Being able to reproduce experiments and results is important to advancing our knowledge, but it’s not something we’ve always been able to do well. In a series of guest posts, we have invited perspectives and advice on reproducibility in NLP.
by Liling Tan, Research Scientist at Rakuten Institute of Technology / Universität des Saarlandes.
I think there are at least 3 levels of reproducibility in NLP (i) Rerun, (ii) Repurpose, (iii) Reimplementation.
At the rerun level, the aim is to re-run the open source code on the open dataset shared from the publication. It’s sort of a sanity check that one would do to understand the practicality of the inputs and the expected outputs. Sometimes, this level of replication is usually skipped because (i) either the open data, open source or perhaps the documentation is missing or (ii) we trust the integrity of the researchers and the publication.
The repurpose level often starts out as a low-hanging fruit project. Usually, the goal is to modify the source code slightly to suit other purposes and/or datasets, e.g. if the code was an implementation of SRU to solve an image recognition task, maybe it could work for machine translation. Alternatively, one might also add the results from the previous state-of-the-art (SOTA) as features/inputs to the new approach.
The last reimplementation level is usually overlooked or done out of necessity. For example, an older SOTA might have stale code that doesn’t compile/run any more so it’s easier to reimplement the older SOTA technique into the framework you’ve created for the novel approach than to figure out how to make the stale code run. Often, the re-implementation might take quite some time and effort and in return, it produces that one line of numbers in the table of results.
More often, we see publications simply citing the results of the previous studies for SOTA comparisons on the same dataset instead of reimplementing and incorporating the previous methods into the code for the new methods. This is largely because of how we incentivize “newness” over “reproducibility” in research, but this is getting better as we see “reproducibility” as a reviewing criterion.
We seldom question the comparability of results once a publication has exceeded the SOTA performance on a common benchmark metric and dataset. Without replication, we often overlook the sensitivity of data munging that might be involved before putting the system output through a benchmarking script. For example, the abuse of the infamous multi-bleu.perl evaluation script overlooked the fact that sentences need to be tokenized before computing the n-gram overlaps in BLEU. Even though the script and gold standards were consistent, different system has been tokenizing their outputs differently making comparability of results inconsistent, especially if there’s no open source or clear documentation of the system reported in the publication. To resolve the multi-bleu.perl misuse, replicating a previous SOTA system using the same pre-/post-processing steps would have given a fairer account of the comparability between the previous SOTA and current approach.
Additionally, “simply citing” often undermines the currency of benchmarking datasets. Like software, datasets are constantly updated and patched; moreover new datasets that are more relevant to the current day or latest shared task are created. But we see publications evaluating on dated benchmarks, most probably to draw comparison with a previous SOTA. Hopefully with “reproducibility” as a criterion in reviewing, authors pay more attention to the writing of the paper and share resources such that future work can easily replicate their systems on newer datasets.
The core ingredients of replication studies are open data and open sources. But lacking in neither shouldn’t hinder reproducibility. If the approaches are well-described in the publication, it shouldn’t be hard to reproduce the results on an open dataset. Without shared resources, open sources, and/or proper documentation, one may question the true impact of the publication that can’t be easily replicated.