LRE Map: What? Why? When? Who?

Posted on March 5, 2018 by Leon Derczynski

This guest post is by Nicoletta Calzolari.

Not-documented Language Resources (LRs) don’t exist!

The LRE Map of Language Resources (data and tools) (http://lremap.elra.info) is an innovative instrument introduced at LREC2010 with the aim of monitoring the wealth of data and technologies developed and used in our field. Why “Map”? Because we aimed at representing the relevant features of a large territory, also for the aspects not represented in the official catalogues of the major players of the field. But we had other purposes too: we wanted to draw attention to the importance of the LRs that are behind many of our papers and to map also the “use” of LRs, to understand the purposes of the developed LRs.

Its collaborative, bottom-up, creation was critical: we conceived the Map as a means to influence a “change of culture” in our community, whereby everyone is asked to make a minimal effort to document the LRs that are used or created, thus understanding the need of proper documentation. By spreading the LR documentation effort across many people instead of leaving it only in the hands of the distribution centres, we also encourage awareness of the importance of metadata and proper documentation. Documenting a resource is the first step for making it identifiable, which in its turn is the first step towards reproducibility.

We kept the requested information at a simple level, knowing that we had to compromise between richness of metadata and willingness of authors to fill them in.

With all these purposes in mind we thought we could exploit the great opportunity offered by LREC and the involvement of so many authors from so many countries, from different modalities and working in so many areas of NLP. Afterwards the Map was used also in the framework of other major Conferences, in particular by COLING, and this provides another opportunity for useful comparisons.

The number of LRs currently described in the Map is 7453 (instances), collected from 17 different conferences. The major conferences for which we have data on a regular basis are LREC and COLING.

With initiatives such as the LRE Map and “Share your LRs” (introduced in 2014) we want to encourage in the field of LT and LRs what is already in use in more mature disciplines, i.e. ensure proper documentation and reproducibility as a normal practice. We think that research is strongly affected also by such infrastructural (meta-research) activities and therefore we continue to promote – also through such initiatives – a greater visibility of LRs, the sharing of LRs in an easier way and the reproducibility of research results.

Here is the vision: it must become common practice also in our field that when you submit a paper either to a conference or a journal you are offered the opportunity to document and upload the LRs related to your research. This is even more important in a data-intensive discipline like NLP. The small cost that each of us will pay to document, share, etc. should be paid back from benefiting of others’ efforts.

What do we ask to colleagues submitting at COLING 2018? Please document all the LRs mentioned in your paper!

SemEval: Striving for Reproducibility in Research – Guest post

Posted on February 20, 2018 by Leon Derczynski

Being able to reproduce experiments and results is important to advancing our knowledge, but it’s not something we’ve always been able to do well. In a series of guest posts, we have invited perspectives and advice on reproducibility in NLP.

by Saif M. Mohammad, National Research Council Canada.

A shared task invites participation in a competition where system predictions are examined and ranked by an independent party on a common evaluation framework (common new training and test sets, common evaluation metrics, etc.). The International Workshop on Semantic Evaluation (SemEval) is a popular shared task platform for computational semantic analysis. (See SemEval-2017; participate in SemEval-2018!) Every year, the workshop selects a dozen or so tasks (from a competitive pool of proposals) and co-ordinates their organization—the setting up of task websites, releasing training and test sets, conducting evaluations, and publishing proceedings. It draws hundreds of participants, and publishes over a thousand pages of proceedings. It’s awesome!

Embedded in SemEval, but perhaps less obvious, is a drive for reproducibility in research—obtaining the same results again, using the same method. Why does reproducibility matter? Reproducibility is a foundational tenet of the scientific method. There is no truth other than reproducibility. If repeated data annotations provide wildly diverging labels, then that data is not capturing anything meaningful. If no one else is able to replicate one’s algorithm and results, then that original work is called into question. (See Most Scientists Can’t Replicate Studies by their Peers and also this wonderful article by Ted Pedersen, Empiricism Is Not a Matter of Faith.)

I have been involved with SemEval in many roles: from a follower of the work, to a participant, a task organizer, and co-chair. In this post, I share my thoughts on some of the key ways in which SemEval encourages reproducibility, and how many of these initiatives can easily be carried over to your research (whether or not it is part of a shared task).

SemEval has two core components:

Tasks: SemEval chooses a mix of repeat tasks (tasks that were run in prior years), new-to-SemEval tasks (tasks studied separately by different research groups, but not part of SemEval yet), and some completely new tasks. The completely new tasks are exciting and allow the community to make quick progress. The new-to-SemEval tasks allow for the comparison and use of disparate past work (ideas, algorithms, and linguistic resources) on a common new test set. The repeat tasks allow participants to build on past submissions and help track progress over the years. By drawing the attention of the community to a set of tasks, SemEval has a way of cleaning house. Literature is scoured, dusted, and re-examined to identify what generalizes well—which ideas and resources are truly helpful.

Bragging rights apart, a common motivation to participate in SemEval is to test whether a particular hypothesis is true or not. Irrespective of what rank a system attains, participants are encouraged to report results on multiple baselines, benchmarks, and comparison submissions.

Data and Resources: The common new (previously unseen) test set is a crucial component of SemEval. It minimizes the risk of highly optimistic results from (over)training on a familiar dataset. Participants usually have only two or three weeks from when they get access to the test set to when they have to provide system submissions. Task organizers often provide links to code and other resources that participants can use, including baseline systems and the winning systems from the past years. Participants can thus build on these resources.

SemEval makes a concerted effort to keep the data and the evaluation framework for the shared tasks available through the task websites even after the official competition. Thus, people with new approaches can continue to compare results with that of earlier participants, even years later. The official proceedings record the work done by the task organizers and participants.

Task Websites: For each task, the organizers set up a website providing details of the task definition, data, annotation questionnaires, links to relevant resources, and references. Since 2017, the tasks are run on shared task platforms such as CodaLab. They include special features such as phases and leaderboards. Phases often correspond to a pre-evaluation period (when systems have access to the training data but not the test data), the official evaluation period (when the test data is released and official systems submissions are to be made), and a post-evaluation period. The leaderboard is a convenient way to record system results. Once the organizers set up the task website with the evaluation script, the system automatically generates results on every new submission and uploads it on the leaderboard. There is a separate leaderboard for each phase. Thus, even after the official competition has concluded, one can upload submissions, and the auto-computed results are posted on the leaderboard. Anyone interested in a task can view all of the results in one place.

SemEval also encourages participants to make system submissions freely available and to make system code available where possible.

Proceedings: For each task, the organizers write a task-description paper that describes their task, data, evaluation, results, and a summary of participating systems. Participants write a system-description paper describing their system and submissions. Special emphasis is paid to replicability in the instructions to authors and in the reviewing process. For the task paper: “present all details that will allow someone else to replicate the data creation process and evaluation.” For the system paper: “present all details that will allow someone else to replicate your system.” All papers are accepted except for system papers that fail to provide clear and adequate details of their submission. Thus SemEval is also a great place to record negative results — ideas that seemed promising but did not work out.

Bonus article: Why it’s time to publish research “failures”

All of the above make SemEval a great sandbox for working on compelling tasks, reproducing and refining ideas from prior research, and developing new ones that are accessible to all. Nonetheless, shared tasks can entail certain less-desirable outcomes that are worth noting and avoiding:

Focus on rankings: While the drive to have the top-ranked submission can be productive, it is not everything. More important is the analysis to help improve our collective understanding of the task. Thus, irrespective of one’s rank, it is useful to test different hypotheses and report negative results.
Comparing post-competition results with official competition results: A crucial benefit of participating under the rigor of a shared task is that one does not have access to the reference/gold labels of the test data until the competition has concluded. This is a benefit because having open access to the reference labels can lead to unfair and unconscious optimisation on the test set. Every time one sees the result of their system on a test set and tries something different, it is a step towards optimising on the test set. However, once the competition has concluded the gold labels are released so that the task organizers are not the only gatekeepers for analysis. Thus, even though post-competition work on the task–data combination is very much encouraged, the comparisons of those results with the official competition results have to pass a higher bar of examination and skepticism.

There are other pitfalls worth noting too—feel free to share your thoughts in the comments.

“That’s great!” you say, “but we are not always involved in shared tasks…”

How do I encourage reproducibility of *my* research?

Here are some pointers to get started:

In your paper, describe all that is needed for someone else to reproduce the work. Make use of provisions for Appendices. Don’t be limited by page lengths. Post details on websites and provide links in your paper.
Create a webpage for the research project. Briefly describe the work in a manner that anybody interested can come away understanding what you are working on and why that matters. There is merit in communicating our work to people at large, and not just to our research peers. Also:
- Post the project papers or provide links to them.
- Post annotation questionnaires.
- Post the code on repositories such as GitHub and CodaLab. Provide links.
- Share evaluation scripts.
- Provide interactive visualisations to explore the data and system predictions. Highlight interesting findings.
- Post tables with results of work on a particular task of interest. This is especially handy if you are working on a new task or creating new data for a task. Use tools such as CodaLab to create leaderboards and allow others to upload their system predictions.
- If you are releasing data or code, briefly describe the resource, and add information on:

- - What can the resource be used for and how?
  - What hypotheses can be tested with this resource?
  - What are the properties of the resource — its strengths, biases, and limitations?
  - How can one build on the resource to create something new?
(Feel free to add more suggestions through your comments below.)

Sharing your work often follows months and years of dedicated research. So enjoy it, and don’t forget to let your excitement shine through! 🙂

Many thanks to Svetlana Kiritchenko, Graeme Hirst, Ted Pedersen, Peter Turney, and Tara Small for comments and discussions.

References:

Empiricism Is Not a Matter of Faith
Most Scientists Can’t Replicate Studies by their Peers
Reproducibility (includes a nice little history; includes a differentiation from a related concept ‘replicability’)
Why it’s Time to Publish Research “Failures”

Reproducibility in NLP – Guest Post

Posted on January 9, 2018 by Leon Derczynski

by Liling Tan, Research Scientist at Rakuten Institute of Technology / Universität des Saarlandes.

I think there are at least 3 levels of reproducibility in NLP (i) Rerun, (ii) Repurpose, (iii) Reimplementation.

At the rerun level, the aim is to re-run the open source code on the open dataset shared from the publication. It’s sort of a sanity check that one would do to understand the practicality of the inputs and the expected outputs. Sometimes, this level of replication is usually skipped because (i) either the open data, open source or perhaps the documentation is missing or (ii) we trust the integrity of the researchers and the publication.

The repurpose level often starts out as a low-hanging fruit project. Usually, the goal is to modify the source code slightly to suit other purposes and/or datasets, e.g. if the code was an implementation of SRU to solve an image recognition task, maybe it could work for machine translation. Alternatively, one might also add the results from the previous state-of-the-art (SOTA) as features/inputs to the new approach.

The last reimplementation level is usually overlooked or done out of necessity. For example, an older SOTA might have stale code that doesn’t compile/run any more so it’s easier to reimplement the older SOTA technique into the framework you’ve created for the novel approach than to figure out how to make the stale code run. Often, the re-implementation might take quite some time and effort and in return, it produces that one line of numbers in the table of results.

More often, we see publications simply citing the results of the previous studies for SOTA comparisons on the same dataset instead of reimplementing and incorporating the previous methods into the code for the new methods. This is largely because of how we incentivize “newness” over “reproducibility” in research, but this is getting better as we see “reproducibility” as a reviewing criterion.

We seldom question the comparability of results once a publication has exceeded the SOTA performance on a common benchmark metric and dataset. Without replication, we often overlook the sensitivity of data munging that might be involved before putting the system output through a benchmarking script. For example, the abuse of the infamous multi-bleu.perl evaluation script overlooked the fact that sentences need to be tokenized before computing the n-gram overlaps in BLEU. Even though the script and gold standards were consistent, different system has been tokenizing their outputs differently making comparability of results inconsistent, especially if there’s no open source or clear documentation of the system reported in the publication. To resolve the multi-bleu.perl misuse, replicating a previous SOTA system using the same pre-/post-processing steps would have given a fairer account of the comparability between the previous SOTA and current approach.

Additionally, “simply citing” often undermines the currency of benchmarking datasets. Like software, datasets are constantly updated and patched; moreover new datasets that are more relevant to the current day or latest shared task are created. But we see publications evaluating on dated benchmarks, most probably to draw comparison with a previous SOTA. Hopefully with “reproducibility” as a criterion in reviewing, authors pay more attention to the writing of the paper and share resources such that future work can easily replicate their systems on newer datasets.

The core ingredients of replication studies are open data and open sources. But lacking in neither shouldn’t hinder reproducibility. If the approaches are well-described in the publication, it shouldn’t be hard to reproduce the results on an open dataset. Without shared resources, open sources, and/or proper documentation, one may question the true impact of the publication that can’t be easily replicated.

Slowly Growing Offspring: Zigglebottom Anno 2017 – Guest post

Posted on December 22, 2017 by Leon Derczynski

Reflections on Improving Replication and Reproduction in Computational Linguistics

(See Ted Pedersen’s Empiricism is not a Matter of Faith for the Sad Tale of the Zigglebottom Tagger)

A little over four years ago, we presented our paper Offspring from Reproduction Problems at ACL. The paper discussed two case studies in which we failed to replicate results. While investigating the problem, we found that results differed to an extent that they led to completely different conclusions. The settings, preprocessing and evaluation whose (small) variations led to these changes were not even reported in the original papers.

Though some progress has been made on both the level of ensuring replication (obtaining the same results using the same experiment) as well as reproduction (reach the same conclusion through different means), the problem described in 2013 still seems to apply to the majority of the computational linguistics papers published in 2017. In this blog post, I’d like to reflect on the progress that has been made, but also on the progress we still need to make on the level of publishing both replicable and reproducible research. The core issue around replication is the lack of means provided to other researchers to repeat an experiment carried out elsewhere. Issues around reproducing results are more diverse, but I believe that the way we look at evidence and comparison to previous work in our field is a key element of the problem. I will argue that major steps in addressing these issues can be made by (1) increasing appreciation for replicability and reproducibility in published research and (2) changing the way we use the ‘state-of-the-art’ when judging research in our field. More specifically, good papers provide insight and understanding in a computational linguistics or NLP problem. Reporting results that beat the state-of-the-art is neither sufficient nor necessary for a paper to provide a valuable research contribution.

Replication Problems and Appreciation for Sharing Code

Attention for replicable results (sharing code and resources) has increased in the last four years. Links to git repositories or other version control systems are more and more common and review forms of the main conferences include a question addressing the possibilities of replication. Our research group CLTL has adopted a policy indicating that code and resources not restricted by third party licenses must be made available when publishing. When reading related work for my own research, I have noticed similar tendencies in, among others, the UKP-group in Darmstadt, Stanford NLP and the CS and Linguistics departments of the University of Washington. Our PhD students furthermore typically start by replicating or reproducing previous work which they can then use as a baseline. From their experience, I noticed that the problems reported four years ago still apply today. Results were close or comparable sometimes and once even higher, but also regularly far off. Sometimes provided code did not even run. Authors often provided feedback, but even with their help (sometimes they went as far as looking at our code), the original results could not be replicated. I currently find myself on the other side of the table, with two graduate students wanting to use an analysis from my PhD and the (openly available) code producing errors.

There can be valid reasons for not sharing code or resources. Research teams from industry have often delivered interesting and highly relevant contributions to NLP research and it is difficult to obtain corpora from various genres without copyright on the text. I therefore do not want to argue for less appreciation for research without open source code and resources, but I would very much want to advocate for more appreciation for research that does provide the means for replicating results. In addition to being openly verifiable, it also provides additional means for other researchers to build their work directly upon previous work rather than first going through the frustration of reimplementing a new baseline system good enough to test their hypotheses on.

The General Reproducible and Replicable State-of-the-Art

Comparing performance on benchmark systems has helped in gaining insight into the performance of our systems and in comparing various approaches. Evaluation in our field is often limited to testing whether an approach beats the state-of-the-art. Many even seem to see this as the main purpose to the extent that reviewers rate papers down that don’t beat the state-of-the-art. I suspect that researchers often do not even bother to try and publish their work if performance remains below the best reported. The purpose of evaluation actually is, or should be, to provide insight into how a model works, what phenomena it captures or which patterns the machine learning algorithm picked up, compared to alternative approaches. Moreover, the difficulties involved in replicating results make the practice of judging research on whether it beats the state-of-the-art rather questionable. Reported results may be misleading regarding the actual state-of-the-art. In general, papers should be evaluated based on what they teach us, i.e. whether they verify their hypothesis by comparing it to a suitable baseline. A suitable baseline may indeed mean a baseline that corresponds to the state-of-the-art, but this state-of-the-art should be a valid reflection of what current technologies do.

I would therefore like to introduce the notions of the reproducible state-of-the-art and the generally replicable state-of-the-art. These two notions both aim at gaining better insight into the true state-of-the-art and making building on top of that more accessible to a wider range of researchers. I understand a ‘reproducible state-of-the-art’ to be a result obtained by different groups of researchers independently which increases the likelihood of providing a reliable result and a baseline that is feasible to reproduce for other researchers. This implies having more appreciation for papers that come relatively close to the state-of-the-art without necessarily beating it. Chances of results being reproducible also increase if they hold across datasets and can be obtained by multiple machine learning runs (e.g. if they are relatively stable across different initiations and order of processing training data by a neural network). The ‘generally replicable state-of-the-art’ refers to the best reported results obtained by a fully available system and, preferably, one that can be trained and run using computational resources available to the average NLP research group. One way to obtain better open source systems and encourage researchers to share their resources and code is by instructing reviewers to appreciate improving the new generally replicable state-of-the-art (with open source code and available resources) as much as improving the reported state-of-the-art.

Understanding Computational Models for Natural Language

In the introduction of this blog, I claimed that improving the state-of-the-art is neither necessary nor sufficient for providing an important contribution to computational linguistics. NLP papers often introduce an idea and show that by adding the features or adapting the machine learning approach associated with that idea improves results. Many authors take the improved results as evidence that the idea works, but this is not necessarily the case: improvement can be due to other differences in settings or random variations. The outcome becomes much more convincing if the hypothesis correctly predicts which kind of errors the new approach would solve compared to the baseline. For instance, if you predict that reinforcement learning reduces error propagation, investigate the error propagation in the new system compared to the baseline. Even if it is difficult to predict where improvement comes from, a decent error analysis showing which phenomena are treated better than by other systems, which perform as good or bad and which have gotten worse can provide valuable insights into why an approach works or, more importantly, why it does not. This has several advantages: first of all, if we have better insights into what information and which algorithms help for similar and which for different phenomena, we have a better idea of how to further improve our systems (for those among you who are convinced that achieving high f-scores is our ultimate goal). It becomes easier to publish negative results , which in turn promotes progress by preventing other research groups from going down the same pointless road without knowing of each other’s work. We may learn whether an approach works or does not work due to particularities of the data we are working with. Moreover, an understood result is more likely to be a reproducible result and even if it is not, details about what is working exactly may help other researchers to find out why they cannot reproduce it. In my opinion, this is where our field fails most: we are too easily satisfied when results are high and do not aim for deep insight frequently enough. This aspect may be the hardest to tackle from the points I have raised in this post. On the upside, addressing this is not made impossible by licenses, copyright and commercial code.

Moving Forward

As a community, we are responsible for improving the quality of our research. Most of the effort will probably have to come from bottom up: individual researchers can decide to write (only) papers with a solid methodological setup, and that aim for insights in addition to or even rather than high f-scores and provide code and resources whenever allowed. They can also decide to value papers that follow such practices more and be (more) critical of papers that do not provide insights or good understanding of the methods. Initiatives such as the workshops Analyzing and Interpreting Neural Networks for NLP, Building and Breaking, Ethics in NLP, Relevance of Linguistic Structure in Neural NLP (and many others) show that the desire to obtain better understanding is very much alive in the community.

Researchers serving as program chairs can play a significant role in further encouraging authors and reviewers. The categories of best papers proposed for COLING2018 are a nice example of an incentive that appreciates a variety of contributions to the field. The main conference’s review forms have included questions about resources provided by the paper. Last year, however, the option ‘no code or resources provided’ was followed by ‘(most submissions)’. As a reviewer, I wondered: why this addition? We should at least try to move towards a situation that providing code and resources is normal or maybe even standard. The new NAACL form refers to the encouragement of sharing research for papers introducing new systems. I hope this will also be included for other paper categories and that the chairs will connect this encouragement to a reward for authors who do. I also hope chairs and editors, for all conferences, journals and workshops, will remind their reviewers of the fragility of reported results and remind them to take this into consideration when verifying if empirical results are sufficient compared to related work. Most of all, I hope many researchers will feel encouraged to submit insightful research with low as well as high results and I hope to learn much from it.

Thank you for reading. Please share your ideas and thoughts: I’d specifically love to hear from researchers that have different opinions.

Antske Fokkens

https://twitter.com/antske

Acknowledgements I’d like to thank Leon Derczynski for inviting me to write this post. Thanks to Ted Pedersen (who I have never met in person) for that crazy Saturday we spent hacking across the ocean to finally find out why the original results could not be replicated. I’d like to thank Emily Bender for valuable feedback. Last but not least, thanks to the members of the CLTL research group for discussions and inspiration on this topic as well as the many many colleagues from all over the world I have exchanged thoughts with on this topic over the past four years!

COLING 2018

August 20-26, 2018, Santa Fe, New Mexico, USA

Category Archives: Reproducibility

LRE Map: What? Why? When? Who?

SemEval: Striving for Reproducibility in Research – Guest post

Reproducibility in NLP – Guest Post

Slowly Growing Offspring: Zigglebottom Anno 2017 – Guest post