SemEval: Striving for Reproducibility in Research – Guest post

Being able to reproduce experiments and results is important to advancing our knowledge, but it’s not something we’ve always been able to do well. In a series of guest posts, we have invited perspectives and advice on reproducibility in NLP.

by Saif M. Mohammad, National Research Council Canada.

A shared task invites participation in a competition where system predictions are examined and ranked by an independent party on a common evaluation framework (common new training and test sets, common evaluation metrics, etc.). The International Workshop on Semantic Evaluation (SemEval) is a popular shared task platform for computational semantic analysis. (See SemEval-2017; participate in SemEval-2018!) Every year, the workshop selects a dozen or so tasks (from a competitive pool of proposals) and co-ordinates their organizationthe setting up of task websites, releasing training and test sets, conducting evaluations, and publishing proceedings. It draws hundreds of participants, and publishes over a thousand pages of proceedings. It’s awesome!

Embedded in SemEval, but perhaps less obvious, is a drive for reproducibility in researchobtaining the same results again, using the same method. Why does reproducibility matter? Reproducibility is a foundational tenet of the scientific method. There is no truth other than reproducibility. If repeated data annotations provide wildly diverging labels, then that data is not capturing anything meaningful. If no one else is able to replicate one’s algorithm and results, then that original work is called into question. (See Most Scientists Can’t Replicate Studies by their Peers and also this wonderful article by Ted Pedersen, Empiricism Is Not a Matter of Faith.)

I have been involved with SemEval in many roles: from a follower of the work, to a participant, a task organizer, and co-chair. In this post, I share my thoughts on some of the key ways in which SemEval encourages reproducibility, and how many of these initiatives can easily be carried over to your research (whether or not it is part of a shared task).

SemEval has two core components:

Tasks: SemEval chooses a mix of repeat tasks (tasks that were run in prior years), new-to-SemEval tasks (tasks studied separately by different research groups, but not part of SemEval yet), and some completely new tasks. The completely new tasks are exciting and allow the community to make quick progress. The new-to-SemEval tasks allow for the comparison and use of disparate past work (ideas, algorithms, and linguistic resources) on a common new test set. The repeat tasks allow participants to build on past submissions and help track progress over the years. By drawing the attention of the community to a set of tasks, SemEval has a way of cleaning house. Literature is scoured, dusted, and re-examined to identify what generalizes wellwhich ideas and resources are truly helpful.

Bragging rights apart, a common motivation to participate in SemEval is to test whether a particular hypothesis is true or not. Irrespective of what rank a system attains, participants are encouraged to report results on multiple baselines, benchmarks, and comparison submissions.

Data and Resources: The common new (previously unseen) test set is a crucial component of SemEval. It minimizes the risk of highly optimistic results from (over)training on a familiar dataset. Participants usually have only two or three weeks from when they get access to the test set to when they have to provide system submissions. Task organizers often provide links to code and other resources that participants can use, including baseline systems and the winning systems from the past years. Participants can thus build on these resources.

SemEval makes a concerted effort to keep the data and the evaluation framework for the shared tasks available through the task websites even after the official competition. Thus, people with new approaches can continue to compare results with that of earlier participants, even years later. The official proceedings record the work done by the task organizers and participants.

Task Websites: For each task, the organizers set up a website providing details of the task definition, data, annotation questionnaires, links to relevant resources, and references. Since 2017, the tasks are run on shared task platforms such as CodaLab. They include special features such as phases and leaderboards. Phases often correspond to a pre-evaluation period (when systems have access to the training data but not the test data), the official evaluation period (when the test data is released and official systems submissions are to be made), and a post-evaluation period. The leaderboard is a convenient way to record system results. Once the organizers set up the task website with the evaluation script, the system automatically generates results on every new submission and uploads it on the leaderboard. There is a separate leaderboard for each phase. Thus, even after the official competition has concluded, one can upload submissions, and the auto-computed results are posted on the leaderboard. Anyone interested in a task can view all of the results in one place.

SemEval also encourages participants to make system submissions freely available and to make system code available where possible.

Proceedings: For each task, the organizers write a task-description paper that describes their task, data, evaluation, results, and a summary of participating systems. Participants write a system-description paper describing their system and submissions. Special emphasis is paid to replicability in the instructions to authors and in the reviewing process. For the task paper: “present all details that will allow someone else to replicate the data creation process and evaluation.” For the system paper: “present all details that will allow someone else to replicate your system.” All papers are accepted except for system papers that fail to provide clear and  adequate details of their submission. Thus SemEval is also a great place to record negative results — ideas that seemed promising but did not work out.

Bonus article: Why it’s time to publish research “failures”

All of the above make SemEval a great sandbox for working on compelling tasks, reproducing and refining ideas from prior research, and developing new ones that are accessible to all. Nonetheless, shared tasks can entail certain less-desirable outcomes that are worth noting and avoiding:

  • Focus on rankings: While the drive to have the top-ranked submission can be productive, it is not everything. More important is the analysis to help improve our collective understanding of the task. Thus, irrespective of one’s rank, it is useful to test different hypotheses and report negative results. 
  • Comparing post-competition results with official competition results: A crucial benefit of participating under the rigor of a shared task is that one does not have access to the reference/gold labels of the test data until the competition has concluded. This is a benefit because having open access to the reference labels can lead to unfair and unconscious optimisation on the test set. Every time one sees the result of their system on a test set and tries something different, it is a step towards optimising on the test set. However, once the competition has concluded the gold labels are released so that the task organizers are not the only gatekeepers for analysis. Thus, even though post-competition work on the task–data combination is very much encouraged, the comparisons of those results with the official competition results have to pass a higher bar of examination and skepticism.

There are other pitfalls worth noting too—feel free to share your thoughts in the comments.

“That’s great!” you say, “but we are not always involved in shared tasks…”

How do I encourage reproducibility of *my* research?

Here are some pointers to get started:

  • In your paper, describe all that is needed for someone else to reproduce the work. Make use of provisions for Appendices. Don’t be limited by page lengths. Post details on websites and provide links in your paper.
  • Create a webpage for the research project. Briefly describe the work in a manner that anybody interested can come away understanding what you are working on and why that matters. There is merit in communicating our work to people at large, and not just to our research peers. Also:
    • Post the project papers or provide links to them.
    • Post annotation questionnaires.
    • Post the code on repositories such as GitHub and CodaLab. Provide links.
    • Share evaluation scripts.
    • Provide interactive visualisations to explore the data and system predictions. Highlight interesting findings.
    • Post tables with results of work on a particular task of interest. This is especially handy if you are working on a new task or creating new data for a task. Use tools such as CodaLab to create leaderboards and allow others to upload their system predictions.
    • If you are releasing data or code, briefly describe the resource, and add information on:
      • What can the resource be used for and how?
      • What hypotheses can be tested with this resource?
      • What are the properties of the resource — its strengths, biases, and limitations?
      • How can one build on the resource to create something new?
  • (Feel free to add more suggestions through your comments below.)

Sharing your work often follows months and years of dedicated research. So enjoy it, and don’t forget to let your excitement shine through! 🙂

Many thanks to Svetlana Kiritchenko, Graeme Hirst, Ted Pedersen, Peter Turney, and Tara Small for comments and discussions.

References:

Area Chairs – and Areas

Traditionally, areas are prescribed by program chairs, in anticipation of the field’s interests. This can lead to last minute scrambles as the number of submissions ends up varying widely across areas.  To avoid this, we chose to follow the methodology developed by  Ani Nenkova and Owen Rambow for NAACL 2016 in sunny San Diego, CA, USA.  For COLING 2018 we have not defined areas directly, but rather will let them emerge from the interests of the area chairs, expressed in keywords. These keywords are then also used to allocate reviewers to areas, and later, papers. That’s why at COLING you won’t be directly asked to select an area for your paper at all; this is managed automatically. You will only be asked to select the type of your paper and describe its focus in keywords, to make sure it’s reviewed correctly. If you don’t know what paper types are available, we highly recommend you see the list of paper types and review questions. The keywords are exposed through the submission interface.

Each area has two area chairs, as previous experience has shown that it’s helpful to have a collaborator with whom to discuss decisions and to share the workload, but that larger groups can lead to lack of clarity in who’s doing what work.  We created the AC pairings automatically, keeping the following in mind:

  • We want to maximize similarity of AC research expertise (as captured by the keywords provided) in each pair, across the global pairing.
  • We want to minimize AC pairs where there is a large timezone difference, to foster quick troubleshooting and discussion (in the end, we ended up with one pair not in the same global region).

In addition, seven of the ACs have been designated not to a specific area but as “Special Circumstances” chairs, who can be called on to troubleshoot or advise as necessary.

Our final AC roster is as follows:

  • Afra Alishahi
  • Alexandre Rademaker
  • Alexis Palmer
  • Aline Villavicencio
  • Alvin Grissom II
  • Andrew Caines
  • Ann Clifton
  • Anna Rumshisky
  • Antske Fokkens
  • Arash Eshghi
  • Aurelie Herbelot
  • Avirup Sil
  • Barry Devereux
  • Chaitanya Shivade
  • Dan Garrette
  • Daniel Lassiter
  • David Schlangen
  • Dekai Wu
  • Deyi Xiong
  • Eric Nichols
  • Francis Bond
  • Frank Ferraro
  • Georgiana Dinu
  • Gerard de Melo
  • Gina-Anne Levow
  • Harry Bunt
  • Hatem Haddad
  • Isabelle Augenstein
  • Jiajun Zhang
  • Jose Camacho Collados
  • Klinton Bicknell
  • Lilja Øvrelid
  • Maja Popovic
  • Manuel Montes-y-Gómez
  • Marcos Zampieri
  • Marie-Catherine de Marneffe
  • Meliha Yetisgen
  • Michael Tjalve
  • Miguel Ballesteros
  • Mike Tian-Jian Jiang
  • Mohammad Taher Pilehvar
  • Na-Rae Han
  • Naomi Feldman
  • Natalie Schluter
  • Nathan Schneider
  • Nikola Ljubešić
  • Nurit Melnik
  • Qin Lu
  • Roman Klinger
  • Sadid A. Hasan
  • Sanja Štajner
  • Sara Tonelli
  • Sarvnaz Karimi
  • Sujian Li
  • Sunayana Sitaram
  • Tal Linzen
  • Valia Kordoni
  • Vivek Kulkarni
  • Viviane Moreira
  • Wei Xu
  • Wenjie Li
  • Xiang Ren
  • Xiaodan Zhu
  • Yang Feng
  • Yonatan Bisk
  • Yue Zhang
  • Yun-Nung Chen
  • Zachary Chase Lipton
  • Zeljko Agic
  • Zhiyuan Liu

With the following ACs in Special Circumstances, spread across the world’s timezones:

  • Anders Søgaard
  • Andreas Vlachos
  • Asad Sayeed
  • Di Jiang
  • Karin Verspoor
  • Kevin Duh
  • Steven Bethard

We are grateful to these distinguished scholars for the time and effort they are committing to COLING 2018!