Being able to reproduce experiments and results is important to advancing our knowledge, but it’s not something we’ve always been able to do well. In a series of guest posts, we have invited perspectives and advice on reproducibility in NLP.
by Saif M. Mohammad, National Research Council Canada.
A shared task invites participation in a competition where system predictions are examined and ranked by an independent party on a common evaluation framework (common new training and test sets, common evaluation metrics, etc.). The International Workshop on Semantic Evaluation (SemEval) is a popular shared task platform for computational semantic analysis. (See SemEval-2017; participate in SemEval-2018!) Every year, the workshop selects a dozen or so tasks (from a competitive pool of proposals) and co-ordinates their organization—the setting up of task websites, releasing training and test sets, conducting evaluations, and publishing proceedings. It draws hundreds of participants, and publishes over a thousand pages of proceedings. It’s awesome!
Embedded in SemEval, but perhaps less obvious, is a drive for reproducibility in research—obtaining the same results again, using the same method. Why does reproducibility matter? Reproducibility is a foundational tenet of the scientific method. There is no truth other than reproducibility. If repeated data annotations provide wildly diverging labels, then that data is not capturing anything meaningful. If no one else is able to replicate one’s algorithm and results, then that original work is called into question. (See Most Scientists Can’t Replicate Studies by their Peers and also this wonderful article by Ted Pedersen, Empiricism Is Not a Matter of Faith.)
I have been involved with SemEval in many roles: from a follower of the work, to a participant, a task organizer, and co-chair. In this post, I share my thoughts on some of the key ways in which SemEval encourages reproducibility, and how many of these initiatives can easily be carried over to your research (whether or not it is part of a shared task).
SemEval has two core components:
Tasks: SemEval chooses a mix of repeat tasks (tasks that were run in prior years), new-to-SemEval tasks (tasks studied separately by different research groups, but not part of SemEval yet), and some completely new tasks. The completely new tasks are exciting and allow the community to make quick progress. The new-to-SemEval tasks allow for the comparison and use of disparate past work (ideas, algorithms, and linguistic resources) on a common new test set. The repeat tasks allow participants to build on past submissions and help track progress over the years. By drawing the attention of the community to a set of tasks, SemEval has a way of cleaning house. Literature is scoured, dusted, and re-examined to identify what generalizes well—which ideas and resources are truly helpful.
Bragging rights apart, a common motivation to participate in SemEval is to test whether a particular hypothesis is true or not. Irrespective of what rank a system attains, participants are encouraged to report results on multiple baselines, benchmarks, and comparison submissions.
Data and Resources: The common new (previously unseen) test set is a crucial component of SemEval. It minimizes the risk of highly optimistic results from (over)training on a familiar dataset. Participants usually have only two or three weeks from when they get access to the test set to when they have to provide system submissions. Task organizers often provide links to code and other resources that participants can use, including baseline systems and the winning systems from the past years. Participants can thus build on these resources.
SemEval makes a concerted effort to keep the data and the evaluation framework for the shared tasks available through the task websites even after the official competition. Thus, people with new approaches can continue to compare results with that of earlier participants, even years later. The official proceedings record the work done by the task organizers and participants.
Task Websites: For each task, the organizers set up a website providing details of the task definition, data, annotation questionnaires, links to relevant resources, and references. Since 2017, the tasks are run on shared task platforms such as CodaLab. They include special features such as phases and leaderboards. Phases often correspond to a pre-evaluation period (when systems have access to the training data but not the test data), the official evaluation period (when the test data is released and official systems submissions are to be made), and a post-evaluation period. The leaderboard is a convenient way to record system results. Once the organizers set up the task website with the evaluation script, the system automatically generates results on every new submission and uploads it on the leaderboard. There is a separate leaderboard for each phase. Thus, even after the official competition has concluded, one can upload submissions, and the auto-computed results are posted on the leaderboard. Anyone interested in a task can view all of the results in one place.
SemEval also encourages participants to make system submissions freely available and to make system code available where possible.
Proceedings: For each task, the organizers write a task-description paper that describes their task, data, evaluation, results, and a summary of participating systems. Participants write a system-description paper describing their system and submissions. Special emphasis is paid to replicability in the instructions to authors and in the reviewing process. For the task paper: “present all details that will allow someone else to replicate the data creation process and evaluation.” For the system paper: “present all details that will allow someone else to replicate your system.” All papers are accepted except for system papers that fail to provide clear and adequate details of their submission. Thus SemEval is also a great place to record negative results — ideas that seemed promising but did not work out.
Bonus article: Why it’s time to publish research “failures”
All of the above make SemEval a great sandbox for working on compelling tasks, reproducing and refining ideas from prior research, and developing new ones that are accessible to all. Nonetheless, shared tasks can entail certain less-desirable outcomes that are worth noting and avoiding:
- Focus on rankings: While the drive to have the top-ranked submission can be productive, it is not everything. More important is the analysis to help improve our collective understanding of the task. Thus, irrespective of one’s rank, it is useful to test different hypotheses and report negative results.
- Comparing post-competition results with official competition results: A crucial benefit of participating under the rigor of a shared task is that one does not have access to the reference/gold labels of the test data until the competition has concluded. This is a benefit because having open access to the reference labels can lead to unfair and unconscious optimisation on the test set. Every time one sees the result of their system on a test set and tries something different, it is a step towards optimising on the test set. However, once the competition has concluded the gold labels are released so that the task organizers are not the only gatekeepers for analysis. Thus, even though post-competition work on the task–data combination is very much encouraged, the comparisons of those results with the official competition results have to pass a higher bar of examination and skepticism.
There are other pitfalls worth noting too—feel free to share your thoughts in the comments.
“That’s great!” you say, “but we are not always involved in shared tasks…”
How do I encourage reproducibility of *my* research?
Here are some pointers to get started:
- In your paper, describe all that is needed for someone else to reproduce the work. Make use of provisions for Appendices. Don’t be limited by page lengths. Post details on websites and provide links in your paper.
- Create a webpage for the research project. Briefly describe the work in a manner that anybody interested can come away understanding what you are working on and why that matters. There is merit in communicating our work to people at large, and not just to our research peers. Also:
- Post the project papers or provide links to them.
- Post annotation questionnaires.
- Post the code on repositories such as GitHub and CodaLab. Provide links.
- Share evaluation scripts.
- Provide interactive visualisations to explore the data and system predictions. Highlight interesting findings.
- Post tables with results of work on a particular task of interest. This is especially handy if you are working on a new task or creating new data for a task. Use tools such as CodaLab to create leaderboards and allow others to upload their system predictions.
- If you are releasing data or code, briefly describe the resource, and add information on:
- What can the resource be used for and how?
- What hypotheses can be tested with this resource?
- What are the properties of the resource — its strengths, biases, and limitations?
- How can one build on the resource to create something new?
- (Feel free to add more suggestions through your comments below.)
Sharing your work often follows months and years of dedicated research. So enjoy it, and don’t forget to let your excitement shine through! 🙂
Many thanks to Svetlana Kiritchenko, Graeme Hirst, Ted Pedersen, Peter Turney, and Tara Small for comments and discussions.
- Empiricism Is Not a Matter of Faith
- Most Scientists Can’t Replicate Studies by their Peers
- Reproducibility (includes a nice little history; includes a differentiation from a related concept ‘replicability’)
- Why it’s Time to Publish Research “Failures”
Traditionally, areas are prescribed by program chairs, in anticipation of the field’s interests. This can lead to last minute scrambles as the number of submissions ends up varying widely across areas. To avoid this, we chose to follow the methodology developed by Ani Nenkova and Owen Rambow for NAACL 2016 in sunny San Diego, CA, USA. For COLING 2018 we have not defined areas directly, but rather will let them emerge from the interests of the area chairs, expressed in keywords. These keywords are then also used to allocate reviewers to areas, and later, papers. That’s why at COLING you won’t be directly asked to select an area for your paper at all; this is managed automatically. You will only be asked to select the type of your paper and describe its focus in keywords, to make sure it’s reviewed correctly. If you don’t know what paper types are available, we highly recommend you see the list of paper types and review questions. The keywords are exposed through the submission interface.
Each area has two area chairs, as previous experience has shown that it’s helpful to have a collaborator with whom to discuss decisions and to share the workload, but that larger groups can lead to lack of clarity in who’s doing what work. We created the AC pairings automatically, keeping the following in mind:
- We want to maximize similarity of AC research expertise (as captured by the keywords provided) in each pair, across the global pairing.
- We want to minimize AC pairs where there is a large timezone difference, to foster quick troubleshooting and discussion (in the end, we ended up with one pair not in the same global region).
In addition, seven of the ACs have been designated not to a specific area but as “Special Circumstances” chairs, who can be called on to troubleshoot or advise as necessary.
Our final AC roster is as follows:
- Afra Alishahi
- Alexandre Rademaker
- Alexis Palmer
- Aline Villavicencio
- Alvin Grissom II
- Andrew Caines
- Ann Clifton
- Anna Rumshisky
- Antske Fokkens
- Arash Eshghi
- Aurelie Herbelot
- Avirup Sil
- Barry Devereux
- Chaitanya Shivade
- Dan Garrette
- Daniel Lassiter
- David Schlangen
- Dekai Wu
- Deyi Xiong
- Eric Nichols
- Francis Bond
- Frank Ferraro
- Georgiana Dinu
- Gerard de Melo
- Gina-Anne Levow
- Harry Bunt
- Hatem Haddad
- Isabelle Augenstein
- Jiajun Zhang
- Jose Camacho Collados
- Klinton Bicknell
- Lilja Øvrelid
- Maja Popovic
- Manuel Montes-y-Gómez
- Marcos Zampieri
- Marie-Catherine de Marneffe
- Meliha Yetisgen
- Michael Tjalve
- Miguel Ballesteros
- Mike Tian-Jian Jiang
- Mohammad Taher Pilehvar
- Na-Rae Han
- Naomi Feldman
- Natalie Schluter
- Nathan Schneider
- Nikola Ljubešić
- Nurit Melnik
- Qin Lu
- Roman Klinger
- Sadid A. Hasan
- Sanja Štajner
- Sara Tonelli
- Sarvnaz Karimi
- Sujian Li
- Sunayana Sitaram
- Tal Linzen
- Valia Kordoni
- Vivek Kulkarni
- Viviane Moreira
- Wei Xu
- Wenjie Li
- Xiang Ren
- Xiaodan Zhu
- Yang Feng
- Yonatan Bisk
- Yue Zhang
- Yun-Nung Chen
- Zachary Chase Lipton
- Zeljko Agic
- Zhiyuan Liu
With the following ACs in Special Circumstances, spread across the world’s timezones:
- Anders Søgaard
- Andreas Vlachos
- Asad Sayeed
- Di Jiang
- Karin Verspoor
- Kevin Duh
- Steven Bethard
We are grateful to these distinguished scholars for the time and effort they are committing to COLING 2018!
The COLING 2018 main conference deadline is in about eight weeks — have you integrated error analysis into your workflow yet?
One distinctive feature of our review forms for COLING 2018 is the question we’ve added about error analysis in the form for the NLP Engineering Experiment paper type. Specifically, we will ask reviewers to consider:
- Error analysis: Does the paper provide a thoughtful error analysis, which looks for linguistic patterns in the types of errors made by the system(s) evaluated and sheds light on either avenues for future work or the source of the strengths/weaknesses of the systems?
Is error analysis required for NLP engineering experiment papers at COLING?
We’ve been asked this, in light of the fact that many NLP engineering experiment papers (by far the most common type of paper published in computational linguistics and NLP conferences of late) do not have error analysis and many of those are still influential, important and valuable.
Our response is of necessity somewhat nuanced. In our ideal world, all NLP engineering experiment papers at COLING 2018 would include thoughtful error analyses. We believe that this would amplify the contributions of the research we publish both in terms of short term interest and long term relevance. However, we also recognize that error analysis is not yet as prominent in the field as it could be and we’d say it should.
And so, our answer is that error analysis not a strict requirement. However, we ask our reviewers to look for it, and value it, and include the value of the error analysis in their overall evaluation of the papers they review. (And conversely, we absolutely do not want to see reviewers complaining that space in the paper is ‘wasted’ on error analysis.)
But why is error analysis so important?
As Antske Fokkens puts it in her excellent guest post on reproducibility:
The outcome becomes much more convincing if the hypothesis correctly predicts which kind of errors the new approach would solve compared to the baseline. For instance, if you predict that reinforcement learning reduces error propagation, investigate the error propagation in the new system compared to the baseline. Even if it is difficult to predict where improvement comes from, a decent error analysis showing which phenomena are treated better than by other systems, which perform as good or bad and which have gotten worse can provide valuable insights into why an approach works or, more importantly, why it does not.
In other words, a good error analysis tells us something about why method X is effective or ineffective for problem Y. This in turn provides a much richer starting point for further research, allowing us to go beyond throwing learning algorithms at the wall of tasks and seeing which stick, while allowing us to also discover which are the harder parts of a problem. And, as Antske also points out, a good error analysis makes it easier to publish papers about negative results. The observation that method X doesn’t work for problem Y is far more interesting if we can learn something about why not!
How do you do error analysis anyway?
Fundamentally, error analysis involves examining the errors made by a system and developing a classification of them. (This is typically best done over dev data, to avoid compromising held-out test sets.) At a superficial level, this can involve breaking things down by input length, token frequency or looking at confusion matrices. But we should not limit ourselves to examining only labels (rather than input linguistic forms) as with confusion matrices, or superficial properties of the linguistic signal. Languages are, after all, complex systems and linguistic forms are structured. So a deeper error analysis involves examining those linguistic forms and looking for patterns. The categories in the error analysis typically aren’t determined ahead of time, but rather emerge from the data. Does your sentiment analysis system get confused by counterfactuals? Does your event detection system miss negation not expressed by a simple form like not? Does your MT system trip up on translating pronouns especially when they are dropped in the source language? Do your morphological analysis system or string-based features meant to capture noisy morphology make assumptions about the form and position of affixes that aren’t equally valid across test languages?
As Emily noted in a guest post over on the NAACL PC blog:
Error analysis of this type requires a good deal of linguistic insight, and can be an excellent arena for collaboration with linguists (and far more rewarding to the linguist than doing annotation). Start this process early. The conversations can be tricky, as you try to explain how the system works to a linguist who might not be familiar with the type of algorithms you’re using and the linguist in turn tries to explain the patterns they are seeing in the errors. But they can be rewarding in equal measure as the linguistic insight brought out by the error analysis can inform further system development.
This brings us to why COLING in particular should be a leader in placing the spotlight on error analysis: As we noted in a previous blog post, COLING has a tradition of being a locus of interdisciplinary communication between (computational) linguistics and NLP as practiced in computer science. Error analysis is a key, under-discussed component of our research process that benefits from such interdisciplinary communication.
This guest post by the workshop chairs describes the process by which workshops were reviewed for COLING and the other major conferences in 2018 and how they were allocated.
For approximately the last 10 years, ACL, COLING, EMNLP, and NAACL have issued a joint call for workshops. While this adds an additional level of effort and coordination for the conference organizers, it lets workshop organizers focus on putting together a strong program and helps to ensure a balanced set of offerings for attendees across the major conferences each year. Workshop proposals are submitted early in the year, and specify which conference(s) they prefer or require. A committee composed of the workshop chairs of each conference then undertakes a review process of the proposals, and decides which proposals to accept, and an assignment of venues. This blog post explains how the process worked in 2018, and largely followed the guidance on the ACL wiki.
We began by gathering the workshop chairs in August 2017. At that time, workshop chairs from ACL (Brendan O’Connor, Eva Maria Vecchi), COLING (Tim Baldwin, Yoav Goldberg, Jing Jiang), and NAACL (Marie Meteer, Jason Williams) had been appointed, but EMNLP (which occurs last of the 4 events in 2018) had not. This group drafted the call for workshops, largely following previous calls.
The call was issued on August 31, 2017, and specified a due date of October 22, 2017. During those months, the workshop chairs from EMNLP were appointed (Marieke van Erp, Vincent Ng) and joined the committee, which now consisted of 9 people. We received a total of 58 workshop proposals.
We went into the review process with the following goals:
- Ensure a high-quality workshop program across the conferences
- Ensure that the topics are relevant to the research community
- Avoid having topically very similar workshops at the same conference
- For placing workshops in conferences, follow proposer’s preferences wherever possible, diverging only in cases where there existed space limitations and/or substantial topical overlap
In addition to quality and relevance, it is worth noting here that space is an important consideration for workshops. Each conference has a fixed set of meeting rooms available for workshops, and the sizes of those rooms varies widely, with the smallest room holding 44 people, and the largest holding 500. We therefore made considerable effort to estimate the expected attendance at workshops (explained more below).
We started by having each proposal reviewed by 2 members of the committee, with most committee members reviewing around 15 proposals. To aid in the review process, we attempted to first categorize the workshop proposals, to help align proposals with areas of expertise on the committee. This categorization proved quite difficult because many proposals intentionally spanned several disciplines, but it did help identify proposals that were similar.
Our review form included the following questions:
- Relevance: Is the topic of this workshop interesting for the NLP community?
- Originality: Is the topic of this workshop original? (“no” not necessarily a bad thing)
- Variety: Does the topic of this workshop add to the diversity of topics discussed in the NLP community? (“no” not necessarily a bad thing)
- Quality of organizing team: Will the organisers be able to run a successful workshop?
- Quality of program committee: Have the organisers drawn together a high-quality PC?
- Quality of invited speakers (if any): Have high-quality, appropriate invited speaker(s) been identified by the organisers?
- Quality of proposal: Is the topic of the workshop motivated and clearly explained?
- Coherence: Is the topic of the workshop coherent?
- Size (smaller size not necessarily a bad thing):
- Number of previous attendees: Is there an indication of previous numbers of workshop attendees, and if so, what is that number?
- Number of previous submissions: Is there an indication of previous numbers of submissions, and if so, what is that number?
- Projected number of attendees: Is there an indication of projected numbers of workshop attendees, and if so, what is that number?
- Recommendation: Final recommendation
- Text comments to provide to proposers
- Text comments for internal committee use
As was done last year, we also surveyed ACL members to seek input on which workshops people were likely to attend. We felt this survey would be useful in two respects. First, it gave us some additional signal on the relative attendance at each workshop (in addition to workshop organizers’ estimates), which helps assign workshops to appropriately sized rooms. Second, it gave us a rough signal about the interest level from the community. We expected that results from this type of survey are almost certainly biased, and kept this in mind when interpreting results.
Before considering the bulk of the 58 submissions, we note that there are a handful of large, long-standing workshops which the ACL organization agrees to pre-admit, including *SEM, WMT, CoNLL, and SemEval. These were all placed at their first-choice venue.
We then dug into our main responsibility of making accept/reject and placement decisions for the bulk of proposals. In making these decisions, we took into account proposal preferences, our reviews, available space, and results from the survey. Although we operated as a joint committee, ultimately the workshop chairs for each conference took responsibility for workshops accepted to their conference.
We first examined space. These 4 conferences in 2018 each had between 8 and 14 rooms available over 2 days, with room capacities ranging from 40 to 500 people. The total space available nearly matched the number of proposals. Specifically — had all proposals been accepted — there was enough space for all but 3 proposals to be at their first choice venue, and the remaining 3 at their second choice.
Considering the reviews, the 2 reviews per paper were very low-variance: about ⅔ of the final recommendations were identical, and the remaining ⅓ differed by 1 point on a 4-point scale. Overall, we were very impressed by the quality of the proposals, which covered a broad range of topics with strong organizing committees, reviewers, and invited speakers. None of the reviewers recommended 1 (clear reject) for any proposal. Further, the survey results for most borderline proposals showed reasonable interest from the community.
We also considered topicality. Here we found that there were 5 pairs of workshops where each requested the same conference as their first choice, and were topically very similar. In four of the pairs, we assigned a workshop to its second choice conference. In the final pair, in light of all the factors listed above, one workshop was rejected.
In summary, of the 58 proposals, 53 workshops were accepted to their first-choice conference; 4 were accepted to their second-choice conference; and 1 was rejected.
For the general chairs of *ACL conferences next year, we would definitely recommend continuing to organize a similarly large number of workshop rooms. For workshop chairs, we stress that reviewing and selecting workshops is qualitatively different than reviewing and selecting papers; for this reason, we recommend reviewing the proposals among the committee rather than recruiting reviewers (as was previously pointed out by the workshop chairs from the previous year). We would also suggest having workshop chairs consider using a structured form for workshop submissions, since a fair amount of manual effort was required to extract structured data from each proposal document.
Brendan O’Connor, University of Massachusetts Amherst
Eva Maria Vecchi, University of Cambridge
Tim Baldwin, University of Melbourne
Yoav Goldberg, Bar Ilan University
Jing Jiang, Singapore Management University
Marie Meteer, Brandeis University
Jason Williams, Microsoft Research
Marieke van Erp, KNAW Humanities Cluster
Vincent Ng, University of Texas at Dallas
Continuing our series on reproducibility in computational linguistics research, this guest post is from Prof. Kalina Bontcheva, from the Department of Computer Science at the University of Sheffield.
Tool Support for Low Overhead Reproducibility
The two previous guest posts on reproducibility in NLP made an excellent job of defining the different kinds of reproducibility in NLP, why it is important, and many of the stumbling points. Now let me try to provide some partial answers to the question of how can we have low overhead reproducibility through automated tool support.
Before I begin and to motivate the somewhat self-centred nature of this post – reproducible and extensible open science has been the main goal and core focus of my research and that of the GATE team at the University of Sheffield for close to two decades now. One of my first papers on developing reusable NLP algorithms dates back to 2002 and argues that open source frameworks (and GATE in particular) offer researchers the much needed tool support that lowers significantly the overhead of NLP repeatability and reproducibility.
So, now 16 years on, let me return to this topic and provide a brief overview of how we address some of the technical challenges in NLP reproducibility through tool support. I will also share how researchers working on open NLP components have benefitted as a result from high visibility and citation counts. As always I will conclude with future work, i.e. outstanding repeatability challenges.
GATE Cloud: Repeatability-as-a-Service
As highlighted in the two previous guest blogs on reproducibility in NLP, there often are major stumbling blocks in repeating an experiment or re-running a method on new data. Examples include outdated software versions, differences in programming languages (e.g. Java vs Python), insufficient documentation and unknown parameter values. Add to this differences in input and output data formats and general software integration challenges, and it is no wonder that many PhD students (and other researchers) simply opt for citing results copied from the original publication.
The quest for low overhead repeatability led us to implement GATE Cloud. It provides an ever-growing set of NLP algorithms (e.g. POS taggers and NE recognisers in multiple languages) through an unified, easy-to-use REST web service interface. Moreover, it allows the automatic deployment as a service of any GATE-based NLP component or application.
Algorithms + Parameter Values + Data = Auto-Packaged Self-Contained Applications
We also realised early on that repeatability needs more than an open source algorithm, so GATE Developer (the Eclipse of NLP as we like to call it) has the ability to auto-package an experiment by saving it as a GATE application. Effectively this makes a self-contained bundle of all software libraries and components, their parameters, and links to the data that they ran on. The latter is optional, as in some cases it is not possible to distribute copyright-protected datasets. Nevertheless, an application can still point to a directory where it expects the dataset and if available on the user’s computer, it will be loaded and used automatically.
Is My Algorithm Really Better?
A key strength of GATE is that it comes with a large number of reusable and repurposable open-source NLP components, e.g. named entity recognisers, POS taggers, tokenisers. (They aren’t always easy to spot in a vanilla GATE Developer, as they are packaged as optional plugins.) Many researchers not only re-run these as baselines, but also improve, extend, and/or repurpose them to new domains or applications. This then begs the question – is this new algorithm really better than the baseline and in what ways. GATE aims to make such comparative evaluations and error analyses easier, through a set of reusable evaluation tools working on a document or corpus level.
Open Reproducible Science and Research Impact Indicators
Now, when I advocate open and repeatable science, I sometimes get asked about the significant overhead it could incur. So firstly, as already discussed, the GATE infrastructure reduces very significantly this burden, but secondly – in our experience – the extra effort more than pays off in terms of paper citations. Please allow me to cut some corners here, as I’ll take just two examples here to illustrate my point:
- The ACL’2002 paper that first introduced GATE currently has 2346 citations on Google Scholar
- Likewise, the equivalent Stanford CORE NLP paper currently has 1990 citations
In other words – allegiance to open and repeatable science tends to translate directly in high paper citation counts, h-indexes for the authors, and consistently excellent research impact evaluations.
The Unreproducible Nature of NLP for Social Media
And now – let me conclude with a reproducibility challenge. As more and more NLP research addresses social media content, the creation of reusable benchmark datasets is becoming increasingly important, but also somewhat elusive, thanks to the ephemeral nature of tweets and forum posts, account deletions, and 404 URLs. How can we solve this in the most beneficial way for the NLP research community is yet to be seen.
Thank you for reading!
The post that follows is by our guest author Alice Motes, who is a Research Data and Preservation Manager at the University of Surrey, UK.
What’s Open science?
Great question! Open science refers to a wide range of approaches to doing science including open access to publications, open data, open software/code, open peer review, and citizen science (among others). It is driven by principles of transparency, accessibility, and collaboration resulting in a very different model of production and dissemination of science than is currently practiced in most fields. There tends to be a gap between what scientists believe and how they actually behave. For example, most scientists agree that sharing data is important to the progress of science. However, fewer of those same scientists report sharing their data or having easily accessible data (Tenopis et al. 2011).
In many ways, open science is about engaging in full faith with the ideals of the scientific process, which prizes transparency, verification, reproducibility, and building on each other’s work to push the field forward. Open science encourages opening up all parts of the scientific process, but I want to focus on data. (Conveniently, the area I’m most familiar with! Funny that.) Open data is a natural extension of open access to academic publications.
Most scholars have probably benefited from open access to academic journals. (Or hit a paywall to a non-open access journal article. Did you know publishers have profit margins higher than Apple?) The strongest argument behind open access is that restrictions on who can get these articles slows scientific advancement and breakthroughs by disadvantaging scientists without access. Combine that with the fact that most research is partially or wholly funded by public money and it’s not a stretch to suggest that these outputs should be made available to the benefit of everyone, scientist and citizens.
Open data is extending this idea into the realm of the data, suggesting that sharing data for verification and reuse can catch errors earlier, foster innovative uses of data, and push science forward faster and more transparently to the benefit of the field. Not to mention the knock on benefits of those advances to the public and broader society. Some versions of open data advocate for broaden access beyond scientific communities into the public sphere, where data may be examined and reused in potentially entrepreneurial ways to the benefit of society and the economy. You may also see the term open data applied in relation to government agencies at all levels releasing data that they hold as part of a push for transparency in governance and potential reuse by entrepreneurs, like using Transport for London’s API to build travel apps.
What are the potential benefits to open data?
You mean beyond the benefits to your scholarly peers and broader society? Well there are lots of ways sharing data can be advantageous for you:
- More citations – there’s evidence to suggest that papers with accompanying data get cited more. (Piwowar and Vision 2013).
- More exposure and impact – more people will see your work, which could lead to more collaborations and publications.
- Innovative reuse – your data may be useful to in ways you don’t anticipate outside your field, leading to interdisciplinary impact and more data citations.
- Better reproducibility: The first reuser of your data is actually you! Plus, avoid a crisis. (Need more reasons? Check out selfish reproducibility).
Moreover, you’ll benefit from access to your peers shared data as well! Think about all the cool stuff you could do.
Great! I’m on board. How do I do it?
Well you just need to answer these three questions, really:
1. Can people get your data?
How are people going to find and download your files? Are you going to deposit the data into a repository?
2. Can people understand the data?
Ok so now they’ve got your data. Have you included enough documentation that they can understand your file organization, code, and supporting documents?
3. Can people use the data?
People have got a copy of your data and they know how to use it. Grand! But can they actually use it? Would someone have to buy expensive software to use it? Could you make a version of your data available in an open format? Have you supplied the code necessary to use the data? (Check out the Software Sustainability Institute for tips.)
For more check out the FAIR principles (Findable, Accessible, Interoperable and Reusable.)
Of course, there are some very good ethical, legal, and commercial reasons why sharing data is not possible, but I think the goal should be to strive towards the European Commission’s ideal of “as open as possible, as closed as necessary”. You can imagine different levels of sharing, expanding outward: within your lab, within your department, within your university, within your scholarly community, and publicly. For most funders across North America and Europe, they see data as a public good with the greatest benefit coming from sharing data to the widest possible audience and encourage publicly sharing data from their funded projects.
Make an action plan or a data management plan
Here are some things to do help you get the ball rolling on sharing data:
- Get started early and stay organized: document your research anticipating a future user. Check out Center for Open Science’s tools and tips.
- Deposit your data into a repository (e.g. Zenodo, Figshare.) Many universities have their own repository. Some repositories integrate with github, dropbox, etc. to make it even easier!
- Get your data a DOI so citations can be tracked (Repositories or your university library can do this for you.)
- Consider applying a license to your data. Don’t be too restrictive though! You want people to do cool things with your data.
- Ask for help: Your university likely has someone who can help with local resources. Probably in the library. Look for “Research Data Management”. You might find someone like me!
But I don’t have time to do it!
Aren’t you already creating documentation for yourself? You know, in case someone questions your findings after publication or if reviewer 2 (always reviewer 2 :::shakes fist:::) questions your methods or in a couple months when trying to figure out why you decided to run one analysis over another. Surely, making it intelligible to other people isn’t adding much to your workflow…or your graduate assistant’s workflow? If you incorporate these habits early in the process you’ll cut down the time necessary to prepare data at the end. Also, if you consider how much time you spend planning, collecting, analyzing, writing, and revising, the amount of time it takes to prepare your data and share it is relatively small in the grand scheme of things. And why wouldn’t you want to have another output to share? Matthew Partridge a researcher from University of Southampton and cartoonist at Errant Science has a great comic illustrating this:
In sum, open science and open data is a model for a more transparent and collaborative type of scientific inquiry. One that lives up to the best ideals of science as a community effort all moving towards discovery and innovation. Plus you get a cool new output to list on your CV and track its impact in the world. Not a bad shake if you ask me.
We are proud to announce that Dr. Fabiola Henri will give one of COLING 2018’s keynote talks.
Fabiola Henri is an Assistant Professor at the University of Kentucky since 2014. She received a Ph.D in Linguistics from the University of Paris Diderot, France in 2010. She is a creolist who primarily focuses on the structure and complexity of morphology in creole languages from the perspective of recent abstractive models, with insights from both information-theoretic and discriminative learning. Her work examines the emergence of creole morphology as proceeding from a complex interplay between sociohistorical context, natural language change, input from the lexifier, substratic influence, unguided second language acquisition, among others. Her main interests lie within French-based creoles, and more specifically Mauritian, a language which she speaks natively. Her publications and various presentations offer empirical and explanatory view of morphological change in French-based creoles, with a view on morphological complexity which starkly contrasts with Exceptionalist theories of creolization.
Being able to reproduce experiments and results is important to advancing our knowledge, but it’s not something we’ve always been able to do well. In a series of guest posts, we have invited perspectives and advice on reproducibility in NLP.
by Liling Tan, Research Scientist at Rakuten Institute of Technology / Universität des Saarlandes.
I think there are at least 3 levels of reproducibility in NLP (i) Rerun, (ii) Repurpose, (iii) Reimplementation.
At the rerun level, the aim is to re-run the open source code on the open dataset shared from the publication. It’s sort of a sanity check that one would do to understand the practicality of the inputs and the expected outputs. Sometimes, this level of replication is usually skipped because (i) either the open data, open source or perhaps the documentation is missing or (ii) we trust the integrity of the researchers and the publication.
The repurpose level often starts out as a low-hanging fruit project. Usually, the goal is to modify the source code slightly to suit other purposes and/or datasets, e.g. if the code was an implementation of SRU to solve an image recognition task, maybe it could work for machine translation. Alternatively, one might also add the results from the previous state-of-the-art (SOTA) as features/inputs to the new approach.
The last reimplementation level is usually overlooked or done out of necessity. For example, an older SOTA might have stale code that doesn’t compile/run any more so it’s easier to reimplement the older SOTA technique into the framework you’ve created for the novel approach than to figure out how to make the stale code run. Often, the re-implementation might take quite some time and effort and in return, it produces that one line of numbers in the table of results.
More often, we see publications simply citing the results of the previous studies for SOTA comparisons on the same dataset instead of reimplementing and incorporating the previous methods into the code for the new methods. This is largely because of how we incentivize “newness” over “reproducibility” in research, but this is getting better as we see “reproducibility” as a reviewing criterion.
We seldom question the comparability of results once a publication has exceeded the SOTA performance on a common benchmark metric and dataset. Without replication, we often overlook the sensitivity of data munging that might be involved before putting the system output through a benchmarking script. For example, the abuse of the infamous multi-bleu.perl evaluation script overlooked the fact that sentences need to be tokenized before computing the n-gram overlaps in BLEU. Even though the script and gold standards were consistent, different system has been tokenizing their outputs differently making comparability of results inconsistent, especially if there’s no open source or clear documentation of the system reported in the publication. To resolve the multi-bleu.perl misuse, replicating a previous SOTA system using the same pre-/post-processing steps would have given a fairer account of the comparability between the previous SOTA and current approach.
Additionally, “simply citing” often undermines the currency of benchmarking datasets. Like software, datasets are constantly updated and patched; moreover new datasets that are more relevant to the current day or latest shared task are created. But we see publications evaluating on dated benchmarks, most probably to draw comparison with a previous SOTA. Hopefully with “reproducibility” as a criterion in reviewing, authors pay more attention to the writing of the paper and share resources such that future work can easily replicate their systems on newer datasets.
The core ingredients of replication studies are open data and open sources. But lacking in neither shouldn’t hinder reproducibility. If the approaches are well-described in the publication, it shouldn’t be hard to reproduce the results on an open dataset. Without shared resources, open sources, and/or proper documentation, one may question the true impact of the publication that can’t be easily replicated.
We are proud to announce that Dr. James Pustejovsky will give one of COLING 2018’s keynote talks.
James Pustejovsky is the TJX Feldberg Chair in Computer Science at Brandeis University, where he is also Chair of the Linguistics Program, Chair of the Computational Linguistics MA Program, and Director of the Lab for Linguistics and Computation. He received his B.S. from MIT and his Ph.D. from UMASS at Amherst. He has worked on computational and lexical semantics for twenty five years and is chief developer of Generative Lexicon Theory. He has been committed to developing linguistically expressive lexical data resources for the CL and AI community. Since 2002, he has also been involved in the development of standards and annotated corpora for semantic information in language. Pustejovsky is chief architect of TimeML and ISO-TimeML, a recently adopted ISO standard for temporal information in language, as well as ISO-Space, a specification for spatial information in language.