Tool Support for Low Overhead Reproducibility

Continuing our series on reproducibility in computational linguistics research, this guest post is from Prof. Kalina Bontcheva, from the Department of Computer Science at the University of Sheffield.

Tool Support for Low Overhead Reproducibility

The two previous guest posts on reproducibility in NLP made an excellent job of defining the different kinds of reproducibility in NLP, why it is important, and many of the stumbling points. Now let me try to provide some partial answers to the question of how can we have low overhead reproducibility through automated tool support.

Before I begin and to motivate the somewhat self-centred nature of this post – reproducible and extensible open science has been the main goal and core focus of my research and that of the GATE team at the University of Sheffield for close to two decades now. One of my first papers on developing reusable NLP algorithms dates back to 2002 and argues that open source frameworks (and GATE in particular) offer researchers the much needed tool support that lowers significantly the overhead of NLP repeatability and reproducibility.

So, now 16 years on, let me return to this topic and provide a brief overview of how we address some of the technical challenges in NLP reproducibility through tool support. I will also share how researchers working on open NLP components have benefitted as a result from high visibility and citation counts. As always I will conclude with future work, i.e. outstanding repeatability challenges.

GATE Cloud: Repeatability-as-a-Service

As highlighted in the two previous guest blogs on reproducibility in NLP, there often are major stumbling blocks in repeating an experiment or re-running a method on new data. Examples include outdated software versions, differences in programming languages (e.g. Java vs Python), insufficient documentation and unknown parameter values. Add to this differences in input and output data formats and general software integration challenges, and it is no wonder that many PhD students (and other researchers) simply opt for citing results copied from the original publication.

The quest for low overhead repeatability led us to implement GATE Cloud. It provides an ever-growing set of NLP algorithms (e.g. POS taggers and NE recognisers in multiple languages) through an unified, easy-to-use REST web service interface. Moreover, it allows the automatic deployment as a service of any GATE-based NLP component or application.

Algorithms + Parameter Values + Data = Auto-Packaged Self-Contained Applications

We also realised early on that repeatability needs more than an open source algorithm, so GATE Developer (the Eclipse of NLP as we like to call it) has the ability to auto-package an experiment by saving it as a GATE application. Effectively this makes a self-contained bundle of all software libraries and components, their parameters, and links to the data that they ran on. The latter is optional, as in some cases it is not possible to distribute copyright-protected datasets. Nevertheless, an application can still point to a directory where it expects the dataset and if available on the user’s computer, it will be loaded and used automatically.

Is My Algorithm Really Better?

A key strength of GATE is that it comes with a large number of reusable and repurposable open-source NLP components, e.g. named entity recognisers, POS taggers, tokenisers. (They aren’t always easy to spot in a vanilla GATE Developer, as they are packaged as optional plugins.) Many researchers not only re-run these as baselines, but also improve, extend, and/or repurpose them to new domains or applications. This then begs the question – is this new algorithm really better than the baseline and in what ways. GATE aims to make such comparative evaluations and error analyses easier, through a set of reusable evaluation tools working on a document or corpus level.

Open Reproducible Science and Research Impact Indicators

Now, when I advocate open and repeatable science, I sometimes get asked about the significant overhead it could incur. So firstly, as already discussed, the GATE infrastructure reduces very significantly this burden, but secondly – in our experience – the extra effort more than pays off in terms of paper citations. Please allow me to cut some corners here, as I’ll take just two examples here to illustrate my point:

The ACL’2002 paper that first introduced GATE currently has 2346 citations on Google Scholar
Likewise, the equivalent Stanford CORE NLP paper currently has 1990 citations

In other words – allegiance to open and repeatable science tends to translate directly in high paper citation counts, h-indexes for the authors, and consistently excellent research impact evaluations.

The Unreproducible Nature of NLP for Social Media

And now – let me conclude with a reproducibility challenge. As more and more NLP research addresses social media content, the creation of reusable benchmark datasets is becoming increasingly important, but also somewhat elusive, thanks to the ephemeral nature of tweets and forum posts, account deletions, and 404 URLs. How can we solve this in the most beneficial way for the NLP research community is yet to be seen.

Thank you for reading!

COLING 2018

August 20-26, 2018, Santa Fe, New Mexico, USA

Tool Support for Low Overhead Reproducibility