Area Chairs – and Areas

Traditionally, areas are prescribed by program chairs, in anticipation of the field’s interests. This can lead to last minute scrambles as the number of submissions ends up varying widely across areas.  To avoid this, we chose to follow the methodology developed by  Ani Nenkova and Owen Rambow for NAACL 2016 in sunny San Diego, CA, USA.  For COLING 2018 we have not defined areas directly, but rather will let them emerge from the interests of the area chairs, expressed in keywords. These keywords are then also used to allocate reviewers to areas, and later, papers. That’s why at COLING you won’t be directly asked to select an area for your paper at all; this is managed automatically. You will only be asked to select the type of your paper and describe its focus in keywords, to make sure it’s reviewed correctly. If you don’t know what paper types are available, we highly recommend you see the list of paper types and review questions. The keywords are exposed through the submission interface.

Each area has two area chairs, as previous experience has shown that it’s helpful to have a collaborator with whom to discuss decisions and to share the workload, but that larger groups can lead to lack of clarity in who’s doing what work.  We created the AC pairings automatically, keeping the following in mind:

  • We want to maximize similarity of AC research expertise (as captured by the keywords provided) in each pair, across the global pairing.
  • We want to minimize AC pairs where there is a large timezone difference, to foster quick troubleshooting and discussion (in the end, we ended up with one pair not in the same global region).

In addition, seven of the ACs have been designated not to a specific area but as “Special Circumstances” chairs, who can be called on to troubleshoot or advise as necessary.

Our final AC roster is as follows:

  • Afra Alishahi
  • Alexandre Rademaker
  • Alexis Palmer
  • Aline Villavicencio
  • Alvin Grissom II
  • Andrew Caines
  • Ann Clifton
  • Anna Rumshisky
  • Antske Fokkens
  • Arash Eshghi
  • Aurelie Herbelot
  • Avirup Sil
  • Barry Devereux
  • Chaitanya Shivade
  • Dan Garrette
  • Daniel Lassiter
  • David Schlangen
  • Dekai Wu
  • Deyi Xiong
  • Eric Nichols
  • Francis Bond
  • Frank Ferraro
  • Georgiana Dinu
  • Gerard de Melo
  • Gina-Anne Levow
  • Harry Bunt
  • Hatem Haddad
  • Isabelle Augenstein
  • Jiajun Zhang
  • Jose Camacho Collados
  • Klinton Bicknell
  • Lilja Øvrelid
  • Maja Popovic
  • Manuel Montes-y-Gómez
  • Marcos Zampieri
  • Marie-Catherine de Marneffe
  • Meliha Yetisgen
  • Michael Tjalve
  • Miguel Ballesteros
  • Mike Tian-Jian Jiang
  • Mohammad Taher Pilehvar
  • Na-Rae Han
  • Naomi Feldman
  • Natalie Schluter
  • Nathan Schneider
  • Nikola Ljubešić
  • Nurit Melnik
  • Qin Lu
  • Roman Klinger
  • Sadid A. Hasan
  • Sanja Štajner
  • Sara Tonelli
  • Sarvnaz Karimi
  • Sujian Li
  • Sunayana Sitaram
  • Tal Linzen
  • Valia Kordoni
  • Vivek Kulkarni
  • Viviane Moreira
  • Wei Xu
  • Wenjie Li
  • Xiang Ren
  • Xiaodan Zhu
  • Yang Feng
  • Yonatan Bisk
  • Yue Zhang
  • Yun-Nung Chen
  • Zachary Chase Lipton
  • Zeljko Agic
  • Zhiyuan Liu

With the following ACs in Special Circumstances, spread across the world’s timezones:

  • Anders Søgaard
  • Andreas Vlachos
  • Asad Sayeed
  • Di Jiang
  • Karin Verspoor
  • Kevin Duh
  • Steven Bethard

We are grateful to these distinguished scholars for the time and effort they are committing to COLING 2018!

Tool Support for Low Overhead Reproducibility

Continuing our series on reproducibility in computational linguistics research, this guest post is from Prof. Kalina Bontcheva, from the Department of Computer Science at the University of Sheffield.

Tool Support for Low Overhead Reproducibility

The two previous guest posts on reproducibility in NLP made an excellent job of defining the different kinds of reproducibility in NLP, why it is important, and many of the stumbling points. Now let me try to provide some partial answers to the question of how can we have low overhead reproducibility through automated tool support.

Before I begin and to motivate the somewhat self-centred nature of this post – reproducible and extensible open science has been  the main goal and core focus of my research and that of the GATE team at the University of Sheffield for close to two decades now.  One of my first papers on developing reusable NLP algorithms dates back to 2002 and argues that open source frameworks (and GATE in particular) offer researchers the much needed tool support that lowers significantly the overhead of NLP repeatability and reproducibility.

So, now 16 years on, let me return to this topic and provide a brief overview of how we address some of the technical challenges in NLP reproducibility through tool support. I will also share how researchers working on open NLP components have benefitted as a result from high visibility and citation counts. As always I will conclude with future work, i.e. outstanding repeatability challenges.

GATE Cloud: Repeatability-as-a-Service

As highlighted in the two previous guest blogs on reproducibility in NLP, there often are major stumbling blocks in repeating an experiment or re-running a method on new data. Examples include outdated software versions, differences in programming languages (e.g. Java vs Python), insufficient documentation and unknown parameter values. Add to this differences in input and output data formats and general software integration challenges, and it is no wonder that many PhD students (and other researchers) simply opt for citing results copied from the original publication.  

The quest for low overhead repeatability led us to implement GATE Cloud. It provides an ever-growing set of NLP algorithms (e.g. POS taggers and NE recognisers in multiple languages) through an unified, easy-to-use REST web service interface. Moreover, it allows the automatic deployment as a service of any GATE-based NLP component or application.

Algorithms + Parameter Values + Data = Auto-Packaged Self-Contained Applications

We also realised early on that repeatability needs more than an open source algorithm, so GATE Developer (the Eclipse of NLP as we like to call it) has the ability to auto-package an experiment by saving it as a GATE application. Effectively this makes a self-contained bundle of all software libraries and components, their parameters, and links to the data that they ran on. The latter is optional, as in some cases it is not possible to distribute copyright-protected datasets. Nevertheless, an application can still point to a directory where it expects the dataset and if available on the user’s computer, it will be loaded and used automatically.  

Is My Algorithm Really Better?

A key strength of GATE is that it comes with a large number of reusable and repurposable open-source NLP components, e.g. named entity recognisers, POS taggers, tokenisers. (They aren’t always easy to spot in a vanilla GATE Developer, as they are packaged as optional plugins.) Many researchers not only re-run these as baselines, but also improve, extend, and/or repurpose them to new domains or applications. This then begs the question – is this new algorithm really better than the baseline and in what ways. GATE aims to make such comparative evaluations and error analyses easier, through a set of reusable evaluation tools working on a document or corpus level.

Open Reproducible Science and Research Impact Indicators

Now, when I advocate open and repeatable science, I sometimes get asked about the significant overhead it could incur. So firstly, as already discussed, the GATE infrastructure reduces very significantly this burden, but secondly – in our experience – the extra effort more than pays off in terms of paper citations. Please allow me to cut some corners here, as I’ll take just two examples here to illustrate my point:

In other words – allegiance to open and repeatable science tends to translate directly in high paper citation counts, h-indexes for the authors, and consistently excellent research impact evaluations.

The Unreproducible Nature of NLP for Social Media

And now – let me conclude with a reproducibility challenge. As more and more NLP research addresses social media content, the creation of reusable benchmark datasets is becoming increasingly important, but also somewhat elusive, thanks to the ephemeral nature of tweets and forum posts, account deletions, and 404 URLs. How can we solve this in the most beneficial way for the NLP research community is yet to be seen.

Thank you for reading!

Open Science

The post that follows is by our guest author Alice Motes, who is a ‎Research Data and Preservation Manager at the ‎University of Surrey, UK.

What’s Open science?

Great question! Open science refers to a wide range of approaches to doing science including open access to publications, open data, open software/code, open peer review, and citizen science (among others). It is driven by principles of transparency, accessibility, and collaboration resulting in a very different model of production and dissemination of science than is currently practiced in most fields. There tends to be a gap between what scientists believe and how they actually behave. For example, most scientists agree that sharing data is important to the progress of science. However, fewer of those same scientists report sharing their data or having easily accessible data (Tenopis et al. 2011).

In many ways, open science is about engaging in full faith with the ideals of the scientific process, which prizes transparency, verification, reproducibility, and building on each other’s work to push the field forward. Open science encourages opening up all parts of the scientific process, but I want to focus on data. (Conveniently, the area I’m most familiar with! Funny that.) Open data is a natural extension of open access to academic publications.

Most scholars have probably benefited from open access to academic journals. (Or hit a paywall to a non-open access journal article. Did you know publishers have profit margins higher than Apple?) The strongest argument behind open access is that restrictions on who can get these articles slows scientific advancement and breakthroughs by disadvantaging scientists without access. Combine that with the fact that most research is partially or wholly funded by public money and it’s not a stretch to suggest that these outputs should be made available to the benefit of everyone, scientist and citizens.

Open data is extending this idea into the realm of the data, suggesting that sharing data for verification and reuse can catch errors earlier, foster innovative uses of data, and push science forward faster and more transparently to the benefit of the field. Not to mention the knock on benefits of those advances to the public and broader society. Some versions of open data advocate for broaden access beyond scientific communities into the public sphere, where data may be examined and reused in potentially entrepreneurial ways to the benefit of society and the economy. You may also see the term open data applied in relation to government agencies at all levels releasing data that they hold as part of a push for transparency in governance and potential reuse by entrepreneurs, like using Transport for London’s API to build travel apps.

What are the potential benefits to open data?

You mean beyond the benefits to your scholarly peers and broader society? Well there are lots of ways sharing data can be advantageous for you:

  • More citations – there’s evidence to suggest that papers with accompanying data get cited more. (Piwowar and Vision 2013).
  • More exposure and impact – more people will see your work, which could lead to more collaborations and publications.
  • Innovative reuse – your data may be useful to in ways you don’t anticipate outside your field, leading to interdisciplinary impact and more data citations.
  • Better reproducibility: The first reuser of your data is actually you! Plus, avoid a crisis. (Need more reasons? Check out selfish reproducibility).

Moreover, you’ll benefit from access to your peers shared data as well! Think about all the cool stuff you could do.

Great! I’m on board. How do I do it?

Well you just need to answer these three questions, really:

1. Can people get your data?

How are people going to find and download your files? Are you going to deposit the data into a repository?

2. Can people understand the data?

Ok so now they’ve got your data. Have you included enough documentation that they can understand your file organization, code, and supporting documents?

3. Can people use the data?

People have got a copy of your data and they know how to use it. Grand! But can they actually use it? Would someone have to buy expensive software to use it? Could you make a version of your data available in an open format? Have you supplied the code necessary to use the data? (Check out the Software Sustainability Institute for tips.)

For more check out the FAIR principles (Findable, Accessible, Interoperable and Reusable.)

Of course, there are some very good ethical, legal, and commercial reasons why sharing data is not possible, but I think the goal should be to strive towards the European Commission’s ideal of “as open as possible, as closed as necessary”. You can imagine different levels of sharing, expanding outward: within your lab, within your department, within your university, within your scholarly community, and publicly. For most funders across North America and Europe, they see data as a public good with the greatest benefit coming from sharing data to the widest possible audience and encourage publicly sharing data from their funded projects.

Make an action plan or a data management plan

Here are some things to do help you get the ball rolling on sharing data:

  • Get started early and stay organized: document your research anticipating a future user. Check out Center for Open Science’s tools and tips.
  • Deposit your data into a repository (e.g. Zenodo, Figshare.) Many universities have their own repository. Some repositories integrate with github, dropbox, etc. to make it even easier!
  • Get your data a DOI so citations can be tracked (Repositories or your university library can do this for you.)
  • Consider applying a license to your data. Don’t be too restrictive though! You want people to do cool things with your data.
  • Ask for help: Your university likely has someone who can help with local resources. Probably in the library. Look for “Research Data Management”. You might find someone like me!

But I don’t have time to do it!

Aren’t you already creating documentation for yourself? You know, in case someone questions your findings after publication or if reviewer 2 (always reviewer 2 :::shakes fist:::) questions your methods or in a couple months when trying to figure out why you decided to run one analysis over another. Surely, making it intelligible to other people isn’t adding much to your workflow…or your graduate assistant’s workflow? If you incorporate these habits early in the process you’ll cut down the time necessary to prepare data at the end. Also, if you consider how much time you spend planning, collecting, analyzing, writing, and revising, the amount of time it takes to prepare your data and share it is relatively small in the grand scheme of things. And why wouldn’t you want to have another output to share? Matthew Partridge a researcher from University of Southampton and cartoonist at Errant Science has a great comic illustrating this:

Image by Matthew Partridge, Errant Science

In sum, open science and open data is a model for a more transparent and collaborative type of scientific inquiry. One that lives up to the best ideals of science as a community effort all moving towards discovery and innovation. Plus you get a cool new output to list on your CV and track its impact in the world. Not a bad shake if you ask me.

Speaker profile – Fabiola Henri

We are proud to announce that Dr. Fabiola Henri will give one of COLING 2018’s keynote talks.

Fabiola Henri is an Assistant Professor at the University of Kentucky since 2014. She received a Ph.D in Linguistics from the University of Paris Diderot, France in 2010. She is a creolist who primarily focuses on the structure and complexity of morphology in creole languages from the perspective of recent abstractive models, with insights from both information-theoretic and discriminative learning. Her work examines the emergence of creole morphology as proceeding from a complex interplay between sociohistorical context, natural language change, input from the lexifier, substratic influence, unguided second language acquisition, among others. Her main interests lie within French-based creoles, and more specifically Mauritian, a language which she speaks natively. Her publications and various presentations offer empirical and explanatory view of morphological change in French-based creoles, with a view on morphological complexity which starkly contrasts with Exceptionalist theories of creolization.

https://linguistics.as.uky.edu/users/fshe223

Reproducibility in NLP – Guest Post

Being able to reproduce experiments and results is important to advancing our knowledge, but it’s not something we’ve always been able to do well. In a series of guest posts, we have invited perspectives and advice on reproducibility in NLP.

by Liling Tan, Research Scientist at Rakuten Institute of Technology / Universität des Saarlandes.

I think there are at least 3 levels of reproducibility in NLP (i) Rerun, (ii) Repurpose, (iii) Reimplementation.

At the rerun level, the aim is to re-run the open source code on the open dataset shared from the publication. It’s sort of a sanity check that one would do to understand the practicality of the inputs and the expected outputs. Sometimes, this level of replication is usually skipped because (i) either the open data, open source or perhaps the documentation is missing or (ii) we trust the integrity of the researchers and the publication.

The repurpose level often starts out as a low-hanging fruit project. Usually, the goal is to modify the source code slightly to suit other purposes and/or datasets, e.g. if the code was an implementation of SRU to solve an image recognition task, maybe it could work for machine translation. Alternatively, one might also add the results from the previous state-of-the-art (SOTA) as features/inputs to the new approach.

The last reimplementation level is usually overlooked or done out of necessity. For example, an older SOTA might have stale code that doesn’t compile/run any more so it’s easier to reimplement the older SOTA technique into the framework you’ve created for the novel approach than to figure out how to make the stale code run. Often, the re-implementation might take quite some time and effort and in return, it produces that one line of numbers in the table of results.

More often, we see publications simply citing the results of the previous studies for SOTA comparisons on the same dataset instead of reimplementing and incorporating the previous methods into the code for the new methods. This is largely because of how we incentivize “newness” over “reproducibility” in research, but this is getting better as we see “reproducibility” as a reviewing criterion.

We seldom question the comparability of results once a publication has exceeded the SOTA performance on a common benchmark metric and dataset. Without replication, we often overlook the sensitivity of data munging that might be involved before putting the system output through a benchmarking script. For example, the abuse of the infamous multi-bleu.perl evaluation script overlooked the fact that sentences need to be tokenized before computing the n-gram overlaps in BLEU. Even though the script and gold standards were consistent, different system has been tokenizing their outputs differently making comparability of results inconsistent, especially if there’s no open source or clear documentation of the system reported in the publication. To resolve the multi-bleu.perl misuse, replicating a previous SOTA system using the same pre-/post-processing steps would have given a fairer account of the comparability between the previous SOTA and current approach.

Additionally, “simply citing” often undermines the currency of benchmarking datasets. Like software, datasets are constantly updated and patched; moreover new datasets that are more relevant to the current day or latest shared task are created. But we see publications evaluating on dated benchmarks, most probably to draw comparison with a previous SOTA. Hopefully with “reproducibility” as a criterion in reviewing, authors pay more attention to the writing of the paper and share resources such that future work can easily replicate their systems on newer datasets.

The core ingredients of replication studies are open data and open sources.  But lacking in neither shouldn’t hinder reproducibility. If the approaches are well-described in the publication, it shouldn’t be hard to reproduce the results on an open dataset. Without shared resources, open sources, and/or proper documentation, one may question the true impact of the publication that can’t be easily replicated.

Speaker profile – James Pustejovsky

We are proud to announce that Dr. James Pustejovsky will give one of COLING 2018’s keynote talks.

James Pustejovsky is the TJX Feldberg Chair in Computer Science at Brandeis University, where he is also Chair of the Linguistics Program, Chair of the Computational Linguistics MA Program, and Director of the Lab for Linguistics and Computation. He received his B.S. from MIT and his Ph.D. from UMASS at Amherst. He has worked on computational and lexical semantics for twenty five years and is chief developer of Generative Lexicon Theory. He has been committed to developing linguistically expressive lexical data resources for the CL and AI community. Since 2002, he has also been involved in the development of standards and annotated corpora for semantic information in language. Pustejovsky is chief architect of TimeML and ISO-TimeML, a recently adopted ISO standard for temporal information in language, as well as ISO-Space, a specification for spatial information in language.

James Pustejovsky has authored and/or edited numerous books, including Generative Lexicon (MIT, 1995), The Problem of Polysemy (CUP, with B. Boguraev,1996), The Language of Time: A Reader (OUP, with I. Mani and R. Gaizauskas, 2005), Interpreting Motion: Grounded Representations for Spatial Language (OUP, with I. Mani, 2012), and Natural Language Annotation for Machine Learning, O’Reilly, 2012 (with A. Stubbs). Recently, he has been developing a modeling framework for representing linguistic expressions, gestures, and interactions as multimodal simulations. This platform, VoxML/VoxSim, enables real-time communication between humans and computers and robots for joint tasks. Recent books include: Recent Advances in Generative Lexicon Theory, (Springer, 2013); The Handbook of Linguistic Annotation, Springer, 2017 (edited with Nancy Ide), and two textbooks, The Lexicon, Cambridge University Press, 2018 (with O. Batiukova), and A Guide to Generative Lexicon Theory, Oxford University Press, 2019 (with E. Jezek). He is presently finishing a book on temporal information processing for O’Reilly with L. Derczynski and M. Verhagen.

Speaker profile – Min-Yen Kan

We are proud to announce that Dr. Min-Yen Kan will give one of COLING 2018’s keynote talks.

Min-Yen Kan (BS;MS;PhD Columbia Univ.) is an associate professor at the National University of Singapore.  He is a senior member of the ACM and a member of the IEEE.  Currently, he is an associate editor for the journal “Information Retrieval” and is the Editor for the ACL Anthology, the computational linguistics community’s largest archive of published research.  His research interests include digital libraries and applied natural language processing.  Specific projects include work in the areas of scientific discourse analysis, full-text literature mining, machine translation and applied text summarization. More information about him and his group can be found at the WING homepage: http://wing.comp.nus.edu.sg/

Slowly Growing Offspring: Zigglebottom Anno 2017 – Guest post

Being able to reproduce experiments and results is important to advancing our knowledge, but it’s not something we’ve always been able to do well. In a series of guest posts, we have invited perspectives and advice on reproducibility in NLP, this from Antske Fokkens.

Reflections on Improving Replication and Reproduction in Computational Linguistics

(See Ted Pedersen’s Empiricism is not a Matter of Faith for the Sad Tale of the Zigglebottom Tagger)

A little over four years ago, we presented our paper Offspring from Reproduction Problems at ACL. The paper discussed two case studies in which we failed to replicate results. While investigating the problem, we found that results differed to an extent that they led to completely different conclusions. The settings, preprocessing and evaluation whose (small) variations led to these changes were not even reported in the original papers.

Though some progress has been made on both the level of ensuring replication (obtaining the same results using the same experiment) as well as reproduction (reach the same conclusion through different means), the problem described in 2013 still seems to apply to the majority of the computational linguistics papers published in 2017. In this blog post, I’d like to reflect on the progress that has been made, but also on the progress we still need to make on the level of publishing both replicable and reproducible research. The core issue around replication is the lack of means provided to other researchers to repeat an experiment carried out elsewhere. Issues around reproducing results are more diverse, but I believe that the way we look at evidence and comparison to previous work in our field is a key element of the problem. I will argue that major steps in addressing these issues can be made by (1) increasing appreciation for replicability and reproducibility in published research and (2) changing the way we use the ‘state-of-the-art’ when judging research in our field. More specifically, good papers provide insight and understanding in a computational linguistics or NLP problem. Reporting results that beat the state-of-the-art is neither sufficient nor necessary for a paper to provide a valuable research contribution.

Replication Problems and Appreciation for Sharing Code

Attention for replicable results (sharing code and resources) has increased in the last four years. Links to git repositories or other version control systems are more and more common and review forms of the main conferences include a question addressing the possibilities of replication. Our research group CLTL has adopted a policy indicating that code and resources not restricted by third party licenses must be made available when publishing. When reading related work for my own research, I have noticed similar tendencies in, among others, the UKP-group in Darmstadt, Stanford NLP and the CS and Linguistics departments of the University of Washington. Our PhD students furthermore typically start by replicating or reproducing previous work which they can then use as a baseline. From their experience, I noticed that the problems reported four years ago still apply today. Results were close or comparable sometimes and once even higher, but also regularly far off. Sometimes provided code did not even run. Authors often provided feedback, but even with their help (sometimes they went as far as looking at our code), the original results could not be replicated. I currently find myself on the other side of the table, with two graduate students wanting to use an analysis from my PhD and the (openly available) code producing errors.

There can be valid reasons for not sharing code or resources. Research teams from industry have often delivered interesting and highly relevant contributions to NLP research and it is difficult to obtain corpora from various genres without copyright on the text. I therefore do not want to argue for less appreciation for research without open source code and resources, but I would very much want to advocate for more appreciation for research that does provide the means for replicating results. In addition to being openly verifiable, it also provides additional means for other researchers to build their work directly upon previous work rather than first going through the frustration of reimplementing a new baseline system good enough to test their hypotheses on.

The General Reproducible and Replicable State-of-the-Art

Comparing performance on benchmark systems has helped in gaining insight into the performance of our systems and in comparing various approaches. Evaluation in our field is often limited to testing whether an approach beats the state-of-the-art. Many even seem to see this as the main purpose to the extent that reviewers rate papers down that don’t beat the state-of-the-art. I suspect that researchers often do not even bother to try and publish their work if performance remains below the best reported. The purpose of evaluation actually is, or should be, to provide insight into how a model works, what phenomena it captures or which patterns the machine learning algorithm picked up, compared to alternative approaches. Moreover, the difficulties involved in replicating results make the practice of judging research on whether it beats the state-of-the-art rather questionable. Reported results may be misleading regarding the actual state-of-the-art. In general, papers should be evaluated based on what they teach us, i.e. whether they verify their hypothesis by comparing it to a suitable baseline. A suitable baseline may indeed mean a baseline that corresponds to the state-of-the-art, but this state-of-the-art should be a valid reflection of what current technologies do.

I would therefore like to introduce the notions of the reproducible state-of-the-art and the generally replicable state-of-the-art. These two notions both aim at gaining better insight into the true state-of-the-art and making building on top of that more accessible to a wider range of researchers. I understand a ‘reproducible state-of-the-art’ to be a result obtained by different groups of researchers independently which increases the likelihood of providing a reliable result and a baseline that is feasible to reproduce for other researchers. This implies having more appreciation for papers that come relatively close to the state-of-the-art without necessarily beating it. Chances of results being reproducible also increase if they hold across datasets and can be obtained by multiple machine learning runs (e.g. if they are relatively stable across different initiations and order of processing training data by a neural network). The ‘generally replicable state-of-the-art’ refers to the best reported results obtained by a fully available system and, preferably, one that can be trained and run using computational resources available to the average NLP research group. One way to obtain better open source systems and encourage researchers to share their resources and code is by instructing reviewers to appreciate improving the new generally replicable state-of-the-art (with open source code and available resources) as much as improving the reported state-of-the-art.

Understanding Computational Models for Natural Language

In the introduction of this blog, I claimed that improving the state-of-the-art is neither necessary nor sufficient for providing an important contribution to computational linguistics. NLP papers often introduce an idea and show that by adding the features or adapting the machine learning approach associated with that idea improves results. Many authors take the improved results as evidence that the idea works, but this is not necessarily the case: improvement can be due to other differences in settings or random variations. The outcome becomes much more convincing if the hypothesis correctly predicts which kind of errors the new approach would solve compared to the baseline. For instance, if you predict that reinforcement learning reduces error propagation, investigate the error propagation in the new system compared to the baseline. Even if it is difficult to predict where improvement comes from, a decent error analysis showing which phenomena are treated better than by other systems, which perform as good or bad and which have gotten worse can provide valuable insights into why an approach works or, more importantly, why it does not. This has several advantages: first of all, if we have better insights into what information and which algorithms help for similar and which for different phenomena, we have a better idea of how to further improve our systems (for those among you who are convinced that achieving high f-scores is our ultimate goal). It becomes easier to publish negative results , which in turn promotes progress by preventing other  research groups from going down the same pointless road without knowing of each other’s work. We may learn whether an approach works or does not work due to particularities of the data we are working with. Moreover, an understood result is more likely to be a reproducible result and even if it is not, details about what is working exactly may help other researchers to find out why they cannot reproduce it. In my opinion, this is where our field fails most: we are too easily satisfied when results are high and do not aim for deep insight frequently enough. This aspect may be the hardest to tackle from the points I have raised in this post. On the upside, addressing this is not made impossible by licenses, copyright and commercial code.

Moving Forward

As a community, we are responsible for improving the quality of our research. Most of the effort will probably have to come from bottom up: individual researchers can decide to write (only) papers with a solid methodological setup, and that aim for insights in addition to or even rather than high f-scores and provide code and resources whenever allowed. They can also decide to value papers that follow such practices more and be (more) critical of papers that do not provide insights or good understanding of the methods. Initiatives such as the workshops Analyzing and Interpreting Neural Networks for NLP, Building and Breaking, Ethics in NLP, Relevance of Linguistic Structure in Neural NLP (and many others) show that the desire to obtain better understanding is very much alive in the community.

Researchers serving as program chairs can play a significant role in further encouraging authors and reviewers. The categories of best papers proposed for COLING2018 are a nice example of an incentive that appreciates a variety of contributions to the field. The main conference’s review forms have included questions about resources provided by the paper. Last year, however, the option ‘no code or resources provided’ was followed by ‘(most submissions)’. As a reviewer, I wondered: why this addition? We should at least try to move towards a situation that providing code and resources is normal or maybe even standard. The new NAACL form refers to the encouragement of sharing research for papers introducing new systems. I hope this will also be included for other paper categories and that the chairs will connect this encouragement to a reward for authors who do. I also hope chairs and editors, for all conferences, journals and workshops, will remind their reviewers of the fragility of reported results and remind them to take this into consideration when verifying if empirical results are sufficient compared to related work. Most of all, I hope many researchers will feel encouraged to submit insightful research with low as well as high results and I hope to learn much from it.

Thank you for reading. Please share your ideas and thoughts: I’d specifically love to hear from researchers that have different opinions.

Antske Fokkens

https://twitter.com/antske

Acknowledgements I’d like to thank Leon Derczynski for inviting me to write this post. Thanks to Ted Pedersen (who I have never met in person) for that crazy Saturday we spent hacking across the ocean to finally find out why the original results could not be replicated. I’d like to thank Emily Bender for valuable feedback. Last but not least, thanks to the members of the CLTL research group for discussions and inspiration on this topic as well as the many many colleagues from all over the world I have exchanged thoughts with on this topic over the past four years!

Speaker profile – Hannah Rohde

COLING 2018 will have four full keynote speeches. As we announce the speakers, we’ll introduce them via this blog, too. We are quite proud of this line up, and it’s hard to refrain from just putting all the info out there at once! So we’ll start by crowing about Dr. Hannah Rohde.

Hannah Rohde is a Reader in Linguistics & English Language at the University of Edinburgh. She works in experimental pragmatics, using psycholinguistic techniques to investigate questions in areas such as pronoun interpretation, referring expression generation, implicature, presupposition, and the establishment of discourse coherence. Her undergraduate degree was in Computer Science and Linguistics from Brown University, from which she went on to complete a PhD in Linguistics at the University of California San Diego, followed by postdoctoral fellowships at Northwestern and Stanford. She currently helps organise the working group on empiricism for the EU-wide “TextLink: Structuring discourse in multilingual Europe” COST Action network and is a recipient of the 2017 Philip Leverhulme Prize in Languages and Literatures.

http://www.lel.ed.ac.uk/~hrohde/

Best paper categories and requirements

Recognition of excellent work is very important.  In particular, we see the role of best/outstanding paper awards as being two-fold: On the one hand, it is a chance for the conference program committee to highlight papers it found particularly compelling and promote them to a broader audience.  On the other hand, it provides recognition to the authors and may help advance their careers.

From the perspective of both of these functions we think it is critical that different kinds of excellent work be recognized.  Accordingly, we have established an expanded set of categories in which an award will be given for COLING 2018. The categories are:

  • Best linguistic analysis
  • Best NLP engineering experiment
  • Best reproduction paper
  • Best resource paper
  • Best position paper
  • Best survey paper
  • Best evaluation, for a paper that does their eval very well
  • Most reproducible, where the paper’s work is highly reproducible
  • Best challenge, for a paper that sets a new challenge
  • Best error analysis, where the linguistic analysis of failures is exemplary

The first six of these correspond to our paper types.  The last cross-cut those categories, at least to a certain extent.  We hope that ‘Best evaluation’ and ‘Most reproducible’ in particular will provide motivation for raising the bar in best practice in these areas.

A winner will be selected for each category by a best paper committee. However, while there are more opportunities for recognition, we’ve also raised the minimum requirements for winning a prize. Namely, any work with associated code or other resources must make that openly available, and do so before the best paper committee finished selecting works.

We’ve taken this step to provide a solid reward for those who share their work and help advance our field (see e.g. “Sharing is Caring”, Nissim et al. 2017, Computational Linguistics), without excluding others (e.g. industrial authors) who cannot easily share work from participating in COLING 2018’s many tracks.

We look forward with great anticipation to this collection of papers!