The COLING 2018 main conference deadline is in about eight weeks — have you integrated error analysis into your workflow yet?
One distinctive feature of our review forms for COLING 2018 is the question we’ve added about error analysis in the form for the NLP Engineering Experiment paper type. Specifically, we will ask reviewers to consider:
- Error analysis: Does the paper provide a thoughtful error analysis, which looks for linguistic patterns in the types of errors made by the system(s) evaluated and sheds light on either avenues for future work or the source of the strengths/weaknesses of the systems?
Is error analysis required for NLP engineering experiment papers at COLING?
We’ve been asked this, in light of the fact that many NLP engineering experiment papers (by far the most common type of paper published in computational linguistics and NLP conferences of late) do not have error analysis and many of those are still influential, important and valuable.
Our response is of necessity somewhat nuanced. In our ideal world, all NLP engineering experiment papers at COLING 2018 would include thoughtful error analyses. We believe that this would amplify the contributions of the research we publish both in terms of short term interest and long term relevance. However, we also recognize that error analysis is not yet as prominent in the field as it could be and we’d say it should.
And so, our answer is that error analysis not a strict requirement. However, we ask our reviewers to look for it, and value it, and include the value of the error analysis in their overall evaluation of the papers they review. (And conversely, we absolutely do not want to see reviewers complaining that space in the paper is ‘wasted’ on error analysis.)
But why is error analysis so important?
As Antske Fokkens puts it in her excellent guest post on reproducibility:
The outcome becomes much more convincing if the hypothesis correctly predicts which kind of errors the new approach would solve compared to the baseline. For instance, if you predict that reinforcement learning reduces error propagation, investigate the error propagation in the new system compared to the baseline. Even if it is difficult to predict where improvement comes from, a decent error analysis showing which phenomena are treated better than by other systems, which perform as good or bad and which have gotten worse can provide valuable insights into why an approach works or, more importantly, why it does not.
In other words, a good error analysis tells us something about why method X is effective or ineffective for problem Y. This in turn provides a much richer starting point for further research, allowing us to go beyond throwing learning algorithms at the wall of tasks and seeing which stick, while allowing us to also discover which are the harder parts of a problem. And, as Antske also points out, a good error analysis makes it easier to publish papers about negative results. The observation that method X doesn’t work for problem Y is far more interesting if we can learn something about why not!
How do you do error analysis anyway?
Fundamentally, error analysis involves examining the errors made by a system and developing a classification of them. (This is typically best done over dev data, to avoid compromising held-out test sets.) At a superficial level, this can involve breaking things down by input length, token frequency or looking at confusion matrices. But we should not limit ourselves to examining only labels (rather than input linguistic forms) as with confusion matrices, or superficial properties of the linguistic signal. Languages are, after all, complex systems and linguistic forms are structured. So a deeper error analysis involves examining those linguistic forms and looking for patterns. The categories in the error analysis typically aren’t determined ahead of time, but rather emerge from the data. Does your sentiment analysis system get confused by counterfactuals? Does your event detection system miss negation not expressed by a simple form like not? Does your MT system trip up on translating pronouns especially when they are dropped in the source language? Do your morphological analysis system or string-based features meant to capture noisy morphology make assumptions about the form and position of affixes that aren’t equally valid across test languages?
As Emily noted in a guest post over on the NAACL PC blog:
Error analysis of this type requires a good deal of linguistic insight, and can be an excellent arena for collaboration with linguists (and far more rewarding to the linguist than doing annotation). Start this process early. The conversations can be tricky, as you try to explain how the system works to a linguist who might not be familiar with the type of algorithms you’re using and the linguist in turn tries to explain the patterns they are seeing in the errors. But they can be rewarding in equal measure as the linguistic insight brought out by the error analysis can inform further system development.
This brings us to why COLING in particular should be a leader in placing the spotlight on error analysis: As we noted in a previous blog post, COLING has a tradition of being a locus of interdisciplinary communication between (computational) linguistics and NLP as practiced in computer science. Error analysis is a key, under-discussed component of our research process that benefits from such interdisciplinary communication.