Slowly Growing Offspring: Zigglebottom Anno 2017 – Guest post

Posted on December 22, 2017 by Leon Derczynski

Being able to reproduce experiments and results is important to advancing our knowledge, but it’s not something we’ve always been able to do well. In a series of guest posts, we have invited perspectives and advice on reproducibility in NLP, this from Antske Fokkens.

Reflections on Improving Replication and Reproduction in Computational Linguistics

(See Ted Pedersen’s Empiricism is not a Matter of Faith for the Sad Tale of the Zigglebottom Tagger)

A little over four years ago, we presented our paper Offspring from Reproduction Problems at ACL. The paper discussed two case studies in which we failed to replicate results. While investigating the problem, we found that results differed to an extent that they led to completely different conclusions. The settings, preprocessing and evaluation whose (small) variations led to these changes were not even reported in the original papers.

Though some progress has been made on both the level of ensuring replication (obtaining the same results using the same experiment) as well as reproduction (reach the same conclusion through different means), the problem described in 2013 still seems to apply to the majority of the computational linguistics papers published in 2017. In this blog post, I’d like to reflect on the progress that has been made, but also on the progress we still need to make on the level of publishing both replicable and reproducible research. The core issue around replication is the lack of means provided to other researchers to repeat an experiment carried out elsewhere. Issues around reproducing results are more diverse, but I believe that the way we look at evidence and comparison to previous work in our field is a key element of the problem. I will argue that major steps in addressing these issues can be made by (1) increasing appreciation for replicability and reproducibility in published research and (2) changing the way we use the ‘state-of-the-art’ when judging research in our field. More specifically, good papers provide insight and understanding in a computational linguistics or NLP problem. Reporting results that beat the state-of-the-art is neither sufficient nor necessary for a paper to provide a valuable research contribution.

Replication Problems and Appreciation for Sharing Code

Attention for replicable results (sharing code and resources) has increased in the last four years. Links to git repositories or other version control systems are more and more common and review forms of the main conferences include a question addressing the possibilities of replication. Our research group CLTL has adopted a policy indicating that code and resources not restricted by third party licenses must be made available when publishing. When reading related work for my own research, I have noticed similar tendencies in, among others, the UKP-group in Darmstadt, Stanford NLP and the CS and Linguistics departments of the University of Washington. Our PhD students furthermore typically start by replicating or reproducing previous work which they can then use as a baseline. From their experience, I noticed that the problems reported four years ago still apply today. Results were close or comparable sometimes and once even higher, but also regularly far off. Sometimes provided code did not even run. Authors often provided feedback, but even with their help (sometimes they went as far as looking at our code), the original results could not be replicated. I currently find myself on the other side of the table, with two graduate students wanting to use an analysis from my PhD and the (openly available) code producing errors.

There can be valid reasons for not sharing code or resources. Research teams from industry have often delivered interesting and highly relevant contributions to NLP research and it is difficult to obtain corpora from various genres without copyright on the text. I therefore do not want to argue for less appreciation for research without open source code and resources, but I would very much want to advocate for more appreciation for research that does provide the means for replicating results. In addition to being openly verifiable, it also provides additional means for other researchers to build their work directly upon previous work rather than first going through the frustration of reimplementing a new baseline system good enough to test their hypotheses on.

The General Reproducible and Replicable State-of-the-Art

Comparing performance on benchmark systems has helped in gaining insight into the performance of our systems and in comparing various approaches. Evaluation in our field is often limited to testing whether an approach beats the state-of-the-art. Many even seem to see this as the main purpose to the extent that reviewers rate papers down that don’t beat the state-of-the-art. I suspect that researchers often do not even bother to try and publish their work if performance remains below the best reported. The purpose of evaluation actually is, or should be, to provide insight into how a model works, what phenomena it captures or which patterns the machine learning algorithm picked up, compared to alternative approaches. Moreover, the difficulties involved in replicating results make the practice of judging research on whether it beats the state-of-the-art rather questionable. Reported results may be misleading regarding the actual state-of-the-art. In general, papers should be evaluated based on what they teach us, i.e. whether they verify their hypothesis by comparing it to a suitable baseline. A suitable baseline may indeed mean a baseline that corresponds to the state-of-the-art, but this state-of-the-art should be a valid reflection of what current technologies do.

I would therefore like to introduce the notions of the reproducible state-of-the-art and the generally replicable state-of-the-art. These two notions both aim at gaining better insight into the true state-of-the-art and making building on top of that more accessible to a wider range of researchers. I understand a ‘reproducible state-of-the-art’ to be a result obtained by different groups of researchers independently which increases the likelihood of providing a reliable result and a baseline that is feasible to reproduce for other researchers. This implies having more appreciation for papers that come relatively close to the state-of-the-art without necessarily beating it. Chances of results being reproducible also increase if they hold across datasets and can be obtained by multiple machine learning runs (e.g. if they are relatively stable across different initiations and order of processing training data by a neural network). The ‘generally replicable state-of-the-art’ refers to the best reported results obtained by a fully available system and, preferably, one that can be trained and run using computational resources available to the average NLP research group. One way to obtain better open source systems and encourage researchers to share their resources and code is by instructing reviewers to appreciate improving the new generally replicable state-of-the-art (with open source code and available resources) as much as improving the reported state-of-the-art.

Understanding Computational Models for Natural Language

In the introduction of this blog, I claimed that improving the state-of-the-art is neither necessary nor sufficient for providing an important contribution to computational linguistics. NLP papers often introduce an idea and show that by adding the features or adapting the machine learning approach associated with that idea improves results. Many authors take the improved results as evidence that the idea works, but this is not necessarily the case: improvement can be due to other differences in settings or random variations. The outcome becomes much more convincing if the hypothesis correctly predicts which kind of errors the new approach would solve compared to the baseline. For instance, if you predict that reinforcement learning reduces error propagation, investigate the error propagation in the new system compared to the baseline. Even if it is difficult to predict where improvement comes from, a decent error analysis showing which phenomena are treated better than by other systems, which perform as good or bad and which have gotten worse can provide valuable insights into why an approach works or, more importantly, why it does not. This has several advantages: first of all, if we have better insights into what information and which algorithms help for similar and which for different phenomena, we have a better idea of how to further improve our systems (for those among you who are convinced that achieving high f-scores is our ultimate goal). It becomes easier to publish negative results , which in turn promotes progress by preventing other research groups from going down the same pointless road without knowing of each other’s work. We may learn whether an approach works or does not work due to particularities of the data we are working with. Moreover, an understood result is more likely to be a reproducible result and even if it is not, details about what is working exactly may help other researchers to find out why they cannot reproduce it. In my opinion, this is where our field fails most: we are too easily satisfied when results are high and do not aim for deep insight frequently enough. This aspect may be the hardest to tackle from the points I have raised in this post. On the upside, addressing this is not made impossible by licenses, copyright and commercial code.

Moving Forward

As a community, we are responsible for improving the quality of our research. Most of the effort will probably have to come from bottom up: individual researchers can decide to write (only) papers with a solid methodological setup, and that aim for insights in addition to or even rather than high f-scores and provide code and resources whenever allowed. They can also decide to value papers that follow such practices more and be (more) critical of papers that do not provide insights or good understanding of the methods. Initiatives such as the workshops Analyzing and Interpreting Neural Networks for NLP, Building and Breaking, Ethics in NLP, Relevance of Linguistic Structure in Neural NLP (and many others) show that the desire to obtain better understanding is very much alive in the community.

Researchers serving as program chairs can play a significant role in further encouraging authors and reviewers. The categories of best papers proposed for COLING2018 are a nice example of an incentive that appreciates a variety of contributions to the field. The main conference’s review forms have included questions about resources provided by the paper. Last year, however, the option ‘no code or resources provided’ was followed by ‘(most submissions)’. As a reviewer, I wondered: why this addition? We should at least try to move towards a situation that providing code and resources is normal or maybe even standard. The new NAACL form refers to the encouragement of sharing research for papers introducing new systems. I hope this will also be included for other paper categories and that the chairs will connect this encouragement to a reward for authors who do. I also hope chairs and editors, for all conferences, journals and workshops, will remind their reviewers of the fragility of reported results and remind them to take this into consideration when verifying if empirical results are sufficient compared to related work. Most of all, I hope many researchers will feel encouraged to submit insightful research with low as well as high results and I hope to learn much from it.

Thank you for reading. Please share your ideas and thoughts: I’d specifically love to hear from researchers that have different opinions.

Antske Fokkens

https://twitter.com/antske

Acknowledgements I’d like to thank Leon Derczynski for inviting me to write this post. Thanks to Ted Pedersen (who I have never met in person) for that crazy Saturday we spent hacking across the ocean to finally find out why the original results could not be replicated. I’d like to thank Emily Bender for valuable feedback. Last but not least, thanks to the members of the CLTL research group for discussions and inspiration on this topic as well as the many many colleagues from all over the world I have exchanged thoughts with on this topic over the past four years!

Speaker profile – Hannah Rohde

Posted on December 21, 2017 by Leon Derczynski

COLING 2018 will have four full keynote speeches. As we announce the speakers, we’ll introduce them via this blog, too. We are quite proud of this line up, and it’s hard to refrain from just putting all the info out there at once! So we’ll start by crowing about Dr. Hannah Rohde.

Hannah Rohde is a Reader in Linguistics & English Language at the University of Edinburgh. She works in experimental pragmatics, using psycholinguistic techniques to investigate questions in areas such as pronoun interpretation, referring expression generation, implicature, presupposition, and the establishment of discourse coherence. Her undergraduate degree was in Computer Science and Linguistics from Brown University, from which she went on to complete a PhD in Linguistics at the University of California San Diego, followed by postdoctoral fellowships at Northwestern and Stanford. She currently helps organise the working group on empiricism for the EU-wide “TextLink: Structuring discourse in multilingual Europe” COST Action network and is a recipient of the 2017 Philip Leverhulme Prize in Languages and Literatures.

http://www.lel.ed.ac.uk/~hrohde/

You can find the slides for Dr. Rohde’s talk here.

Best paper categories and requirements

Posted on December 12, 2017 by Leon Derczynski

Recognition of excellent work is very important. In particular, we see the role of best/outstanding paper awards as being two-fold: On the one hand, it is a chance for the conference program committee to highlight papers it found particularly compelling and promote them to a broader audience. On the other hand, it provides recognition to the authors and may help advance their careers.

From the perspective of both of these functions we think it is critical that different kinds of excellent work be recognized. Accordingly, we have established an expanded set of categories in which an award will be given for COLING 2018. The categories are:

Best linguistic analysis
Best NLP engineering experiment
Best reproduction paper
Best resource paper
Best position paper
Best survey paper
Best evaluation, for a paper that does their eval very well
Most reproducible, where the paper’s work is highly reproducible
Best challenge, for a paper that sets a new challenge
Best error analysis, where the linguistic analysis of failures is exemplary

The first six of these correspond to our paper types. The last cross-cut those categories, at least to a certain extent. We hope that ‘Best evaluation’ and ‘Most reproducible’ in particular will provide motivation for raising the bar in best practice in these areas.

A winner will be selected for each category by a best paper committee. However, while there are more opportunities for recognition, we’ve also raised the minimum requirements for winning a prize. Namely, any work with associated code or other resources must make that openly available, and do so before the best paper committee finished selecting works.

We’ve taken this step to provide a solid reward for those who share their work and help advance our field (see e.g. “Sharing is Caring”, Nissim et al. 2017, Computational Linguistics), without excluding others (e.g. industrial authors) who cannot easily share work from participating in COLING 2018’s many tracks.

We look forward with great anticipation to this collection of papers!

COLING as a Locus of Interdisciplinary Communication

Posted on December 8, 2017 by Emily M. Bender

The nature of the relationship between (computational) linguistics and natural language processing remains a hot topic in the field. There is at this point a substantial history of workshops focused on how to get the most out of this interaction, including at least:

An invited plenary symposium on Computational Linguistics in Support of Linguistic Analysis at the 2009 meeting of the Linguistic Society of America and the associated special issue of Linguistic Issues in Language Technology
The EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or Vacuous? and the associated special issue of Linguistic Issues in Language Technology
The ACL 2010 Workshop on NLP and Linguistics: Finding the Common Ground
The Workshops on the Use of Computational Methods in the Study of Endangered Languages at ACL 2014 and ICLDC 2017
The EMNLP 2017 Workshop on Building Linguistically Generalizable NLP Systems
The Workshop Perceptrons and Syntactic Structures at Sixty, held jointly with the inaugural meeting of the Society for Computational in Linguistics
The ACL 2018 Workshop on the Relevance of Linguistic Structure in Neural NLP

[There are undoubtedly more! Please let us know what we’ve missed in the comments and we’ll add them to this list.]

The interaction between the fields also tends to be a hot-button topic on Twitter, leading to very long and sometimes informative discussions, such as the NLP/CL Megathread of April 2017 (as captured by Sebastian Mielke) or the November 2017 discussion on linguistics, NLP, and interdisciplinarity, summarized in blog posts by Emily M. Bender and Ryan Cotterell.

It is very important to us as PC co-chairs of COLING 2018 to continue the COLING tradition of providing a venue that encourages interdisciplinary work. COLING as a venue should host both computationally-aided linguistic analysis and linguistically informed work on natural language processing. Furthermore, it should provide a space for authors of each of these kinds of papers to provide feedback to each other.

Actions we have taken so far to support this vision include recruiting area chairs whose expertise spans the two fields as well as in the design of our paper types and associated review forms.

We’d like to see even more discussion of how interdiscipinarity works/can work in our field. What do you consider to be best practices for carrying out such interdisciplinary work? What role do you see for linguistics in NLP/how do computational methods inform your linguistic research? How do you build and maintain collaborations? When you read (or review) in this field, what kind of features of a paper stand out for you as particularly good approaches to interdisciplinary work? Finally, how can COLING further support such best practices?

Recruiting Area Chairs

Posted on November 8, 2017 by Emily M. Bender

An absolutely key ingredient for a successful conference is a stellar team of area chairs (ACs). What do we mean by stellar? We need people who take the task seriously, work hard to ensure fairness, bring their expertise to bear in selecting papers that make valuable contributions and constitute a vibrant program, can be effective leaders and get the reviewers to do their job well, and finally who represent a broad range of diverse interests and perspectives on our field. What a tall order!

On top of that, given the size of conferences in our field presently, we need a large team of such amazing colleagues. How big? We are planning for 2000 submissions (yikes!), which we will allocate evenly across 40 areas, so roughly 50 papers per area. We plan to have area chairs work in pairs, so we need 80 area chairs to cover 40 areas. In addition, we anticipate a range of troubleshooting and consulting beyond what we two as PC co-chairs can handle, and so we also want an additional 10 area chairs who can assist across areas, with START troubleshooting, handling papers with COI issues, and whatever else comes up. That means we’re looking for about 100 people total.

We decided to do the recruiting in two phases. The first phase involved recruiting 50 area chairs directly by invitation. Phase II is an open call for nominations (and self-nominations!) for the remaining 50 area chairs. The purpose of this blog post is to give you an update on how we are doing in terms of various metrics of diversity, and, more importantly, to alert you to the call for area chairs. If you would like to serve as area chair, or if you know someone who you’d like to nominate, please fill out this form.

As we select additional area chairs, we will be looking to round out the range of areas of expertise we have recruited so far (see below); maintain our gender balance; improve our regional diversity; improve the representation of area chairs from non-academic affiliations; and improve racial/ethnic diversity. The stats for our area chairs so far are as follows (based on a self-report survey we sent to the area chairs).

Research Interests

A diverse range of areas were described, from a free-text entry from. Those with multiple entries are shown in the chart, and the hapaxes listed below.

Accent Variation
Active Learning
Argument Mining
Aspect
Authorship Analysis (Attribution, Profiling, Plagiarism Detection)
Automatic Summarization
Biomedical/clinical Text Processing
BioNLP
Clinical NLP
Clustering
Code-mixing
Code-switching
Computational Cognitive Modeling
Computational Discourse
Computational Lexical Semantics
Computational Lexicography
Computational Morphology
Computational Pragmatics
Conversational AI
Conversation Modeling
Corpora Construction
Corpus Design And Development
Corpus Linguistics
Cross-language Speech Recognition
Cross-lingual Learning
Data Modeling And System Architecture
Dialogue Pragmatics
Dialogue System
Dialogue Systems
Discourse Modes
Discourse Parsing
Document Summarization
Emotion Analysis
Endangered Language Documentation
Evaluation
Event And Temporal Processing
Experimental Linguistics
Eye Movements
Fact Checking
Grammar Correction
Grammar Engineering
Grammar Induction
Grounded Language Learning
Grounded Semantics
HPSG
Incremental Language Processing
Information Retrieval
KA
Korean NLP
Language Acquisition
Lexical Resources
Linguistic Annotation
Linguistic Issues In NLP
Linguistic Processing Of Non-canonical Text
Low-resource Learning
Machine Reading
Modality
Multilingual Systems
Multimodal NLP
NER
NLG
NLP In Health Care & Education
NLU
Ontologies
Ontology Construction
Phonology
POS Tagging
Reading
Reasoning
Relation Extraction
Resources
Resources And Evaluation
Rhetorical Types
Semantic Parsing
Semantic Processing
Short-answer Scoring
Situation Types
Social Media
Social Media Analysis
Social Media Analytics
Software And Tools
Speech
Speech Perception
Speech Recognition
Speech Synthesis
Spoken Language Understanding
Stance Detection
Structured Prediction
Summarization
Syntactic And Semantic Parsing
Syntax/parsing
Tagging
Temporal Information Extraction
Text Classification
Text Mining
Text Simplification
Text Types
Transfer Learning
Treebanks
Vision And Language
Weakly Supervised Learning

Gender

We asked a completely open-ended question here, which was furthermore optional, and then binned the answers into the three categories female, male, and other/question skipped.

Country of affiliation

Another open-ended question, which we again binned by region. Latin America is the Americas minus the US and Canada. Australia is counted as Asia. So far Africa is not represented.

Type of affiliation

Our survey anticipated five possible answers here: Academia, Industry – research lab, Industry – other, Government, Other; but only the first two are represented so far.

Race/ethnicity

We are interested in making sure that our senior program committee is diverse in terms of race/ethnicity, but it is very difficult to talk about what this means in an international context, because racial constructs are very much products of the cultures they are a part of. So rather than ask for specific race/ethnicity categories, which we would be unprepared to summarize across cultures, we decided to ask the following pair of questions, both of which were optional (like the question about gender):

As we work to make sure that our senior PC is appropriately diverse, we would like to consider race/ethnicity. Yet, at the level of an international organization, it is very unclear what categories could possibly be appropriate for such a survey. Accordingly, we have settled on the distinction minoritized (treated as a minority)/not minoritized (treated as normative/majority).

In the context of your country of current affiliation, and with respect to your race/ethnicity, are you: (optional)

Minoritized

Not minoritized

During your education or career prior to your current affiliation, has there ever been a significant period time during which you were minoritized with respect to your race/ethnicity? (optional)

Yes

No

Please join us!

We’re looking for about 50 more ACs! Please consider nominating yourself and/or other people who you think would do a good job and also help us round out our leadership team along the various dimensions identified above. Both self- and other-nominations can be done at this form. You can nominate as many people as you like (but only nominate yourself once, please 😉

Untangling biases and nuances in double-blind peer review at scale

Posted on September 25, 2017 by Leon Derczynski

It’s important to get reviewing right, and remove as many biases as we can. We had a discussion about how to do this in COLING, presented in this blog post in interview format. The participants are the program co-chairs, Emily M. Bender and Leon Derczynski.

LD: How do you feel about blindness in the review process? It could be great for us to have blindness in a few regards. I’ll start with the most important to me. First, reviewers do not see author identities. Next, reviewers do not see each other’s identities. Most people would adjust their own review to align with e.g. Chris Manning’s (sounds terribly boring for him if this happens!). Third, area chairs do not see author identities. Finally, area chairs do not see reviewer identities in connection to their reviews, or a paper. But I don’t know how much of this is possible within the confines of conference management. The last seems the most risky; but reviewer identities being hidden from each other seems like a no-brainer. What do you think?

Reviewers blind from each other

EMB: It looks like we have a healthy difference of opinion here 🙂 Absolutely, reviewers should not see author identities. With them not seeing each other’s identities, I disagree. I think the inter-reviewer discussion tends to go better if people know who they are talking to. Perhaps we can get the software to track the score changes and ask the ACs to be on guard for bigwigs dragging others to their opinions?

LD: Alright, we can try that; but after reading that report from URoch, how would you expect PhD students/postdocs/asst profs to have reacted around a review of Florian Jaeger’s, if they’d had or intended to have any connection with his lab? On the other side, I hear a lot from people unwilling to go against big names, because they’ll look silly. So my perception of this is that discussion goes worse when people know who they’re contradicting—though reviews might end up being more civil, too. I still think big names distort reviews here despite getting reviewing wrong just as often as the small names, so having reviewers know who each other are makes for less fair reviewing.

EMB: I wonder to what extent we’ll have ‘big names’ among our reviewers. I wonder if we can get the best of both worlds though by revealing all reviewers names to each other only after the decisions are out. So people will be on good behavior in the discussions (and reviews) knowing that they’ll be associated with their remarks eventually, but won’t be swayed by big names during the process?

LD: Yes, let’s do this. OK, what about hiding authors from area chairs?

Authors and ACs

EMB: I think hiding author identities from ACs is a good idea, but we still need to handle conflicts-of-interest somehow. And the cases where reviewers think that the authors should be citing X previous work when X is actually the author’s. Maybe we can have some of the small team of “roving” ACs doing that work? I’m not sure how they can handle all COI checking though.

LD: Ah, that’s tough. I don’t know too much about how the COI process typically works from the AC side, so I can’t comment here. If we agree on the intention—that author identities should ideally be hidden from ACs—we can make the problem better-defined and share it with the community, so some development happens.

EMB: Right. Having ACs be blind to authors is also being discussed in other places in the field, so we might be able to follow in their footsteps.

Reviewers and ACs

LD: So how about reviewer identities being hidden from ACs?

EMB: I disagree again about area chairs not seeing reviewer identities next to their reviews. While a paper should be evaluated solely on its merits, I don’t think we can rely on the reviewers to get absolutely everything into their reviews. And so having the AC know who’s writing which review can provide helpful context.

LD: I suppose we are choosing ACs we hope will be strong and authoritative about their domain. Do you agree there’s a risk of a bias here? I’m not convinced that knowing a reviewer’s identity helps so much—all humans make mistakes with great reliability (else annotation would be easier), and so what we really see is random effect magnification/minimization depending on the AC’s knowledge of a particular reviewer, where a given review’s quality varies on its own.

EMB: True, but/and it’s even more complex: The AC can only directly detect some aspects of review quality (is it thorough? helpful?) but doesn’t necessarily have the ability to tell whether it’s accurate. Also—how are the ACs supposed to do the allocation of reviewers to papers, and do things like make sure those with more linguistic expertise are evenly distributed, if they don’t know who the reviewers are?

LD: My concern is that ACs will have bias about which reviewers are “reliable” (and anyway, no reviewer is 100% reliable). However, in the interest of simplicity: we’ve already taken steps to ensure that we have a varied, balanced AC pool this iteration, which I hope will reduce the effect of AC:reviewer bias when compared to conferences with mostly static AC pools. And the problem of allocating reviews to papers remains unsettled.

EMB: Right. Maybe we’re making enough changes this year?

LD: Right.

Resource papers

LD: An addendum: this kind of blindness may prove impossible for resource-type papers, where author anonymity may become an optionally relaxable constraint.

EMB: Well, I think people should at least go through the motions.

LD: Sure—this makes life easier, too. As long as authors aren’t torn apart during review because someone can guess the authors behind a resource.

EMB: Good point. I’ll make a note in our draft AC duties document.

Reviewing style

LD: I want to bring up review style, as well. To nudge reviewers towards good reviewing style, I’d like reviewers to have the option of signing their reviews, with signatures available to authors at notification only. The reviewer identity would not be attached to a specific review, but rather general, in the form “Reviewers of this paper included: Natalie Schluter.” We known adversarial reviewing drops when reviewer identity is known, and I’d love to see CS—a discipline known for nasty reviews—begin to move in a positive direction. Indeed, as PC co-chairs of a CS-related conference, I feel we in particular have a duty to address this problem. My hope is that I can write a script to add this information, if we do it.

EMB: If the reviewers are opting in, perhaps it makes more sense for them to claim their own reviews. If I think one of my co-reviewers was a jerk, I would be less inclined to put my name to the group of reviews.

LD: That’s an interesting point. Nevertheless I’d like us to make progress on this front. In some time-rich utopia it might make sense to have the reviewers all agree whether or not to sign all three, and only have their identities revealed to each other after that—but we don’t have time. How about, reviews may be signed, but only at the point notifications are sent out? This prevents reviewers knowing who each other is, and lets those who want to hide, do so—as well as protecting us all from the collateral damage that results from jerk reviewers.

This could work with a checkbox—”Sign my review with my name in the final author notification”—and the rest’s scripted in Softconf.

EMB: So how about option to sign for author’s view (the checkbox) + all reviewers revealed to each other once the decisions are done?

LD: Good, let’s do that. Reviewer identities are hidden from each other during the process, and revealed later; and reviewers have the option to sign their review via a checkbox in softconf.

EMB: Great.

Questions

What do you think? What would you change about the double-blind process?

Writing Mentoring Program

Posted on September 15, 2017 by Emily M. Bender

Submit your manuscript for mentoring here: https://www.softconf.com/coling2018/mentoring/

Among the goals we outlined in our inaugural post was the following:

(1) to create a program of high quality papers which represent diverse approaches to and applications of computational linguistics written and presented by researchers from throughout our international community;

One of our strategies for achieving this goal is to create a writing mentoring program, which takes place before the reviewing stage. This optional program is focused on helping those who perhaps aren’t used to publishing in the field of computational linguistics, are early in their careers, and so on. We see mentoring as a tool that makes COLING accessible for broader range of high-quality ideas. In other words, this isn’t about pushing borderline papers into acceptance but rather alleviating presentational problems with papers that, in their underlying research quality, easily make the high required standard.

In order for this program to be successful, we need buy-in from prospective mentors. In this blog post, we provide the outlines of the program, in order to let the community (including both prospective mentors and mentees) know what we have in mind and to seek (as usual) your feedback.

We plan to run the mentoring program through the START system, as follows:

Anyone wishing to receive mentoring will submit an abstract by 4 weeks before the COLING submission deadline. Authors will be instructed that submitting an abstract at this point represents a commitment to submit a full draft by the mentoring deadline and then to submit to COLING.
Requesting mentoring doesn’t guarantee receiving mentoring and receiving mentoring doesn’t guarantee acceptance to the conference program.
Any reviewer willing to serve as mentor will bid on those abstracts and indicate how many papers total they are willing to mentor. Mentors will receive guidance from the program committee co-chairs on their duties as mentors, as well as a code of conduct.
Area chairs will assign papers to mentors by 3 weeks before the submission deadline, giving priority as follows. (Note that if there are not enough mentors, not every paper requesting mentoring will receive it.)
1. Authors from non-anglophone institutions
2. Authors from beyond well-represented institutions
Authors wishing to receive mentoring will submit complete drafts via START by 3 weeks before the submission deadline.
Mentors will provide feedback within one week, using a ‘mentoring form’ created by the PCs structured to encourage constructive feedback.
No mentor will serve as a reviewer for a paper they were mentor of.
Mentor bidding will be anonymous, but actual mentoring will not be (in either direction).
Mentors will be recognized in the conference handbook/website, but COLING will not indicate which papers received mentoring (though authors are free to acknowledge mentorship in their acknowledgments section).

As a starting point, here are our initial questions for the mentoring form:

What is the main claim or result of this paper?
What are the strengths of this paper?
What questions do you have as a reader? What do you wish to know about the research that was carried out that is unclear as yet from the paper?
What aspect of the paper do you think the COLING audience will find most interesting?
Which paper category/review form do you think is most appropriate for this paper?
Taking into consideration the specific questions in that review form, in what ways could the presentation of the research be strengthened?
If you find places where there are grammatical or stylistic issues in writing, or in general, if you think certain improvements are possible in terms of overall organization and structure, please indicate these. It may be most convenient to do so by marking up a pdf with comments.

Regarding code of conduct, by signing up to mentor a paper, mentors agree to:

Maintain confidentiality: Do not share the paper draft or discuss its contents with others (without express permission from the author). Do not appropriate the ideas in the paper.
Commit to prompt feedback: Read the paper and provide feedback via the form by the deadline specified.
Be constructive: Avoid sarcastic or harsh evaluative remarks; phrase feedback in terms of how to improve, rather than what is wrong or bad.

The benefits to authors are clear: Authors participating in the program will benefit because they will receive feedback on the presentation of their work, which if heeded, might also improve chances of acceptance as well as enhance the impact of the paper once published. Perhaps the benefits to mentors are more in need of articulation. Here are the benefits we see: Mentors will be recognized through a listing in the conference handbook and website, with outstanding mentors receiving further recognition. In addition, mentoring should be rewarding for the mentors because the exercise of giving constructive feedback on academic writing provides insight into what makes good writing. Finally, the mentoring program will benefit the entire COLING audience through both improved presentation of research results and improved diversity of authors included in the conference.

Our questions for our readership at this point are:

What would make this program more enticing to you as a prospective mentor or author?
As a prospective mentor or author, are there additional things you’d like to see in the mentoring form?
Are there points you think we should add to the code of conduct?

What kinds of invited speakers could we have?

Posted on September 13, 2017 by Leon Derczynski

As we begin to plan the keynote talks for COLING, we are looking for community input. The keynote talks, among the few shared experiences in a conference with multiple parallel tracks, serve to both anchor the ‘conversation’ that the field is having through the conference and push it in new directions. In the past, speakers have been from both close to the center of our community and from outside it, lending both new, important perspectives that contextualize COLING, as well as helping us hear stories and insights that have led to great successes.

We are seeking two kinds of input:

In public in the comments on this post: What kinds of topics would you like to hear about in the invited keynotes? We’re interested in both suggestions within computational linguistics as well as specific topics from related fields: linguistics, machine learning, cognitive science, and applications of computational linguistics to other fields.
Privately, via this web form: If you have specific speakers you would like to nominate, please send us their contact info and any further information you’d like to share.

Call for input: Paper types and associated review forms

Posted on August 17, 2017 by Emily M. Bender

In our opening post, we laid out our goals as PC co-chairs for COLING 2018. In this post, we present our approach to the subgoal (of goal #1) of creating a program with many different types of research contributions. As both authors and reviewers, we have been frustrated by the one-size-fits-all review form typical of conferences in our field. When reviewing, how do we answer the ‘technical correctness’ question about a position paper? Or the ‘impact of resources’ question on a paper that doesn’t present any resources?

We believe that a program that includes a wide variety of paper types (as well as a wide variety of paper topics) will be more valuable both for conference attendees and for the field as a whole. We hypothesize that more tailored review forms will lead to fairer treatment of different types of papers, and that fairer treatment will lead to a more varied program. Of course, if we don’t get many papers outside the traditional type (called “NLP engineering experiment paper” below), having tailored review forms won’t do us much good. Therefore, we aim to get the word out early (via this blog post) so that our audience knows what kinds of papers we’re interested in.

Furthermore, we’re interested in what kinds of papers you’re interested in. Below you will find our initial set of five categories, with drafts of the associated review forms. You’ll see some questions are shared across some or all of the paper types, but we’ve elected to lay them out this way (even though it might feel repetitive) so that you can look at each category, putting yourself in both the position of author and of reviewer, and think about what we might be missing/which questions might be inappropriate. Let us know in the comments!

As you answer, keep in mind that our goal with the review forms is to help reviewers structure their reviews in such a way that they are helpful for the area chairs in making final acceptance decisions, informative for the authors (so they understand the decisions that were made), and helpful for the authors (as they improve their work either for camera ready, or for submission to a later venue).

Computationally-aided linguistic analysis

The focus of this paper type is new linguistic insight.

Relevance: Is this paper relevant to COLING?
Readability/clarity: From the way the paper is written, can you tell what research question was addressed, what was done and why, and how the results relate to the research question?
Originality: How original and innovative is the research described? Originality could be in the linguistic question being addressed, in the methodology applied to the linguistic question, or in the combination of the two.
Technical correctness/soundness: Is the research described in the paper technically sound and correct? Can one trust the claims of the paper—are they supported by the analysis or experiments and are the results correctly interpreted?
Reproducibility: Is there sufficient detail for someone in the same field to reproduce/replicate the results?
Generalizability: Does the paper show how the results generalize, either by deepening our understanding of some linguistic system in general or by demonstrating methodology that can be applied to other problems as well?
Meaningful comparison: Does the paper clearly place the described work with respect to existing literature? Is it clear both what is novel in the research presented and how it builds on earlier work?
Substance: Does this paper have enough substance for a full-length paper, or would it benefit from further development?
Overall recommendation: There are many good submissions competing for slots at COLING 2018; how important is it to feature this one? Will people learn a lot by reading this paper or seeing it presented? Please be decisive—it is better to differ from other reviewers than to grade everything in the middle.

NLP engineering experiment paper

This paper type matches the bulk of submissions at recent CL and NLP conferences.

Relevance: Is this paper relevant to COLING?
Readability/clarity: From the way the paper is written, can you tell what research question was addressed, what was done and why, and how the results relate to the research question?
Originality: How original and innovative is the research described? Note that originality could involve a new technique or a new task, or it could lie in the careful analysis of what happens when a known technique is applied to a known task (where the pairing is novel) or in the careful analysis of what happens when a known technique is applied to a known task in a new language.
Technical correctness/soundness: Is the research described in the paper technically sound and correct? Can one trust the claims of the paper—are they supported by the analysis or experiments and are the results correctly interpreted?
Reproducibility: Is there sufficient detail for someone in the same field to reproduce/replicate the results?
Error analysis: Does the paper provide a thoughtful error analysis, which looks for linguistic patterns in the types of errors made by the system(s) evaluated and sheds light on either avenues for future work or the source of the strengths/weaknesses of the systems?
Meaningful comparison: Does the paper clearly place the described work with respect to existing literature? Is it clear both what is novel in the research presented and how it builds on earlier work?
Substance: Does this paper have enough substance for a full-length paper, or would it benefit from further work?
Overall recommendation: There are many good submissions competing for slots at COLING 2018; how important is it to feature this one? Will people learn a lot by reading this paper or seeing it presented? Please be decisive—it is better to differ from other reviewers than to grade everything in the middle.

Reproduction paper

The contribution of a reproduction paper lies in analyses of and in insights into existing methods and problems—plus the added certainty that comes with validating previous results.

Relevance: Is this paper relevant to COLING?
Readability/clarity: Is the paper well-written and well-structured?
Analysis: If the paper was able to replicate the results of the earlier work, does it clearly lay out what needed to be filled in in order to do so? If it wasn’t able to replicate the results of earlier work, does it clearly identify what information was missing/the likely causes?
Generalizability: Does the paper go beyond replicating the results on the original to explore whether they can be reproduced in another setting? Alternatively, in cases of non-replicability, does the paper discuss the broader implications of that result?
Informativeness: To what extent does the analysis reported in the paper deepen our understanding of the methodology used or the problem approached? Will the information in the paper help practitioners with their choice of technique/resource?
Meaningful comparison: In addition to identifying the experimental results being replicated, does the paper motivate why these particular results are an important target for reproduction and what the future implications are of their having been reproduced or been found to be non-reproducible?
Overall recommendation: There are many good submissions competing for slots at COLING 2018; how important is it to feature this one? Will people learn a lot by reading this paper or seeing it presented? Please be decisive—it is better to differ from other reviewers than to grade everything in the middle.

Resource paper

Papers in this track present a new language resource. This could be a corpus, but also could be an annotation standard, tool, and so on.

Relevance: Is this paper relevant to COLING? Will the resource presented likely be of use to our community?
Readability/clarity: From the way the paper is written, can you tell how the resource was produced, how the quality of annotations (if any) was evaluated, and why the resource should be of interest?
Originality: Does the resource fill a need in the existing collection of accessible resources? Note that originality could be in the choice of language/language variety or genre, in the design of the annotation scheme, in the scale of the resource, or still other parameters.
Resource quality: What kind of quality control was carried out? If appropriate, was inter-annotator agreement measured, and if so, with appropriate metrics? Otherwise, what other evaluation was conducted, and how agreeable were the results?
Resource accessibility: Will it be straightforward for researchers to download or otherwise access the resource in order to use it in their own work? To what extent can work based on this resource be shared?
Metadata: Do the authors make clear whose language use is captured in the resource and to what populations experimental results based on the resource could be generalized to? In case of annotated resources, are the demographics of the annotators also characterized?
Meaningful comparison: Is the new resource situated with respect to existing work in the field, including similar resources it took inspiration from or improves on? Is it clear what is novel about the resource?
Overall recommendation: There are many good submissions competing for slots at COLING 2018; how important is it to feature this one? Will people learn a lot by reading this paper or seeing it presented? Please be decisive—it is better to differ from other reviewers than to grade everything in the middle.

Position paper

A position paper presents a challenge to conventional thinking or a futuristic new vision. It could open up a new area or novel technology, propose changes in existing research, or give a new set of ground rules.

Relevance: Is this paper relevant to COLING?
Readability/clarity: Is it clear what the position is that the paper is arguing for? Are the arguments for it laid out in an understandable way?
Soundness: Are the arguments presented in the paper relevant and coherent? Is the vision well-defined, with success criteria? (Note: It should be possible to give a high score here even if you don’t agree with the position taken by the authors)
Creativity: How novel or bold is the position taken in the paper? Does it represent well-thought through and creative new ground?
Scope: How much scope for new research is opened up by this paper? What effect could it have on existing areas and questions?
Meaningful comparison: Is the paper well-situated with respect to previous work, both position papers (taking the same or opposing side on the same or similar issues) and relevant theoretical or experimental work?
Substance: Does the paper have enough substance for a full-length paper? Is the issue sufficiently important? Are the arguments sufficiently thoughtful and varied?
Overall recommendation: There are many good submissions competing for slots at COLING 2018; how important is it to feature this one? Please be decisive—it is better to differ from other reviewers than to grade everything in the middle.

So, those are the initial set of submission types. These types of paper aren’t limited to single tracks. That is to say, there won’t be a dedicated position paper track, with its own reviewers and chair. You might find a resource paper in any track, for example, and a multi-lingual embeddings track (if one appears—but that’s for a future post) might contain all five kinds of paper mixed together. This makes it even more important that the right questions are asked for a paper type, to help out hard-working reviewers with the task of judging each kind of paper in an appropriate light.

Our questions for you: Is there a type of paper you’d either like to submit to COLING or would like to see at COLING that you think doesn’t fit any of these five already? Should any of the review questions be dropped or refined for any of the paper types? Are there review questions it would be useful to add? Please let us know in the comments!

COLING 2018 PC Blog: Welcome!

Posted on August 11, 2017 by Leon Derczynski

Emily M. Bender and Leon Derczynski, at the University of Washington

We (Emily M. Bender and Leon Derczynski) are the PC co-chairs for COLING 2018, to be held in Santa Fe, NM, USA, 20-25 August 2018. Inspired by Min-Yen Kan and Regina Barzilay’s ACL 2017 PC Blog, we will be keeping one of our own. We start today with a brief post introducing ourselves and outlining our goals for COLING 2018. In later posts, we’ll describe the various plans we have for meeting those goals.

First the intros:

Emily is a Professor of Linguistics and Adjunct Professor of Computer Science & Engineering at the University of Washington, Seattle WA (USA), where she has been on the faculty since 2003 and has served as the Faculty Director of the Professional Masters in Computational Linguistics (CLMS) since its inception in 2005. Her degrees are all in Linguistics (AB UC Berkeley, MA and PhD Stanford) and her primary research interests are in grammar engineering, computational semantics, and computational linguistic typology. She is also interested in ethics in NLP, the application of computational methods to linguistic analysis, and different ways of integrating linguistic knowledge into NLP.

Leon is a Research Fellow of Computer Science at the University of Sheffield (UK), the home ICCL, where he has been a researcher since 2012, including visiting positions at Aarhus Universitet (Denmark), Innopolis University (Russian Federation) and University of California, San Diego (USA). His degrees are in Computer Science (MComp and PhD), also from Sheffield, and his research interests are in noisy text, unsupervised methods, and spatio-temporal information extraction. He is also interested in chunking and tagging, effective crowdsourcing, and assessing veracity and fake news.

We first met by proxy, through Tim Baldwin, at LREC 2014 in Reykjavik. Tim pointed out that we both happened to be currently visiting scholars in a hip Danish city devoid of its own NLP group—Aarhus. Shortly after returning from Iceland, each upon Tim’s recommendation, we met for lunch a few times in Aarhus, chatting about understanding language, language diversity, and the interface between data-driven computational techniques and linguistic reality. We have made a point of catching up regularly ever since, and the city is a place where we still have connections—and even more hip, as the European Capital of Culture for 2017!

Then goals:

Our goals for COLING 2018 are (1) to create a program of high quality papers which represent diverse approaches to and applications of computational linguistics written and presented by researchers from throughout our international community; (2) to facilitate thoughtful reviewing which is both informative to ACs (and to us as PC co-chairs) and helpful to authors; and (3) to ensure that the results published at COLING 2018 are as reproducible as possible.

To give a bit more detail on the first goal, by diverse approaches/applications, we mean that we intend to attract (in the tradition of COLING):

papers which develop linguistic insight as well as papers which deepen our understanding of how machine learning can be applied to NLP — and papers that do both!
research on a broad variety of languages and genres
many different types of research contributions (application papers, resource papers, methodology papers, position papers, reproduction papers…)

We have the challenge and the privilege of taking on this role at a time when our field is growing tremendously quickly. We hope to advance the way our conferences work by trying new things and improving the experience from all sides. In approaching this task, we started by reviewing the strategies taken by PC chairs at other recent conferences (including COLING 2016, NAACL 2016, and ACL 2017), learning from them, and then adapting strategies based on our goals for COLING 2018. We strongly believe that one key to achieving a diverse and strong program is community engagement. Thus our first step towards that is starting this blog. Over the coming weeks we will tell you more about what we are working on and seek input on various points in the process. We look forward to working with you and hope to see many of you in Santa Fe next August!