COLING 2018 – a truly international event!

Posted on September 4, 2018 by Emily M. Bender

There have been a few requests for information about the range of countries represented at COLING 2018. As a partial answer, we collected (from START) the country of affiliation of the corresponding author of each paper, and were impressed to find 38 countries on that list! [Note: Country names are as shown in START.]

Australia, Belgium, Brazil, Canada, China, Czech Republic, Denmark, Ethiopia, France, Germany, Greece, Hong Kong Special Administrative Region of China, India, Ireland, Israel, Italy, Japan, Kazakhstan, Mexico, Netherlands, Norway, Pakistan, Poland, Qatar, Republic of Korea, Romania, Russian Federation, Saudi Arabia, Singapore, South Africa, Spain, Sweden, Switzerland, Taiwan, Turkey, United Kingdom, United States, and Viet Nam

The most frequently represented countries (among 331 accepted papers) were China (89 papers), United States (69), Germany (26), Japan (19), India (16), United Kingdom (15), France (11), and Netherlands (10).

Slides from Keynote Talks

Posted on August 25, 2018 by Emily M. Bender

All four keynote speakers have made the slides from their amazing talks available, and they can be found here:

Pustejovsky: Visualizing Meaning: Modeling Communication through Multimodal Simulations
Henri: Investigating a Discriminative Approach to Creolization
Rohde: Why Are You Telling Me This? Relevance & Informativity in Language Processing
Kan: Research Fast and Slow Video available here

Q&A best practices: Introduce yourself

Posted on August 13, 2018 by Emily M. Bender

At COLING 2018, we’ll be asking question askers in the Q&A sessions to introduce themselves (name and affiliation) before asking their questions, because we’d like to see this practice spread as a norm in the community. When our field was smaller, it may have been the case that most everyone knew everyone else on sight and could recognize each other’s voices. That’s surely not true now!

We want to emphasize that this advice is for everyone, regardless of whether you expect most people to know who you are. The speaker whose paper you’re asking a question about might well have heard of you, but not recognize you on sight. And the speaker might appreciate the chance to follow up with you later! Likewise, people in the audience appreciate knowing who is asking questions. Even if you’re pretty sure everyone in the audience knows who you are, it’s still important: perhaps not everyone can see you. Furthermore, if the more well-known speakers adopt this practice, it makes it more comfortable for less-established scholars to do so.

PC chairs report back: Paper types and the selection process

Posted on June 28, 2018 by Emily M. Bender

As we stated at the outset, one of our goals for COLING 2018 has been “to create a program of high quality papers which represent diverse approaches to and applications of computational linguistics written and presented by researchers from throughout our international community”. One aspect of the COLING 2018 review process that we designed with this goal in mind was the enumeration of six different paper types, each with its own tailored review form. We first proposed an initial set of five paper types, and then added a sixth and revised the review forms in light of community input. The final set of paper types and review forms can be found here. In this blog post, we report back on this aspect of COLING 2018, both quantitatively and qualitatively.

Submission and acceptance statistics

The first challenge was to recruit papers from the less common paper types. Most papers published a NLP venues fit either our “NLP Engineering Experiment” or “Resources” paper type. The table below shows how many of each type were submitted, withdrawn, and accepted, as well as the acceptance rate per paper type. (The “withdrawn” number is included because these are excluded from the denominator in the acceptance rate, as discussed here.)

Type	Submitted	Withdrawn	Accepted	Acceptance rate
NLPEE	657	85	217	37.94%
CALA	163	28	45	33.33%
Resource	106	7	32	32.32%
Reproduction	35	0	17	48.57%
Position	31	6	8	32.00%
Survey	25	3	12	54.55%
Overall	1017	129	331	37.27%

Not surprisingly, the “NLP Engineering Experiment” paper type accounted for more than half of the submissions, but we are pleased that the other paper types are also represented. We hope that if this strategy is taken up in future COLINGs (or other venues) that it will continue to gain traction and the minority paper types will become more popular.

These statistics all represent the paper type chosen by the authors at submission time, not necessarily how we would have classified the papers. More discussion on this point below.

Author survey on paper types

As described in our post on our author survey, the feedback on the paper types from authors was fairly positive:

We wanted to find out if people were aware of the paper types (since this is relatively unusual in our field) before submitting their papers, and if so, how they found out. Most—349 (80.4%)—were aware of the paper types ahead of time. Of these, the vast majority (93.4%) found out about the paper types via the Call for Papers. Otherwise, people found out because someone else told them (7.4%), via our Twitter or Facebook feeds (6.0%), or via our blog (3.7%).

We also asked if it was clear to authors which paper type was appropriate for their paper and if they think paper types are a good idea. The answers in both cases were pretty strongly positive: 78.8% said it was clear and 91.0% said it was a good idea. (Interestingly, 74 people who said it wasn’t clear which paper type was a good fit for theirs nonetheless said it was a good idea, and 21 people who thought it was clear which paper type fit nonetheless said it wasn’t.)

Not knowable from that survey is whether/to what extent we failed to reach people who would have submitted e.g. a survey paper or reproduction paper, had they only known we were specifically soliciting them.

Reviewer survey on paper types

We also carried out a survey of our reviewers. This was sent with more delay (on 25 May, though reviews were due 10 April), and as some survey respondents pointed out, we may have gotten more accurate answers if we’d asked more quickly. But, there was plenty else we were worrying about in the interim! The response rate was also relatively low: only 128 of our 1200+ reviewers answered the survey. With those caveats, here are some results. (No question was required, so the answers don’t sum to 100%.)

We asked: “Did you feel like the authors chose the appropriate paper type for their papers?” 69.5% chose “Yes, all of them”, 26.6% “Only some of them”, and 0.8% (just one respondent), “No, none of them.”
We asked: “For papers that were assigned to what you thought was the correct paper type, did you feel that the review form questions helped you evaluate papers of that type?” 29.7% chose “Yes, better than usual for conferences/better than expected”, 57% “Yes, about as usual/about as expected”, 6.3% “No, worse than usual/worse than expected”, 1.6% “No, the review forms were poorly designed”
We asked: 36.7% chose “For papers that were assigned to what you thought was an incorrect paper type, how problematic was the mismatch?” 21.9% “Not so bad, even the numerical questions were still somewhat relevant” and no one chose “Pretty bad, I could only say useful things in the comments” or “Terrible, I felt like I couldn’t fairly evaluate the paper.” (58.6% chose “other”, but this was mostly people who didn’t have any mismatches.)

Our take away is that, at least for the reviewers who responded to the survey, the differentiated review forms for different paper types were on balance a plus—that is, they helped more than they hurt.

How to handle papers submitted under the wrong type?

Some misclassified papers were easy to spot. We turned them up early in the process browsing the non-NLP engineering experiment paper types (since we were interested to see what was coming in). Similarly, ACs and reviewers noted many cases of obvious type mismatches. However, we decided against reclassifying papers. The primary reason for this is that, despite there being some clear cases of mistyped papers, many others would not be. It would be impossible to go through all papers and consider reassigning their types and do so consistently. Furthermore, the point of the paper types was to allow authors to choose what questions reviewers would be answering about their papers. Second-guessing this seemed unfair and non-transparent.

Perhaps the most common clear cases of mistyped papers were papers we considered NLP engineering experiment (NLPEE) papers that were submitted as computationally aided linguistic analysis (CALA) papers. We have a few hypotheses about why that might have happened, not mutually exclusive:

(1) Design factors. CALA was listed first on the paper types page; people read it, thought it matched and looked no further. (Though in the dropdown menu for this question in the submission form on START, NLPEE is first.)

(2) Terminological prejudice. People were put off by “engineering” in the name of NLPEE. We’ve definitely heard some objections to that term from colleagues who take “engineering” to be a derogatory term. But we do not see it that way at all! Engineering research is research. Indeed, a lack of attention to good engineering in our computational experiments makes the science suffer. Furthermore, research contributions focused on building something and then testing how well it works seem to us to be well characterized by the term “engineering experiment”. It’s worth noting that we did struggle to come up with a name for this paper type, in large part because it is so ubiquitous but we couldn’t very well call it “typical NLP paper”. In our discussions, Leon proposed a name involving the word “empirical”, but Emily objected strongly to that: linguistic analysis papers that investigate patterns in language use to better understand linguistic structure or language behavior are very much empirical too.

(3) Interdisciplinary misunderstanding. Perhaps people working on NLPEE-type work from more of an ML background don’t understand the term “linguistic analysis” or “linguistic phenomenon” as we intended it. The CALA paper type was described as follows:

The focus of this paper type is new linguistic insight. It might take the form of an empirical study of some linguistic phenomenon, or of a theoretical result about a linguistically-relevant formal system.

It’s entirely possible that someone without training in linguistics would not know what terms like “linguistic phenomenon” or “formal system” denote for linguists. This speaks to the need for more interdisciplinary communication in our field, and we hope that COLING 2018 will continue the COLING tradition of providing such a venue!

Acceptance rate

Posted on June 8, 2018 by Emily M. Bender

As we noted in a previous post, the acceptance rate is an important metric of competitiveness for authors with accepted papers.

… for individual researchers, especially those employed in or hoping to be employed in academia, acceptance of papers to COLING and similar venues is very important for job prospects/promotion/etc. Furthermore, it isn’t simply a matter of publishing in peer-reviewed venues, but in high-prestige, competitive venues. Where the validation view of peer review would view it as binary question (does this paper make a validatable contribution or not?), the prestige view instead speaks to ranking—where we end up with best papers, strong papers, borderline papers that get in, borderline papers that don’t get in, and papers that were easy to decide to reject. (And, for full disclosure, it is in the interest of a conference to strive to become and maintain status as a high-prestige, competitive venue.)

Not surprisingly, we’ve received several requests for the acceptance rate for COLING 2018. It turns out that determining that number is not straightforward. We initially had 1017 submissions, but some of those (129) were withdrawn, either early in the process (the authors never in fact completed the paper) or later, usually in light of acceptance at another venue, per the COLING 2018 dual submission policy. The denominator for our acceptance rate excludes these papers as it hardly seems fair to include papers that either weren’t reviewed, or were withdrawn because they were accepted elsewhere. Conversely, we decided to include the papers desk rejected (n=33) in the denominator.

With a total of 332 papers accepted for publication, that gives an acceptance rate of 37.4%.

PC chairs report back: On the effectiveness of author response

Posted on May 16, 2018 by Emily M. Bender

The utility of the author response part of the conference review process is hotly debated. At COLING 2018, we decided to have the author response be addressed only to the area chairs (and PC co-chairs), and not the reviewers. The purpose of this blog post is to report back on our experience with this model (largely positive, from the PC perspective!) and also to share with the community what we have learned, inhabiting this role, about what makes an effective author response.

For background, here is a description of the decision making process at the PC level. Keep in mind that COLING 2018 received 1017 submissions, of which 880 were still ‘active’ at the point of these decisions. (The difference is a combination of desk rejects and papers withdrawn, the latter mostly in light of acceptance to other venues with earlier notifications.)

Outline of our process

Final accept/reject decisions for COLING 2018 were made as follows:

We asked the ACs for each area to provide a ranking of the papers in their area and to indicate recommendations of accept, maybe accept, maybe reject, or reject. We specifically instructed the ACs to not use the reviewer scores to sort the papers, but rather to come to their own ranking based on their judgment, given the reviews, discussion among reviews, author responses, and (where necessary) reading the papers.

Our role as PCs was to turn those recommendations into decisions. To do so, we first looked at each area’s report and determined which papers had clear recommendations and which were borderline. For the former, we went with the AC recommendations directly. The borderline cases were either papers that the ACs marked as ‘maybe accept’ or ‘maybe reject’, or, for areas that only used ‘accept’ and ‘reject’, the last two ‘accept’ papers and the first two ‘reject’ papers in the ACs’ ranking. This gave us a bit over 200 papers to consider.

We divided the areas into two sets, one for each of us. (We were careful at this point to put the areas containing papers with which one of us had COIs into the other PC’s stack.) Area by area, we looked at the borderline papers, considering the reviews, the reviewer discussion (if any), the author response, comments from the ACs, and sometimes the papers (to clarify particular points; we didn’t read the papers in full). Although the PC role on START allows us to see the authors of all submissions, we worked out ways to look at all the information we needed to do this without seeing the author names (or institutions, etc).

Of the 200 or so papers we looked at, there were 23 for which we wanted to have further discussion. This was done over Skype, despite the 9 hour time difference! These papers were evenly distributed between Emily’s and Leon’s areas, but clustered towards the start of each of our respective stacks; our analysis is that as we worked our way through the process, we each gained a better sense of how to make the decisions and found less uncertainty. (Discussion of COI papers was done with the General Chair, Pierre Isabelle, not the other PC, per our COI policy.)

As a final step to verify data entry (to make sure what is entered in START actually matches our intentions), we went through and looked at both the accepted papers with the lowest reviewer scores and the rejected papers with the highest reviewer scores. 98 papers with an average score 3 or higher were rejected. 27 papers with an average score lower than 3 were accepted. (Remember, it’s not just about the numbers!) For each of these, we went back to our notes to check that the right information was entered (it was) and in so doing, we found that, for the majority of the papers which were accepted despite low reviewer scores (and correspondingly harsh reviews), our notes reflected effective author responses. This furthermore is consistent with our subjective sense that the author responses really did make a difference in the case of difficult decisions, that is, the papers we were looking at.

What makes an effective author response?

The effective author responses all had certain characteristics in common. They were written in a tone that was respectful, calm and confident (but not arrogant). They had specific answers to reviewers’ specific questions or specific replies to reviewers’ criticisms. For example, if a reviewer pointed out that a paper failed to discuss important related work, an effective author response would either acknowledge the omission and indicate that it will be addressed in the final version, or clearly state why the indicated paper isn’t in fact relevant. Effective author responses to reviewer questions about points that aren’t clear were short and to the point (and specific). This gave us confidence that the answers would be incorporated in the final version. In many cases, authors related the results of experiments they hadn’t had space for, or ran the analyses during the response period; this is much more effective than an ephemeral promise to add the content. Author responses could also be effective in indicating that reviewers misunderstood key points of the paper or the background into which it fits, but only if they were written in the calm, confident tone mentioned above.

Many effective author responses also expressed gratitude for the reviewers’ feedback. This was nice to see, but it wasn’t a problem when it wasn’t there.

What makes an ineffective author response?

In effective author responses, on the other hand seemed to be written in a place of anger. We understand where authors are coming from when this happens! Reviews, especially negative reviews, can sting. But an author response that comes across as angry, condescending, or combative is not effective at persuading the ACs & PCs that the reviewers have things the wrong way around, nor does it provide good evidence that the paper will be improved for the camera ready version.

Best practices for writing author responses

Here we try to distill our experience of reading the author responses for ~200 papers (not all papers had them, but most did) into some helpful tips.

For conference organizers

We definitely recommend setting up an author response process, but having the author responses go to the ACs (and PCs) only, not the reviewers. Two ways to improve on what we did:

Clarify the word count constraints better than we did. We asked for no more than 400 words total, but the way START enforced that was no more than 400 words per review (since there were separate author response boxes for each review).
Don’t make the mistake we made of sending authors who wanted to do a late author response to their ACs … in the very small number of cases where that happened, it compromised anonymity of authors to ACs.

For authors

Read the reviews and write the angry version. Then set it aside and write a calmer one.
If you can, show your author response to someone who will read it for you and let you know where it sounds angry/arrogant/petty.
Try starting with “Thank you for the helpful feedback”—this isn’t necessary, and you can edit it out afterwards for space, but it might help you get off on the right foot regarding tone.
Don’t play the reviewers off each other (“R1 says this paper is hard to read, but that’s clearly wrong, because R2 said it was easy to follow.”) Rest assured that the ACs will read all of the reviews; they’ll have seen R2’s comments too.
Similarly, don’t feel obliged to reply to everything in the reviews. General negative comments (e.g. “I found this paper hard to read”) don’t require a response and there probably isn’t a response that would be helpful. Either the paper really is unclear or the reviewer doesn’t have sufficient background / didn’t leave enough time to read the paper carefully. Which scenario this is will likely be evident from the rest of the reviews and the author response.
Don’t promise the moon and the stars in the final version. It’s hard to accept a borderline paper based on promises alone.
Do indicate specific answers to key questions, in a way that is obviously easily incorporated in the final version. (And in that case it’s fine to say “We will add clarification along these lines”, or similar.)
Do concisely demonstrate mastery of the area, if reviewers probe issues you have considered during your research and you have the answers to hand.
Don’t play games with the word count. We saw two author responses where the authors got around the software’s restriction to 400 words (per box!) by_joining_whole_sentences_with_underscores. This does not make a good impression.

Ultimately, even a calm and confident author response doesn’t necessarily push a paper on the borderline over into accept. Sometimes the paper just isn’t ready and it’s not reasonable to try to fix what needs fixing or add what needs adding for the final version. Nonetheless, we found that the above patterns do make author responses more effective, and so we wanted to share them.

COLING schedule construction: Next steps

Posted on May 16, 2018 by Emily M. Bender

We are proud to have sent out the acceptance notifications for COLING 2018 ahead of schedule! But, our work as chairs is not done. Here are our next steps:

Progam construction

We have prepared a schedule “frame”, with plenary sessions (opening, keynotes, best papers, closing), parallel sessions (talks and posters), all fit in around coffee breaks, lunch and the excursions. Our task now is to group the accepted papers into coherent talk and poster sessions. In doing so, we will consider:

Author preferences (as indicated in START)
Area chair recommendations
Thematic coherence of sessions
Suitability of each topic for each format

Our goal is to have the program constructed by June 13. That timing is partially dependent on the best paper award process, outlined below.

Planning ahead

In an event of this size, it is inevitable that some number of presenters may be unable to attend at the last minute. In that case, we hope that speakers will be able to arrange to present remotely (per the inclusion policy). If that is not possible, and an oral presentation is being pulled, we will seek to replace it with the most thematically similar poster available.

Best paper awards

We have 10 award categories, the 9 listed in our previous post on this topic, plus ‘Best error analysis’, which we really should have thought of initially! We have 11 scholars who have agreed to be on this committee. And we have 41 papers which have been nominated, each for one of the specific awards.

We will shortly be creating subcommittees of the best paper committee to consider each award. Each award will be considered by two committee members and most committee members will be working on two award types. The exception is the “Best NLP engineering experiment” award, as that award type has the most nominations (being the most common paper type among our submissions). The committee members working on that type will focus only on it. We are open to the possibility that some awards may go unallocated (if this is warranted) and also that a paper may end up with a different award than the one it was nominated for.

Timeline

May 17: Nominated papers to best paper committee
June 1: Each subcommittee reports to the whole BPC with their nomination and a handful of alternates; the BPC then discusses results
June 8: The committee confirms up to ten best paper awards for nomination to the PC co-chairs
June 13: Best papers confirmed, and authors notified

Anonymity

In order to preserve anonymity in the best paper award selection process, we will not post the list of accepted papers until the selection is done. Individual authors are of course free at this point to post their own information, but we trust our best paper committee won’t go hunting for it.

availability

As mentioned in our requirements post, only papers that have made the resources/code publicly available by camera ready time will be considered for best paper awards; those that rely on code or data, but haven’t made it available, will be taken out of the running.

Best paper committee

Our responsive, expert committee members are:

Steven Bethard, University of Arizona
Kevin Duh, Johns Hopkins University
Eva Hajicova, Charles University
Aurelie Herbelot, University of Trento
Qin Lu, Hong Kong Polytechnic University
Diana Maynard, University of Sheffield
Asad Sayeed, University of Gothenburg
Donia Scott, University of Sussex
Aline Villavicencio, University of Essex
Andreas Vlachos, University of Sheffield
Lilja Øvrelid, University of Oslo

Publication preparation

Going from drafts to papers in proceedings is a massive undertaking—for you and for us. Our hard-working publication chairs, Xiaodan Zhu and Zhiyuan Liu, are directing and supporting the process of getting hundreds of main-conference papers (and later, more hundreds of workshop papers) into a form where they can be easily and freely downloaded by anyone. This collection of published papers is a huge part of the output of COLING. Creating them involves getting the proceedings to compile properly, which as you may have experience of, is tough enough for one single paper—let alone 300+ in one volume. So please, support them in this critical, painstaking work by getting your paper as tight and well-formatted as possible.

A window into the decision process

Posted on May 2, 2018 by Emily M. Bender

We are aware that the decision process for a large conference like COLING can be quite opaque from the point of view of authors, especially those who have not served in the role of AC or PC in the past. In this post, we aim to demystify a bit what we are doing (and why it takes so long from submission to decision!). As always, our belief is that more transparency leads to a better process—as we are committed to doing what we lay out, and what we lay out should be justified in this writing—and to a better understanding of the outcomes.

Timeline

Many of our authors are probably aware that reviews were due on April 10, and reviews are seen as the primary determinant of acceptance, so you might well wonder why you won’t be hearing about acceptance decisions until May 17. What could possibly take so long?

We (the PC co-chairs) met in Seattle last July to lay out a detailed timeline, making sure to build in time for careful decision making and also to allow for buffers to handle the near-certainty that some things would go wrong. The portion between April 10 and May 17 looks like this:

April 10	Reviews due
April 11	ACs request reviewer discussion, chase missing reviews
April 15	Reviewer discussion ends
April 16	ACs request fixes to problematic reviews (too short, inappropriate tone)
April 19	Deadline for reviews to be updated based on AC feedback
April 20	Reviews available to authors; author response begins
April 25	Author response ends
April 26	AC discussion starts
May 3	Reviewer identities revealed to co-reviewers
May 4	AC recommendations due to PC co-chairs
May 16	Signatures revealed for signed reviews
May 17	Acceptance notifications

As you can see, the time between the initial deadline for reviews and the final acceptance notification is largely dedicated to two things: making sure all reviews are present and appropriate, and leaving time for thoughtful consideration by both ACs and PC co-chairs in the decision making process.

Of course, not everything goes according to plan. As of April 25, we still have a handful of missing or incomplete reviews. In many of these cases, ACs (including our Special Cirumstances ACs) are stepping in to provide the missing reviews. That this can be done blind is another benefit of keeping author identity from the ACs! (It’s not quite double blind, as authors can probably work out who the ACs are for their track, but that direction is less critical in this case.)

How did we end up with missing reviews? In some cases, this was not the fault of the reviewers at all. There were a handful of cases where START had the wrong email addresses for committee members, and we only discovered this when the ACs emailed the committee members from outside START—only to discover they hadn’t received their assignments! In other cases, committee members agreed to review and submitted bids and then didn’t turn in their reviews. While we absolutely understand that things come up, in the case that someone can’t complete their reviewing assignment, the best course of action in terms of minimizing impact on others (authors, other reviewers asked to step in, and the ACs/PCs managing the process) is just to communicate this fact as soon as possible.

Instructions to ACs

In our very first post to this PC blog we laid out our goals for the COLING 2018 program:

Our goals for COLING 2018 are (1) to create a program of high quality papers which represent diverse approaches to and applications of computational linguistics written and presented by researchers from throughout our international community; (2) to facilitate thoughtful reviewing which is both informative to ACs (and to us as PC co-chairs) and helpful to authors; and (3) to ensure that the results published at COLING 2018 are as reproducible as possible.

The process by which reviews are turned into acceptance decisions is a key part of the first of those goals (but not the only part—recruiting a strong, diverse pool of submissions was a key first step, as well as the design of the review process). Accordingly, these are the directions we have given to ACs, as they consider each paper in their area:

Please, please do not simply rank papers by overall score. Three reviewers is just not enough to get a reliable estimate of a paper’s quality. Maybe one reviewer didn’t read the paper, another one didn’t understand it and reacted poorly, and a final reviewer always gives negative scores; maybe one reviewer as warped priorities and another doesn’t know the area as well. There’s too much individual variance for a tiny number of reviewers (i.e. 3) to precisely judge a paper.

In fact, don’t even sort papers like this to start out with; glancing at that list will unconsciously bias perception of the papers and that’ll mean poor decisions. Save yourself - don’t let knowledge of that ranking make a nuanced review go unread.

However, as an area chair, you know your area well, and have good ideas of the technical merits of individual works in that area. You should be understand the technical content when needed and be able to judge the reviews’ quality for yourself. Once the scores are in, you’ll also have a good idea of which reviewers generally grade low (or high).

Try to order the papers in such a way that the ones you like most at the top, the ones that shouldn’t appear are at the bottom, and each paper is more preferable than the one below. You can split this work with your co-AC as you prefer; some will take half the papers and then merge, but if you do this, it’s important to realise that the split won’t be perfect - you won’t be able to interleave the resulting ranking one-by-one. In any event, both you and your co-AC must explicitly agree on the final ranking.

Use the reviews and author feedback as the evidence for the ranking, and be sure and confident about every decision. If you’re not yet confident, there are a few options. Ask the reviewers to clarify, or to examine a point; ask your co-AC for their opinion; find another reviewer for an extra opinion, if this can be done quickly; or ask us to send over resources.

Once you have an ordering, think about which of that set you’d recommend for acceptance, and send us the rankings along with your recommendations. You should also build a short report on your area - the process and the trends you saw there. Between you and your co-chair, this should be around 100-500 words.

As you can see, we are emphasizing holistic understanding of the merits of each paper, and de-emphasizing the numerical scores. Which brings up the obvious question: Why not rely on the scores?

It’s not just about the scores

Scoring is far too unreliable to be used as acceptance recommendation. We have only three reviewers, each biased in their own way. You won’t get good statistics with a population of 3, and we don’t expect to. This isn’t the reviewer’s fault; it’s just plain statistics. Rather, each review has to be considered on its own—in terms of overall bias, expertise, and how well a paper was understood by them.

So, in the words of Jason Eisner, from his fantastic “How to Serve as Program Chair of a Conference” guide:

How not to do it: Please, please, please don’t just sort the papers by the 3 reviewers’ average overall recommendation! There is too much variance in these scores for n=3 to be a large enough sample. Maybe reviewer #1 tends to give high scores to everyone, reviewer #2 has warped priorities, and reviewer #3 barely read the paper or barely knows the area. Whereas another paper drew a different set of 3 reviewers.

How still not to do it: Even as a first step, don’t sort the papers by average recommendation. Trust me — this noisy and uncalibrated ranking isn’t even a good way to triage the papers into likely accepts, likely rejects, and borderline papers that deserve a closer look. Don’t risk letting it subtly influence the final decisions, or letting it doom some actual, nuanced reviews to go unread.

What I told myself: When you’re working with several hundred papers, a single paper with an average score of 3.8 may seem to merit only a shrug and a coin flip. But a single false negative might harm a poor student’s confidence, delay her progress to her next project, or undermine her advisor’s grant proposal or promotion case. Conversely, a single false positive wastes the time of quite a lot of people in your audience.

To do this step fairly, then, for the 872 papers remaining undecided, requires a considerable effort.

The dual role of peer reviewed conferences

As we (the PC co-chairs) work to oversee this process and then construct a final program out of AC recommendations, we are mindful of the dual role that a full-paper peer-review conference like COLING 2018 is playing.

On the one hand, peer review is meant to be an integral part of the process of doing science. If something is published in a peer-reviewed venue, that is an indication that it has been read critically by a set of reviewers and found to make a worthwhile contribution to the field of inquiry. This doesn’t ensure that it is correct, or even that most people up-to-date with the field would find it reliable, but it is an indication of scientific value. (This is all the more difficult in interdisciplinary fields, as we address some in an earlier blog post.) This aspect of peer review fits well with the interests of the conference audience as stake-holders: The audience benefits from having vetted papers curated for them at the event.

On the other hand, for individual researchers, especially those employed in or hoping to be employed in academia, acceptance of papers to COLING and similar venues is very important for job prospects/promotion/etc. Furthermore, it isn’t simply a matter of publishing in peer-reviewed venues, but in high-prestige, competitive venues. Where the validation view of peer review would view it as binary question (does this paper make a validatable contribution or not?), the prestige view instead speaks to ranking—where we end up with best papers, strong papers, borderline papers that get in, borderline papers that don’t get in, and papers that were easy to decide to reject. (And, for full disclosure, it is in the interest of a conference to strive to become and maintain status as a high-prestige, competitive venue.)

While understanding our role in the validation aspect of peer review, we are indeed viewing it as a ranking rather than binary process, for several reasons. First, the reviewers are also human, and it is simply not the case that any group of 3-5 humans can definitively decide whether any given paper (roughly in their field) is definitely ‘valid’ or ‘invalid’ as a scientific contribution. Second, even if we did have a perfect oracle for validity, it’s not the case that the amount of available spots in a given conference will be a perfect match for the number of ‘valid’ papers among the submissions. In case there are more worthy papers than spots, decisions have to be made somehow—and we believe that somehow should include both measures of degree of interest in the paper and overall diversity of approaches and topics in the program. (Conversely, we will not be aiming to ‘fill up’ a certain number of spots just because we have them.) Finally, we work with the understanding that COLING is not the only conference available, and that authors whose work is not accepted to COLING will in most cases be able to improve the presentation and/or underlying methodology and submit to another conference.

That ranking is ultimately binarized into accept/reject (modulo best paper awards) and we understand (and have our own personal experiences with!) the way that a paper rejection can seem to convey: ‘this research is not valid/not worthy.’ Or alternatively, that authors with relatively high headline scores on a paper that is nonetheless rejected might feel that the ‘true’ or ‘correct’ result for their paper was overridden by the ACs or PC. But we hope that this blog post will help to dispel those notions by providing a broader view of the process.

PC process once we have the AC reports

Once the ACs provide us with their rankings and reports, on May 4, we (PC co-chairs) will have the task of building from them a (nearly) complete conference program—the one outstanding piece will be the selection of best papers from among the accepted papers. Ahead of time, we have blocked out a ‘frame’ for the overall program so we have upper limits on how many oral presentations and poster presentations we can accept.

As a first step, we will look to see how the total acceptance recommendations of the ACs compares to the total number of spots available. However, it is not our role to simply accept the AC’s recommendations, but rather to review them and ensure that the decisions as a whole are consistent (to the extent feasible, given that the whole process is noisy) and that the resulting program meets our goals of diversity in regard to topics and approaches (again, to the extent feasible, given the submission pool). We have also asked ACs to recommend mode of presentation (oral, poster), with the understanding that oral presentations are not ‘better papers’ than posters, but rather that some topics are more likely to be successful in each mode of presentation.

Though the author identities have been hidden from ACs, they haven’t been hidden from us. Nonetheless, as we work with the AC reports, we will have paper numbers & titles (but not author lists) to work from and will not go out of our way to associate author identities. Furthermore, the final accept/reject decisions for any papers that either of us have a COI with will be handled by the other PC co-chair together with the conference GC.

Author response

Posted on April 15, 2018 by Emily M. Bender

The value of the author response mechanism is frequently debated in our field and can be a source of stress for authors. On the one hand, when our work is being reviewed by others, it can feel helpless to not have the opportunity to respond to those reviews. On the other hand, there is the perennial question about whether author responses ever “help” (in the sense of taking a paper over the line to “accept” from “reject”). (On that point, see this very thoughtful analysis by Hal Daumé III for the process for NAACL 2013.) And finally there is the issue that author responses must be turned around in a short time and can be tricky to write: How to strike the right tone (firm, polite, confident; not pleading or angry) especially when we might still be feeling the sting of negative reviews. As reviewers, we have seen both very effective author responses (expressing gratitude for feedback and pointing out sources of misunderstanding) and very ineffective ones (pure vitriol, or long lists of promises of what will be accomplished before the camera-ready version).

In light of all of this, what we settled on for COLING 2018 is an optional author response to be seen by the area chairs only – and not the reviewers. Thus we are providing authors with the opportunity to flag reviewer misunderstandings for area chairs and to answer questions raised by reviews. The latter should only be done when the information is already available and can be indicated in a short statement (e.g. “Indeed, we did set the random seed and will include this information in the camera ready” but not “That is an interesting idea for a further experiment, we will run that one and include the numbers in the camera ready”). We also note that author response is optional and area chairs will not read anything into the lack of an author response.

Author response will run from 20-25 April.

Why this route? Well, the quantitative evidence is that pointing out reviewer mistakes rarely leads to a change in scores. The folk knowledge has been for some time that responses are really used by ACs to detect misaligned reviews. So rather than encourage an intrinsically difficult communication that has had little to no effect in the past, we instead divert the replies to go to the authoritative party they are relevant to. This gives a little extra work for ACs, but as they’re acting in pairs and areas are roughly the same compact size, our hope is that time can be spent more on working out the dialog around a paper and less on administering a huge set of authors and reviewers.

Lessons Learned

Posted on April 6, 2018 by Emily M. Bender

The role of PC chair is interesting in many ways. It provides a perhaps unparalleled opportunity to influence the way in which research is approached and presented in our field. For COLING 2018, we have been taking this responsibility very seriously and working hard, through both decisions for the review process and the publicization of those ideas in this blog, the push the field in directions that we believe will be fruitful, including stronger interdisciplinarity and more reproducibility.

On the flip side, the role of PC chair comes with some serious downsides. One is the heart rending process of deciding on and then informing authors of desk rejects. We did our utmost to do this as fairly as humanly possible, starting with publicizing our desk reject policy. We hoped that that move would reduce the number of desk rejects, and it may have, but there were still a handful of papers rejected without review under the policy.

The most common reason for a desk reject by a long way was the paper’s length (ie. documents were submitted with more than 9 content pages). Papers in the completely incorrect template were also desk rejected, as were those with squashed line spacing, reduced font size, removed author boxes, and so on. Other reasons for desk rejection were bad anonymisation; some papers, for example, linked to the author’s private github repository. This is the sort of thing that can really wait until camera ready. All papers sent in other templates were desk rejected (we saw e.g. NAACL, ACL, NIPS formats). One paper was rejected for breaking the arXiv embargo period, having been published there fewer than 30 days before the COLING deadline. No edits were allowed after the deadline had passed. This was a very unpleasant process overall and we can only make a plea to authors to follow the guidelines so that work gets the attention it needs, instead of rejection without feedback. That way there don’t have to be any desk rejects at all. They are often desperately unpleasant to send, and probably even worse to receive.

In this blog post, we wanted to briefly reflect on what we have learned about the kind of practices that put people in the corners that lead to the kind of mistakes that result in desk rejects. In general, we see that there is a culture of last-minutism in our field. Deadlines can inspire people to get things done that otherwise seem impossible, but doing things in a rush also has downsides. Here are some DOs and DON’Ts of paper submission that we hope will spare people some pain in the future:

Do access the submission system early, so you know what awaits.
Do read the CFP carefully. Such documents can be intimidating, especially for first-time submitters, but the information there all has a purpose, and it’s easier to make use of if you get it early.
Don’t leave submitting your final paper until the absolute last minute. If something goes wrong (e.g. submitting the wrong pdf, losing your internet connection), you’ll have missed the deadline. This happens regularly and is wasteful. Sometimes you might not find out it was the wrong PDF until after the deadline, or might be so rushed that the paper spills over the page limit unnoticed. This means the hard work has to wait for another conference.

And finally a couple of thoughts on interacting with PC chairs, especially in large conferences:

Please don’t ask the PC chairs to upload a PDF for you after the deadline. The deadline is a deadline. Asking for it to be bent is asking the PC chairs to not apply policies evenly and fairly.
Do be aware that the PC chairs in a conference this size are communicating with ~1000 authors and ~1000 reviewers, and keep that in mind as you make requests.

COLING 2018

August 20-26, 2018, Santa Fe, New Mexico, USA

Author Archives: Emily M. Bender

COLING 2018 – a truly international event!

Slides from Keynote Talks

Q&A best practices: Introduce yourself

PC chairs report back: Paper types and the selection process

Submission and acceptance statistics

Author survey on paper types

Reviewer survey on paper types

How to handle papers submitted under the wrong type?

Acceptance rate

PC chairs report back: On the effectiveness of author response

Outline of our process

What makes an effective author response?

What makes an ineffective author response?

Best practices for writing author responses

For conference organizers

For authors

COLING schedule construction: Next steps

Progam construction

Planning ahead

Best paper awards

Timeline

Anonymity

availability

Best paper committee

Publication preparation

A window into the decision process

Timeline

Instructions to ACs

It’s not just about the scores

The dual role of peer reviewed conferences

PC process once we have the AC reports

Author response

Lessons Learned