PC chairs report back: On the effectiveness of author response

The utility of the author response part of the conference review process is hotly debated. At COLING 2018, we decided to have the author response be addressed only to the area chairs (and PC co-chairs), and not the reviewers. The purpose of this blog post is to report back on our experience with this model (largely positive, from the PC perspective!) and also to share with the community what we have learned, inhabiting this role, about what makes an effective author response.

For background, here is a description of the decision making process at the PC level. Keep in mind that COLING 2018 received 1017 submissions, of which 880 were still ‘active’ at the point of these decisions.  (The difference is a combination of desk rejects and papers withdrawn, the latter mostly in light of acceptance to other venues with earlier notifications.)

Outline of our process

Final accept/reject decisions for COLING 2018 were made as follows:

We asked the ACs for each area to provide a ranking of the papers in their area and to indicate recommendations of accept, maybe accept, maybe reject, or reject. We specifically instructed the ACs to not use the reviewer scores to sort the papers, but rather to come to their own ranking based on their judgment, given the reviews, discussion among reviews, author responses, and (where necessary) reading the papers.

Our role as PCs was to turn those recommendations into decisions. To do so, we first looked at each area’s report and determined which papers had clear recommendations and which were borderline.  For the former, we went with the AC recommendations directly. The borderline cases were either papers that the ACs marked as ‘maybe accept’ or ‘maybe reject’, or, for areas that only used ‘accept’ and ‘reject’, the last two ‘accept’ papers and the first two ‘reject’ papers in the ACs’ ranking. This gave us a bit over 200 papers to consider.

We divided the areas into two sets, one for each of us. (We were careful at this point to put the areas containing papers with which one of us had COIs into the other PC’s stack.) Area by area, we looked at the borderline papers, considering the reviews, the reviewer discussion (if any), the author response, comments from the ACs, and sometimes the papers (to clarify particular points; we didn’t read the papers in full). Although the PC role on START allows us to see the authors of all submissions, we worked out ways to look at all the information we needed to do this without seeing the author names (or institutions, etc).

Of the 200 or so papers we looked at, there were 23 for which we wanted to have further discussion. This was done over Skype, despite the 9 hour time difference! These papers were evenly distributed between Emily’s and Leon’s areas, but clustered towards the start of each of our respective stacks; our analysis is that as we worked our way through the process, we each gained a better sense of how to make the decisions and found less uncertainty. (Discussion of COI papers was done with the General Chair, Pierre Isabelle, not the other PC, per our COI policy.)

As a final step to verify data entry (to make sure what is entered in START actually matches our intentions), we went through and looked at both the accepted papers with the lowest reviewer scores and the rejected papers with the highest reviewer scores. 98 papers with an average score 3 or higher were rejected. 27 papers with an average score lower than 3 were accepted. (Remember, it’s not just about the numbers!) For each of these, we went back to our notes to check that the right information was entered (it was) and in so doing, we found that, for the majority of the papers which were accepted despite low reviewer scores (and correspondingly harsh reviews), our notes reflected effective author responses. This furthermore is consistent with our subjective sense that the author responses really did make a difference in the case of difficult decisions, that is, the papers we were looking at.

What makes an effective author response?

The effective author responses all had certain characteristics in common. They were written in a tone that was respectful, calm and confident (but not arrogant). They had specific answers to reviewers’ specific questions or specific replies to reviewers’ criticisms. For example, if a reviewer pointed out that a paper failed to discuss important related work, an effective author response would either acknowledge the omission and indicate that it will be addressed in the final version, or clearly state why the indicated paper isn’t in fact relevant. Effective author responses to reviewer questions about points that aren’t clear were short and to the point (and specific). This gave us confidence that the answers would be incorporated in the final version. In many cases, authors related the results of experiments they hadn’t had space for, or ran the analyses during the response period; this is much more effective than an ephemeral promise to add the content. Author responses could also be effective in indicating that reviewers misunderstood key points of the paper or the background into which it fits, but only if they were written in the calm, confident tone mentioned above.

Many effective author responses also expressed gratitude for the reviewers’ feedback. This was nice to see, but it wasn’t a problem when it wasn’t there.

What makes an ineffective author response?

In effective author responses, on the other hand seemed to be written in a place of anger. We understand where authors are coming from when this happens! Reviews, especially negative reviews, can sting. But an author response that comes across as angry, condescending, or combative is not effective at persuading the ACs & PCs that the reviewers have things the wrong way around, nor does it provide good evidence that the paper will be improved for the camera ready version.

Best practices for writing author responses

Here we try to distill our experience of reading the author responses for ~200 papers (not all papers had them, but most did) into some helpful tips.

For conference organizers

We definitely recommend setting up an author response process, but having the author responses go to the ACs (and PCs) only, not the reviewers.  Two ways to improve on what we did:

  • Clarify the word count constraints better than we did. We asked for no more than 400 words total, but the way START enforced that was no more than 400 words per review (since there were separate author response boxes for each review).
  • Don’t make the mistake we made of sending authors who wanted to do a late author response to their ACs … in the very small number of cases where that happened, it compromised anonymity of authors to ACs.

For authors

  • Read the reviews and write the angry version. Then set it aside and write a calmer one.
  • If you can, show your author response to someone who will read it for you and let you know where it sounds angry/arrogant/petty.
  • Try starting with “Thank you for the helpful feedback”—this isn’t necessary, and you can edit it out afterwards for space, but it might help you get off on the right foot regarding tone.
  • Don’t play the reviewers off each other (“R1 says this paper is hard to read, but that’s clearly wrong, because R2 said it was easy to follow.”) Rest assured that the ACs will read all of the reviews; they’ll have seen R2’s comments too.
  • Similarly, don’t feel obliged to reply to everything in the reviews. General negative comments (e.g. “I found this paper hard to read”) don’t require a response and there probably isn’t a response that would be helpful. Either the paper really is unclear or the reviewer doesn’t have sufficient background / didn’t leave enough time to read the paper carefully. Which scenario this is will likely be evident from the rest of the reviews and the author response.
  • Don’t promise the moon and the stars in the final version. It’s hard to accept a borderline paper based on promises alone.
  • Do indicate specific answers to key questions, in a way that is obviously easily incorporated in the final version. (And in that case it’s fine to say “We will add clarification along these lines”, or similar.)
  • Do concisely demonstrate mastery of the area, if reviewers probe issues you have considered during your research and you have the answers to hand.
  • Don’t play games with the word count. We saw two author responses where the authors got around the software’s restriction to 400 words (per box!) by_joining_whole_sentences_with_underscores. This does not make a good impression.

Ultimately, even a calm and confident author response doesn’t necessarily push a paper on the borderline over into accept. Sometimes the paper just isn’t ready and it’s not reasonable to try to fix what needs fixing or add what needs adding for the final version. Nonetheless, we found that the above patterns do make author responses more effective, and so we wanted to share them.

 

 

COLING schedule construction: Next steps

We are proud to have sent out the acceptance notifications for COLING 2018 ahead of schedule! But, our work as chairs is not done. Here are our next steps:

Progam construction

We have prepared a schedule “frame”, with plenary sessions (opening, keynotes, best papers, closing), parallel sessions (talks and posters), all fit in around coffee breaks, lunch and the excursions. Our task now is to group the accepted papers into coherent talk and poster sessions. In doing so, we will consider:

  • Author preferences (as indicated in START)
  • Area chair recommendations
  • Thematic coherence of sessions
  • Suitability of each topic for each format

Our goal is to have the program constructed by June 13. That timing is partially dependent on the best paper award process, outlined below.

Planning ahead

In an event of this size, it is inevitable that some number of presenters may be unable to attend at the last minute. In that case, we hope that speakers will be able to arrange to present remotely (per the inclusion policy).  If that is not possible, and an oral presentation is being pulled, we will seek to replace it with the most thematically similar poster available.

Best paper awards

We have 10 award categories, the 9 listed in our previous post on this topic, plus ‘Best error analysis’, which we really should have thought of initially! We have 11 scholars who have agreed to be on this committee. And we have 41 papers which have been nominated, each for one of the specific awards.

We will shortly be creating subcommittees of the best paper committee to consider each award. Each award will be considered by two committee members and most committee members will be working on two award types. The exception is the “Best NLP engineering experiment” award, as that award type has the most nominations (being the most common paper type among our submissions). The committee members working on that type will focus only on it. We are open to the possibility that some awards may go unallocated (if this is warranted) and also that a paper may end up with a different award than the one it was nominated for.

Timeline

May 17: Nominated papers to best paper committee
June 1: Each subcommittee reports to the whole BPC with their nomination and a handful of alternates; the BPC then discusses results
June 8: The committee confirms up to ten best paper awards for nomination to the PC co-chairs
June 13: Best papers confirmed, and authors notified

Anonymity

In order to preserve anonymity in the best paper award selection process, we will not post the list of accepted papers until the selection is done. Individual authors are of course free at this point to post their own information, but we trust our best paper committee won’t go hunting for it.

availability

As mentioned in our requirements post, only papers that have made the resources/code publicly available by camera ready time will be considered for best paper awards; those that rely on code or data, but haven’t made it available, will be taken out of the running.

Best paper committee

Our responsive, expert committee members are:

Publication preparation

Going from drafts to papers in proceedings is a massive undertaking—for you and for us. Our hard-working publication chairs, Xiaodan Zhu and Zhiyuan Liu, are directing and supporting the process of getting hundreds of main-conference papers (and later, more hundreds of workshop papers) into a form where they can be easily and freely downloaded by anyone. This collection of published papers is a huge part of the output of COLING. Creating them involves getting the proceedings to compile properly, which as you may have experience of, is tough enough for one single paper—let alone 300+ in one volume. So please, support them in this critical, painstaking work by getting your paper as tight and well-formatted as possible.

A window into the decision process

We are aware that the decision process for a large conference like COLING can be quite opaque from the point of view of authors, especially those who have not served in the role of AC or PC in the past. In this post, we aim to demystify a bit what we are doing (and why it takes so long from submission to decision!). As always, our belief is that more transparency leads to a better process—as we are committed to doing what we lay out, and what we lay out should be justified in this writing—and to a better understanding of the outcomes.

Timeline

Many of our authors are probably aware that reviews were due on April 10, and reviews are seen as the primary determinant of acceptance, so you might well wonder why you won’t be hearing about acceptance decisions until May 17. What could possibly take so long?

We (the PC co-chairs) met in Seattle last July to lay out a detailed timeline, making sure to build in time for careful decision making and also to allow for buffers to handle the near-certainty that some things would go wrong.  The portion between April 10 and May 17 looks like this:

April 10 Reviews due
April 11 ACs request reviewer discussion, chase missing reviews
April 15 Reviewer discussion ends
April 16 ACs request fixes to problematic reviews (too short, inappropriate tone)
April 19 Deadline for reviews to be updated based on AC feedback
April 20 Reviews available to authors; author response begins
April 25 Author response ends
April 26 AC discussion starts
May 3 Reviewer identities revealed to co-reviewers
May 4 AC recommendations due to PC co-chairs
May 16 Signatures revealed for signed reviews
May 17 Acceptance notifications

As you can see, the time between the initial deadline for reviews and the final acceptance notification is largely dedicated to two things: making sure all reviews are present and appropriate, and leaving time for thoughtful consideration by both ACs and PC co-chairs in the decision making process.

Of course, not everything goes according to plan. As of April 25, we still have a handful of missing or incomplete reviews. In many of these cases, ACs (including our Special Cirumstances ACs) are stepping in to provide the missing reviews. That this can be done blind is another benefit of keeping author identity from the ACs! (It’s not quite double blind, as authors can probably work out who the ACs are for their track, but that direction is less critical in this case.)

How did we end up with missing reviews? In some cases, this was not the fault of the reviewers at all. There were a handful of cases where START had the wrong email addresses for committee members, and we only discovered this when the ACs emailed the committee members from outside START—only to discover they hadn’t received their assignments! In other cases, committee members agreed to review and submitted bids and then didn’t turn in their reviews. While we absolutely understand that things come up, in the case that someone can’t complete their reviewing assignment, the best course of action in terms of minimizing impact on others (authors, other reviewers asked to step in, and the ACs/PCs managing the process) is just to communicate this fact as soon as possible.

Instructions to ACs

In our very first post to this PC blog we laid out our goals for the COLING 2018 program:

Our goals for COLING 2018 are (1) to create a program of high quality papers which represent diverse approaches to and applications of computational linguistics written and presented by researchers from throughout our international community; (2) to facilitate thoughtful reviewing which is both informative to ACs (and to us as PC co-chairs) and helpful to authors; and (3) to ensure that the results published at COLING 2018 are as reproducible as possible.

The process by which reviews are turned into acceptance decisions is a key part of the first of those goals (but not the only part—recruiting a strong, diverse pool of submissions was a key first step, as well as the design of the review process). Accordingly, these are the directions we have given to ACs, as they consider each paper in their area:

Please,​ ​please​ ​do​ ​​not​ ​​simply​ ​rank​ ​papers​ ​by​ ​overall​ ​score.​ ​Three​ ​reviewers​ ​is​ ​just​ ​not enough​ ​to​ ​get​ ​a​ ​reliable​ ​estimate​ ​of​ ​a​ ​paper’s​ ​quality.​ ​Maybe​ ​one​ ​reviewer​ ​didn’t​ ​read​ ​the paper,​ ​another​ ​one​ ​didn’t​ ​understand​ ​it​ ​and​ ​reacted​ ​poorly,​ ​and​ ​a​ ​final​ ​reviewer​ ​always​ ​gives negative​ ​scores;​ ​maybe​ ​one​ ​reviewer​ ​as​ ​warped​ ​priorities​ ​and​ ​another​ ​doesn’t​ ​know​ ​the area​ ​as​ ​well.​ ​There’s​ ​too​ ​much​ ​individual​ ​variance​ ​for​ ​a​ ​tiny​ ​number​ ​of​ ​reviewers​ ​(i.e.​ ​3)​ ​to precisely​ ​judge​ ​a​ ​paper.

 

In​ ​fact,​ ​don’t​ ​even​ ​sort​ ​papers​ ​like​ ​this​ ​to​ ​start​ ​out​ ​with;​ ​glancing​ ​at​ ​that​ ​list​ ​will​ ​unconsciously bias​ ​perception​ ​of​ ​the​ ​papers​ ​and​ ​that’ll​ ​mean​ ​poor​ ​decisions.​ ​Save​ ​yourself​ ​-​ ​don’t​ ​let knowledge​ ​of​ ​that​ ​ranking​ ​make​ ​a​ ​nuanced​ ​review​ ​go​ ​unread.

 

However,​ ​as​ ​an​ ​area​ ​chair,​ ​you​ ​know​ ​your​ ​area​ ​well,​ ​and​ ​have​ ​good​ ​ideas​ ​of​ ​the​ ​technical merits​ ​of​ ​individual​ ​works​ ​in​ ​that​ ​area.​ ​You​ ​should​ ​be​ ​understand​ ​the​ ​technical​ ​content​ ​when needed​ ​and​ ​be​ ​able​ ​to​ ​judge​ ​the​ ​reviews’​ ​quality​ ​for​ ​yourself.​ ​Once​ ​the​ ​scores​ ​are​ ​in,​ ​you’ll also​ ​have​ ​a​ ​good​ ​idea​ ​of​ ​which​ ​reviewers​ ​generally​ ​grade​ ​low​ ​(or​ ​high).

 

Try​ ​to​ ​order​ ​the​ ​papers​ ​in​ ​such​ ​a​ ​way​ ​that​ ​the​ ​ones​ ​you​ ​like​ ​most​ ​at​ ​the​ ​top,​ ​the​ ​ones​ ​that shouldn’t​ ​appear​ ​are​ ​at​ ​the​ ​bottom,​ ​and​ ​each​ ​paper​ ​is​ ​more​ ​preferable​ ​than​ ​the​ ​one​ ​below. You​ ​can​ ​split​ ​this​ ​work​ ​with​ ​your​ ​co-AC​ ​as​ ​you​ ​prefer;​ ​some​ ​will​ ​take​ ​half​ ​the​ ​papers​ ​and then​ ​merge,​ ​but​ ​if​ ​you​ ​do​ ​this,​ ​it’s​ ​important​ ​to​ ​realise​ ​that​ ​the​ ​split​ ​won’t​ ​be​ ​perfect​ ​-​ ​you won’t​ ​be​ ​able​ ​to​ ​interleave​ ​the​ ​resulting​ ​ranking​ ​one-by-one.​ ​In​ ​any​ ​event,​ ​both​ ​you​ ​and​ ​your co-AC​ ​must​ ​explicitly​ ​agree​ ​on​ ​the​ ​final​ ​ranking.

 

Use​ ​the​ ​reviews​ ​and​ ​author​ ​feedback​ ​as​ ​the​ ​evidence​ ​for​ ​the​ ​ranking,​ ​and​ ​be​ ​sure​ ​and confident​ ​about​ ​every​ ​decision.​ ​If​ ​you’re​ ​not​ ​yet​ ​confident,​ ​there​ ​are​ ​a​ ​few​ ​options.​ ​Ask​ ​the reviewers​ ​to​ ​clarify,​ ​or​ ​to​ ​examine​ ​a​ ​point;​ ​ask​ ​your​ ​co-AC​ ​for​ ​their​ ​opinion;​ ​find​ ​another reviewer​ ​for​ ​an​ ​extra​ ​opinion,​ ​if​ ​this​ ​can​ ​be​ ​done​ ​quickly;​ ​or​ ​ask​ ​us​ ​to​ ​send​ ​over​ ​resources.

 

Once​ ​you​ ​have​ ​an​ ​ordering,​ ​think​ ​about​ ​which​ ​of​ ​that​ ​set​ ​you’d​ ​recommend​ ​for​ ​acceptance, and​ ​send​ ​us​ ​the​ ​rankings​ ​along​ ​with​ ​your​ ​recommendations.​ ​​​You​ ​should​ ​also​ ​build​ ​a​ ​short report​ ​on​ ​your​ ​area​ ​-​ ​the​ ​process​ ​and​ ​the​ ​trends​ ​you​ ​saw​ ​there.​ ​Between​ ​you​ ​and​ ​your co-chair,​ ​this​ ​should​ ​be​ ​around​ ​100-500​ ​words.

As you can see, we are emphasizing holistic understanding of the merits of each paper, and de-emphasizing the numerical scores. Which brings up the obvious question: Why not rely on the scores?

It’s not just about the scores

Scoring is far too unreliable to be used as acceptance recommendation. We have only three reviewers, each biased in their own way. You won’t get good statistics with a population of 3, and we don’t expect to. This isn’t the reviewer’s fault; it’s just plain statistics. Rather, each review has to be considered on its own—in terms of overall bias, expertise, and how well a paper was understood by them.

So, in the words of Jason Eisner, from his fantastic “How to Serve as Program Chair of a Conference” guide:

How not to do it: Please, please, please don’t just sort the papers by the 3 reviewers’ average overall recommendation! There is too much variance in these scores for n=3 to be a large enough sample. Maybe reviewer #1 tends to give high scores to everyone, reviewer #2 has warped priorities, and reviewer #3 barely read the paper or barely knows the area. Whereas another paper drew a different set of 3 reviewers.

How still not to do it: Even as a first step, don’t sort the papers by average recommendation. Trust me — this noisy and uncalibrated ranking isn’t even a good way to triage the papers into likely accepts, likely rejects, and borderline papers that deserve a closer look. Don’t risk letting it subtly influence the final decisions, or letting it doom some actual, nuanced reviews to go unread.

What I told myself: When you’re working with several hundred papers, a single paper with an average score of 3.8 may seem to merit only a shrug and a coin flip. But a single false negative might harm a poor student’s confidence, delay her progress to her next project, or undermine her advisor’s grant proposal or promotion case. Conversely, a single false positive wastes the time of quite a lot of people in your audience.

To do this step fairly, then, for the 872 papers remaining undecided, requires a considerable effort.

The dual role of peer reviewed conferences

As we (the PC co-chairs) work to oversee this process and then construct a final program out of AC recommendations, we are mindful of the dual role that a full-paper peer-review conference like COLING 2018 is playing.

On the one hand, peer review is meant to be an integral part of the process of doing science. If something is published in a peer-reviewed venue, that is an indication that it has been read critically by a set of reviewers and found to make a worthwhile contribution to the field of inquiry. This doesn’t ensure that it is correct, or even that most people up-to-date with the field would find it reliable, but it is an indication of scientific value. (This is all the more difficult in interdisciplinary fields, as we address some in an earlier blog post.) This aspect of peer review fits well with the interests of the conference audience as stake-holders: The audience benefits from having vetted papers curated for them at the event.

On the other hand, for individual researchers, especially those employed in or hoping to be employed in academia, acceptance of papers to COLING and similar venues is very important for job prospects/promotion/etc. Furthermore, it isn’t simply a matter of publishing in peer-reviewed venues, but in high-prestige, competitive venues. Where the validation view of peer review would view it as binary question (does this paper make a validatable contribution or not?), the prestige view instead speaks to ranking—where we end up with best papers, strong papers, borderline papers that get in, borderline papers that don’t get in, and papers that were easy to decide to reject. (And, for full disclosure, it is in the interest of a conference to strive to become and maintain status as a high-prestige, competitive venue.)

While understanding our role in the validation aspect of peer review, we are indeed viewing it as a ranking rather than binary process, for several reasons. First, the reviewers are also human, and it is simply not the case that any group of 3-5 humans can definitively decide whether any given paper (roughly in their field) is definitely ‘valid’ or ‘invalid’ as a scientific contribution. Second, even if we did have a perfect oracle for validity, it’s not the case that the amount of available spots in a given conference will be a perfect match for the number of ‘valid’ papers among the submissions. In case there are more worthy papers than spots, decisions have to be made somehow—and we believe that somehow should include both measures of degree of interest in the paper and overall diversity of approaches and topics in the program. (Conversely, we will not be aiming to ‘fill up’ a certain number of spots just because we have them.) Finally, we work with the understanding that COLING is not the only conference available, and that authors whose work is not accepted to COLING will in most cases be able to improve the presentation and/or underlying methodology and submit to another conference.

That ranking is ultimately binarized into accept/reject (modulo best paper awards) and we understand (and have our own personal experiences with!) the way that a paper rejection can seem to convey: ‘this research is not valid/not worthy.’ Or alternatively, that authors with relatively high headline scores on a paper that is nonetheless rejected might feel that the ‘true’ or ‘correct’ result for their paper was overridden by the ACs or PC. But we hope that this blog post will help to dispel those notions by providing a broader view of the process.

PC process once we have the AC reports

Once the ACs provide us with their rankings and reports, on May 4, we (PC co-chairs) will have the task of building from them a (nearly) complete conference program—the one outstanding piece will be the selection of best papers from among the accepted papers. Ahead of time, we have blocked out a ‘frame’ for the overall program so we have upper limits on how many oral presentations and poster presentations we can accept.

As a first step, we will look to see how the total acceptance recommendations of the ACs compares to the total number of spots available. However, it is not our role to simply accept the AC’s recommendations, but rather to review them and ensure that the decisions as a whole are consistent (to the extent feasible, given that the whole process is noisy) and that the resulting program meets our goals of diversity in regard to topics and approaches (again, to the extent feasible, given the submission pool). We have also asked ACs to recommend mode of presentation (oral, poster), with the understanding that oral presentations are not ‘better papers’ than posters, but rather that some topics are more likely to be successful in each mode of presentation.

Though the author identities have been hidden from ACs, they haven’t been hidden from us. Nonetheless, as we work with the AC reports, we will have paper numbers & titles (but not author lists) to work from and will not go out of our way to associate author identities. Furthermore, the final accept/reject decisions for any papers that either of us have a COI with will be handled by the other PC co-chair together with the conference GC.

Review statistics

So far, there have been many things to measure of our review process at COLING. Here are a few.

Firstly, it’s interesting to see how many reviewers recommend the authors cite them. We can’t evaluate how appropriate this is, but it happened in 68 out of 2806 reviews (2.4%).

Best paper nominations are quite rare in general. This gives very little signal for the best paper committee to work with. To gain more information, in addition to asking whether a paper warranted further recognition, we asked reviewers to say if a given paper was the best out of those they had reviewed. This worked well for 747 reviewers, but 274 reviewers (26.8%) said no paper they reviewed was the best of their reviewing allocation.

Mean scores and confidence can be broken down by type, as follows.

Score Confidence
Computationally-aided linguistic analysis 2.85 3.42
NLP engineering experiment paper 2.86 3.51
Position paper 2.41 3.36
Reproduction paper 2.92 3.54
Resource paper 2.76 3.50
Survey paper 2.93 3.58

We can see that reviewers were least confident with position papers, and were both most confident and most pleased with survey papers—though reproduction papers came in a close second in regard to mean score. This fits the general expectation that position papers are hard to evaluate.

The overall distribution of scores follows.

Anonymity and Review

Anonymous review is a way of achieving a fairer process. The ongoing discussion among many in our field led to us examining how well this was really working, and rethinking how anonymity was implemented for COLING this year.

One step we took was to make sure that area chairs did not know who the authors were. This is important because area chairs are the ones putting forward recommendations based on reviews; area chairs are the people who mediate between borderline papers and acceptance, or who assess reviewer ratings to decide if they put a paper on the wrong side of the acceptance boundary. This is a critical and powerful role. So, we should be extra sure that if a venue has chosen to run an anonymized process, the area chairs don’t see paper authors’ names.

This policy caused a little initial surprise but everyone has adapted quickly. In order for this to work, authors must continue to hide their identity, especially through author response to chairs—the current process.

We also increased anonymity in reviewer discussion: reviewers did not and still do not know each others’ identity. To keep review tone professional, we will reveal reviewer identities to each other later in the process, so if you are one of our generous program committee members, you can see who perhaps wrote the excellent review you saw, and also who left the blank one—on submissions you also reviewed.

It’s established that signed reviews—that is, those including the reviewer’s name—are generally found by authors to be of better quality and tone. We gave an option to reviewers to sign their reviews. This time, 121 reviewers used this, out of 1020 active review authors (11.9%).

On the topic of anonymity, there have been a few rejections due to poor or absent anonymization. To help future authors, here are some ways anonymity can be broken.

  • Linking to a personal or institutional github account and making it clear in the prose it is the authors’ (e.g. “We make this available at github.com/authorname/tool/”).
  • Describing and citing prior work as “we showed”, “our previous work”, and so on
  • Leaving names and affiliations on the front page
  • Including unpublished papers in the bibliography

Some of these can be avoided by simply only referring to one’s past literature in the camera-ready copy, and holding back for review, which is a strategy we recommend. Of course it’s not always possible, but in most of cases we saw, refraining from self-citing would not have damaged the narrative and would have left the paper compliant.

The final step in the review process, from the author side, is author response to chairs. Please remember to keep yourself anonymous here—the chairs know neither author nor reviewer identities, which helps them be impartial.

Author response

The value of the author response mechanism is frequently debated in our field and can be a source of stress for authors. On the one hand, when our work is being reviewed by others, it can feel helpless to not have the opportunity to respond to those reviews. On the other hand, there is the perennial question about whether author responses ever “help” (in the sense of taking a paper over the line to “accept” from “reject”). (On that point, see this very thoughtful analysis by Hal Daumé III for the process for NAACL 2013.) And finally there is the issue that author responses must be turned around in a short time and can be tricky to write: How to strike the right tone (firm, polite, confident; not pleading or angry) especially when we might still be feeling the sting of negative reviews. As reviewers, we have seen both very effective author responses (expressing gratitude for feedback and pointing out sources of misunderstanding) and very ineffective ones (pure vitriol, or long lists of promises of what will be accomplished before the camera-ready version).

In light of all of this, what we settled on for COLING 2018 is an optional author response to be seen by the area chairs only – and not the reviewers. Thus we are providing authors with the opportunity to flag reviewer misunderstandings for area chairs and to answer questions raised by reviews. The latter should only be done when the information is already available and can be indicated in a short statement (e.g. “Indeed, we did set the random seed and will include this information in the camera ready” but not “That is an interesting idea for a further experiment, we will run that one and include the numbers in the camera ready”). We also note that author response is optional and area chairs will not read anything into the lack of an author response.

Author response will run from 20-25 April.

Why this route? Well, the quantitative evidence is that pointing out reviewer mistakes rarely leads to a change in scores. The folk knowledge has been for some time that responses are really used by ACs to detect misaligned reviews. So rather than encourage an intrinsically difficult communication that has had little to no effect in the past, we instead divert the replies to go to the authoritative party they are relevant to. This gives a little extra work for ACs, but as they’re acting in pairs and areas are roughly the same compact size, our hope is that time can be spent more on working out the dialog around a paper and less on administering a huge set of authors and reviewers.

Lessons Learned

The role of PC chair is interesting in many ways. It provides a perhaps unparalleled opportunity to influence the way in which research is approached and presented in our field. For COLING 2018, we have been taking this responsibility very seriously and working hard, through both decisions for the review process and the publicization of those ideas in this blog, the push the field in directions that we believe will be fruitful, including stronger interdisciplinarity and more reproducibility.

On the flip side, the role of PC chair comes with some serious downsides. One is the heart rending process of deciding on and then informing authors of desk rejects. We did our utmost to do this as fairly as humanly possible, starting with publicizing our desk reject policy. We hoped that that move would reduce the number of desk rejects, and it may have, but there were still a handful of papers rejected without review under the policy.

The most common reason for a desk reject by a long way was the paper’s length (ie. documents were submitted with more than 9 content pages). Papers in the completely incorrect template were also desk rejected, as were those with squashed line spacing, reduced font size, removed author boxes, and so on. Other reasons for desk rejection were bad anonymisation; some papers, for example, linked to the author’s private github repository. This is the sort of thing that can really wait until camera ready. All papers sent in other templates were desk rejected (we saw e.g. NAACL, ACL, NIPS formats). One paper was rejected for breaking the arXiv embargo period, having been published there fewer than 30 days before the COLING deadline. No edits were allowed after the deadline had passed. This was a very unpleasant process overall and we can only make a plea to authors to follow the guidelines so that work gets the attention it needs, instead of rejection without feedback. That way there don’t have to be any desk rejects at all. They are often desperately unpleasant to send, and probably even worse to receive.

In this blog post, we wanted to briefly reflect on what we have learned about the kind of practices that put people in the corners that lead to the kind of mistakes that result in desk rejects. In general, we see that there is a culture of last-minutism in our field. Deadlines can inspire people to get things done that otherwise seem impossible, but doing things in a rush also has downsides. Here are some DOs and DON’Ts of paper submission that we hope will spare people some pain in the future:

  • Do access the submission system early, so you know what awaits.
  • Do read the CFP carefully. Such documents can be intimidating, especially for first-time submitters, but the information there all has a purpose, and it’s easier to make use of if you get it early.
  • Don’t leave submitting your final paper until the absolute last minute. If something goes wrong (e.g. submitting the wrong pdf, losing your internet connection), you’ll have missed the deadline. This happens regularly and is wasteful. Sometimes you might not find out it was the wrong PDF until after the deadline, or might be so rushed that the paper spills over the page limit unnoticed. This means the hard work has to wait for another conference.

And finally a couple of thoughts on interacting with PC chairs, especially in large conferences:

  • Please don’t ask the PC chairs to upload a PDF for you after the deadline. The deadline is a deadline. Asking for it to be bent is asking the PC chairs to not apply policies evenly and fairly.
  • Do be aware that the PC chairs in a conference this size are communicating with ~1000 authors and ~1000 reviewers, and keep that in mind as you make requests.

COLING 2018 Submissions Overview

We’ve had a successful COLING so far, with over a thousand papers submitted, covering a variety of areas. In total, 1017 papers were submitted to the main conference, all full-length.

Each submitted paper had a distinct type assigned by the authors, that affects how it is reviewed. These were developed based on our earlier blog post on paper types. The “NLP Engineering Experiment paper” was unsurprisingly the dominant type, though only made up for 65% of all papers. We were very happy to receive 25 survey papers, 31 position papers, and 35 reproduction papers—as well as a solid 106 resource papers and a strong showing of 163 computationally-aided linguistic analysis papers, the second largest contingent.

Some papers were withdrawn or desk rejected before review began in earnest. Between ACs and PC co-chairs, in total, 32 papers were rejected without review. Excluding desk rejects, so far 41 papers have been withdrawn from consideration by the authors.

Allocating papers to areas gave each area a mean and median of 27 papers. The largest area has 31 papers and the smallest 19. We interpret this as indicating that area chairs will not be overloaded, leading to better review quality and interpretation.

Author survey results

Shortly after the submission deadline, we sent out a survey to our authors, with the goal of better understanding how our outreach was working.

Respondents

We sent the notification of the survey via START to all corresponding authors (so roughly 1000 people) and asked them to share it with co-authors. The survey recorded 434 total responses, which is a pretty satisfying response rate!

Of those 434, 302 (69.6%) indicated that they were submitting to COLING for the first time, and 101 (23.3%) to a major NLP conference for the first time.

Outreach

We asked how people first found out about COLING 2018. The most popular response was “Web search” (44.2%), followed by “Call for Papers sent over email (e.g. corpora mailing list, ACL mailing list)” (35.9%), then “Other” (12.4%) and “Social media” (7.4%).  The “Other” answers included word-of-mouth, knowing to expect COLING to come around in 2018, and websites that aggregate CFPs.

Paper types

We wanted to find out if people were aware of the paper types (since this is relatively unusual in our field) before submitting their papers, and if so, how they found out. Most—349 (80.4%)—were aware of the paper types ahead of time.  Of these, the vast majority (93.4%) found out about the paper types via the Call for Papers. Otherwise, people found out because someone else told them (7.4%), via our Twitter or Facebook feeds (6.0%), or via our blog (3.7%).

We also asked if it was clear to authors which paper type was appropriate for their paper and if they think paper types are a good idea. The answers in both cases were pretty strongly positive: 78.8% said it was clear and 91.0% said it was a good idea. (Interestingly, 74 people who said it wasn’t clear which paper type was a good fit for theirs nonetheless said it was a good idea, and 21 people who thought it was clear which paper type fit nonetheless said it wasn’t.)

Writing mentoring program

We wanted to know if our authors were aware of the writing mentoring program, and for those who were but didn’t take advantage of it, why not. 277 respondents (63.8%) said they were aware of it. The most common reason chosen for not taking advantage of it was “I didn’t/couldn’t have a draft ready in time.” (150 respondents), followed by “I have good mentoring available to me in my local institution” (97 respondents). The other two options available in that check-all-that-apply question were “I have a lot of practice writing papers already” (74 respondents) and “Other” (10). Alas, a few people indicated that they only discovered it too late.

Other channels

We have been putting significant effort into getting information out about our process, but still worry that the channels we’re using aren’t reaching everyone. We asked “What other channels would you like to see information like this publicized on?” referring specifically to the paper types. Most people did not respond, or indicated that what we’re doing is enough. Other responses included: LINGUIST List, LinkedIn, Instagram, ResearchGate, and email. Ideas for email include creating a conference-specific mailing list that people can subscribe to and sending out messages to all email addresses registered in START.

We include these ideas here for posterity (and the benefit of future people filling this role). We have used LinkedIn and Weibo in a limited capacity and are using Twitter and Facebook. Adding additional social media (Instagram) sounds plausible, but is not in our plans for this year. An email list that people could opt into for updates makes a lot of sense, though there’s still the problem of getting the word out about that list. Perhaps a good way to do that would be to include that info in the CFP (starting from the first CFP). Emailing everyone through START may not be feasible (depending on START’s email privacy policy) and at any rate wouldn’t help reach those who have never submitted to a compling/NLP conference before.

Blog readership

Of course we wanted to know if our authors are reading this blog.  44.7% of respondents weren’t aware of the blog (prior to being asked that question!), 15.0% had found it only recently, 24.9% had been aware of it for at least a month but less than 6, and 15.4% indicated that they’ve been aware of it for at least 6 months. 9.2% of respondents read (almost) everything we post, 32.0% read it sometimes, and the remainder don’t read it or read it only rarely.

We also wanted to know if the PC blog helped our authors to understand our submission process or shape their submissions to COLING 2018. 22.8% indicated “Yes, a lot!” and 28.1% “Yes, a little”. On the no side, 22.1% chose “No, not really” and 27.0% “No, not at all”. “Yes, a lot!” people, we’re doing this for you 🙂

 

Outstanding Mentors

The COLING 2018 writing mentoring program went extremely well—we are grateful to all of the mentors who volunteered their time to provide thoughtful comments to the authors who participated. Furthermore, the prompts we used in the writing mentoring form (listed in the description of the program) were effective in eliciting useful feedback for authors.

There is great willingness in our field to participate from the mentoring side.  Over 100 mentors signed up, which means we could have provided mentoring for even more papers than we did. It seems that the biggest hurdle to success for such a program is getting the word out to those who would most likely benefit from it. (We’ve got another blog post in the works about outreach & responses to our author survey.)

Reviewing the work of the mentors to find those to recognize as outstanding mentors was inspiring—and the task of choosing difficult—because so many did such a great job. Even if the mentored papers aren’t ultimate accepted to COLING, the authors who received mentoring will have benefited from thoughtful, constructive feedback on their work which we hope will inform both future writings on the same topic and perhaps even their approach to writing on other topics.

Against that background, the following mentors distinguished themselves as particularly outstanding:

  • Kevin Cohen
  • Carla Parra Escartín
  • David Mimno
  • Emily Morgan
  • Irina Temnikova
  • Jennifer Williams

Thank you to all of our mentors!