PC chairs report back: On the effectiveness of author response

Posted on May 16, 2018 by Emily M. Bender

The utility of the author response part of the conference review process is hotly debated. At COLING 2018, we decided to have the author response be addressed only to the area chairs (and PC co-chairs), and not the reviewers. The purpose of this blog post is to report back on our experience with this model (largely positive, from the PC perspective!) and also to share with the community what we have learned, inhabiting this role, about what makes an effective author response.

For background, here is a description of the decision making process at the PC level. Keep in mind that COLING 2018 received 1017 submissions, of which 880 were still ‘active’ at the point of these decisions. (The difference is a combination of desk rejects and papers withdrawn, the latter mostly in light of acceptance to other venues with earlier notifications.)

Outline of our process

Final accept/reject decisions for COLING 2018 were made as follows:

We asked the ACs for each area to provide a ranking of the papers in their area and to indicate recommendations of accept, maybe accept, maybe reject, or reject. We specifically instructed the ACs to not use the reviewer scores to sort the papers, but rather to come to their own ranking based on their judgment, given the reviews, discussion among reviews, author responses, and (where necessary) reading the papers.

Our role as PCs was to turn those recommendations into decisions. To do so, we first looked at each area’s report and determined which papers had clear recommendations and which were borderline. For the former, we went with the AC recommendations directly. The borderline cases were either papers that the ACs marked as ‘maybe accept’ or ‘maybe reject’, or, for areas that only used ‘accept’ and ‘reject’, the last two ‘accept’ papers and the first two ‘reject’ papers in the ACs’ ranking. This gave us a bit over 200 papers to consider.

We divided the areas into two sets, one for each of us. (We were careful at this point to put the areas containing papers with which one of us had COIs into the other PC’s stack.) Area by area, we looked at the borderline papers, considering the reviews, the reviewer discussion (if any), the author response, comments from the ACs, and sometimes the papers (to clarify particular points; we didn’t read the papers in full). Although the PC role on START allows us to see the authors of all submissions, we worked out ways to look at all the information we needed to do this without seeing the author names (or institutions, etc).

Of the 200 or so papers we looked at, there were 23 for which we wanted to have further discussion. This was done over Skype, despite the 9 hour time difference! These papers were evenly distributed between Emily’s and Leon’s areas, but clustered towards the start of each of our respective stacks; our analysis is that as we worked our way through the process, we each gained a better sense of how to make the decisions and found less uncertainty. (Discussion of COI papers was done with the General Chair, Pierre Isabelle, not the other PC, per our COI policy.)

As a final step to verify data entry (to make sure what is entered in START actually matches our intentions), we went through and looked at both the accepted papers with the lowest reviewer scores and the rejected papers with the highest reviewer scores. 98 papers with an average score 3 or higher were rejected. 27 papers with an average score lower than 3 were accepted. (Remember, it’s not just about the numbers!) For each of these, we went back to our notes to check that the right information was entered (it was) and in so doing, we found that, for the majority of the papers which were accepted despite low reviewer scores (and correspondingly harsh reviews), our notes reflected effective author responses. This furthermore is consistent with our subjective sense that the author responses really did make a difference in the case of difficult decisions, that is, the papers we were looking at.

What makes an effective author response?

The effective author responses all had certain characteristics in common. They were written in a tone that was respectful, calm and confident (but not arrogant). They had specific answers to reviewers’ specific questions or specific replies to reviewers’ criticisms. For example, if a reviewer pointed out that a paper failed to discuss important related work, an effective author response would either acknowledge the omission and indicate that it will be addressed in the final version, or clearly state why the indicated paper isn’t in fact relevant. Effective author responses to reviewer questions about points that aren’t clear were short and to the point (and specific). This gave us confidence that the answers would be incorporated in the final version. In many cases, authors related the results of experiments they hadn’t had space for, or ran the analyses during the response period; this is much more effective than an ephemeral promise to add the content. Author responses could also be effective in indicating that reviewers misunderstood key points of the paper or the background into which it fits, but only if they were written in the calm, confident tone mentioned above.

Many effective author responses also expressed gratitude for the reviewers’ feedback. This was nice to see, but it wasn’t a problem when it wasn’t there.

What makes an ineffective author response?

In effective author responses, on the other hand seemed to be written in a place of anger. We understand where authors are coming from when this happens! Reviews, especially negative reviews, can sting. But an author response that comes across as angry, condescending, or combative is not effective at persuading the ACs & PCs that the reviewers have things the wrong way around, nor does it provide good evidence that the paper will be improved for the camera ready version.

Best practices for writing author responses

Here we try to distill our experience of reading the author responses for ~200 papers (not all papers had them, but most did) into some helpful tips.

For conference organizers

We definitely recommend setting up an author response process, but having the author responses go to the ACs (and PCs) only, not the reviewers. Two ways to improve on what we did:

Clarify the word count constraints better than we did. We asked for no more than 400 words total, but the way START enforced that was no more than 400 words per review (since there were separate author response boxes for each review).
Don’t make the mistake we made of sending authors who wanted to do a late author response to their ACs … in the very small number of cases where that happened, it compromised anonymity of authors to ACs.

For authors

Read the reviews and write the angry version. Then set it aside and write a calmer one.
If you can, show your author response to someone who will read it for you and let you know where it sounds angry/arrogant/petty.
Try starting with “Thank you for the helpful feedback”—this isn’t necessary, and you can edit it out afterwards for space, but it might help you get off on the right foot regarding tone.
Don’t play the reviewers off each other (“R1 says this paper is hard to read, but that’s clearly wrong, because R2 said it was easy to follow.”) Rest assured that the ACs will read all of the reviews; they’ll have seen R2’s comments too.
Similarly, don’t feel obliged to reply to everything in the reviews. General negative comments (e.g. “I found this paper hard to read”) don’t require a response and there probably isn’t a response that would be helpful. Either the paper really is unclear or the reviewer doesn’t have sufficient background / didn’t leave enough time to read the paper carefully. Which scenario this is will likely be evident from the rest of the reviews and the author response.
Don’t promise the moon and the stars in the final version. It’s hard to accept a borderline paper based on promises alone.
Do indicate specific answers to key questions, in a way that is obviously easily incorporated in the final version. (And in that case it’s fine to say “We will add clarification along these lines”, or similar.)
Do concisely demonstrate mastery of the area, if reviewers probe issues you have considered during your research and you have the answers to hand.
Don’t play games with the word count. We saw two author responses where the authors got around the software’s restriction to 400 words (per box!) by_joining_whole_sentences_with_underscores. This does not make a good impression.

Ultimately, even a calm and confident author response doesn’t necessarily push a paper on the borderline over into accept. Sometimes the paper just isn’t ready and it’s not reasonable to try to fix what needs fixing or add what needs adding for the final version. Nonetheless, we found that the above patterns do make author responses more effective, and so we wanted to share them.

9 thoughts on “PC chairs report back: On the effectiveness of author response”

Yuval Pinter on May 17, 2018 at 1:28 pm said:

Thanks for sharing this analysis!
There’s one point of confusion for me – you’re encouraging authors to conduct quick experiments so they can answer reviewers’ concerns during the response period. So far, guidelines I’ve seen explicitly forbade asking for new results in a response, including the upcoming EMNLP cycle.
http://emnlp2018.org/reviewform/
Do you not share this view? Or is there some nuance I’m missing?

Reply ↓
- Emily M. Bender on May 17, 2018 at 7:30 pm said:
  
  I think the nuance might be the difference between entirely different experiments v. numbers that can be quickly produced by the authors’ existing experimental set-up.
  
  Reply ↓
Abhirut Gupta on May 17, 2018 at 1:47 pm said:

Thank you for this wonderful analysis! It is really detailed, and the list of best practices is indeed very helpful to new authors like myself.

Reply ↓
- Emily M. Bender on May 17, 2018 at 7:30 pm said:
  
  I’m glad to hear it is helpful!
  
  Reply ↓
Henning Wachsmuth on May 22, 2018 at 9:43 pm said:

Dear Emily and Leon,

first of all, let me also thank you for all these great insights into the COLING organization process and many thoughful decisions you made within the process. As others have said before, some ideas (such as writing the author response to the area chairs) hopefully stay with the CL community over time.

However, I’d also like to say a word of criticism, because I think it should at least be discussed. My criticism refers to the treatment of the reviewers’ scores suggested to the area chairs, as described in the “window into the decision process” (sorry that this comes a bit late):

– In my view, one main idea of peer-reviewing is that the decision about a submission is shared over multiple people, thereby making it at least a bit more objective. An initial filtering/sorting of papers based on their scores actually supports that this idea is followed.

– Yes, it’s not only about scores. Yes, of course several papers with medium overall scores should be looked at in more detail. And yes, scores depend on the subjective opinions of the reviewers. But after all, also an area chair makes a subjective decision. Agreed, he or she even may be more expert for the whole area – but maybe also not for the topic of the paper at hand.

– I know the area chairs are asked to use the reviews as evidence, but still for me the guidelines you gave on http://coling2018.org/a-window-into-the-decision-process/ sound like you counter the idea of a shared decision, giving the responsibility only to the area chair.

– Naturally, area chairs can generally ignore the reviewers, but now they are somewhat encouraged to do so. And when I read that 27 papers with average score lower than 3 made it (so, 4-2-2, 4-3-1, …), I’m happy for the authors, but I have second thoughts that it’s good to overrule the reviewers. Besides, from a reviewer’s perspective, how much should I care about the given reasonable scores then?

Please notice that this is not meant to complain, but rather to trigger further discussion. It might be that I’m missing something, also seeing that the conducted process was based on experience of others. But I would be glad to hear whether you thought about these things!

Thanks and best,
Henning

Reply ↓
- Emily M. Bender on May 22, 2018 at 10:17 pm said:
  
  Thank you for your thoughtful comments, Henning!
  
  It is not the case that we asked the area chairs to ignore the reviewers. Nor is it the case that the area chairs were ‘overruling’ anything. The reviewers make recommendations and record their opinion both in the form of text comments and in the form of numerical scores. The area chairs look at all of that, with the perspective not necessarily of greater expertise but rather of the context of their whole area and make recommendations to the PCs. The ultimate responsibility for decisions rests with the PCs, who again look things over (in our case, just for the borderline papers) with the broader context in view. (And still without the author names in view, it should be stressed!)
  
  The assumption here is not that the ACs have more relevant expertise than any given reviewer (though they will in some cases and not in others), but that they have more information. They can see: All the reviews for a paper, other reviews that that same reviewer wrote (were they just generally negative? did they tend to give high scores across the board?), the author response, and the same for all of the papers in their area. Furthermore, we didn’t tell the ACs not to look at the scores, but rather not to start with the papers ranked by score.
  
  I hope this response contributes to the discussion you are aiming to start!
  
  Reply ↓
- colingauthor on May 28, 2018 at 9:28 am said:
  
  I would like to add to the discussion, specially with regard to the numbers you presented. Let’s say the paper with average overall score below 3 that got accepted had 4-3-1. The means, one reviewer thought the paper should definitely be accepted, another one couldn’t make up their mind even after rigorous thought, and the last one said the paper is a definite reject. In this case, if the area chair (who is a specialist in the area) decides to accept the paper after due deliberation, I fail to see how this amounts to overruling the peer review feedback. The majority of those who read the paper (including AC) clearly decided not to reject, and hence, the paper was accepted. Maybe the reviewer who gave score of 1 did not understand the paper, or did not have enough time to review properly. In the case of 4-2-2, the same argument can be applied. And the same argument can be applied for rejecting the papers with average score of more than 3 (e.g., 4-3-3).
  
  PS. I am not an author whose paper got accepted with a score of less than 3 🙂
  
  Reply ↓
Henning Wachsmuth on June 2, 2018 at 12:59 pm said:

I tend to follow your argumentation on 4-3-1, although your interpretation of the scores is slightly (!) more positive than I would see them. Definitely, papers with a score range of four points (1–4) deserve a deeper inspection.

I don’t see how the argument you gave can also be applied to 4-2-2, though, but also here: Yes, when given conflicting scores, an AC should have a closer look.

Again, I don’t wanna argue at all for that it’s only about scores. Of course, an informed decision is always better than just following three numbers. Rather, my main point is:

The numbers should be part of the decision in my view.

Here is my reason (exaggerated): If we don’t believe in our reviewers, then why should the reviewers put effort into their job? Especially why should they think about reasonable scores? Or the other way round: If we believe in what they write in their reviews, why not in their scores? And if the tendency of the scores is clear (5-4-4, 4-4-4, 3-2-2, 2-2-2, …), why not just follow it then?

Best,
Henning

PS:
Side thing, from an author’s perspective: I like the idea of responding only to the AC, partly because it avoids the stressing worry that a reviewer reduces his or her score afterwards. But for the scores: Before I knew that certain scores most likely mean acceptance (say, 4-4-4) or rejection (say, 2-2-2). Even the latter can help, thinking of re-submission. Now, it seems like I need to worry until the final decision.

Reply ↓
Maja Popović on June 14, 2018 at 10:55 am said:

My point of view:

“The numbers should be part of the decision in my view.”

Definitely.

However, papers with high variance (as already mentioned 4-3-1 or 4-2-2) or papers in the very middle range (e.g. 3-3-3) should be inspected thoroughly by ACs: exact reviewers’ comments, reviewers’ confidence, other reviews by the involved reviewers, and the paper itself.

Simply rejecting a 4-2-2 or a 4-4-1 paper based on their ranks according to the average score is not really a good approach.

Best,
Maja

Reply ↓

COLING 2018

August 20-26, 2018, Santa Fe, New Mexico, USA