Comments on: PC chairs report back: On the effectiveness of author response

By: Maja Popović

Maja Popović — Thu, 14 Jun 2018 10:55:42 +0000

My point of view:

“The numbers should be part of the decision in my view.”

Definitely.

However, papers with high variance (as already mentioned 4-3-1 or 4-2-2) or papers in the very middle range (e.g. 3-3-3) should be inspected thoroughly by ACs: exact reviewers’ comments, reviewers’ confidence, other reviews by the involved reviewers, and the paper itself.

Simply rejecting a 4-2-2 or a 4-4-1 paper based on their ranks according to the average score is not really a good approach.

Best,
Maja

By: Henning Wachsmuth

Henning Wachsmuth — Sat, 02 Jun 2018 12:59:03 +0000

I tend to follow your argumentation on 4-3-1, although your interpretation of the scores is slightly (!) more positive than I would see them. Definitely, papers with a score range of four points (1–4) deserve a deeper inspection.

I don’t see how the argument you gave can also be applied to 4-2-2, though, but also here: Yes, when given conflicting scores, an AC should have a closer look.

Again, I don’t wanna argue at all for that it’s only about scores. Of course, an informed decision is always better than just following three numbers. Rather, my main point is:

The numbers should be part of the decision in my view.

Here is my reason (exaggerated): If we don’t believe in our reviewers, then why should the reviewers put effort into their job? Especially why should they think about reasonable scores? Or the other way round: If we believe in what they write in their reviews, why not in their scores? And if the tendency of the scores is clear (5-4-4, 4-4-4, 3-2-2, 2-2-2, …), why not just follow it then?

Best,
Henning

PS:
Side thing, from an author’s perspective: I like the idea of responding only to the AC, partly because it avoids the stressing worry that a reviewer reduces his or her score afterwards. But for the scores: Before I knew that certain scores most likely mean acceptance (say, 4-4-4) or rejection (say, 2-2-2). Even the latter can help, thinking of re-submission. Now, it seems like I need to worry until the final decision.

By: colingauthor

colingauthor — Mon, 28 May 2018 09:28:14 +0000

I would like to add to the discussion, specially with regard to the numbers you presented. Let’s say the paper with average overall score below 3 that got accepted had 4-3-1. The means, one reviewer thought the paper should definitely be accepted, another one couldn’t make up their mind even after rigorous thought, and the last one said the paper is a definite reject. In this case, if the area chair (who is a specialist in the area) decides to accept the paper after due deliberation, I fail to see how this amounts to overruling the peer review feedback. The majority of those who read the paper (including AC) clearly decided not to reject, and hence, the paper was accepted. Maybe the reviewer who gave score of 1 did not understand the paper, or did not have enough time to review properly. In the case of 4-2-2, the same argument can be applied. And the same argument can be applied for rejecting the papers with average score of more than 3 (e.g., 4-3-3).

PS. I am not an author whose paper got accepted with a score of less than 3 🙂

By: Emily M. Bender

Emily M. Bender — Tue, 22 May 2018 22:17:35 +0000

Thank you for your thoughtful comments, Henning!

It is not the case that we asked the area chairs to ignore the reviewers. Nor is it the case that the area chairs were ‘overruling’ anything. The reviewers make recommendations and record their opinion both in the form of text comments and in the form of numerical scores. The area chairs look at all of that, with the perspective not necessarily of greater expertise but rather of the context of their whole area and make recommendations to the PCs. The ultimate responsibility for decisions rests with the PCs, who again look things over (in our case, just for the borderline papers) with the broader context in view. (And still without the author names in view, it should be stressed!)

The assumption here is not that the ACs have more relevant expertise than any given reviewer (though they will in some cases and not in others), but that they have more information. They can see: All the reviews for a paper, other reviews that that same reviewer wrote (were they just generally negative? did they tend to give high scores across the board?), the author response, and the same for all of the papers in their area. Furthermore, we didn’t tell the ACs not to look at the scores, but rather not to start with the papers ranked by score.

I hope this response contributes to the discussion you are aiming to start!

By: Henning Wachsmuth

Henning Wachsmuth — Tue, 22 May 2018 21:43:53 +0000

Dear Emily and Leon,

first of all, let me also thank you for all these great insights into the COLING organization process and many thoughful decisions you made within the process. As others have said before, some ideas (such as writing the author response to the area chairs) hopefully stay with the CL community over time.

However, I’d also like to say a word of criticism, because I think it should at least be discussed. My criticism refers to the treatment of the reviewers’ scores suggested to the area chairs, as described in the “window into the decision process” (sorry that this comes a bit late):

– In my view, one main idea of peer-reviewing is that the decision about a submission is shared over multiple people, thereby making it at least a bit more objective. An initial filtering/sorting of papers based on their scores actually supports that this idea is followed.

– Yes, it’s not only about scores. Yes, of course several papers with medium overall scores should be looked at in more detail. And yes, scores depend on the subjective opinions of the reviewers. But after all, also an area chair makes a subjective decision. Agreed, he or she even may be more expert for the whole area – but maybe also not for the topic of the paper at hand.

– I know the area chairs are asked to use the reviews as evidence, but still for me the guidelines you gave on http://coling2018.org/a-window-into-the-decision-process/ sound like you counter the idea of a shared decision, giving the responsibility only to the area chair.

– Naturally, area chairs can generally ignore the reviewers, but now they are somewhat encouraged to do so. And when I read that 27 papers with average score lower than 3 made it (so, 4-2-2, 4-3-1, …), I’m happy for the authors, but I have second thoughts that it’s good to overrule the reviewers. Besides, from a reviewer’s perspective, how much should I care about the given reasonable scores then?

Please notice that this is not meant to complain, but rather to trigger further discussion. It might be that I’m missing something, also seeing that the conducted process was based on experience of others. But I would be glad to hear whether you thought about these things!

Thanks and best,
Henning

By: Emily M. Bender

Emily M. Bender — Thu, 17 May 2018 19:30:48 +0000

I’m glad to hear it is helpful!

By: Emily M. Bender

Emily M. Bender — Thu, 17 May 2018 19:30:39 +0000

I think the nuance might be the difference between entirely different experiments v. numbers that can be quickly produced by the authors’ existing experimental set-up.

By: Abhirut Gupta

Abhirut Gupta — Thu, 17 May 2018 13:47:09 +0000

Thank you for this wonderful analysis! It is really detailed, and the list of best practices is indeed very helpful to new authors like myself.

By: Yuval Pinter

Yuval Pinter — Thu, 17 May 2018 13:28:36 +0000

Thanks for sharing this analysis!
There’s one point of confusion for me – you’re encouraging authors to conduct quick experiments so they can answer reviewers’ concerns during the response period. So far, guidelines I’ve seen explicitly forbade asking for new results in a response, including the upcoming EMNLP cycle.
http://emnlp2018.org/reviewform/
Do you not share this view? Or is there some nuance I’m missing?