Review statistics

So far, there have been many things to measure of our review process at COLING. Here are a few.

Firstly, it’s interesting to see how many reviewers recommend the authors cite them. We can’t evaluate how appropriate this is, but it happened in 68 out of 2806 reviews (2.4%).

Best paper nominations are quite rare in general. This gives very little signal for the best paper committee to work with. To gain more information, in addition to asking whether a paper warranted further recognition, we asked reviewers to say if a given paper was the best out of those they had reviewed. This worked well for 747 reviewers, but 274 reviewers (26.8%) said no paper they reviewed was the best of their reviewing allocation.

Mean scores and confidence can be broken down by type, as follows.

	Score	Confidence
Computationally-aided linguistic analysis	2.85	3.42
NLP engineering experiment paper	2.86	3.51
Position paper	2.41	3.36
Reproduction paper	2.92	3.54
Resource paper	2.76	3.50
Survey paper	2.93	3.58

We can see that reviewers were least confident with position papers, and were both most confident and most pleased with survey papers—though reproduction papers came in a close second in regard to mean score. This fits the general expectation that position papers are hard to evaluate.

The overall distribution of scores follows.

29 thoughts on “Review statistics”

This graph is really hard to read. Can we get a clearer figure?

Reply ↓

Emiel van Miltenburg on April 22, 2018 at 7:37 pm said:

It seems WordPress automatically resized the image, making it harder to read. The original image is clearer: http://coling2018.org/wp-content/uploads/2018/04/Untitled-1.png

(Thanks for the post, Leon!)

Reply ↓
- Leon Derczynski on April 22, 2018 at 8:28 pm said:
  
  Thanks, Emil. In fact Emily and I co-author all posts and often one of us will start a particular post and the other finish it, but we both appreciate the sentiment!
  
  Reply ↓
Leon Derczynski on April 22, 2018 at 8:27 pm said:

Ah, WP picked an unfortunate resolution. How about now? It’s not the interactive SVG we’d all hoped for, but I hope it is sufficient for the point.

Reply ↓
- Soujanya Poria on April 23, 2018 at 2:50 am said:
  
  Whats the graph all about!?
  
  Reply ↓
  - Leon Derczynski on April 24, 2018 at 10:19 am said:
    
    Cumulative distribution of mean overall recommendation; count on y, mean on x.
    
    Reply ↓

What’s the graph about? How come more than 900 submissions have a mean score of 5 (only 1000 submissions in total) ?

Reply ↓

Arne Köhn on April 23, 2018 at 4:57 am said:

This is summed. More than 900 submissions have a mean score of 5 *or less*, i.e. all submissions have a score between 1 and 5 🙂

OTOH, you can see that less than 100 have a score of 4 or more.

Reply ↓
- zhe on April 25, 2018 at 5:39 am said:
  
  does it exactly include a score of 4?
  
  Reply ↓
  - Leon Derczynski on April 25, 2018 at 7:02 am said:
    
    If there’s an x-axis mark for it, it’s included.
    
    Reply ↓
    - Someone on April 28, 2018 at 5:40 pm said:
      
      In this figure, there is an x-axis mark for score 4. In other words, there are about 900 papers have a mean score less than 4 instead of a mean score less or equal to 4 ?
      
      Reply ↓
      - Leon Derczynski on May 1, 2018 at 4:06 pm said:
        
        There are about 900 papers with a mean score of 4 or below (less than or equal).
        
        Reply ↓

Thank you for providing the interesting information. Is mean score mean of overall recommendation scores by three reviews??

Reply ↓

Leon Derczynski on April 24, 2018 at 10:20 am said:

Yes, that’s correct. It’s only a proxy for quality – overall rec. is assessed subjectively for each paper by the ACs later.

Reply ↓

Nice statistics. Would be better if you give the detailed numbers of each mean scores.

Thanks!

Reply ↓

Leon Derczynski on April 24, 2018 at 10:20 am said:

Detailed how?

Reply ↓
- Zakaa on April 24, 2018 at 5:08 pm said:
  
  For example, how many papers got the score of 4, how many papers got the score of 3.67 . This may give more information than the single graph.
  
  Reply ↓
  - Leon Derczynski on April 25, 2018 at 7:03 am said:
    
    Oh, no, the data is presented as a graph, not a table. The decision boundary is usually fuzzy with some “hinge loss” and so it’s not always worth going into too much detail here – instead, the general shape is shown.
    
    Reply ↓

I just want to thank you, Emily, Leon, for all these great initiatives:
– Having this “reproducibility track” is really really great !
– Anonymity at area chairs level is also nice
– “Minimum author responses” to area chairs looks the right way to go – let’s see afterwards if it went well or not 😉
Even this blog post (although not so original 😉 ) live during rush time is so nice

At this pace, I hope next year we’ll have mandatory attached source code to every submission, that would be another nice progress I think 🙂

Thanks a lot !

Reply ↓

Leon Derczynski on April 24, 2018 at 10:27 am said:

Thank you!

We don’t want to lock out those working on unreleasable data (e.g. clinical records), so we can’t really enforce this. What we can do so far – and have done – is to only give best paper awards to papers who have already made code/resource already available, and to run a reproducibility track.

Reply ↓
- nlpc on April 24, 2018 at 12:44 pm said:
  
  So, just to clarify, decisions with regards to the best papers for this year’s conference have been already made?
  
  Reply ↓
  - Leon Derczynski on April 25, 2018 at 7:04 am said:
    
    The acceptance decision process has not started, and best paper selections will be about two months from now.
    
    Reply ↓
Emily M. Bender on April 25, 2018 at 9:52 am said:

Thank you! It’s very nice to know that these initiatives are appreciated.

Reply ↓

What is the mean score? is it the average over the overall recommendation scores?

Reply ↓

Leon Derczynski on April 24, 2018 at 10:20 am said:

Yes, that’s right

Reply ↓
- tuzi on April 24, 2018 at 3:43 pm said:
  
  It is with confidence or not?
  
  Reply ↓
  - Leon Derczynski on April 25, 2018 at 7:04 am said:
    
    There is only one variable reported, score.
    
    Reply ↓

We don’t see reviewer confidence scores in the reviews. Is that intentional?

Reply ↓

Leon Derczynski on May 1, 2018 at 4:04 pm said:

Yes, that’s intentional.

Reply ↓

COLING 2018

August 20-26, 2018, Santa Fe, New Mexico, USA

29 thoughts on “Review statistics”

Leave a Reply to Leon Derczynski Cancel reply