Call for input: Paper types and associated review forms

In our opening post, we laid out our goals as PC co-chairs for COLING 2018. In this post, we present our approach to the subgoal (of goal #1) of creating a program with many different types of research contributions. As both authors and reviewers, we have been frustrated by the one-size-fits-all review form typical of conferences in our field. When reviewing, how do we answer the ‘technical correctness’ question about a position paper? Or the ‘impact of resources’ question on a paper that doesn’t present any resources?

We believe that a program that includes a wide variety of paper types (as well as a wide variety of paper topics) will be more valuable both for conference attendees and for the field as a whole. We hypothesize that more tailored review forms will lead to fairer treatment of different types of papers, and that fairer treatment will lead to a more varied program. Of course, if we don’t get many papers outside the traditional type (called “NLP engineering experiment paper” below), having tailored review forms won’t do us much good. Therefore, we aim to get the word out early (via this blog post) so that our audience knows what kinds of papers we’re interested in.

Furthermore, we’re interested in what kinds of papers you’re interested in. Below you will find our initial set of five categories, with drafts of the associated review forms. You’ll see some questions are shared across some or all of the paper types, but we’ve elected to lay them out this way (even though it might feel repetitive) so that you can look at each category, putting yourself in both the position of author and of reviewer, and think about what we might be missing/which questions might be inappropriate. Let us know in the comments!

As you answer, keep in mind that our goal with the review forms is to help reviewers structure their reviews in such a way that they are helpful for the area chairs in making final acceptance decisions, informative for the authors (so they understand the decisions that were made), and helpful for the authors (as they improve their work either for camera ready, or for submission to a later venue).

Computationally-aided linguistic analysis

The focus of this paper type is new linguistic insight.

  • Relevance: Is this paper relevant to COLING?
  • Readability/clarity: From the way the paper is written, can you tell what research question was addressed, what was done and why, and how the results relate to the research question?
  • Originality: How original and innovative is the research described? Originality could be in the linguistic question being addressed, in the methodology applied to the linguistic question, or in the combination of the two.
  • Technical correctness/soundness: Is the research described in the paper technically sound and correct? Can one trust the claims of the paper—are they supported by the analysis or experiments and are the results correctly interpreted?
  • Reproducibility: Is there sufficient detail for someone in the same field to reproduce/replicate the results?
  • Generalizability: Does the paper show how the results generalize, either by deepening our understanding of some linguistic system in general or by demonstrating methodology that can be applied to other problems as well?
  • Meaningful comparison: Does the paper clearly place the described work with respect to existing literature? Is it clear both what is novel in the research presented and how it builds on earlier work?
  • Substance: Does this paper have enough substance for a full-length paper, or would it benefit from further development?
  • Overall recommendation: There are many good submissions competing for slots at COLING 2018; how important is it to feature this one? Will people learn a lot by reading this paper or seeing it presented? Please be decisive—it is better to differ from other reviewers than to grade everything in the middle.

NLP engineering experiment paper

This paper type matches the bulk of submissions at recent CL and NLP conferences.

  • Relevance: Is this paper relevant to COLING?
  • Readability/clarity: From the way the paper is written, can you tell what research question was addressed, what was done and why, and how the results relate to the research question?
  • Originality: How original and innovative is the research described? Note that originality could involve a new technique or a new task, or it could lie in the careful analysis of what happens when a known technique is applied to a known task (where the pairing is novel) or in the careful analysis of what happens when a known technique is applied to a known task in a new language.
  • Technical correctness/soundness: Is the research described in the paper technically sound and correct? Can one trust the claims of the paper—are they supported by the analysis or experiments and are the results correctly interpreted?
  • Reproducibility: Is there sufficient detail for someone in the same field to reproduce/replicate the results?
  • Error analysis: Does the paper provide a thoughtful error analysis, which looks for linguistic patterns in the types of errors made by the system(s) evaluated and sheds light on either avenues for future work or the source of the strengths/weaknesses of the systems?
  • Meaningful comparison: Does the paper clearly place the described work with respect to existing literature? Is it clear both what is novel in the research presented and how it builds on earlier work?
  • Substance: Does this paper have enough substance for a full-length paper, or would it benefit from further work?
  • Overall recommendation: There are many good submissions competing for slots at COLING 2018; how important is it to feature this one? Will people learn a lot by reading this paper or seeing it presented? Please be decisive—it is better to differ from other reviewers than to grade everything in the middle.

Reproduction paper

The contribution of a reproduction paper lies in analyses of and in insights into existing methods and problems—plus the added certainty that comes with validating previous results.

  • Relevance: Is this paper relevant to COLING?
  • Readability/clarity: Is the paper well-written and well-structured?
  • Analysis: If the paper was able to replicate the results of the earlier work, does it clearly lay out what needed to be filled in in order to do so? If it wasn’t able to replicate the results of earlier work, does it clearly identify what information was missing/the likely causes?
  • Generalizability: Does the paper go beyond replicating the results on the original to explore whether they can be reproduced in another setting? Alternatively, in cases of non-replicability, does the paper discuss the broader implications of that result?
  • Informativeness: To what extent does the analysis reported in the paper deepen our understanding of the methodology used or the problem approached? Will the information in the paper help practitioners with their choice of technique/resource?
  • Meaningful comparison: In addition to identifying the experimental results being replicated, does the paper motivate why these particular results are an important target for reproduction and what the future implications are of their having been reproduced or been found to be non-reproducible?
  • Overall recommendation: There are many good submissions competing for slots at COLING 2018; how important is it to feature this one? Will people learn a lot by reading this paper or seeing it presented? Please be decisive—it is better to differ from other reviewers than to grade everything in the middle.

Resource paper

Papers in this track present a new language resource. This could be a corpus, but also could be an annotation standard, tool, and so on.

  • Relevance: Is this paper relevant to COLING? Will the resource presented likely be of use to our community?
  • Readability/clarity: From the way the paper is written, can you tell how the resource was produced, how the quality of annotations (if any) was evaluated, and why the resource should be of interest?
  • Originality: Does the resource fill a need in the existing collection of accessible resources? Note that originality could be in the choice of language/language variety or genre, in the design of the annotation scheme, in the scale of the resource, or still other parameters.
  • Resource quality: What kind of quality control was carried out? If appropriate, was inter-annotator agreement measured, and if so, with appropriate metrics? Otherwise, what other evaluation was conducted, and how agreeable were the results?
  • Resource accessibility: Will it be straightforward for researchers to download or otherwise access the resource in order to use it in their own work? To what extent can work based on this resource be shared?
  • Metadata: Do the authors make clear whose language use is captured in the resource and to what populations experimental results based on the resource could be generalized to? In case of annotated resources, are the demographics of the annotators also characterized?
  • Meaningful comparison: Is the new resource situated with respect to existing work in the field, including similar resources it took inspiration from or improves on? Is it clear what is novel about the resource?
  • Overall recommendation: There are many good submissions competing for slots at COLING 2018; how important is it to feature this one? Will people learn a lot by reading this paper or seeing it presented? Please be decisive—it is better to differ from other reviewers than to grade everything in the middle.

Position paper

A position paper presents a challenge to conventional thinking or a futuristic new vision. It could open up a new area or novel technology, propose changes in existing research, or give a new set of ground rules.

  • Relevance: Is this paper relevant to COLING?
  • Readability/clarity: Is it clear what the position is that the paper is arguing for? Are the arguments for it laid out in an understandable way?
  • Soundness: Are the arguments presented in the paper relevant and coherent? Is the vision well-defined, with success criteria? (Note: It should be possible to give a high score here even if you don’t agree with the position taken by the authors)
  • Creativity: How novel or bold is the position taken in the paper? Does it represent well-thought through and creative new ground?
  • Scope: How much scope for new research is opened up by this paper? What effect could it have on existing areas and questions?
  • Meaningful comparison: Is the paper well-situated with respect to previous work, both position papers (taking the same or opposing side on the same or similar issues) and relevant theoretical or experimental work?
  • Substance: Does the paper have enough substance for a full-length paper? Is the issue sufficiently important? Are the arguments sufficiently thoughtful and varied?
  • Overall recommendation: There are many good submissions competing for slots at COLING 2018; how important is it to feature this one? Please be decisive—it is better to differ from other reviewers than to grade everything in the middle.

 

So, those are the initial set of submission types. These types of paper aren’t limited to single tracks. That is to say, there won’t be a dedicated position paper track, with its own reviewers and chair. You might find a resource paper in any track, for example, and a multi-lingual embeddings track (if one appears—but that’s for a future post) might contain all five kinds of paper mixed together. This makes it even more important that the right questions are asked for a paper type, to help out hard-working reviewers with the task of judging each kind of paper in an appropriate light.

Our questions for you: Is there a type of paper you’d either like to submit to COLING or would like to see at COLING that you think doesn’t fit any of these five already? Should any of the review questions be dropped or refined for any of the paper types? Are there review questions it would be useful to add? Please let us know in the comments!

 

36 thoughts on “Call for input: Paper types and associated review forms

  1. I miss a category for theoretical papers (new grammar formalism, new parsing algorithm with correctness/complexity analysis, proof that a grammar formalism has more generative power than another, etc.). They are not too common in CL conferences, but they exist, and I think they don’t fit too well into those five review forms. ACL has a review form for them (or at least has had it in many editions, I don’t remember specifically if it was there this year): http://www.acl2010.org/reviewforms.html

  2. Dear Emily and Leon,

    Can we have a track on temporal dynamics of natural languages. With the culturomics in fashion, we now have a lot of time-varying texts from various domains which have their own interesting properties. This is not into mainstream so far, but with the increasing number of papers in this area, analysis of diachronic text is soon going to become very popular I would guess? Please feel free to get in touch with me if I could be of any help.

    • Dear Animesh,

      We plan to do tracks a little differently from the classic model, and won’t be compiling them from set lists of topics. But this is an interesting area, and if there are enough submissions to warrant it, I’m sure they will find an appropriate “track” together. Thank you for your offer to help!

  3. Greetings Emily and Leon,

    Nice work having this discussion in public and early! I think you’ve done good work on identifying interesting categories for papers. One other possible addition could be survey papers? There was a survey paper at ACL 2018 (Abend et al) on Semantic Representations that I thought was rather useful and memorable. I think survey papers are particularly useful now given our field is growing and there are many newcomers who are (I think) beaten into submission very quickly by a series of hyper find grained papers.

    Cordially,
    Ted

  4. Thanks, Ted, for that idea. That was a good paper — and I was impressed with the crowd that their presentation drew!

  5. I second Carlos’s suggestion that theoretical papers should be a category that is reviewed on its own terms. if you’re looking for a class of papers that’s endangered by the field’s mania for “NLP engineering experiment paper”, look no further: One of my students recently had a paper rejected on the grounds that it was theoretical, with the dismissive comment: “In my view, the theoretical rigor of this work is ill-suited for a conference presentation in the first place and out-of-place for the EMNLP audience.”

  6. This is a great idea, even if not brand new. It will particularly help the reviewers in the sense of giving them a direction for evaluating papers, where one size doesn’t fit all and there are many directions of evaluation [mixing metaphors].

    I have always felt that the category of position papers was too vague and many reviewers perhaps did not take it seriously.

    I like the name for the ‘traditional’ papers: ‘NLP engineering experiment paper’. That’s what they are indeed and all have to submit to this category’s dominance.

    There is one category missing from the list that you had mentioned in your first post: methodology papers. Can it come under theoretical papers or will it come under linguistic analysis papers? I think there can be methodology paper other than on linguistic analysis.

    Also, there are sometimes papers around software architecture or software design for NLP. Ultimately we need good software for NLP and, like resources, it is a time consuming job, and having a category for this may encourage researchers to also think of development and spend time on it. (Not that they don’t do it now).

    The risk, of course, is of proliferation of categories.

  7. I agree that the balancing act here is to get sufficient categories that we can accommodate all of the kinds of papers that we want, while not ending up with a confusing or overwhelming number of categories.

    Can you say more about what kind of paper you think would go under ‘methodology’? And what kinds of questions should be on its review form that aren’t on e.g. ‘NLP engineering experiment paper’? (I’m glad you like that term — we struggled to come up with something!)

    I don’t ask out of skepticism. Rather, as we (Leon and I) discussed this while developing the draft categories, we didn’t get to something that seemed sufficiently distinct for ‘methodology paper’ and so decided to go without it in the draft. So, I’m hoping to learn more here!

    Alternatively (and this goes for the ‘theory paper’ category too, if Adam and Carlos would like to say more), can you point me to a particularly good paper in the ‘methodology’ category? How about about one of the ‘software design for NLP’ type?

    Thanks again!

  8. For both methodology and software design for NLP categories, I can think right now only of GATE-related papers or theses:

    https://gate.ac.uk/sale/hc-for-nlp-survey/TACL-methodology/paperACL_v11.pdf

    https://gate.ac.uk/sale/thesis/index.html

    NLTK is another example for software design.

    I have a feeling that methodology papers might become more important with the popularity of Deep Learning approaches, which are more opaque compared to traditional machine learning based approaches (from the linguistic or explanatory point of view).

  9. I also second the idea of survey papers. They can help in occasional taking of stock, so to say. Also good for beginners, as Ted said. But will they overlap with reproduction papers?

  10. Emily, some examples of interesting theoretical papers would be:

    Mildly Context-Sensitive Dependency Languages, Kuhlmann and Möhl
    http://aclweb.org/anthology/P07-1021

    Parsing Graphs with Hyperedge Replacement Grammars, Chiang et al.
    http://aclweb.org/anthology/P/P13/P13-1091.pdf

    The first paper characterizes formal properties of (sets of) dependency trees; The second describes and analyzes a parsing algorithm. The papers themselves are purely theoretical (Chiang’s paper mentions an implementation but the paper reports nothing about it other than its existence; there are no experiments), but they are nonetheless relevant to NLP and both inspired further work, including empirical work, that would naturally fall into your other categories. This kind of theoretical work is often necessary when figuring out how to work with new types of representations (as both the above papers do).

  11. Thanks Carlos and Adam for the comments regarding theory papers. This is absolutely something we wanted, and hoped to get it under the first category: “computationally-aided linguistic analysis”. There’s no doubt that these papers are critical to our field, and that review comment from EMNLP seems at best contentious.
    Looking at these examples and again at the description for analysis papers, it’s clear that these theory papers do not fit that category’s initial description. To help develop the idea, I’m curious: by what criteria ought a CL theory paper be judged?

    • Apart from the common criteria for all categories (like relevance, clarity, etc.) the fundamental “specific” criterion for theoretical papers should be soundness/correctness.

      I think the ACL review form for theoretical papers is a good starting point: http://www.acl2010.org/forms/theoretical_long.txt

      In that form, the ACL question for this was “First, is the theoretical approach sound and well-chosen? Second, can one trust the claims of the paper — for example are they supported by an appropriate proof or analysis?”. I think that is a good synthesis.

      Maybe it could be grouped in a common category with computationally-aided linguistic analysis by writing texts that would encompass both, but the problem is that theoretical papers have no reproducibility in the experimental sense (unless if you count sitting down and re-doing the proofs as reproducibility) 🙂 and they not necessarily have generalizability (typically they are language-agnostic per se) so I think they call for a separate category.

  12. Ted, Anil, I wholeheartedly agree about good surveys having a place. These are a great tool.
    On the other hand, one worries about exhausting reviewers (and authors!) by providing too many paper types. Six already seems to be pushing what is reasonable for a human to cope with, so we have to be careful to get this balance right.
    It might help to state explicitly that survey papers are welcome and encouraged at COLING, and make sure ACs know this too, to reduce the impact of “surprised” reviewers (i.e. those who didn’t manage to keep up with this year’s style). After all, we’re all humans with error rates and finite cognitive capacities. We hope to design COLING 2018 around these deficiencies.
    Another option, that reduces dependency on perfect ACing, is to include a checkbox for a “survey” paper type; this allows easy flagging of survey papers so that negative survey reviews can be double-checked.
    Of course, none of these go out to describe what qualities a perfect survey paper might have – perhaps that’s another blog post!

    • Hi Leon,

      Yes, I very much understand the problem of too many categories. I also think survey papers might be relatively rare, since they aren’t always encouraged and so we might not even think to write one for a conference especially. I think the idea of encouraging surveys and perhaps allowing an author to indicate “we intend this to be a survey” is a nice balance, and would at least make it clear such papers are welcome while not chopping up the space of categories too finely. And of course one could have a different sorts of surveys too – a survey of methods, or a survey of theories, or even a survey of experimental results, which might well morph into a new category of meta-analysis of experimental papers. 🙂 I think in fact that’s an interesting kind of paper we rarely see – while common in some areas like medicine (where someone takes stock of 100 previous experiments and draws some overall conclusions) that is not something we tend to do as much of, at least not that I have seen. So. I think encouraging surveys (and perhaps meta-analyses) while not creating a new category for them makes a great deal of sense. Thank you for this very interesting discussion.

  13. Could a “methodological” category be easily defined ? it seems to me that this would include papers reconsidering existing validation methodologies (and this would fit in reproduction) or propose new ones, in which case the merits should be argued with respect to a linguistic/engineering experimental case study, and should fit one the first two categories, depending on the subject.

  14. I agree with most of the comments here. I think a great balance of types could be better assessed. If paper types are going to be a major part of the program differentiation (as in tracks as well) perhaps it’s useful to have particular *dedicated* reviewers (in addition to general reviewers) assigned to each of different types. I can imagine I would be confused if I had to review 3-5 submissions and each of them were considered as part of a different paper type.

    Of course this makes this more confusing to the (domain) area of expertise, and the product of both would be very difficult to manage but I think it might be good to have some people reviewing limited to areas but others reviewing for specific paper types.

    • Thanks, Min-Yen! We will consider having some dedicated reviewers per paper type (though I worry that will add a layer of complexity to the paper assignment process which is already quite complex).

      We are also planning to have very terse emails (rather than the traditional long instructions to reviewers that it seems no one reads), which basically say:

      (1) Read the review forms before you read each paper
      (2) Here’s your link to your assignment

      Emily

  15. Two sub-categories of methodology paper come to mind. One of them is a paper pointing out that a standard evaluation measure (think BLEU) gives unexpected and undesirable results under certain well-defined and important circumstances. That is publishable in its own right if the unexpected results are truly unexpected and the circumstances sufficiently important. Clearly this kind of paper is even better if it ALSO proposes a different or adapted evaluation measure that works better.

    The second type of methodology paper that I imagine is one that is one that is like the http://raaijmakers.edu.fmg.uva.nl/PDFs/Raaijmakers%20et%20al%20MinF%20paper.pdf or the Clark 1973 paper it refers. This type basically says “we are doing it wrong, and here’s how to do it right”. Papers like this are basically about experimental design.

  16. A genius idea, the combination of the review forms with the paper-type descriptions.

    A suggestion: in the description of the computationally-aided linguistic analysis paper type, add your definition of “linguistic.” I’m guessing that you didn’t do that in the first place because either (1) you think it’s too broad/fuzzy/variable to define concisely (could be true), or (2) you don’t want to limit people’s thinking about what sorts of things they might submit there (admirable). However, I suggest that you take a crack at it anyway, because a number of the suggestions in the Comments section–as well as Emily’s/Leon’s responses to some of those comments–seem like they fit squarely within the linguistic analysis section. At a minimum, perhaps consider adding their suggestions to the topic description as examples (if you agree with me that they fit there)?

  17. tl;dr — Concentrating on review forms is problematic, because the overall quality of a paper is not the sum (or mean) of the individual criteria.

    I like the idea of explicitly calling for different types of papers. I’m less thrilled about having separate review forms, with numerical scores for the individual criteria. Personally, I haven’t found those to be very useful as either a reviewer or an area chair.

    When I review a paper, I typically write my narrative review first; only later do I look at the criteria and try to assign a score for each. It is fairly rare that rating individual criteria will change my overall impression of a paper; more often, I find that my overall evaluation is not a sum of its parts (as defined by the review form): sometimes a paper hits all the points but is overall not very impressive, and conversely, I have encountered papers that fail on one or more criteria but are still very interesting. Perhaps because of my approach to reviewing, I also found the individual scores not so useful as area chair. The overall recommendation and the textual narrative were the most informative. The most frustrating reviews were the ones where the narrative text was very terse.

    I am also a bit troubled by the idea that each paper has to fit under one type. Some contributions cut across types (for instance, an experiment report with theoretical consequences); others might not fit in cleanly under any type, as mentioned above in the comments (for example, a meta-analysis is not really a survey, but also not an engineering experiment). How will the paper types be determined? Will the authors have to choose a review form for their paper? Since each review form is a collection of criteria, this amounts to asking the authors to choose the criteria by which their paper is evaluated. This is a strange burden to place on an author, and it might even influence the way authors write their papers, trying to check off review criteria, rather than concentrating on presenting a coherent message.

    So here’s an alternative suggestion for getting the diversity we want in submission and reviewing. Rather than devise separate review forms and paper categories, invite diversity in the call for papers and author instructions. Give examples of paper types and what kind of contributions they typically make, and instruct authors to explicitly state their contribution in the paper itself. I find the latter point extremely important: when I review a paper I try to judge it relative to the kind of contribution it is trying to make, and it is very frustrating (and not altogether uncommon) to find a paper where it is unclear until deep into the paper whether the key contribution is an experiment report, a theoretical discussion, an application and so forth. Stating the contribution explicitly in the paper is also important for future readers, who have no access to the review form…

    Reviewers would benefit from similar instructions as the author: identify the main contribution of the paper and judge it accordingly (with short examples). As for the review form, if individual criteria are still needed, there can always be a “not applicable” option for criteria that are not applicable to all paper types.

    Finally, a reaction to the idea of dedicated reviewers for different types of papers: personally, I like diversity in the papers that I review. It is fine if the same reviewer gets papers that are not directly comparable — reviewers shouldn’t be comparing papers in their sample anyhow, since each reviewer receives only a tiny sample of the total (could be all great papers, or all bad ones). The main acceptance criterion should be the same for all papers, namely that they make a good contribution that is interesting to the audience. How this criterion is broken down will vary by paper, and I’m happy to review diverse papers where it’s broken down in different ways.

    • Thank you, Ron, for this thoughtful reply! I see that we agree on the goal (more diverse papers) and disagree on the likelihood that the strategy we propose will work well.

      I think it’s a feature, not a bug, if authors look to the review forms for guidance on what makes a good paper (of different types).

      I recognize that there is a risk that some papers will be hard to categorize, but we are leaving it to the authors to choose which forms will be used for their papers.

      I also absolutely agree that terse narrative text reviews are next to useless — the idea here is not for the separate scores to obviate the need for remarks. On the contrary: Ideally, all of the scores should be supported by remarks. Also: We do not expect the overall evaluation to be a simple sum (or other formula) of the scores for other questions. I find that as a reviewer, thinking carefully about the component scores helps me to form my overall opinion.

      Thank you again,
      Emily

      • Hi Emily,

        I thought some more about the issue of paper diversity and review criteria, and I suspect it reflects a more fundamental question (also hinted at in your opening post), namely:

        How is COLING different from ACL/EACL/NAACL/EMNLP?

        Here’s one characterization which I hope is fairly uncontroversial: COLING is less focused on NLP engineering experiment papers (thanks for introducing this term!). As a consequence, reviewers for COLING are more open to papers without a rigorous engineering evaluation. So one way to interpret the different review forms is as a signal that COLING warmly accepts papers which are not experiments (and therefore do not require an experiment evaluation), while maintaining the same evaluation standards as other conferences when such evaluation is appropriate.

        If this is part of the intent, then I think it would be good to state it explicitly and clearly in the call for papers (these can get long and hard to read with all the boilerplate text). The idea would be to match the expectations of authors and reviewers already at the writing and submission stage.

        Regardless, I think it would be interesting to have a discussion of what makes COLING unique and different from ACL and its kin, since there is a large overlap in the communities and the distinction is not always clear. This might be worthy of a separate post.

        -Ron.

        • Ron – The original motivation for COLING was to provide a venue for researchers who couldn’t normally travel to the West, could nevertheless stay “au courant” with the latest ideas in NLP. It is not something that ACL itself would undertake. Hence, Don Walker (the secretary/treasurer of ACL for many years) was also a key figure in organizing COLING conferences, but doing so separate from his ACL activities.

          What I see Emily and Leon doing is possibly creating a new way in which COLING meetings are distinct from ACL conferences.

        • Thanks, Ron and Bonnie for these thoughtful replies!

          Ron, I’m glad you like the term “NLP engineering experiment paper” (we thought long and hard to come up with it).

          We definitely want to make COLING distinct from the *CL conferences and see it as serving an important role for our community.

          And yes, the idea is to make sure that authors and reviewers are all fully informed about the expectations for COLING papers—which are in fact a diverse set of expectations for a diverse set of kinds of papers. We expect all of them to be rigorous in their argumentation, but the actual shape of that depends on the type of paper. So indeed, COLING warmly welcomes research contributions for which the kind of quantitative evaluations typical of NLP engineering experiment papers are not appropriate.

          • An important characteristic of COLING compared to other CL conferences has also been to attract papers that challenge our thinking (including those which address important topics that are no longer ‘sexy’ –at one point that was MT!) and Ines which are likely to generate a lot of discussion. Indeed, the tradition to include in the programme a half-day excursion is intended to precisely provide opportunities for such discussions to take place in a fun and relaxed setting. Perhaps there’s a way to incorporate this in the review form.

            Like others have said, I find — whether as author, co-reviewer, area chair or PC chair — reviews that provide little narrative comment to be particularly unhelpful. I wonder whether placing the request for overall assessment and the ‘comments for authors’ box at the beginning of the review form instead of the usual end might make a difference?

  18. Rather than a list of bullet point criteria, each scored with a number (1-5), I’d like to suggest that whichever of Emily and Leon’s paper categories are meant to support some hypothesis, that we as reviewers are encouraged to assess the papers in terms of how the hypothesis, methodology, results and discussion either support or refute the hypothesis (or fail to do either). More specifically, I would like to see authors and reviewers of such papers address “Hypothesis-related criteria” of paper quality. With respect to review forms, I would like them to ask reviewers to answer these non-numeric questions about each paper they are reviewing:

    – Is it clear what the authors’ hypothesis is? What is it?
    – Is it clear how the authors have tested their hypothesis?
    – Is it clear how the results confirm/refute the hypothesis, or are the results
    inconclusive?
    – Do the authors explain how the results FOLLOW from their hypothesis (as
    opposed to say, other possible confounding factor)?

    Too many of our authors don’t seem to realize that their results have tie back to their hypothesis in a way that is licenced by their methology and experimental design.

  19. ACL 2010 (Uppsala) had multiple categories of submissions. Problems occurred with authors submitting under (and hence getting reviewed under) the wrong category, since reviewers rightly refused to review the paper in a different way after its authors realized their errors.

    Having a single review form (as one does with journals) that relies more on a reviewer’s opinions and analysis expressed in text than on assigning a value between 1 and 5 to each of several criteria (whose explication rarely seems to correlate with what one is actually responding to) might avoid this problem.

    • Yes, we realize there is a risk in having more than one review form, but we are also committed to trying it out, because the status quo seems to focus everything on the NLP engineering experiment paper type. We want some of those, of course, but not only those.

      Our plans for avoiding the problems that came up with ACL 2010 include this blog (and other efforts at clearly publicizing what we have in mind) as well as keeping the range of review forms much smaller, so as not to have such an overwhelming menu.

  20. I think identifying different types of papers beyond the “NLP engineering experiment” sort that is widespread these days, is a great idea. I think the theoretical type that was already suggested is necessary to. Besides identifying types though, it is essential that reviewers are assigned to the different types. who can appreciate that sort of work

    • Thank you, Barbara. One way to ensure better matching of reviewers to paper types is to make the paper type very visible during the bidding phase, which we will endeavor to do!

  21. There was a similar attempt at ACL circa 2010. It was decided to have multiple types of papers (including resources, surveys, negative results). If I recall correctly, after a year or two, this experiment was abandoned. I have to look for the details. Maybe some of the PC chairs from that time can say more.

Leave a Reply

Your email address will not be published. Required fields are marked *