COLING 2018 Submissions Overview

We’ve had a successful COLING so far, with over a thousand papers submitted, covering a variety of areas. In total, 1017 papers were submitted to the main conference, all full-length.

Each submitted paper had a distinct type assigned by the authors, that affects how it is reviewed. These were developed based on our earlier blog post on paper types. The “NLP Engineering Experiment paper” was unsurprisingly the dominant type, though only made up for 65% of all papers. We were very happy to receive 25 survey papers, 31 position papers, and 35 reproduction papers—as well as a solid 106 resource papers and a strong showing of 163 computationally-aided linguistic analysis papers, the second largest contingent.

Some papers were withdrawn or desk rejected before review began in earnest. Between ACs and PC co-chairs, in total, 32 papers were rejected without review. Excluding desk rejects, so far 41 papers have been withdrawn from consideration by the authors.

Allocating papers to areas gave each area a mean and median of 27 papers. The largest area has 31 papers and the smallest 19. We interpret this as indicating that area chairs will not be overloaded, leading to better review quality and interpretation.

Who gets to author a paper? A note on the Vancouver recommendations

At COLING 2018, we require submitted work to follow the Vancouver Convention on authorship – i.e. who gets to be an author on a paper. This guest post by Željko Agić of ITU Copenhagen introduces the topic.

Who gets to author a paper? A note on the Vancouver recommendations

One of the basic principles of publishing scientific research is that research papers are authored and signed by researchers.

Recently, the tenet of authorship has sparked some very interesting discussions in our community. In light of the increased use of preprint servers, we have been questioning the *ACL conference publication workflows. These mostly had to do with the peer review biases, but also with authorship: Should we enable blind preprint publications?

The notion of unattributed publications mostly does not sit well with researchers. We do not even know how to cite such papers, while we can invoke entire research programs in our paper narratives through a single last name.

Authorship is of crucial importance in research, and not just in writing up our related work sections. This goes without saying to all us fellow researchers. While in everyday language an author is simply a writer or an instigator of a piece of work, the question is slightly more nuanced in publishing scientific work:

  • What activities qualify one for paper authorship?
  • If there are multiple contributors, how should they be ordered?
  • Who decides on the list of paper authors?

These questions have sparked many controversies over the centuries of scientific research. An F. D. C. Willard, short for Felis Domesticus Chester, has authored a physics paper, similar to Galadriel Mirkwood, a Tolkien-loving Afgan hound versed in medical research. Others have built on the shoulders of giants such as Mickey Mouse and his prolific group.

Yet, authorship is no laughing matter: It can make and break research careers, and its (un)fair treatment can make a difference between a wonderful research group and an uneasy one at the least. A fair and transparent approach to authorship is of particular importance to early-stage researchers. There, the tall tales of PhD students might include the following conjectures:

  • The PIs in medical research just sign all the papers their students author.
  • In algorithms research the author ordering is always alphabetical.
  • Conference papers do not make explicit the individual author contributions.
  • The first and the last author matter the most.

The curiosities and the conjectures listed above all stem from the fact that there seems to be no awareness of any standard rulebook to play by in publishing research. This in turn gives rise to the many different traditions in different fields.

Yet, there is a rulebook!

One prominent attempt to put forth a set of guidelines for determining authorship are the Vancouver Group recommendations. The Vancouver Group are the International Committee of Medical Journal Editors (ICMJE), who in 1985 introduced a set of criteria for authorship. The criteria have seen many updates over the years, to match the latest developments in research and publishing. Their scope far surpasses the topic of authorship, and spans across the scientific publication process: reviewing, editorial work, publishing, copyright, and the like.

While the recommendations do stem from the medical field, they are nowadays broadened and thus widely adopted. The following is an excerpt from the recommendations in relation to authorship criteria.

The ICMJE recommends that authorship be based on the following 4 criteria:

1. Substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data for the work; AND

2. Drafting the work or revising it critically for important intellectual content; AND

3. Final approval of the version to be published; AND

4. Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

(…)

All those designated as authors should meet all four criteria for authorship, and all who meet the four criteria should be identified as authors. Those who do not meet all four criteria should be acknowledged.

(…)

These authorship criteria are intended to reserve the status of authorship for those who deserve credit and can take responsibility for the work. The criteria are not intended for use as a means to disqualify colleagues from authorship who otherwise meet authorship criteria by denying them the opportunity to meet criterion #s 2 or 3.

Note that there is an AND operator tying the four criteria, but there are some ORs within the individual entries. Thus, in essence, to be adherent with the Vancouver recommendations for authorship, one has to meet all four requirements, while in meeting each of the four, one is allowed to meet them minimally.

To take one example:

If you substantially contributed to 1) data analysis, and to 2) revising the paper draft, and then you subsequently 3) approved of the final version and 4) agreed to be held accountable for all the work, then congrats! you have met the authorship criteria!

One could take others routes through the four criteria, some arguably easier, while some even harder.

In my own view, we as a field should hope for the Vancouver recommendations to have already been adopted in NLP research, if only implicitly through the way our research groups and collaborations work.

Yet, are they? What are your thoughts? In your view, are the Vancouver recommendations well-matched with the COLING 2018 paper types? In general, are there aspects of your work in NLP that are left uncovered by the authorship criteria? Might there be at least some controversy and discussion potential to this matchup? 🙂

Metadata and COLING submissions

As the deadline for submission draws near, we’d like to alert our authors to a few things that are a bit different from previous COLINGs and other computational linguistics/NLP venues in the hopes that this will help the submission process go smoothly.

Paper types

Please consider the paper type you indicate carefully, as this will affect what the reviewers are instructed to look for you in your paper.  We encourage you to read the description of the paper types and especially the associated reviewer questions carefully. Which set of questions would you most like to have asked of your paper? (And if reading the questions inspires you to reframe/edit a bit to better address them before submitting, that is absolutely fair game!)

Emiel van Miltenburg raised the point on Twitter last week that it can be difficult to categorize papers and in particular that certain papers might fall between our paper types, combining characteristics of more than one, or being something else entirely.

Emiel and colleagues wondered whether we could implement a “tagging” system where authors could indicate the range of paper types their paper relates to. That is an intriguing idea, but it doesn’t work with the way we are using paper types to improve the diversity and rage of papers at COLING. As noted above, the paper types entail different questions on the review forms. We’re doing that because otherwise it seems that everything gets evaluated against the NLP Engineering Experiment paper type, which in turn means it’s hard to get papers of the other types accepted.  And as we hope we’ve made it blindingly clear, we really are interested in getting a broad range of paper types!

Keywords

The other aspect of our submission form that will have a strong impact on how your paper is reviewed is the keywords. Following the system pioneered by Ani Nenkova and Owen Rambow as PC co-chairs for NAACL 2016, we have asked our reviewers to all describe their areas of expertise along five dimensions:

  1. Linguistic targets of study
  2. Application tasks
  3. Approaches
  4. Languages
  5. Genres

(All five of these have a none of the above/not-applicable option.) The reviewers (and area chairs) all indicated all of the items on each of these dimensions they have the expertise and interest to review for. For authors, we ask you to indicate which items on each dimension best describe the paper you are submitting. Softconf will then match your paper to an area based on the assignment of papers to areas that best optimizes reviewer expertise for the papers submitted.

In sum: To ensure the most informed reviewing possible of your paper, please fill out these keywords carefully.  We urge you to start your submission in the system ahead of time so you aren’t trying to complete this task in a hurry just at the deadline.

Dual submission policy

Our Call for Papers indicates the following dual submission policy:

Papers that have been or will be under consideration for other venues at the same time must indicate this at submission time. If a paper is accepted for publication at COLING, it must be immediately withdrawn from other venues. If a paper under review at COLING is accepted elsewhere and authors intend to proceed there, the COLING committee must be notified immediately.

We have added a field in the submission form for you to be able to indicate this information.

LRE Map

COLING 2018 is participating in LRE map, as described in this guest post by Nicoletta Calzolari. In the submission form, you are asked to provide information about language resources your research has used—and those it has produced. Do not worry about anonymity on this form.  This information is not shared with reviewers.

Author responsibilities and the COLING 2018 desk reject policy

As our field experiences an upswing in participation, we have more submissions to our conferences, and this means we have to be careful to keep the reviewing process as efficient as possible. One tool used by editors and chairs is the “desk reject”. This is a way to filter out papers that clearly shouldn’t get through for whatever reason, without asking area chairs and reviewers to handle them, leaving our volunteers to use their energy on the important process of dealing with your serious work.

A desk reject is an automatic rejection without further review. This saves time, but is also quite a strong reaction to a submission. For that reason, this post clarifies possible reasons for a desk reject and the stages at which this might occur. It is the responsibility of the authors to make sure to avoid these situations.

Reasons for desk rejects:

  • Page length violations. The content limit at COLING is 9 pages. (You may include as many pages as needed for references.) Appendices, if part of the main paper, must be put into that nine pages. It’s unfair to judge longer papers against those that have kept to the limit and so exceeding the page limit means a desk reject.
  • Template cheating. The LaTeX and Word templates give a level playing field for everyone. Squeezing out whitespace, adjusting margins, and changing the font size all stop that playing field from being even and give an unfair advantage. If you’re not using the official template, you’ve altered that template, or the way a manuscript uses it goes beyond our intent, then the paper may be desk rejected.
  • Missing or poor anonymisation. It’s well-established that non-anonymised papers from “big name” authors and institutions fare better during review. To avoid this effect, and others, COLING is running double-blind; see our post on the nuances of double-blinding. We do not endeavour to be arbiters of what does or does not constitute a “big name”—rather, any paper that is poorly anonymised (or not anonymised at all) will face desk reject. See below for a few more comments on anonymisation.
  • Inappropriate content. We want to give our reviewers and chairs research papers to review. Content that really does not fit this will be desk rejected.
  • Plagiarism. Submitting work that has already appeared, has already been accepted for publication at another venue, or has any significant overlap with other works submitted to COLING will be desk rejected. Several major NLP conferences are actively collaborating on this.
  • Breaking the arXiv embargo. COLING follows the ACL pre-print policy. This means that only papers not published on pre-print services or published on pre-print services more than a month before the deadline (i.e. before February 16, 2018) will be considered. Pre-prints published after this date (non-anonymously) may not be submitted for review at COLING. In conjunction with other NLP conferences this year, we’ll be looking for instances of this and desk rejecting them.

The desk rejects are determined at four separate points. In order,

  1. Automatic rejection by the START submission system, which has a few checks at various levels.
  2. A rejection by the PC co-chairs, before papers are allocated to areas.
  3. After papers are placed in areas, ACs have the opportunity to check for problems. One response is to desk reject.
  4. Finally, during and immediately after allocation of papers to reviewers, an individual reviewer may send a message to invoke desk rejection, which will be queried and checked by at least two people from the ACs or PC co-chairs.

As an honest researcher trying to publish your important and exciting work, the above probably do not apply to you. But if they do, please think twice. We would prefer to send out no desk rejects and imagine it would be much more pleasant for our authors if none were to receive a desk reject. So, now you know what to avoid!

Postscript on anonymisation

Papers must be anonymised. This protects everybody during review. It’s a complex issue to implement, which is why we earlier had a post dedicated to double blindness in peer review. There are strict anonymisation guidelines in the call for papers and the only way to be sure that nobody takes exception during the review process is to follow these guidelines.

We’ve received several questions on what the best practices for anonymisation are.  We realize that in long-standing projects, it can be impossible to truly disguise the group that work comes from.  Nonetheless, we expect all COLING authors to follow the forms of anonymisation:

  1. Do NOT include author names/affiliations in the version of the paper submitted for review.  Instead, the author block should say “Anonymous”.
  2. When making reference to your own published work, cite it as if written by someone else: “Following Lee (2007), …” “Using the evaluation metric proposed by Garcia (2016), …”
  3. The only time it’s okay to use “anonymous” in a citation is when you are referring to your own unpublished work: “The details of the construction of the data are described in our companion paper (anonymous, under review).”
  4. Expanded versions of earlier workshop papers should rework the prose sufficiently so as not to turn up as potential plagiarism examples. The final published version of such papers should acknowledge the earlier workshop paper, but that should be suppressed in the version submitted for review.
  5. More generally, the acknowledgments section should be left out of the version submitted for review.
  6. Papers making code available for reproducibility or resources available for community use should host a version of that at a URL that doesn’t reveal the authors’ identity or  institution.

We have been asked a few times about whether LRE Map entries can be done without de-anonymising submissions.  The LRE Map data will not be shared with reviewers, so this is not a concern.

Keeping resources anonymised is a little harder. We recommend you keep things like names of people and labs out of your code and files; for example, Java code uploaded that ran within an edu.uchicago.nlp namespace would be problematic. Similarly, if the URL given is within a personal namespace, this breaks double-blindness, and must be avoided. Google Drive, Dropbox and Amazon S3 – as well as many other file-sharing services – offer reasonably anonymous (and often free) file sharing URLs, and we recommend you use those if you can’t upload your data/code/resources into START as supplementary materials.

 

 

LRE Map: What? Why? When? Who?

This guest post is by Nicoletta Calzolari.

Not-documented Language Resources (LRs) don’t exist!

The LRE Map of Language Resources (data and tools) (http://lremap.elra.info) is an innovative instrument introduced at LREC2010 with the aim of monitoring the wealth of data and technologies developed and used in our field. Why “Map”? Because we aimed at representing the relevant features of a large territory, also for the aspects not represented in the official catalogues of the major players of the field. But we had other purposes too: we wanted to draw attention to the importance of the LRs that are behind many of our papers and to map also the “use” of LRs, to understand the purposes of the developed LRs.

Its collaborative, bottom-up, creation was critical: we conceived the Map as a means to influence a “change of culture” in our community, whereby everyone is asked to make a minimal effort to document the LRs that are used or created, thus understanding the need of proper documentation. By spreading the LR documentation effort across many people instead of leaving it only in the hands of the distribution centres, we also encourage awareness of the importance of metadata and proper documentation. Documenting a resource is the first step for making it identifiable, which in its turn is the first step towards reproducibility.

We kept the requested information at a simple level, knowing that we had to compromise between richness of metadata and willingness of authors to fill them in.

With all these purposes in mind we thought we could exploit the great opportunity offered by LREC and the involvement of so many authors from so many countries, from different modalities and working in so many areas of NLP. Afterwards the Map was used also in the framework of other major Conferences, in particular by COLING, and this provides another opportunity for useful comparisons.

The number of LRs currently described in the Map is 7453 (instances), collected from 17 different conferences. The major conferences for which we have data on a regular basis are LREC and COLING.

With initiatives such as the LRE Map and “Share your LRs” (introduced in 2014) we want to encourage in the field of LT and LRs what is already in use in more mature disciplines, i.e. ensure proper documentation and reproducibility as a normal practice. We think that research is strongly affected also by such infrastructural (meta-research) activities and therefore we continue to promote – also through such initiatives – a greater visibility of LRs, the sharing of LRs in an easier way and the reproducibility of research results.

Here is the vision: it must become common practice also in our field that when you submit a paper either to a conference or a journal you are offered the opportunity to document and upload the LRs related to your research. This is even more important in a data-intensive discipline like NLP. The small cost that each of us will pay to document, share, etc. should be paid back from benefiting of others’ efforts.

What do we ask to colleagues submitting at COLING 2018? Please document all the LRs mentioned in your paper!

SemEval: Striving for Reproducibility in Research – Guest post

Being able to reproduce experiments and results is important to advancing our knowledge, but it’s not something we’ve always been able to do well. In a series of guest posts, we have invited perspectives and advice on reproducibility in NLP.

by Saif M. Mohammad, National Research Council Canada.

A shared task invites participation in a competition where system predictions are examined and ranked by an independent party on a common evaluation framework (common new training and test sets, common evaluation metrics, etc.). The International Workshop on Semantic Evaluation (SemEval) is a popular shared task platform for computational semantic analysis. (See SemEval-2017; participate in SemEval-2018!) Every year, the workshop selects a dozen or so tasks (from a competitive pool of proposals) and co-ordinates their organizationthe setting up of task websites, releasing training and test sets, conducting evaluations, and publishing proceedings. It draws hundreds of participants, and publishes over a thousand pages of proceedings. It’s awesome!

Embedded in SemEval, but perhaps less obvious, is a drive for reproducibility in researchobtaining the same results again, using the same method. Why does reproducibility matter? Reproducibility is a foundational tenet of the scientific method. There is no truth other than reproducibility. If repeated data annotations provide wildly diverging labels, then that data is not capturing anything meaningful. If no one else is able to replicate one’s algorithm and results, then that original work is called into question. (See Most Scientists Can’t Replicate Studies by their Peers and also this wonderful article by Ted Pedersen, Empiricism Is Not a Matter of Faith.)

I have been involved with SemEval in many roles: from a follower of the work, to a participant, a task organizer, and co-chair. In this post, I share my thoughts on some of the key ways in which SemEval encourages reproducibility, and how many of these initiatives can easily be carried over to your research (whether or not it is part of a shared task).

SemEval has two core components:

Tasks: SemEval chooses a mix of repeat tasks (tasks that were run in prior years), new-to-SemEval tasks (tasks studied separately by different research groups, but not part of SemEval yet), and some completely new tasks. The completely new tasks are exciting and allow the community to make quick progress. The new-to-SemEval tasks allow for the comparison and use of disparate past work (ideas, algorithms, and linguistic resources) on a common new test set. The repeat tasks allow participants to build on past submissions and help track progress over the years. By drawing the attention of the community to a set of tasks, SemEval has a way of cleaning house. Literature is scoured, dusted, and re-examined to identify what generalizes wellwhich ideas and resources are truly helpful.

Bragging rights apart, a common motivation to participate in SemEval is to test whether a particular hypothesis is true or not. Irrespective of what rank a system attains, participants are encouraged to report results on multiple baselines, benchmarks, and comparison submissions.

Data and Resources: The common new (previously unseen) test set is a crucial component of SemEval. It minimizes the risk of highly optimistic results from (over)training on a familiar dataset. Participants usually have only two or three weeks from when they get access to the test set to when they have to provide system submissions. Task organizers often provide links to code and other resources that participants can use, including baseline systems and the winning systems from the past years. Participants can thus build on these resources.

SemEval makes a concerted effort to keep the data and the evaluation framework for the shared tasks available through the task websites even after the official competition. Thus, people with new approaches can continue to compare results with that of earlier participants, even years later. The official proceedings record the work done by the task organizers and participants.

Task Websites: For each task, the organizers set up a website providing details of the task definition, data, annotation questionnaires, links to relevant resources, and references. Since 2017, the tasks are run on shared task platforms such as CodaLab. They include special features such as phases and leaderboards. Phases often correspond to a pre-evaluation period (when systems have access to the training data but not the test data), the official evaluation period (when the test data is released and official systems submissions are to be made), and a post-evaluation period. The leaderboard is a convenient way to record system results. Once the organizers set up the task website with the evaluation script, the system automatically generates results on every new submission and uploads it on the leaderboard. There is a separate leaderboard for each phase. Thus, even after the official competition has concluded, one can upload submissions, and the auto-computed results are posted on the leaderboard. Anyone interested in a task can view all of the results in one place.

SemEval also encourages participants to make system submissions freely available and to make system code available where possible.

Proceedings: For each task, the organizers write a task-description paper that describes their task, data, evaluation, results, and a summary of participating systems. Participants write a system-description paper describing their system and submissions. Special emphasis is paid to replicability in the instructions to authors and in the reviewing process. For the task paper: “present all details that will allow someone else to replicate the data creation process and evaluation.” For the system paper: “present all details that will allow someone else to replicate your system.” All papers are accepted except for system papers that fail to provide clear and  adequate details of their submission. Thus SemEval is also a great place to record negative results — ideas that seemed promising but did not work out.

Bonus article: Why it’s time to publish research “failures”

All of the above make SemEval a great sandbox for working on compelling tasks, reproducing and refining ideas from prior research, and developing new ones that are accessible to all. Nonetheless, shared tasks can entail certain less-desirable outcomes that are worth noting and avoiding:

  • Focus on rankings: While the drive to have the top-ranked submission can be productive, it is not everything. More important is the analysis to help improve our collective understanding of the task. Thus, irrespective of one’s rank, it is useful to test different hypotheses and report negative results. 
  • Comparing post-competition results with official competition results: A crucial benefit of participating under the rigor of a shared task is that one does not have access to the reference/gold labels of the test data until the competition has concluded. This is a benefit because having open access to the reference labels can lead to unfair and unconscious optimisation on the test set. Every time one sees the result of their system on a test set and tries something different, it is a step towards optimising on the test set. However, once the competition has concluded the gold labels are released so that the task organizers are not the only gatekeepers for analysis. Thus, even though post-competition work on the task–data combination is very much encouraged, the comparisons of those results with the official competition results have to pass a higher bar of examination and skepticism.

There are other pitfalls worth noting too—feel free to share your thoughts in the comments.

“That’s great!” you say, “but we are not always involved in shared tasks…”

How do I encourage reproducibility of *my* research?

Here are some pointers to get started:

  • In your paper, describe all that is needed for someone else to reproduce the work. Make use of provisions for Appendices. Don’t be limited by page lengths. Post details on websites and provide links in your paper.
  • Create a webpage for the research project. Briefly describe the work in a manner that anybody interested can come away understanding what you are working on and why that matters. There is merit in communicating our work to people at large, and not just to our research peers. Also:
    • Post the project papers or provide links to them.
    • Post annotation questionnaires.
    • Post the code on repositories such as GitHub and CodaLab. Provide links.
    • Share evaluation scripts.
    • Provide interactive visualisations to explore the data and system predictions. Highlight interesting findings.
    • Post tables with results of work on a particular task of interest. This is especially handy if you are working on a new task or creating new data for a task. Use tools such as CodaLab to create leaderboards and allow others to upload their system predictions.
    • If you are releasing data or code, briefly describe the resource, and add information on:
      • What can the resource be used for and how?
      • What hypotheses can be tested with this resource?
      • What are the properties of the resource — its strengths, biases, and limitations?
      • How can one build on the resource to create something new?
  • (Feel free to add more suggestions through your comments below.)

Sharing your work often follows months and years of dedicated research. So enjoy it, and don’t forget to let your excitement shine through! 🙂

Many thanks to Svetlana Kiritchenko, Graeme Hirst, Ted Pedersen, Peter Turney, and Tara Small for comments and discussions.

References:

Area Chairs – and Areas

Traditionally, areas are prescribed by program chairs, in anticipation of the field’s interests. This can lead to last minute scrambles as the number of submissions ends up varying widely across areas.  To avoid this, we chose to follow the methodology developed by  Ani Nenkova and Owen Rambow for NAACL 2016 in sunny San Diego, CA, USA.  For COLING 2018 we have not defined areas directly, but rather will let them emerge from the interests of the area chairs, expressed in keywords. These keywords are then also used to allocate reviewers to areas, and later, papers. That’s why at COLING you won’t be directly asked to select an area for your paper at all; this is managed automatically. You will only be asked to select the type of your paper and describe its focus in keywords, to make sure it’s reviewed correctly. If you don’t know what paper types are available, we highly recommend you see the list of paper types and review questions. The keywords are exposed through the submission interface.

Each area has two area chairs, as previous experience has shown that it’s helpful to have a collaborator with whom to discuss decisions and to share the workload, but that larger groups can lead to lack of clarity in who’s doing what work.  We created the AC pairings automatically, keeping the following in mind:

  • We want to maximize similarity of AC research expertise (as captured by the keywords provided) in each pair, across the global pairing.
  • We want to minimize AC pairs where there is a large timezone difference, to foster quick troubleshooting and discussion (in the end, we ended up with one pair not in the same global region).

In addition, seven of the ACs have been designated not to a specific area but as “Special Circumstances” chairs, who can be called on to troubleshoot or advise as necessary.

Our final AC roster is as follows:

  • Afra Alishahi
  • Alexandre Rademaker
  • Alexis Palmer
  • Aline Villavicencio
  • Alvin Grissom II
  • Andrew Caines
  • Ann Clifton
  • Anna Rumshisky
  • Antske Fokkens
  • Arash Eshghi
  • Aurelie Herbelot
  • Avirup Sil
  • Barry Devereux
  • Chaitanya Shivade
  • Dan Garrette
  • Daniel Lassiter
  • David Schlangen
  • Dekai Wu
  • Deyi Xiong
  • Eric Nichols
  • Francis Bond
  • Frank Ferraro
  • Georgiana Dinu
  • Gerard de Melo
  • Gina-Anne Levow
  • Harry Bunt
  • Hatem Haddad
  • Isabelle Augenstein
  • Jiajun Zhang
  • Jose Camacho Collados
  • Klinton Bicknell
  • Lilja Øvrelid
  • Maja Popovic
  • Manuel Montes-y-Gómez
  • Marcos Zampieri
  • Marie-Catherine de Marneffe
  • Meliha Yetisgen
  • Michael Tjalve
  • Miguel Ballesteros
  • Mike Tian-Jian Jiang
  • Mohammad Taher Pilehvar
  • Na-Rae Han
  • Naomi Feldman
  • Natalie Schluter
  • Nathan Schneider
  • Nikola Ljubešić
  • Nurit Melnik
  • Qin Lu
  • Roman Klinger
  • Sadid A. Hasan
  • Sanja Štajner
  • Sara Tonelli
  • Sarvnaz Karimi
  • Sujian Li
  • Sunayana Sitaram
  • Tal Linzen
  • Valia Kordoni
  • Vivek Kulkarni
  • Viviane Moreira
  • Wei Xu
  • Wenjie Li
  • Xiang Ren
  • Xiaodan Zhu
  • Yang Feng
  • Yonatan Bisk
  • Yue Zhang
  • Yun-Nung Chen
  • Zachary Chase Lipton
  • Zeljko Agic
  • Zhiyuan Liu

With the following ACs in Special Circumstances, spread across the world’s timezones:

  • Anders Søgaard
  • Andreas Vlachos
  • Asad Sayeed
  • Di Jiang
  • Karin Verspoor
  • Kevin Duh
  • Steven Bethard

We are grateful to these distinguished scholars for the time and effort they are committing to COLING 2018!

Tool Support for Low Overhead Reproducibility

Continuing our series on reproducibility in computational linguistics research, this guest post is from Prof. Kalina Bontcheva, from the Department of Computer Science at the University of Sheffield.

Tool Support for Low Overhead Reproducibility

The two previous guest posts on reproducibility in NLP made an excellent job of defining the different kinds of reproducibility in NLP, why it is important, and many of the stumbling points. Now let me try to provide some partial answers to the question of how can we have low overhead reproducibility through automated tool support.

Before I begin and to motivate the somewhat self-centred nature of this post – reproducible and extensible open science has been  the main goal and core focus of my research and that of the GATE team at the University of Sheffield for close to two decades now.  One of my first papers on developing reusable NLP algorithms dates back to 2002 and argues that open source frameworks (and GATE in particular) offer researchers the much needed tool support that lowers significantly the overhead of NLP repeatability and reproducibility.

So, now 16 years on, let me return to this topic and provide a brief overview of how we address some of the technical challenges in NLP reproducibility through tool support. I will also share how researchers working on open NLP components have benefitted as a result from high visibility and citation counts. As always I will conclude with future work, i.e. outstanding repeatability challenges.

GATE Cloud: Repeatability-as-a-Service

As highlighted in the two previous guest blogs on reproducibility in NLP, there often are major stumbling blocks in repeating an experiment or re-running a method on new data. Examples include outdated software versions, differences in programming languages (e.g. Java vs Python), insufficient documentation and unknown parameter values. Add to this differences in input and output data formats and general software integration challenges, and it is no wonder that many PhD students (and other researchers) simply opt for citing results copied from the original publication.  

The quest for low overhead repeatability led us to implement GATE Cloud. It provides an ever-growing set of NLP algorithms (e.g. POS taggers and NE recognisers in multiple languages) through an unified, easy-to-use REST web service interface. Moreover, it allows the automatic deployment as a service of any GATE-based NLP component or application.

Algorithms + Parameter Values + Data = Auto-Packaged Self-Contained Applications

We also realised early on that repeatability needs more than an open source algorithm, so GATE Developer (the Eclipse of NLP as we like to call it) has the ability to auto-package an experiment by saving it as a GATE application. Effectively this makes a self-contained bundle of all software libraries and components, their parameters, and links to the data that they ran on. The latter is optional, as in some cases it is not possible to distribute copyright-protected datasets. Nevertheless, an application can still point to a directory where it expects the dataset and if available on the user’s computer, it will be loaded and used automatically.  

Is My Algorithm Really Better?

A key strength of GATE is that it comes with a large number of reusable and repurposable open-source NLP components, e.g. named entity recognisers, POS taggers, tokenisers. (They aren’t always easy to spot in a vanilla GATE Developer, as they are packaged as optional plugins.) Many researchers not only re-run these as baselines, but also improve, extend, and/or repurpose them to new domains or applications. This then begs the question – is this new algorithm really better than the baseline and in what ways. GATE aims to make such comparative evaluations and error analyses easier, through a set of reusable evaluation tools working on a document or corpus level.

Open Reproducible Science and Research Impact Indicators

Now, when I advocate open and repeatable science, I sometimes get asked about the significant overhead it could incur. So firstly, as already discussed, the GATE infrastructure reduces very significantly this burden, but secondly – in our experience – the extra effort more than pays off in terms of paper citations. Please allow me to cut some corners here, as I’ll take just two examples here to illustrate my point:

In other words – allegiance to open and repeatable science tends to translate directly in high paper citation counts, h-indexes for the authors, and consistently excellent research impact evaluations.

The Unreproducible Nature of NLP for Social Media

And now – let me conclude with a reproducibility challenge. As more and more NLP research addresses social media content, the creation of reusable benchmark datasets is becoming increasingly important, but also somewhat elusive, thanks to the ephemeral nature of tweets and forum posts, account deletions, and 404 URLs. How can we solve this in the most beneficial way for the NLP research community is yet to be seen.

Thank you for reading!

Open Science

The post that follows is by our guest author Alice Motes, who is a ‎Research Data and Preservation Manager at the ‎University of Surrey, UK.

What’s Open science?

Great question! Open science refers to a wide range of approaches to doing science including open access to publications, open data, open software/code, open peer review, and citizen science (among others). It is driven by principles of transparency, accessibility, and collaboration resulting in a very different model of production and dissemination of science than is currently practiced in most fields. There tends to be a gap between what scientists believe and how they actually behave. For example, most scientists agree that sharing data is important to the progress of science. However, fewer of those same scientists report sharing their data or having easily accessible data (Tenopis et al. 2011).

In many ways, open science is about engaging in full faith with the ideals of the scientific process, which prizes transparency, verification, reproducibility, and building on each other’s work to push the field forward. Open science encourages opening up all parts of the scientific process, but I want to focus on data. (Conveniently, the area I’m most familiar with! Funny that.) Open data is a natural extension of open access to academic publications.

Most scholars have probably benefited from open access to academic journals. (Or hit a paywall to a non-open access journal article. Did you know publishers have profit margins higher than Apple?) The strongest argument behind open access is that restrictions on who can get these articles slows scientific advancement and breakthroughs by disadvantaging scientists without access. Combine that with the fact that most research is partially or wholly funded by public money and it’s not a stretch to suggest that these outputs should be made available to the benefit of everyone, scientist and citizens.

Open data is extending this idea into the realm of the data, suggesting that sharing data for verification and reuse can catch errors earlier, foster innovative uses of data, and push science forward faster and more transparently to the benefit of the field. Not to mention the knock on benefits of those advances to the public and broader society. Some versions of open data advocate for broaden access beyond scientific communities into the public sphere, where data may be examined and reused in potentially entrepreneurial ways to the benefit of society and the economy. You may also see the term open data applied in relation to government agencies at all levels releasing data that they hold as part of a push for transparency in governance and potential reuse by entrepreneurs, like using Transport for London’s API to build travel apps.

What are the potential benefits to open data?

You mean beyond the benefits to your scholarly peers and broader society? Well there are lots of ways sharing data can be advantageous for you:

  • More citations – there’s evidence to suggest that papers with accompanying data get cited more. (Piwowar and Vision 2013).
  • More exposure and impact – more people will see your work, which could lead to more collaborations and publications.
  • Innovative reuse – your data may be useful to in ways you don’t anticipate outside your field, leading to interdisciplinary impact and more data citations.
  • Better reproducibility: The first reuser of your data is actually you! Plus, avoid a crisis. (Need more reasons? Check out selfish reproducibility).

Moreover, you’ll benefit from access to your peers shared data as well! Think about all the cool stuff you could do.

Great! I’m on board. How do I do it?

Well you just need to answer these three questions, really:

1. Can people get your data?

How are people going to find and download your files? Are you going to deposit the data into a repository?

2. Can people understand the data?

Ok so now they’ve got your data. Have you included enough documentation that they can understand your file organization, code, and supporting documents?

3. Can people use the data?

People have got a copy of your data and they know how to use it. Grand! But can they actually use it? Would someone have to buy expensive software to use it? Could you make a version of your data available in an open format? Have you supplied the code necessary to use the data? (Check out the Software Sustainability Institute for tips.)

For more check out the FAIR principles (Findable, Accessible, Interoperable and Reusable.)

Of course, there are some very good ethical, legal, and commercial reasons why sharing data is not possible, but I think the goal should be to strive towards the European Commission’s ideal of “as open as possible, as closed as necessary”. You can imagine different levels of sharing, expanding outward: within your lab, within your department, within your university, within your scholarly community, and publicly. For most funders across North America and Europe, they see data as a public good with the greatest benefit coming from sharing data to the widest possible audience and encourage publicly sharing data from their funded projects.

Make an action plan or a data management plan

Here are some things to do help you get the ball rolling on sharing data:

  • Get started early and stay organized: document your research anticipating a future user. Check out Center for Open Science’s tools and tips.
  • Deposit your data into a repository (e.g. Zenodo, Figshare.) Many universities have their own repository. Some repositories integrate with github, dropbox, etc. to make it even easier!
  • Get your data a DOI so citations can be tracked (Repositories or your university library can do this for you.)
  • Consider applying a license to your data. Don’t be too restrictive though! You want people to do cool things with your data.
  • Ask for help: Your university likely has someone who can help with local resources. Probably in the library. Look for “Research Data Management”. You might find someone like me!

But I don’t have time to do it!

Aren’t you already creating documentation for yourself? You know, in case someone questions your findings after publication or if reviewer 2 (always reviewer 2 :::shakes fist:::) questions your methods or in a couple months when trying to figure out why you decided to run one analysis over another. Surely, making it intelligible to other people isn’t adding much to your workflow…or your graduate assistant’s workflow? If you incorporate these habits early in the process you’ll cut down the time necessary to prepare data at the end. Also, if you consider how much time you spend planning, collecting, analyzing, writing, and revising, the amount of time it takes to prepare your data and share it is relatively small in the grand scheme of things. And why wouldn’t you want to have another output to share? Matthew Partridge a researcher from University of Southampton and cartoonist at Errant Science has a great comic illustrating this:

Image by Matthew Partridge, Errant Science

In sum, open science and open data is a model for a more transparent and collaborative type of scientific inquiry. One that lives up to the best ideals of science as a community effort all moving towards discovery and innovation. Plus you get a cool new output to list on your CV and track its impact in the world. Not a bad shake if you ask me.

Speaker profile – Fabiola Henri

We are proud to announce that Dr. Fabiola Henri will give one of COLING 2018’s keynote talks.

Fabiola Henri is an Assistant Professor at the University of Kentucky since 2014. She received a Ph.D in Linguistics from the University of Paris Diderot, France in 2010. She is a creolist who primarily focuses on the structure and complexity of morphology in creole languages from the perspective of recent abstractive models, with insights from both information-theoretic and discriminative learning. Her work examines the emergence of creole morphology as proceeding from a complex interplay between sociohistorical context, natural language change, input from the lexifier, substratic influence, unguided second language acquisition, among others. Her main interests lie within French-based creoles, and more specifically Mauritian, a language which she speaks natively. Her publications and various presentations offer empirical and explanatory view of morphological change in French-based creoles, with a view on morphological complexity which starkly contrasts with Exceptionalist theories of creolization.

https://linguistics.as.uky.edu/users/fshe223