key: cord-0511493-i49ppov2 authors: Tabassum, Jeniya; Lee, Sydney; Xu, Wei; Ritter, Alan title: WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols date: 2020-10-27 journal: nan DOI: nan sha: 6d4deac96b3f1004b73933e86b9f7822e8ce16d1 doc_id: 511493 cord_uid: i49ppov2 This paper presents the results of the wet lab information extraction task at WNUT 2020. This task consisted of two sub tasks: (1) a Named Entity Recognition (NER) task with 13 participants and (2) a Relation Extraction (RE) task with 2 participants. We outline the task, data annotation process, corpus statistics, and provide a high-level overview of the participating systems for each sub task. Wet Lab protocols consist of natural language instructions for carrying out chemistry or biology experiments (for an example, see Figure 1 ). While there have been efforts to develop domain-specific formal languages in order to support robotic automation 1 of experimental procedures (Bates et al., 2017) , the vast majority of knowledge about how to carry out biological experiments or chemical synthesis procedures is only documented in natural language texts, including in scientific papers, electronic lab notebooks, and so on. Recent research has begun to apply human language technologies to extract structured representations of procedures from natural language protocols (Kuniyoshi et al., 2020; Vaucher et al., 2020; Kulkarni et al., 2018; Soldatova et al., 2014; Vasilev et al., 2011; Ananthanarayanan and Thies, 2010) . Extraction of named entities and relations from these protocols is an important first step towards machine reading systems that can interpret the meaning of these noisy human generated instructions. However, performance of state-of-the-art tools for extracting named entity and relations from wet lab protocols still lags behind well edited text genres (Jiang et al., 2020) . This motivates the need for continued research, in addition to new datasets and tools adapted to this noisy text genre. In this overview paper, we describe the development and findings of a shared task on named entity and relation extraction from the noisy wet lab protocols, which was held at the 6-th Workshop on Noisy User-generated Text (WNUT 2020) and attracted 15 participating teams. In the following sections, we describe details of the task including training and development datasets in addition to the newly annotated test data. We briefly summarize the systems developed by selected teams, and conclude with results. Wet lab protocols consist of the guidelines from different lab procedures which involve chemicals, drugs, or other materials in liquid solutions or volatile phases. The protocols contain a sequence of steps that are followed to perform a desired task. These protocols also include general guidelines or warnings about the materials being used. The publicly available archive of protocol.io contains such guidelines of wet lab experiments, written by researchers and lab technicians around the world. This protocol archive covers a large spectrum of experimental procedures including neurology, epigenetics, metabolomics, stem cell biology, etc. Figure 1 shows a representative wet lab protocol. The wet lab protocols, written by users from all over the worlds, contain domain specific jargon as well as numerous nonstandard spellings, abbreviations, unreliable capitalization. Such diverse and Train Dev Test-18 Test-20 Total #protocols 370 122 123 111 726 #sentences 8444 2839 2813 3562 17658 #tokens 107038 36106 36597 51688 231429 #entities 48197 15972 16490 104654 185313 #relations 32158 10812 11242 70591 124803 per noisy style of user created protocols imposed crucial challenges for the entity and relation extraction systems. Hence, off-the-shelf named entity recognition and relation extraction tools, tuned for well edited texts, suffer a severe performance degradation when applied to noisy protocol texts (Kulkarni et al., 2018) . To address these challenges, there has been an increasing body of work on adapting entity and relation extraction recognition tools for noisy wet lab texts (Jiang et al., 2020; Luan et al., 2019; Kulkarni et al., 2018) . However, different research groups have used different evaluation setups (e.g., training / test splits) making it challenging to perform direct comparisons across systems. By organizing a shared evaluation, we hope to help establish a common evaluation methodology (for at least one dataset) and also promote research and development of NLP tools for user generated wet-lab text genres. Our annotated wet lab corpus includes 726 experimental protocols from the 8-year archive of ProtocolIO (April 2012 to March 2020). These protocols are manually annotated with 15 types of relations among the 18 entity types 2 . The fine-grained entities can be broadly classified into The training and development dataset for our task was taken from previous work on wet lab corpus (Kulkarni et al., 2018 ) that consists of from the 623 protocols. We excluded the eight duplicate protocols from this dataset and then re-annotated the 615 unique protocols in BRAT (Stenetorp et al., 2012) . This re-annotation process aided us to add the previously missing 20,613 missing entities along with 10,824 previously missing relations and also to facilitate removing the inconsistent annotations. The updated corpus statics is provided in Table 1 . This full dataset (Train, Dev, Test-18) was provided to the participants at the beginning of the task and they were allowed to use any of part of this dataset to train their final model. For this shared task we added 111 new protocols (Test-20) which were used to evaluate the submitted models. Test-20 dataset consists of 100 randomly sampled general protocols and 11 manually selected covid-related protocols from ProtocolIO (https://www.protocols.io/). This 111 protocols were double annotated by three annotators using a web-based annotation tool, BRAT (Stenetorp et al., 2012) . Figure 1 presents a screenshot of our annotation interface. We also provided the annotators a set of guidelines containing the entity and relation type definitions. The annotation task was split in multiple iterations. In each iteration, an annotator was given a set of 10 protocols. An adjudicator then went through all the entity and relation annotations in these protocols and resolved the disagreements. Before adjudication, the interannotator agreement is 0.75 , measured by Cohen's Kappa (Cohen, 1960) . We provided the participants baseline model for both of the subtasks. The baseline model for named entity recognition task utilized a feature-based CRF tagger developed using the CRF-Suite 3 with a standard set of contextual, lexical and gazetteer features. The baseline relation extraction system employed a feature-based logistic regression model developed using the Scikit-Learn 4 with a standard set of contextual, lexical and gazetteer features. Thirteen teams (Table 3 ) participated in the named entity recognition sub-task. A wide variety of approaches were taken to tackle this task. Table 2 summarizes the word representations, features and the machine learning approaches taken by each team. Majority of the teams (11 out of 13) utilized contextual word representations. Four teams combined the contextual word representations with global word vectors. Only two teams did not use any type of word representations and relied entirely on hand-engineered features and a CRF taggers. The best performing teams utilized a combination of contextual word representation with ensemble of learning. Below we provide a brief description of the approach taken by each team. B-NLP (Lange et al., 2020) modeled the NER as a parsing task and uses a biaffine classifier. The second classifier of their system used the predictions from the first classifier and then updated the labels of the predicted entities. Both of the classifiers utilized word2vec (Mikolov et al., 2013) and SciBERT DSC-IITISM (Gupta et al., 2020 ) developed a BiLSTM-CRF model that utilized a concatenation of CamemBERT base (Martin et al., 2020) , Flair(PubMed) (Akbik et al., 2018) , and GloVe(en) (Pennington et al., 2014) word representations. Fancy Man (Zeng et al., 2020) fine-tuned the BERT base (Devlin et al., 2019) Two teams (Table 3 ) participated in the relation extraction sub-task. Both of the teams followed fine-tuning of contextual word representation and did not use any hand-crafted features. Table 5 summarizes the word representations and the machine learning approaches followed by each team. Below we provide a brief description of the model developed by taken by each team. Big Green (Miller and Vosoughi, 2020) considered the protocols as a knowledge graph, in which relationships between entities are edges in the knowledge graph. They trained a BERT (Devlin et al., 2019) based system to classify edge presence and type between two entities, given entity text, label, and local context. (Gu et al., 2020) as input to the relation extraction model that enumerates all possible pairs of arguments using deep exhaustive span representation approach. In this section, we present the performance of each participating systems along with a description of the errors made by the model types. Table 4 : Results on extraction of 18 Named Entity types from the Test-20 dataset. Exact Match reports the performance when the predicted entity type is same as the gold entity and the predicted entity boundary is the exact same as the gold entity boundary. Partial Match reports the performance when the predicted entity type is same as the gold entity and the predicted entity boundary has some overlap with gold entity boundary. boundary are exactly same as the gold entity type and boundary. Whereas, the partial match refers to the case where the predicted entity type is same as the gold entity type and predicted entity boundary has some overlap with the gold entity boundary. We observe that ensemble models with contextual word representations outperforms all other approaches by achieving 77.99 F 1 score in exact match (Team:BITEM) and 81.75 F 1 score in partial match (Team:PublishInCovid19). In Figure 2 , we present an error analysis of different NER systems. Analysis of the errors these different NER model prediction demonstrate that, the BERT based models make less mistakes in false positive and incorrect type errors compared to the traditional neural networks and feature based models. We also observed that, these BERT models suffer from higher false negatives errors compared to the other approaches. To combine the advantages of these different approaches, we made an majority voting based ensemble classifier. Our ensemble NER tagger utilizes the predictions of all the submitted systems and then it assigns each word the most frequently predicted tag. This ensemble classifier performs better than all the single fine-tuned BERT models and it out-performed the traditional neural and feature based models by achieving 76.84 F 1 (Table 4 ). However, our ensemble NER tagger performed 1.15 F 1 below the neural ensemble models (Team:BITEM, Pub-lishInCovid19). We would like to note that, we did not have access to the participant model's predictions on development and training set. Hence, it was not possible for us to fine-tune our ensemble classifier on the entity recognition task. Table 6 shows the comparison of precision (P), recall (R) and F 1 score among the participating teams, evaluated on the Test-20 corpus. Both of the teams utilized the gold entities and then predicted the relations among these entities by fine-tuning contextual word representations. We observed that fine-tuning of domain related PubMedBERT provides significantly higher performance compared to the general BERT fine-tuning. While examining the relation predictions from both of these systems, we found that system with PubMedBERT fine-tuning (Team:mgsohrab) resulted in significantly less amount of errors in every category (Figure 3 The error analysis over different participant predictions revealed that the general domain BERT has less false negative errors compared to the domain-related BERT. However, the domain related Pubmed-BERT models have significantly less number of false positive and incorrect type errors compared to the general domain BERT. To combine the advantages of these different approaches, we made an ensemble classifier from the prediction of the submitted systems, where we assign the most frequently predicted relation for each entity pair. This ensemble classifier outperforms the winner system by achieving 81.32 F 1 score. The task of information extraction from wet lab protocols is closely related to the event trigger extraction task. The event trigger task has been studied extensively, mostly using ACE data (Doddington et al., 2004) and the BioNLP data (Nédellec et al., 2013) . Broadly, there are two ways to classify various event trigger detection models: (1) Rule-based methods using pattern matching and regular expression to identify triggers (Vlachos et al., 2009) and (2) Machine Learning based methods focusing on generation of high-end hand-crafted features to be used in classification models like SVMs or maxent classifiers . Kernel based learning methods have also been utilized with embedded features from the syntactic and semantic contexts to identify and extract the biomedical event entities (Zhou et al., 2014) . In order to counteract highly sparse representations, different neural models were proposed. These neural models utilized the dependency based word embeddings with feed forward neural networks (Wang et al., 2016b) , CNNs (Wang et al., 2016a) and Bidirectional RNNs (Rahul et al., 2017) . Previous work has experimented on datasets of well-edited biomedical publications with a small number of entity types. For example, the JNLPBA corpus (Kim et al., 2004) with 5 entity types (CELL LINE, CELL TYPE, DNA, RNA, and PROTEIN) and the BC2GM corpus (Hirschman et al., 2005) with a single entity class for genes/proteins. In contrast, our dataset addresses the challenges of recognizing 18 finegrained named entities along with 15 types of relations from the user-created wet lab protocols. In this paper, we presented a shared task for consisting of two sub-tasks: named entity recognition and relation extraction from the wet lab protocols. We described the task setup and datasets details, and also outlined the approach taken by the participating systems. The shared task included larger and improvised dataset compared to the prior literature (Kulkarni et al., 2018) . This improvised dataset enables us to draw stronger conclusions about the true potential of different approaches. It also facilitates us in analyzing the results of the participating systems, which aids us in suggesting potential research directions for both future shared tasks and noisy text processing in user generated lab protocols. KaushikAcharya at WNUT 2020 Shared Task-1: Conditional Random Field(CRF) based Named Entity Recognition(NER) for Wet Lab Protocols Contextual String Embeddings for Sequence Labeling Biocoder: A programming language for standardizing and automating biology protocols Wet lab accelerator: a web-based application democratizing laboratory automation for synthetic biology SciB-ERT: Pretrained Contextualized Embeddings for Scientific Text A Coefficient of Agreement for Nominal Scales BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding The automatic content extraction (ace) program-tasks, data, and evaluation Jianfeng Gao, and Hoifung Poon. 2020. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing DSC-IITISM at WNUT 2020 Shared Task-1: Name Entity Extraction from Wet Lab Protocol 2020. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks Overview of biocreative: critical assessment of information extraction for biology Clinicalbert: Modeling clinical notes and predicting hospital readmission SudeshnaTCS at WNUT 2020 Shared Task-1: Name Entity Extraction from Wet Lab Protocol Generalizing Natural Language Analysis through Span-relation Representations IITKGP at WNUT 2020 Shared Task-1: Domain specific BERT representation for Named Entity Recognition of lab protocol BIO-BIO at WNUT 2020 Shared Task-1: Name Entity Extraction from Wet Lab Protocol 2020. kabir at WNUT 2020 Shared Task-1: Name Entity Extraction from Wet Lab Protocol Introduction to the bio-entity recognition task at jnlpba BiTeM at WNUT 2020 Shared Task-1: Named Entity Recognition over Wet Lab Protocols using an Ensemble of Contextual Language Models An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature B-NLP at WNUT 2020 Shared Task-1: Name Entity Extraction from Wet Lab Protocol BioBERT: a pre-trained biomedical language representation model for biomedical text mining BioBERT: a pre-trained biomedical language representation model for biomedical text mining RoBERTa: A robustly optimized BERT pretraining approach A general framework for information extraction using dynamic span graphs CamemBERT: a Tasty French Language Model Efficient estimation of word representations in vector space Big Green at WNUT 2020 Shared Task-1: Relation Extraction as Contextualized Sequence Classification Overview of bionlp shared task 2013 Comparisons of sequence labeling algorithms and extensions GloVe: Global Vectors for Word Representation Deep Contextualized Word Representations 2020. mahab at WNUT 2020 Shared Task-1: Name Entity Extraction from Wet Lab Protocol Event extraction across multiple levels of biological organization Biomedical event trigger identification using bidirectional recurrent neural network based models IBS at WNUT 2020 Shared Task-1: Name Entity Extraction from Wet Lab Protocol Pub-lishInCovid19 at WNUT 2020 Shared Task-1: Entity Recognition in Wet Lab Protocols using Structured Learning Ensemble and Contextualised Embeddings Makoto Miwa, and Hiroya Takamura. 2020. mgsohrab at WNUT 2020 Shared Task-1: Neural Exhaustive Approach for Entity and Relation Recognition Over Wet Lab Protocols EXACT2: the semantics of biomedical protocols brat: a Web-based Tool for NLP-Assisted Text Annotation A software stack for specification and robotic execution of protocols for synthetic biological engineering Biomedical event extraction without training data Biomedical event trigger detection based on convolutional neural network Biomedical event trigger detection by dependencybased word embedding Xlnet: Generalized autoregressive pretraining for language understanding Xiaoyang Fang, and Zhexin Liang. 2020. Fancy Man Launches Zippo at WNUT 2020 Shared Task-1: A Bert Case Model for Wet Lab Entity Extraction Event trigger identification for biomedical events extraction using domain knowledge We would like to thank Ethan Lee and Jaewook Lee for helping with data annotation. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0108. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.