key: cord-0843566-o2yr7mk2 authors: Xiong, Ying; Chen, Shuai; Tang, Buzhou; Chen, Qingcai; Wang, Xiaolong; Yan, Jun; Zhou, Yi title: Improving deep learning method for biomedical named entity recognition by using entity definition information date: 2021-12-17 journal: BMC Bioinformatics DOI: 10.1186/s12859-021-04236-y sha: 86516a28569642eb1dfcc9873fd9470dbb66b298 doc_id: 843566 cord_uid: o2yr7mk2 BACKGROUND: Biomedical named entity recognition (NER) is a fundamental task of biomedical text mining that finds the boundaries of entity mentions in biomedical text and determines their entity type. To accelerate the development of biomedical NER techniques in Spanish, the PharmaCoNER organizers launched a competition to recognize pharmacological substances, compounds, and proteins. Biomedical NER is usually recognized as a sequence labeling task, and almost all state-of-the-art sequence labeling methods ignore the meaning of different entity types. In this paper, we investigate some methods to introduce the meaning of entity types in deep learning methods for biomedical NER and apply them to the PharmaCoNER 2019 challenge. The meaning of each entity type is represented by its definition information. MATERIAL AND METHOD: We investigate how to use entity definition information in the following two methods: (1) SQuad-style machine reading comprehension (MRC) methods that treat entity definition information as query and biomedical text as context and predict answer spans as entities. (2) Span-level one-pass (SOne) methods that predict entity spans of one type by one type and introduce entity type meaning, which is represented by entity definition information. All models are trained and tested on the PharmaCoNER 2019 corpus, and their performance is evaluated by strict micro-average precision, recall, and F1-score. RESULTS: Entity definition information brings improvements to both SQuad-style MRC and SOne methods by about 0.003 in micro-averaged F1-score. The SQuad-style MRC model using entity definition information as query achieves the best performance with a micro-averaged precision of 0.9225, a recall of 0.9050, and an F1-score of 0.9137, respectively. It outperforms the best model of the PharmaCoNER 2019 challenge by 0.0032 in F1-score. Compared with the state-of-the-art model without using manually-crafted features, our model obtains a 1% improvement in F1-score, which is significant. These results indicate that entity definition information is useful for deep learning methods on biomedical NER. CONCLUSION: Our entity definition information enhanced models achieve the state-of-the-art micro-average F1 score of 0.9137, which implies that entity definition information has a positive impact on biomedical NER detection. In the future, we will explore more entity definition information from knowledge graph. Biomedical named entity recognition (NER) is a fundamental task of biomedical text mining to identify biomedical entity mentions of different types in biomedical text. Most biomedical NER studies focus on the biomedical text in English. To accelerate the development of Spanish biomedical NER techniques, Martin Krallinger et al. organized a specific challenge for chemical & drug mention recognition in Spanish biomedical text, called PharmaCoNER, in 2019 [1] . Participants were required to recognize the entities in Spanish biomedical text, as shown in Fig. 1 . Biomedical NER is a typical sequence labeling problem, and lots of state-of-the-art methods have been proposed for this problem, such as BiLSTM-CRF [2] . Almost all these methods do not consider the meaning of different entity types, which may benefit biomedical NER. The meaning of each entity type can be represented by its definition. For example, the definition of PROTEINAS in the guideline of PharmaCoNER 2019 is: "Las menciones de proteínas y genes incluyen péptidos, hormonas peptídicas y anticuerpos. " (Protein and gene mentions include peptides, peptide hormones, and antibodies). In this paper, we explore how to encode entity definition information in two kinds of deep learning methods for NER. They are: (1) SQuad-style MRC methods designed to find a continuous span of entity mentions in given text for each type. We use each type's entity definition as a query instead of a naive query generated by simple rules in MRC methods. For convenience, we adopt MRC to represent SQuad-style MRC in the following sections in this paper. (2) Span-level one-pass (SOne) methods that predict entity spans of one type by one type. We use entity definition information to represent each entity type's meaning and introduce the entity type meaning into SOne. The definition information of each type includes the original definition of each type in the guideline and entity mentions in the text. We compare them in the SOne model. In order to evaluate the performances of MRC and SOne, we conduct experiments on the PharmaCoNER 2019 corpus. Experiments show that the entity definition information brings improvements to both MRC and SOne methods. The improvement in microaveraged F1-score is about 0.003. The MRC method using entity definition information as query achieves the best performance with a micro-average precision of 0.9225, a recall of 0.9050, and an F1-score of 0.9137, respectively. It outperforms the best model of the PharmaCoNER 2019 challenge by 0.0032 in micro-averaged F1-score. The natural language processing (NLP) community has made a great contribution to the development of NER in the biomedical text through challenges, such as I2B2 (Informatics for Integrating Biology and the Bedside) [3, 4] , BioCreative (Critical Assessment of Information Extraction systems in Biology) [5, 6] , SemEval (Semantic Evaluation) [7, 8] , CCKS (China Conference on Knowledge Graph and Semantic Computing) [9, 10] and IberLEF [11] . A large number of methods have been proposed for biomedical NER. Most of them can be classified into the following three categories: (1) Rule-based methods that extract named entities using specific rules design by experts. The earlier clinical NLP tools are rule-based systems relying on clinical dictionaries, such as MedLEE [12] , KnowledgeMap [13] and MetaMap [14] . (2) Supervised machine learning methods with hand-crafted features Maximum Entropy (ME) [15, 16] , Support Vector Machines (SVM) [17] , CRF [18, 19] , Hidden Markov Models (HMM) [20, 21] and Structural Support Vector Machines (SSVM) [22] . They usually treat NER as a sequence labeling task, which tags a sentence with a label sequence. The common features used in the supervised machine learning methods include orthographic information (e.g. capitalization, prefix, suffix and word-shape), syntactic information (e.g., POS tags), dictionary information, n-gram information, disclosure information (e.g. section information in EHRs) and some features generated from unsupervised learning methods [23] . (3) Deep learning methods that can learn features from large unlabeled data without costly feature engineering. Convolutional Neural network (CNN) [24] , Recurrent Neural Network (RNN) [25] and Long Short Term Memory neural network (LSTM) [2] have been widely used for biomedical NER and show good performance. Besides the methods mentioned above, there are also some other attempts. For example, to tackle the low-resource problem in the biomedical domain, researchers introduce multi-task learning methods to learn more abundant information from other tasks, such as NER from other sources, chunking, and POS tagging [26] [27] [28] , and deploy transfer learning methods to first learn knowledge from related sources and then finetune on target [29] [30] [31] [32] [33] . Nowadays, there is an upward trend in defining NLP tasks in the MRC framework. MRC models [34] [35] [36] extract answer spans from the context given a pre-defined question. Generally, SQuad-style MRC models can be formalized as predicting the start position and the end position of the answer. Li et al. [37] treat the entity-relation extraction task as a multi-turn question answering and propose a unified MRC framework to recognize entities and extract relationships. Li et al. [38] propose an MRC method to recognize both flat and nested entities. In this study, all experiments are conducted on the PharmaCoNER 2019 corpus annotated by medicinal chemistry experts according to a pre-defined guideline. The corpus contains 1000 clinical records with 24,654 chemical & drug mentions. The corpus is divided into a training set of 500 records, a development set of 250 records and a test set of 250 records, where the test set is hidden in a background set of 3751 records during the test stage of the competition. In experiments, we first split each record into sentences by sentence ending symbols, including ' ' , '. ' , ';' , '?' , and '!' . About 95% of sentences are no longer than 230 tokens. The corpus statistics, including the number of records, sentences, and chemical & drug mentions of different types, are listed in Table 1 . It should be noted that the UNCLEAR mentions are not considered during the competition. Given a sequence X = {x 1 , x 2 , . . . , x n } of length n, we need to assign a label sequence Y = y 1 , y 2 , . . . , y n to X, where y i is the possible label of token x i (1 ≤ i ≤ n ) (e.g., PROTEINAS, NORMALIZABLES, NO_NORMALIZABLES, UNCLEAR). MRC definition: the sequence labeling problem can be redefined in the MRC framework as follows, For each label type y, its definition information is regarded as a query q y = {q 0 , q 1 , . . . , q m } of length m, a sentence X is regarded as the context of q y , the span of an entity of type y, and x y start:end = x start , x start+1 , . . . , x end−1 , x end , is recognized as an answer. Then, the original sequence labeling problem can be represented by q y , X, x y start:end . The goal of MRC is to find the spans of all entity mentions of all types, given all sentences. SOne definition: SOne takes sequence X as inputs and predicts the spans of all entities of one type by one type using a multi-layer pointer network [39] . The number of network layers depends on the number of entity types. For each type of entity, we add entity definition information e to enhance SOne by concatenating it to all tokens. Query generation is critical for MRC, since queries usually contain some prior knowledge (e.g. entity type definition) about tasks. Li et al. [40] introduce various kinds of query generation methods, including keywords, Wikipedia, rule-based template filling, synonyms, keywords combined synonyms and annotation guideline notes, and compare them. The results show that annotation guideline is the best choice for query generation. Following Li et al. [40] , we compare two kinds of query generation: annotation guideline and rule-based template filling. Table 2 shows our generated queries for each type of entity. In this study, We utilize BERT (Bidirectional Encoder Representations from Transformers) [41] as our model backbone. Figure 2 shows the skeleton of the MRC model. Given query q y and sentence X, we need to predict the span of every entity of type y, including a start position x y start and an end position x y end . The model first takes the following input and encodes it by BERT: , and d is the dimension of the last layer output of BERT, the model then predicts the possibilities of start position and end position as follows: where W start and W end are trainable parameters, b start and b end are biases. The predicted start index I start and end index I end are: We use MRC_rule and MRC_guideline to denote MRC using rule-based template filling for query generation and MRC using annotation guideline as query, respectively. Figure 3 shows the skeleton of the SOne model. In this model, we first use BERT to encode the input sentence X as Z ∈ R n×d (i.e., the output of the BERT's last layer), and then concatenate the entity definition information representation e ∈ R d e to all tokens, where d e is the dimension of the entity definition information representation. Here, we consider three kinds of entity definition information: (1) entity mentions word embedding. each entity type definition information is represented by the mean pooling of word2vec embeddings We use BERT to encode each query generated by rules (denoted as SOne_rule). (3) Annotation guideline encoded by BERT (denoted as SOne_guideline). The entity definition information enhanced sentence representation is represented as follows: where E ∈ R n×d e is n copied e, and [] denotes the concatenation operation. Finally, the SOne model makes the same prediction for start position and end position as the MRC model. The only difference is that SOne has four input-shared span predictors with the same structure and different parameters, while MRC has four separate span predictors. The overall objective function of MRC and SOne is: where L start is the start position prediction loss and L end is the end position prediction loss. The performances of all models are measured by micro-averaged precision (P), recall (R), and F1-score (F1) under the "exact-match" criterion: where TP is true positive, FP is false positive, and FN is false negative. These measures can be calculated by the evaluation tool [43] released by the official organization of the PharmaCoNER 2019 challenge. Following Xiong's work [44] , we first train our models on the training set and development set, and then further finetune the model for 20 epochs. The max sentence lengths of the MRC model and SOne model are set as 250 and 230, respectively. The difference in the max length is due to the query in the MRC model. The learning rate of BERT is set as 2e−5, the batch size of all models is set as 20. The dimension of entity definition information representation d e is set as 300. Other parameters are set as the default. The code is available at [45] . Table 3 presents the results of our proposed MRC and SOne model (lower part) and summarizes some reported results on the PharmaCoNER Corpus (upper part). First, the micro-average precision, recall and F1-score of MRC_rule and MRC_guideline is 0.915, 0.9055, 0.9109 and 0.9225, 0.9050, 0.9137, respectively. Results show that both MRC_rule and MRC_guideline outperform the baseline model SOne by 0.44% and 0.72% in micro-averaged F1-score. The reason why MRC_guideline performs better than MRC_rule lies in the expertness of guideline definition. For SOne extended models, all kinds of entity definition information representation can bring improvements to the baseline model SOne. Compared with SOne, the micro-averaged F1-score of SOne_ rule increases to 0.912, SOne_guideline increases to 0.9128, and SOne_w2v increases to 0.9094. The overall micro-averaged F1-score improvements of extended SOne models range from 0.29 to 0.63%. Second, MRC-guideline outperforms all existing systems on the PharmaCoNER corpus, creating new state-of-the-art results and pushing the micro-averaged F1-score of the benchmark to 0.9137, which amounts to 0.32% absolute improvement over the top-1 system of the PharmaCoNER 2019 challenge, developed by us that using lots of features, and 1% absolute improvement over our previous system without using features [44] , which is a significant improvement. We perform a significance test by comparing the model without using any feature with our MRC model or SOne model, and the results show that the improvement is significant (t-test < 0.05) [46] . This implies that entity definition information has a positive impact on entity recognition. Third, Table 4 shows the detailed results of each entity type of MRC_guideline and SOne_guideline. Both MRC_guideline and SOne_guideline perform best on NORMAL-IZABLES and worst on NO_NORMALIZABLES. Though MRC_guideline outperforms SOne_guideline in terms of micro-averaged F1-score, it wrongly predicts all NO_NOR-MALIZABLES type. The probable reason is that queries of NORMALIZABLES and NO_NORMALIZABLES are too similar, which may confuse our models. Overall, MRC_ guideline outperforms better than SOne_guideline on micro-averaged precision but worse on micro-averaged recall. Besides, we analyze all our proposed models and find that the SOne model can recognize the NO_NORMALIZABLES entities, but the MRC model cannot. It may be because that concatenation of entity definition representation benefits to few samples. Comparing with previous state-of-the-art models, our model can recognize more named entities due to the domain knowledge embedded in the entity definition information. For example, because of the introduction of the PROTEIN information, our model can recognize "timoglobulina (thymoglobulin)", "protrombina (prothrombin)" and so on, which are ignored by previous state-of-the-art models. To visualize the effect of the added domain knowledge, we calculate the cosine similarity of some words based on their word2vec embeddings. For example, the similarity of "protrombina" and "proteínas" is more than 0.5 but has a lower similarity with "normalizar" or words in the question of the UNCLEAR type. Though the MRC_guideline model outperforms other models, there are also some errors, mainly of the following five kinds. (1) About 20% of errors are due to the predicted entities not included in the gold test set. Although these predicted entities are the ones that have appeared, such as "vimentina (vimentin)", they are wrong because they are not officially annotated. (2) About 30% of errors are due to that the model omits some entities. (3) About 16% of the errors are because the model predicts the correct entity type, but the boundary is too long. For instance, the correct entity is "anticuerpos anticitoplasma (cytoplasmic antibodies)", but the model predicts "anticuerpos anticitoplasma de neutrófilo (antineutrophil cytoplasmic antibodies)", or the correct entity is "hormonas de crecimiento (growth hormones)", but the model predicts "hormonas de crecimiento y antidiurética (growth hormones and antidiuretics)". (4) About 20% of errors are because the model predicts the correct entity type, but the boundary is too short. For example, "tinción de auramina" is wrongly predicted as "auramina (auramine)", "anticuerpos antimembrana basal glomerular (glomerular basement membrane antibodies)" is wrongly predicted as "nticuerpos antimembrana basal (basal membrane antibodies)", and "(Ig)A-kappa" is wrongly predicted as "Ig". (5) About 10% of the errors are caused by that the model predicts the wrong entity type, and 70% of them are because that "NO_NORMALIZABLES" entity type is mistakenly predicted as "NOR-MALIZABLES", such as "Viekirax", "Tobradex" and "Harvoni". This paper proposed two kinds of entity definition information enhanced model, MRC and SOne for biomedical NER. Compared with the previous models, our methods do not require features and achieve state-of-the-art performance with a micro-average PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track Long short-term memory RNN for biomedical named entity recognition Evaluating temporal relations in clinical text: 2012 i2b2 Challenge Automated systems for the de-identification of longitudinal clinical narratives: overview of 2014 i2b2/UTHealth shared task Track 1 Overview of BioCreative II gene mention recognition Overview of BioCreAtIvE: critical assessment of information extraction for biology SemEval-2015 Task 14: Analysis of Clinical Text The Association for Computer Linguistics Overview of CCKS 2018 Task 1: Named Entity Recognition in Chinese Electronic Medical Records HITSZ_CNER: a hybrid system for entity recognition from Chinese clinical text Automatic deidentification of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results Towards a comprehensive medical language processing system: methods and issues Spickard 3rd A. The KnowledgeMap project: development of a conceptbased medical school curriculum database An overview of MetaMap: historical perspective and recent advances Feature selection techniques for maximum entropy based biomedical named entity recognition A Maximum Entropy approach to biomedical named entity recognition Bio-medical entity extraction using support vector machines Automatic de-identification of electronic medical records using token-level and character-level conditional random fields Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets Effective adaptation of a hidden markov model-based named entity recognizer for biomedical domain Biomedical named entity recognition: a poor knowledge HMM-based approach Clinical Entity Recognition Using Structural Support Vector Machines with Rich Features Evaluating word representation features in biomedical named entity recognition tasks CNN-based ranking for biomedical entity normalization Biomedical named entity recognition based on extended recurrent neural networks Cross-type biomedical named entity recognition with deep multi-task learning A neural network multi-task learning approach to biomedical named entity recognition Similarity Based Auxiliary Classifier for Named Entity Recognition Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on Ten benchmarking datasets Transfer learning for biomedical named entity recognition with neural networks Effective use of bidirectional language modeling for transfer learning in biomedical named entity recognition BioBERT: a pre-trained biomedical language representation model for biomedical text mining Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets Bidirectional attention flow for machine comprehension Attention-over-attention neural networks for reading comprehension Gated self-matching networks for reading comprehension and question answering Entity-relation extraction as multi-turn question answering A Unified MRC Framework for Named Entity Recognition Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems A Unified MRC Framework for Named Entity Recognition Pre-training of Deep Bidirectional Transformers for Language Understanding A Deep Learning-Based System for PharmaCoNER When Specialization Helps: Using Pooled Contextualized Embeddings to Detect Chemical and Biomedical Entities in Spanish Transfer Learning in Biomedical Named Entity Recognition: An Evaluation of BERT in the Pharma-CoNER task Enhancing Neural Sequence Taggers with Attention and Noisy Channel for Robust Pharmacological Entity Detection Biomedical Named Entity Recognition with Multilingual BERT IxaMed at PharmacoNER Challenge A Neural Pipeline Approach for the PharmaCoNER Shared Task using Contextual Exhaustive Models We thank all anonymous reviewers for suggesting the update of a draft of this manuscript.. This article has been published as part of BMC Bioinformatics Volume 22, Supplement 1 2021: Recent Progresses with BioNLP Open Shared Tasks -Part 2. The full contents of the supplement are available at https:// bmcbi oinfo rmati cs. biome dcent ral. com/ artic les/ suppl ements/ volume-22-suppl ement-1. The work presented here was carried out in collaboration between all authors. Y.X., S.C. and B.T. designed the methods and experiments. Y.X. and S.C. conducted the experiment. Y.X. analyzed the data and interpreted the results. Y.X. and B.T. wrote the paper. Q.C., X.W., Y.J., and Y.Z. provided detailed edits and critical suggestions. All authors have approved the final manuscript.