key: cord-0488213-tcki6b7p
authors: Jeon, Sung Hwan; Cho, Sungzoon
title: Named Entity Normalization Model Using Edge Weight Updating Neural Network: Assimilation Between Knowledge-Driven Graph and Data-Driven Graph
date: 2021-06-14
journal: nan
DOI: nan
sha: a781d44e780ffe807bcb34989d2fe40849344e79
doc_id: 488213
cord_uid: tcki6b7p

Discriminating the matched named entity pairs or identifying the entities' canonical forms are critical in text mining tasks. More precise named entity normalization in text mining will benefit other subsequent text analytic applications. We built the named entity normalization model with a novel Edge Weight Updating Neural Network. Our proposed model when tested on four different datasets achieved state-of-the-art results. We, next, verify our model's performance on NCBI Disease, BC5CDR Disease, and BC5CDR Chemical databases, which are widely used named entity normalization datasets in the bioinformatics field. We also tested our model with our own financial named entity normalization dataset to validate the efficacy for more general applications. Using the constructed dataset, we differentiate named entity pairs. Our model achieved the highest named entity normalization performances in terms of various evaluation metrics.

The text mining technology is undergoing a rapid evolution thanks to the exponential growth in the number of text-rich documents available online, and as a result, it is being widely applied in a range of domains such as finance and bioinformatics. Text mining aims to extract the information from documents to derive valuable insights. Documents subject to analysis contain many named entities, which are proper names that denote unique objects such as organizations, products, persons, and locations. The technique used to extract named entities from documents is called named entity recognition (NER, henceforth). Furthermore, named entity normalization (NEN, henceforth) involves matching extracted named entities with homogeneous identity and is pivotal for text mining tasks.

More specifically, in the biomedical domain, disease names and chemicals in drugs often have different surface forms while sharing the same concept. Types of named entities with different surface forms that share same concept can be divided into following categories: (1) synonyms, (2) abbreviations, (3) acronyms, (4) different combinations of punctuations and alphabets, (5) descriptive phrases, and (6) possible NER parsing errors. For example, "hepatomegaly" arXiv:2106.07549v1 [cs.AI] 14 Jun 2021 and "liver enlarged" do not have matching strings but the two disease names have identical meanings, and thus, these two named entities are synonyms. Biomedical named entities have a wide variety of different surface forms compared with entities from other text sources. More accurate named entity normalization techniques will potentially improve the quality of downstream tasks. Moreover, matching entity pairs such as "International Business Machines" and "IBM", which are examples of acronyms, are very critical in financial text mining applications. Linking entities with the same identity enables accurate sentiment analysis on firms and products. Furthermore, evaluation of news impacts on the stock market requires the connection between news articles and related firms. Given the wide range of named entities in bioinformatics and finance documents, the total number of tokens to be calculated for text clustering and classification is enormous.

The early NEN models explored knowledge-based approaches. Generating the rules for named entity matching based on domain knowledge is valid only for the dataset in which the corresponding rules are already created. The rule-based models are not robust for the neologisms. In order to overcome the disadvantage that the rule-based model is not robust, models based on machine learning have been introduced. However, machine learning models are limited to specific fields such as bioinformatics NEN and chemical engineering NEN due to lack of NEN datasets in other domains. Our research aims to construct fully automated NEN model that can be applied to various other domains. To test our model's robustness on different domain, we also apply the NEN dataset in finance.

An automated named entity normalization model reduce the burden of hand-mined information extraction tasks. Clear linkage between entities with different forms, such as abbreviations and acronyms, aid in more accurate sentiment analysis. The named entity normalization model also benefits the creation of more comprehensible classifying and clustering documents. The primary contributions of our study are (1) constructing better performing NEN model using an Edge Weight Updating Neural Network and (2) applying our proposed model to bioinformatics NEN and financial NEN tasks.

The proposed method, that is, the edge weight updating neural network, consists of four parts: (1) ground truth entity graph construction, (2) similarity-based entity graph construction, (3) edge weight updating neural network training, and (4) edge weight updating neural network inferencing. The main concept behind the Edge Weight Updating Neural Network is to minimize the Ground Truth Entity Graph's edge weight distributions and the Similarity-Based Entity Graph's edge weight distributions. By minimizing the edge weight distributions on the two graphs, entity embeddings capture more accurate information on semantic similarity between matching entities.

Our proposed model is evaluated on three widely used bioinformatics datasets (NCBI Disease, BC5CDR Disease, and BC5CDR Chemical) and its performance is compared with other cutting-edge models. Furthermore, to validate the efficacy of our proposed model in general NEN tasks, we construct a financial NEN dataset with state-of-the-art NER using BERT [1] . Using the constructed dataset, we propose the deep learning model to solve more practical financial NEN tasks. Out dataset incorporates major challenges in entity matching: (1) synonyms, (2) abbreviations, (3) acronyms, (4) different combinations of punctuations and alphabets, (5) descriptive phrases, and (6) possible NER parsing errors. Compare with other recent NEN models, our proposed model shows higher accuracies in all datasets used in the experiments, and our model is tested with not only bioinformatics NEN datasets but also financial NEN datasets, which verifies the efficacy in general NEN tasks.

The remainder of this paper is organized as follows. Section 2 describes related work. Section 4.1 presents an overview of dataset we used for evaluations. A brief explanation of pre-constructed NEN datasets from the bioinformatics domain is given in this section. The structure for our proposed model is described in Section 3. Experiment settings for testing model performances are provided in Section 4. Furthermore, 4.1 in Experiment Settings(subsection 4) provides the overview of preprocessing for data and financial NEN dataset construction with examples. In Section 5, we present the details regarding the qualitative and quantitative analyses we conducted on the models. Finally, in Section 6, we present our conclusions.

Bioinformatics, chemical engineering, and materials science domain actively adopt cutting-edge deep learning frameworks for NEN tasks. According to Cho et al. [2] , various products exist for recognizing and normalizing named entities in biomedical fields such as ProMiner [3] and MetaMap [4] . DNorm [5] and TaggerOne [6] also used machine learning models such as pairwise ranking scoring and semi-Markov models, respectively, for NEN processing. In genetic engineering, GenNorm [7] and GNAT [8] are used to normalize the gene names. ChemSpot [9] uses Conditional Random Field for NER and NEN tasks in chemical engineering. Weston et al. [10] developed MatScholar [10] python repository to perform general NLP tasks on material science texts, which includes entity normalization.

The above researches and products used NEN datasets concentrated on specific domains. ShARe/CLEF [11] is one of the widely used NEN datasets for bioinformatics that is made up of clinical notes. The NCBI [12] dataset contains PubMed abstracts for disease name normalization tasks. TAC2017ADR [13] aims to link identical drug labels. The BC2GM [14] , BioNLP09 [15] , and BioNLP-OST19 [16] datasets deal with genes, proteins, and bacteria, respectively. In chemical engineering, SCAI [17] and IUPAC [18] are available for researches on chemical name matching. Similar to chemical names, Weston et al. [10] developed a dataset for material engineering to normalize entities to a canonical form.

Applying machine learning algorithms in the financial domain is gaining increasing attention. One major branch is stock movement forecasting using various deep learning mechanisms [19, 20] . Thanks to the rapid developments of unstructured data processing techniques, researches on applying text mining techniques to the financial fields have increased in number. In their study, Gupta et al. [21] illustrated the trends for applying text mining in finance. Among many related text mining applications in finance, NEN can be applied to various financial researches and financial practices. In preprocessing for applying text mining techniques to solve real-world problems, NER and NEN models are performed preemptively. However, the NEN dataset for the financial domain is scant and there is a need for developing a dataset targeting the financial NEN.

Many researchers have developed targeted datasets for more general NEN tasks in domains such as user comments, product description, and financial invoices. For example, in their study, Jijkoun et al. [22] used user comments from newspaper websites. Sun et al. [23] performed normalization of product entity names, for which the dataset was developed by the authors. The study conducted by Francis et al. [24] on financial invoices is the most relevant one to our study. However, Francis et al. focused on insurance, telecommunications, banking, and tax companies using the following entities: International Bank Account Number (IBAN) of the beneficiary, invoice number, invoice date, and due date [24] . The focus of our study is on more general financial entity normalization, which covers entities from all financial sectors. Previous studies using the datasets illustrated above used various machine learning and deep learning models.

There are similarities between the string matching methodologies in various other fields and NEN researches. Sun et al. [23] proposed NEN for product names using a pre-constructed product entity linkage dictionary. In semantic string matching, Siamese Neural Networks are widely used [25, 26, 27] . Krivosheev et al. [28] used Siamese Graph Neural Network for company name normalization. We need to extend NEN on company names to NEN on a wide range of product names and legal entities. Siamese RNN model successfully apprehends the morphological similarity between strings [29] . Niu et al. [30] applied Attention mechanisms for medical concept normalization. Furthermore, the evolution of Transformer-based models capacitate the adoption pre-trained language models such as BERT [1] for entity linking problems [31] .

The major development in recent NEN researches is as follows. D'Souza et al. [32] proposed an early NEN model using a rule-based model, which requires comparatively more human input when generating the rules. The model is static and, thus, there is a possibility that new rules need to be created when applying the model to other datasets. NEN models that use more advanced machine learning and deep learning techniques can be more effective. Leaman et al. [6] used semi-Markov model, Li et al. [33] used word-level CNN model, and Wirght and Dustin [34] and Phan et al. [35] models based on BiGRU and BiLSTM. However, BERT achieved state-of-the-art performance in many general text mining and natural language processing (NLP) challenges. Compared with the four models illustrated above, the most recent researches such as the BERT ranking model [36] and BioSyn [37] takes full advantage of the BERT model by training the model based on BERT embeddings. The BERT Ranking model [36] used ranking-based objective function and BioSyn [37] used Synonym Marginalization techniques as the objective function for training. Our proposed model optimizes BERT embedding vectors with named entity graph's edge weight updating neural network. Our proposed model successfully captures the ground truth linkage between named entity graphs, achieving the highest accuracies. Previous NEN researches focus mainly on the NEN dataset from a specific domain. To test the efficacy of our model in more general NEN tasks, we evaluate our model with NEN datasets from both the bioinformatics domain and financial domain.

Many NEN researches explore semi-supervised learning models. Our proposed model is motivated by one of the leading semi-supervised models on images, Edge-Labeling Graph Neural Network for Few-shot Learning [38] (EGNN). The major difference between EGNN and our model is that EGNN labels an edge for each round of training but our model updates edge weights for top K connected entities. By capturing more node and edge information simultaneously for each round of training, the proposed model shows better performance compared with other NEN models.

Our proposed model, Edge Weight Updating Neural Network, consists of four major parts. The basic idea behind Edge Weight Updating Neural Network is to minimize the Ground Truth Entity Graph's edge weight distributions and the Similarity-Based Entity Graph's edge weight distributions. Entity embeddings are trained with Kullback-Leibler divergence [39] loss between two graphs. Detailed steps for constructing the Ground Truth Named Entity Graph, building the Similarity-Based Entity Graph, and training and inferencing the Edge Weight Updating Neural Network are presented in Sections 3.1, 3.2, 3.3 and 3.4, respectively.

Ground Truth Entity Graph constructions are based on mentions (entities) in each dataset and their concept IDs. Figure  1 demonstrates the steps for building the graph.

For the NEN corpus, each entity is annotated with one or more concept IDs. For example in Figure 1 , entities A, B, and C share the same concept ID, ID_1. Then, entities A, B, and C are fully connected in the entity graph. Other entity pairs, D -E (concept ID: ID_2) and F -G (concept ID: ID_3) are linked. The training dataset for each NEN corpus has query entities with corresponding concept ID. If query entity Q has a concept ID of ID_1, then, query entity Q will be linked to entities A, B, and C in the pre-constructed graph. As the constructed graph is the ground truth graph, each edge weight in the graph is 1.

We iterate all the entities in training sets that include the referencing dictionary entity table and the query entity table. Graph created by the following steps above is the Ground Truth Entity Graph which is the reference or the target graph the Similarity-Based Entity Graph will try to match.

For each query entity, Similarity-Based Entity Graph is constructed as follows. Graph edges are calculated using BERT embedding vector similarities. We use BioBERT [40] for bioinformatics NEN corpus' initial BERT embeddings and the original BERT [1] for financial NEN corpus' initial BERT embeddings. Figure 2 , let query entity Q has size of 768 (vector length of BERT embeddings), Embed Q = (X Q1 X Q2 · · · X Q768 ). Similarly, BERT-based entity embeddings in the dictionary set are also denoted as Embed entity = (X entity1 X entity2 · · · X entity768 ). The BERT embedding has a fixed length of 768, so our embedding vectors have a vector length of 768.

To calculate the edge weights based on entity similarities, we calculate inner products between query entities and dictionary entities. < , > is the notation for inner product and Sim Q is the set of similarities between query entity Q and all the entities in a dictionary; then the similarity between each query entity and each dictionary entity calculation is expressed as Equation 1,

We normalize the similarity score by dividing the maximum similarity score in each query entity's similarity score set, Sim Q . For Similarity-Based Entity Graph, top K edges based on similarity score are selected. Highlighted blue region in entity similarity table for query entity Q in Figure 2 demonstrates the edge weight determination steps when K = 5. Mathematically, edge weights are calculated using Equation 2.

For each training epoch, which is illustrated in Section 3.3, edge weights are updated. Updated entity embedding vectors generate new similarity scores that alter the edge weights in the graph.

The main concept of Edge Weight Updating Neural Network is to minimize the difference between the edge weights' discrete distribution for each query entity in the Ground Truth Entity Graph and the Similarity-Based Entity Graph. As illustrated in Section 3.2, edge weights are calculated by entities' embeddings. In each training epoch in Edge Weight Updating Neural Network, baseline BERT model's parameters are optimized to mimic the ground truth edge weight distributions. Figure 3 shows the training process of our proposed model for the number of connected edges in the Similarity-Based Entity Graph is 5 (K = 5). Following the example in Section 3.2, query entity Q is connected to dictionary entities A, B, C, D, and F, and edge weights are 0.8, 0.9, 0.6, 0.7, and 0.5, respectively. Given the Ground Truth Entity Graph in Section 3.1, the truth edge weights for connected edges between query entity Q and dictionary entities, A, B, C, D, and F are 1, 1, 1, 0, and 0, respectively. In training procedures, BERT parameters are tuned to make edge weights distributions in Similarity-Based Entity Graph closer to the ground truth edge weight distributions. We use Kullback-Leibler Divergence Loss [39] (KL divergence loss, henceforth) for training our model. As edge weight distribution is discrete, we normalize the edge weights using the Softmax function.

We denote graph as G, entity as V , and edge as E. The Ground Truth Entity Graph and the Similarity-Based Entity Graph are denoted as G GT = (V GT , E GT ) and G Sim = (V Sim , E Sim ), respectively. The adjacency matrices for Ground Truth Entity Graph and the Similarity-Based Entity Graph are denoted GT _A and Sim_A. P Sim_Edge Q is the discrete distribution of edge weights of Q in Similarity-Based Entity Graph. P GT _Edge Q is the discrete distribution of edge weights of Q in the Ground Truth Entity Graph. Our KL divergence loss is calculated using Equation 3 .

and, A Q is the edge weight vector connected to Q f or given query entity node Q We use an Adam optimizer with weight decay [41] , and set the batch size to 16 and the number of connected edges in the Similarity-Based Entity Graph to 30 (K = 30) for all datasets we test. We train our model for 50 epochs. The best scores are reported in Section 4.

First, fine-tuned BERT embeddings illustrated in Section 3.3 are used to embed unseen query entities in test sets. With newly computed BERT embedding vectors, we repeat the steps in Section 3.2 to construct the new Similarity-Based Entity Graph. For each query entity, a dictionary entity with the highest edge weights is returned as a synonym. Figure  4 demonstrates the inferencing process of the Edge Weight Updating Neural Network. Three datasets summarized below contains bioinformatics-related mentions (entities) with unique concept IDs. The main goal of these datasets is to identify the mentions that share the same concept IDs. We follow NEN preprocessing convention for the datasets below, in which the mentions that do not exist in the concept dictionary are eliminated [35] . Bioinformatics NEN datasets usually consist of train, development, and test sets. Following previous studies, we use train and development sets for training our model. Test sets are used for evaluations. Table 1 shows detailed statistics of the NCBI Disease corpus.

Biocreative V CDR Disease and Biocreative V CDR Chemical [42] . The BC5CDR corpus is organized for challenging tasks of disease named entity recognition and chemical-induced disease relation extraction. The BC5CDR corpus consists of 1,500 PubMed articles with 4,409 annotated chemicals, and 5,818 disease and 3,116 chemical-disease interactions [42] . The dataset contains disease mention corpus and chemical mention corpus. Disease mentions are mapped into the MeSH IDs similar to the NCBI Disease corpus. Chemical mentions are annotated using the Comparative Toxicogenomics Database (CTD) [43] . Mentions that share the same disease concept and chemical concept based on MeSH ID and CTD ID are considered synonyms. Detailed statistics of both BC5CDR Disease corpus and BC5CDR Chemical corpus are illustrated in Table 1 .

There are no publicly open financial NEN datasets available; therefore, we constructed our own financial NEN dataset to test the performance of our proposed model in NEN tasks other than the bioinformatics domain.

Overview . We construct the dataset for the financial NEN task from the annual reports (Form 10-K) of Standard and Poor's 500 listed companies. We aim to build the dataset that fulfills the need for financial NEN; the dataset includes (1) synonyms, (2) abbreviations, (3) acronyms, (4) different combinations of punctuations and alphabets, (5) descriptive phrases, and (6) possible NER parsing errors. A detailed explanation of primary data sources, data preprocessing steps, and dataset construction procedures are as follows. Fig. 5 demonstrates the overall flow diagram for NEN dataset construction.

Data Source . We gather the year 2019's Form 10-Ks (published early 2020) of S&P500 companies from the U.S. firms and Exchange Commission (SEC) website 1 , which is open to the public. We parse the business section of each [44] . The outputs of the BERT NER model are WordPiece tokens that we have to link together with specified rules that will be circumstantially described below. There are four types of entity types: persons (PER), organizations (ORG), locations (LOC) and miscellaneous names (MISC), and one outside the named entity tag (O) in the CoNLL-2003 dataset. We detect entities with ORG and MISC tags. For the year 2019 S&P500 firms' 10-K, we parse a total of 41,593 named entities.

With named entities recognized illustrated in Section 4.1.2, we construct the financial named entity normalization dataset. As mentioned in Section 4.1.2, our focus is to build a NEN dataset to meet the need for general text mining in finance; the dataset includes (1) synonyms, (2) abbreviations, (3) acronyms, (4) different combinations of punctuations and alphabets, (5) descriptive phrases, and (6) possible NER parsing errors. We hand label a total of 7,155 unique named entities into 2,600 groups; with each group sharing the same identity. Table 2 shows three examples in our dataset for types of named entities that need to be normalized.

• Synonyms: There exist entities with the suffix " ® " or "™". "Coca-Cola ® " and "Coca-Cola" are the same entity. In addition, "COVID-19 Pandemic" and "COVID-19" should be linked. We generalize the product model numbers in which "iPhone 11 Pro Max" and "iPhone ® " are considered identical entities.

• Abbreviations: Most abbreviations occur for abridging "Company" to "Co.", "Corporation" to "Corp.", and "Incorporated" to "Inc.".

• Acronyms: Acronyms are one of the most challenging NEN tasks. There are multiple abbreviations that are included in financial documents. We avoided matching acronyms if there are multiple original entities can be assigned. For example, "Advanced Development Programs ( ADP )" and "Automatic Data Processing, Inc. ( ADP )" both share the same acronyms, "ADP", but these should not be linked together. • Combinations of punctuations:

The different combinations of punctuations problems can be solved using rule-based approaches. However, there are many entities with a combination of punctuations. ",", ".", and "&" are commonly found and used interchangeably.

• Descriptive phrases: In parsed named entity, an entity with descriptive phrases can be frequently found. With or without descriptive phrases, the root or the identified entity is invariable.

• NER parsing errors: No NER models and entity concatenation models are perfect. If NER is conducted manually, there are possible human errors too. According to our dataset, one common error model makes is appending the following token after "-" token. NER parsing error correction is one of the important targets our NEN model aims to achieve.

Hand-matched entity pairs are labeled positive. We also added negatively labeled pairs in which two entities have no relationship. A total of 25,000 pairs with 10,825 positive matching pairs and 14,175 negative pairs are created. We separate entity groups for a train set and test set in which there are no overlapping groups. This eliminates possible training bias, especially when training the model with entities' graph topology. Table 3 shows the statistics of our financial NEN dataset.

We compare our proposed model's performance with seven different biomedical NEN models. The accuracy score presented in this study is excerpted from original papers. A summary of each model is illustrated in Table 4 .

The dataset we used is covered in Section 4.1.2. Table 5 shows each model used in NEN in Finance is tested. The experiments are conducted using Intel Core-i9-10940X CPU with 128GB memory and three NVIDIA GeForce Titan RTX GPU. To avoid possible biases caused by exogenous variables, we use the same setting for all models if applicable.

We conduct both quantitative and qualitative analysis. For NCBI Disease, BC5CDR Disease, and BC5CDR Chemical datasets, we compare our proposed model's score with previous researches. Bioinformatics datasets are reported by top one recommendation accuracy. Given the biomedical entity in the train set, entities are matched with the most similar entities in datasets. If the query entity and target entity share the same concept ID, it is considered correct. The financial NEN dataset is a pairwise NEN matching corpus. For evaluations on the financial NEN dataset, models that are used in evaluations distinguish whether two named entity pairs share identical meanings or not. We also perform the qualitative analysis to assess models' weaknesses. Table 6 shows a performance comparison between our proposed model and previous state-of-the-art models. For three bioinformatics datasets, our proposed model achieved the highest accuracy. Our model showed the highest performance increase by 0.6% in the NCBI Disease corpus. For BC5CDR Disease and BC5CDR Chemical corpus, the performance increase compared the previous state-of-the-art model is 0.2% and 0.1%, respectively.

The NCBI Disease corpus is a comparatively harder dataset based on the performance of other models. We conclude that there there is a significant to increase the accuracy in a relatively lower performing dataset. The previous model Sieve-based [32] This is one the earliest NEN papers. The research was conducted with 10 Sieve, which is mostly a rule-based approaches. Many published post this research follow similar preprocessing steps. Taggerone [6] Taggerone used the semi-Markov model for both NER and NEN tasks. Taggerone was originally validated on the NCBI Disease and BC5CDR corpus. CNN Ranking [33] CNN Ranking model used a word-level deep learning approach for NEN. This research did not perform better than the previous model, Taggerone. However, it was the first study that applied deep learning to NEN tasks. NormCo [34] NormCo used BiGRU, which is considered to be a better performing deep learning model with text data. NormCo achieved similar accuracy scores with significantly fewer parameters. BNE [35] BNE introduced two-level BiLSTM to capture both character-level and word-level information of biomedical entities, achieving increased NEN performance. BERT Ranking [36] BERT Ranking model is based on Transformer-based embeddings that use the pre-trained BERT [1] , BioBERT [40] , and ClinicalBERT [45] for their entity embeddings. For each entity, candidate concepts were retrieved and three different BERT models are fine-tuned to rank and to capture the ground truth concepts. TripletNet [46] The concept of TripletNet [47] for semi-supervised learning was introduced for NEN tasks. This study uses CNN for entity embedding and shared CNN parameters are trained with TripletNet structure. BioSyn [37] BioSyn uses BioBERT for entity embeddings and trained with Synonym Marginalization. Marginal Maximum Likelihood (MML) is the objective function for Synonym Marginalization. Edit Distance [48] Edit Distance is suitable for basic NEN tasks for linking "Apple Inc" and "Apple Inc.". However, Edit Distance can only capture the superficial morphological similarity between two entities. In our experiment, we calculate the Edit Distance between two entity pairs and train a simple classifier to determine the equivalence of two entities. BERT [1] BERT is a state-of-the-art model for various NLP tasks. However, for our specific tasks, the BERT model has a limitation on capturing morphological similarity between entity pairs. We use pre-trained BERT vectors with size 768 and train a simple MLP classifier with batch size 4096 to determine the linkage between entity pairs. Siamese GCN [49] We use the entity graph illustrated in Section 3.2 and we use a pre-trained BERT vector for each entity node vector. 2-layer Siamese GCN is used in our experiment with 256 hidden nodes for the first GCN layer and 16 hidden nodes for the second GCN layer. GCN requires more epochs for training so we trained for 120 epochs for the full dataset (full batch: 17,500 entity pairs). The learning rate for ADAM optimizer for GCN is 0.01. Siamese BiLSTM [50] For Character Level Siamese BiLSTM model training, we one-hot encoded the characters entity strings with unique 85 tokens. We stack two BiLSTM layers. The BiLSTM cells in the first layer return 64 dimension hidden states output and the BiLSTM cells in the second layer return 16 dimension hidden states output. To prevent overfitting, we train the BiLSTM model for 12 epochs. The BiLSTM model is trained with a learning rate of 0.001. Embedding dimension, 16, is the same as GCN. already performs excellently on the the BC5CDR corpus with an accuracy score increasing from 93.2% to 96.6%. Significant performance increase in these datasets can be marginal. Table 7 shows the performance of each model we test. The evaluation metrics are expressed as follows

False positive indicates that two entities should not be matched, but our proposed model decided to link two entities. False negative indicates that two entities should be matched, but our proposed model failed to link two entities.

For practical use in the NEN model in the finance domain, a model with higher precision should be rewarded more.

In practice, a model with higher precision will reduce the burden for practitioners' tasks by giving more reliable entity-matching results. A model with higher precision will reduce time double-checking the validity entity pairs marked as matched.

Edit Distance had the lowest score along with all performance evaluation indicators. Graph Convolutional Network we use for the experiments adopts the BERT vector as entity node features. BERT and GCN have a similar recall, but GCN has higher precision, which brings higher F-score and accuracy compared with BERT. Our proposed model achieved the highest precision, F-score, and accuracy. Among all the models, our proposed model is the only model with a precision score over 90%. Therefore, our proposed model is the most suitable for practical use.

In error analysis, entities for which accurate recommendations are not made are reported. Through error analysis, we aim to recognize the pattern of cases where recommendations are not properly made. Table 8 lists the errors in three bioinformatics NEN datasets. Our proposed model achieves approximately 90% accuracy for all three datasets. However, finding the synonyms for short abbreviations such as "cdm", "htn", and "dph" seems relatively harder. In addition, if there exist longer overlapping strings, the performance of the model is degraded.

Financial NEN datasets are constructed using entity pairs. Our model predicts whether two entity pairs are matched or not. Table 9 is divided into false positive lists and false-negative lists. By examine the false-positive lists, entities with similar meanings or with matching strings are often predicted positive while the actual label is negative.

We also examine the false negatives. Matching named entities with parenthesis and abbreviations is the part where our model's prediction is relatively unstable. Entity pairs such as "Paris Climate Accords" and "Paris Agreement" can be more difficult to predict as positive because the intrinsic meaning of "Paris Agreement" requires common sense. Even our model is based on BERT, which captures the semantic meaning from the sentences where named entities are excerpted, using the common sense beyond the information presented in surrounding sentences can be limited.

As the training epochs increase, recommendations become more accurate. We randomly selected entities from four datasets we tested. Top 5 recommendations for the selected entities are provided for epoch 0, epoch 1, and epoch with best result in Section 5.1 and Section 5.2. , and bold-underlined entities are the entities with the same concept ID as the query entity. Throughout the datasets, at epoch 0, the recommended entities differ greatly from the concept ID of the query entity. As the model is trained, the recommendation becomes more accurate in epoch 1. At the epochs in which the highest accuracy for the datasets is achieved, true synonyms for query entities are successfully selected.

Based on our experiments, our proposed model has the highest precision, recall, F1 score, and accuracy. Qualitative analysis shows that our proposed model also gives the most robust results. Our proposed model is most suitable for tasks such as financial named entity normalization automation and preprocessing for various financial NLP tasks.

We introduce Edge Weight Updating Neural Network for NEN. NEN to match extracted named entities with homogeneous identity is pivotal for many text mining tasks. We tested our model on three widely used NEN datasets, NCBI Disease, BC5CDR Disease, and BC5CDR Chemical. We also generated the NEN dataset for the finance domain. Next, we verify our model's performance for general NEN applications.

The main contribution of this study are as follows. Our proposed model successfully links named entities with the same meanings with different surface forms. The proposed model performs best among previous NEN models. We test our model not only for bioinformatics datasets in which NEN researches are more active but also for financial NEN datasets. According to the performance of the NEN corpus in two distinct fields, our proposed model proves the efficacy for general NEN applications.

Similar to many other NEN models, the performance of linking named entities with abbreviations is comparatively lower. Matching abbreviations more accurately is one of the future works. The neural network model with our proposed Edge Weight Updating objective function performs better than other models. Providing the more general guideline for the number of training epochs and increasing the training stability is one of the future research topics.

acknowledgements This work was supported by National Research Foundation of Korea (2018R1D1A1A02045842).

Pre-training of deep bidirectional transformers for language understanding

A method for named entity normalization in biomedical articles: application to diseases and plants

Prominer: rule-based protein and gene entity recognition

Effective mapping of biomedical text to the umls metathesaurus: the metamap program

Dnorm: disease name normalization with pairwise learning to rank

Taggerone: joint named entity recognition and normalization with semi-markov models

Cross-species gene normalization by species inference

The gnat library for local and remote gene mention normalization

Chemspot: a hybrid system for chemical named entity recognition

Named entity recognition and normalization applied to large-scale information extraction from the materials science literature

Overview of the share/clef ehealth evaluation lab

Ncbi disease corpus: a resource for disease name recognition and concept normalization

A dataset of 200 structured product labels annotated for adverse drug reactions. Scientific data

Overview of biocreative ii gene mention recognition

Overview of bionlp'09 shared task on event extraction

Bacteria biotope at bionlp open shared tasks 2019

Chemical names: terminological resources and corpora annotation

Detection of iupac and iupac-like chemical names

An evaluation of equity premium prediction using multiple kernel learning with financial features

Ar-arch type artificial neural network for forecasting

Comprehensive review of textmining applications in finance

Named entity normalization in user generated content

A product named entity normalization method based on entity relations

Transfer learning for named entity recognition in financial and biomedical documents

Siamese recurrent architectures for learning sentence similarity

Semantic textual similarity with siamese neural networks

Matching long text documents via graph convolutional networks

Siamese graph neural networks for data integration

Learning text similarity with siamese recurrent networks

Multi-task character-level attentional networks for medical concept normalization

Evaluating the impact of knowledge graph context on entity disambiguation models

Sieve-based entity linking for the biomedical domain

Cnn-based ranking for biomedical entity normalization

NormCo: Deep disease normalization for biomedical knowledge base construction

Robust representation learning of biomedical names

Bert-based ranking for biomedical entity normalization

Biomedical entity representations with synonym marginalization

Edge-labeling graph neural network for few-shot learning

On information and sufficiency. The annals of mathematical statistics

Biobert: a pre-trained biomedical language representation model for biomedical text mining

Biocreative v cdr task corpus: a resource for chemical disease relation extraction

Comparative toxicogenomics database: a knowledgebase and discovery tool for chemical-gene-disease networks

Introduction to the conll-2003 shared task: Language-independent named entity recognition

Enhancing clinical concept extraction with contextual embeddings

Amitava Bhattacharyya, and Mahanandeeshwar Gattu. Medical entity linking using triplet network

Deep metric learning using triplet network

Binary codes capable of correcting deletions, insertions, and reversals

Semi-supervised classification with graph convolutional networks

Bidirectional recurrent neural networks

This work was supported by National Research Foundation of Korea (2018R1D1A1A02045842). All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.