1 Introduction

With the evolution of digital technologies and the internet users profile, the generation of unstructured data had a huge increase. According to [22], unstructured data are those that do not show a clear syntactic and semantic, machine-processable structure.

This way, methods have been developed to identify and structure information present in textual documents. In these methods, the data structuring process make use of Natural Language Processing, an area that uses concepts of Linguistics and Artificial Intelligence to process data and automate language related tasks. Some of the Natural Language Processing tasks that we can mention are Machine Translation, Speech Recognition, Automatic Text Summarization, Information Extraction, among others.

Information Extraction methods have been developed to identify and structure information present in textual documents. In particular there is the Named Entity Recognition task, which consists of using methods to identify Named Entities, such as Person, Place, among others, in texts written in natural language and categorize them according to their nature. A Named Entity is a concept, formed by one or more words, that belongs to a previously defined semantic group [9,10,11, 25, 28, 33].

Traditionally, NER approaches have used many techniques from Linguistics, such as syntactic labels, word lemmas, affixes, among others, to extract information present in texts. The use of traditional techniques is quite laborious, as it involves several stages of data preparation. In order to ease the work of traditional NER approaches, Machine Learning methods have been developed. Recently, numeric techniques that captures syntactic and semantic aspects of words has been gaining space for their similar results, sometimes even better, when compared to classical techniques, without requiring an extensive process of feature engineering.

As examples of Machine Learning approaches to Named Entity Recognition task we can cite [5, 6, 12, 14, 15, 26], among many others. Distributed representations for words are some of the reasons for the popularization of Machine Learning methods in NLP, and as example of these representations there are word2vec [23, 24], GloVe [27], fastText [3], Flair [1], BERT [7], and many others.

As stated by Ratinov and Roth [29], NER is knowledge-intensive task. Therefore, there are approaches that still perform feature engineering with Machine Learning methods in order to include features that cannot be captured by word vectors or texts. Even though part of the problems are solved by feature engineering, there is still a semantic gap that cannot be solved with lexical and syntactic attributes alone. In this sense, semantic repositories have been adopted.

In [5] external information is added to the CNN-LSTM model through the use of a lexicon. This lexicon is made of Wikipedia data and is used to add new features in the shape of Inside-Outside-Beginning tags to the input vectors.

A work using external knowledge from ontology is presented in [19], where the authors perform the Named Entity Recognition on bridge inspection reports. In order to do this, they use a semi-supervised Conditional Random Fields (CRF) whose inputs consists of syntactic features, like Stems and Part Of Speech tags, and semantic features from the domain ontology, named BridgeOnto.

Liu et al. [20] add external information from Gazetteers, but instead of using tags indicating the pertinence of the word to one or more Gazetteers, it uses Hybrid Semi-Markov CRF to generate a numeric value that express the degree of pertinence of the word to a Gazetteer. The reason for this is to find a better alternative to the hard match representation, which is commonly used when Gazetteers use is adopted.

In [18] the authors present an approach using external knowledge through use of lexicons together with syntactic features like Part Of Speech tags and n-grams.

Ratinov and Roth [29] enrich the Named Entity Recognition task by using a External Knowledge in the form of gazetteers on CoNLL03, MUC7, and a smaller dataset of webpages assembled and annotated by them. They also use other features, such as context aggregation, extended prediction history, among others, to boost the NER performance of their model.

Seyler et al. [30] divide external features in four categories to quantify the impact of each one in the NER task, using a Linear-Chain CRF on CoNLL03 and MUC7 datasets. The four categories are: i) Knowledge Agnostic - using only local features; ii) Name-Based Knowledge - using a list of named entities; ii) Knowledge-Base-Based Knowledge - using features extracted from a Knowledge Base or an annotated corpus; iv) Entity-Based Knowledge - using the results of a Named Entity Disambiguation. Authors show that incrementally adding more categories of knowledge yields better effectiveness, but sometimes at the cost of efficiency, stating that there is a trade-off between them.

In order to add external knowledge to a neural model, Ding et al. [8] propose the use of an adapted Gated Graph Sequence Neural Network to capture the information contained in multiple gazetteers. The role of the Gated Graph Sequence Neural Network is to serve as a embedding layer that learn how to combine the knowledge present in more than one gazetteer of the same or different type. The resulting embeddings are then fed to a standard BiLSTM-CRF to fulfill the NER task on Weibo-NER and OntoNotes 4, both in Chinese.

The gazetteer knowledge is aggregated to an Attentive Neural Network by Lin et al. [17] for the Nested Named Entity Recognition task. They leverage the knowledge contained in gazetteers by finding a representation for entity candidates through what they call a gazetteer network, that is concatenated with the representation learned by a region encoder. The experiments show that this strategy improves the model performance on ACE2005 dataset.

Xiaofeng et al. [32] propose a method to incorporate dictionary features to a BiLSTM-CRF model in order to evaluate their impact. Their differential is that the dictionaries used during the training phase are obtained from the train split data, whereas the dictionaries used in the testing phase are from SENNA. Their experiments are conducted on CoNLL2003 dataset, and show that the size of the dictionaries (partial or oversized) may lead to inferior results in some cases.

The purpose of this work is to evaluate the aggregation of external knowledge, in form of Gazetter and Knowledge Graphs from YAGO and Freebase, for Named Entity Recognition task using BiLSTM, BiGRU and CRF.

The paper is organized as follows: Sect. 2 presents our approach, details of the models, the input vectors and the external knowledge used. In Sect. 3 the experimental protocol and the datasets are explained, and the results are discussed. Section 4 presents conclusions and future work.

2 Approach

This work aims to investigate the use of external knowledge in some commonly used neural models for NER. The first step of our approach is to generate the embeddings for the datasets utilized by the methods, and add the sources of external knowledge. After the first step, the next one is to define the neural networks, with the architecture inspired by [5].

This section introduces the neural models used, as well as the sources of the external knowledge and how these knowledge sources are added to our model.

2.1 Neural Models

In this work we used two neural models: Bidirectional Long Short-Term Memory (BiLSTM) combined with Conditional Random Fields (CRF) [6], and Bidirectional Gated Recurrent Units (BiGRU) combined with CRF.

The aim of the models is to find a sequence of labels \(\mathcal {Y}=\{\mathcal {Y}_1,\mathcal {Y}_2,...,\mathcal {Y}_n\}\) for a given sequence of inputs \(\mathcal {X}=\{\mathcal {X}_1,\mathcal {X}_2,...,\mathcal {X}_n\}\) of length n. In this work, \(\mathcal {X}_i\) are vector representations of each word and its features in a sentence. These vectors are used as input to the BiLSTM/BiGRU layers, and the purpose of these layers is to extract features and create a new feature vector for each word represented by \(\mathcal {X}_i\) while considering the surrounding words present in the same sentence. The idea of bidirectional layers is to use the same input vectors for two LSTM/GRU layers, one layer with the word sequence from left to right, generating \(\overrightarrow{h_i}\), and the other the sequence from right to left, generating \(\overleftarrow{h_i}\), which are concatenated into one feature vector, \(h_i = [\overrightarrow{h_i}: \overleftarrow{h_i}]\), that is used as input by the classification layer.

Following [14] and other works, we chose CRF as the classification layer of our models because it takes into account the previously assigned labels. This way, combining BiLSTM/BiGRU with CRF, the models make good use of the sequential characteristic of texts.

2.2 Embeddings

Inspired by [5], this work make use of common embeddings: Character embeddings, Casing embeddings and Word embeddings. Further, we also aggregate the aim of validation: External Knowledge embeddings. All these embedding techniques are explained below, with exception of the External Knowledge embedding that is shown in the next subsection.

Character Embeddings: To generate the Character embeddings we first create a vector, randomly initialized with \(U(-0.5,0.5)\), for each character in a set of 135 characters. After initialization, these vectors are then retrieved for each character on each word and used as input to a Convolutional Neural Network with max-pooling layer to generate a single vector with information from all the characters of a word, named Character embedding. To better capture character features, character vectors are trained with the rest of the model, and we use a Dropout layer to avoid Character embeddings overfitting.

Casing Embeddings: For the Casing embeddings generation, each input word is categorized in one of eight possible categories according to their composition, such as their capitalisation and presence of numbers and special characters. Then, each category is initialized as a one-hot vector that is further trained with the model.

Word Embeddings: As Word embeddings we decided to use pre-trained GloVe [27] Word embeddings with 50 dimensions. Other candidates were word2vec, fastText, Flair and BERT. Some of these maybe would yield better final results, but as the objective of the work is to analyze the impact of external knowledge in the chosen neural architecture, results below the state of the art do not invalidate this work.

2.3 External Knowledge

To aggregate external information to the input vectors, we chose to use two distinct sources: Gazetteers made from version 3.1 of Yet Another Great Ontology (YAGO) [21]; and Knowledge embeddings from Freebase, generated by TransE method [4] using OpenKE framework [13] with the latest Freebase dumpFootnote 1.

To create the Gazetteers we picked all entities referring to four types of entities in YAGO (Person, Organization, Foundation, Place), which correspond to three of CONLL2003 types (Person, Organization, Place). Besides the chosen types, we also picked their sub-types (e.g. Abstract painters is a sub-type of People, Presidents is a sub-type of People) By the end of this process, we had three Gazetteer lists whose number of entities is shown in Table 1.

Table 1. Number of entities in each Gazetteer list.

With the purpose of adding Gazetteer information to the input vectors, a strategy similar to the Casing Embedding was used, but instead of the eight categories related to the composition of the word, we used another eight categories that express the pertinence of the word in one or more Gazetteer. To do this, we did a tagging stage where each word of the dataset was tagged according to their pertinence to a Gazetteer (e.g. Washington received PER/LOC tag, Kilimanjaro received only LOC tag). Then one-hot vectors are generated for each of the categories, which are then trained with the model, just as we did with Casing Embeddings.

As for the Knowledge embeddings from Freebase, we chose to use only single-word topics, but to compensate it we did not filter the topics by types. This way we used the Knowledge embeddings of all words in the text, and if the word doesn’t have a knowledge embedding we used a vector full of zeros.

Regardless of the source, the addition of external knowledge is done by concatenating the semantic feature vector, that was generated by one of the method described above, to the other vectors (Word, Casing, Character) in order to present the resulting vector as input to the neural models used.

3 Experiments

We conducted experiments on four datasets in order to check the impact of external knowledge on chosen neural model for NER task. We executed each experimental setting a total of 10 times due to the stochastic elements present in each model initialization. For evaluation we decided to adopt the F1-score metric, shown in Eq. 1 that is the harmonic mean of Precision (P) and Recall (R), shown in Eq. 2 and Eq. 3, where TP stands for True Positive, FP for False Positive, and FN for False Negative.

$$\begin{aligned} F_1-score = \frac{2 * P * R}{(P + R)} \end{aligned}$$
(1)
$$\begin{aligned} P = \frac{TP}{TP + FP} \end{aligned}$$
(2)
$$\begin{aligned} R = \frac{TP}{TP + FN} \end{aligned}$$
(3)

So, the F1-score results presented in this section are the mean of 10 executions for each setting of the model.

3.1 Datasets

In this work we chose four distinct datasets, all of them in English, with different sizes and types of entities in order to validate our approach on different scenarios. The chosen dataset are CONLL2003 [31], OntoNotes5, MIT Movies [18], and MIT Restaurants [18]. We only used train/test split, leaving validation sets out of our experiments, thus not conducting a hyper-parameter optimization. All of the datasets use Inside-Outside-Beginning as entity annotation scheme. Table 2 shows in details the number of sentences and entities contained in datasets.

Table 2. Quantification of datasets.

3.2 Experimental Settings

The experiments consisted in using external embeddings, described on previous sections, to check their impact on the results of our models. To verify this, we conducted experiments in the same model with and without the external knowledge. The model without the external knowledge was used as Baseline for our modifications, this way, we executed both network models, BiLSTM-CRF and BiGRU-CRF, with and without the addition of the external embeddings and compared their F1-score.

The parameters used are shown in Table 3. The parameters are the same for all datasets, except for OntoNotes 5 dataset, which we decided to use a lower Learning Rate and fewer Epochs due its large amount of data.

Table 3. Parameters values. Values with \(*\) symbol refer to parameters used only for OntoNotes5.

3.3 Experimental Results

Table 4 shows the results of the carried out experiments and the state-of-art (as stated by the authors in their papers) F1-score for the given dataset. The results show a an increase in all cases when we added Gazetteer information to BiLSTM-CRF. However, the addition of Knowledge Embedding to BiLSTM-CRF decreased the F1-score on every situation. On the experimental results using BiGRU there are two scenarios of increase in F1-score with the addition of Gazetteer information, and the use of Knowledge Embeddings also increased the F1-score in two scenarios. Regardless of whether the external knowledge increase or decreased the F1-score, the difference was not significant, representing on average 0,4%, with exception of OntoNotes 5 dataset, where the differences were very accentuated, averaging 6,76%.

Even though other works show an increase in F1-score when adding external knowledge to the models, our addition of external knowledge doesn’t always brings positive impacts on F1-score, as shown by Table 4, and even when it does it not necessarily achieve the best result of our models. Furthermore, even our best results are far from the state of the art in some datasets.

Table 4. Comparison of F1-score between our models and state-of-art models, where Gaz stands for Gazetteer, and KE stands for Knowledge Embeddings. Each column represents the F1-score for that dataset, and the bold values represent the best result for that dataset.

When compared to the results of other works, such as [29, 30] our strategy doesn’t seem to bring gains, however it is worth noting that the models we chose as baseline achieve good values of F1-measure (88.57% and 88.15% for CoNLL2003 using BiLSTM and BiGRU, respectively), which are very close, respectively, to their best and second best results. This choice of a good baseline may be one of the reasons behind the small gains.

4 Conclusion

This work aimed to evaluate the use of external knowledge used in machine learning methods for Named Entity Recognition. The methodology was composed of two steps: i) generation of embeddings, ii) definition and training of Machine Learning methods.

The defined models were trained and tested in four English dataset of different sizes and different types of entities in order to evaluate our methodology.

As our experiments show, in spite of an increase of F1-score in 17 of the 32 cases, the way external knowledge was integrated to the model did not bring much gains, most of them being lesser than 0.5%, and in some datasets the results were a way below the state-of-art methods (for the values stated by the authors in their papers). This may be explained because we haven’t made any hyper-parameter optimization, which may have led the model to suffer from overfit, and underfit in the case of OntoNotes5, where there was an increase of 11.8% in one scenario.

Another point to consider in the gap between our results and the state-of-art is the choice of Word embeddings: while our choice was to use GloVe Word embeddings, most recent works use embeddings that better capture the context of words.

It is important to note that we haven’t reproduced the state-of-art methods. Although we did not achieve state-of-art results, we were still able to check the impact of the addition of external knowledge in the used neural models.

The results and discussion show that the results of adding external knowledge are strongly linked to what information is used, as well as how it is used. We conjecture that in some cases the information present in external bases may be already integrated on the word representations, especially when the embeddings training set and the knowledge base share common data.

This way, adding external knowledge to the models does not always improve the results, and can even lead to performance decreases. So, in order to integrate external knowledge, a deep analysis is needed to capture all the semantic present in external knowledge bases.