1 Introduction

Active learning (AL) algorithms reduce the labeled data required to train a machine learning model through supervised training. These algorithms can be generally separated into three classes: (1) pool-based, (2) stream-based, and (3) query synthesis. For this paper, we focus on the pool-based class which thrives in scenarios where large amounts of data are available, but the annotation process for the complete pool of data is costly. The objective of the pool-based AL technique is to identify, given a considerable pool of unlabeled data, a smaller subset of data samples that represent the whole data distribution well. By annotating this smaller subset of data samples and training a model through supervised training, it is possible to achieve good model performance with a significant decrease in data annotation costs.

In Active Learning for Named Entity Recognition (NER), most research proposes using sentence-level querying strategies. Sentence-level querying means that the whole unlabeled sentence is queried to the oracle, which is expected to provide labels for all tokens. It provides more context and can lead to more accurate annotation decisions, but it can be more computationally expensive and time-consuming. Some works propose alternative strategies such as token-level [7] or subsentence-level [16] querying. These strategies aim to make the annotation process more efficient by reducing the number of tokens that need to be manually labeled.

This paper proposes cooperative approaches for reducing the cost of annotation by an oracle and speeding up the annotation process. We focus on Active Self-Learning (ASL) algorithms for that. The most straightforward implementation of an Active Self-Learning algorithm uses the machine learning model to predict classes for the unlabeled data. The predicted labels are then split into high-confidence and low-confidence, depending on the model’s confidence in its predictions. High-confidence samples are labeled automatically using the model’s predictions, while low-confidence samples are queried for the human annotator to be labeled manually. Labeled samples are added to the labeled dataset, which is used to train the machine-learning model further. This process is repeated until the desired level of model performance is achieved, or a stopping criterion is reached. In Active Self-Learning algorithms, the active learning part corresponds to the query to the human annotator. In contrast, the self-learning part corresponds to the automatic labeling done by the trained model. The ASL creates a cooperative scenario where humans and models annotate data together and can potentially reduce annotation costs compared to using only active learning strategies [21].

Our 2 main contributions in this paper are: (1) We investigate the impact of different self-training techniques on Active Self-Learning algorithms; And (2) We propose a novel Active Self-Learning algorithm based on token-level querying that achieves peak performance with less hand-annotated data than previous works.

The overview of the paper is as follows. In Sect. 2, we present relevant works from the literature on both Active Learning and Active Self-Learning algorithms applied to named entity recognition tasks. Section 3 briefly reviews the Active Self-Learning algorithm from [2] based on sentence-level querying. In Sect. 4, we investigate the impact of different Self-Learning techniques on the proposed Active Self-Learning algorithm from [2]. In Sect. 5, we propose a novel Active Self-learning algorithm based on token-level querying, where both the human annotator and machine learning model cooperatively annotate tokens from the same sentence. Section 7 we present the results of our investigations on the ASL algorithms proposed in Sects. 4 and 5. Finally, in Sect. 8, we summarize the results from both sentence-level and token-level Active Self-learning algorithms.

2 Related Works

Many studies in the current literature focusing on Active Learning algorithms for NER utilize sentence-level querying. Shen et al. [18] proposes, to the best of our knowledge, the first Active Learning algorithm based on neural networks for tasks of sequence tagging (e.g. NER and Part-Of-Speech tagging). Their proposed neural model uses convolutional layers for character and word-level feature encoding and an LSTM layer with greedy decoding as the tag decoder. The authors also proposed the Maximum Normalized Log-Probability (MNLP) sentence-level query function, which normalizes the model’s confidence for a given unlabeled sentence. Their experiments show that the proposed algorithm trains the model to peak performance using only 25% of the training set of the OntoNotes5.0 dataset [15]. Siddhant and Lipton [19] extend previous work by Shen et al. They use the Bayesian Active Learning through Disagreement [6] (BALD) framework by querying the unlabeled sentences that generate the most disagreement over multiple passes on neural models. To introduce stochastic behavior in the neural models, they propose two solutions. The first solution is the Monte Carlo dropout, where the model makes multiple predictions for the same sentence but with a different dropout mask at a time. The second solution is the Bayes by backpropagation, where some layers have their deterministic parameters replaced by stochastic parameters. Experiments showed that the proposed query strategies performed consistently better than the previous MNLP, but this came with more complex and costly compute query functions.

In the literature on Active Learning applied to named entity recognition, sub-sentence-level querying strategies are found less often, including token-level querying strategies. The main challenge is how to implement the model training routine with sentences that are partially labeled. Kobayashi and Wakabayashi [7] propose the use of a multi-class logistic regression model with point-wise predictions [12]. They can use token-level queries to train their models through this method. Radmard et al. [16] proposes a sub-sentence-based query strategy, where only the most essential, non-overlapping sub-sentence are queried instead of the whole sentence. The strategy is to query sub-sentence without their surrounding context and store the annotation in a dictionary that associates each sub-sentence with its corresponding accurate labels. Annotations for a sub-sentence are propagated onto the unlabeled dataset, meaning that all occurrences of a particular sub-sentence are assigned the same labels. They then train a neural model using a loss function computed only on labeled tokens.

Active Self-Learning algorithms have received considerably less attention in the NER literature when compared to purely Active Learning Algorithms. Tran et al. [21] proposes an Active Self-Learning algorithm based on Conditional Random Fields (CRF) models. The Active Learning process is related to using a diversity measure to query the most informative samples to the oracle. At the same time, the Self-Learning process is related to using the trained CRF model from the previous iteration to annotate unlabeled sentences with high confidence in its predictions. The dataset used was composed of sentences extracted from Twitter. The experiments compared uncertainty, diversity query strategies, and Active Learning algorithms with and without Self-Learning. Results have shown that Active Self-Learning algorithms achieved, in general, better results than the purely Active Learning algorithm. Inspired by the work of Tran et al. [21], Cunha and Faleiros [2] presents another Active Self-Learning algorithm. They made specific changes to address the use of deep neural models instead of CRF. The proposed ASL algorithm has been shown to be less sensitive to the quality of the labeled set in initial iterations, when compared to the previous Active Self-Learning algorithm from the literature [1].

3 Deep Active Self-learning Algorithm

This section briefly describes the active self-learning algorithm proposed by Cunha and Faleiros [2]. The Active Self-Learning algorithms described in existing literature require collaboration between a human annotator and a trained model to annotate samples from an unlabeled database. However, this process is sensitive to the initial labeled set used to train the machine learning model [1]. We argue that this sensitivity arises from poorly annotated samples selected by the model in the early rounds of the active self-learning algorithm. This poorly annotated data may introduce permanent bias to the labeled set. To address this issue, the work of [2] proposed an active Self-Learning algorithm that distinguishes between samples labeled by the model and those labeled by the human annotator. The former has less impact on the model’s parameters during training. Additionally, samples labeled automatically by the model were returned to the unlabeled set after each iteration of the active self-learning algorithm. These modifications seem to mitigate the risk of introducing permanent bias to the trained model and labeled dataset.

Algorithm 1 presents a more detailed explanation of active self-learning algorithm.

Algorithm 1
figure a

. Active self-learning algorithm

Note that in Algorithm 1, m represents the machine learning model, Q is the query budget, \(min\_confidence\) is the minimum confidence level required for the model to annotate an unlabeled sample, A.L. denotes the active labeled set that contains samples annotated by the oracle, S.L. contains samples labeled by the trained model, and U represents the set of unlabeled data. The active learning procedure, represented by the \(Active\_Learning\_Query(\cdot )\) function, identifies the most informative unlabeled samples for annotation by the oracle. The self-learning procedure, represented by the \(Self\_Learning\_Query(\cdot )\) function, corresponds to the different self-learning strategies that will be applied. We investigate the impact of different self-learning techniques based on sentence-level and word-level querying strategies.

4 Sentence-Level Active Self-learning Algorithm

This section introduces our proposed Active Self-Learning algorithms that utilize sentence-level querying. This algorithm is built upon the work of [2] by incorporating various Self-Learning techniques.

The diagram of the Active Self-Learning algorithm proposed is shown in Fig. 1. Two main characteristics make this algorithm different from previous algorithms in the literature. (1) The first change is that we now have two individual labeled sets (shown in orange in Fig. 1). One of these labeled datasets is responsible for keeping sentences hand-annotated by the oracle, thus having highly reliable labels. At the same time, the other is responsible for keeping sentences labeled by the machine learning model with less reliable labels. This separation of highly reliable and less reliable labeled data allows us to use them differently during training. We employ hand-annotated data for traditional supervised learning and utilize data labeled by the model for self-learning. (2) The second change is that after training the machine learning model in any given algorithm iteration, the data self-labeled by the machine learning model returns to the unlabeled data pool. These alterations allow a better-trained model to re-annotate samples in later iterations.

Fig. 1.
figure 1

Annotation process for a sentence-level active-self learning algorithm. (Color figure online)

The algorithm proposed in [2] adopted pseudo-labeling as the self-learning strategy. Here, our alternate version replaces the pseudo-labeling technique with Self-Learning methods. We test three different Self-Learning techniques, namely: (1) Word dropout, (2) Virtual adversarial training, and (3) cross-view training. Next, we will explain how our algorithm utilized pseudo-labeling and integrated these three techniques:

  • The Pseudo-Labeling (PL) technique identifies highly reliable unlabeled samples to be automatically annotated by the model. In this case, highly reliable means that the model’s confidence in its predictions is above a predefined threshold. Our strategy consists of using the machine learning model’s predictions as labels. Then, apply supervised training with the pseudo-labeled and the hand-labeled data.

  • The Cross-View Training (CVT) [3] aims to improve the representation capabilities of the model by forcing it to output similar predictions with different views of the same input data. The CVT algorithm uses neural models with auxiliary classification heads. Each head learns to predict tokens given a limited view of the input sentence (e.g., left-context only, right-context only). The auxiliary classification head comprises a fully-connected layer with ReLU [11] activation, followed by a softmax function to generate a distribution over predicted classes. The CVT loss is the Kullback-Leibler (KL) divergence between the outputs of the main classification head, which sees the complete input, and the auxiliary heads, which see partial views of the input.

  • The Word Dropout (WD) technique [3, 8] consists in replacing random words in a sentence with special tokens, such as \(<removed>\) or \(<UNK>\), and training the model to produce a similar output distribution as when the word was unmasked.

  • The Virtual Adversarial Technique (VAT) is a Self-Learning technique proposed by Miyato et al. [10] for text classification and applied to NER by Clark et al. [3]. This technique extends the Adversarial Training [4], where a sample of input data is perturbed with specially crafted noise designed to fool the model. This helps the model to be more robust to small perturbations in the input data.

5 Token-Level Active-Self Learning Algorithm

The model’s overall confidence for the entire sentence is considered when using sentence-level querying. This implies that the model assumes average confidence in its predictions for all the words/tokens within the sentence. However, hard-to-predict entities may be surrounded by various easy-to-predict tokens. We argue this may lead to an overall overconfidence in the model for complex tokens, leading to poor self-annotations that may hamper the Active Self-Learning process. We leverage this idea to propose a token-level Active Self-Learning algorithm.

The token-level Active Self-Learning algorithm proposed is a modified version of the Deep Active Learning (DAL) algorithm presented by Shen et al. [18]. The main difference between the original DAL and our proposed algorithm is its labeling process. The original algorithm queried the most uncertain sentences and asked the oracle to annotate them. Our proposed algorithm takes the queried sentences, identifies the tokens with low confidence, and asks the oracle to annotate them. The remaining unlabeled tokens from the queried sentence are automatically labeled using the model’s output predictions. We also improve the accuracy of our self-labeling process by using hand-annotated labels to refine the predictions made by the model. This cooperative scenario can alleviate the cost of manual annotation for the oracle and speed up the annotation process by highlighting specific words in a sentence that must be hand-annotated. Our proposed token-level Active Self-Learning algorithm is illustrated in Fig. 2, which shows the annotation process.

Fig. 2.
figure 2

An illustration of the collaborative configuration where an oracle and machine learning model annotate the same sentence jointly.

Our proposed algorithm has three main procedures: 1) predict the highest confidence tokens to automatic annotation; 2) query low-confidence tokens to the oracle for manual annotation and 3) self-labeling refinement.

  1. 1.

    Identifying low-confidence tokens: Suppose a neural network produces a probability distribution for an input token, which represents the likelihood of the token belonging to one of the classes of named entities. In this context, we define low-confidence tokens as those for which the model has confidence in its prediction lower than a predefined threshold. We empirically selected a threshold confidence of 99%, implying that tokens with less than 99% of confidence in their predicted class will be labeled by the oracle (i.e., human annotator). In contrast, the remaining tokens will be self-labeled by the model.

  2. 2.

    Query to the oracle: It is the traditional active learning technique implemented at a token level, where only the low-confidence tokens in the selected sentences are queried to the oracle.

  3. 3.

    Self-labeling refinement: Once an oracle has labeled the low-confidence tokens, we can use self-labeling to label the remaining unlabeled tokens from the queried samples. A simple approach is to predict the classes for all tokens and use the predictions to label high-confidence tokens. However, we can leverage these labels to improve our predictions since we have the proper labels of the low-confidence tokens, which were labeled by the oracle. For example, the CNN-CNN-LSTM model [18] uses a greedy decoding approach where it receives the predicted label of the previous token as an additional input to help to predict the current token’s class. If the previous token had low confidence and was manually labeled by the oracle, we could modify this approach by using the oracle-assigned label instead of the label previously predicted by the model. This process can be repeated a predefined number of times, iteratively, with a decreasing number of tokens replacement in each iteration.

The refinement step was inspired by the iterative algorithm proposed by Park et al. [13]. They use iterative refinement to identify synonyms to substitute specific tokens from a sentence, with a low impact on its coherence. The idea is to identify potential synonyms for specific tokens, replace them, and verify if a masked language model predicts the synonyms with high confidence. Other synonyms replace the synonyms with the lowest masked language model scores. For our refinement step, however, we use reliable Oracle annotated tokens to enhance the model predictions for the unlabeled tokens.

6 Experimental Design

We use two consolidated English NER datasets, namely CoNLL03 [17] and OntoNotes5.0 [15], and one legal domain Portuguese NER dataset, named Aposentadoria [2]. Aposentadoria is a new legal domain NER dataset. It contains named entities from 10 classes associated with retirement acts of public employees from the Diário Oficial do Distrito Federal (Brazilian Federal District official gazette, in direct translation). The datasets chosen for use are listed in Table 1, along with relevant information such as their language, subject domain.

Table 1. Datasets description.

We use two neural models for sentence-level and one for token-level Active Self-Learning. For sentence level, we use CNN-CNN-LSTM proposed by Shen et al. [18] and the CNN-biLSTM-CRF proposed by Ma and Hovy [9]. We could not use CNN-biLSTM-CRF in the token-level case because CRF classification layers output a distribution over the entire sentence instead of a distribution of classes for each token. More details about the model’s and training algorithm’s hyperparameters are presented in the following subsections.

We evaluate four versions of our proposed Active Self-Learning algorithm using self-training techniques at the sentence level (Sect. 4). We also compare the results with the deep active learning algorithm proposed by Shen et al. [18]. Moreover, we compare the token-level Active Self-Learning algorithm with the subsequence-based active learning algorithm by Radmard et al. [16]. This algorithm was chosen as a baseline because its queries use sub-sentences instead of whole sentences.

For all proposed algorithms, we compute the maximum normalized log-probability (MNLP) measure proposed by Shen et al. [18] to generate queries for new samples. The MNLP value for a sentence X of length n can be calculated as

$$\begin{aligned} MNLP(X) = \max _{y_1, ..., y_{N-1}}\frac{1}{n}\sum ^n_{i=0}log\ P(y_i\vert x_i, y_0, y_1, ..., y_{i-1}). \end{aligned}$$
(1)

where \(x_i\) is the i-th token of a sentence X, and \(y_i\) is the class predicted for the i-th token \(x_i\).

The querying for the oracle and the trained model is based on the MNLP measure. We use the least confidence for the Oracle annotation, where unlabeled samples with the lowest MNLP scores in a given algorithm iteration will be queried to the Oracle. For automatic annotation, we use the exponentiated MNLP measure. The unlabeled samples with exponentiated MNLP measure higher than a predefined threshold will be used for self-training in the next iteration of the Active Self-Learning algorithm. For all experiments, the threshold selected empirically was 0.99. This threshold means that the model’s prediction confidence must be equal to or higher than 0.99 for the sample to be used for self-training.

Similar to previous works in the literature [18, 19], we apply early stopping of the model training. The early stopping is based on the model’s performance on the validation set. We empirically chose a patience of 15 epochs, meaning that if the model’s performance does not improve in 15 consecutive epochs, the model’s training is interrupted.

6.1 CNN-CNN-LSTM Model

The CNN-CNN-LSTM model consists of a character-level convolutional encoder, a word-level convolutional encoder, and an LSTM tag decoder.

The character-level CNN is used to generate character-level vector representations of a word. This CNN works by first transforming each character of a word into a vector representation, with embeddings being initialized with uniform samples from \(\left[ -\sqrt{\frac{3}{dim}}, +\sqrt{\frac{3}{dim}} \right] \) as proposed by Ma and Hovy [9]. Dropout [20] is applied to the generated embeddings as presented by Ma and Hovy [9]. Then, a one-dimensional convolutional layer is used to extract information of neighboring characters, followed by ReLU activation [11] and max-pooling to generate fixed-size character-level word embeddings.

We generate a word representation by concatenating a character-level vector with a vector representation from a lookup table, which we initialize with pre-trained word embeddings that we update throughout the training process. We used GloVe embeddings of 100 dimensions pre-trained on English newswire corpus [14], and GloVe embeddings of 300 dimensions pre-trained on multi-genre Portuguese corpus [5].

The LSTM tag decoder is responsible for the tagging of each word. The LSTM tag decoder uses a one-hot encoded vector of the previous tag, concatenates it to the encoded vector representation of the current word, and uses it as input.

A thorough explanation of the hyperparameter tuning routine and hyperparameter values used in our experiments is presented in [1].

6.2 CNN-biLSTM-CRF Model

Similarly to the CNN-CNN-LSTM model, the CNN-biLSTM-CRF model uses a CNN to generate a character-level representation for each word. The full embedding, formed by the concatenation of character-level and word-level embeddings, is fed to a biLSTM layer that generates vector representations for each word in a sentence. A fully-connected layer is then used to reduce the encoded vector’s dimension to the number of possible tags. The reduced dimension vector is then fed to a CRF layer. Dropout layers are used before and after the biLSTM layer. All weight matrices for the biLSTM and fully-connected layers are randomly initialized using a uniform distribution to select samples from \([-\sqrt{\frac{6}{r+c}}, +\sqrt{\frac{6}{r+c}}\) as proposed by Ma and Hovy [9], where r and c are the numbers of rows and columns in the weight matrix. We initialize the bias parameters from the biLSTM layer to 0.0 and the forget gate bias to 1.0.

The model’s hyperparameters for the English NER datasets are similar to those presented in the experiments of Siddhant and Lipton [19]. A grid search was employed for the Portuguese dataset to define the model’s hyperparameters. A thorough explanation of the hyperparameter tuning routine and hyperparameter values used in our experiments is presented in [1].

7 Results

We present and discuss the experimental results focusing on the performance overview (f1-score) of the models trained on each of the three datasets (see Table 1). Additionally, we compare the performance of neural models trained using various Active and Active Self-Learning algorithms. In separate sections, we present the results of sentence-level and token-level querying strategies.

7.1 Results of Sentence-Level Strategy

The ASL algorithm that uses pseudo-labeling as the default Self-Learning strategy as proposed by [2], is referred to as DASL (Deep Active Self-Learning). To conduct an almost comprehensive experiment with various Self-Learning methods, we replace pseudo-labeling with three other techniques: DASL_WD (Deep Active Self-Learning with Word Dropout), DASL_VAT (Deep Active Self-Learning with Virtual Adversarial Technique), and DASL_CVT (Deep Active Self-Learning with Cross-View Training). We also compare the results of Active Self-Learning strategies with Deep Active Learning (DAL). In Fig. 1, Supervised (i.e. dashed line) represents the best performance of the neural models when trained using the entire training set with true labels in a supervised fashion. DAL represents the baseline active learning algorithm by Shen et al. [18].

Figure 3 presents the performance (i.e., f1-score) achieved by the models and the percentage of oracle annotated tokens (human labeled tokens). It should be noted that Active Self-Learning algorithms use additional data for training in addition to the hand-annotated data. Note that the DASL_CVT technique is only implemented with the CNN-biLSTM-CRF model. This technique required multiple views of the input, which is more computationally costly to implement in neural models with CNN-only encoders such as the CNN-CNN-LSTM.

From the experiments performed in Fig. 3, we noticed that no technique performed consistently better than the others. However, for the Aposentadoria dataset, the DASL_VAT and DASV_CVT techniques help the model achieve its best performance faster than the baseline algorithms. We believe this behavior is because this dataset is less imbalanced than the other datasets, meaning that most NER entities are present in most sentences. As a result, the model generalizes faster with less hand-annotated data and more effectively utilizes unlabeled data through self-training in the early iterations of the Active Self-Learning algorithms.

Fig. 3.
figure 3

The graph shows the performance (f1-score) of the neural models as a function of the percentage of hand-annotated tokens in each iteration of the Active and Active Self-Learning algorithms.

7.2 Results of Token-Level Strategy

In this section, we present the results of the token-level Active Self-Learning algorithm proposed. Our algorithm, namely MDAL, is a modified version of Deep Active Learning (DAL).

Figure 4 compares the original and the modified DAL algorithms, both using the validation f1-score for early stopping. The graphs show that both the original and the modified DAL algorithms achieve similar performance with the same amount of labeled data but with our modified algorithm allowing for tokens to be self-labeled. From the graphs in the center, we observe that the models trained with the modified DAL algorithm reach peak performance with significantly less hand annotated tokens. This is justified by the fact that for the modified algorithm, most tokens are self-annotated by the trained model reliably, as shown in the graphs to the right in Fig. 4. Table 2 presents the percentage of hand-annotated tokens required for the trained model to reach 99% of its peak f1-score performance. We observe that the proposed method requires significantly fewer hand-annotated tokens. However, we do note that these results do not consider that many of the tokens in the queried sentences were not named entities and therefore were not required to be annotated. This is the reason why our algorithm is capable of training a model to peak performance with only 6.96% of human-labeled tokens in the CoNLL2003 dataset when compared to the 36.20% and 27% of the baseline algorithm. However, for fine-grained NER datasets, such as the OntoNotes5.0, and datasets with few tokens that do not have a named entity class, such as the Aposentadoria dataset, the presented results are convincing evidence that our method is capable of reducing the human annotation costs in the active learning process when compared to the baselines.

Table 2. Percentage of the training set that was annotated by the oracle in order to train a model that reaches 99% of its peak performance.
Fig. 4.
figure 4

Comparison between the original and modified DAL algorithms. Graphs on the left column compare the performance of the trained model (y-axis), by the total amount of labeled data (x-axis), including tokens annotated by the oracle and self-annotated by the trained model in the case of our modified algorithm. Graphs on the center column compare the f1-score of the trained model (y-axis) by the amount of data annotated by the Oracle (x-axis). Graphs on the right column compare the percentage of the training set that was annotated by a human, by the model through self-labeling, and the samples that were mislabeled. In all graphs, DAL represents the original deep active learning algorithm from the literature, while MDAL indicates our modified algorithm with token-level self-labeling.

8 Conclusion

In this work, we presented an investigation into different types of active-self learning algorithms for named entity recognition tasks. From the many experiments performed, the sentence-level Active Self-Learning algorithm could not consistently achieve significant results compared to pure active learning. However, the proposed token-level Active Self-Learning could train a neural model to near-peak performance using fewer human-annotated tokens compared to the state-of-the-art algorithms used as baselines. Our proposed token-level algorithm is particularly effective for fine-grained datasets where most tokens are assigned a named entity class.

While we failed at creating a sentence-level Active Self-Learning that overcomes the current state-of-the-art, future research may investigate how pretrained transformer models may impact these algorithms. Bridging the current gaps in sentence-level Active Self-Learning algorithms research using models with substantial a priori information may be possible.