1 Introduction

Named entity recognition (NER) is an important task for meaning extraction of textual content. In recent years, advances in transfer learning with deep neural networks based on Transformers [15] enabled substantial performance improvements for NLP models, especially in low-resource languages. In this context, transfer learning enables training by leveraging previous knowledge of a particular language’s grammatical structure - that knowledge is embedded in models previously trained on unlabeled data through self-supervision. Language models are the most classic examples among these pre-trained models.

Fig. 1.
figure 1

Model soup - model creation by averaging weights of multiple models trained with different hyperparameter sets. Here, two pre-trained neural networks with identical architecture but different parameters are used to create a third model (“Model Soup”), whose weights are calculated by simply averaging the respective weights of Model 1 and Model 2, without any training.

In particular, for the Portuguese language, the most commonly used model is BERTimbau [12], a neural network that replicates with minor changes the architecture and training procedures for BERT [4], by using brWaC [16] as training corpus. BERTimbau was validated in specific tasks such as named entity recognition, textual similarity, and recognizing textual entailment, surpassing previously published results. Currently, a classical solution for constructing entity recognizers, classifiers, and other discriminative models for low-resource languages requires fine-tuning a pre-trained language model - PLM (e.g., BERTimbau for Portuguese) - in the target task by adding linear layers and adjusting the network weights by retraining with a new optimization objective. Such an approach has achieved satisfactory results for most situations at a low training cost [7, 8, 11].

However, there are still not yet explored techniques, at least for entity recognition tasks in the Portuguese language: domain adaptation, one-shot/few-shot learning, and recent techniques such as model soups [17].

This paper applies model soups and domain adaptation techniques to named entity recognition (NER) task for Brazilian Portuguese language. We assess the pertinence of such techniques for this specific language and evaluate multiple ways of applying them, finding, for example, that certain configurations of uniform model soups worked best. We present a study reusing existing models for a relatively low-resource language, and we show that techniques proposed for other problems can effectively transfer to NER.

So, from the viewpoint of practical applications in natural language understanding, our main contributions can be summarized as 1) experiments on a medium-to-low resource language (Brazilian Portuguese, as most gold standard datasets such as HAREM [10] were conceived for the Portuguese variant); 2) further investigation on model soups on NLP, specifically for NER; 3) further investigation on domain adaptation focused on entity recognition, while previous research was more commonly conducted on document classification; and 4) evaluation of NER performance in few/zero-shot learning setups with a causal large language model (LLM).

This paper is organized in the following way: 2 introduces the techniques to be analyzed and other related works, while 3 and 4 present the detailed experiments and results, respectively. Source code is available at https://github.com/monilouise/opt-bert-ner.

2 Related Work

Wortsman et al. [17] propose a technique that consists of generating a model by averaging the weights of two or more trained models, in opposition to the traditional approach, which is based on 1) training multiple models with several hyperparameters and 2) choosing the model with the best performance on a validation set. Also, the idea is to combine multiple models without the additional inference and memory costs related to traditional ensemble learning. The authors refer to the technique as model soups, illustrated in Fig. 1. It supports three variations: 1) construction of a model by averaging all models (uniform soup); 2) greedy soup, in which models are added sequentially to the soup as they improve the model accuracy in the validation dataset; and 3) learned soup, in which the interpolation weights for each model are optimized through gradient descent. According to the authors, the greedy strategy showed the best results. At the same time, learned soups require loading in memory all the models simultaneously, generating more extensive networks and leading to little gain in accuracy. Their experiments were conducted on image and text classification. In this work, we evaluate the application of this technique to the NER task.

Sun, Qiu, Xu and Huang [13] conducted experiments on fine-tuning BERT pre-trained models on document classification. In particular, they analyzed domain adaptation - a training approach with training data from the target task or the same domain. Here, we apply domain adaptation to the NER task.

Houlsby et al. [6] propose transfer learning with adapter modules. According to the authors, adapter modules add only a few trainable parameters per task. New tasks can be added without revisiting previous ones, keeping the parameters of the original network fixed, with a high degree of parameter sharing. They transferred BERT Transformer model to 26 text classification tasks (including the GLUE benchmark), attaining near state-of-the-art performance. The following tasks were evaluated: document classification on several datasets, linguistic acceptability, sentiment analysis, paraphrase detection, semantic textual similarity, textual entailment, and question answering. Only English datasets were used, and the paper does not report performance on the named entity recognition task. Compared to our work, they focus on parameter efficiency and flexibility for different tasks, while we focus on entity recognition performance in a low-resource language. Even so, we intend to investigate Adapters architecture performance for NER tasks in future research.

Regarding Portuguese language, Rodrigues et al. [9] introduced Albertina, a Transformer-based foundation model both for European (Portugal) and American (Brazil) Portuguese variants. Compared to BERTimbau, it uses 128-token sequence truncation instead of 512. The authors report results on tasks such as text entailment, semantic textual similarity and disambiguation tasks, but to our knowledge, they did not evaluate the model on entity recognition.

Finally, BERT was one of the first transformer-based models; up to our knowledge, the current main Portuguese PLMs were based on it. It takes some concerns about its potential compared to more recent PLMs. However, experiments conducted by Tänzer et al. [14] show that BERT can reach near-optimal performance even when many of the training set labels have been corrupted.

3 Methodology and Experiments

3.1 Model Soups Experiments

Creating a model by averaging other models’ weights was validated on image and document classification tasks [17].

Formally, let \({\theta } = FineTune(\theta _{0}, h)\) the set of parameters obtained by fine-tuning the pre-trained initialization \({\theta _{0}}\) with the hyperparameters configuration h. The technique uses a mean value for \({\theta _i}\), i.e., \({\theta _{S} = \frac{1}{|S|}\sum _{i \in S} \theta _{i}}\), where \({S \subset \{1,...,k\}}\) and k is the number of hyperparameter configurations (or models) to be used.

Initially, we reproduced the idea using a uniform average. The greedy approach was not analyzed due to our low number of available candidate models. So, we created a model whose weights are the average of the weights of different models:

  1. (A)

    the entity recognition model developed to validate BERTimbau [12], trained on an adapted version from the First HAREMFootnote 1 [10];

  2. (B)

    model analogous to (A), but trained with data augmentationFootnote 2 and;

  3. (C)

    model adapted to the same corpus for the target task (First HAREM), as described in Sect. 3.2.

We denote this model as the average model or model soup.

For the combinations cited above, we evaluated the BERT base and large variations, which differ in the number of parameters. Also, we evaluated settings with and without an additional layer for conditional random fields (CRF), which we used to reduce the probability of generating invalid label sequences.

Fig. 2.
figure 2

Model soup - the original idea (adapted to the setting of two models trained on a NER task from a common initialization - in this case, the language model).

There is a difference between our experiments and the original proposal by Wortsman et al. (2022): the authors assume that the models were independently optimized from the same initialization, which would lead them to the same region or valley on the error surface. Such strategy is represented in Fig. 2. But here, according to Fig. 3, only models A and B start from the same initialization (BERTimbau language model). In contrast, model C was finetuned on the First HAREM textual content.

The experiments’ results are shown in Sect. 4.1.

3.2 Domain Adaptation Experiments

Sun et al. [13] conducted experiments on different fine-tuning options, including multi-task learning. However, they concluded that the benefits from multi-task learning tend to be inferior to those obtained from additional pretraining. So, we did not experiment with multi-task learning.

To conduct the experiments on domain adaptation, we use as start points:

  1. (A)

    a NER model for (long) documents related to public accounts auditingFootnote 3 Footnote 4;

  2. (B)

    the NER model trained for BERTimbau evaluation [12];

  3. (C)

    an entity recognition model [11] trained on a public news dataset [7].

In (A), during domain adaptation, the original language model - BERTimbau - received additional training on a dataset different from the one used to train the target NER model. However, this dataset came from the same domain and origin (documents related to public accounts auditing), leading to an intermediary and domain-optimized language model. As the training task was Masked Language Model (MLM), such dataset does not contain any label and can be considered a kind of “superset” for the entity recognition dataset: for example, the dataset for training the domain-adapted language model has 52,912 documents, against 431 documents for the labeled dataset used during entity recognition model training. The complete flow is described in Fig. 4.

Fig. 3.
figure 3

Model soup - alternative version using two models trained from a common parameter set (BERTimbau language model) and a third model trained on an intermediary language model.

For (B) and (C), during the construction of the intermediary language models, the respective datasets were used: First HAREM (the same train split used by Souza et al. [12]) and the media news dataset (the same train split as the original dataset [7]). The labels were discarded for both datasets in the phase of MLM training.

Fig. 4.
figure 4

Training flow for a NER specific to public accounts audit domain.

The three resulting language models were used as base models for retraining the three cited entity recognizers to measure domain adaptation impact. Learning rate and batch size hyperparameters for all intermediary language models training were the same as reported by Souza et al. [12].

Section 4.2 shows the experiments’ results.

3.3 Causal Language Modeling - Few/zero-Shot Learning

We also conducted experiments on entity recognition with few and zero-shot learning by using a large language model (LLM) based on GPT-3.5, an improved version of GPT-3 [1], a language model pre-trained with causal language modeling objective. We used the same Brazilian public news dataset already described [7] and compared the following settings:

  1. (A)

    a NER model [8] based on finetuning BERTimbau to the news dataset

  2. (B)

    GPT 3.5 few-shot learning with instruction prompt and examples

  3. (C)

    GPT 3.5 few-shot learning with no instruction prompt (only examples)

  4. (D)

    GPT 3.5 zero-shot learning

The results are shown in Sect. 4.3.

4 Results and Discussions

4.1 Model Soup Results Analysis

Table 1 summarizes the results from the main experiments on the model soup technique. The bolded rows refer to the baseline (original) models, with mean and standard deviation from 5 training executions. Rows prefixed as “M1:”, “M2:”, ..., and “M12:” are the respective “best” models for each variant among five training runs. These models are the model soups components. Finally, rows related to the model soups - labeled with + (plus) signs - do not involve training (only two or three models averaging), so they were not randomized (this is why standard deviation values are not shown in these rows, except for the setup with additional fine-tuning, as explained below). We report precision, recall, and F1 metrics for all the experiments.

In the first experiment, we evaluate the direct application of the model soup technique. We used a uniform average from a set composed of the respective best models (i.e., best training) among the variations described in 3.1.

Later, we evaluated the second setup, in which the model soup received additional fine-tuning for the NER task. As shown in Table 1, for the smaller variant in model size (base), the first model, without additional fine-tuning, shows better results for precision and F1 metrics.

Table 1. Experiments with model soups for NER BERTImbau (“\(\textrm{BERT}_\textrm{BASE}\)" or “\(\textrm{BERT}_\textrm{LARGE}\)") trained on First HAREM

As already described in 3.1, for each model size (base/large), we added the following components to each combination: (A) original BERTimbau NER; (B) BERTimbau NER retrained with data augmentation; and (C) BERTimbau NER retrained after domain adaptation to First HAREM (original BERTimbau language model fine-tuned to First HAREM text set), as described in 3.2).

Table 2. Domain adaptation for documents related to public accounts audit.

Later, the third variant (C) was removed from each combination, leading to the original schema shown in Fig. 2. Finally, we observed that the combinations based only on (A) (original NER) and (B) (data augmented NER) led to better values for precision and F1 metrics, confirming Wortsman’s (2022) original hypothesis of using independently optimized models from the same initialization.

When we compare our methodology with the one used by the authors [17], they use accuracy in most image and text classification. For example, the authors do not refer explicitly to recall, which shows worse results in our experiments.

So, further investigation is needed about the reason recall is worsening by the use of the model soup technique, which at the moment makes us believe that such a method could be more suitable to situations in which precision is more important than getting a high number of entities or false positive cases.

On the other hand, given known limitations in using precision-recall-F1 for entity recognition, better and more interpretable metrics for this task are a current research topic [5].

Finally, according to Wortsman et al. [17], preliminary experiments show that improvement in the textual corpus, although present, is less profound than in image classification. The authors stress the need for additional investigation into this subject. They used ImageNet and several variations, including illustrations and other representations beyond real photos (e.g., drawings, arts, cartoons, tattoos, origami, sculptures, etc.). But for textual classification, they used general domain datasets for paraphrase detection, coherence vs. contradiction, linguistic acceptability, and sentiment analysis. Preliminary and qualitative analysis of the different data for images vs. texts shows more variability and larger data size for the first case (e.g., ImageNet contains millions of images), which could have led to a more significant impact on image classification.

4.2 Domain Adaptation Results Analysis

In this subsection, we report results achieved by NER models when trained over intermediary language models, as described in 3.2.

The results of the experiment with the NER model trained on documents related to public accounts auditing are shown in Table 2.

As a comparison metric, Masked F1 [11] was used. This metric is F1 calculated over post-processed output, correcting invalid transitions according to the IOB2 schema instead of using the raw output directly. Based on in-domain adaptation, this setup led to the most pronounced improvements.

In the experiment conducted on the NER model for media news [7, 8, 11], the results are shown in Table 3.

Table 3. Domain adaptation for media news.
Table 4. Domain adaptation for First HAREM.

The experiment was conducted only with variants based on a pre-trained Brazilian Portuguese language model (BERTimbau) because multi-language models gave an inferior performance, according to Silva et al. [11] and Chen et al. [2]. Masked F1 was also used as the main comparison metric. Table 4 shows the results achieved with the NER model used in BERTimbau evaluation [12].

Table 5. Domain adaptation for First HAREM - qualitative analysis.

The second column shows outputs generated by the baseline NER, while the third column shown outputs from the NER trained on the intermediary language model (domain-adapted NER). While the first one misclassifies Portuguese expressions, the second one labels them correctly. All the examples belong to First HAREM test dataset.

As can be noted in Tables 3 and 4, experiments conducted with the media news NER and BERTimbau NER did not reveal significant differences after domain adaptation. Such results confirm observations by Sun et al. [13]: the domain adaptation made on media news and First HAREM is “within-task" (the texts used are the same as the training texts for the target task). In general, “in-domain" pretraining - using texts from the same domain which are different from the texts in the training dataset for the target task - gives superior results. We suspect that within-task pretraining could lead to overfitting because it uses the same texts from the target task dataset.

However, after error analysis, we realized that some European expressions, organizations, and local names could be correctly classified only after BERTimbau Brazilian language model domain adaptation on (European) HAREM. At the same time, they were misclassified when NER was trained directly from the raw BERTimbau language model. The results are shown in Table 5, where we have the following examples:

Table 6. Sample outputs by a generic domain LM vs. specific domain LM.
Table 7. Few/zero-shot learning
  • 1st row: In Portugal, “Consoada" refers to Christmas Night, which should be classified as a temporal (TIME) mention.

  • 2nd row: “Soajo" is the name of a Portuguese village.

  • 3rd row: “Abade da Loureira" refers to both an organization (ORG) and a street (LOC). But given the specific context in the sentence, it should be classified as an organization (ORG).

  • 4th row: “Centro de Formação de Informática do Minho" is a local educational institution.

  • 5th row: “São João do Souto" refers to an extinct Portuguese parish.

It makes sense because First HAREM is a Portuguese corpus, different from the Brazilian corpus used to train BERTimbau. These results show semantic gains from domain adaptation, although quantitative performance differences are not statistically significant.

Further, the results shown here were obtained in the NER task context. For document classification, within-task pretraining has been commonly used.

Finally, we showed that domain adaptation in experiment (A) - by training an intermediary language model with a larger, same-domain dataset - led to a higher impact on F1 metrics when compared to the experiments with within-task domain adaptation ((B) and (C)). Furthermore, qualitative analysis for the intermediary language model shows example outputs for predicting masked term tasks, i.e., the public accounts auditing language model can generate texts related to themes such as contracts and biddings, as seen in Table 6.

4.3 Causal Language Modeling - Few/zero-Shot Learning

This subsection summarizes experiments conducted on GPT 3.5, a large language model pre-trained with a causal language modeling objective. The results are shown in Table 7.

In the few-shot setting, we gave the model some examples. These examples could be accompanied by an instruction prompt telling the model the kinds of entities expected. We also tested sending only the examples without any further instruction. In the zero-shot setting, we only asked the question and let the model be free to return the information in any format, giving no examples. So it returned the results in a conversational unstructured format, making it difficult to measure the output precision and F1. Therefore, we only show recall.

Despite the recent impressive reasoning skills shown by GPT 3.5 and “ChatGPT” models, it is interesting to note its quantitative performance still lags behind traditionally fine-tuned models. We hypothesize that BERT bidirectional masked language model objective may be more suitable to discriminative tasks, while causal language models - concerned only with predicting the future words - may lose information from past tokens. On the other hand, we believe that more research is necessary to investigate prompt engineering practices more suitable to NER and similar tasks. Finally, we realized impressive qualitative results on GPT 3.5 zero-shot learning: although its recall is considerably worse than the BERT-based baseline, it returns the entities and the relationships between them.

5 Conclusions

Among the analyzed techniques, model soups achieved the best results for the NER task. In the experiments conducted on domain adaptation, the best results were achieved with in-domain adaptation. We did not observe significant improvements in within-task domain adaptation. However, we realized the model could learn domain-specific terms with the First HAREM corpus. The lack of both quantity and diversity of “golden” labeled datasets for Portuguese when compared to English or Chinese is a substantial limitation to research on several tasks or multitask learning, as done, for example, by Houlsby at al. [6]. We believe advances in few-shot learning and prompt engineering could solve this limitation through synthetic data generation. So, besides investigating fine-tuning with Adapters, we expect to conduct further experiments on few-shot learning. Finally, we intend to investigate the fine-tuning of the recent Albertina LLM to the NER task and check if its larger capacity can compensate for its smaller context window (128 tokens) compared to BERTimbau.