Optimization Strategies for BERT-Based Named Entity Recognition

Monteiro, Monique; Zanchettin, Cleber

doi:10.1007/978-3-031-45392-2_6

Monique Monteiro⁹ &
Cleber Zanchettin⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14197))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

676 Accesses
3 Citations

Abstract

Transfer learning through language modeling achieved state-of-the-art results for several natural language processing tasks such as named entity recognition, question answering, and sentiment analysis. However, despite these advancements, some tasks still need more specific solutions. This paper explores different approaches to enhance the performance of Named Entity Recognition (NER) in transformer-based models that have been pre-trained for language modeling. We investigate model soups and domain adaptation methods for Portuguese language entity recognition, providing valuable insights into the effectiveness of these methods in NER performance and contributing to the development of more accurate models. We also evaluate NER performance in few/zero-shot learning settings with a causal language model. In particular, we evaluate diverse BERT-based models trained on different datasets considering general and specific domains. Our results show significant improvements when considering model soup techniques and in-domain pretraining compared to within-task pretraining.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Semantically-Informed Domain Adaptation for Named Entity Recognition

Named Entity Extractors for New Domains by Transfer Learning with Automatically Annotated Data

TOKEN Is a MASK: Few-shot Named Entity Recognition with Pre-trained Language Models

1 Introduction

Named entity recognition (NER) is an important task for meaning extraction of textual content. In recent years, advances in transfer learning with deep neural networks based on Transformers [15] enabled substantial performance improvements for NLP models, especially in low-resource languages. In this context, transfer learning enables training by leveraging previous knowledge of a particular language’s grammatical structure - that knowledge is embedded in models previously trained on unlabeled data through self-supervision. Language models are the most classic examples among these pre-trained models.

In particular, for the Portuguese language, the most commonly used model is BERTimbau [12], a neural network that replicates with minor changes the architecture and training procedures for BERT [4], by using brWaC [16] as training corpus. BERTimbau was validated in specific tasks such as named entity recognition, textual similarity, and recognizing textual entailment, surpassing previously published results. Currently, a classical solution for constructing entity recognizers, classifiers, and other discriminative models for low-resource languages requires fine-tuning a pre-trained language model - PLM (e.g., BERTimbau for Portuguese) - in the target task by adding linear layers and adjusting the network weights by retraining with a new optimization objective. Such an approach has achieved satisfactory results for most situations at a low training cost [7, 8, 11].

However, there are still not yet explored techniques, at least for entity recognition tasks in the Portuguese language: domain adaptation, one-shot/few-shot learning, and recent techniques such as model soups [17].

This paper applies model soups and domain adaptation techniques to named entity recognition (NER) task for Brazilian Portuguese language. We assess the pertinence of such techniques for this specific language and evaluate multiple ways of applying them, finding, for example, that certain configurations of uniform model soups worked best. We present a study reusing existing models for a relatively low-resource language, and we show that techniques proposed for other problems can effectively transfer to NER.

So, from the viewpoint of practical applications in natural language understanding, our main contributions can be summarized as 1) experiments on a medium-to-low resource language (Brazilian Portuguese, as most gold standard datasets such as HAREM [10] were conceived for the Portuguese variant); 2) further investigation on model soups on NLP, specifically for NER; 3) further investigation on domain adaptation focused on entity recognition, while previous research was more commonly conducted on document classification; and 4) evaluation of NER performance in few/zero-shot learning setups with a causal large language model (LLM).

This paper is organized in the following way: 2 introduces the techniques to be analyzed and other related works, while 3 and 4 present the detailed experiments and results, respectively. Source code is available at https://github.com/monilouise/opt-bert-ner.

2 Related Work

Wortsman et al. [17] propose a technique that consists of generating a model by averaging the weights of two or more trained models, in opposition to the traditional approach, which is based on 1) training multiple models with several hyperparameters and 2) choosing the model with the best performance on a validation set. Also, the idea is to combine multiple models without the additional inference and memory costs related to traditional ensemble learning. The authors refer to the technique as model soups, illustrated in Fig. 1. It supports three variations: 1) construction of a model by averaging all models (uniform soup); 2) greedy soup, in which models are added sequentially to the soup as they improve the model accuracy in the validation dataset; and 3) learned soup, in which the interpolation weights for each model are optimized through gradient descent. According to the authors, the greedy strategy showed the best results. At the same time, learned soups require loading in memory all the models simultaneously, generating more extensive networks and leading to little gain in accuracy. Their experiments were conducted on image and text classification. In this work, we evaluate the application of this technique to the NER task.

Sun, Qiu, Xu and Huang [13] conducted experiments on fine-tuning BERT pre-trained models on document classification. In particular, they analyzed domain adaptation - a training approach with training data from the target task or the same domain. Here, we apply domain adaptation to the NER task.

Houlsby et al. [6] propose transfer learning with adapter modules. According to the authors, adapter modules add only a few trainable parameters per task. New tasks can be added without revisiting previous ones, keeping the parameters of the original network fixed, with a high degree of parameter sharing. They transferred BERT Transformer model to 26 text classification tasks (including the GLUE benchmark), attaining near state-of-the-art performance. The following tasks were evaluated: document classification on several datasets, linguistic acceptability, sentiment analysis, paraphrase detection, semantic textual similarity, textual entailment, and question answering. Only English datasets were used, and the paper does not report performance on the named entity recognition task. Compared to our work, they focus on parameter efficiency and flexibility for different tasks, while we focus on entity recognition performance in a low-resource language. Even so, we intend to investigate Adapters architecture performance for NER tasks in future research.

Regarding Portuguese language, Rodrigues et al. [9] introduced Albertina, a Transformer-based foundation model both for European (Portugal) and American (Brazil) Portuguese variants. Compared to BERTimbau, it uses 128-token sequence truncation instead of 512. The authors report results on tasks such as text entailment, semantic textual similarity and disambiguation tasks, but to our knowledge, they did not evaluate the model on entity recognition.

Finally, BERT was one of the first transformer-based models; up to our knowledge, the current main Portuguese PLMs were based on it. It takes some concerns about its potential compared to more recent PLMs. However, experiments conducted by Tänzer et al. [14] show that BERT can reach near-optimal performance even when many of the training set labels have been corrupted.

3 Methodology and Experiments

3.1 Model Soups Experiments

Creating a model by averaging other models’ weights was validated on image and document classification tasks [17].

Formally, let \({\theta } = FineTune(\theta _{0}, h)\) the set of parameters obtained by fine-tuning the pre-trained initialization \({\theta _{0}}\) with the hyperparameters configuration h. The technique uses a mean value for \({\theta _i}\), i.e., \({\theta _{S} = \frac{1}{|S|}\sum _{i \in S} \theta _{i}}\), where \({S \subset \{1,...,k\}}\) and k is the number of hyperparameter configurations (or models) to be used.

Initially, we reproduced the idea using a uniform average. The greedy approach was not analyzed due to our low number of available candidate models. So, we created a model whose weights are the average of the weights of different models:

(A)
the entity recognition model developed to validate BERTimbau [12], trained on an adapted version from the First HAREM^{Footnote 1} [10];
(B)
model analogous to (A), but trained with data augmentation^{Footnote 2} and;
(C)
model adapted to the same corpus for the target task (First HAREM), as described in Sect. 3.2.

We denote this model as the average model or model soup.

For the combinations cited above, we evaluated the BERT base and large variations, which differ in the number of parameters. Also, we evaluated settings with and without an additional layer for conditional random fields (CRF), which we used to reduce the probability of generating invalid label sequences.

There is a difference between our experiments and the original proposal by Wortsman et al. (2022): the authors assume that the models were independently optimized from the same initialization, which would lead them to the same region or valley on the error surface. Such strategy is represented in Fig. 2. But here, according to Fig. 3, only models A and B start from the same initialization (BERTimbau language model). In contrast, model C was finetuned on the First HAREM textual content.

The experiments’ results are shown in Sect. 4.1.

3.2 Domain Adaptation Experiments

Sun et al. [13] conducted experiments on different fine-tuning options, including multi-task learning. However, they concluded that the benefits from multi-task learning tend to be inferior to those obtained from additional pretraining. So, we did not experiment with multi-task learning.

To conduct the experiments on domain adaptation, we use as start points:

(A)
a NER model for (long) documents related to public accounts auditing^{Footnote 3} ^{Footnote 4};
(B)
the NER model trained for BERTimbau evaluation [12];
(C)
an entity recognition model [11] trained on a public news dataset [7].

In (A), during domain adaptation, the original language model - BERTimbau - received additional training on a dataset different from the one used to train the target NER model. However, this dataset came from the same domain and origin (documents related to public accounts auditing), leading to an intermediary and domain-optimized language model. As the training task was Masked Language Model (MLM), such dataset does not contain any label and can be considered a kind of “superset” for the entity recognition dataset: for example, the dataset for training the domain-adapted language model has 52,912 documents, against 431 documents for the labeled dataset used during entity recognition model training. The complete flow is described in Fig. 4.

For (B) and (C), during the construction of the intermediary language models, the respective datasets were used: First HAREM (the same train split used by Souza et al. [12]) and the media news dataset (the same train split as the original dataset [7]). The labels were discarded for both datasets in the phase of MLM training.

The three resulting language models were used as base models for retraining the three cited entity recognizers to measure domain adaptation impact. Learning rate and batch size hyperparameters for all intermediary language models training were the same as reported by Souza et al. [12].

Section 4.2 shows the experiments’ results.

3.3 Causal Language Modeling - Few/zero-Shot Learning

We also conducted experiments on entity recognition with few and zero-shot learning by using a large language model (LLM) based on GPT-3.5, an improved version of GPT-3 [1], a language model pre-trained with causal language modeling objective. We used the same Brazilian public news dataset already described [7] and compared the following settings:

(A)
a NER model [8] based on finetuning BERTimbau to the news dataset
(B)
GPT 3.5 few-shot learning with instruction prompt and examples
(C)
GPT 3.5 few-shot learning with no instruction prompt (only examples)
(D)
GPT 3.5 zero-shot learning

The results are shown in Sect. 4.3.

4 Results and Discussions

4.1 Model Soup Results Analysis

Table 1 summarizes the results from the main experiments on the model soup technique. The bolded rows refer to the baseline (original) models, with mean and standard deviation from 5 training executions. Rows prefixed as “M1:”, “M2:”, ..., and “M12:” are the respective “best” models for each variant among five training runs. These models are the model soups components. Finally, rows related to the model soups - labeled with + (plus) signs - do not involve training (only two or three models averaging), so they were not randomized (this is why standard deviation values are not shown in these rows, except for the setup with additional fine-tuning, as explained below). We report precision, recall, and F1 metrics for all the experiments.

In the first experiment, we evaluate the direct application of the model soup technique. We used a uniform average from a set composed of the respective best models (i.e., best training) among the variations described in 3.1.

Later, we evaluated the second setup, in which the model soup received additional fine-tuning for the NER task. As shown in Table 1, for the smaller variant in model size (base), the first model, without additional fine-tuning, shows better results for precision and F1 metrics.

Table 1. Experiments with model soups for NER BERTImbau (“\(\textrm{BERT}_\textrm{BASE}\)" or “\(\textrm{BERT}_\textrm{LARGE}\)") trained on First HAREM

Full size table

As already described in 3.1, for each model size (base/large), we added the following components to each combination: (A) original BERTimbau NER; (B) BERTimbau NER retrained with data augmentation; and (C) BERTimbau NER retrained after domain adaptation to First HAREM (original BERTimbau language model fine-tuned to First HAREM text set), as described in 3.2).

Table 2. Domain adaptation for documents related to public accounts audit.

Full size table

Later, the third variant (C) was removed from each combination, leading to the original schema shown in Fig. 2. Finally, we observed that the combinations based only on (A) (original NER) and (B) (data augmented NER) led to better values for precision and F1 metrics, confirming Wortsman’s (2022) original hypothesis of using independently optimized models from the same initialization.

When we compare our methodology with the one used by the authors [17], they use accuracy in most image and text classification. For example, the authors do not refer explicitly to recall, which shows worse results in our experiments.

So, further investigation is needed about the reason recall is worsening by the use of the model soup technique, which at the moment makes us believe that such a method could be more suitable to situations in which precision is more important than getting a high number of entities or false positive cases.

On the other hand, given known limitations in using precision-recall-F1 for entity recognition, better and more interpretable metrics for this task are a current research topic [5].

Finally, according to Wortsman et al. [17], preliminary experiments show that improvement in the textual corpus, although present, is less profound than in image classification. The authors stress the need for additional investigation into this subject. They used ImageNet and several variations, including illustrations and other representations beyond real photos (e.g., drawings, arts, cartoons, tattoos, origami, sculptures, etc.). But for textual classification, they used general domain datasets for paraphrase detection, coherence vs. contradiction, linguistic acceptability, and sentiment analysis. Preliminary and qualitative analysis of the different data for images vs. texts shows more variability and larger data size for the first case (e.g., ImageNet contains millions of images), which could have led to a more significant impact on image classification.

4.2 Domain Adaptation Results Analysis

In this subsection, we report results achieved by NER models when trained over intermediary language models, as described in 3.2.

The results of the experiment with the NER model trained on documents related to public accounts auditing are shown in Table 2.

As a comparison metric, Masked F1 [11] was used. This metric is F1 calculated over post-processed output, correcting invalid transitions according to the IOB2 schema instead of using the raw output directly. Based on in-domain adaptation, this setup led to the most pronounced improvements.

In the experiment conducted on the NER model for media news [7, 8, 11], the results are shown in Table 3.

Table 3. Domain adaptation for media news.

Full size table

Table 4. Domain adaptation for First HAREM.

Full size table

The experiment was conducted only with variants based on a pre-trained Brazilian Portuguese language model (BERTimbau) because multi-language models gave an inferior performance, according to Silva et al. [11] and Chen et al. [2]. Masked F1 was also used as the main comparison metric. Table 4 shows the results achieved with the NER model used in BERTimbau evaluation [12].

Table 5. Domain adaptation for First HAREM - qualitative analysis.

Full size table

The second column shows outputs generated by the baseline NER, while the third column shown outputs from the NER trained on the intermediary language model (domain-adapted NER). While the first one misclassifies Portuguese expressions, the second one labels them correctly. All the examples belong to First HAREM test dataset.

As can be noted in Tables 3 and 4, experiments conducted with the media news NER and BERTimbau NER did not reveal significant differences after domain adaptation. Such results confirm observations by Sun et al. [13]: the domain adaptation made on media news and First HAREM is “within-task" (the texts used are the same as the training texts for the target task). In general, “in-domain" pretraining - using texts from the same domain which are different from the texts in the training dataset for the target task - gives superior results. We suspect that within-task pretraining could lead to overfitting because it uses the same texts from the target task dataset.

However, after error analysis, we realized that some European expressions, organizations, and local names could be correctly classified only after BERTimbau Brazilian language model domain adaptation on (European) HAREM. At the same time, they were misclassified when NER was trained directly from the raw BERTimbau language model. The results are shown in Table 5, where we have the following examples:

Table 6. Sample outputs by a generic domain LM vs. specific domain LM.

Full size table

Table 7. Few/zero-shot learning

Full size table

1st row: In Portugal, “Consoada" refers to Christmas Night, which should be classified as a temporal (TIME) mention.
2nd row: “Soajo" is the name of a Portuguese village.
3rd row: “Abade da Loureira" refers to both an organization (ORG) and a street (LOC). But given the specific context in the sentence, it should be classified as an organization (ORG).
4th row: “Centro de Formação de Informática do Minho" is a local educational institution.
5th row: “São João do Souto" refers to an extinct Portuguese parish.

It makes sense because First HAREM is a Portuguese corpus, different from the Brazilian corpus used to train BERTimbau. These results show semantic gains from domain adaptation, although quantitative performance differences are not statistically significant.

Further, the results shown here were obtained in the NER task context. For document classification, within-task pretraining has been commonly used.

Finally, we showed that domain adaptation in experiment (A) - by training an intermediary language model with a larger, same-domain dataset - led to a higher impact on F1 metrics when compared to the experiments with within-task domain adaptation ((B) and (C)). Furthermore, qualitative analysis for the intermediary language model shows example outputs for predicting masked term tasks, i.e., the public accounts auditing language model can generate texts related to themes such as contracts and biddings, as seen in Table 6.

4.3 Causal Language Modeling - Few/zero-Shot Learning

This subsection summarizes experiments conducted on GPT 3.5, a large language model pre-trained with a causal language modeling objective. The results are shown in Table 7.

In the few-shot setting, we gave the model some examples. These examples could be accompanied by an instruction prompt telling the model the kinds of entities expected. We also tested sending only the examples without any further instruction. In the zero-shot setting, we only asked the question and let the model be free to return the information in any format, giving no examples. So it returned the results in a conversational unstructured format, making it difficult to measure the output precision and F1. Therefore, we only show recall.

Despite the recent impressive reasoning skills shown by GPT 3.5 and “ChatGPT” models, it is interesting to note its quantitative performance still lags behind traditionally fine-tuned models. We hypothesize that BERT bidirectional masked language model objective may be more suitable to discriminative tasks, while causal language models - concerned only with predicting the future words - may lose information from past tokens. On the other hand, we believe that more research is necessary to investigate prompt engineering practices more suitable to NER and similar tasks. Finally, we realized impressive qualitative results on GPT 3.5 zero-shot learning: although its recall is considerably worse than the BERT-based baseline, it returns the entities and the relationships between them.

5 Conclusions

Among the analyzed techniques, model soups achieved the best results for the NER task. In the experiments conducted on domain adaptation, the best results were achieved with in-domain adaptation. We did not observe significant improvements in within-task domain adaptation. However, we realized the model could learn domain-specific terms with the First HAREM corpus. The lack of both quantity and diversity of “golden” labeled datasets for Portuguese when compared to English or Chinese is a substantial limitation to research on several tasks or multitask learning, as done, for example, by Houlsby at al. [6]. We believe advances in few-shot learning and prompt engineering could solve this limitation through synthetic data generation. So, besides investigating fine-tuning with Adapters, we expect to conduct further experiments on few-shot learning. Finally, we intend to investigate the fine-tuning of the recent Albertina LLM to the NER task and check if its larger capacity can compensate for its smaller context window (128 tokens) compared to BERTimbau.

Notes

1.
The adapted version refers to a setting called “selective" by the authors, in which only 5 classes are used (PERSON, ORGANIZATION, LOCAL, VALUE and DATE).
2.
Here, we used label-wise token replacement (LwTR) [3].
3.
We used the hyperparameters for learning rate and batch size suggested by Silva et al. [11].
4.
This dataset contains non-public data and cannot be made publicly available.

References

Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901. Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Chen, Y., Mikkelsen, J., Binder, A., Alt, C., Hennig, L.: A comparative study of pre-trained encoders for low-resource named entity recognition. In: Gella, S., et al (eds.) Proceedings of the 7th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2022, Dublin, Ireland, 26 May 2022, pp. 46–59. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.repl4nlp-1.6
Dai, X., Adel, H.: An analysis of simple data augmentation for named entity recognition. In: Scott, D., Bel, N., Zong, C. (eds.) Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), 8-13 December 2020, pp. 3861–3867. International Committee on Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.coling-main.343
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR arXiv: 1810.04805
Fu, J., Liu, P., Neubig, G.: Interpretable multi-dataset evaluation for named entity recognition. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, 16-20 November 2020, pp. 6058–6069. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.489
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. Proceedings of Machine Learning Research, vol. 97, pp. 2790–2799. PMLR (2019), http://proceedings.mlr.press/v97/houlsby19a.html
Monteiro, M.: Extrator de entidades mencionadas em notícias da mídia. https://github.com/SecexSaudeTCU/noticias_ner (2021), (Accessed 21 May 2022)
Monteiro, M.: Riskdata brazilian portuguese ner. https://huggingface.co/monilouise/ner_news_portuguese (2021), (Accessed 21 May 2022)
Rodrigues, J., et al.: Advancing neural encoding of portuguese with transformer albertina PT-. CoRR https://doi.org/10.48550/arXiv.2305.06721, https://doi.org/10.48550/arXiv.2305.06721 (2023)
Santos, D., Seco, N., Cardoso, N., Vilela, R.: HAREM: an advanced NER evaluation contest for portuguese. In: Calzolari, N., et al. (eds.) Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, 22-28 May 2006, pp. 1986–1991. European Language Resources Association (ELRA) (2006), http://www.lrec-conf.org/proceedings/lrec2006/summaries/59.html
Silva, E.H.M.D., Laterza, J., Silva, M.P.P.D., Ladeira, M.: A proposal to identify stakeholders from news for the institutional relationship management activities of an institution based on named entity recognition using BERT. In: Wani, M.A., Sethi, I.K., Shi, W., Qu, G., Raicu, D.S., Jin, R. (eds.) 20th IEEE International Conference on Machine Learning and Applications, ICMLA 2021, Pasadena, CA, USA, 13–16 December 2021, pp. 1569–1575. IEEE (2021). https://doi.org/10.1109/ICMLA52953.2021.00251
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Chapter Google Scholar
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS (LNAI), vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Chapter Google Scholar
Tänzer, M., Ruder, S., Rei, M.: Memorisation versus generalisation in pre-trained language models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7564–7578. Association for Computational Linguistics, Dublin, Ireland (May 2022). https://doi.org/10.18653/v1/2022.acl-long.521
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (May 2018). https://aclanthology.org/L18-1686
Wortsman, M., et al.: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. CoRR abs/2203.05482 (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Informatics Center - Federal University of Pernambuco (UFPE), Recife, Brazil
Monique Monteiro & Cleber Zanchettin

Authors

Monique Monteiro
View author publications
Search author on:PubMed Google Scholar
Cleber Zanchettin
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Monique Monteiro .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Murilo C. Naldi
Centro Universitario da FEI, São Bernardo do Campo, Brazil
Reinaldo A. C. Bianchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Monteiro, M., Zanchettin, C. (2023). Optimization Strategies for BERT-Based Named Entity Recognition. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-45392-2_6
Published: 12 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

Optimization Strategies for BERT-Based Named Entity Recognition

Abstract

Similar content being viewed by others

Semantically-Informed Domain Adaptation for Named Entity Recognition

Named Entity Extractors for New Domains by Transfer Learning with Automatically Annotated Data

TOKEN Is a MASK: Few-shot Named Entity Recognition with Pre-trained Language Models

Explore related subjects

1 Introduction

2 Related Work

3 Methodology and Experiments

3.1 Model Soups Experiments

3.2 Domain Adaptation Experiments

3.3 Causal Language Modeling - Few/zero-Shot Learning

4 Results and Discussions

4.1 Model Soup Results Analysis

4.2 Domain Adaptation Results Analysis

4.3 Causal Language Modeling - Few/zero-Shot Learning

5 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us