Bete: A Brazilian Portuguese Dataset for Named Entity Recognition and Relation Extraction in the Diabetes Healthcare Domain

Pavanelli, Lucas; Gumiel, Yohan Bonescki; Ferreira, Thiago; Pagano, Adriana; Laber, Eduardo

doi:10.1007/978-3-031-45392-2_17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14197))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

536 Accesses

Abstract

The biomedical NLP community has seen great advances in dataset development mostly for the English language, which has hindered progress in the field, as other languages are still underrepresented. This study introduces a dataset of Brazilian Portuguese annotated for named entity recognition and relation extraction in the healthcare domain. We compiled and annotated a corpus of health professionals’ responses to frequently asked questions in online healthcare forums on diabetes. We measured inter-annotator agreement and conducted initial experiments using up-to-date methods to recognize entities and extract relations, such as BERT-based ones. Data, models, and results are publicly available at https://github.com/pavalucas/Bete .

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Automated Annotation of Electronic Health Records Using Large Language Models

We are not ready yet: limitations of state-of-the-art disease named entity recognizers

Article Open access 27 October 2022

Leveraging Deep Active Learning and Large Language Models for Cost-Efficient Categorization of User-Generated Content

1 Introduction

Named entity recognition (NER) identifies named entities in texts; these entities can be proper names, locations, and organizations. Relation extraction (RE) consists in predicting whether and what predefined relation exists between two mentioned entities or if they are unrelated. In the healthcare domain, named entities include mentions relevant to the clinical context, such as treatment plans, drugs, and symptoms while relations can be, for instance, if a certain drug treats a symptom. Both tasks are prevalent in machine learning methods [2] and are considered the primary step for several other Natural Language Processing (NLP) tasks, such as question answering [20], document classification [6], and search engines [12].

Consultation time is not always sufficient for health professionals to answer all the patient and their families questions, and patients may not have easy access to primary care centers [8]. Hence, patients often engage in dedicated public forums to search for answers to their queries. Among chronic diseases, diabetes stands out as a condition much in need of attention because it is an increasingly prevalent and severe long-term problem quickly growing worldwide, particularly in developing countries, as is the case of Brazil. In 2019, it was estimated that almost half a billion of the world’s population (9.3% of the adults between 20–79 years) had diabetes [18]. Further, one in every two (50.1%) persons with diabetes is unaware of or has not been diagnosed with this condition. Hence, question answering (QA) systems could help relieve some of the burdens of health care. In QA, an answer to a question is found by querying a corpus of text documents. A QA system incorporates NER and RE components so that entities and their relations can be detected and answers can be more efficiently found.

In this study, the main goal was to annotate a corpus of texts in order to develop a framework to automatically identify healthcare-related entities and relations. To that end, medical and nutrition science students produced texts as answers to queries posted by users in diabetes-related public forums. Our work is thus relevant for studies in the healthcare domain targeting specifically the general public and focusing on Brazilian Portuguese. The obtained dataset is expected to be used to build BeteQA, a community question answering system that provides fast and precise answers to questions about Diabetes Mellitus posed by the lay public [5].

To the best of our knowledge, this is the first study focusing on NER and relation extraction drawing on a novel corpus of real-life questions answered by professional healthcare specialists in Brazilian Portuguese.

1.1 Related Work

Among named entity recognition methods, contextual word representations are employed due to their capacity for modeling complex word usage characteristics and variations across linguistic contexts [17]. A method worth highlighting is BERT [7], a masked language model that benefits from transformers. These pre-trained language models are becoming the new NER paradigm because of their contextualized embeddings. Furthermore, they are fine-tuned for several NLP tasks by adding an additional output layer [13].

As far as relation extraction methods, BERT-based models are also dominant. Soares et al. [21] proposed a model that learns relation representations directly from the text using BERT-based models.

Considering the available BERT models for Brazilian Portuguese, there is a multilingual version [7]; a Brazilian Portuguese version focused on the general domain called BERTimbau [22]; and a Brazilian Portuguese version focused on the clinical domain named BioBERTpt [19].

2 Methodology

2.1 Corpus

We searched for questions posed by users regarding health issues in several online forums. In such forums, users answer each other’s questions drawing on their beliefs and understanding of the issues, with no help or supervision of any healthcare professional. To avoid this problem, we designed a study in which answers were produced by medical and nutrition science students relying on their domain knowledge. Answers were curated by expert professionals in our project who supervised students. This way, we ensured that our set of QA is reliable and can be used in a prospective QA system to be queried through a conversational agent. Our corpus comprises two sets of documents made up of diabetes-related general questions with their respective answers: a first set contains 304 real online forum questions answered by medical students under the supervision of medical professionals and a second one contains 201 real online forum questions answered by nutrition science students supervised by professionals. This way, we can have a QA system to provide accurate answers about health and nutrition issues, authored by professionals in both fields.

Annotation Setup. As an annotation tool, we used Webanno [4], an open-source and intuitive software/platform. After comparing several available entity tagging tools, we found Webanno to be the easiest and most efficient tool for our purposes. The system was set up on a web server, text data was uploaded, and entity/relation types were defined within the system.

Annotation Guidelines. The annotation guidelines were created in an iterative process. A first draft was created, containing general guidelines as well as specific examples of types of entities and relations. Specialists were consulted regarding annotators’ queries and their answer was used to update the guidelines, then they were tested again. Besides, during the annotation process, whenever one of the annotators ran into a dubious case, this was added to the guidelines. Figure 1 shows an example of a text annotated in Webanno following our guidelines^{Footnote 1}. The annotation guidelines are publicly available for download as part of the dataset.

Annotation Process. We recruited undergraduate students pursuing their BA degree to complete the annotation task. The students are part of Empoder@, a multidisciplinary project engaging health sciences, statistics, computer science, and applied linguistics students. The project aims to empower researchers, professionals, and users of health services.

Annotators took part in a training session and were requested to read the guidelines and resort to the project coordinator whenever they encountered problems during annotation. Two students annotated each document, and a third one performed the adjudication so that a gold-standard was obtained upon completion. Annotators were presented with each text on a simple interface (see Fig. 1) and using a mouse or a track pad they selected entities and dragged relationships between them.

2.2 Entity and Relation Extraction

We considered relation extraction as a multi-label classification problem in which, given a pair of entities, a label is assigned out of the relation types available. Also, we decoupled relation extraction from entity recognition, so we performed RE on the gold entities.

The set of entity and relation types devised for our annotation was built drawing on an ontology proposed in [1]. The ontology labels diabetes mentions as “DiabetesType”, and diabetes-related diseases are classed with the “Complication” entity type. Diabetes-related temporal expressions, clinical tests, and treatments are also addressed in the ontology. Table 1 lists the 14 entity types, providing a brief explanation of each label with some examples. The 5 relations types are verbalized by the following verbs: “causes”, “diagnoses”, “has”, “prevents”, and “treats”.

Table 1. Entities description and examples.

Full size table

Reliability. To measure dataset reliability, we computed inter-annotator agreement (IAA) considering exact matches. Following the work of [3], we computed pairwise F1 score and Cohen’s Kappa. The former is more reliable according to various studies [9, 11]. Because of the vast amount of unannotated tokens (labeled “O”), we calculated the scores without the O label, for both annotated entities and relations. Table 2 shows the obtained agreement. Annotated entities can be said to be fully reliable, achieving an IAA of 0.93. As regards relations, moderate agreement (0.58) was found.

Table 2. Inter-annotator agreement for entity recognition and relation extraction.

Full size table

Table 3. Dataset information.

Full size table

Table 4. Entities: Number of occurrences and percentage per entity type sorted in decreasing order.

Full size table

Table 5. Relations: Number of occurrences and percentage per relation type sorted in decreasing order.

Full size table

3 Dataset Information

Table 3 shows overall statistics of the whole dataset. Table 4 shows the number of annotations per entity type, while Table 5 covers the annotated relations.

4 Experiments Setup

We conducted initial experiments using methods to recognize entities and extract relations. Since the dataset is in Brazilian Portuguese, we chose deep learning models trained on multilingual and Brazilian Portuguese data. These models are multilingual BERT (mBERT), BERTimbau, and the three different versions of BioBERTpt: BioBERTpt-bio, trained on Portuguese biomedical texts, BioBERTpt-clin, trained on clinical narratives from electronic health records from Brazilian Hospitals, and BioBERTpt-all, trained in both biomedical texts and clinical narratives.

Regarding the training setup, we randomly divided the 505 documents into train/dev/test using the split 0.8/0.1/0.1, respectively, tuning the hyperparameters on the development set and reporting the results on the test set. To run the experiments, we used one NVIDIA GeForce RTX 3090 GPU.

As for models, we trained a baseline Conditional Random Field (CRF) for the NER task. We used a Portuguese model from the Python library called Spacy [10] to extract a set of features. Table 6 shows the used features.

Table 6. Used CRF features.

Full size table

For BERT models for the NER task, we used the Adam [14] optimizer with a learning rate of 1e-5 and a maximum length of 512. Moreover, we trained for 50 epochs with early stopping of 15 epochs and a batch size of 64.

Regarding the relation extraction task, we experimented with a baseline Support Vector Machine (SVM) model, using the one-vs-the-rest (OvR) multiclass strategy from scikit-learn [16] Python library.

Considering BERT for relation extraction, we used 7e-5 as the learning rate with Adam optimizer and max length of 512, training for 11 epochs, and a batch size of 32.

We considered the following metrics for evaluation: precision, recall, and F1 score. We reported the weighted average F1 score, averaging the support weighted mean per label. All metrics are considering exact matches. In addition, we computed results considering all classes and, for the best-performing model, we reported metrics for each label.

5 Results

Table 7 shows the results for NER models. We trained baseline CRF and BERT-based models. The best one concerning F1 score is the BioBERTpt-clin model, outperforming BioBERTpt-all by 0.9 points. Also, BERTimbau did not perform well with an F1 score 5.3 points lower than the second-lowest.

To evaluate the relation extraction task, we experimented using the two models: SVM, as the baseline, and BERT for relation extraction (BERT-RE) from [21], using mBERT as the BERT encoder.

Table 8 shows the result for both relation extraction models. We can observe that, although SVM has higher precision than BERT-RE, the latter performs better overall with 17 points of F1 score difference.

We also provide an in-depth analysis of the best-performing model in NER and RE. Table 9 shows detailed results for the entity recognition BioBERTpt-clin model. The method produced good scores (>80%) for the three entities with the largest number of examples: Food, Complication, and Symptom. However, for some entities with few examples, the model was not able to perform good predictions (<60%): Test, Time, and Set.

Table 10 describes an in-depth analysis of BERT-RE performance for each relation type. Similarly to NER results, the model performs well for the relation with most examples (has) and falls short in relations with few examples: causes and prevents.

Table 7. Experiments for entity recognition models. The best scores are highlighted in bold.

Full size table

Table 8. Experiments for relation extraction models. The best scores are highlighted in bold.

Full size table

Table 9. Detailed results for the best entity recognition model (BioBERTpt-clin).

Full size table

Table 10. Detailed results for the best relation extraction model (BERT-RE).

Full size table

6 Discussion

In this study, we introduced a novel dataset and models for NER and RE tasks, gathering answers written by health professionals about Diabetes Mellitus to online forum users.

Corpus. We found that the entities Food and Complication were the most annotated in the corpus. Also, nutrition is a prevalent topic in online forums, as people with diabetes frequently inquire whether they can or cannot consume specific foods, such as chocolate, or alcoholic drinks. Similarly, people are generally keen on finding further information on diabetes-related complications because they feel specific symptoms or want to know about frequent diabetes-related complications.

Regarding the main difficulties found in the process of annotating relations, despite annotators’ prior training and availability of guidelines, some entities were annotated as having a different extent in terms of words making up those entities. Thus, “type 2 Diabetes” and “Diabetes” are two different annotations that refer to the same entity type, “DiabetesType”. This ends up impacting the relation since a relation is defined by a pair of entities and a relation type, which can justify our lower agreement for relation annotation. So if two annotators annotated the same relation type holding between entities spanning different extents in words, the resulting relations did not match. Further, as evidenced in [15], temporal relation extraction, which is a specific type of relation extraction, usually has lower agreement than span annotation.

Models. For the NER task, among the BERT models, we found that BioBERTpt-clin, which leveraged clinical data, had superior performance compared to the other models. Both BioBERTpt-clin and BioBERTpt-all were trained on clinical data and were initialized with the weights from Multilingual BERT, so as expected, these models achieved similar results. It is also worth noting that, although trained on Brazilian Portuguese data, BERTimbau did not perform well. An explanation is that BERTimbau was trained on brWAC [23], a large corpus extracted from the Web, that differs in vocabulary and context from Bete medical data.

Considering relation extraction, the BERT-based model outperforms the baseline model, showing the dominance of context-aware embeddings compared to kernel-based methods that were dominant in the past.

7 Conclusion

To the best of our knowledge, this is the first study yielding an annotated corpus of NER in Brazilian Portuguese made up of diabetes-related answers authored by domain specialists in response to questions posed by lay users. Moreover, it contributes to the research field by introducing resources for a sensitive context, such as diabetes, and creating models for a low-resource language, such as Brazilian Portuguese.

The fine-tuning of models leveraging clinical data was found to improve the results. Hence, the vocabulary and the context from the clinical context boosted the model’s ability to predict entities.

We plan to expand our corpus in future work, especially for the categories of entities and relations that have a small number of instances.

8 Limitations

Among the limitations of our study is the size of our dataset, which is due to the fact that it was obtained through manual annotation thereby demanding human effort and more time to accomplish this task. Another limitation is that some entity types have few instances, making our dataset slightly imbalanced. For instance, only 1/3 of our entity types have percentages of occurrence higher than 3%. Hence, adding more training examples and making the dataset less imbalanced will certainly enhance our results.

Our dataset targets a particular domain - diabetes; future experiments targeting other chronic diseases, such as cardiovascular disease, will boost our potential. Additionally, in the aftermath of COVID-19, there is growing uncertainty about procedures, symptoms, and treatments, with several questions being asked over social media. Hence, addressing COVID-19-related frequently asked questions would be a valuable contribution.

Notes

1.
Translation into English: “Being overweight can lead to type 2 diabetes. Therefore, intermittent fasting may be a way to prevent type 2 diabetes. Intermittent fasting can also be used as a treatment for people newly diagnosed with type 1 diabetes who need to lose weight to achieve a more stable health condition; these people should be advised and monitored by an endocrinologist and a nutritionist.”.

References

Ben Abacha, A., Zweigenbaum, P.: MEANS: a medical question-answering system combining NLP techniques and semantic web technologies. Inf. Process. Manag. 51(5), 570–594 (2015). https://doi.org/10.1016/j.ipm.2015.04.006, https://www.sciencedirect.com/science/article/pii/S0306457315000515
Bose, P., Srinivasan, S., Sleeman, W.C., Palta, J., Kapoor, R., Ghosh, P.: A survey on recent named entity recognition and relationship extraction techniques on clinical texts. Appl. Sci. 11(18) (2021). https://doi.org/10.3390/app11188319, https://www.mdpi.com/2076-3417/11/18/8319
Brandsen, A., Verberne, S., Wansleeben, M., Lambers, K.: Creating a dataset for named entity recognition in the archaeology domain. In: Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4573–4577. European Language Resources Association, Marseille, France (2020), https://aclanthology.org/2020.lrec-1.562
Eckart de Castilho, R., et al.: A web-based tool for the integrated annotation of semantic and syntactic structures. In: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pp. 76–84. The COLING 2016 Organizing Committee, Osaka, Japan (2016), https://www.aclweb.org/anthology/W16-4011
Castro Ferreira, T., et al.: Evaluating recognizing question entailment methods for a Portuguese community question-answering system about diabetes mellitus. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pp. 234–243. INCOMA Ltd., Held Online (2021), https://aclanthology.org/2021.ranlp-main.28
Choudhary, A., Arora, A.: Linguistic feature based learning model for fake news detection and classification. Expert Syst. Appl. 169, 114171 (2021)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423
Gabarron, E., et al.: Social media for health promotion in diabetes: study protocol for a participatory public health intervention design. BMC Health Serv. Res. 18(1), 414 (2018). https://doi.org/10.1186/s12913-018-3178-7
Article Google Scholar
Grouin, C., Rosset, S., Zweigenbaum, P., Fort, K., Galibert, O., Quintard, L.: Proposal for an extension of traditional named entities: from guidelines to evaluation, an overview. In: Proceedings of the 5th linguistic annotation workshop, pp. 92–100 (2011)
Google Scholar
Honnibal, M., Montani, I.: spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To Appear 7(1), 411–420 (2017)
Google Scholar
Hripcsak, G., Rothschild, A.S.: Agreement, the f-measure, and reliability in information retrieval. J. Am. Med. Inform. Assoc. 12(3), 296–298 (2005)
Article Google Scholar
Lahav, D., et al.: A search engine for discovery of scientific challenges and directions. In: AAAI (2022)
Google Scholar
Li, J., Sun, A., Han, J., Li, C.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34(1), 50–70 (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=Bkg6RiCqY7
Nikfarjam, A., Emadzadeh, E., Gonzalez, G.: Towards generating a patient’s timeline: Extracting temporal relationships from clinical notes. J. Biomed. Inform. 46, S40–S47 (2013). https://doi.org/10.1016/j.jbi.2013.11.001, supplement: 2012 i2b2 NLP Challenge on Temporal Relations in Clinical Data
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1202, https://aclanthology.org/N18-1202
Saeedi, P., et al.: Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the international diabetes federation diabetes atlas, 9th edition. Diabetes Res. Clin. Pract. 157, 107843 (2019). https://doi.org/10.1016/j.diabres.2019.107843
Schneider, E.T.R., et al.: BioBERTpt - a Portuguese neural language model for clinical named entity recognition. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop, pp. 65–72. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.clinicalnlp-1.7
Sharma, V., Kulkarni, N., Pranavi, S., Bayomi, G., Nyberg, E., Mitamura, T.: BioAMA: towards an end to end biomedical question answering system. In: Proceedings of the BioNLP 2018 workshop, pp. 109–117. Association for Computational Linguistics, Melbourne, Australia (2018). https://doi.org/10.18653/v1/W18-2312, https://aclanthology.org/W18-2312
Soares, L.B., FitzGerald, N., Ling, J., Kwiatkowski, T.: Matching the blanks: distributional similarity for relation learning. In: ACL 2019–57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pp. 2895–2905 (2020). https://doi.org/10.18653/v1/p19-1279
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri, R., Prati, R.C. (eds.) BRACIS 2020. LNCS (LNAI), vol. 12319, pp. 403–417. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61377-8_28
Chapter Google Scholar
Wagner, J., Wilkens, R., Idiart, M., Villavicencio, A.: The brWaC corpus: a new open resource for Brazilian Portuguese (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

aiXplain Inc., Los Gatos, USA
Lucas Pavanelli & Thiago Ferreira
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Yohan Bonescki Gumiel & Adriana Pagano
Pontifícia Universidade Católica do Rio de Janeiro (PUC-RJ), Rio de Janeiro, Brazil
Eduardo Laber

Authors

Lucas Pavanelli
View author publications
Search author on:PubMed Google Scholar
Yohan Bonescki Gumiel
View author publications
Search author on:PubMed Google Scholar
Thiago Ferreira
View author publications
Search author on:PubMed Google Scholar
Adriana Pagano
View author publications
Search author on:PubMed Google Scholar
Eduardo Laber
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Lucas Pavanelli .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Murilo C. Naldi
Centro Universitario da FEI, São Bernardo do Campo, Brazil
Reinaldo A. C. Bianchi

Ethics declarations

Ethical Statement

Our study fully complies with ethical standards and did not require any submission to ethical boards, since no data collection with human subjects was carried out. Our dataset was created by our team and contains texts drafted by medical students under the supervision of healthcare professionals, all of whom are research members in our project.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pavanelli, L., Gumiel, Y.B., Ferreira, T., Pagano, A., Laber, E. (2023). Bete: A Brazilian Portuguese Dataset for Named Entity Recognition and Relation Extraction in the Diabetes Healthcare Domain. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14197. Springer, Cham. https://doi.org/10.1007/978-3-031-45392-2_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-45392-2_17
Published: 12 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45391-5
Online ISBN: 978-3-031-45392-2
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics

Bete: A Brazilian Portuguese Dataset for Named Entity Recognition and Relation Extraction in the Diabetes Healthcare Domain

Abstract

Similar content being viewed by others

Automated Annotation of Electronic Health Records Using Large Language Models

We are not ready yet: limitations of state-of-the-art disease named entity recognizers

Leveraging Deep Active Learning and Large Language Models for Cost-Efficient Categorization of User-Generated Content

Explore related subjects

1 Introduction

1.1 Related Work

2 Methodology

2.1 Corpus

2.2 Entity and Relation Extraction

3 Dataset Information

4 Experiments Setup

5 Results

6 Discussion

7 Conclusion

8 Limitations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Keywords

Publish with us