1 Introduction

Named entity recognition (NER) identifies named entities in texts; these entities can be proper names, locations, and organizations. Relation extraction (RE) consists in predicting whether and what predefined relation exists between two mentioned entities or if they are unrelated. In the healthcare domain, named entities include mentions relevant to the clinical context, such as treatment plans, drugs, and symptoms while relations can be, for instance, if a certain drug treats a symptom. Both tasks are prevalent in machine learning methods [2] and are considered the primary step for several other Natural Language Processing (NLP) tasks, such as question answering [20], document classification [6], and search engines [12].

Consultation time is not always sufficient for health professionals to answer all the patient and their families questions, and patients may not have easy access to primary care centers [8]. Hence, patients often engage in dedicated public forums to search for answers to their queries. Among chronic diseases, diabetes stands out as a condition much in need of attention because it is an increasingly prevalent and severe long-term problem quickly growing worldwide, particularly in developing countries, as is the case of Brazil. In 2019, it was estimated that almost half a billion of the world’s population (9.3% of the adults between 20–79 years) had diabetes [18]. Further, one in every two (50.1%) persons with diabetes is unaware of or has not been diagnosed with this condition. Hence, question answering (QA) systems could help relieve some of the burdens of health care. In QA, an answer to a question is found by querying a corpus of text documents. A QA system incorporates NER and RE components so that entities and their relations can be detected and answers can be more efficiently found.

In this study, the main goal was to annotate a corpus of texts in order to develop a framework to automatically identify healthcare-related entities and relations. To that end, medical and nutrition science students produced texts as answers to queries posted by users in diabetes-related public forums. Our work is thus relevant for studies in the healthcare domain targeting specifically the general public and focusing on Brazilian Portuguese. The obtained dataset is expected to be used to build BeteQA, a community question answering system that provides fast and precise answers to questions about Diabetes Mellitus posed by the lay public [5].

To the best of our knowledge, this is the first study focusing on NER and relation extraction drawing on a novel corpus of real-life questions answered by professional healthcare specialists in Brazilian Portuguese.

1.1 Related Work

Among named entity recognition methods, contextual word representations are employed due to their capacity for modeling complex word usage characteristics and variations across linguistic contexts [17]. A method worth highlighting is BERT [7], a masked language model that benefits from transformers. These pre-trained language models are becoming the new NER paradigm because of their contextualized embeddings. Furthermore, they are fine-tuned for several NLP tasks by adding an additional output layer [13].

As far as relation extraction methods, BERT-based models are also dominant. Soares et al. [21] proposed a model that learns relation representations directly from the text using BERT-based models.

Considering the available BERT models for Brazilian Portuguese, there is a multilingual version [7]; a Brazilian Portuguese version focused on the general domain called BERTimbau [22]; and a Brazilian Portuguese version focused on the clinical domain named BioBERTpt [19].

Fig. 1.
figure 1

Annotation example from our corpus.

2 Methodology

2.1 Corpus

We searched for questions posed by users regarding health issues in several online forums. In such forums, users answer each other’s questions drawing on their beliefs and understanding of the issues, with no help or supervision of any healthcare professional. To avoid this problem, we designed a study in which answers were produced by medical and nutrition science students relying on their domain knowledge. Answers were curated by expert professionals in our project who supervised students. This way, we ensured that our set of QA is reliable and can be used in a prospective QA system to be queried through a conversational agent. Our corpus comprises two sets of documents made up of diabetes-related general questions with their respective answers: a first set contains 304 real online forum questions answered by medical students under the supervision of medical professionals and a second one contains 201 real online forum questions answered by nutrition science students supervised by professionals. This way, we can have a QA system to provide accurate answers about health and nutrition issues, authored by professionals in both fields.

Annotation Setup. As an annotation tool, we used Webanno [4], an open-source and intuitive software/platform. After comparing several available entity tagging tools, we found Webanno to be the easiest and most efficient tool for our purposes. The system was set up on a web server, text data was uploaded, and entity/relation types were defined within the system.

Annotation Guidelines. The annotation guidelines were created in an iterative process. A first draft was created, containing general guidelines as well as specific examples of types of entities and relations. Specialists were consulted regarding annotators’ queries and their answer was used to update the guidelines, then they were tested again. Besides, during the annotation process, whenever one of the annotators ran into a dubious case, this was added to the guidelines. Figure 1 shows an example of a text annotated in Webanno following our guidelinesFootnote 1. The annotation guidelines are publicly available for download as part of the dataset.

Annotation Process. We recruited undergraduate students pursuing their BA degree to complete the annotation task. The students are part of Empoder@, a multidisciplinary project engaging health sciences, statistics, computer science, and applied linguistics students. The project aims to empower researchers, professionals, and users of health services.

Annotators took part in a training session and were requested to read the guidelines and resort to the project coordinator whenever they encountered problems during annotation. Two students annotated each document, and a third one performed the adjudication so that a gold-standard was obtained upon completion. Annotators were presented with each text on a simple interface (see Fig. 1) and using a mouse or a track pad they selected entities and dragged relationships between them.

2.2 Entity and Relation Extraction

We considered relation extraction as a multi-label classification problem in which, given a pair of entities, a label is assigned out of the relation types available. Also, we decoupled relation extraction from entity recognition, so we performed RE on the gold entities.

The set of entity and relation types devised for our annotation was built drawing on an ontology proposed in [1]. The ontology labels diabetes mentions as “DiabetesType”, and diabetes-related diseases are classed with the “Complication” entity type. Diabetes-related temporal expressions, clinical tests, and treatments are also addressed in the ontology. Table 1 lists the 14 entity types, providing a brief explanation of each label with some examples. The 5 relations types are verbalized by the following verbs: “causes”, “diagnoses”, “has”, “prevents”, and “treats”.

Table 1. Entities description and examples.

Reliability. To measure dataset reliability, we computed inter-annotator agreement (IAA) considering exact matches. Following the work of [3], we computed pairwise F1 score and Cohen’s Kappa. The former is more reliable according to various studies [9, 11]. Because of the vast amount of unannotated tokens (labeled “O”), we calculated the scores without the O label, for both annotated entities and relations. Table 2 shows the obtained agreement. Annotated entities can be said to be fully reliable, achieving an IAA of 0.93. As regards relations, moderate agreement (0.58) was found.

Table 2. Inter-annotator agreement for entity recognition and relation extraction.
Table 3. Dataset information.
Table 4. Entities: Number of occurrences and percentage per entity type sorted in decreasing order.
Table 5. Relations: Number of occurrences and percentage per relation type sorted in decreasing order.

3 Dataset Information

Table 3 shows overall statistics of the whole dataset. Table 4 shows the number of annotations per entity type, while Table 5 covers the annotated relations.

4 Experiments Setup

We conducted initial experiments using methods to recognize entities and extract relations. Since the dataset is in Brazilian Portuguese, we chose deep learning models trained on multilingual and Brazilian Portuguese data. These models are multilingual BERT (mBERT), BERTimbau, and the three different versions of BioBERTpt: BioBERTpt-bio, trained on Portuguese biomedical texts, BioBERTpt-clin, trained on clinical narratives from electronic health records from Brazilian Hospitals, and BioBERTpt-all, trained in both biomedical texts and clinical narratives.

Regarding the training setup, we randomly divided the 505 documents into train/dev/test using the split 0.8/0.1/0.1, respectively, tuning the hyperparameters on the development set and reporting the results on the test set. To run the experiments, we used one NVIDIA GeForce RTX 3090 GPU.

As for models, we trained a baseline Conditional Random Field (CRF) for the NER task. We used a Portuguese model from the Python library called Spacy [10] to extract a set of features. Table 6 shows the used features.

Table 6. Used CRF features.

For BERT models for the NER task, we used the Adam [14] optimizer with a learning rate of 1e-5 and a maximum length of 512. Moreover, we trained for 50 epochs with early stopping of 15 epochs and a batch size of 64.

Regarding the relation extraction task, we experimented with a baseline Support Vector Machine (SVM) model, using the one-vs-the-rest (OvR) multiclass strategy from scikit-learn [16] Python library.

Considering BERT for relation extraction, we used 7e-5 as the learning rate with Adam optimizer and max length of 512, training for 11 epochs, and a batch size of 32.

We considered the following metrics for evaluation: precision, recall, and F1 score. We reported the weighted average F1 score, averaging the support weighted mean per label. All metrics are considering exact matches. In addition, we computed results considering all classes and, for the best-performing model, we reported metrics for each label.

5 Results

Table 7 shows the results for NER models. We trained baseline CRF and BERT-based models. The best one concerning F1 score is the BioBERTpt-clin model, outperforming BioBERTpt-all by 0.9 points. Also, BERTimbau did not perform well with an F1 score 5.3 points lower than the second-lowest.

To evaluate the relation extraction task, we experimented using the two models: SVM, as the baseline, and BERT for relation extraction (BERT-RE) from [21], using mBERT as the BERT encoder.

Table 8 shows the result for both relation extraction models. We can observe that, although SVM has higher precision than BERT-RE, the latter performs better overall with 17 points of F1 score difference.

We also provide an in-depth analysis of the best-performing model in NER and RE. Table 9 shows detailed results for the entity recognition BioBERTpt-clin model. The method produced good scores (>80%) for the three entities with the largest number of examples: Food, Complication, and Symptom. However, for some entities with few examples, the model was not able to perform good predictions (<60%): Test, Time, and Set.

Table 10 describes an in-depth analysis of BERT-RE performance for each relation type. Similarly to NER results, the model performs well for the relation with most examples (has) and falls short in relations with few examples: causes and prevents.

Table 7. Experiments for entity recognition models. The best scores are highlighted in bold.
Table 8. Experiments for relation extraction models. The best scores are highlighted in bold.
Table 9. Detailed results for the best entity recognition model (BioBERTpt-clin).
Table 10. Detailed results for the best relation extraction model (BERT-RE).

6 Discussion

In this study, we introduced a novel dataset and models for NER and RE tasks, gathering answers written by health professionals about Diabetes Mellitus to online forum users.

Corpus. We found that the entities Food and Complication were the most annotated in the corpus. Also, nutrition is a prevalent topic in online forums, as people with diabetes frequently inquire whether they can or cannot consume specific foods, such as chocolate, or alcoholic drinks. Similarly, people are generally keen on finding further information on diabetes-related complications because they feel specific symptoms or want to know about frequent diabetes-related complications.

Regarding the main difficulties found in the process of annotating relations, despite annotators’ prior training and availability of guidelines, some entities were annotated as having a different extent in terms of words making up those entities. Thus, “type 2 Diabetes” and “Diabetes” are two different annotations that refer to the same entity type, “DiabetesType”. This ends up impacting the relation since a relation is defined by a pair of entities and a relation type, which can justify our lower agreement for relation annotation. So if two annotators annotated the same relation type holding between entities spanning different extents in words, the resulting relations did not match. Further, as evidenced in [15], temporal relation extraction, which is a specific type of relation extraction, usually has lower agreement than span annotation.

Models. For the NER task, among the BERT models, we found that BioBERTpt-clin, which leveraged clinical data, had superior performance compared to the other models. Both BioBERTpt-clin and BioBERTpt-all were trained on clinical data and were initialized with the weights from Multilingual BERT, so as expected, these models achieved similar results. It is also worth noting that, although trained on Brazilian Portuguese data, BERTimbau did not perform well. An explanation is that BERTimbau was trained on brWAC [23], a large corpus extracted from the Web, that differs in vocabulary and context from Bete medical data.

Considering relation extraction, the BERT-based model outperforms the baseline model, showing the dominance of context-aware embeddings compared to kernel-based methods that were dominant in the past.

7 Conclusion

To the best of our knowledge, this is the first study yielding an annotated corpus of NER in Brazilian Portuguese made up of diabetes-related answers authored by domain specialists in response to questions posed by lay users. Moreover, it contributes to the research field by introducing resources for a sensitive context, such as diabetes, and creating models for a low-resource language, such as Brazilian Portuguese.

The fine-tuning of models leveraging clinical data was found to improve the results. Hence, the vocabulary and the context from the clinical context boosted the model’s ability to predict entities.

We plan to expand our corpus in future work, especially for the categories of entities and relations that have a small number of instances.

8 Limitations

Among the limitations of our study is the size of our dataset, which is due to the fact that it was obtained through manual annotation thereby demanding human effort and more time to accomplish this task. Another limitation is that some entity types have few instances, making our dataset slightly imbalanced. For instance, only 1/3 of our entity types have percentages of occurrence higher than 3%. Hence, adding more training examples and making the dataset less imbalanced will certainly enhance our results.

Our dataset targets a particular domain - diabetes; future experiments targeting other chronic diseases, such as cardiovascular disease, will boost our potential. Additionally, in the aftermath of COVID-19, there is growing uncertainty about procedures, symptoms, and treatments, with several questions being asked over social media. Hence, addressing COVID-19-related frequently asked questions would be a valuable contribution.