Abstract
Question answering (QA) systems aim to answer human questions made in natural language. This type of functionality can be very useful in the most diverse application domains, such as the biomedical and clinical. Considering the clinical context, where we have a growing volume of information stored in electronic health records, answering questions about the patient status can improve the decision-making and optimize the patient care. In this work, we carried out the first experiments to develop a QA model for clinical texts in Portuguese. To overcome the lack of corpora for the required language and context, we used a transfer learning approach supported by pre-trained attention-based models from the Transformers library. We fine-tuned the BioBERTpt model with a translated version of the SQuAD dataset. The evaluation showed promising results when evaluated in different clinical scenarios, even without the application of a clinical QA corpus to support a training process. The developed model is publicly available to the scientific community.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Question answering (QA) is a task that seeks to automatically answer questions asked in natural language by humans, with a combination of Information Retrieval (IR) and Natural Language Processing (NLP) techniques. QA systems have been applied in multiple domains of knowledge, as they can facilitate access to information by reasoning and inferring it from a large volume of data stored as natural language [1].
There has been a growth in interest in the use of QA in the biomedical context, including efforts to create biomedical and clinical QA corpora [8, 12, 18, 23] and the realization of shared tasks such as BioASQ 8 [9]. The possibility of finding answers to questions in the growing volume of data available in the scientific literature and patients’ medical records evidence the potential of these systems for practical use. Therefore, QA systems can assist both the medical research community (e.g., researching disease treatments and symptoms) and healthcare professionals (e.g., clinical decision support).
Specifically in the clinical context, when we work with patient’s data, the QA research has few initiatives if compared to other domains, mainly in languages other than English [10, 12]. The lack of studies may be due to the required medical expertise and the ethical/legal restrictions, which limits the data access [7]. Another aspect that hinders the research of clinical QA is the complexity of the data that are stored in the patient’s Electronic Health Records (EHR). The longitudinal and unstructured nature of the clinical notes, high use of medical terminology, temporal relations between multiple clinical concepts (e.g., medications, diseases, symptoms), text misspellings, frequent redundancy, and implicit events make text comprehension much more complex for the machine, and make it difficult to apply baseline NLP tools [2].
Currently, health professionals manually browse or use a simple search algorithm to find answers about patients, and often they are unable to reach the information they need. A QA system can provide a much more intuitive interface for this task, allowing humans to “talk" to the EHR asking questions about patient’s status and health condition, generating a much more fluid and human interaction [12, 16, 18].
Nevertheless, how to develop a clinical QA system for Portuguese if we do not have a large corpus of questions and answers that support most of the current QA methods? A well-established strategy in the NLP community is the use of transfer learning, which consists of using a pre-trained language model in a large dataset and then fine-tune it for a specific task in a smaller dataset. For example, one can use a large attention-based model [19] available in Transformers libraryFootnote 1, such as BERT [3], and fine-tune it for a specific NLP task, such as Named Entity Recognition [17]. We can go beyond that, like Jeong and colleagues [7], who fine-tuned two different QA corpora (one open-domain and the other biomedical) to achieve superior performances in biomedical QA.
In this work, we present the first glance of experiments on a QA system used to answers questions about patients in Portuguese. We developed a biomedical QA model for the Portuguese language based on the fine-tuning of the BioBERTpt model [15] using the SQuAD QA dataset translated into the Portuguese language. We evaluate the performance in a small set of different clinical contexts with multiple types of clinical notes. Finally, we discuss our preliminary results, main findings, and future work. The model is publicly available for use.
2 Related Work
In this section, we present important QA corpora development initiatives. Moreover, we introduce the transfer learning method, which is one of the solutions used to overcome the lack of domain-specific corpora. Furthermore, we highlight biomedical and clinical QA studies based on Transformers models and transfer learning.
2.1 Corpora
In the literature, we can find several QA datasets, most of them covering open-domain questions and answers, as the Stanford Question Answering Dataset (SQuAD) [14]. SQuAD is a reading comprehension dataset consisting of English questions and answers from a set of Wikipedia articles. This dataset is present in a QA format, where the answer to each question is a segment of text from the reading passage (see Fig. 1). The SQuAD 1.1 contains more than 100,000 question-answer pairs on approximately 500 articles, all of them manually annotated.
Example of answers annotated within the text in SQuAD corpus. Image from [14]
Regarding the biomedical context, we need to mention the PubMedQA, a corpus with the objective of answering research questions with “yes”, “no” and “maybe”. It consists of 1,000 expert-annotated QA pairs and 211,300 artificially generated QA pairs. All the answers were annotated in PubMed scientific paper abstracts [8].
In the clinical domain, we highlight the following recent initiatives. Yue and colleagues [23] developed a corpus with 1,287 expert-annotated QA pairs, originated from 36 discharge summaries from the MIMIC-III datasetFootnote 2. The emrQA corpus [12] has minimal expert involvement, as they generated the QA pairs using an automated method that leverages annotations from shared i2b2 datasetsFootnote 3, resulting in 400,000 QA pairs.
It is worth noting the discrepancy of manually annotated QA pairs from SQuAD and the biomedical and clinical corpora, which affects directly the development of QA algorithms for these specific domains.
2.2 Transfer Learning
Transfer learning is a method focused on the reuse of a model trained for a specific task, by applying and adjusting it (i.e., fine-tuning) to a different but related task. Several studies have already proven the efficacy of transfer learning when using deep learning architectures [13], especially the pre-trained attention-based models available in the Transformers library.
The BioBERTpt [15] is an example of how transfer learning can result in models with low computational cost and at the same time extremely useful. BioBERTpt was generated from the fine-tuning of the multilingual version of BERTFootnote 4 in clinical and biomedical documents written in Portuguese. The model was also adjusted for the named entity recognition (NER) task, which achieved state-of-the-art results in the SemClinBr corpus [11].
2.3 Biomedical and Clinical QA
Considering the current availability of domain-specific corpora for the QA task, many studies focused on transfer learning to take advantage of the huge amount of annotated data in open-domain resources, as the SQuAD. Furthermore, due to the state-of-the-art results obtained by architectures based on the attention mechanism on several NLP tasks, most studies explore the fine-tuning of pre-trained models available in the Transformers library.
Wiese and colleagues [20] transferred the knowledge from the SQuAD dataset to the BioASQ dataset. In Yoon’s work [22], they executed a sequential transfer learning pipeline exploring BioBERT, SQuAD, and BioASQ. In [7], the authors used sequential transfer learning as well but trained BioBERT on the NLI dataset. Finally, Soni and Roberts [16] performed experiments on multiple Transformers models and fine-tuned them with multiple QA corpora combinations. They found that a preliminary fine-tuning on SQuAD improves the clinical QA results across all different models. Due to the unavailability of a clinical QA corpus in Portuguese and the good results obtained when using transfer learning based on an open-domain QA corpus (i.e., SQuAD), we decided to follow a similar development protocol.
3 Materials and Methods
This section describes the development of our Transformer-based QA model, including all the steps of execution and details regarding the data used. Next, we present the evaluation setup used to assess our approach.
3.1 BioBERTpt-squad-v1.1-portuguese: A Biomedical QA Model for Portuguese
As seen in the previous section, the use of transfer learning supported by contextual embedding models, pre-trained on large-scale unlabelled corpora, combined with the transformer architecture, has proven to reach state-of-the-art performance on NLP tasks. Then, our method will focus on these pre-trained models and the consequent fine-tuning of the QA task.
The method is composed of the following general phases:
-
1.
Obtain a model for biomedical Portuguese texts
-
2.
Obtain a dataset for QA in Portuguese
-
3.
Fine-tune the biomedical model to QA task
For our first phase, we needed a model capable of processing Portuguese biomedical texts, and for that, we used the BioBERTpt model. The BioBERTpt is a deep contextual embedding model for the Portuguese language to support clinical and biomedical tasks. The authors fine-tuned the multilingual BERT on clinical notes and biomedical literature, resulting in a Portuguese medical BERT-base model. In this paper, we used the BioBERTpt(all) versionFootnote 5, which includes both biomedical and clinical texts.
As the second phase, we used the automatically translated version of the SQuAD dataset, shared by the Deep Learning Brasil groupFootnote 6, which we will call SQuADpt from now on. They used Google-powered machine translation to generate a Portuguese version of SQuAD v1.1, with additional human correction after. It is worth noting the warning made by the authors, in which they state that the dataset could contain a good amount of unfiltered content with potential bias.
In the final phase, we fine-tuned BioBERTpt on the question and answer dataset, the SQuAD-pt, with the following steps:
-
1.
Pre-processing of data (removal of spaces and possible duplicate records);
-
2.
Tokenization of the inputs and converting the tokens to their corresponding IDs in the pre-trained vocabulary, with the Transformers Tokenizer [21];
-
3.
Split of long inputs (longer than the model maximum sequence length) into several smaller features, overlapping some features if the answer is at the split point;
-
4.
Mapping of the start and end positions of the answers in the tokens, so the model can find where exactly the answer is in each feature;
-
5.
Padding the text to the maximum sequence length.
Finally, we performed the fine-tuning to the QA task with the Transformers AutoModelForQuestionAnswering class [21], using BioBERTpt pre-trained weights. We trained our Portuguese biomedical QA model for three epochs on a GPU GTX2080Ti Titan 12 GB, with learning rate as 0.00003, and both training and evaluation batch size as 16. All these parameters were recommend for fine-tuning step by [3], at Appendix A.3 of the paper. Table 1 shows the main hyper-parameter settings used to train our model.
In Fig. 2 we can see the BERT model trained for the QA task, where given input with the question and context, the model predicts as output the start and end indices of the probable answer to the question. The resulting model is available at Transformers repositoryFootnote 7.
3.2 Evaluation Setup
Our model achieved a F1-score of 80.06 when measuring using the SQuAD dataset annotations. However, the goal of our work is to generate a QA model to be used in the clinical domain, therefore, we need to define an assessment protocol that addresses this context. In the following subsections, we describe the selection of the clinical notes and the definition of the questions used to evaluate our model.
Clinical Notes Selection. There are many uses of a QA system in the clinical environment, being able to support the entire multidisciplinary health staff. Among the possibilities, the system could help the medical and nursing professionals, who constantly need to retrieve information from the patient’s medical record to carry out their interventions. Thinking about these cases, we selected two main categories of clinical notes, nursing notes, and medical notes.
The nursing note is a document that describes the general state of the patient and actions performed by the nurses during the care (e.g., physical examination, medications used, vital signs monitoring). This document is used to compare the patient’s clinical condition during the last 24 h and is very useful to determine optimal strategies of care (e.g., new physical examinations) [4]. The medical note contains a narrative regarding the patient’s signs, symptoms, comorbidities, diagnosis, prescribed medications, post-surgery data, etc. All this information is written in chronological order, to facilitate the understanding of the history of care and support the decision of the healthcare professionals [5].
The nursing and medical notes were selected from the SemClinBr corpus, prioritizing narrative aspects that would allow the realization of questions, to use the answers to facilitate the elaboration of a Nursing Diagnosis by the nursing team, during the shift. As well as carrying out a quick search by the multidisciplinary team to support the practices to be adopted in the patient’s care. Three nursing notes and six medical notes were selected, being three from medical assistance (i.e., hospital admission) and three from ambulatory (i.e., outpatient service).
Questions Definition. After defining the clinical notes to be used as the context for our QA model, we defined a set of questions that we would ask the model to try to find the answers to. The questions were based on Nursing Diagnosis, risk factors, and defining characteristics, that is, a set of information that demonstrates signs and symptoms of a given diagnosis. The selected criteria were: physical examination, level of consciousness, circulation and ventilation, procedures/care, nutrition, physiological eliminations, intravenous devices, pain complaints and exams. For nursing notes, ten questions were elaborated and applied. For the medical notes, eight questions were elaborated and applied.
Evaluation Process and Metrics. To generate the gold standard of answers, a specialist analyzed all selected clinical notes and annotated the text with the corresponding answers (examples are presented in Fig. 3 and Fig. 4). After the generation of the gold standard, all questions and clinical notes were input to our QA model, so we made a comparison between the model’s answers and those generated by the expert. A calculation of Precision, Recall and F1 metrics was performed. We used a partial-match strategy since once the QA system finds and marks the answer in the text, even partially, the user could easily identify the rest of the information in the text. For example, in the gold standard the excerpt of text that answers the “Has previous diseases?” question is “Previously Hypertension/Diabetes/Chronic obstructive pulmonary disease. Stroke with left hemiparesis 2 years ago.”, but the model outputted just a partial match: “Previously Hypertension/Diabetes/Chronic obstructive pulmonary disease.”. In these cases the partial match approach considers the prediction as a true positive.
It is worth noting that our gold standard was not used for training purposes. We used it only for evaluation, as we exploited the use of transfer learning and already built resources to develop our model.
4 Results
The result of the evaluation process is summarized in Table 2, where we can see better results when we process nursing and outpatient notes. In Fig. 5, all items evaluated and the confidence rate given by the model when making the prediction were plotted. In addition to the hits and misses, we also include the situation in which the question asked does not contain an answer in the text, therefore, it is not possible to make a right answer. An analysis of the graph suggests that if we use a threshold of 0.5 when returning the answer to the user (<0.5 then No Answer; \(\ge \)0.5 then Show Answer) we could avoid several false positives. However, we would also have a reduction of true positives, so a more in-depth analysis of this trade-off is needed.
5 Discussion
This work is an initial investigation into the development of a clinical QA system in Portuguese, which could be a powerful resource in clinical decision support and patient care process optimization (e.g., discover information about the patient’s condition, optimal shift change). The aim of this article is to carry out preliminary experiments to help define the next steps towards a robust model of clinical QA.
Our findings are in line with other studies [7, 16, 20, 22] regarding the use of transfer learning as a means of overcoming the lack of annotated resources for the desired language/context, as we use the BioBERTpt model, which is a result of the fine-tuning of the multilingual BERT for clinical texts, and we also used the SQuAD, which, despite being an open-domain dataset, helped us to adapt the model for the clinical QA task. However, it is worth emphasizing that, even without quantitative results to benchmark, apparently the results of this process of creating the model for the clinical domain require a little more attention, due to the great differences in vocabulary, questions and specificities of clinical notes. We can find models that followed a similar process of development to our model, working a little more cohesively when applied to the context of questions about medical topics [6].
A more detailed analysis about the use of a threshold in the model output should be carried out, as it is common to have missing data in clinical notes. Furthermore, it is necessary to study the different types of possible questions (i.e., factoid, list, yes/no) in the clinical context, since using all these types of questions considerably increases the complexity of the problem. Moreover, something that is also out of the scope of this article is to understand the use of discontinuous answers, when the response is in multiple parts of the clinical note. As the Transformers model allows returning the top k texts, it might be possible to perform some kind of inference through this list.
We recognize as limitations of this work the size of the dataset used for evaluation, as well as the search for answers about the patient’s status in a single patient’s clinical note and not in all of its EHR records. As future work, we intend to expand both the dataset size and the exploration of the entire longitudinal aspect of the patient’s clinical notes. Finally, since we intend to expand the gold standard for evaluation, a natural step would be to use this corpus to train a model directly from these data and also make it available to the scientific community as well as emrQA[12] and CliCR [18].
6 Conclusion
We developed a clinical QA model for the Portuguese language by sequentially fine-tuning attention-based models available in the Transformers library. The model is publicly available to the scientific community. We used the BioBERTpt model with the SQuADpt QA dataset to build a model capable of finding answers about a patient within the clinical notes. Our experiments show promising results, even with the use of resources adapted (i.e., fine-tuned) to the target task, context and language. We plan to expand our experiments in various aspects, as the increase of evaluation dataset size, use of multiple patients’ clinical notes, training a model with our own clinical QA corpus, and deepen analysis about the use of thresholds, discontinuous answers and question types. We understand that this work was the first step towards a robust clinical QA system in Portuguese, which could result in more intelligent systems for healthcare professionals and improve the patients’ care.
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
References
Calijorne Soares, M.A., Parreiras, F.S.: A literature review on question answering techniques, paradigms and systems (2020). https://doi.org/10.1016/j.jksuci.2018.08.005
Dalianis, H.: Characteristics of patient records and clinical corpora. In: Clinical Text Mining, pp. 21–34. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78503-5_4
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (June 2019). https://doi.org/10.18653/v1/N19-1423, https://www.aclweb.org/anthology/N19-1423
Dias, L.B., Duran, E.C.M.: Análise das evoluções de enfermagem contextualizadas no processo de enfermagem. Revista de Enfermagem UFPE on line (2018). https://doi.org/10.5205/1981-8963-v12i11a234623p2952-2960-2018
Garritano, C.R.d.O., Junqueira, F.H., Lorosa, E.F.S., Fujimoto, M.S., Martins, W.H.A.: Avaliação do Prontuário Médico de um Hospital Universitário. Revista Brasileira de Educação Médica (2020). https://doi.org/10.1590/1981-5271v44.1-20190123
Guillou, P.: Portuguese bert base cased QA (question answering), finetuned on squad v1.1 (2021). https://huggingface.co/pierreguillou/bert-base-cased-squad-v1.1-portuguese
Jeong, M., et al.: Transferability of natural language inference to biomedical question answering. CoRR abs/2007.00217 (2020). https://arxiv.org/abs/2007.00217
Jin, Q., Dhingra, B., Liu, Z., Cohen, W.W., Lu, X.: PubMedQA: a dataset for biomedical research question answering. In: EMNLP-IJCNLP 2019–2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference (2020). https://doi.org/10.18653/v1/d19-1259
Krallinger, M., Krithara, A., Nentidis, A., Paliouras, G., Villegas, M.: BioASQ at CLEF2020: large-scale biomedical semantic indexing and question answering. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 550–556. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_71
Mutabazi, E., Ni, J., Tang, G., Cao, W.: A review on medical textual question answering systems based on deep learning approaches. Appl. Sci. 11(12) (2021). https://doi.org/10.3390/app11125456, https://www.mdpi.com/2076-3417/11/12/5456
e Oliveira, L.E.S., et al.: Semclinbr - a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks (2020). https://arxiv.org/abs/2001.10071
Pampari, A., Raghavan, P., Liang, J., Peng, J.: emrQA: a large corpus for question answering on electronic medical records. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2357–2368. Association for Computational Linguistics, Brussels, Belgium (October-November 2018). https://doi.org/10.18653/v1/D18-1258, https://aclanthology.org/D18-1258
Qiu, X.P., Sun, T.X., Xu, Y.G., Shao, Y.F., Dai, N., Huang, X.J.: Pre-trained models for natural language processing: a survey. Sci. Chin. Technol. Sci. 63(10), 1872–1897 (2020). https://doi.org/10.1007/s11431-020-1647-3
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuad: 100,000+ questions for machine comprehension of text. In: EMNLP 2016 - Conference on Empirical Methods in Natural Language Processing, Proceedings (2016). https://doi.org/10.18653/v1/d16-1264
Schneider, E.T.R., et al.: BioBERTpt - a Portuguese neural language model for clinical named entity recognition (2020). https://doi.org/10.18653/v1/2020.clinicalnlp-1.7
Soni, S., Roberts, K.: Evaluation of dataset selection for pre-training and fine-tuning transformer language models for clinical question answering. In: LREC 2020–12th International Conference on Language Resources and Evaluation, Conference Proceedings (2020)
Souza, J.V.A.D., et al.: A multilabel approach to Portuguese clinical named entity recognition. J. Health Inf. 12 (2021). http://www.jhi-sbis.saude.ws/ojs-jhi/index.php/jhi-sbis/article/view/840. http://www.jhi-sbis.saude.ws/ojs-jhi/index.php/jhi-sbis/issue/view/98/showToc
Šuster, S., Daelemans, W.: CliCR: a dataset of clinical case reports for machine reading comprehension. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1551–1563. Association for Computational Linguistics, New Orleans, Louisiana (June 2018). https://doi.org/10.18653/v1/N18-1140, https://aclanthology.org/N18-1140
Vaswani, A., et al.: Attention Is All You Need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pp. 6000–6010 (2017)
Wiese, G., Weissenborn, D., Neves, M.: Neural domain adaptation for biomedical question answering. In: CoNLL 2017–21st Conference on Computational Natural Language Learning, Proceedings (2017). https://doi.org/10.18653/v1/k17-1029
Wolf, T., et al.: transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (October 2020). https://www.aclweb.org/anthology/2020.emnlp-demos.6
Yoon, W., Lee, J., Kim, D., Jeong, M., Kang, J.: Pre-trained language model for biomedical question answering. In: Communications in Computer and Information Science (2020). https://doi.org/10.1007/978-3-030-43887-6_64
Yue, X., Zhang, X.F., Sun, H.: Annotated question-answer pairs for clinical notes in the mimic-iii database (2021). https://doi.org/10.13026/J0Y6-BW05, https://physionet.org/content/mimic-iii-question-answer/1.0.0/
Acknowledgements
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Oliveira, L.E.S.e., Schneider, E.T.R., Gumiel, Y.B., Luz, M.A.P.d., Paraiso, E.C., Moro, C. (2021). Experiments on Portuguese Clinical Question Answering. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-91699-2_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)





