Abstract
Value-based health care management models require a precise accounting of health indexes such as risk events monitoring, clinical conditions, patient handling and outcomes. Currently this accounting is performed by manually reading and searching through electronic health records for these indexes. Our research proposes a way to make this an autonomous task that is performed by a computer using a Portuguese free-text concept classifier model based on ontologies. To validate our model we tested it on digital clinical records from 191 patients under ischemic stroke care. We have selected 30 management indexes to be identified in these texts. Our model reached, on average 56,8% of f1-score, varying from 5,83% to 94,78% f1-score across different management indexes.
Financially supported by the Brazilian Coordination of Superior Level Staff Improvement (CAPES), the by Portuguese Foundation for Science and Technology (FCT)under the projects CEECIND/01997/2017, UIDB/00057/2020 and, the National Council for Scientific and Technological Development (CNPq) (project: 465518/2014-1).
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Value-based healthcare systems (VBHC) allow fair rewards and proper recognition for health providers based on the quality and results of the service provided [3, 5, 11]. In this system, the responsible bodies for financing and rewarding these providers can be more confident about their investments, the main users get better services and results and, providers are encouraged to optimize their practice. The implementation of effective VBHC requires advances in computational intelligence to continually turn EHR data on information [7, 10]. Through these records, services and results are evaluated. This is an exhausting manual task that has to be performed even on top of edge digital health records software; therefore it is essential to turn it into an automated task. This task is defined as the measure of service indexes such as risk events monitoring, clinical conditions, patient handling and outcomes. In order to measure these indexes it is necessary to find keywords and technical terms inside clinical records that are written as free-texts in the Portuguese language. Our research aims to automatize such task, and it is focused on patients under ischemic stroke care. For this context, 30 indicator indexes were chosen to measure the services grouped in: Clinical features, Evaluation measures and risk events, Clinical handling and Patient condition.
This challenge has been handled previously by Zanotto et al. (2021) [14] through machine learning techniques where good results were presented. However, these techniques often require vast amounts of annotated data and computational processing for model training. Our approach tries to avoid these issues by proposing a knowledge based system combined with natural language processing (NLP) techniques. We make use of the NLP methods to find keywords and match terms with an ontology which then uses axioms to classify a given text from the clinical records according to the management indexes.
The evaluation of our model was made on digital clinical records from 191 patients under stroke care. Our model was capable of to identify and to classify 28 of those indexes varying from 5,83% f1-score results and mcc score of 8,01% to 94,78% f1-score results and mcc-score of 94,78%. Considering all 30 indexes, our model reached, on average 56,8% of f1-score and a mcc-score of 57,97%.
2 Related Work
In Wang et al. (2003) [12] and Zhou and El-Gohary (2015) [15] we see the use of machine learning algorithms as part of the classification processes, either to learn the terms related to the domain or to make the classifications. A larger quantity of data was required to achieve good results. The domain application of Wang et al. (2003) were papers from the MEDLINE database and on Zhou and El-Gohary (2015) were construction regulation documents.
Other works follow a knowledge representation approach. The work of Allahyari et al. (2014) [1] uses an ontology for the classifications of English text from news on the web, they used graph projection for this purpose. In Chi et al. (2014) [4] an ontology was created in a semi-automated way, but due to data scarcity, quality and range of terms were not optimal. They worked with job hazards reports. In Schwertner et al. (2019) [9] the authors built an ontology based on domain knowledge from specialists, defining relations between concepts and sentences, they used the ontology as a classification tool. They also worked with clinical data information in the English language. In Gayathri and Kannan (2020) [6] the authors face a similar challenge that we had, their goal was to detect and identify health-related information on English text documents. On Yehia et al. (2019) [13] domain ontologies are used to classify sentences using rules, based on the relations between concepts. They worked with clinical data documents. In de Araujo et al. (2017) [2] an ontology that uses inference processes, based on linguistic rules defined by specialists, to classify texts is presented. In this paper the authors worked with documents and texts in Portuguese about judicial events.
These related works show that ontologies can be considered an alternative for text classification. Different approaches are applied to different domains and results were in general positive. We also see few movements towards techniques focused on the Portuguese language. Given the results of These related works, we considered that a model based on ontologies should be tested for our text classification task, in Portuguese language, to compare with previous work on the same problem that used machine learning in a study made by Zanotto et al. (2021) [14]. They present an evaluation of machine learning approaches for the same database used in the present work, the classification considers almost the same set of classes, they worked on 24 or the indexes. A comparison with this work is presented in the results section.
3 Available Data and Challenges
The main goal of this research was to verify the applicability and the performance of a computational model, based on ontologies, in the task of automatically detect and classify Portuguese free-texts from electronic health records. To accomplish our goal we focused on classifying clinical records from patients under ischemic stroke care. This decision is based on our proximity with a team of specialists of this domain. We also wanted to verify how this model compares against machine learning approaches on this same task and dataset [14].
The data available for this research contains Portuguese free texts clinical records of 191 patients that were treated for ischemic stroke incidents from 01/01/2019 up to 07/23/2019. Our challenge is to find terms and keywords in a given text from these records and given the detected words, classify the text in one the following 30 quality indexes that are shown in Table 1. This study was approved by the Hospital Ethics Committee (CAAE: 29694720000005330).
We aim to develop a computational model in a way that it permeates current Electronic Health Records software, allowing it to operate as first designed as it only would have to provide the data that is stored. With this approach, healthcare practitioners may keep using the same kind of software that they are used to, with no need to input any new data. As the model outputs only the indexes in which the texts were classified, practitioners would also be able keep the privacy and the details of their records and practices.
With the help of a team of domain specialists, we identified technical terms and keywords for the indexes that are to be found within the clinical records. We developed an ontology that plays three important roles in our model. The first one is to provide a list of terms (keywords) that are to be found in the texts. The second role is the definition of the relation of the indexes with the terms. Moreover, the last role is the classifier itself that reasons about the relation between texts and terms and classifies them into the appropriate index. To work along with our ontology, a term detecting algorithm was also developed.
3.1 Data Preprocessing and Annotation
All available clinical records were first anonymized. All records were split into sentences. After this step, 46.547 sentences were generated, in which the indexes would have to be detected. The sentence order was randomized to prevent annotation based on previous context, the idea was to analyse each sentence independently.
Two annotators, domain specialists, read the sentences and informed all the indexes that could be identified in each sentence. The results obtained from both annotators had the percent agreement between them measured by kappa, which was higher than 0.61. In the cases in which there were conflicts, both annotators would come to together to discuss and solve the conflict. No conflict was left unsolved. At the conclusion of these step it was identified that only 17.471 out of the 46.547 sentences were related to one or more of the indicators. The sentences occurrences of each indicator is detailed in Table 8.
4 Methods
4.1 Ontology Based Classification Algorithm
Note that our ontology is a task ontology, developed for classifying sentences containing terms into 30 different indexes. Thus our ontology has three main concepts: the ‘Terms’ concept, in which all the sub-concepts are keywords that are to be found in the texts; the ‘Sentences’ concept that contains all the sentences processed by the text-detection algorithm; and the ‘Index’ concept that contains the description of all the indexes in which the sentences should be classified into. An object property relation ‘contain’, expresses the presence of a given term in a sentence. To the ‘Terms’ concept we added a few subsets of concepts: The ‘Values’ sub-concept aims to specify the terms that require a numeric value in order to compose a classification; The ‘Negations’ sub-set contains expressions that would indicate the negation of the occurrence of an index; the ‘PastTense’ sub-set contains terms that would signal the occurrence of an index in some point of the past, these terms often do so by appearing at some point after the mention of an index, and the ‘PastTenseRetroactive’ is also for identification of the past, but this specifies the terms that appear before the mention of the index in the sentence. Whenever the terms from these subsets were found, they create the appropriate relation, for instance, ‘sentence negates index’, instead of ‘sentence contains index’.
Our ontology plays the following roles: It serves as knowledge model of the terminology of this domain and it plays as a text classifier trough its inference capabilities.
The inference process is based in assertions in a logical form that together comprise the overall theory that the ontology describes in this domain, this assertions are calles axioms and in this ontology it refers to which terms a sentence must contain in order to be classified as an element of a given index. Table 2 shows us a few examples of these axioms. This ontology was built in OWL language using the Protégé tool.
The term detection algorithm receives the instances of the class Terms. The algorithm then runs through each sentence and tries to match the given terms to the words in the sentences. Whenever a match is found, the algorithm registers it in the ontology composing the triple sentence, relation, term.
Our approach also uses ‘word embedding’ models. The ‘word embedding’ used was developed on the basis of electronic health records of a brazilian hospitalFootnote 1[8], which was trained using 21 million sentences culminating in a model with 63 thousand words. We chose this language model, since this is based on Portuguese EHR. The application of this models has two main goals: To circumvent grammar errors that often occur in these clinical records and; To expand the vocabulary list of terms that were defined initially in the ontology, which means that it brings new rellated words, for instance the medical term ‘coronarina’ is not defined in the ontology, but this model relates it to term ‘coronaria’ covering this terminology gap. Hence our algorithm uses these models to search for similar words. Using the cosine similarity, the top 10 words are retrieved to expand the list of terms given by the ontology.
To optimize running time two parallel lists are kept by the algorithm: the first one is in charge of storing all words from the texts that do not have a match in the ontology; the second list keeps track of all the words from the sentence that do not match with the terms in the ontology, but some of its word embedding similar do. After all the sentences have been processed and put under the ‘Sentences’ set in the ontology, the reasoning process starts and the classification is made. The results are then evaluated by comparing them with the annotated data.
4.2 Example
To better understand how our model works let us take the set of sentences shown in Table 3. Our model receives these sentences as input and classifies them. Consider sentence number 6, the first step is to tokenize it. Table 4 shows the result of this step. Next, every word is treated to remove characters that are not either alphabets or numbers, and changed to lowercase, as seen in Table 5. The first word from our sample is ‘após’, our model uses the word embedding model to expand this word and get new similar words. Table 6 shows us all the similar words found for this term. After that they are compared with our defined terms. In this example, the term ‘após’ and all the similar terms are searched in the list of defined terms in the ontology. For this instance the term ‘após’ is not defined and hence no relation between ‘sentence #6’ and this term is created. Our model stores this term in a list of unmatched terms, for the next time it appears it will not be evaluated again. Next, we have the term ‘trombolise’. This one is also expanded with our word-embedding model and the similar terms are shown in Table 7. Again, the term and all the similar words are matched against the ontology.
In this scenario we notice that the term given by the sentence is ‘trombolise’, whereas the definition contains ‘trombólise’. So as expected the term itself is not found in the ontology list as it is defined with proper spelling. However the word embedding model captures this misspelling. As one of the similar words is ‘trombólise’ then our model correctly creates the relation between ‘sentence #6’ and the term ‘trombólise’. Our algorithm then writes in the ontology the relation ‘Sentence#6 contain trombólise’. As this relation is specified in the axiom ‘(contain some alteplase) or (contain some trombolisada) or (contain some trombolítico) or (contain some trombólise) SubClassOf thrombolysis’ once the reasoning process is complete this sentence would then be classified as an elements of the set ‘thrombolysis’ which is the set of all sentences that tell us that the index Thrombolysis is present.
These steps are then taken to every word in the sentence, and at the end, the relations found are stored in the ontology that will next reason about them.
5 Results
For evaluation of the proposed model the 46.547 sentences were processed and the output was compared to the manual annotation to measure ‘precision’, ‘recall’, ‘mcc-score’ and ‘f1-score’. The running time for the whole task was also measured. Table 8 shows the total occurrences of sentences annotated for each index and the results obtained.
The model took 532,43 s to process all the 46.547 sentences achieving, in average, ‘f1-score’ of 56,8%, ‘mcc-score’ of 57,97%, ‘precision’ of 64,89% and ‘recall’ of 57,97%. Some indexes, such as ‘Thrombolysis’ and ‘Atrial fibrillation’, achieved results over 80%; however others, such as ‘Pain’ and ‘Mobility’, did not reach over 20%.
These results were compared to the ones obtained by Zanotto et al. (2021) [8] in which a machine learning approach was used for the same challenge. Several supervised computational machine learning (ML) methods, including recent neural and non-neural methods were evaluated on the basis of a 5-fold cross- validation procedure. The best results were achieved by the W+C+SVM model, which is based on word-TFIDF and character-TFIDF for input representation and SVM for the classification.
Table 9 presents the f1-score results of our ontology based approach and the machine learning approach. Recall that we have machine learning evaluation results for only some of the indexes (24) and can only compare those.
Our approach performs well on indexes that the machine learning model doesn’t and vice-versa. This signals that, by delegating a given a indicator to the most adequate classifier, a combined model could perform well on a larger range of indexes. For instance classifications made by the ontology would benefit the analysis when it comes to the indicator ‘Dyslipidemia’ whereas the indicator ‘Infection’ would benefit from the machine learn approach. This decision could also be modeled in the ontology and reasoned by it. Further efforts should be put into this matter as a collaborative future work.
5.1 Errors Analysis
In Table 10 we present the most common cases of ‘false negatives’ and ‘false positives’. In general the performance of the model is related to the coverage of terms in the ontology. Misspellings and technical slang played a big part on classification errors, the word-embedding model does not cover all the possible variations of terms.
Cases 3, 12 and 13 are related to differences in spelling which were not captured in the most similar words according to the WE model. Examples 1, 2 and 9 are cases in which the Index is present in the sentences, but the specific terms in these sentences were not defined in our ontology.
Past tense and negation are also common source of errors, as expected, since these elements in language require specific sophisticated solutions on their own. In examples 4, 7, 8 the past is indicated through the date of the event, neither our ontology nor our algorithm was prepared for this. Errors in 6, 10 and 11 show us cases in which one ‘negation’ term is negating more than one Index. Our model expects one negation term per event.
6 Conclusion
In this paper, we proposed a Portuguese text classifier based on ontologies. Our results show that this approach achieved good results for at least 18 out of the 30 indexes. We believe that this research demonstrates how ontologies is a good alternative for Portuguese medical texts classifications and, because of that, can contribute to the implementation of VBHC programs, contributing to the transformation of health care systems. An advantage of this approach is direct explainability.
For future work we plan to evaluate the model on EHR from different institutions to validate the generality of the results shown here. We also plan to expand the coverage of terms and for that end, we plan to create another word embedding model, more tailored to the stroke patients context. As a continuation of the project we plan to align our task ontology with stroke domain ontologies.
References
Allahyari, M., Kochut, K.J., Janik, M.: Ontology-based text classification into dynamically defined topics. In: Proceedings of the 8th IEEE International Conference on Semantic Computing, Newport Beach, Estados Unidos da America, pp. 273–278. IEEE (2014)
de Araujo, D.A., Rigo, S.J., Barbosa, J.L.V.: Ontology-based information extraction for juridical events with case studies in Brazilian legal realm. Artif. Intell. Law 25, 379–396 (2017)
Bessa, R.d.O.: Análise dos modelos de remuneração médica no setor de saúde suplementar brasileiro. Ph.D. thesis, FGV (2011)
Chi, N.W., Lin, K.Y., Hsieh, S.H.: Using ontology-based text classification to assist job hazard analysis. Adv. Eng. Inf. 28, 381–394 (2014)
Engel, G.L.: The clinical application of the biopsychosocial model. J. Med. Philos. Forum Bioeth. Philos. Med. 6(2), 101–124 (1981). https://doi.org/10.1093/jmp/6.2.101
Gayathri, M., Kannan, R.: Ontology based concept extraction and classification of ayurvedic documents. Procedia Comput. Sci. 172, 511–516 (2020)
Gonçalves, F.N.R.: Optimizing patients’ pathways in international cooperation, by doing value based healthcare (VBHC). Acta medica portuguesa 32(2), 167–168 (2019)
Dias Pereira dos Santos, H., D. P. S. Ulbrich, A.H., Woloszyn, V., Vieira, R.: An initial investigation of the charlson comorbidity index regression based on clinical notes. In: 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS), pp. 6–11 (2018). https://doi.org/10.1109/CBMS.2018.00009
Schwertner, M.A., Rigo, S.J., Araújo, D.A., Silva, A.B., Eskofier, B.: Fostering natural language question answering over knowledge bases in oncology EHR. In: Proceedings of the 32nd International Symposium on Computer-Based Medical Systems, Cordoba, Espanha, pp. 501–506. IEEE (2019)
da Silva Etges, A.P.B., Ruschel, K.B., Polanczyk, C.A., Urman, R.D.: Advances in value-based healthcare by the application of time-driven activity-based costing for inpatient management: a systematic review. Value Health 23(6), 812–823 (2020). https://doi.org/10.1016/j.jval.2020.02.004. https://www.sciencedirect.com/science/article/pii/S1098301520301303
Uzuelli, F.H.d.P., Costa, A.C.D.d., Guedes, B., Sabiá, C.F., Batista, S.R.R.: Reforma da atenção hospitalar para modelo de saúde baseada em valor e especialidades multifocais. Ciência Saúde Coletiva 24, 2147–2154 (2019)
Wang, B.B., Mckay, R.I.B., Abbass, H.A., Barlow, M.: A comparative study for domain ontology guided feature extraction. In: Proceedings of the 26th Australasian Computer Science Conference, pp. 69–78. Australian Computer Society Inc., Adelaide, Austrália (2003)
Yehia, E., Boshnak, H., Abdelgaber, S., Abdo, A., Elzanfaly, D.: Ontology-based clinical information extraction from physician’s free-text notes. J. Biomed. Inform. 98, 103–117 (2019)
Zanotto, B., et al.: Automatic classification of electronic health records for a value-based program through machine learning. Value Health (2021, to appear). Accepted for publication in Virtual ISPOR 2021
Zhou, P., El-Gohary, N.: Ontology-based multilabel text classification of construction regulatory documents. J. Comput. Civ. Eng. 30, 40–54 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Bosco, A.D., Vieira, R., Zanotto, B., da Silva Etges, A.P.B. (2021). Ontology Based Classification of Electronic Health Records to Support Value-Based Health Care. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13073. Springer, Cham. https://doi.org/10.1007/978-3-030-91702-9_24
Download citation
DOI: https://doi.org/10.1007/978-3-030-91702-9_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91701-2
Online ISBN: 978-3-030-91702-9
eBook Packages: Computer ScienceComputer Science (R0)
