key: cord-0323415-ih75w4up authors: Zhan, X.; Humbert-Droz, M.; Mukherjee, P.; Gevaert, O. title: Structuring clinical text with AI: old vs. new natural language processing techniques evaluated on eight common cardiovascular diseases date: 2021-01-29 journal: nan DOI: 10.1101/2021.01.27.21250477 sha: 4a055b883e69ec3b2b776e5cb620668986420e36 doc_id: 323415 cord_uid: ih75w4up Mining the structured data in electronic health records(EHRs) enables many clinical applications while the information in free-text clinical notes often remains untapped. Free-text notes are unstructured data harder to use in machine learning while structured diagnostic codes can be missing or even erroneous. To improve the quality of diagnostic codes, this work extracts structured diagnostic codes from the unstructured notes concerning cardiovascular diseases. Five old and new word embeddings were used to vectorize over 5 million progress notes from Stanford EHR and logistic regression was used to predict eight ICD-10 codes of common cardiovascular diseases. The models were interpreted by the important words in predictions and analyses of false positive cases. Trained on Stanford notes, the model transferability was tested in the prediction of corresponding ICD-9 codes of the MIMIC-III discharge summaries. The word embeddings and logistic regression showed good performance in the diagnostic code extraction with TF-IDF as the best word embedding model showing AUROC ranging from 0.9499 to 0.9915 and AUPRC ranging from 0.2956 to 0.8072. The models also showed transferability when tested on MIMIC-III data set with AUROC ranging from 0.7952 to 0.9790 and AUPRC ranging from 0.2353 to 0.8084. Model interpretability was showed by the important words with clinical meanings matching each disease. This study shows the feasibility to accurately extract structured diagnostic codes, impute missing codes and correct erroneous codes from free-text clinical notes with interpretable models for clinicians, which helps improve the data quality of diagnostic codes for information retrieval and downstream machine-learning applications. The digitization of hospitals has enabled electronic health records (EHR) to become accessible to researchers for secondary usage that benefits healthcare research [1, 2, 3, 4]. The analyses of electronic health records contributes to a better understanding of the clinical trajectories of patients [5] , improved patient 5 stratification and risk evaluation [6, 7] . However, much of the information in the EHR is locked in free text clinical notes [2, 4] . Analyzing these free text clinical notes is challenging [1, 2, 8] . Historically, the information in free-text clinical notes has been extracted mostly manually by clinical experts for archiving, retrieval and analyses and this has been particularly relevant to chronic disease 10 as clinical notes dominate over structured data. More recently, natural language processing (NLP) and machine learning methods have shown great promise to automatically analyze clinical notes [1, 2, 9, 10]. EHR data enable researchers and clinicians to perform information extraction and encode the information for later information retrieval and secondary 15 usage [4] . Based on these clinical notes, ICD-10 codes (i.e. the International Classification of Diseases, Tenth Revision) [11] are used by clinicians to encode diagnoses. Some typical research applications of EHR data has been using these diagnostic codes in downstream tasks, such as automatic information retrieval, risk prediction and the prediction of disease subtypes [1, 2, 9, 10]. As the ICD-20 10 diagnostic codes form the basis, its quality determines the performance of downstream tasks. Furthermore, EHR data in structured format rather than 2 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint in free-text format can be more easily used in machine learning applications or combined with other data types. Yet, diagnostic codes are frequently missing in EHR or the recorded diag- 25 nostic codes may be inaccurate. Misclassification and inaccuracy in diagnostic codes have been reported in an increasing number of papers, for instance, in cases related to myocardial infarction and stroke [12, 13] . Mccarthy et al. [12] reported that a substantial percentage of patients who had myocardial injury were miscoded as having type 2 myocardial infarction, which may have serious 30 consequences. Next, Chang et al. [13] found disagreement in stroke coding, which may negatively influences stroke case identification in epidemiological studies and hospital-level quality metrics. Recent studies have focused on the problem of diagnostic code prediction [1, 9] . Although some good results have been shown, many of the previous diagnostic code prediction studies have ap- 35 plied deep-learning methods that make the models hard to interpret [2, 9, 3]. Because ICD-10 codes are usually the start for downstream tasks and clinicians attach great significance to interpretable information extraction systems [4] , interpretable models may have certain advantages than less-interpretable models in that they may not only enable accurate ICD-10 code imputation but also 40 enable clinicians to readily understand the models and control the quality of the diagnostic codes with their expertise. In this study, we propose the use of NLP word vectorization algorithms and logistic regression (LR) to predict eight ICD-10 codes related to common cardiovascular diseases from free-text outpatient progress notes. We compared 45 both interpretable models and less interpretable models with regards to their performances on the ICD-10 code prediction tasks. The proposed models show good classification performance on eight ICD-10 codes on two Stanford cohorts and the models generalized well to the MIMIC-III data set. Additionally, the most interpretable models also showed the best performance on all data sets 3 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint We used outpatient progress notes of 133,644 patients diagnosed with cardiovascular diseases at Stanford Health Care. The patients were partitioned into a 55 training set (60%), validation set (20%) and test set (20%). All notes belonging to the same patient were partitioned into the same data set to avoid information leakage across data sets. The data set included 5,604,539 notes from 31,502 encounters dated from April, 2000 to October, 2016. The data was retrospectively collected and de-identified in accordance with approved IRB guidelines We focused on the following eight common cardiovascular diseases from clinical notes: acute myocardial infarction (I21), chronic ischemic heart disease (I25), other pulmonary heart disease (I27), cardiomyopathy (I42), atrial fibril-65 lation flutter (I48), heart failure (I50), atherosclerosis (I70), esophageal varices (I85). As ICD-10 codes have a hierarchy to organize the over 69,000 diagnostic codes, we aimed at predicting the three-letter prefixes of the ICD-10 diagnosis codes. Notes with fewer than sixty words and notes without any labeled ICD-10 code were excluded, resulting in the removal of 63.2% notes defining Cohort 2. For prototyping and testing the scalability of the models, a smaller cohort, Cohort 1 was built with randomly selected notes from Cohort 2 ( Fig. 1 the term embeddings was taken as the embedding for one individual note. The progress notes we used can be divided in three general sections, describing patient history, description at presentation and plan/billing. In addition to taking the average as a note embedding, a batched form of W2V was introduced in this study by splitting a note into several batches (n = 1, 3, 5) to D2V is based on W2V but further inputs the tagged document id in the training of word vectors [22] . In the training process, a word vector is trained for each term, and a document vector is generated for each document. In the inference process for prediction, all weights are fixed to calculate the document 110 vector for a new document. In this study, to avoid overfitting, we used the 63.2% dropped notes (neither in Cohort 1 nor in Cohort 2 because the notes 5 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint were either shorter than 60 words or without any ICD-10 codes) to train our D2V model with 40 epochs and an embedding dimension of 200. The number of terms modeled was 327,113. To visualize the data, the nonlinear dimensionality reduction method, tdistributed Stochastic Neighbor Embedding (t-SNE) [23] was used. Once we get the note embeddings, the vectors become the input of a classification model to predict the diagnostic code. We used logistic regression (LR) 120 for ICD-10 code prediction considering model interpretability. LR [24] applies the logistic function in combination with least square regression for classification. In this study, we used a Python implementation of LR in the scikit-learn package [25] . L2 regularization was used in this study and the penalty strength C was tuned based on the average AUROC on the validation set in Cohort 1. A 125 1:50 class weight was added to deal with the imbalanced cases since the average prevalence of the eight I-codes was approximately 2%. To assess the performances of different word vectorization methods, AU-ROC and area under precision recall curve (AUPRC) were used as the metrics 130 to evaluate the word embeddings and the LR models in eight diagnostic code classification tasks. On Cohort 1, bootstrapping [26] was done on the training set for thirty times to test the model robustness. As BOW and TF-IDF are directly interpretable word-based vectorization algorithms, to interpret the models, the LR coefficients were analyzed to iden-135 tify the important words in classification. The top ten most important words for decision were extracted after bootstrapping the training samples in thirty repeats. In each of the bootstrapping experiments, the thirty most important words were extracted as the candidates, and the final top ten most important words were selected based on two metrics: 1) the ranking metric: the sum of 140 rankings of the important words over all bootstrapping results (smaller ranking 6 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint sums mean higher importance); 2) the coefficient metric: the sum of LR coefficients of the important words over all bootstrapping results (larger coefficient sums mean higher importance). Because the recorded diagnostic codes can be missing and inaccurate in clin-145 ical practice, to test whether it was possible to impute missing ICD-10 codes based on the model predictions, several false positive cases were randomly selected and the corresponding notes were analyzed. Next, the model transferability was tested on the MIMIC-III (Medical In-150 formation Mart for Intensive Care III) data set of de-identified health-related data of 40,000 intensive care unit stays at Beth Israel Deaconess Medical Center [27] . We directly applied the word embedding models (BOW, TF-IDF, W2V, W2V batch and D2V) and the corresponding LR classifiers trained on the training set of the larger Cohort 2 of Stanford notes to predict the diagnostic codes of 155 the discharge summary in MIMIC-III data set (59,652 notes, 41,127 patients). No model fine-tuning on the MIMIC-III data set was done. As MIMIC-III uses the ICD-9 as diagnostic codes, the ground truth was set to the corresponding ICD-9 codes of the eight cardiovascular diseases. In this study, we matched the ICD-10 codes to the corresponding ICD-9 codes by matching the three-letter 160 prefix and the highest hierarchy of the ICD-9 code that describes a specific disease. The matched ICD-9-ICD-10 codes [28] CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint We first visualized the feature embeddings with TF-IDF using t-SNE to explore the data in the clinical notes in Cohort 1 training set (Fig. 2) . Due to limited space, we presented the TF-IDF visualization as a demonstration 175 because of its full interpretability. We found clusters related to several cardiovascular diseases. The selected clusters within the bounding boxes showed high prevalence in I-codes, suggesting that the feature embeddings may be able to distinguish ICD-10 codes. Secondly, on the larger Cohort 2, with more data, the results showed that the LR models trained on the word vectorization methods classified the I-codes with an improvement in both AUROC and AUPRC, particularly on the codes with 195 lower prevalence (Fig. 3, Supplementary Fig. 2 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint To interpret the models, the ten most important words were extracted in thirty bootstrapping experiments on Cohort 1 ( Table 1 ). The results showed that not only many important words that were found were overlapping in the bootstrapping experiments, but also that most words could be explained based 205 on the meanings related to the diagnostic codes. For example, for acute myocardial infarction, non-ST-elevation myocardial infarction, myocardial, myocardial infarction, thrombus and infarction were found important; for chronic ischemic heart disease, coronary, coronary artery disease, artery/arterial and angina were found important; for atrial fibrillation flutter, fibrillation, atrial, fibrillation, 210 atrial fibrillation and paroxysm were found important. Meanwhile, the results based on the two metrics were similar, indicating that the importance of words was relatively stable over the thirty bootstrapping experiments. To conclude, the models based on TF-IDF and LR predicted I-codes not only had high AU-ROC and AUPRC, but were also interpretable based on clinically meaningful 215 terms determining the prediction. Next, to test whether there were missing diagnostic codes in the data sets that could be imputed by the I-code prediction models, several randomly selected false positive cases were analyzed (Table 2 ). This analysis suggests that 220 it is possible to impute missing I-codes based on the model predictions in a subset of cases. Additional manual curation efforts might be needed because the most accurate TF-IDF embedding was word-based that has problems dealing with negation, personal and family medical history. For instance, an I-code might be predicted due to a patient's medical history but not necessarily noted 225 down as the diagnostic code for that specific encounter. To test the model transferability, we extracted the discharge summaries in MIMIC-III data set and the corresponding ICD-9 diagnostic codes of each of 9 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint the eight ICD-10 codes, and tested the pre-trained word embedding models and 230 classification models on the MIMIC-III data set without any fine-tuning. The high AUROC and AUPRC values showed that all models (i.e. TF-IDF, W2V, W2V batch and D2V) models could be well transferred to the classification of the diagnostic codes in the MIMIC-III data set (Fig. 5, Table 3 In this work, natural language processing (NLP) methods were used to compare five different word embeddings from free-text outpatient clinical notes and then LR was shown effective to predict the diagnostic codes of eight cardiovascular diseases. Among them, on both the smaller Cohort 1 and the larger Cohort 245 2 from the Stanford EHR data set, the best embedding according to AUROC and AUPRC was TF-IDF (Figs. 3, 4, Supplementary Figs. 1, 2) . From the Cohort 1 to the Cohort 2, the scalability of the models was shown that with more data the prediction performance could be improved (Fig. 3, Supplementary Fig. 4) . Additionally, the majority of the embedding models and classification 250 models trained on the Stanford EHR data set also showed transferability when applied to the MIMIC-III data set (Table 3) Although direct benchmarking and comparison cannot be made due to differences in the prevalence of ICD-10 codes and data sets selected in this study, the simple word vectorization models and LR showed good predictive performances in our study (Fig. 3, Table 3 ), while maintaining interpretability 285 and therefore could contribute to the diagnostic code prediction and quality control for clinicians. Next, false positive case analysis showed that some of the false positive predictions might be correct and could be applied to impute potential missing codes 11 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint that do not have I-codes recorded by clinicians ( Table 2 ). The false positive pre-290 dictions might not be wrong but are simply missing. However, among the false positive cases, we also observed that certain mistakes were caused by negation, past medical and family history. Because the best model of TF-IDF method is word-based, it models the contents of the free-text by each individual word, and these issues cannot be directly detected by the TF-IDF model. Therefore, to 295 impute missing I-codes, the proposed classifiers here could be used to complete records, in combination with additional methods to assert negation, temporality and who the experiencer is. More generally, an important use case of this work is to impute ICD-10 codes from unstructured free-text format. As diagnostic codes rich in clinical infor-300 mation can be missing and the noted diagnostic codes may also be inaccurate, CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint tion performance. Secondly, in this study, after the first step of data processing, 63.2% notes were removed because they either didn't have a diagnostic code or were shorter than 60 words. We used these dropped notes in training the doc2vec embedding models. The part of unlabeled notes might still contain meaningful information related to classification. Such methods as semi-supervised learning 325 [29] and conformal predictions [30, 31] might be hold potential to make use of these unlabeled data, which could potentially further improve the prediction performance. Thirdly, this work focused on the prediction of ICD-10 codes and the structured codes was not tested in downstream tasks such as phenotyping or outcome prediction with machine learning. This work might help subsequent 330 prediction tasks. For example, the structured diagnostic codes based on the information from clinical notes, can be combined with other data sources in data fusion tasks including imaging data, genomics data and laboratory test data to predict prognosis, patient outcome and disease subtypes [32, 33, 34, 35] . [1] X. Wei, C. Eickhoff, Embedding electronic health records for clinical information retrieval, arXiv preprint arXiv:1811.05402. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint [8] L. Kuhn, C. Eickhoff, Implicit negative feedback in clinical information retrieval, arXiv preprint arXiv:1607.03296. The data used in this study is not shareable as the data concerns patient information. 18 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. I21 I25 I27 I42 I48 I50 I70 I85 I21 I25 I27 I42 I48 I50 I70 I85 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 29, 2021. ; https://doi.org/10.1101/2021.01.27.21250477 doi: medRxiv preprint Visualizing data using t-sne The regression analysis of binary sequences Scikit-learn: Machine learning in Python Bootstrap methods: another look at the jackknife Mimic-iii, a freely accessible critical care database The international classification of diseases: ninth revision (icd-9 Semi-supervised learning with penalized probabilistic clustering Deep learning with multimodal representation for pancancer prognosis prediction Development and validation of radiomic signatures of 430 head and neck squamous cell carcinoma molecular features and subtypes A shallow convolutional neural network predicts prognosis of lung cancer patients in multi-435 institutional computed tomography image datasets Ct-based rapid triage of covid-19 patients: Risk prediction and progression estimation of icu admission, mechanical ventilation This research used data or services provided by STARR, the STAnford medicine Research data Repository, a clinical data warehouse containing live Epic data from Stanford Health Care (SHC), the University Healthcare Alliance 445 (UHA) and Packard Children's Health Alliance (PCHA) clinics and other auxil-