key: cord-0209796-e503oyqf authors: Caicedo-Torres, William; Gutierrez, Jairo title: ISeeU2: Visually Interpretable ICU mortality prediction using deep learning and free-text medical notes date: 2020-05-19 journal: nan DOI: nan sha: 43180f7a446ea362f009a9e0fb03baa32c93142d doc_id: 209796 cord_uid: e503oyqf Accurate mortality prediction allows Intensive Care Units (ICUs) to adequately benchmark clinical practice and identify patients with unexpected outcomes. Traditionally, simple statistical models have been used to assess patient death risk, many times with sub-optimal performance. On the other hand deep learning holds promise to positively impact clinical practice by leveraging medical data to assist diagnosis and prediction, including mortality prediction. However, as the question of whether powerful Deep Learning models attend correlations backed by sound medical knowledge when generating predictions remains open, additional interpretability tools are needed to foster trust and encourage the use of AI by clinicians. In this work we show a Deep Learning model trained on MIMIC-III to predict mortality using raw nursing notes, together with visual explanations for word importance. Our model reaches a ROC of 0.8629 (+/-0.0058), outperforming the traditional SAPS-II score and providing enhanced interpretability when compared with similar Deep Learning approaches. Intensive Care Units (ICUs) are the last line of defense against critical conditions that require constant monitoring and advanced medical support. Their importance has been highlighted in recent times, when ICUs around the world have been overrun by the COVID-19 pandemic [1, 2] . It is in times like these when research into ways to adequately manage scarce critical care resources must be even more vigorously pursued, in order to offer additional tools that support medical decisions and allow for the effective benchmark of clinical practice. The issue of mortality prediction in the ICU has been approached from a statistical standpoint by means of risk prediction models like APACHE, SAPS, MODS, among others [3] . These models use a set of physiological predictors, demographic factors, and the occurrence of certain chronic conditions, to estimate a score that serves as a proxy for the likelihood of death of ICU patients. Because of the relatively straightforward way of interpreting results, simple statistical approaches such as logistic regression are the go-to modeling techniques used to estimate mortality probability and the importance of the predictors involved. On the other hand, the simplicity of the models also mean that their limited expressiveness may not accurately represent the possibly non-linear dynamics of mortality prediction. Given this, high-capacity machine learning models might be useful to increase predictive performance. Concretely, the relevant literature shows that the use of deep learning models trained on physiological time-series data can outperform these previously mentioned statistical models [4, 5] . One of the advantages of deep learning over other techniques is its ability to use multiple modes of data to train predictive models. In the biomedical domain, health records, images, and time-series data, have been used for different tasks with success [6, 7] . This advantage is relevant for mortality prediction (and for many other clinical tasks as well), as a substantial amount of data is generated inside ICUs as free-text notes which can be used as input to create Natural Language Processing (NLP) predictive models. The nature of NLP poses some challenges for which deep learning is uniquely suited via its ability to deal with high-dimensional data and its elegant way to take temporal and spatial patterns into account. Some works have used deep neural networks and free-text to predict mortality [8] and length of stay (among others), showing that there is interesting potential for this type of models. On the other hand, a particularly important downside of deep learning is that, compared to the simpler logistic regression based models, feature importance is not as readily available. This in turn makes these models hard to interpret, as internally the model may transform the original input features to high-dimensional spaces via non-linear transformations, making it hard to establish the impact of each predictor on the predicted outcome. It has been documented that given their large predictive capacity, deep learning models can easily fit spurious correlations in the datasets used for their training, leading to potential diagnostic issues [9] . However some work has been done to interpret deep learning models in order to offer explanations intended to foster trust and further encourage their usage in the critical care setting. For instance, in our previous work we developed an interpretable deep learning mortality prediction model that uses physiological time-series data from the first 48 hours of patient ICU stay [5] . In this work, we present ISeeU2, a deep learning model that uses free-text medical notes from the first 48 hours of stay to predict patient mortality in the ICU. We use the MIMIC-III database [10] to train a convolutional neural network (ConvNet) that is able to use raw nursing notes with minimal preprocessing to efficiently generate a prediction, and we couple the prediction of mortality with word importance and sentence importance visualizations, in a way that annotates the original medical note to show what parts of it are more predictive for death or survival, according to the model. In the past some works have used deep learning to predict ICU mortality using free text. Grnarova et al [8] proposed the use of a convolutional neural network for ICU mortality prediction using free-text medical notes from MIMIC-III. They used all medical notes from each patient stay to predict mortality, and trained their model using a custom loss function that included a cross-entropy term involving mortality prediction at the sentence as well, with promising results. Jo et al [11] used a hybrid Latent Dirichlet Allocation (LDA) + Long Short Term Memory (LSTM) model for ICU mortality prediction trained on medical notes from MIMIC-III, in which the LSTM used the topic LDA features as input. Suchil et al [12] used stacked denoising autoencoders to create patient representations out of medical free-text notes, to be used for downstream tasks as mortality prediction. Si et al [13] proposed the use of a ConvNet for multitask prediction (mortality, length of stay), using all available patient medical notes up until time of discharge. Jin et al [14] proposed a multimodal neural network architecture and a Named Entity Recognition (NER) text pre-processing pipeline to predict in-hospital ICU mortality using all available types of free-text notes and a set of vital signs and lab results from the first 48 hours of patient stay, extracted from MIMIC-III. Most of these works include some ad-hoc interpretability mechanism: Grnarova et al [8] included a sentence-based mortality prediction target which is then used to score individual words according to their associated predicted mortality probability, Jo et al [11] used LDA-computed weights to provide word importance, Suchil et al [12] used a gradient-based interpretability approach to compute the importance of words in the input notes. Our work has key differences relative to those from the related literature. As opposed to [8, 13] , we only use notes from the first 48 hours of patient stay instead of all notes available up until the time of discharge/death, and as opposed to citejin2018improving we only use nursing notes and not the whole spectrum of notes available in MIMIC-III. Also from an interpretability standpoint we rely on a theoretically sound concept from coalitional game theory, known as the Shapley Value [17] , instead of explainability heuristics. Finally our visualization approach puts emphasis on presenting results in a way that can be easily understood and it is useful for users. The contributions of our work are summarized in the following: • We present a model that is able to offer performance comparable to state of the art models that use physiological time series data, but only using raw nursing notes extracted from MIMIC-III. • Our approach only uses data from the first 48 hours of patient stay, instead of using data from the entirety of the stay. That makes our model more usable in a real setting as a benchmark tool. • Our approach to interpretability is based on a theoretically sound concept (the Shapley Value) and our visualizations provide a novel way to annotate clinical free-text notes to highlight the most informative parts for the prediction of mortality. This paper is organized as follows: first we will show the overall distribution of our patient cohort dataset and its corresponding distribution of medical free-text notes. Then we will briefly describe our approach to interpretability using the Shapley Value, followed by a description of our convolutional architecture and experimental setting. Finally we will present and discuss our results and end with our conclusions and suggested future work. We used the Medical Information Mart for Intensive Care (MIMIC-III v1.4) to create a dataset for the training of our deep learning model. MIMIC-III contains ICU records including vitals, laboratory, therapeutical and radiology reports, representing more than a decade of data from patients admitted to the ICUs of the Beth Israel Deaconess Center in Boston, Massachusetts [10] . The median age of adult patients (those with age > 16y) is 65.8 years, and the median length of stay (LoS) for ICU patients is 2.1 days (Q1-Q3: Our patient cohort was created using the following criteria: only stays longer than 48 hours were considered, in cases where patients were admitted multiple times to the ICU only the first admission was considered, and patients should have at least one free-text note recorded during their ICU stay. These criteria lead to a sample with n = 21415. Table 1 shows the different types of medical notes included in our dataset together with their respective counts. Given that a substantial number of patients in our dataset were missing more than one type of medical note, and that nursing and nursing/other types were the more prevalent ones, we decided to only include patients that had some type of nursing note available (nursing, nursing/other), with no regard to the note word count. This reduced our patient sample to n = 16970, with 1659 recorded deaths (9.78%) and 15311 patients that survived (90.22%). The mean note length is 1252.59 words, with a standard deviation of 1087.48. Our prediction model, called ISeeU2, is a convolutional neural network (ConvNet). ConvNets are a specialized neural network architecture that ex- ploit the convolution operator and spatial pooling operations to detect local patterns and reduce input dimensionality to learn a representation that is useful for predictive purposes [15] . ConvNets are extensively and primarily used for computer vision but have found application in Natural Language Processing as well, given their ability to deal with patterns occurring at different scales in sequential inputs [16, 8] . The specific architecture of our model ( figure 4) includes a text embedding layer to convert a bag of words text representation into 10-dimensional dense word vectors. The output of the embedding layer is then fed to a convolutional layer with 32 channels and a kernel size of 5x10 (stride 1), followed by ReLU activations and a max-pooling layer with a pool size of 1x3 (stride 1). The obtained representation is then fed to a 50x1 dense layer with ReLU activations connected to a one-neuron final layer with sigmoid activation, which computes the mortality probability. One argument that is used routinely against deep learning is its reduced interpretability when compared to other modeling techniques such as logistic regression [9] . In order to overcome that potential limitation we use the The summation is taken over all possible subsets S ⊆ N that don't in- [20] . DeepLIFT is an algorithm specifically designed to compute feature importance in feed-forward neural networks. DeepLIFT overcomes the issues associated with competing methods such as Layerwise Relevance Propagation [19] , and gradient-based attribution [21, 22] , i.e. saturation, overlooking negative contributions, and gradient discontinuities [19] . DeepLIFT computes feature importance by comparing the network output to a reference output obtained by feeding the network with a designated input. The difference in outputs is back-propagated through the different layers of the network until the input layer is reached and feature importances are fully computed. A more detailed treatment of DeepLIFT in the context of interpreting deep learning models for critical care prognosis can be found in [5] . Our ConvNet was built using Tensorflow [23] . Since our dataset is highly unbalanced (negative outcomes represent just 9.78% of training examples), we used a weighted logarithmic loss assigning more importance to the posi-tive class, i.e. patients that died in the ICU. We used 5-fold cross-validation to assess the model performance and place a confidence estimate on it. We did not perform any substantial hyperparameter optimization other than conservatively varying the number of channels of the convolutional layer and the number of neurons of the first fully connected layer of the network. Our choice of optimizer was Adam [24] with default Tensorflow-provided parameters. Our model was trained for three epochs per training fold, and we kept the lowest loss model of each run. One of our goals is to show a deep learning model that needs little to no input pre-processing in order for it to be as widely applicable as possible. Keeping with that we used the NLTK library [25] to remove English stopwords and the Tensorflow.keras default tokenizer to vectorize the text notes, keeping the 100k most frequent words; and no further pre-processing was attempted. The tokenizer was fitted only on the training folds to avoid data leakage. Finally, we set the maximum note length to 500, so notes with a larger word count were truncated at the beginning and those with a smaller word count were padded at the beginning with zeroes. Using this configuration we obtained a 5-fold cross validation Receiver Operating Characteristic Area Under the Curve (ROC AUC) of 0.8629 (±0.0058) as seen in figure 5 . Using a 0.5 decision threshold, the model reaches 72% sensitivity at 83% specificity. We also provide some baseline models to compare with our proposed model to better assess its performance. Concretely, we have included results for a traditionally used mortality risk score and a recurrent neural network. As baseline, we used a well-established ICU mortality risk score, SAPS-II [26] . SAPS-II uses data from the first 24 hours of ICU stay to calculate a numerical score, which in turn is converted into a mortality probability. In order to compare our approach with SAPS-II predictions and performance, we trained our convolutional architecture using nursing notes from the first 24 hours only while keeping training parameters the same. We used the SAPS-II implementation provided by the authors of the MIMIC-III code repository [27] . The 24 hour version of our model obtained a 0.8155 (±0.0102) ROC AUC 5-fold cross-validation score, against 0.7448 (±0.0117) for the SAPS-II model. Figures 6 and 7 show the corresponding ROC plots for the two models. Short Term Memory (LSTM). LSTM is a neural network model designed to handle sequential input data with temporal dependencies [28] , and it has been used extensively in Natural Language Processing tasks. We trained a deep neural network with a bidirectional LSTM layer with 100 units, followed by an extra 100-unit LSTM layer, a 50-unit dense layer ReLU activation, and a final sigmoid layer. As it was the case for our original convolutional model, an embedding layer was used to create 10-dimensional dense vectors to feed the initial layer of the LSTM and the same text preprocessing pipeline was used (save for a now 1000-word maximum note length). Finally dropout with probability 0.5 was applied to control overfitting. With this particular architecture we were able to obtain a 0.7839 (±0.0076) ROC AUC 5-fold cross-validation score (Figure 8 ). Using the DeepLIFT implementation provided by [20] which works appropriately with Tensorflow 2 models, we calculated word importances for our model, using the empirical mean of the input embedding vectors as ref- erence value. Using these values we designed and built visualizations to show (Figures 9 and 10 ). Word clouds are an interesting way to visualize words and their impor- tance at the same time, but they don't capture the context in which words live, potentially leading to erroneous interpretations. For example, the survival word cloud in Figure 9 shows melena as associated with survival, which is not readily understandable. However, when the word cloud is combined with the note heatmap, the reason becomes apparent, given the context of the word (stable present wo melena stool ). We also observe that certain phrases and words are flagged intuitively, e.g. guaic pos heme, and also the fact that for this particular patient occurrences of Plavix/Clopidogrel in the Note length and mortality probability. High capacity machine learning models such as deep neural networks have the ability to leverage subtle correlations and patterns to attain very low training error in learning tasks. As shown in Table 2 and Figure 1 , there is a difference in our sample between mean length of patients who survived and those who had a negative outcome. A Mann-Whitney U test supplied further evidence, as we were able to reject the Having established that, we decided to investigate if our model was attending somehow to that difference in distributions. For this purpose we inspected the importance score of the padding characters used by our preprocessing pipeline, with most of them being regarded as evidence for survival, which is consistent with our original conjecture that the model considers that shorter notes are correlated with a survival outcome (shorter notes have more padding characters). Figure 11 shows the distribution of approximate Shapley Values for padding characters. Our convolutional model shows interesting performance on the MIMIC-III dataset, with consistent results across validation folds, showing evidence Our results are not directly comparable to those published by Grnarova el al [8] given that we restricted our input window to the first 48 hours of patient stay, instead of using all available notes up until the time of discharge. Results published by Jo et al [11] show their models performing under 0.84 ROC AUC for mortality prediction using MIMIC-III data (48 hour mark), which is well below our results here. On the other hand, the model Vital + EntityEmb reported by [14] uses physiological data and a substantial text preprocessing pipeline that involves a second neural network for Named Entity Recognition. Limitations of our study include the fact that we do not have access to some pre-admission data, and that we are using a retrospective, single center cohort. Also given the moderate size of our dataset we are only reporting cross-validation results without a proper test set result. An additional limitation is that high-quality nursing notes may not be available for a substantial number of patients in other critical care settings, which could hurt the performance of our model. Finally, the common misspellings and other noise present in the medical notes may affect the quality of the explanations, giving rise to counterintuitive results. In future work we intend to investigate the usage of a more robust preprocessing pipeline, and assess whether there is any performance improvement attributable to its usage. Also we intend to evaluate how our approach fares in a situation where limited-quality notes are the only training data available. Finally we plan to explore the joint usage of physiological time series data and free-text medical notes to train a multi-modal deep learning model and compare its performance with our current approach. In this paper we have presented ISeeU2, a convolutional neural network for the prediction of mortality using free-text nursing notes from MIMIC-III. We showed that our model is able to offer performance competitive with that of much more complex models with little text pre-processing, while at the same time providing visual explanations of feature importance based on coalitional game theory that allow users to gain insight on the reasons behind predicted outcomes. Our visualizations also provide a way to annotate freetext medical notes with markers to flag parts correlated with predictions of survival and death. We have also shown that nursing notes could be rich enough to capture the concepts needed for mortality prediction at a level of accuracy far higher than what is currently possible with traditional statistical techniques. Critical Care Utilization for the COVID-19 Outbreak in Fair Allocation of Scarce Medical Resources in the Time of Covid-19 Scoring systems in the intensive care unit: A compendium Benchmarking Deep Learning Models on Large Healthcare Datasets ISeeU: Visually interpretable deep learning for mortality prediction inside the ICU Deep Learning in Medical Image Analysis Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis Neural Document Embeddings for Intensive Care Patient Mortality Prediction An evaluation of machine-learning methods for predicting pneumonia mortality MIMIC-III, a freely accessible critical care database Combining LSTM and Latent Topic Modeling for Mortality Prediction Patient representation learning and interpretable evaluation using clinical notes Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction Kass-hout, Improving Hospital Mortality Prediction with Gradient-Based Learning Applied to Document Recognition Deep Learning A Value for n-Person Games Contributions to the Theory of Games II An Efficient Explanation of Individual Classifications using Game Theory Learning Important Features Through Propagating Activation Differences A unified approach to interpreting model predictions Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps Striving for Simplicity: The All Convolutional Net TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems Adam: A Method for Stochastic Optimization A New Simplified Acute Physiology Score (SAPS II The MIMIC Code Repository: Enabling reproducibility in critical care research Long Short-Term Memory The Mythos of Model Interpretability, ICML Workshop on Human Interpretability in Machine Learning Recurrent Neural Networks for Multivariate Time Series with Missing Values Directly Modeling Missing Data in Sequences with RNNs: Improved Classification of Clinical Time Series Temporal convolutional neural networks for diagnosis lab tests