key: cord-0444628-dmmcxndf authors: Ozyurt, Yilmazcan; Kraus, Mathias; Hatt, Tobias; Feuerriegel, Stefan title: AttDMM: An Attentive Deep Markov Model for Risk Scoring in Intensive Care Units date: 2021-02-09 journal: nan DOI: nan sha: 9532f5aa454a7a6f5e5b5169b87abec277d1a7f1 doc_id: 444628 cord_uid: dmmcxndf Clinical practice in intensive care units (ICUs) requires early warnings when a patient's condition is about to deteriorate so that preventive measures can be undertaken. To this end, prediction algorithms have been developed that estimate the risk of mortality in ICUs. In this work, we propose a novel generative deep probabilistic model for real-time risk scoring in ICUs. Specifically, we develop an attentive deep Markov model called AttDMM. To the best of our knowledge, AttDMM is the first ICU prediction model that jointly learns both long-term disease dynamics (via attention) and different disease states in health trajectory (via a latent variable model). Our evaluations were based on an established baseline dataset (MIMIC-III) with 53,423 ICU stays. The results confirm that compared to state-of-the-art baselines, our AttDMM was superior: AttDMM achieved an area under the receiver operating characteristic curve (AUROC) of 0.876, which yielded an improvement over the state-of-the-art method by 2.2%. In addition, the risk score from the AttDMM provided warnings several hours earlier. Thereby, our model shows a path towards identifying patients at risk so that health practitioners can intervene early and save patient lives. Intensive care units (ICUs) provide healthcare to patients with severe or life-threating illnesses. ICUs receive patients directly from an emergency unit, from other wards if their condition deteriorates rapidly, or after surgery. In ICUs, patients require constant care to ensure normal body functions. Yet, due to the severity of underlying illnesses, their health trajectories cannot always be stabilized. Owing to this, mortality rates in ICUs are among the highest across all hospital units [34] and are estimated to be 8-19 % [36] . To ensure normal body functions, patients in ICUs are subject to extensive monitoring [21] . Examples are the monitoring of body temperature, heart rate, and blood pressure. Based on this, clinical professionals determine a risk score that identifies the probability of in-hospital mortality. The risk score is crucial for decision-making in ICU practice as it guides the treatment plans [5, 26, 37] . In addition, it provides early warnings when a health condition is about to deteriorate so that preventive measures can be taken. The clinical literature indicates the development of several methods for risk scoring in ICUs that are nowadays widely used in clinical practice. Among the most widely applied ones is the simplified acute physiology score (SAPS) [32, 35] , which assesses the severity of the health condition as defined by the probability of patient in-hospital mortality. For this, the risk score makes use of measurements that indicate vital signs, such as body temperature, heart rate, and blood pressure. However, the aforementioned risk scores are computed through overly simple decision rules that operate only on a few measurements from selected timestamps. In other words, the complete time-series of measurements is ignored, and because of this, the prediction power concerning patient mortality is limited. Recent works have addressed mortality prediction in ICUs through the use of machine learning. On the one hand, neural networks have been used so that long-term temporal dependencies are captured. Examples of neural networks that have been adapted for ICU predictions are long short-term memory (LSTM) [15, 44, 52] and gated recurrent unit (GRU) [6, 10] . These models have been powerful in representing complex interactions among (high-dimensional) measurements and thus represent the state of the art. On the other hand, latent variable models have been used for ICU prediction [16, 17] . In this case, the latent variables allow latent disease states in the health trajectory to be captured. However, we are not aware of any previous works that have combined the strengths of neural networks and latent variables into a joint model for predicting ICU mortality. Proposed model: 1 In this work, we propose a novel generative deep probabilistic model for predicting mortality risk in ICUs. Specifically, we develop an attentive deep Markov model called AttDMM. Based on this, our model allows us to jointly capture (1) long-term disease dynamics (via attention) and (2) different disease states in the health trajectory (via a latent variable model). To the best of our knowledge, our model is the first combination of a deep Markov model with an attention mechanism. In addition, AttDMM is closely aligned with the needs in clinical practice providing confidence interval of the real-time risk score that further facilitates decision-making. Finally, we show how to estimate AttDMM via an end-to-end training task by a tailored evidence lower bound (ELBO). Results: Our AttDMM was evaluated on an established baseline dataset from clinical practice, MIMIC-III [21] , comprising 53,423 ICU stays. For each ICU stay, we evaluated the performance across two prediction tasks. First, we predicted the mortality risk from measurements spanning the first 48 hours after ICU admission. This prediction task is analogous to that of risk scoring from the clinical literature [e. g., 20, 32, 35, 53] and thus facilitates comparability with prior literature. In this prediction task, state-of-the-art baselines were consistently outperformed. Compared to the baselines, the proposed AttDMM achieved a performance improvement in the area under the receiver operating characteristic curve (AUROC) by 2.2 % and in area under the precision-recall curve (AUPRC) by 2.4 %. On top of that, our model achieved the same AUROC (AUPRC) performance as the best baseline 12 hours (6 hours) earlier. Second, we assessed a prediction task in which we make long-term forecasts of mortality risk. This prediction task is demanded by clinical practice as many conditions remain stable for a fairly long time window but afterwards deteriorate suddenly. We found that AttDMM outperforms state-of-the-art baselines in terms of AUROC by 2.2 % and AUPRC, remarkably, by 5.4 %. Contributions: Our work advances machine learning for ICU predictions in the following ways: ( , and patient sociodemographics (often termed "risk factors" in a clinical context). The underlying prediction models also differ, for instance, in whether they handle static and/or time-series data, warrant interpretability, and model prediction intervals (i. e., whether they output confidence intervals to assess the uncertainty around point estimates) [e. g., 8, 28, 51] . The use of machine learning in healthcare allows clinical practitioners to obtain predictions about the current (and future) health condition of a patient. Based on this, clinical practitioners can adapt their treatment plans accordingly (e. g., by planning preventive interventions or choosing different treatments). Risk prediction in ICUs: Machine learning in ICU settings has different objectives, such as predicting adverse events like sepsis [e. g., 22, 23, 40, 41, 48] , while a predominant focus in the literature is on predicting mortality risk. For this, one uses various measurements of vital signs, such as body temperature, heart rate, and blood pressure. They are widely regarded as important indicators that describe a patient's health status in critical care [14, 25] . In practice, there is variability in which measurements are recorded for a specific patient depending on her condition and reason of hospitalization. For instance, for some diseases, sodium and potassium levels are highly indicative of the future course, while for COVID-19, a focus might be placed on measuring respiratoryrelated thoracic movements. As such, some measurements might not be subject to recording (due to the setting) and, hence, are missing for the complete (or a partial) ICU stay. Hence, the missingness of ICU measurements must be appropriately modeled in the context of ICU prediction [e. g., 6] . Clinical practice in ICU risk scoring: In clinical practice, predictions of mortality risk are computed through simple decision rules. A key benefit of decision rules from clinical practice is that they directly output a dynamic risk score. Common examples are the so-called simplified acute physiology score (SAPS) [32, 35] , the acute physiology and chronic health evaluation (APACHE) [53] , and mortality probability model (MPM) [20] . To allow for straightforward use, the decision rules rely only upon a few sensor measurements and, in particular, additional characteristics of patients (i. e., risk factors) are not considered. In essence, these decision rules determine the mortality risk through a linear combination of different vital signs. However, only the most critical value from the last 48 hours is considered, rather than the complete time-series of vital signs. Hence, the rich information embedded in high-resolution measurements is-to a large extent-ignored which limits the prediction power. Machine learning for ICU mortality risk scoring: Recent works have approached ICU mortality prediction using machine learning. For this, several sequential neural networks have been adapted specifically to ICU settings, namely long short-term memory networks [15, 18] and gated recurrent units [6, 10, 43] . For ICU predictions, state-of-the-art sequential networks are designed to capture long-term dependencies in high-resolution time-series. However, existing machine learning does not explicitly model different disease states in trajectory, which would require a latent variable approach. Latent variable models have the ability to account for the fact that patient health trajectories undergo different disease states (e. g., acute, stable). However, latent variable models explicitly consider the fact that such disease states cannot be directly observed and must thus be treated as latent [e. g., 4, 12, 42] . In principle, such models assume that the relation between health measurements and the disease states are stochastic. A naïve example is a hidden Markov model. However, these models are generally made for modeling a sequence of observations instead of classification tasks. To adapt latent variable models for prediction tasks, some works [1, 2, 49] encode the risk into the latent variable and adopt sequential hypothesis testing [46] . Others [16, 17] feed latent variables into another prediction algorithm, while even others [30, 31, 50] match discrete latent variables to the risk profiles. However, these approaches have only a limited ability in capturing long-term temporal dynamics, rely on discrete latent variables, and assume linearity in the transition and emission components. The deep Markov model (DMM) [24] overcomes the above limitations by introducing continuous latent variables, non-linear transition networks, and non-linear emission networks. However, the DMM presents an unsupervised framework, because of which an application of the DMM to the prediction of mortality risk is precluded. Instead, a new model is needed, and for this reason, we develop an attentive deep Markov model to address the above mentioned prediction tasks. Research gap: To the best of our knowledge, no prior work has combined the strength of neural networks and latent variable approaches for the purpose of ICU prediction. To fill this gap, we propose a novel attentive deep Markov model called AttDMM. In the following, we develop AttDMM, which processes ICU stays in order to predict the ICU mortality label ∈ {0, 1}. A label = 0 denotes a discharge from ICU, whereas = 1 denotes that the patient has died during the ICU stay. The input to our model are ICU stays. Each stay is represented by static and time-series features: (1) Static features are denoted by . These encode potential risk factors at the patient level (e. g., gender, age) and thus describe the between-patient heterogeneity. (2) Time-series features are denoted by { } =1 . These encode a high-resolution time-series with various measurements for each time step = 1, . . . , throughout the ICU stay (e. g., in MIMIC-III, they are sampled at a 2-hour resolution). Some measurements might not be subject to recording, and, therefore, are missing for the complete (or a partial) ICU stay. To address this issue, we first decompose into { } =1 , where is the number of features (i. e., types of measurements). Then, for each , we derive the mask variable that denotes if will be imputed ( = 0) or not ( = 1). AttDMM is a generative deep probabilistic model for predicting mortality risk in ICUs. The model takes lab measurements recorded during an ICU stay as input. The measurements are processed to capture the latent disease state in a patient trajectory. This is achieved via a latent variable model which captures the latent disease state at time by a latent variable . Based on the series of latent variables, our model infers the ICU mortality risk of a patient. This is achieved via an attention mechanism. AttDMM has 4 model components, which are depicted in Figure 1. A (1) transition network specifies the transition probability among consecutive latent variables (i. e., −1 and ). It further accommodates between-patient heterogeneity. An (2) emission network models the probability of a lab measurement given the latent variable from a certain time step. In this case, it adheres to the assumption that the lab measurements are stochastically linked to the latent variables. The combination of a transition network and emission network represents the generative part of the model, which is used as an ELBO regularization in the loss function. An (3) attention network outputs the summary representation of the patient trajectory˜, based on all latent variables 1: . Finally, a (4) predictor network takes this summary representation as input and predicts the ICU mortality risk of a patient. In terms of notation, let ReLU denote the rectified linear unit, ⊙ denote the element-wise multiplication, and [.; .] denote the concatenation of two vectors. (1) Transition network: This component specifies the transition probability among consecutive latent variables. For this, we build upon the Markov property; i. e., the current latent variable only depends on the previous one. Further, it accommodates the betweenpatient heterogeneity by making use of the patient's static variable . Formally, the transition network models the distribution of a latent variable conditional on the inputs −1 and , denoted by ( | −1 , ). This distribution is parametrized by a multivariate Gaussian distribution with a diagonal covariance, denoted by N ( , Σ ), where Σ is the covariance matrix with 2 on the diagonal and 0 otherwise. The mean is a sum of two terms: (1) a linear contribution of the inputs, denoted by¯, and (2) a non-linear contribution of the inputs, denoted by˜. The weights of these two terms are controlled by a gate . The diagonal 2 is derived from˜(plus a constant 2 ). The mathematical formulation of the transition network is as follows: = softplus( ReLU(˜) + ) + const., with matrices ′ , ,˜′,˜,¯, and and bias vectors ′ , ,˜′,˜,¯, and . (2) Emission network: The emission network outputs the probability of a lab measurement given the latent variable. Formally, the emission network models the distribution of given the latent variable , denoted by ( | ). In the distribution, each feature of is modeled as conditionally independent from each other, i. e., where is leveraged to mask out unobserved entries. The distribution ( | ) is parametrized by a univariate Gaussian distribution, denoted by N ( , 2 ), where and 2 are the -th indices of the vectors and 2 . With the same model assumptions from the transition network, the mean is a sum of two terms: (1) a linear contribution of the inputs, denoted by¯, and (2) a non-linear contribution of the inputs, denoted by˜. The weights of these two terms are controlled by a gate . The variance 2 is derived from˜(plus a constant as above). The mathematical formulation of the emission network is as follows: =˜˜′ +˜, =¯+¯, = = softplus( ReLU(˜) + ) + const., with matrices ′ , ,˜′,˜,¯, and and bias vectors ′ , ,˜′,˜,¯, and . 2 A constant term is added to the diagonal covariance to ensure stability in the ELBO computation. This applies to all Gaussian distributions introduced in AttDMM. (3) Attention network: The attention network allows the model to assign a different importance to each latent variable when making inferences of mortality risk. The importance is computed via similarity between a query vector and the latent variable. Thereby, the attention network aggregates the latent variables based on their importance. The resulting vector is used as a summary representation of the patient health trajectory, which is fed into a predictor network. Although this overcomes the Markov property, it allows the model to capture long-term dependencies. Formally, the attention network outputs the aggregation of latent variables˜based on the input sequence 1: . This is formalized via with matrix ′ , bias vector ′ , query vector , and scalar , where K(·, ·) defines the cosine similarity, i. e., (4) Predictor network: The predictor network is the final step of making the inference of mortality risk. For this, it predicts based on˜from the attention network. This yieldŝ with matrices and and bias vectors and .ˆis the output of AttDMM. The computational complexity present in AttDMM hinders the exact inference of the latent variables. Because of this, we develop a posterior approximation that builds upon stochastic variational inference to estimate latent variables (see Figure 2 ). For AttDMM, we approximate the posterior distribution of latent variables (denoted as ) based on the sequence of lab measurements and static variables. It further makes use of the missingness of each lab measurement (i. e., masks). Formally, the approximated posterior distribution of is based on the inputs −1 , , : , and : . For this, we proceed as follows. First, we use a recurrent neural network (RNN) for encoding the information carried by the future lab measurements and their missingness, i. e., a concatenation of : and : . This yields the encoded information ℎ . This is formalized via with matrices ℎ , ℎ and bias vector ℎ . Second, a combiner network is applied to the inputs of −1 , , and ℎ . In this case, −1 and ℎ encapsulate the past and future lab measurements. In addition, is leveraged to accommodate the between-patient heterogeneity in the posterior approximation. The combiner network models the approximated distribution of , denoted by ( | −1 , , : , : ). This distribution is parametrized by a multivariate Gaussian distribution with a diagonal covariance, denoted by N ( , Σ ), where Σ is the covariance matrix with 2 on the diagonal and 0 otherwise. The mathematical formulation of the combiner network is given by =h + , = softplus(h + ) + const., with matrices , , and and bias vectors , , and . Altogether, this approximated the posterior of the latent variables of AttDMM, which is leveraged to optimize the ELBO via the stochastic variational inference. See Appendix A for the details of the estimation procedure. AttDMM estimates the probability of in-hospital mortality at time as follows. The static features , the lab measurements 1: , and the masking of measurements 1: are fed into the posterior approximation network. Based on it, AttDMM constructs the posterior distribution ( | −1 , , : , : ), from which the latent variables 1: are sampled sequentially. This sampling procedure is repeated times, where the number is defined by the user. At the end of the sampling procedure, the set of sampled latent variables { 1: } =1 is processed sequentially by the attention network and the predictor network. AttDMM then produces samples of in-hospital mortality prediction {ˆ} =1 , whose mean is used for the estimate of the probability of in-hospital mortality. The complete hospital stay of an ICU patient is tracked by our AttDMM, thereby producing the probability of in-hospital mortality over time. In this way, AttDMM can inform the physicians in the ICU about the risk scores of each patient. AttDMM yields a full posterior distribution of the prediction (rather than a point estimate). This distribution is acquired by the multiple samples of the mortality prediction {ˆ} =1 . This allows us to compute confidence interval of the prediction in order to quantify the uncertainty of our prediction. and, in practice, one can use it in order to decide whether the prediction is sufficiently reliable. Our evaluation was based on an established reference dataset with ICU measurements, namely, MIMIC-III [21] . MIMIC-III is one of the largest publicly available ICU datasets, comprising 38,597 distinct patients and a total of 53,423 ICU stays. Because of this, MIMIC-III has been used extensively in prior literature for benchmarking [e. g., 6, 10, 15, 18, 22, 23, 38, 40, 41] . For each ICU stay, the dataset includes a time-series of different measurements, such as temperature, heart rate, and blood pressure. In addition, the dataset also reports sociodemographic variables that describe the heterogeneity among patients, such as age and admission type (e. g., scheduled surgery, unscheduled surgery). The available features for prediction are reported in Appendix B. For details on the dataset, we refer the reader to [21] . We follow the preprocessing pipeline established by prior literature [6, 38] . We first remove ICU stays which are shorter than 2 days or longer than 30 days, and ICU stays of patients younger than 15. Further, we only consider the first ICU stay of patients with multiple stays. Afterwards, we applied further preprocessing steps consistent with the aforementioned references [6, 38] . For each ICU stay, we extracted features that were sampled regularly (i. e., every 2 hours), and we filled missing values with a forward-backward linear imputation. The preprocessed dataset contains 31,895 ICU stays. Out of them, the share of ICU stays with in-hospital mortality amounted to 3,311 stays (i. e., 10.38 %). The remaining 28,584 stays (i. e. 89.62 %) resulted in a discharge from the ICU. The length of hospital stay after ICU admission had a mean of 189.21 hours (with a standard deviation of 136.44 hours). Of all the ICU stays, 58.78 % required a hospital stay of longer than five days. The distribution of hospitalization length is shown in the Appendix. In this task, we predict the in-hospital mortality based on the measurement data from the first 48 hours after ICU admission. This represents the reference task in the literature for calibrating and evaluating mortality predictions in ICUs. To examine how the prediction performance changes throughout an ICU stay, we further report the prediction performance at different time periods after the ICU admission, ranging from 12 to 48 hours after the ICU admission. For this, all the predictions have only access to the respective (limited) time frame of ICU measurements. This allows us to assess how precise predictions are at the beginning of the ICU stay. Mortality prediction (complete hospital stay after ICU admission). A second prediction task is used to evaluate ICU mortality during the complete hospital stay. Typical use cases are diseases in which the health trajectory remains stable for a large amount of time and, only later, deteriorates suddenly. The relevance of the second prediction task is also seen in the summary statistics of our dataset, in which hospital stays with a duration of more than 48 hours are highly common and account for more than 80 % of all ICU admissions. Formally, we now use the complete time-series of the ICU data during both training and evaluation. We then report the prediction performance in intervals of 2 hours. In this case, we label time with respect to hours to discharge/death. We report the prediction performance for all lead times between −120 hours and 0 hours. Longer lead times were discarded, as there was not sufficient data. We later also report an overall performance score. For this, we aggregate the prediction performance across all time-steps via a weighted micro-level average. Thereby, we weigh each time step by the number of available patients for the evaluation at that time step. We compare our AttDMM with other baselines based on two performance metrics. (1) We report area under the receiver operating characteristic curve (AUROC). This was also chosen in prior literature [6, 10, 15] . We tested whether the improvement in AUROC is statistically significant via DeLong's test [11] . (2) We further report area under the precision-recall curve (AUPRC). The AUPRC is frequently used in the clinical literature as it focuses on the performance relative to detecting a negative event (i. e., mortality). We compare our AttDMM against an extensive series of baselines that have been carefully crafted for mortality prediction in ICU settings. We first implemented SAPS-II representing current practice [25] . 3 Formally, SAPS-II provides predefined decision rules that make use of the most critical lab measurements of the last 48 hours. In addition, we implemented naïve baselines for performance comparisons. These are a multilayer perceptron (MLP) and a random forest (RF). Both are fed with summary statistics (mean, max, min, std. dev.) of the same features used by current practice. Hence, the naïve baselines are intended to reflect the power of the prediction heuristics from clinical practice. We further adapt latent variable models to ICU mortality prediction. A hidden Markov model (HMM) is utilized with two hidden states, referring to discharge and mortality. Additionally, we use an HMM with multiple hidden states and feed these states into a long short-term memory, denoted as HMM+LSTM. Finally, we crafted a DMM for the prediction task, in which the last latent variable is fed into the predictor network (the same network of the AttDMM). We use the following ICU-specific state-of-the-art models: a long short-term memory network for ICU prediction (ICU/LSTM) [15] ; a gated recurrent unit for ICU prediction (ICU/GRU) [10] ; and a gated recurrent unit with decay mechanism for ICU prediction (ICU/GRU-D) [6] . These models can, by nature of the underlying model, handle sequential input of variable-length and are, thus, trained with the same data as AttDMM. We use implementations analogous to those in the prior literature that have been tailored to handle the specific time-series structure in ICU settings (e. g., with regard to missing values, sampling frequency). Appendix C describes details about the implementation and evaluation of all baseline models as well as AttDMM. Table 1 lists the results of in-hospital mortality prediction based on the measurement data from the first 48 hours after ICU admission. The best baseline is given by the ICU/GRU-D with an AU-ROC of 0.857. In contrast, AttDMM achieves an AUROC of 0.876, which is an improvement over the best baseline by 2.2 %. The improvement is statistically significant (p<0.001). Similarly, AttDMM achieved the highest AUPRC with 0.465. Compared to the best baseline (ICU/GRU-D with an AUPRC of 0.454), this is an improvement of 2.4 %. [15] 0.831 ± 0.017 0.403 ± 0.028 ICU/GRU [10] 0.838 ± 0.019 0.412 ± 0.029 ICU/GRU-D [6] 0.857 ± 0.012 0.454 ± 0.014 Proposed AttDMM 0.876 ± 0.004 0.465 ± 0.010 Higher is better. Best value in bold. We further provide a sensitivity analysis in which we study how the prediction performance varies after ICU admission. For this, Figure 3 depicts the evolution of both the AUROC and AUPRC from 12 to 48 hours (in steps of 2 hours) after ICU admission. All the baselines are outperformed by the proposed AttDMM. The improvement of AttDMM over the baselines is particularly pronounced for time periods after the first 24 hours. We further compare the models in their minimum time needed after the ICU admission in order to achieve a specific prediction performance. To achieve an AUROC of 0.85, the ICU/GRU-D needs 44 hours of patient data, whereas the same AUROC is achieved by AttDMM 12 hours earlier. For AUPRC, the /GRU-D performs better shortly after ICU admission, but after 18 hours, it is outperformed by AttDMM. An AUPRC of 0.45 is achieved by the ICU/GRU-D 46 hours after ICU admission, whereas the same AUROC is achieved by AttDMM after 40 hours, i. e., 6 hours earlier. Table 2 lists the results of in-hospital mortality prediction based on the complete hospital stay. The best baseline is given by the random forest with an AUROC of 0.846. In contrast, AttDMM achieves an AUROC of 0.865. Therefore, AttDMM yields an improvement over the best baseline by 2.2 %. The improvement is statistically significant (p<0.001). For the AUPRC, our proposed model again achieves the best score amounting to 0.545. In this case, AttDMM [15] 0.829 ± 0.012 0.470 ± 0.026 ICU/GRU [10] 0.842 ± 0.016 0.473 ± 0.034 ICU/GRU-D [6] 0.834 ± 0.016 0.517 ± 0.025 Proposed AttDMM 0.865 ± 0.010 0.545 ± 0.029 Higher is better. Best value in bold. Figure 4 depicts the evolution of prediction performance across different time periods. For both the AUROC and AUPRC, AttDMM yields favorable results. In comparison, the baseline models vary in their performance at different steps. The random forest has an AUROC similar to AttDMM until 84 hours to discharge/death. After that, AttDMM outperforms the random forest. The ICU/GRU and ICU/GRU-D show increases in AUROC for the last 24 hours of hospital stays, but are ranked consistently below the AttDMM. For the AUPRC, AttDMM and the ICU/GRU-D show a similar performance for the last 24 hours of hospital stays. However, prior to that, the ICU/GRU-D and the other baselines are consistently outperformed by AttDMM. For both AUROC and AUPRC, the other sequential networks (i. e., ICU/LSTM, ICU/GRU, and ICU/GRU-D) show a lower performance level compared to AttDMM when focusing on the time window of more than 24 hours to discharge/death (i. e., from −120 hours to −24 hours). Overall, we see consistent gains from using AttDMM over existing baselines. Figure 4 : Sensitivity of prediction performance across varying time windows, that is, when making predictions hours ahead of an ICU discharge/mortality. Shown are two performance metrics: AUROC (top) and AUPRC (bottom). The risk score is crucial for the decision-making of clinical practitioners in ICUs. It provides early warnings of the deterioration in a patient's health so that preventive measures can be taken. For this, our AttDMM outputs the probability of in-hospital mortality over time. This can inform physicians in ICU regarding the risk scores of each patient and, thus, influence the treatment. Figure 5 presents risk scores for two example patients, namely one example patient with in-hospital mortality (top) and one example patient with a hospital discharge (bottom). As shown by both patients, AttDMM produces proper risk scores several hours before death or discharge. For the patient who died in the hospital, AttDMM outputs a risk score that peaks quickly, thereby informing that the patient is at high risk. In this case, the risk score from AttDMM exceeds a mortality probability of 90 % and this risk score is retrieved even 52 hours before death. For the patient who has been discharged from the hospital, the risk scores produced by AttDMM suggests that the patient had a moderate risk of mortality during the first 30 hours of the ICU stay. After that, the health condition stabilized further, as is shown by the fact that the associated risk score approaches zero. In contrast, for the second patient, the clinical practice (SAPS-II) estimates a fairly constant risk throughout the entire hospital stay. As SAPS-II is computed based on the most critical measurements, it cannot properly model the temporal changes when the measurements are indicating a better health condition. Therefore, SAPS-II does not capture such recovery in the health condition of the patient. Theory-informed model: We present a novel generative deep probabilistic model for predicting mortality risk in ICUs. Our AttDMM outperformed existing baselines due to the way the model is formalized. Specifically, our AttDMM is modeled so that it jointly captures (1) long-term disease dynamics (via an attention network) and (2) different latent disease states in patient trajectories (via a latent variable model). By combining these two characteristics, AttDMM presents a powerful tool for modeling health progression in ICUs, as was confirmed in a superior prediction performance. Clinical relevance: AttDMM outperformed both baselines from clinical practice (e. g., SAPS-II [25] ) and state-of-the-art machine learning for ICU mortality prediction [6, 10, 15] . In practice, such an improvement in performance allows the triggering of warnings when health conditions deteriorate several hours earlier. Thereby, our model provides ICU practitioners with more time for early intervention in order to save patient lives. Precision of real-time risk score: The real-time risk score provided by AttDMM provides the health trajectory of an ICU patient over time. Thereby, ICU practitioners can track the course of the disease, and, by identifying patients at risk, adapt their decisionmaking concerning treatment planning. Owing to our probabilistic setting, AttDMM additionally outputs the confidence interval of the predicted risk score. Thus, ICU practitioners are informed about the precision of the prediction and they might postpone a discharge until more measurements are available. This is a particular benefit over risk scoring (e. g., SAPS, APACHE) in clinical practice, in which similar confidence intervals are lacking. Generalizability: We developed AttDMM as a generative deep probabilistic model tailored for ICU mortality. Needless to say, AttDMM is not limited to an ICU setting and might facilitate other use cases, in which inferences from time-series have to be made that have long-term dependencies and are driven by latent dynamics, such as clickstream analytics or churn prediction. In intensive care units (ICUs), patients are subject to constant monitoring in order to guide clinical decision-making. While state-ofthe-art models can capture long-term dependencies, these cannot formally account for latent disease states in health trajectories. To fill this gap, we developed a novel probabilistic generative model, specifically, an attentive deep Markov model (AttDMM). To the best of our knowledge, this is the first attentive deep Markov model. AttDMM was co-developed with clinical researchers who emphasized the benefit of real-time risk scoring. In our numerical experiments, AttDMM yielded performance improvements in both AUROC and AUPRC over the state of the art by more than 2 %. Altogether, this enables more accurate risk scoring so that lives of patients at risk of mortality can be saved. The true posterior of the latent variables ( 1: | 1: , 1: , ) is computationally intractable. Because of this, we adopted stochastic variational inference and leverage the approximated posterior ( 1: | 1: , 1: , ) as the proxy of the true posterior. Throughout the iterations, the approximated posterior is getting closer, in terms of Kullback-Leibler (KL) divergence, to the true posterior. This is achieved by maximizing the ELBO. For this, we use stochastic optimization via unbiased Monte Carlo estimates of the gradient, details are found in [39] . For the ICU mortality prediction task, we incorporate the maximization of the ELBO into the loss function as a regularization term. Overall, in the following, we show how to estimate AttDMM via an end-to-end training task by a tailored ELBO. Loss function: The loss function of AttDMM is given by where the two terms ℓ ( ,ˆ) and the ELBO are described below. Cross-entropy loss: The term ℓ ( ,ˆ) denotes weighted crossentropy loss between the observed label and the corresponding mortality predictionˆ. It is given by with a weight = | { ∈ : ="discharge"} | | { ∈ : ="death"} | denoting the discharge-todeath ratio. Introducing such weight further helps in discriminating the minority class (i. e., in-hospital mortality) in the imbalanced dataset. The expectation term denotes the expected log-likelihood of 1: given the latent variables 1: . The KL-divergence measures the similarity between the posterior approximation and the prior formulation of the latent variables 1: . In this ELBO formulation, ( 1: | 1: , 1: , ) is decomposed into further terms, which are produced by the posterior approximation, given by Similarly, ( 1: | 1: ) is decomposed into individual terms of , which are produced by the emission network. Specifically, we can rewrite We make further adjustments to handle missing measurements inside the vector . For this, AttDMM is specified to train on actual measurements (i. e., ignoring imputed values), which is achieved by Sampling: Computation of the above ELBO is analytically intractable due to latent variables being modeled by a continuous distribution and the non-linearity encoded in AttDMM. Therefore, we leverage Monte Carlo sampling and, thereby, we estimate ELBO via stochastic variational inference based on sampled latent variables. Figure 6 : Histogram of length of hospital stay after ICU admission. Recall that hospital stays with less than 48 hours were removed during preprocessing as in [6, 38] . The dataset is split randomly into 5 separate folds. Each fold has the same ratio of in-hospital mortality and hospital discharge. To quantify the uncertainty regions for the predictions, we used the following approach: each model was trained 5 times on each fold, in which one fold represents the test set and the remaining 4 folds were randomly split into a training set (3 folds) and a validation set (1 fold; for early stopping). Afterwards, the performance was aggregated over the different folds via macro averaging. Later, we report the standard deviation across folds. The hyperparameters of the models are tuned according to performance on validation set. We use the hyperparameter set that provides the best overall validation score. Details on hyperparameter tuning are presented in Appendix D. All the models were implemented in Python. We employed the following libraries: Pyro for AttDMM; PyTorch for the LSTM, GRU and MLP; Tensorflow for the GRU-D; and Scikit-learn for the random forest. Training and evaluations were performed on TITAN V GPU from NVIDIA with 12 GB of memory. Due to the large hyperparameter space across all the models, it was computationally expensive to perform a grid search. Therefore, we adopted a ceteris paribus strategy which formally means that we tuned each parameter individually while keeping the other parameters fixed. We ran this procedure for a few loops, until we had observed the convergence of the score. The tuning parameters with the corresponding tuning ranges are listed for the AttDMM in Table 4 and for the baselines in Table 5 . For both AttDMM and the baselines of the sequential neural network (ICU/LSTM, ICU/GRU, and ICU/GRU-D) we used the Adam optimizer. These models were trained using early stopping with patience set to 20 epochs. Individualized risk prognosis for critical care patients: A multi-task Gaussian process model Personalized risk scoring for critical care prognosis using mixtures of Gaussian processes Patient similarity analysis with longitudinal health data Progression of liver cirrhosis to HCC: An application of hidden Markov model Severity scoring systems in the critically ill. Continuing Education in Anaesthesia Recurrent neural networks for multivariate time series with missing values Risk prediction with electronic health records: A deep learning approach Retain: An interpretable predictive model for healthcare using reverse time attention mechanism Medical concept representation learning from electronic health records and its application on heart failure prediction Deep ensemble tensor factorization for longitudinal patient trajectories classification Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach Hidden Markov models for zero-inflated Poisson counts with an application to substance use Potentially avoidable 30-day hospital readmissions in medical patients: Derivation and validation of a prediction model Serial evaluation of the SOFA score to predict outcome in critically ill patients An Interpretable ICU mortality prediction model based on logistic regression and recurrent neural networks with LSTM units Unfolding physiological state: Mortality modelling in intensive care units A multivariate timeseries modeling approach to severity of illness assessment and forecasting in ICU with sparse, heterogeneous clinical data Greg Ver Steeg, and Aram Galstyan. 2017. Multitask learning and benchmarking with clinical time series data Hospital readmission in general medicine patients: A prediction model Assessing contemporary intensive care unit outcome: An updated mortality probability admission model (MPM0-III) MIMIC-III, a freely accessible critical care database An attention based deep learning model of clinical events in the intensive care unit Learning representations for the early detection of sepsis with deep neural networks Structured inference networks for nonlinear state space models A new simplified acute physiology score (SAPS II) based on a European/North American multicenter study Understanding decision making in critical care HiTANet: Hierarchical time-aware attention networks for risk prediction on electronic health records Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks Maytal Saar-Tsechansky, and Thomas Zueger. 2021. Modeling longitudinal dynamics of comorbidities Multivariate hidden Markov models for disease progression Hierarchical hidden Markov jump processes for cancer screening modeling SAPS 3-From evaluation of the patient to evaluation of the intensive care unit. Part 1: Objectives, methods and cohort description Deep patient: An unsupervised representation to predict the future of patients from the electronic health records Outcomes of direct and indirect medical intensive care unit admissions from the emergency department of an acute care hospital: A retrospective cohort study SAPS 3-From evaluation of the patient to evaluation of the intensive care unit. Part 2: Development of a prognostic model for hospital mortality at ICU admission Risk factors for hospital and long-term mortality of critically ill elderly patients admitted to an intensive care unit ICU admission, discharge, and triage guidelines: A framework to enhance clinical operations, development of institutional policies, and further research Benchmark of deep learning models on large healthcare MIMIC datasets Black box variational inference Predicting sepsis with a recurrent neural network using the MIMIC III database Temporal probabilistic profiles for sepsis prediction in the ICU Hidden Markov models for alcoholism treatment trial data Modeling irregularly sampled clinical time series Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records Estimating risk-adjusted hospital performance Sequential tests of statistical hypotheses Readmission prediction via deep contextual embedding of clinical concepts Dongdong Zhang, and Ping Zhang. 2020. Identifying sepsis subphenotypes via time-aware multi-modal auto-encoder ForecastICU: A prognostic decision support system for timely prediction of intensive care unit admission Real-time risk assessment based on hidden Markov model and security configuration Hang Chen, Yefeng Zheng, and Ian Davidson. 2020. INPREM: An interpretable and trustworthy predictive model for healthcare Using a LSTM-RNN based deep learning framework for ICU mortality prediction Acute physiology and chronic health evaluation (APACHE) IV: Hospital mortality assessment for today's critically ill patients