key: cord-0957241-5237ave3 authors: Zhang, Hanrui; Yi, Daiyao; Guan, Yuanfang title: Timesias: A machine learning pipeline for predicting outcomes from time-series clinical records date: 2021-07-02 journal: STAR Protoc DOI: 10.1016/j.xpro.2021.100639 sha: 3ccbeafc1f37a8cef27eacf921a395eba0065b4a doc_id: 957241 cord_uid: 5237ave3 The prediction of outcomes is a critical part of the clinical surveillance for hospitalized patients. Here, we present Timesias, a machine learning pipeline which predicts outcomes from real-time sequential clinical data. The strategy implemented in Timesias is the first-place solution in the crowd-sourcing DII (discover, innovate, impact) National Data Science Challenge involving more than 100,000 patients, achieving 0.85 as evaluated by AUROC (area under receiver operator characteristic curve) in predicting the early onset of sepsis status. Timesias is freely available via PyPI and GitHub. For complete details on the use and execution of this protocol, please refer to Guan et al. (2021). The prediction of outcomes is a critical part of the clinical surveillance for hospitalized patients. Here, we present Timesias, a machine learning pipeline which predicts outcomes from real-time sequential clinical data. The strategy implemented in Timesias is the first-place solution in the crowd-sourcing DII (discover, innovate, impact) National Data Science Challenge involving more than 100,000 patients, achieving 0.85 as evaluated by AUROC (area under receiver operator characteristic curve) in predicting the early onset of sepsis status. Timesias is freely available via PyPI and GitHub. For complete details on the use and execution of this protocol, please refer to Guan et al. (2021) . The application of digital systems in the clinical system stimulates an exponential growth of Electronic Health Records (EHR) in the last decade, enabling the potential secondary usage of such records to further improve the clinical practice. Outcome prediction from EHR is one of the most interesting topics in the field of clinical informatics and so far has attracted growing efforts in developing algorithms addressing this task (Che et al., 2016; Shickel et al., 2018; Dash et al., 2019; Liu et al., 2019; Morawski et al., 2020; Shamout et al., 2021) . More specifically, predicting clinical outcomes from real-time sequential data from clinical surveillance is crucial for preventing sudden outburst of an emergency. A variety of machine learning models has been drawn in EHR clinical outcome prediction, including mathematical models, such as Linear and Logistic regression (Generalized Linear Models (GLMs), 2005; Chatterjee and Hadi, 2015) . Machine learning models such as SVM and Random Forest have also been used (Breiman, 2001) . Deep learning models, including Gated Recurrent Units (GRU), LSTM (Long Short-term Memory) Neural Network, CNN (Convolution Neural Network) and Attention Network have also been used to capture the longitudinal information in the time-series records (Hochreiter and Schmidhuber, 1997; LeCun et al., 2015; Shickel et al., 2018) . Recently, the light-weight gradient boosting machine, such as LightGBM, has become the top performer in dozens of Data Science challenges. The advantage of LightGBM against the other model is its significant speed in training on large datasets (Ke et al., 2017) . One of the biggest challenges for employing EHR data is the unavoidable missing information in the sequential timely records. However, previous observations provided substantial support that the missing values can also convey informative data suggesting patients' health status (Sharafoddini et al., 2019) . Therefore, besides the temporal changes in health status, the pattern of missing information may further assist the clinical outcome prediction and was taken advantage of by our feature design. Here we introduce Timesias, a LightGBM machine learning pipeline utilizing time-series EHR data to predict clinical outcomes. Our algorithm, which placed first in the 2019 DII National Data Science Challenge (DII Challenge, 2019) and then in the ongoing COVID-19 EHR DREAM Challenge (Sage Bionetworks, 2020), is highly versatile and can be easily accommodated to different time-series data inputs. Here we implemented Timesias as a user-friendly command line interface, allowing non-programmers to train predictive machine learning models for their own purpose. Meanwhile, Timesias also allows users to visualize the top ranked features selected by the machine learning models using SHAP analysis (Lundberg and Lee, 2017) , improving interpretability of the black box models. Our model is installed and run under the Linux or Mac system. Before launching our program, preinstalled Python (>= v.3.6) and LightGBM (>=v.3.1.1) modules are required. You should also prepare your own timely EHR records with the clinical outcomes of interest. The example of prerequisites and input data format can be found from our GitHub repository: https://github.com/GuanLab/timesias.git. While this protocol can be applied to any time-series records, here we describe how to perform our pipeline on the randomly generated time series EHR data in predicting sepsis onset, a binary classification task. Timesias can also be used for multiclass classification or regression tasks. The program in this protocol was written in the Ubuntu Linux system using Python language (>=v.3.6). All experiments were carried out and evaluated under the Ubuntu system with the computational resources listed in Table 1 . REAGENT pip install timesias or clone the whole package from our GitHub repository using the following command to your local directory using the following command ( Figure 1 ): Timing: <10 min Two types of data should be prepared for model training and prediction: 1) gold standard file, which contains the names of the record file from patients and the corresponding gold standard (clinical outcome of that patient). 2) time-series records files, which are ''|'' delimited time series records. The two types of data mentioned above should follow the following formats: 2. Gold standard file (for example: example.gs.file): a ',' delimited table with two columns. Each row in the gold standard file represents a sample. The details of gold standard file are described below (Table 2) : a. first column: path of the time-series record files. the details of the time-series record files which will be detailed later. b. The second column: gold standard, or label for the time-series record files. In the example data, we use binary label 1/0 to indicate failure/survival, which is the status of sepsis onset of the patients. 3. Record files (for example: *.psv): ''|'' delimited time series records. The record files should be corresponding to the first column in the gold standard file, which are the time-series records for each individual sample/patient. Each individual sample should have one record file. The record file should be in the following format (Table 3) : a. The first row, or the header: the clinical measurements (Features). The first header should be the time point, and from the second column to the end, the header should be the clinical measurements recorded at each time point. b. The first column, or the row names: the exact time points in ascending order. c. From the second column to the last, the values should be the observed values for each clinical measurement at the nth time point. It is worth noting that, for some time points, some features might be missing. Therefore, it is recommended that researchers fill the missing values as 'NaN' instead of leaving them blank. The demonstrated datasets can be found in our Timesias GitHub repository (https://github.com/GuanLab/ timesias/tree/master/data). CRITICAL: 1. Both types of datasets should follow the format described above. 2. Make sure that the first column in the gold standard file matches the file name of the records. Namely, the total number of rows in the gold standard file is always equal to the total number of records. Timing: $0.5 h (depending on your data) Our package provides a one-line command to perform feature construction, model training and fivefold cross-validation at the same time, followed by the SHAP analysis with a visualization report of top contributing features ( Figure 2 ). The missing values contain critical information for outcome predictions. Here we introduce a specific feature construction method implemented in Timesias to preserve the missing value information (Figure3). First, for each feature in a certain time point, we add an additional value to annotate the binary status if it is missing or not (1/0). This will double the total number of features in the matrix. For example, if we use the last i time points of the record of j features, the original feature matrix would be an i3j matrix. After the missing . Overview of the feature construction process First, the missing information will be replaced by different replacement values (for missing feature values (gray), replace with À3000; for missing timepoints (green), replace with À5000). Then we compute four types of characteristics for each feature: 'std', 'norm', 'missing portion' and 'baseline'. Then the final feature matrix will be generated by concatenating all generated features together. STAR Protocols 2, 100639, September 17, 2021 value annotation step, the processed feature matrix will be i 3 2j. At the same time, the missing values in the original feature matrix will be replaced by an arbitrary constant, here we set it to 'À3000' as default. As we have designated a fixed length n for all timely records by using the argument -t [LAST_N_RECORDS], if the provided record (with i timepoints) is shorter than n, we will fill the missing earliest timepoints of the original record with another arbitrary constant. Here we use 'À5000' as default. Otherwise, the record will be cropped to the last ntimepoints. The replacement values of the above missing information can also be changed according to your own data (Troubleshooting 2). The above are the feature preprocessing step, which generates a n32j matrix, from which extra features can be generated as described next. c. -f [EXTRA_FEATURES]: additional features you want to include for prediction. We provide four additional features to construct based on your original data: norm, std, missing portion and baseline. Note: These additional features will capture the temporal changes as well as the patterns of missing values in the timely records. The details of the extra features are described here: 1) 'norm': normalized feature values. For each column q in the n32j matrix, we calculate the mean and standard deviation across all rows. The normed new feature value f 0 p;q will be f 0 p;q = f p;q À m q s q p˛f1; .; ng; q˛f1; .2jg; where f p;q is value of the pth timepoint in the q th column, m q is the mean for the q th column through the timeline; s q is the standard deviation for the q th column through the timeline; n: total number of time points; and j is the total number of features. 2) 'std': as mentioned above, is the standard deviation (s) of each feature throughout the whole timecourse: .; ng; q˛f1; ::2jg; where f p;q is the value of the pth feature in the q th column; is mean for the q th feature through the timeline; and n is the total number of time points. 3) 'missing portion': the percentage of missing points for each feature, which is calculated as: where missingðf q Þ is the number of missing values of the q th feature. 4) 'baseline': for each feature, we record the earliest time point for each feature to be collected and the corresponding feature value. d. -e [EVA_METRICS]: evaluation metrics you choose. Our program primarily provides five options: AUROC, AUPRC, C-index, Spearman's r and Pearson's r, which are used for different prediction tasks (binary/multiclass classification and regression). The details for the above metrics are described in Quantification and Statistical Analysis. e. -shap: an optional argument. If used, SHAP analysis (Lundberg and Lee, 2017) will be carried out, using the trained model from five-fold cross-validation on the test set to visualize the top measurements and time points. Note: Our program also provides an option to analyze the top features by SHAP analysis by using the -shap argument. If additional features were constructed from the original features, we group the corresponding values of the original feature to analyze its overall importance. According to the additivity of SHAP values, the grouped SHAP values of features and time points can be computed following the formula below: where X is all features of interests. For example, if we want to know the aggregated SHAP importance of the last 6th time point, Xcan be all features from the last 6th timepoint. The top features (clinical measurements and timepoints) are visualized in bar graphs. Then the program will start training the prediction model using the arguments you specified. When model training is finished, the models will be automatically saved to './model', and the evaluation on the test set will be automatically performed. Also, if -shap is used, the program will start SHAP analysis after prediction performance evaluation. The above command will generate the following results from your training data: Saved models: five models resulting from five-fold cross-validation will be saved in ./models/finalized_model.sav.* and can reload by the standard pickle module in Python. Evaluation results (e.g., AUROC and AUPRC) from five-fold cross validation, which will be saved in './results/eva.tsv'. if -shap is specified, results from SHAP analysis will be stored in the following files: './results/shap_-group_by_measurment.csv': Feature importance of all measurements resulted from SHAP analysis using the five models in five-fold cross-validation; './results/shap_group_by_timeslot.csv': Feature importance of all time points resulted from SHAP analysis using the five models in five-fold crossvalidation;'./results/top_feature_report.html': the top features (measurements and timepoints) visualized in bar graphs (Figure 4 ). Timesias provide the following matrices to evaluate the model performance, which can be specified using -e argument: censored data. C-index can be calculated using the following formula: Among the above, AUROC and AUPRC can be used for binary classification problems, and C-index, Pearson's r and Spearman's r can be used for regression problems. The users can also implement other customized evaluation metrics. One possible limitation of Timesias is the memory usage. After filling all missing values, the feature matrices will increase to more than twice of the original input, depending on how many time points you decided to use ( Figure 3 ). Additionally, when using the full set of features (norm, std, missing portion, baseline), the feature number will also increase. Therefore, when there are many samples (for example, in the DII National Data Science Challenge, the training data could be up to $100, 000 samples), the CPU memory could be a potential limitation of our method. Installation of Timesias fails due to uninstalled prerequisites (step 1). Please install the required dependencies manually through the links we provided in key resources table, and then try installing Timesias again. Different missing value replacement could affect the prediction performance. In our default setting, the missing feature values are replaced with À3000 and the missing timepoints are replaced with À5000. Difference replacement values may have a significant influence on the prediction accuracy (Table 4 ). (step 4) Use other replacement values for missing feature value replacement and missing time point replacement. To change missing value/timepoint replacement, in utils.py, assign val_rep/t_rep with other values you would like instead of À3000 and À5000 ( Figure 5 ). Usually, the replacement values should be lower than the lowest feature value. Also, we recommend that val_rep and t_rep not be equal to retain more information in the dataset. The program will check if the gold standard file exists. If not, the program will halt and print the following information in the terminal ( Figure 6 ). (step 4) Potential solution Check if the gold standard file path (-g) is correct. If the program fails to read the time-series records, the terminal prints out the following assertion error (Figure 7 ). (step 4) It means the file path of the time series records are wrong (*.psv file). Check if the training record paths (first column) listed in the gold standard file are correct. If the time-series record file is empty, the following errors could be raised ( Figure 8 ). (step 4) Check the file in the training data, and exclude it from the training data if it is not available. Feature value error. We assume all feature values used for prediction should be numeric. However, if the feature value is a string, the following error could occur ( Figure 9 ). (step 4) If categorical features are used, you can encode them as numeric values before training with our program. While the default value of -t is 16, usually a larger -t could retain more information, while increasing CPU usage and training time correspondingly. We plot the memory (CPU) and time versus last n point using the deposited data (100 samples) ( Figure 10 ). (step 4) Choose a proper number for -t. We also encourage the users to try different -t and compare the performance/resource requirements for a most preferred choice. We also encourage the users to tune the hyperparameters of the LightGBM model for better performance. The default parameters for the LightGBM regression model are set as Figure 11 and default boosting round is set to 1000. However, the default setting may not be optimal for variable datasets. The model parameters in model.py can be modified. If the model overfits, choose smaller 'num_leaves' and 'n_estimator' and smaller boosting rounds. If the model underfits, choose larger 'num_leaves' and 'n_estimator' and larger boosting rounds. Larger 'learning_rate' could speed up the training process while a smaller 'learning_rate' with larger boosting rounds may help achieve better prediction accuracy. Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Yuanfang Guan (gyuanfan@umich.edu). This study did not generate new unique reagents. The code generated in this study is available at [https://github.com/GuanLab/timesias]. The mock dataset used for display in this paper is deposited to and available at [https://github.com/ GuanLab/timesias/tree/master/data]. This work is supported by National Institute of Health (NIH), R35-GM133346 and National Science Foundation (NSF), #1452656. H.Z packaged the code and wrote the manuscript. D.Y. prepared the figures. Y.G. developed the method and original code. All authors edited and agreed with this manuscript. The authors declare no competing interests. Random forests Regression Analysis by Example Interpretable deep models for ICU outcome prediction Deep Learning Techniques for Biomedical and Health Informatics Generalized, linear, and mixed models Assessment of the timeliness and robustness for predicting adult sepsis Long short-term memory LightGBM: A Highly Efficient Gradient Boosting Decision Tree Deep learning Learning hierarchical representations of electronic health records for clinical outcome prediction A unified approach to interpreting model predictions Predicting hospitalizations from electronic health record data Machine learning for clinical outcome prediction A new insight into missing data in intensive care unit patient profiles: observational study Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis