key: cord-0933632-mmakaaon authors: Yan, Yao; Schaffter, Thomas; Bergquist, Timothy; Yu, Thomas; Prosser, Justin; Aydin, Zafer; Jabeer, Amhar; Brugere, Ivan; Gao, Jifan; Chen, Guanhua; Causey, Jason; Yao, Yuxin; Bryson, Kevin; Long, Dustin R.; Jarvik, Jeffrey G.; Lee, Christoph I.; Wilcox, Adam; Guinney, Justin; Mooney, Sean title: A Continuously Benchmarked and Crowdsourced Challenge for Rapid Development and Evaluation of Models to Predict COVID-19 Diagnosis and Hospitalization date: 2021-10-11 journal: JAMA Netw Open DOI: 10.1001/jamanetworkopen.2021.24946 sha: bf7687f30b5df711832784236a79880644557d1c doc_id: 933632 cord_uid: mmakaaon IMPORTANCE: Machine learning could be used to predict the likelihood of diagnosis and severity of illness. Lack of COVID-19 patient data has hindered the data science community in developing models to aid in the response to the pandemic. OBJECTIVES: To describe the rapid development and evaluation of clinical algorithms to predict COVID-19 diagnosis and hospitalization using patient data by citizen scientists, provide an unbiased assessment of model performance, and benchmark model performance on subgroups. DESIGN, SETTING, AND PARTICIPANTS: This diagnostic and prognostic study operated a continuous, crowdsourced challenge using a model-to-data approach to securely enable the use of regularly updated COVID-19 patient data from the University of Washington by participants from May 6 to December 23, 2020. A postchallenge analysis was conducted from December 24, 2020, to April 7, 2021, to assess the generalizability of models on the cumulative data set as well as subgroups stratified by age, sex, race, and time of COVID-19 test. By December 23, 2020, this challenge engaged 482 participants from 90 teams and 7 countries. MAIN OUTCOMES AND MEASURES: Machine learning algorithms used patient data and output a score that represented the probability of patients receiving a positive COVID-19 test result or being hospitalized within 21 days after receiving a positive COVID-19 test result. Algorithms were evaluated using area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC) scores. Ensemble models aggregating models from the top challenge teams were developed and evaluated. RESULTS: In the analysis using the cumulative data set, the best performance for COVID-19 diagnosis prediction was an AUROC of 0.776 (95% CI, 0.775-0.777) and an AUPRC of 0.297, and for hospitalization prediction, an AUROC of 0.796 (95% CI, 0.794-0.798) and an AUPRC of 0.188. Analysis on top models submitting to the challenge showed consistently better model performance on the female group than the male group. Among all age groups, the best performance was obtained for the 25- to 49-year age group, and the worst performance was obtained for the group aged 17 years or younger. CONCLUSIONS AND RELEVANCE: In this diagnostic and prognostic study, models submitted by citizen scientists achieved high performance for the prediction of COVID-19 testing and hospitalization outcomes. Evaluation of challenge models on demographic subgroups and prospective data revealed performance discrepancies, providing insights into the potential bias and limitations in the models. We also curated a synthetic dataset which was adapted from the SynPuf (Synthetic OMOP dataset) to accurately reflect the distribution and size of the UW COVID-19 patient dataset. We randomly sampled clinical terms and concepts that appeared in more than 100 person's clinical records and populated the synthetic data with these terms. To better capture the record distribution, we created synthetic visit records and added them to individual patients until the synthetic record distribution resembled the real data record distribution, making sure that the number of patients with one visit, 10 visits, 100 visits was similar between the two datasets. 18-24 5.85 6.79 17.88 9.05 9.21 9.10 9.44 9.14 9.24 9.14 12. We received valid submissions (models scoring area under the receiver operating characteristic (AUROC) > 0.5) to Q1 from 18 teams. Over the course of the challenge, teams submitted multiple models that were scored against different versions of the COVID-19 patient datasets. For this analysis, we selected each team's best performing model (highest AUROC), regardless of the data version on which the model achieved its highest score. We re-trained each of the models on the cumulative training dataset and evaluated them on the cumulative evaluation datasets to select the top 10 models used for the analysis. Q1 model ranking is listed in eTable 4. We received valid submissions (models scoring AUROC > 0.5) to Q2 from 7 teams. Q2 model ranking is listed in eTable 5. AUROC was used as the primary scoring metric for assessing model performance. The Bayes Factor, K, (bootstrapped distributions n = 1000) was computed to determine if the AUROCs between two models were consistently different. If two models were found to have a small Bayes Factor (K < 19), we used the area under the precision-recall curve (AUPRC) as a tie-breaking metric. Bayes factor was calculated using the number of times the current group won divided by the number of times the comparison group won. E.g., During the bootstrapping for Home-Sweet-Home and UWisc-Madison-BMI, Home-Sweet-Home won 1000 times, UWisc-Madison-BMI won 0 times. So, in eTable 4, the Bayes score of Home-Sweet-Home compared to UWisc-Madison-BMI is Inf. The methods developed by team Home Sweet Home include missing value imputation, concept id ranking, feature extraction, a two-step feature selection, feature normalization, and classification. Information contained in measurement, condition, observation, drug exposure, device exposure, procedure, and visit tables is used to derive features. In addition, age, gender, race, and ethnicity information is also included. Concept ids are ranked with respect to number of occurrences among COVID-19 patients tested as positive, which enables to sort features with respect to their importance for the pandemic. Home Sweet Home has been the best performing team so far in question 1 of the COVID-19 DREAM Challenge on the last six dataset versions. The software is available at https://aguedutrmy.sharepoint.com/:f:/g/personal/zafer_aydin_agu_edu_tr/EnA2BRYcJdpPoAk71tJcZ1EBz5Ck0KgnDhmKnocW4 UwtPA?e=UDM6Qc. We performed the following missing value imputation steps on data tables. The missing values in regular date fields or start date fields of the tables are filled with January 1st, 1900, in end date fields are filled with January 1st, 2100 and those in value_as_number field of the measurement table are filled with 0.0. In each table of train data, we ranked concept ids with respect to their frequency of occurrence counts in positively labeled person_ids. In computing the frequencies, we eliminated duplicates that are caused by multiple entries for the same person. We only considered data that could be related to covid-19 pandemic in the USA. For this purpose, we eliminated data if measurement_date, condition_end_date, device_exposure_end_date, drug_exposure_end_date, procedure_date, and visit_date_fields are equal to a date on or before January 1st, 2020. In order to meet time quota restrictions of the challenge, we selected a maximum of the first 100 concept ids from each table of train set after the ranking process explained in Section 2.2. The same concept ids are also used to extract features for the test set in order to have equal number of features in both sets. Four types of features are extracted from the data tables: multi-instance learning based, count based, age, and one hot encoding based. These are explained in detail below. In this work, for each measurement_concept_id selected from the measurement table, minimum, maximum, and average of the value_as_number fields are computed as multi-instance learning (MIL) features. For a given person_id and measurement_concept_id, if measurement data exists after January 1st, 2020 the MIL features are computed using data in this timeframe only. Otherwise, the MIL features are computed using data before this date if available. For the measurement_concept_id 3003694, which represents blood and Rh group, an ordinal encoding approach is used, which represents the feature values by integers from 0 to 7. Count features are computed for measurement, condition, observation, drug exposure, device exposure, procedure, and visit tables. For instance, the measurement count feature includes the number of times a given measurement is present in the measurement table for a given person_id and measurement_concept_id. For drug exposure, drug quantity information is used as the count value. For the rest of the tables, a count of 1 is used for each entry. No time window constraint is applied for deriving the observation counts. For the remaining tables, count data after January 1st, 2020 is used only. The age of each person is computed from the year of birth data available in person.csv file and is used as a single numeric feature. Gender, race, and ethnicity features are represented using a one-hot encoding approach applied to each feature separately. A wrapper-based feature selection strategy is employed on each data matrix of the train set separately (excluding age, gender, race, and ethnicity). A forward selection approach is used based on the ranking information derived as in Section 2.2. Starting from the empty set, in each iteration, a feature is included to the feature set and a 2-fold crossvalidation is performed on the data matrix using lightGBM as the classifier [1] and stratified sampling to assign data samples to folds. The optimum feature set is found as the one that maximizes the AUPRC score. For MIL-based measurement features, once a feature is selected during the forward search, it is selected simultaneously from the three data matrices that contain minimum, maximum, and average values. The optimum number of features found for each table of the train set are used to select features for the tables of the test set directly using the same ranking information obtained for the train set. Data matrices are concatenated along the feature dimension (excluding gender, race, and ethnicity matrices). A second feature selection is performed on this concatenated matrix, which uses an embedded selection strategy. For this purpose, SelectFromModel module [2] of the scikit-learn library [3] of Python is employed, in which the base learner is set to lightGBM [1] with default settings. The same features selected for the train set are also selected for the test set. The data matrices excluding the matrices for gender, race, and ethnicity are normalized to the interval [0,1] using min-max scaling strategy [4]. The data matrix obtained at the end of the second feature selection step is concatenated with the data matrices for gender, race and ethnicity. In the training phase, a lightGBM classifier [1] with default settings is trained on this dataset and its learned parameters are used to compute predictions for the evaluation phase. Table 1 shows the leaderboard results of the team Home Sweet Home for four dataset versions of question 1. The team has been ranked as first on these datasets. Our proposed method for Question 2 of the EHR Dream Challenge -COVID-19 prediction focuses on diverse model selection with the fewest domain assumptions. We find that gradient-boosted trees are competitive against more sophisticated embedding-based methods and achieve the highest performance on Q2 without consideration of data date ranges, imputation, or other feature engineering. Our method does hyperparameter grid search over gradient-boosted tree models AdaBoost [1] and Catboost [2] . We select the model with maximum mean AUROC over (k=3) model instantiations using random training/validation partitions (p=0.5) of input training data. All models are evaluated over the same training/validation samples (e.g. there are exactly k=3 partitions over the model selection). We build a binary feature vector over all concepts, over all time within the patient history. Within the "person" This methodology allows for novel competing models to be quickly added in model selection without loss of model performance (e.g. the prior best model may still be selected). For example, we included neural network embedding methods within the model selection but did not see improved performance. We had limited visibility to the model selected on real data, so we could not confirm that a gradient-boosting model was chosen. However, we found better improvements in wider gradient-boosting hyperparameter search and more robust train/validation sampling at increased k. Without compute budgeting, model selection over all of these would strictly improve performance vs. higher compute cost. We did not consider any temporal windowing to train only on "recent" visits. We started with the simplest model and found it surprisingly competitive without these filters. It is likely that these predictive features are rare within the entirety of the patient record. The learned association may not incur Type I errors due to prior events not related to recent covid treatments. An important aspect of this challenge was that participating teams were unable to see real patient data. The simulated data provided for model testing provided little information about which features might correlate to patient hospitalization risk, so we built a flexible framework for evaluating alternative models and automated feature engineering. We chose the ExtraTreesClassifier from the Scikit-Learn Python library (https://scikit-learn.org) based on empirical performance. Other models we considered included XGBoost1 and LightGBM.2 Training: Our model was trained by first performing the preprocessing feature extraction described below, then simple random oversampling of the existing positive examples was used to achieve an overall positive ratio of 40%. This augmented dataset was used to train ExtraTreesClassifier, with the following parameters: 'n_estimators': 1800, 'max_depth': 4, 'max_features': 'sqrt', 'bootstrap': True, 'n_jobs': -1, 'oob_score': True, 'class_weight': 'balanced_subsample', 'random_state': 0. A secondary logistic regression model (LogisticRegression from Scikit-Learn) was fit to the predictions produced by ExtraTreesClassifier. Its presence provided a consensus model for ensembles; it was active in this version even though only one ExtraTreesClassifier was employed. We utilized information from the person, measurement, observation, condition_occurrence, and drug_exposure tables, as well as dates of contact from the procedure_occurrence, device_exposure, and visit_occurrence tables to determine the last recorded date of contact with each patient. Features from the person table were race (5 features), ethnicity (2 features,), gender (2 features), and an engineered age feature derived from the difference between a patient's date of birth and last contact date. We used 38 raw features from the condition_occurrence table, selected manually. We used 11 raw features from the drug_exposure table, all hand selected. 207 raw features from the measurement table were used, selected by a combination of manual choice and random selection. The observation table provided four hand selected features: "Cardiac rhythm", "Blood pressure method", "Tobacco user", "History of alcohol use". After the raw features were selected, we performed an automated feature engineering / expansion process. A Boolean feature valid_test_flag was added indicating whether the patient had an entry for a COVID-19 test within 21 days of the last contact date. For each raw feature, we derived and added the following: A Boolean indicating whether the value was present or missing, the standardized feature (=0, =1), the standard deviation, a numeric measurement of the (jifan.gao@wisc.edu), G.C. (gchen25@wisc.edu) Introduction The task of Question 1 is to predict the risk that a patient's first SARS-CoV-2-test is positive given the patient's past EHR. We process diagnosis and measurement information and build a LightGBM model to make the prediction 3023314': Hematocrit, '3013650': Neutrophils, '3004327': Lymphocytes, '3016502': SpO2, '4196147': Peripheral oxygen saturation, '3044938': Influenza virus A RNA [Presence] in Unspecified specimen by NAA with probe detection, '3044254': Respiratory syncytial virus RNA [Presence] in Unspecified specimen by NAA with probe detection, '3042596': Human coronavirus RNA [Presence] in Unspecified specimen by NAA with probe detection, '3042194': Human metapneumovirus RNA [Presence] in Unspecified specimen by NAA with probe detection, '3038297': Parainfluenza virus 4 RNA [Presence] in Unspecified specimen by NAA with probe detection Epidemiological and clinical predictors of COVID-19 Epidemiology and clinical features of COVID-19: A review of current literature After the feature expansion, our dataset contained 15,232 features, named according to the following pattern: FEATURENAME_tX: (X [0,11]), the raw (12−t) th newest measurement recorded for FEATURENAME. FEATURENAME_notna: flag indicating value of FEATURENAME was not missing FEATURENAME_normed_tX: (X [0,11]), the standardized (=0, =1) (12−t) th newest measurement recorded for FEATURENAME. FEATURENAME_std: standard deviation for this feature FEATURENAME_sprc: a measure of sparsity for this feature FEATURENAME_bl: oldest available value for FEATURENAME FEATURENAME_tdelta: analog for relative age of FEATURENAME_bl (larger value implies older) FEATURENAME XGBoost: A Scalable Tree Boosting System LightGBM: A highly efficient gradient boosting decision tree A NovelTriage Tool of Artificial Intelligence Assisted Diagnosis Aid System for Suspected COVID-19 pneumonia In Fever Clinics. medRxiv Lactate dehydrogenase levels predict coronavirus disease 2019 (COVID-19) severity and mortality: A pooled analysis. The American journal of emergency medicine The automated feature expansion process produces redundant features for values "age", "gender", "race", "ethnicity", and "delta" since those values cannot re-occur, but are still expanded into "timestep" and "baseline" columns), and some are Boolean in nature, but a "normed" version and standard deviation computed and added to the dataset. For this reason, features such as race_8552_t8 and race_8552_t9 contain identical information and should be viewed as a single feature. Any of these might be selected to play a role in the underlying decision trees.