key: cord-0816053-cpjk5ibc authors: Tschoellitsch, Thomas; Dünser, Martin; Böck, Carl; Schwarzbauer, Karin; Meier, Jens title: Machine Learning Prediction of SARS-CoV-2 Polymerase Chain Reaction Results with Routine Blood Tests date: 2020-12-19 journal: Lab Med DOI: 10.1093/labmed/lmaa111 sha: 3d636891f7e753628a0d3cdabd49e817be64c842 doc_id: 816053 cord_uid: cpjk5ibc OBJECTIVE: The diagnosis of COVID-19 is based on the detection of SARS-CoV-2 in respiratory secretions, blood, or stool. Currently, reverse transcription polymerase chain reaction (RT-PCR) is the most commonly used method to test for SARS-CoV-2. METHODS: In this retrospective cohort analysis, we evaluated whether machine learning could exclude SARS-CoV-2 infection using routinely available laboratory values. A Random Forests algorithm with 1353 unique features was trained to predict the RT-PCR results. RESULTS: Out of 12,848 patients undergoing SARS-CoV-2 testing, routine blood tests were simultaneously performed in 1528 patients. The machine learning model could predict SARS-CoV-2 test results with an accuracy of 86% and an area under the receiver operating characteristic curve of 0.90. CONCLUSION: Machine learning methods can reliably predict a negative SARS-CoV-2 RT-PCR test result using standard blood tests. In this retrospective cohort analysis, we evaluated whether machine learning could exclude SARS-CoV-2 PCR infection using routinely available laboratory values. Therefore, we extracted demographic, clinical, and laboratory data and concurrent (ie, within a 24-hour window) SARS-CoV-2 RT-PCR test results (Cobas SARS-CoV-2, Roche, Freiburg, Germany and Real-Time PCR Assay, BioProducts Genesig, Camberley, United Kingdom) from the electronic charts of patients in whom a SARS-CoV-2 test was performed at the Kepler University Hospital in Linz, Austria, from March 1, 2020, until April 30, 2020. Laboratory results used were from within 24 hours of admission. We trained a machine learning model (the Random Forests algorithm) 4 using R version 3.6.3 5 and the packages RandomForest 4.6-14, Boruta 7.0.0, Psych 2.0.9, pROC 1.16.2, ROCR 1.0-11, Amelia 1.7.6, and Caret 6.0-86 6 , ranger 0.12.1 using laboratory data with 1353 unique features of which 28 were used in the final model. The following standard laboratory values were included: blood count, electrolytes, C-reactive protein, creatinine, blood urea nitrogen, liver enzymes, bilirubin, cholinesterase, and prothrombin time. Thereafter, the dataset underwent extensive data preprocessing and data cleaning. The data cleaning included detection of typos and out-of-range values and the imputation of missing values; features with more than 25% of missing values were excluded. The remaining missing values were imputed using Strawman imputation, which replaces missing data by median values (continuous variables) or the most frequently occurring value (categorical values). The Strawman imputation method yielded results comparable to other, more complicated methods (eg, the "missForest" technique 7 ). Censored numerical data were truncated (eg, "<0.1" was replaced by 0.1). Categorical features with >2 values were one-hot encoded (ie, a binary encoding for every category). Ordinal features were encoded as positive integers. Binary and numerical features were included as they were. For the determination of our model performance, we conducted nested cross-validation. The hyperparameter search was conducted in the inner five-fold cross-validation loop via grid-search. The model performance is estimated in the outer loop in five folds. The study protocol was approved by the Ethics Committee of Upper Austria (No. 1104/2020). Out of 12,848 patients undergoing SARS-CoV-2 testing, routine blood tests were performed concurrently in 1528 patients who were then included in the statistical analysis ( Table 1) . Of the 1528 study participants, 65 tested positive for SARS-CoV-2. After data cleaning 1357 study participants were analyzed. As calculated from the confusion matrix ( Table 2) , the machine learning model was able to detect SARS-CoV-2 test results with an accuracy of 81%, an area under the ROC curve of 0.74 ( Figure 1A) , a sensitivity of 60%, and a specificity of 82%. The positive and negative predictive values were 13% Figure 1B . Our results suggest that machine learning methods can predict SARS-CoV-2 RT-PCR results using routine blood values with fair accuracy. Although from a bedside perspective the value of such a model to predict a positive SARS-CoV-2 test result was poor, the high negative predictive value of 99% allows clinicians to reliably predict a negative SARS-CoV-2 test result with acceptable safety. The machine learning algorithm used, Random Forests, although not new, is a proven and effective method. When evaluating the feature importance reported by the machine learning models, leukocyte count ranked as the most important feature. Elevated white blood cell counts have been observed early on in COVID-19 and have been linked to inflammation, similar to an increase in the neutrophil-tolymphocyte ratio. 8 Another highly ranked feature, hemoglobin level, has been associated with mortality from COVID-19. 9 Serum calcium changes are considered to be important for various functions of viruses such as structure and gene expression and release, along with promoting inflammation pathways linked to lung cell damage and edema formation. 10, 11 Our results may have relevant clinical implications, particularly for settings where SARS-CoV-2 RT-PCR testing is not readily available and/or personal protection equipment is in short supply. Although World Health Organization (WHO) considerations have defined acceptable and desirable price ranges for largevolume SARS-CoV-2 RT-PCR testing, demand vs general availability and currently reported current prices commonly exceed these recommendations by a factor of 10 or higher. 12, 13 On the contrary, commonly reported reference costs of routinely ordered laboratory tests that were identified as features of high importance in our prediction model are well below the WHO-designated desirable range for SARS-CoV-2 RT-PCR tests. 14 It can therefore be considered beneficial from an economic point of view to employ the presented model as support for clinical decision-making. When interpreting the results of our analysis, 2 limitations must be considered. First, RT-PCR test results can be false-negative and false-positive. 15 This potential impairs the validity of the model to predict true-negative RT-PCR results. Second, although 1357 study patients were included in our analysis, the sample size may still be considered low for machine learning methods, especially regarding the asymmetry of the classification problem. Inclusion of more patients may therefore have yielded more valid results. In conclusion, machine learning methods can reliably predict a negative SARS-CoV-2 RT-PCR test result using standard blood values. Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR The presence of SARS-CoV-2 RNA in the feces of COVID-19 patients Diagnostic testing for severe acute respiratory syndrome-related coronavirus 2: a narrative review Random forests The R Project for Statistical Computing The R Foundation The Comprehensive R Archive Network The R Foundation MissForest-non-parametric missing value imputation for mixed-type data Dysregulation of immune response in patients with coronavirus 2019 COVID-19 in Wuhan, China Leukocytosis and alteration of hemoglobin level in patients with severe COVID-19: association of leukocytosis with mortality Viral calciomics: interplays between Ca2+ and virus Low levels of total and ionized calcium in blood of COVID-19 patients Test, re-test, re-test": using inaccurate tests to greatly increase the accuracy of COVID-19 testing COVID-19 target product profiles for priority diagnostics to support response to the COVID-19 pandemic v.1.0. World Health Organization website Estimated costs of 51 commonly ordered laboratory tests in Canada -PubMed Estimating the false-negative test probability of SARS-CoV-2 by RT-PCR