key: cord-0835020-jdv4xa7x authors: Farhang-Sardroodi, S.; Ghaemi, M.; Craig, M.; Ooi, H. K.; Heffernan, J. M. title: A Machine Learning Approach to Differentiate Between COVID-19 and Influenza Infection Using Synthetic Infection and Immune Response Data date: 2022-01-29 journal: nan DOI: 10.1101/2022.01.27.22269978 sha: 06dcccd3e5d8e0e96eabf1b031a3b758dc2fdbe7 doc_id: 835020 cord_uid: jdv4xa7x Data analysis is widely used to generate new insights into human disease mechanisms and provide better treatment methods. In this work, we used the mechanistic models of viral infection to generate synthetic data of influenza and COVID-19 patients. We then developed and validated a supervised machine learning model that can distinguish between the two infections. Influenza and COVID-19 are contagious respiratory illnesses that are caused by different pathogenic viruses but appeared with similar initial presentations. While having the same primary signs COVID-19 can produce more severe symptoms, illnesses, and higher mortality. The predictive model performance was externally evaluated by the ROC AUC metric (area under the receiver operating characteristic curve) on 100 virtual patients from each cohort and was able to achieve at least AUC{approx}91% using our multiclass classifier. The current investigation highlighted the ability of machine learning models to accurately identify two different diseases based on major components of viral infection and immune response. The model predicted a dominant role for viral load and productively infected cells through the feature selection process. : Schematic of viral infection. Each Target cell, T, is infected by a virus, V, with a constant rate β. During the eclipse period the productively infected cell, I 2 , is being produced by the first infected cell, I 1 , with a constant rate k. The Infected cell, I 2 , produces virus at rate p, IFNI at rate q and dies at rate δ per cell. IFNI hinders viral infection by converting target cells to a virus-resistant state with a constant rate ϕ and decays with rate d. Free virus particles that can be influenza or coronaviruses are cleared at per-capita rate c. 2 Method 67 2.1 Mechanistic models 68 We employed a target-cell limited model of viral dynamics using five differential equations that track susceptible target cells (T ), infected cells in the eclipse phase (I 1 ), productively infected cells (I 2 ), virus (V ), and interferon (F ) in-host. Figure 1 presents a flow diagram of the model. The system of ordinary differential equations is as follows: infected cells produce new virus particles with a rate of p, and the virus particles are cleared 73 from the system with a rate of c. We assumed that productively infected target cells have a death Table 1 . We assumed that the initial number of target cells, T 0 , is equal to the total with initial values were estimated as in Table 2 . (2 × t − value) and then multiplying by the square root of the sample size as follows Standard errors must be of means calculated from within each parameter confidence interval. The t − value for a 95% confidence interval from a sample size of N was then obtained in Mi- 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2022. ; To distinguish between patients who encounter COVID-19 from those who are exposed to in-141 fluenza, we developed a predictive model based on some biological feature selections. Accord-142 ingly, we adopted Logistic regression with ℓ 1 -regularization, referred to Lasso (stands for least CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2022. Logistic regression, which is a special case of linear regression and used for binary classifi-150 cation, is defined by the following sigmoid function in which X is the (n × p) model feature matrix of n = 100 patients and p = 5 biological 153 hallmarks. Defining the cost/objective (C) function of logistic regression in mean squared error 154 format leads to a non-convexity that makes it difficult to optimally converge. Therefore, it is 155 represented by the following equations where Y is a binary response vector of outcome (CVOID-19 vs flu). Compressing the above 158 two equations inside a single function, we have Replacing the sigmoid function from equation (3) The penalty term which is called the ℓ 1 -regularization term is added to prevent data over-fitting. The model objective is to find a specific solution with a best-optimized cost function. For model training and testing, we developed a K-fold cross-validation strategy, which is 167 a re-sampling method to evaluate machine learning models on a limited data sample. The 168 procedure has a single parameter called K which displays the number of groups that a given 169 data sample is to be split into. As such, the procedure is often called K-fold cross-validation. Therefore, our regression model is not tailored to a particular data set and is exposed to all 171 available samples of a given subject in the training set. This approach implies that the training 172 procedure was entirely blinded to the synthetic patient data sets, and ensures the presumed 173 independence from any intra-subject correlations that are required for Lasso classification. We . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2022. ; https://doi.org/10.1101/2022.01.27.22269978 doi: medRxiv preprint is shown in three-dimensional scatter plots in Figure 6 of the ground truth and regression pre- . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2022. ; 12 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2022. can be cheaper to conduct, but these data can also be subject to inconsistencies and bias, af-265 fecting classification outcomes. In a future study, we will expand our analysis to a model of 266 in-host measurements and observational data to determine if specific combinations of in-host 267 and observational data that best classify influenza and COVID-19 infections differ. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted January 29, 2022. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 29, 2022. ; https://doi.org/10.1101/2022.01.27.22269978 doi: medRxiv preprint also be employed, and require only small changes to our method to include this. We find that CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted January 29, 2022. ; https://doi.org/10.1101/2022.01.27.22269978 doi: medRxiv preprint tal sign-based prediction algorithm for differentiating covid-19 versus seasonal influenza 303 in hospitalized patients. NPJ digital medicine, 4(1):1-10, 2021. [10] Prasith Baccam, Catherine Beauchemin, Catherine A Macken, Frederick G Hayden, and Respiratory virus infections: understanding 275 covid-19 Sars-cov-2 and influenza: a 277 comparative overview and treatment implications. Boletín médico del Hospital Infantil de 278 Co-infection with sars-cov-2 and influenza a virus. IDCases, 20:e00775 283 et al. Covid-19 in critically ill patients in the seattle region-case series Immunopathological similar-287 ities between covid-19 and influenza: Investigating the consequences of co-infection Microbial pathogenesis