key: cord-0836310-p5atqdk0 authors: Kuzmin, Kiril; Adeniyi, Ayotomiwa Ezekiel; DaSouza, Arthur Kevin; Lim, Deuk; Nguyen, Huyen; Molina, Nuria Ramirez; Xiong, Lanqiao; Weber, Irene T.; Harrison, Robert W. title: Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone date: 2020-09-18 journal: Biochem Biophys Res Commun DOI: 10.1016/j.bbrc.2020.09.010 sha: 7b1a2ad1f0748fd50e89548e4ddb0aaa337f8054 doc_id: 836310 cord_uid: p5atqdk0 Coronaviruses infect many animals, including humans, due to interspecies transmission. Three of the known human coronaviruses: MERS, SARS-CoV-1, and SARS-CoV-2, the pathogen for the COVID-19 pandemic, cause severe disease. Improved methods to predict host specificity of coronaviruses will be valuable for identifying and controlling future outbreaks. The coronavirus S protein plays a key role in host specificity by attaching the virus to receptors on the cell membrane. We analyzed 1238 spike sequences for their host specificity. Spike sequences readily segregate in t-SNE embeddings into clusters of similar hosts and/or virus species. Machine learning with SVM, Logistic Regression, Decision Tree, Random Forest gave high average accuracies, [Formula: see text] scores, sensitivities and specificities of 0.95–0.99. Importantly, sites identified by Decision Tree correspond to protein regions with known biological importance. These results demonstrate that spike sequences alone can be used to predict host specificity. The COVID-19 pandemic has heightened public awareness 2 of coronaviruses (CoVs) and our vulnerability to highly con- CoVs only infect mammals, including humans, while delta and 10 gamma CoVs are known to infect birds and some mammals, but 11 have not yet been shown to infect humans [3] . Seven species are 12 known to infect humans. HCoV-NL63, HCoV-229E, HCoV-13 HKU1, HCoV-OC43, and BCoV-1, cause mild respiratory dis- The overall distribution of the data with respect to viral species 62 and hosts is shown in Figure 1 . We used MEGA X software 63 https://www.megasoftware.net to align the sequences. Af-64 ter alignment, all sequences were represented with an identical 65 length of 2,396 residues. We applied a well-known one-hot en-66 coding to convert the sequences into numerical vectors for input 67 to machine learning algorithms. The amino acid sequences con- However, the latter method demonstrated slightly worse results 109 as compared to t-SNE. 110 The most important parameter of t-SNE, perplexity, con- and their hosts. We note that t-SNE without special adjustments 190 may distort distances between far apart clusters [11] . Therefore, were similar to k = 10 and not shown in Table S2 ). Notably, 258 on SVD-reduced inputs, the means of accuracy, F 1 -score, sen-259 sitivity, and specificity tend to increase with the growth of k, 260 while on non-reduced inputs, they vary with no particular de- Table 2 : k-fold cross-validations for S Hum performed by DT and SVM classifiers run on non-reduced and SVD-reduced inputs respectively. The results are presented as mean (µ) ± standard deviation (σ) of 4 measures of performance: accuracy (Ac), F 1 -score (F 1 ), sensitivity (Sn), and specificity (Sp). The best performances (with the greatest value µ − σ) are shown in bold. Table 3 demonstrates decent results for all performed classifications (referring to the 4 statistical metrics used -accuracy, F 1 -score, sensitivity, and specificity) reaching more than 98% for S Hum , 95% for H A/S , and 94% for H Mam . Important sites. We used DT to identify important sites in S Hum classification, see Table 4 . Only two sites (1483 and 2258) had high importance of greater than 0.80. Remarkably, they appeared in each run of DT classifier independently of the number of splits k. All other sites used in DT had importance of less than 0.13. As k increases, the proportion of occurrences of the two sites changes in favor of 2258, reaching 100% in the 10-fold split. Table 3 : 3-fold cross-validation of DT and SVM classifier run on inputs without and with dimensionality reduction respectively. The results are presented as mean (µ) ± standard deviation (σ) of 4 measures of performance: accuracy (Ac), F 1 -score (F 1 ), sensitivity (Sn), and specificity (Sp). S 332 Virus taxonomy: 2019 release. ec 51 Origin and evolution of pathogenic coronaviruses Discovery of seven novel mammalian and avian coronaviruses in the genus deltacoro 340 Coronavirus spike protein and 343 tropism changes Structure, function, and evolution of coronavirus spike proteins Structure, function, and antigenicity of the sars-cov-2 spike glycoprotein Coronavirus membrane fusion mechanism offers a potential target for antiviral develo 353 SVD-phy: improved prediction of protein functional associations through singular va Iupac-iubmb joint commission on biochemical nomenclature (jcbn) and no 361 Newsletter Visualizing data using t-SNE The art of using t-sne for single-cell transcriptomics The authors declare there is no conflict of interest in this submission.J o u r n a l P r e -p r o o f