key: cord-263008-w6twrjzr authors: Yin, Rui; Luo, Zihan; Kwoh, Chee Keong title: Alignment-free machine learning approaches for the lethality prediction of potential novel human-adapted coronavirus using genomic nucleotide date: 2020-07-15 journal: bioRxiv DOI: 10.1101/2020.07.15.176933 sha: doc_id: 263008 cord_uid: w6twrjzr A newly emerging novel coronavirus appeared and rapidly spread worldwide and World Health Organization declared a pandemic on March 11, 2020. The roles and characteristics of coronavirus have captured much attention due to its power of causing a wide variety of infectious diseases, from mild to severe on humans. The detection of the lethality of human coronavirus is key to estimate the viral toxicity and provide perspective for treatment. We developed alignment-free machine learning approaches for an ultra-fast and highly accurate prediction of the lethality of potential human-adapted coronavirus using genomic nucleotide. We performed extensive experiments through six different feature transformation and machine learning algorithms in combination with digital signal processing to infer the lethality of possible future novel coronaviruses using previous existing strains. The results tested on SARS-CoV, MERS-Cov and SARS-CoV-2 datasets show an average 96.7% prediction accuracy. We also provide preliminary analysis validating the effectiveness of our models through other human coronaviruses. Our study achieves high levels of prediction performance based on raw RNA sequences alone without genome annotations and specialized biological knowledge. The results demonstrate that, for any novel human coronavirus strains, this alignment-free machine learning-based approach can offer a reliable real-time estimation for its viral lethality. have caused high morbidity and mortality, and unfortunately, the fast and untraceable 48 virus mutations take the lives of people before the immune system can produce the 49 inhibitory antibody [17] . Currently, no miracle drug or vaccines are available to treat or 50 prevent the humans infected by coronaviruses [18] [19] . Therefore, there is a desperate 51 need for developing approaches to detect the lethality of coronaviruses not only for 52 COVID-19 but also the potential new variants and species. This would facilitate the 53 diagnosis of coronavirus clinical severity and provide decision-making support. 54 The detection of viral lethality has already been explored in influenza viruses [20] . 55 Through a meta-analysis of predicting the virulence and antigenicity of influenza 56 viruses, we can infer the lethality of the virus timely to improve the current influenza 57 surveillance system [21] . Regarding the risk of novel emerging coronavirus strains, much 58 attention has been captured to investigate the lethality or clinical severity of new 59 emerging coronavirus. Typically, epidemiological models are certainly built to estimate 60 the lethality and the extent of undetected infections associated with the new 61 coronaviruses. Bastolla suggested an orthogonal approach based on a minimum number 62 of parameters robustly fitted from the cumulative data easily accessible for all countries 63 at the John Hopkins University database to extrapolate the death rate [22] . 64 Bello-Chavolla et al. proposed a clinical score to evaluate the risk for complications and 65 lethality attributable to COVID-19 regarding the effect of obesity and diabetes in 66 Mexico [23] . The results provided a tool for quick determination of susceptibility 67 patients in a first contact scenario. Want et al. leveraged patient data in real-time and 68 devise a patient information based algorithm to estimate and predict the death rate 69 caused by COVID-19 for the near future [24] . Aiewsakun et al. performed a 70 genome-wide association study on the genomes of COVID-19 to identify genetic 71 variations that might be associated with the COVID-19 severity [25] . Moreover, Jiang et 72 al. established an artificial intelligence framework for data-driven prediction of 73 coronavirus clinical severity [26] . The development of computational and physics-based 74 approaches has relieved the labors of experiments by utilizing epidemiological and 75 biological data to construct the model. However, direct evaluation of potential novel 76 coronavirus strains for their lethality is crucial when clinicians are forced to make 77 difficult decisions without past specific experience to guide clinical acumen. Inferring 78 the lethality of novel coronavirus is possible by identifying the patterns from a large 79 number of coronavirus sequences. In this paper, we propose alignment-free machine learning-based approaches to infer 81 2/18 the lethality of potential novel human-adapted coronavirus using genomic sequences. The main contribution is that we formulate the problem of estimating the lethality of 83 human-adapted coronavirus through machine learning approaches. By leveraging some 84 appropriate feature transformation, we can encode genomic nucleotides into numbers 85 that allow us to convert it into a prediction task. The experimental results suggest our 86 models deliver accurate prediction of lethality without prior biological knowledge. We 87 also performed phylogenetic analysis validating the effectiveness of our models through 88 other human coronaviruses. Problem formulation 91 The pandemic of novel coronavirus COVID-19 has caused thousands of fatalities, 92 making tremendous treats to public health worldwide. The society is deeply concerned 93 about its spread and evolution with the emergence of any potential new variants, that 94 would increase the lethality. Typically, lethality refers to the capability of causing death. 95 It is usually estimated as the cumulative number of deaths divided by the total number 96 of confirmed cases. Among all the human-adapted coronaviruses, MERS-CoV caused 97 the highest fatality rate of 37% [12] , followed by the SARS-CoV with 9.3% fatality 98 rate [27] . In comparison, COVID-19 indicates a lower mortality rate of 5.5% [13] . The 99 lethality rate of COVID-19 is likely to decrease with better treatment and precautions. 100 In this paper, we mainly focus on these three types of human-adapted coronavirus and 101 define the degree of viral lethality in terms of historical fatality rates. As a result, 102 MERS-CoV strains are high lethal while SARS-CoV and COVID-19 strains are middle 103 and low lethal, respectively. Data collection and preprocessing 105 Genomic nucleotide sequences of three different coronaviruses with the human host are 106 downloaded from National Center for Biotechnology Information on April 30, 2020 [28] . 107 Duplicate sequences and incomplete genomes with a length smaller than 20000 are 108 removed from the collection to address the possible issues raised from sequence length 109 bias. Some SARS-CoV strains from the laboratory are included that are cultivated in 110 Vero cell cultures to enrich the training samples. Finally, we end up with 321, 351, 1638 111 samples for MERS-CoV, SARS-CoV and SARS-CoV-2. In addition, we also collect the 112 genomic data of other four human coronaviruses with 27, 64, 32 and 142 strains for 113 HCoV-HKU1, HCoV-NL63, HCoV-229E and HCoV-OC43, respectively. Apart from the 114 four symbolic bases (A, C, T, G) of each strain, we have degenerate base symbols that 115 are an IUPAC representation [29] for a position on genomic sequences, which could [31] , mapping biological sequences into real-value vector space that 125 the information or pattern characteristic of the sequence is kept in order. This is 126 3/18 important as the existing machine learning approaches can only deal with vectors but 127 not sequence samples. Several methods are proposed that convert genomic sequences 128 into numerical vectors, e.g., the fixed mapping between nucleotides and real numbers 129 without biological significance [32] , based on physio-chemical properties [33] , deduction 130 from doublets or codons [34] , and chaos game representation [35] . To accommodate 131 comprehensive analysis and comparison, we adapt different types of numerical 132 representations for biological RNA sequences. Randhawa et al. [36] showed that "Real", 133 "Just-A" and "Purine/Pyrimidine (PP)" numerical representation yield better The real number representation is a fixed transformation technique that we obtain 141 values of four bases as: adenine (A) = -1.5, thymine (T) = 1.5, cytosine (C) = 0.5, and 142 guanine (G) = -0.5 [37] . It is efficient in finding a complementary strand of DNA/RNA 143 sequence and can endure complementary property. Just-A" method maps the four bases 144 into binary classification as the presence of adenine is labeled 1, while others are 0 [38] . 145 PP representation is a DNA-Walk model that shows nucleotides sequences in which a 146 step is taken upwards if the nucleotide is pyrimidine with T/C = 1, or downward if it is 147 purine with A/G = -1 [39] . EIIP describes the distribution of the energy of free 148 elections along with nucleotide sequences that a single EIIP indicator sequence is 149 formed through replacing its nucleotides, where A=0.1260, C=0.1340, G=0.0806, and 150 T=0.1335 [40] . The sequence-to-signal mapping for nearest-neighbor based doublet 151 representation is illustrated in [34] , where the last position is followed by the first in the 152 sequence. Lastly, CGR is a method proposed by Jeffrey [41] that has been successfully 153 used for a visual representation of genome sequence patterns and taxonomic nucleotide of the sequence is plotted halfway between the center of the square and the 157 vertex representing this nucleotide. The next base is mapped into the image that the 158 coordinate is assigned halfway between the previous point and the vertex corresponding 159 to the previous nucleotide. The mathematical formulation of the successive points that 160 calculates the coordinates in the CGR of the sequences is described below: where C ix and C iy denote the X and Y coordinates of the vertices matching the 162 nucleotide at position i of the sequence, respectively. Model construction 164 Machine learning has been utilized in many aspects of viral genomic analysis, e.g., 165 antigenicity prediction of viruses [21] , genome classification of novel pathogens [43] , 166 reassortment detection [44] , receptor binding analysis [45] and vaccine 167 recommendation [46] , etc. With increasingly available genomic sequences, it will play 168 more critical roles in helping biologists to analyze large, complex biological data for 169 prediction and discovery. In this work, we provide a comprehensive analysis of the implemented in comparison with the predictive performance of machine learning models. 175 Traditional machine learning models consist of logistic regression (LR), random forest 176 (RF), K-nearest neighbor (KNN) and neural network (NN) [47] , while three variants of 177 convolutional neural network (CNN) and two types of recurrent neural network (RNN) 178 are leveraged. The CNN models contain AlexNet [48] , VGG [49] and ResNet [50] . Following the choices of five one-dimensional numerical representation for viral 180 sequences, digital signal processing is introduced through DFT techniques. We assume 181 that the number of input sequence is n and all the sequences have the same length l. For 182 and 0 ≤ k ≤ l − 1, the corresponding discrete numerical representation is formulated as 184 where f (S i (k)) denotes the numerical value after mapping by function f (·) at the 185 position k of nucleotide sequence S i . The signal N i computed after DFT is represented 186 as vector F i . The formulation of F i is presented below. We define that the magnitude 187 vector that corresponds to the signal Typically, the length of numerical digital signal N i is equal to the magnitude spectrum 189 M i that is originated from the length of the genomic sequence. However, the input 190 genome sequences are in different lengths, thus they need to be length-normalized after 191 DFT. Median length-normalization is leveraged for the input digital signals using zero 192 padding. We employ anti-symmetric padding that begins from the last position if the 193 input sequences are shorter than the median length, these short signals are extended to 194 the median length with zero-padding, while the longer sequences are truncated after the 195 median length. As for the two-dimensional numerical representation, i.e., CGR, a point that 197 corresponds to a sequence of length l will be contained within a square with a side of 198 length 2 −l . We assume a square CGR image is generated with a size of 2 k ×2 k matrix, 199 where k is the parameter that determines the size of the image. The frequency of 200 occurrence of any oligomer in a sequence can be obtained by partitioning the CGR 201 space into small squares. Therefore, the number of CGR points in each unit square of 202 2 k ×2 k grid is equal to the number of occurrences of all possible k-mers in the sequence. 203 By counting the frequency of CGR points, it is possible to calculate oligonucleotide 204 frequencies at various grid resolutions. We define the element a j as the number of 205 points that are located in the corresponding sub-square j, where 1 ≤ j ≤ 2 2k . Each 206 sequence will be mapped into a 2 k ×2 k dimensional vector space based on CGR. 207 Implementation and evaluation 208 We implement all the models by Scikit-learn [51] and PyTorch [52] . We utilize the learning models. For deep learning-based models, we apply stochastic gradient descent 218 with a minimum batch size of 64 for optimization. The drop-out (rate = 0.9) strategy is 219 carried out with a 0.001 learning rate and all the models are fit for 50 training epochs. 220 The predictive performance is evaluated by accuracy, precision, sensitivity, and F1 score 221 of all models in the prediction tasks of coronavirus lethality. [54] . The evidence shows that cytosine discrimination and deamination 244 against CpG dinucleotides are the driving force that outlines the coronaviruses over 245 evolutionary times [55] . It is indicated that the atypical nucleotide bias could reflect 246 distinct biological functions that are the direct cause of the characteristic codon usage 247 in these viruses [56] . Therefore, the analysis of the nucleotide and codon usage in 248 coronaviruses can not only exhibits the clues on potential viral evolution but also 249 improves the understanding of the viral regulation and promotes vaccine design. We test the ability of our models to identify the lethality of other different human Guangzhou, China, but few death cases are reported [61] . HCoV-229E is a close relative 305 of HCoV-NL63 and it will lead to alike symptoms [62] . 306 Figure 3 displays the CGR plots of different sequences of human coronavirus at the 307 value of 6 for k-mer frequency. The CGR plots visually indicate that the genomic 308 signature of the SARS-CoV-2 isolate Wuhan-Hu-1 (Fig.3c) is closer to the genomic 309 signature of the SARS-CoV coronavirus isolate Canada (Fig.3a) , followed by the strain 310 of MERS-CoV Betacoronavirus England 1 isolate (Fig.3b) . Moreover, the other four human ACE2 receptor has been identified as the potential receptor for COVID-19 and 360 serves as a potential target for treatment [70] [71] . Nevertheless, with the circulation of 361 bat-related coronavirus and geographic coverage, it is critical to monitor the evolution 362 of coronavirus. Currently, seven known types of coronavirus can infect humans. Novel 363 strains of these coronaviruses can likely arise and attack human again through 364 reassortment and mutation when two different or more strains co-infect the same host. 365 Preparation is necessary to prevent potential epidemics and pandemics caused by a 366 novel coronavirus. As a result, our work paves the basis for surveillance by inferring the 367 lethality of any potential human coronaviruses that may emerge in the future. This study is subject to a variety of limitations. The definition of classifying the 369 degree of coronavirus lethality is mainly based on the mortality rate. We assume that 370 the higher the mortality, the more lethal for the virus, and thus make three categories of 371 the lethality level for all viruses with a different threshold. However, our estimation for 372 these values lies within the range of fatality rate from the literature, which we do not 373 have sufficient data to parameterize the case-structured model, especially for viruses 374 with few samples. We also do not build a benchmark for the death caused directly by 375 human coronaviruses, as the criteria from institutions and countries could be different. 376 Besides, the limited data points for the human coronavirus pale the high predictive 377 accuracy, as most of the machine learning algorithms possess a superb generation ability 378 to discover inherent patterns from training samples, particularly in the small dataset. But like typical machine learning approaches, our models are not qualified to provide a 380 direct and accessible explanation that explicitly interprets why a certain coronavirus 381 strain is more lethal to humans. Some rule-based methods or clinical study might 382 provide a better rationale for their results. 383 Conclusion 384 We provide a comprehensive analysis through alignment-free machine learning-based 385 methods for the prediction of the lethality of potential human-adapted coronavirus. The 386 results show that on the average, CGR, EIIP, and Just-A representations perform better 387 than others. Interestingly, traditional machine learning methods display obvious merit 388 both in computational efficiency and performance than deep learning models on this 389 task. Validation of other types of human coronavirus in combination with phylogenetic 390 analysis further demonstrates our predictive results. We hope this work would facilitate 391 the research of COVID-19 for biologists and clinicians that are in the frontline. Coronavirus infections and immune responses A case for the ancient origin of coronaviruses Evolutionary insights into the ecology of coronaviruses Origin and evolution of pathogenic coronaviruses Coronavirus diversity, phylogeny and interspecies jumping History and recent advances in coronavirus discovery Hosts and sources of endemic human coronaviruses Global outbreak of severe acute respiratory syndrome (sars) Outbreak of middle east respiratory syndrome coronavirus in saudi arabia: a retrospective study Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding The challenge of emerging and re-emerging infectious diseases A comparative analysis of factors influencing two outbreaks of middle eastern respiratory syndrome (mers) in saudi arabia and south korea An interactive web-based dashboard to track covid-19 in real time. The Lancet infectious diseases Covid-19: what is next for public health? The Lancet Tempel: time-series mutation prediction of influenza a viruses via attention-based recurrent neural networks Escaping pandoras boxanother novel coronavirus The sars, mers and novel coronavirus (covid-19) epidemics, the newest and biggest global health threats: what lessons have we learned? Development and clinical application of a rapid igm-igg combined antibody test for sars-cov-2 infection diagnosis Comparative genetic analysis of the novel coronavirus (2019-ncov/sars-cov-2) receptor ace2 in different populations Meta-analysis on the lethality of influenza a viruses using machine learning approaches Predicting antigenic variants of h1n1 influenza virus based on epidemics and pandemics using a stacking model How lethal is the novel coronavirus, and how many undetected cases there are? the importance of being tested. medRxiv Predicting mortality due to sars-cov-2: A mechanistic score relating obesity and diabetes to covid-19 outcomes in mexico Real-time estimation and prediction of mortality caused by covid-19 with patient information based algorithm Suradej Hongeng, and Arunee Thitithanyanont. Sars-cov-2 genetic variations associated with covid-19 severity. medRxiv Towards an artificial intelligence framework for data-driven prediction of coronavirus clinical severity Summary of probable sars cases with onset of illness from Database resources of the national center for biotechnology information Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984 Computational identification of physicochemical signatures for host tropism of influenza a virus Protein-protein interaction site prediction through combining local and global features with deep neural networks Numerical representation of dna sequences Identification of pathogenic viruses using genomic cepstral coefficients with radial basis function neural network Genomic signal processing methods for computation of alignment-free distances from dna sequences Analysis of genomic sequences by chaos game representation Ml-dsp: Machine learning with digital signal processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels Autoregressive modeling and feature analysis of dna sequences Evolution of long-range fractal correlations and 1/f noise in dna base sequences Visualization and analysis of dna sequences using dna walks A coding measure scheme employing electron-ion interaction pseudopotential (eiip) Chaos game representation of gene structure Additive methods for genomic signatures Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: Covid-19 case study Hopper: an adaptive model for probability estimation of influenza reassortment through host prediction Computational analysis of the receptor binding specificity of novel influenza a/h7n9 viruses Time series computational prediction of vaccines for influenza a h3n2 with recurrent neural networks Machine learning: an algorithmic perspective Imagenet classification with deep convolutional neural networks Very deep convolutional networks for large-scale image recognition Deep residual learning for image recognition Scikit-learn: Machine learning in python. the Automatic differentiation in pytorch Mutational patterns correlate with genome organization in sars and other coronaviruses Genome structure and transcriptional regulation of human coronavirus nl63 Coronavirus genomics and bioinformatics analysis. viruses On the biased nucleotide composition of the human coronavirus rna genome An open-source k-mer based machine learning tool for fast and accurate subtyping of hiv-1 genomes Projecting the transmission dynamics of sars-cov-2 through the postpandemic period Epidemiology, genetic recombination, and pathogenesis of coronaviruses Understanding human coronavirus hcov-nl63. The open virology journal Epidemiology and clinical characteristics of human coronaviruses oc43, 229e, nl63, and hku1: a study of hospitalized children with acute respiratory tract infection in guangzhou, china Human coronavirus nl63 and 229e seroconversion in children Host and infectivity prediction of wuhan 2019 novel coronavirus using deep learning algorithm Functional assessment of cell entry and receptor usage for sars-cov-2 and other lineage b betacoronaviruses Bats are natural reservoirs of sars-like coronaviruses Isolation and characterization of viruses related to the sars coronavirus from animals in southern china Middle east respiratory syndrome coronavirus infection in dromedary camels in saudi arabia Beware of asymptomatic transmission: Study on 2019-ncov prevention and control measures based on extended seir model Preliminary estimation of the basic reproduction number of novel coronavirus (2019-ncov) in china, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak Developing covid-19 vaccines at pandemic speed Single-cell rna expression profiling of ace2, the putative receptor of wuhan 2019-ncov