key: cord-328069-a9fi9ssg
authors: Pathan, Refat Khan; Biswas, Munmun; Khandaker, Mayeen Uddin
title: Time Series Prediction of COVID-19 by Mutation Rate Analysis using Recurrent Neural Network-based LSTM Model
date: 2020-06-13
journal: Chaos Solitons Fractals
DOI: 10.1016/j.chaos.2020.110018
sha: 
doc_id: 328069
cord_uid: a9fi9ssg

SARS-CoV-2, a novel coronavirus mostly known as COVID-19 has created a global pandemic. The world is now immobilized by this infectious RNA virus. As of May 18, already more than 4.8 million people have been infected and 316k people died. This RNA virus has the ability to do the mutation in the human body. Accurate determination of mutation rates is essential to comprehend the evolution of this virus and to determine the risk of emergent infectious disease. This study explores the mutation rate of the whole genomic sequence gathered from the patient's dataset of different countries. The collected dataset is processed to determine the nucleotide mutation and codon mutation separately. Furthermore, based on the size of the dataset, the determined mutation rate is categorized for four different regions: China, Australia, The United States, and the rest of the World. It has been found that a huge amount of Thymine (T) and Adenine (A) are mutated to other nucleotides for all regions, but codons are not frequently mutating like nucleotides. A recurrent neural network-based Long Short Term Memory (LSTM) model has been applied to predict the future mutation rate of this virus. The LSTM model gives Root Mean Square Error (RMSE) of 0.06 in testing and 0.04 in training, which is an optimized value. Using this train and testing process, the nucleotide mutation rate of 400(th) patient in future time has been predicted. About 0.1% increment in mutation rate is found for mutating of nucleotides from T to C and G, C to G and G to T. While a decrement of 0.1% is seen for mutating of T to A, and A to C. It is found that this model can be used to predict day basis mutation rates if more patient data is available in updated time.

The whole world is suffering by an ongoing pandemic due to Coronavirus disease brought about by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1] . It was an outbreak from Wuhan, the capital of Hubei province in China during December 2019 [2] . The virus was identified on 7th January and observed that it is spread by human-to-human transmission via droplets or direct contact [3, 4] . Its infection has been estimated to be a mean incubation period of 6.4 days and a basic reproduction number of 2.24-3.58. Since its identification, it has already been spread speedily over the whole globe, therefore WHO had declared COVID-19 a global pandemic on 11th March 2020 [5] .

The SARS-Cov-2 is a pathogenic human coronavirus under the Betacoronavirus genus. In the recent decade, the other two pathogenic species SARS-CoV and MERS-CoV were outbreaks in 2002 and 2012 in China and the Middle East, respectively [6] [7] [8] [9] . The complete genomic sequence (Wuhan-HU1) of this large RNA virus (SARS-CoV-2) was first discovered in the laboratory of China on 10th January [10] and placed in the NCBI GenBank. The SARS-CoV-2 is a single positive-stranded RNA virus having non-segmented in nucleic acid sequence [11] . Although it is an RNA virus but for simplicity of understanding the gene sequence has been given as DNA type which means nucleobase Uracil (U) has been replaced by Thymine (T). The genomic sequence of SARS-CoV-2 virus shows about 79% and 50% similarity with the SAR-CoV and MARS-CoV, respectively [6] .

SARS-CoV-2 performs mutation during replication of genomic information [12] . The mutation occurs due to some errors when copying RNA to a new cell. Mutations are mainly three kinds: Base substitution, Insertion, and Deletion. Further, in base substitutions, there are some more divisions: silent, nonsense, missense, and frameshift [13] . Micro-level alteration of mutation rate is also detectable for virus infection in host immune systems and drastically change the virus characteristic and virulence. To understand viral evolution, the mutation rate is one of the crucial parameters [14] . Furthermore, it is one of the most important factors for the assessment of the risk of emergent infectious disease, like due to SARS-CoV-2. Therefore, an accurate estimation of this parameter finds a great significance [15, 16] .

In connection to this and following the current pandemic, many researchers and scientists are working relentlessly to understand the evolution of SARS-CoV-2. Asim et. al have performed Phylogenetic analysis of SARS-CoV-2 virus based on the spike gene of the genomic sequence [17] . In this study, they described a detailed genomic sequence of the SARS-CoV-2 virus. They identified the factor of endemicity of SARS-CoV-2 and then focused to find out the next reservoir of the SARS-CoV-2 virus. Based on the case study, the authors reported that all sequence of this virus is constituted in a single cluster without making any branching on this ongoing pandemic but not validated the findings with detailed statistical analysis. An analysis on Gene signature of SARS-CoV-2 virus has been performed by Ranajit and Sudeep [18] . They estimate the ancestry rate of the European genome from the reference population by applying a statistical tool qpAdm. Then they applied Pearson's correlation coefficient between various ancestry rate of European genome and performed statistical analysis on death/recovery ratio by using GraphPad Prism v8.4.0, GraphPad Software. In this study, they developed different linear regression models. Finally, they performed Genome-wide association analyses (GWAS) among European and East Asian genomes to examine Single-nucleotide polymorphism (SNP) which is correlated to the infection of the SARS-CoV-2 virus. From the SNP association, they observed a huge difference in allele frequencies between European and Eastern Asian countries. Debaleena et al. analyzed the statistical changes of signature from different variant of SARS-CoV-2 virus [19] . They calculated diversity, non-synonymous, synonymous, and substitution rates for each gene of the nucleotide sequence by using DnaSP. They also employed Time zone software for phylogenetic analysis and mutation detection for each gene. After that, they compared the sequence alignment of a protein of Wuhan and India by using multiple sequence alignment. Note that in their study, the mutation rate was not calculated based on the patient's genomic sequence. However, the contemporary literature shows adequate studies on the genomic sequence but very few studies on the mutation rate. Therefore, the present study is designed to perform the mutation rate prediction for SARS-CoV-2 on the basis of the time series.

Unfortunately, the current data shows that the SARS-CoV-2 virus is highly infectious than the other harmful species of pathogenic human coronaviruses [20] . World populations are now suffering and are in great anxiety by observing the deadliest effect of this virus. But what can be done to stay healthy or avoid getting infected with the virus is still undiscovered. To stop SARS-CoV-2 virus, there is a critical need to invent proper vaccine and antibody based therapy against this virus [22] . Scientists and Researchers are trying their best to discover suitable drugs or vaccines to neutralize the effect of this virus on the human body, or at least in helping to create an effective resistance against the spreading out of this virus. For inventing proper drugs and vaccines against COVID-19 RNA viruses, genomic sequence and mutation analysis are crucially required. In fact, the viral mutation rate also plays a role in the assessment of possible vaccination strategies.

In this regard, we performed a detailed study on the mutation rate of this virus using the available dataset in the NCBI GenBank. From this dataset, we have analyzed the Genomic sequence of 3408 patients from different countries for a period of 12 th January to 11th May 2020. We focus specifically on the mutations that have developed freely on different dates (homoplasies) as these are likely possibilities for progressing adjustment of SARS-CoV2 to its novel human host. Specifically, we have calculated the base substitution mutation rates. Due to the lack of necessary information for insertion and deletion, we have considered those as substitution mutations to ensure that no nucleotide goes out of count. It is expected that the present analysis would help to understand the changing behavior of this virus in the human body and set up strategies to combat the epidemiological and evolutionary levels.

An adequate amount of gene dataset is currently available in the NCBI GenBank which has the complete genome sequence of SARS-CoV-2. Among the many entities, we have filtered the gene sequence, date of collection, and country of the sample. All genes are taken from the human body who are affected by COVID-19. There are genes from almost 33 countries but China, Australia and the United States has a considerable number of patients' data. Though some countries like England, Italy, France, Spain, and Brazil has a very high mortality rate but for the lack of data availability in the NCBI GenBank till 15 th May, we were unable to calculate the mutation rates for these countries separately. Therefore we have considered these countries along with others those have low gene data sequence available in the GenBank as the rest of the world category to cover as much region as possible. Table 1 shows the information about the gene dataset.

In this dataset, there are also some partial genes. So we filtered them and take only with the level of the complete genome. There is a reference gene sequence of length 29903. Finally, we have filtered the dataset by taking a maximum gene length of 29903 and a minimum of 29161 and avoided the copy sequences. With this filtering process, the total number of patients come down to 3068 from 3408, patients from china come down to 40 from 86, The United States come down to 1903 from 2103 and Australia come down to 918 from 925. Following the size of the available dataset, the mutation rate calculations have been set for four categories: China, United States (USA), Australia and the rest of the World. Furthermore, the dataset is arranged in a suitable way to separately calculate the nucleotide mutation and codon mutation. The first filtered dataset is to find the nucleotide mutation rate. Then we have converted the four raw nucleotides (A=Adenine, T=Thymine, C=Cytosine and G=Guanine) into codon set. A codon consists of three nucleotides and forms a unit of genetic code in DNA or RNA. Information given in table 2 is used to convert the gene sequence by their index number. For example, if three consecutive nucleotides are 'TTT' then it will be converted to 1, 'GCT' is converted into 53, and so on. The conversion process has been shown in figure 1 . This process is important to understand the mutation in the codon sequence of COVID-19. Also, it helps to lower the computational complexity.

Gene mutates for many reasons. When RNA tries to copy genetic codes form DNA it may cause some error which causes mutation. Also, errors in DNA replication, recombination, and chemical damage in DNA causes mutation. There are basically three types of mutations: base substitutions, deletions, and insertions. From this dataset, we can find out the three kinds of substitution mutation which are silent, missense, and nonsense. A silent mutation is the change of codon by which the resulting amino acid remains unchanged. If the resulting amino acid changes then it is called a missense mutation. On the other hand, when changing codon produces the stop signal for gene translation which causes a nonfunctional protein then it is called a nonsense mutation. These three types of substitution mutation of the observed dataset have been shown in figure 2 , where the missense rate is 34.3%, the nonsense mutation rate is 6.7% and the silent mutation rate is 0.8%.

If the mutation type is missense then it can be said that the change of nucleotide has affected the protean generation, which may change the behavior of the virus. Also, it is hard to identify the cure's gene sequence. The missense nucleotide mutation rate has been calculated by the given algorithm in figure 3 . After using this algorithm equation 1 has been used to convert the values in percentage.

Here, MutationRate is the final output array, mutation is the output array sized 4×4 containing raw values after applying the algorithm, lg is the length of a dataset which is 3068 for the full dataset, 40 for China, 918 for Australia and 1903 for the USA, gs is the length of reference gene sequence which is 29903 in this dataset.

In this process, we have calculated the nucleotide mutation rate for the prepared dataset. The mutation rate for China has been shown in figure 4 (a). It shows that a huge percent of Thymine (T) are being mutated to other nucleotides but not producing the same amount of T again. Also, a huge amount of Adenine (A) is mutated to other nucleotides. Comparing to T and A, Cytosine (C) and Guanine (G) were not changed much.

After that, the mutation rate has been calculated for Australia and the USA, and shown in figures 4(b) and 4(c). This is clear that all rates have a common factor of having the high mutation rate of T and A. But there is a significant increase in the mutation rate compared to China. This clearly indicates that this virus is very much active in changing its gene sequence. Finally, the nucleotide mutation for the full dataset of 33 countries has been shown in figure 4 (d). It shows that C and G mutation rates are almost equal to the USA because there are more data of USA than any other countries. But some changes in T and A can be seen for the dataset for the rest of the world. These values vary on the availability of the data from different countries.

The second processed and converted dataset that were prepared previously has been used here to calculate the codon mutation rate, and shown in figure 5 . Changes in nucleotide cause changes in codon set, which later affects the protean directly. We have used the same algorithm shown in figure 3 for detecting the codon mutation rate. A small change has been made in the receiving array where array size was 4×4 for nucleotide but here the array size is 64×64 for codon mutation. After finding the codon mutations, equation 2 has been used to get the rates in percentage.

Here, CodonMutation is the final output array, mutation is the output array sized 64×64 containing raw values after applying the algorithm, lg is the length of dataset which is 3068, gs is the length of the reference gene sequence which is 9967 in this converted dataset.

The codon mutation rate for the full dataset has been shown in figure 5 . From the obtained value it is clear that codons are not frequently mutating like nucleotides. The diagonal values are 0 because that point codons are not changing comparing with reference gene sequence and heights codon mutation rate is 0.12%.

In processed nucleotide mutation dataset, we have gene data from 12 th January 2020 to 11 th May 2020 discontinuously. These dates are in sorted ascending order which makes it easy to consider this as a time series dataset. In one particular date, this dataset has one or more patients. 3068 patients are in this dataset for 62 days. By taking all the patient we can find a time series dataset for patients shown in figure 6 .

To obtain a day basis time-series dataset we have averaged the mutation rates for different patients in the same date. So the dataset becomes smaller and dates are in non-sequentially increasing order and the mutation rates for 62 days have shown in figure 7 . The low date availability makes it difficult to train a model in such a small amount of data.

Long Short Term Memory network which is a type of recurrent neural network (RNN) has been used in deep learning. Data has been organized as shown in table 3 where each set has a mutation rate of 12 patients.

We have divided 80/20% data as training and testing. Therefore, we get 2453 data for training and 614 for testing. An LSTM model has been created with keras, a deep learning API of python and the structure has shown in figure 8 to train the dataset. First, the input layer got the prepared set of training data with 500 neurons. Then it has been through a dense layer of 250 neurons with relu activation layer. After that 0.25 dropout has been used. Another dense layer of 150 neurons has been used with relu activation. Then again 0.25 dropout is used. Finally, dense of 12 neurons has been used as an output layer with adam optimizer. This model gives Root Mean Square Error (RMSE) of 0.06 in testing and 0.04 in training.

After the train and testing process, the model seems to be working well. So we use the last 12 patients' mutation rate to predict one future patient's mutation rate and then take that patient and again make 12 patients' mutation rate by 11 old and 1 new patient and predicted. By this procedure, we have predicted 400 patients' mutation rate for future time, as shown in figure 9 .

The nucleotide mutation rate of 400 th patient in future time has been shown in figure 10 . A little increment of mutation rate can be seen. If more continuous data can be found from different locations and date then this method can be applied to find the mutation rate for one particular date in the future.

The COVID-19 pandemic has almost stopped the world in this twenty-first century. The great spreading power mixing with mutation turns this virus greatly powerful and deadly. Lockdown has limited the spreading power of this virus temporarily but the mutation power cannot be contained till now as no reliable vaccine has invented yet. In this research, we explain the nucleotide mutation rate and pattern in the codon mutation set. A RNN-based LSTM model has been created to predict the future rate of mutation in person's body if effected with COVID-19.

With this model 400th patient in future time has been predicted. Also, we have explained an LSTM-RNN model for time series prediction based on patients' nucleotide mutation rate. By analyzing more patient data in updated time, this model can be used to predict day basis mutation rates. The situation may change if a reliable way of cure would be invented. Also in this paper, the mutation rate is limited to base substitution only, insertion and deletion rate can be determined in further research. Fig. 1 : Nucleotide to codon indexing. 

Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2

Laboratory testing of 2019 novel coronavirus (2019-nCoV) in suspected human cases: interim guidance

A novel coronavirus outbreak of global health concern

WHO declares COVID-19 a pandemic

Clinical features of patients infected with 2019 novel coronavirus in Wuhan

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

Genome composition and divergence of the novel coronavirus (2019-nCoV) originating in China

Recombination, reservoirs, and the modular spike: mechanisms of coronavirus crossspecies transmission

Origin and evolution of pathogenic coronaviruses

Inhibition of SARS-CoV-2 Replication by Acidizing and RNA Lyase-Modified Carbon Nanotubes Combined with Photodynamic Thermal Effect

Coronaviruses: an overview of their replication and pathogenesis

Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant

Mutations: Types and causes

Viral mutation rates

Increased fidelity reduces poliovirus fitness and virulence under selective pressure in mice

Quasispecies diversity determines pathogenesis through cooperative interactions in a viral population

Emergence of Novel Coronavirus and COVID-19: whether to stay or die out?

Investigating the likely association between genetic ancestry and COVID-19 manifestation

Emergence of multiple variants of SARS-CoV-2 with signature structural changes

Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation

Models of RNA virus evolution and their roles in vaccine design

Technical supports from the BGC University computer club has been acknowledged

This research received no funding

The authors declare no competing financial interest