key: cord-255371-o9oxchq6 authors: Nguyen, Thanh Thi; Pathirana, Pubudu N.; Nguyen, Thin; Nguyen, Henry; Bhatti, Asim; Nguyen, Dinh C.; Nguyen, Dung Tien; Nguyen, Ngoc Duy; Creighton, Douglas; Abdelrazek, Mohamed title: Genomic Mutations and Changes in Protein Secondary Structure and Solvent Accessibility of SARS-CoV-2 (COVID-19 Virus) date: 2020-07-10 journal: bioRxiv DOI: 10.1101/2020.07.10.171769 sha: doc_id: 255371 cord_uid: o9oxchq6 Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly pathogenic virus that has caused the global COVID-19 pandemic. Tracing the evolution and transmission of the virus is crucial to respond to and control the pandemic through appropriate intervention strategies. This paper reports and analyses genomic mutations in the coding regions of SARS-CoV-2 and their probable protein secondary structure and solvent accessibility changes, which are predicted using deep learning models. Prediction results suggest that mutation D614G in the virus spike protein, which has attracted much attention from researchers, is unlikely to make changes in protein secondary structure and relative solvent accessibility. Based on 6,324 viral genome sequences, we create a spreadsheet dataset of point mutations that can facilitate the investigation of SARS-CoV-2 in many perspectives, especially in tracing the evolution and worldwide spread of the virus. Our analysis results also show that coding genes E, M, ORF6, ORF7a, ORF7b and ORF10 are most stable, potentially suitable to be targeted for vaccine and drug development. Biological investigations of the novel coronavirus SARS-CoV-2 are important to understand the virus and help to propose appropriate responses to the pandemic. Scientists have been able to obtain genomic sequences of SARS-CoV-2 and have started analysis of these data. Reference genome of SARS-CoV-2 deposited to the National Center for Biotechnology Information (NCBI) GenBank sequence database (isolate Wuhan-Hu-1, accession number NC_045512) shows that SARS-CoV-2 is an RNA virus having a length of 29,903 nucleotides. Comparative genomic analysis results obtained in [1] [2] [3] suggest that the COVID-19 virus may be originated in bats. Other studies show that pangolins may have served as the hosts for the virus [4; 5] . Andersen et al. [6] furthermore believe that SARS-CoV-2 is not a purposefully manipulated virus or constructed in a laboratory but has a natural origin. A study in [7] using machine learning unsupervised clustering methods corroborates previous findings that SARS-CoV-2 belongs to the Sarbecovirus subgenus of the Betacoronavirus genus within the Coronaviridae family [8; 9] . The whole genome analysis results also indicate that bats are more likely the reservoir hosts for the virus than pangolins. Another study in [10] demonstrates that SARS-CoV-2 may have resulted from a recombination of a pangolin coronavirus and a bat coronavirus, and pangolins may have acted as an intermediate host for the virus. Since the first cases were detected, the COVID-19 virus has spread to almost every country in the world and has been linked to the deaths of more than 404,000 people of over 7 million confirmed cases [11] . Tracing the evolution and spread of the virus is important for developing vaccines and drugs as well as proposing appropriate intervention strategies. Monitoring and analysing the viral genome mutations can be helpful for this task. Due to a strong immunologic pressure in humans, the virus may have mutated over time to circumvent responses of the human immune system. This leads to the creation of virus variants with possible different virulence, infectivity, and transmissibility [25] . This paper reports all point mutations occurring so far in SARS-CoV-2 and presents exemplified implications obtained from the analysis of these mutation pattern data. Four types of mutations, which include synonymous, nonsynonymous, insertion and deletion, are detected. We use 6,324 SARS-CoV-2 genome sequences collected in 45 countries and deposited to the NCBI GenBank so far and create a spreadsheet dataset of all mutations occurred across different genes. Eleven protein coding genes of SARS-CoV-2 have been identified, namely ORF1ab, spike (S), ORF3a, envelope (E), membrane (M), ORF6, ORF7a, ORF7b, ORF8, nucleocapsid (N) and ORF10. The order of these genes and their corresponding length are illustrated in Fig. 1 . The genes S, E, M, and N produce structural proteins that play important roles in the virus functions. For example, the receptor-binding domain (RBD) region of the S protein can bind to a receptor of a host cell, e.g. the human and bat angiotensin-converting enzyme 2 (ACE2) receptor, enabling the entrance of the virus into the cell [12] . Predictions of protein structures may help understand the virus's functions and thus contribute to developing vaccines and therapeutics against the virus. In this paper, to evaluate the possible impacts of genomic mutations on the virus functions, we propose the use of the SSpro/ACCpro 5 methods to predict protein secondary structure and relative solvent accessibility [13] . These predictors were built using deep learning one-dimensional bidirectional recurrent neural networks incorporated in the SCRATCH-1D soft- ware suite (version 1.2, 2018) [14] . By comparing the prediction results obtained on the reference genome and mutated genomes, we are able to assess whether the detected mutations have the potential to change the protein structure and solvent accessibility, and thus lead to possible changes of the virus characteristics. Because of the functional importance of structural proteins, we only report the prediction results of these proteins in this study. The next section reviews related works in the literature. We then present materials and methods for SARS-CoV-2 mutation detection, and protein secondary structure and solvent accessibility prediction. Next we summarize statistics of SARS-CoV-2 mutations so far and implications of these mutations. Details of mutations in nonstructural ORF genes and structural S, E, M and N genes are presented after that. A full SARS-CoV-2 mutation spreadsheet report is provided in the supplemental information. Since the first genomes were collected in December 2019, there have been many findings on the mutations of SARS-CoV-2. For example, Phan [15] analysed 86 genomes of SARS-CoV-2 downloaded from the the Global Initiative on Sharing All Influenza Data (GISAID) database (https://www.gisaid.org/) and found 93 mutations over the entire viral genome sequences. Among them, there are three mutations occurring in the RBD region of the spike surface glycoprotein S, including N354D, D364Y and V367F, with the numbers showing amino acid (AA) positions in the protein. That study also reveals three deletions in the genomes of SARS-CoV-2 obtained from Japan, USA and Australia. Two of these deletion mutations are in the ORF1ab polyprotein and one deletion occurs in the 3' end of the genome. Likewise, a study in [16] shows that the SARS-CoV-2 genomes may have undergone recurrent, independent mutations at 198 sites with 80% of these are of the nonsynonymous type. Tang et al. [17] investigated 103 genomes of COVID-19 patients and discover mutations in 149 sites of these genomes. The study also shows that the spike gene S consistently has larger dS values (synonymous substitutions per synonymous site) than other genes. In addition, two major lineages of the virus, denoted as L and S, have been specified based on two tightly linked SNPs. The L lineage is found more prevalent than the S lineage among the examined sequences. Korber et al. [18] tracked the mutations of spike protein S of SARS-CoV-2 because it plays an important role in mediating infection of targeted cells and is the focus of vaccine and antibody therapy development efforts [19] . They detected 14 mutations in the spike protein that are growing, especially the mutation D614G that rapidly becomes the dominant form when spread to a new geographical region. Likewise, Hashimi [20] analysed the mutation frequency in the spike protein S of 796 SARS-CoV-2 genomes downloaded from the GISAID and GenBank databases. The study found 64 mutations occurring in the S protein sequences obtained from multiple countries. It suggests that the virus is spreading in two forms, the D614 form (residue D at position 614 in the S protein) takes 68.5% while the G614 form takes 31.5% proportion of the examined isolates. Koyama et al. [21] on the other hand found several variants of SARS-CoV-2 that may cause drifts and escape from immune recognition by using the prediction results of B-cell and T-cell epitopes in [22] . Typically, the mutation D614G occurring in the spike protein is found prevalent in the European population. This mutation may have caused antigenic drift, resulting in vaccine mismatches that lead to a high mortality rate of this population. A recent situation report [23] by Nextstrain [24] on genomic epidemiology of novel coronavirus using 5,193 publicly shared COVID-19 genomes shows that SARS-CoV-2 on average accumulates changes at a rate of 24 substitutions per year. This is approximately equivalent to 1 mutation per 1,000 bases in a year. This evolutionary rate of SARS-CoV-2 is typical for a coronavirus, and it is smaller than that of influenza (average 2 mutations per 1,000 bases per year) and HIV (average 4 mutations per 1,000 bases per year). Shen et al. [25] conducted metatranscriptome sequencing for bronchoalveolar lavage fluid samples obtained from 8 patients with COVID-19 and found no evidence for the transmission of intrahost variants as well as a high evolution rate of the virus with the number of intrahost variants ranged from 0 to 51 around a median number of 4. Pachetti et al. [26] examined 220 genomic sequences of COVID-19 patients from the GISAID database and discovered 8 novel recurrent mutations at nucleotide locations 1397, 2891, 14408, 17746, 17857, 18060, 23403 and 28881. Mutations at locations 2891, 3036, 14408, 23403 and 28881 are mostly found in Europe while those at locations 17746, 17857 and 18060 occur in sequences obtained from patients in North America. Likewise, a study in [27] on 95 SARS-CoV-2 complete genome sequences discovered 116 mutations. Among them, the mutations at position C8782T in the ORF1ab gene, T28144C in the ORF8 gene and C29095T in the N gene are common. We use 6,324 sequence records downloaded from the NCBI GenBank database on 2020-06-17. The latest collection date for the samples from which the sequences were derived was on 2020-06-05. The data, which were collected in 45 countries, include both nucleotide sequences and protein translations of coding genes. A proportion of the 6,324 records have sequences of only few proteins, i.e. these records do not annotate all 11 proteins (ORF1ab, ORF3a, ORF6, ORF7a, ORF7b, ORF8, ORF10, S, E, M and N). The number of available sequences is thus different from one protein to another (see column "Avai Num" in Table 1 ). Genome sequences that do not specify country or AA sequences that contain letter "X" representing an unknown AA are excluded in our calculations. We use the genome obtained from the isolate Wuhan-Hu-1, accession number NC_045512 as the reference genome. For the mutation detection purpose, we apply a dynamic programming algorithm to protein AA sequences to get global pairwise alignments between a reference sequence and a query sequence. Specifically, we use the Python Bio.pairwise2.align.globalms function (https://biopython.org/docs/dev/api/Bio.pairwise2.html) where a match is given 2 points, a mismatch is deducted 0.5 points, 2 points are deducted when opening a gap, and 1 point is deducted when extending it. Gaps are then inserted into nucleotide sequences corresponding to the resulted protein sequence alignments. Using the resulted pairwise alignments, we are able to compare query sequences and the reference sequences at each position and identify locations of insertion, deletion, synonymous and nonsynonymous mutations. Virus protein structure plays a key role in its functions and a change in structure shape may affect its functions, virulence, infectivity and transmissibility, possibly resulting in non-functional proteins. Protein secondary structure is defined by hydrogen bonding patterns, which make an intermediate form before the protein folds into a three-dimensional shape composing its tertiary structure. Eight types of protein secondary structure defined by the Dictionary of Protein Secondary Structure (DSSP) include 3 10 helix (G), α helix (H), π helix (I), hydrogen bonded turn (T), extended strand in parallel and/or anti-parallel β-sheet conformation (E), residue in isolated β-bridge (B), bend (S) and coil (C). The DSSP tool assigns every residue to one of the eight possible states. In a reduced form, these 8 conformational states can be diminished to 3 states: H = {H, G, I}, E = {E, B} and C = {S, T, C} [28] . The protein secondary structure represents interactions between neighboring or near-by AAs as its functional three-dimensional shape is created through the polypeptide folding. We thus determine a change in protein secondary structure if any change happens in the structures of the mutated AA and its 10 neighboring AAs compared to those of the reference sequence. In detail, we consider 5 AAs ahead and 5 AAs behind the mutated AA. The same approach is applied when considering a change of the protein relative solvent accessibility. Solvent-exposed area represents the area of a biomolecule on a surface that is accessible to a solvent. Accordingly, a residue is considered as exposed if at least 25% of that residue must be exposed, denoted as the "e" state. Alternatively, the residue is determined as buried, i.e. the "b" state. There have been various protein secondary structure prediction programs in the literature and many of those were developed based on artificial intelligence models using protein AA sequences such as JPred4 [29] , Spider2 [30] , Porter 5 [31] , RaptorX [32] , PSSpred [33] , YASSPP [34] and SSpro [13] . In this paper, we use the protein secondary structure and relative solvent accessibility prediction methods SSpro/ACCpro 5 [13] within the SCRATCH-1D software suite (release 1.2, 2018) [14] . These predictors were built using the bidirectional recursive neural networks and a combination of the sequence similarity and sequence-based structural similarity to sequences in the Protein Data Bank [35] . Prediction results of 8-class structure (SSpro8 predictor) and 25%-threshold relative solvent accessibility (ACCpro predictor) are used for statistics on protein secondary structure and accessibility changes. We however also report in the spreadsheet supplemental information prediction results of 3-class structure (SSpro predictor) and relative solvent accessibility on 20 thresholds, ranging from 0% to 95% with a 5% step (the ACCpro20 predictor within the SCRATCH-1D software). Table 1 summarizes statistics of SARS-CoV-2 mutations so far. "AA Length" indicates the length of the protein AA sequence derived from the SARS-CoV-2 reference genome. "Avai Num" denotes the number of records among 6,324 NCBI GenBank records that have the complete sequence of the corresponding protein. "No Mu" refers to the number of sequences that do not have any mutations compared to the reference sequence. "Delete" means the number of deletion mutations occurring in the AA sequences of the protein. This number may be larger than the number of sequences having deletion mutations because an AA sequence may have more than one deletion. Likewise, "Insert", "Nonsyn" and "Syn" show the number of insertion, nonsynonymous and synonymous mutations occurring in the protein AA sequences. "Nonsyn/Syn" demonstrates a ratio between the number of nonsynonymous mutations versus the number of synonymous mutations. "Struct Change" means the number of nonsynonymous mutations that have protein secondary structure change potential based on the SSpro8 predictor of the SCRATCH-1D software. Similarly, "Acc Change" refers to the number of nonsynonymous mutations that have potential to change the protein relative solvent accessibility based on the ACCpro predictor of the SCRATCH-1D software. Insertion and deletion mutations alter protein secondary structure and solvent accessibility by default so that they are not included in the structure and solvent accessibility change statistics. Table 1 shows that the ORF3a and ORF8 proteins have the number of nonsynonymous mutations significantly larger than that of the synonymous mutations. In contrast, this ratio in proteins E, M, ORF7b and ORF10 are very small (less than 1). These proteins could be targeted for vaccine and drug development as they have less variations than other proteins. These findings are supported by results presented in Figs (Fig. 3) , entire regions before and after the spike at position 614 are almost unchanged. Fig. 4 presents variations of multiple proteins. In addition to proteins E, M, ORF7b and ORF10, we find that proteins ORF6 and ORF7a are also relatively stable without a large number of variations at any particular locations. Protein N has 1,927 nonsynonymous mutations but 1,678 of them are likely to make changes in protein secondary structure, making a ratio of 87.08%. This is considerably larger than those of protein S (4.42%), protein M (24.79%) and protein E (64.29%). The number of solvent accessibility changes of protein S is larger than its structure changes: 184 vs 164. This however is opposite in other structural proteins: E (7 vs 18), M (8 vs 30) and N (37 vs 1,678). The ORF1ab polyprotein has 7,096 AAs. Among 6,324 records deposited to the NCBI GenBank database, only 3,726 genomes have the complete CoDing Sequence (CDS) of protein ORF1ab, with 1,024 unique AA sequences. This is quite a large number compared to other proteins but understandable because ORF1ab is the longest protein of SARS-CoV-2 and thus has a large number Table 2 . Table 2 ). The GenBank accession numbers are presented on the left while isolate names and collected dates are on the right. The numbers on top show the positions of AAs in the protein and isolates are ordered by collected dates. The first isolate having these deletions is USA-CA6/2020 (record MT044258 in second row), collected on 2020-01-27 in USA: CA. This is also the isolate having the largest number of deletions: five sequentially at G82-, H83-, V84-, M85-, V86-and three at K141-, S142-, F143-. The other patients followed were possibly infected by this first case but more data such as travel history are needed to confirm this hypothesis. (18) Germany (16) Taiwan (9) The ORF3a protein has 275 AAs with its complete CDS appearing in 5,527 isolates (146 unique AA sequences). Among these, 2,321 sequences have no mutation or only synonymous mutations, and 3,206 sequences have insertion, deletion or nonsynonymous mutations. Table 4 . Notably, the mutation Q57H occurs in 2,795 sequences collected in many countries. This is an emerging and active mutation, which requires further investigation as the latest case of this mutation was on 2020-06-05, same as the latest collection date of the entire downloaded dataset. The mutation G251V occurring in 206 sequences is also a prevalent mutation in the ORF3a protein. The ORF6 protein has 61 AAs, appearing in 5,792 isolates with 25 unique AA sequences. Among these, 5,719 sequences have no mutation or only synonymous mutations and 73 sequences have insertion, deletion or nonsynonymous mutations. Two insertion mutations occur in record MT520188 at positions -62R and -63T (end of the sequence). Nine continual deletions occur similarly in 2 sequences: MT547814 (collected in Hong Kong on 2020-01-22 from an adult male patient [37] ) and MT609561 (USA: Virginia in 2020-04). These deletions are F22-, K23-, V24-, S25-, I26-, W27-, N28-, L29-and D30-. Alignment of these sequences with the reference genome is displayed in Fig. 6 . The isolate MT547814 thus may have transmitted the virus to MT609561 but this implication needs to be corroborated by patients' travel history. There are 23 distinct nonsynonymous mutations and those occurring in 2 or more sequences are presented in Table 5 . The ORF7a protein has 121 AAs in length, found in 5,321 isolates with 34 unique AA sequences. Among these, 5,215 sequences have no mutations or only synonymous mutations, while the rest ORF7a. There are 15 deletion mutations occurring in 2 records: MT520425 (collected in USA: Massachusetts on 2020-03-27) and MT507795 (USA on 2020-04-06). The MT520425 sequence has 1 deletion at position L77-while the MT507795 sequence has 14 sequential deletions F63-, A64-, F65-, A66-, C67-, P68-, D69-, G70-, V71-, K72-, H73-, V74-, Y75-and Q76-. Alignment of these sequences with that of the reference genome is shown in Fig. 7 . There are 32 distinct nonsynonymous mutations with those occurring in 2 or more sequences are reported in Table 6 . The ORF7b protein has 43 AAs with its complete CDS appearing in 5,175 isolates, forming a set of 11 unique AA sequences. There are 5,151 sequences having no mutations or only synonymous mutations and 24 sequences having nonsynonymous mutations. No insertion or deletion mutations are found in gene ORF7b. This along with a small number of nonsynonymous mutations indicate that ORF7b is a stable gene. Distinct nonsynonymous mutations (10 of them) include F19L, F28Y, F30L, S31L, L32F, T40I, C41F, C41S, H42Y and A43T. Summary of nonsynonymous mutations in gene ORF7b occurring in 2 or more sequences is shown in Table 7 . The ORF10 protein has 38 AAs in length, appearing in 5,891 isolates with only 9 unique AA sequences. Among them, 5,872 sequences have no mutation or only synonymous mutations and the rest 19 sequences have nonsynonymous mutations. No insertion and deletion mutations are found in gene ORF10. Similar to ORF7b, this is a stable gene. There are 8 distinct nonsynonymous mutations, including I4L, A8V, S23F, R24L, R24C, A28V, D31Y and V33I. Those occurring in 2 sequences or more are presented in Table 9 . The virus transmission may have happened between these two isolates but this needs further investigation. Alignment of these sequences is shown in Fig. 9 . The number of nonsynonymous mutations in gene S is 3,711, with 240 distinct mutations. Mutations that occur in 10 or more cases are reported in Table 10 . The number of synonymous mutations is 670, making a ratio between nonsynonymous versus synonymous mutations at 5.54. Among the nonsynonymous mutations, mutation D614G is extremely common as it happens in 3,089 sequences, majorly collected in USA (2340), India (210) and Australia (132). The first collected date of the D614G mutation cannot be identified precisely because some sequences deposited to the NCBI GenBank did not record the full date details. The current data show that either of the following sequences, which have the D614G mutation, was first collected: MT326173 in USA in 2020, or MT270104, MT270105, MT270108 and MT270109 all in Germany: Bavaria in 2020-01, or MT503006 in Thailand on 2020-01-04. It is however important to note that the first patient having the D614G mutation and his/her location may never be known because genome of that patient might not be sequenced and reported. Therefore, information reported here can support for further investigation. Our statistics show that among 4,434 sequences of the S protein, 3,089 sequences have the mutation D614G, taking 69.67%. This number has considerably increased compared to 31.5% in the previous analysis in [20] on a dataset downloaded on 2020-03-22. On the other hand, there are 37 A829T mutations that all occur in Thailand. The first case of this mutation was collected on 2020-01-23 and its latest case was on 2020-04-07. This may indicate that the first case had probably transmitted to other cases having the same mutation A829T in Thailand. Alternatively, mutations H146Y (24 cases), V483A (11 cases), E554D (14 cases), P681L (16 cases) and S939F (11 cases) all occur only in USA or mutation L8V (4 cases) occurs only in Hong Kong (refer to the attached spreadsheet). The "Latest Date" in Tables 2-15 may be used to infer which mutations are inactive or still active. For example, in gene S (Table 10) , the latest date of D614G was on 2020-06-05 (same as the latest collection date of the entire dataset) that indicates that this mutation is still active. The latest date of P681L was on 2020-04-03, indicating that this mutation may no longer occur. This kind of information may be useful for further research on vaccine and drug development as ongoing changes of the viral proteins need to be focused and addressed. We identify the RBD region within the residue range Arg319-Phe541 of protein S based on a study in [36] . In the RBD region only, the number of nonsynonymous mutations is 53 and that of synonymous is 46, making a ratio of 1.15. This is much smaller than the ratio of 5.54 for the entire gene S, suggesting that the RBD region may have been optimized for binding to a receptor of a host cell. This is complemented by Fig. 9 showing all deletion mutations in gene S being outside the RBD region. Note that the difference of these ratios is partly due to the large number of D614G mutations (3, 089) , which is outside the RBD region. Table 11 summarizes nonsynonymous mutations in the RBD region occurring in 2 or more sequences. Notable mutation in this region is V483A occurring in 11 isolates all collected in USA. The first and latest collected dates of these isolates were respectively 2020-03-05 and 2020-04-05, suggesting that the first isolate may have spread to others having the same mutation V483A. Likewise, the mutation G476S occurs in 6 isolates all collected in USA: WA from 2020-03-10 to 2020-03-25. Alternatively, the mutation Y453F occurs in 5 sequences all in Netherlands but the first collected date was on 2020-04-25 and the latest collected date was on 2020-04-29. These dates are too close, indicating that all the reported Y453F cases may have been infected from another case, whose genome had not been sequenced and reported to the NCBI GenBank. It is important to note that all the transmission implications need further investigation with more data from other aspects such as travel history, physical contacts and so on. In For the entire protein S, 134 nonsynonymous mutations (48 unique) have both structure and solvent accessibility change potentials. These mutations occurring in 2 or more sequences are reported in Table 12 . Mutation H146Y occurs in 24 cases and mutation P681L occurs in 16 cases, which are all collected in USA. The most common mutation D614G does not have the potential to change either protein secondary structure or relative solvent accessibility. The envelope protein E has 75 AAs, found in 5,852 GenBank records with 15 unique AA sequences. Among them, 5,824 sequences have no mutation or only synonymous mutations while 28 sequences have nonsynonymous mutations. Gene E is thus relatively stable and could be targeted for vaccine and drug development. This is supported by the fact that no insertion or deletion mutations are found within gene E. There are 14 distinct nonsynonymous mutations in gene E and those occur in 2 or more sequences are presented in Table 13 . Five distinct nonsynonymous mutations in gene E have protein structure change potential: S68C, S68F, P71L, D72Y and L73F. Alternatively, 4 distinct mutations have potential to change relative solvent accessibility: L37H, L37R, D72Y and L73F. Therefore, D72Y and L73F are two mutations in gene E that have a potential to change both protein structure and solvent accessibility. Table 12 . Gene S -Nonsynonymous mutations that have both structure and solvent accessibility change potentials occuring in 2 or more sequences. The "Query Structure" (and "Query Accessibility") shows the unique structure (and accessibility) changes based on on prediction results. Structure letter in parentheses is the predicted structure of the residue at the corresponding mutation position. Five letters before and after parentheses are structures of neighbouring residues. Likewise, letter "b" or "e" in parentheses shows the accessibility status of the residue at the mutation position. 2 CCCCC(C)CBEEE CCCCC(C)CEEEE bebbb(b)bbbbb beebb(b)bbbbb D253G 7 EECCT(T)CCCTC CCCCC(C)CCEEC CCCCT(C)CCEEC bbbeb(e)bbbeb bbbeb(b)bbbbb S254F 2 ECCTT(C)CCTCC ECTTC(E)EEECC bbebe(b)bbebb bbbbb(b)bbbbb W258L 4 TCCCT(C)CCCSE CCCEE(E)EEESE ebbbe(b)bbbeb bbbbb(b)bbbeb G261D 4 CTCCC(C)SEEEE EECCC(C)SEEEE bebbb(b)ebbbb bbbbb(b) The M protein has 222 AAs and its complete CDS appears in 5,677 GenBank records, with 37 unique AA sequences. There are 5,557 sequences having no mutation or only synonymous mutations while other 120 sequences have nonsynonymous mutations. No insertion or deletion mutations are found in gene M. The number of distinct nonsynonymous mutations in gene M is 37, with those occurring in 5 or more sequences shown in Table 14 . Among these, 10 mutations are likely to make changes in protein secondary structure: C64F, A69S, A69V, V70F, N113B, R158L, V170I, D190N, D209Y and S214I. Alternatively, 6 mutations have the solvent accessibility change potential: N113B, P123L, P132S, H155Y, D190N and T208I. N113B and D190N are thus two mutations having potential to change both protein structure and solvent accessibility in gene M. The N protein has 419 AAs and its complete CDS appears in 5,281 isolates, with 178 unique AA sequences. Among them, 4,315 sequences have no mutation or only synonymous mutations while the rest 966 sequences have deletions or nonsynonymous mutations. There are no insertion in gene N. The sequence in MT434815 (collected in USA: NY on 2020-03-09) has three sequential deletions at Q390-, T391-and V392-while the sequence in MT370992 (USA: NY on 2020-03-20) has six sequential deletions at T366-, E367-, P368-, K369-, K370-and D371-. Two other sequences MT605818 and MT560525 (both collected in Turkey on 2020-04-16) have three sequential deletions at R195-, N196-and S197-. There are 1,927 nonsynonymous mutations with 156 distinct ones and those occurring in 10 or more sequences are presented in Table 15 . Notable mutations are R203K occurring in 871 sequences and G204R occurring in 433 sequences. There are 15 mutations in this protein having the potential to change both protein structure and solvent accessibility, including G18V, D22Y, G34W, R40C, R40L, R185C, A211S, P365H, T391I, T393I, A398S, D399E, D399H, D401Y and D402Y. Analysing the virus genome sequences and their proteins is crucial for understanding the virus and proposing appropriate approaches to respond to and control the pandemic. This paper has reported all point mutations of SARS-CoV-2 since the virus's first genomes were obtained in December 2019. A SARS-CoV-2 mutation database is built using a large number of genome sequences (6,324) obtained across 45 countries. This database can enable scientists to monitor the evolution and spread of the virus although the use of these data needs to be corroborated with patients' clinical data and travel history for substantiated confirmations. We also predict the secondary structure and relative solvent accessibility of the virus proteins to evaluate whether the detected mutations have a potential to change the virus characteristics. These protein secondary structure and solvent accessibility change potentials are predicted results based on deep learning recurrent neural networks, which need to be experimentally verified. They however provide important insights about the virus and prompt further experimental biochemistry and molecular biology research into the genomic regions of these mutations. Among 3,089 D614G mutations, our prediction results show that none of these mutations is likely to make changes in the protein secondary structure and relative solvent accessibility. In addition, we have shown regions of the SARS-CoV-2 genomes that have small variations such as those coding for proteins E, M, ORF6, ORF7a, ORF7b and ORF10. These regions could be targeted for vaccine and drug development. USA (10) Australia (6) Bangladesh (3) Hong Kong (2) Taiwan (1) Germany (1) Kazakhstan (1) S202N 25 2020-01-30 China Australia (203) Greece (122) Bangladesh (88) Japan (38) Czech Republic (34) Poland (26) Germany (18) India (12) Taiwan (10) Turkey (8) France (8) Thailand (6) Serbia Australia (101) Greece (61) Bangladesh (44) Japan (19) Czech Republic ) Serbia (3) Italy (3) Spain (2) Russia (2) Sri Lanka (1) Puerto Rico (1) Peru (1) Nigeria (1) A new coronavirus associated with human respiratory disease in China Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding A pneumonia outbreak associated with a new coronavirus of probable bat origin Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins Probable pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak The proximal origin of SARS-CoV-2 Origin of novel coronavirus (COVID-19): a computational biology study using artificial intelligence. bioRxiv A novel coronavirus from patients with pneumonia in China The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2 Isolation of SARS-CoV-2-related coronavirus from Malayan pangolins WHO coronavirus disease (COVID-19) dashboard Characterization of the receptor-binding domain (RBD) of 2019 novel coronavirus: implication for development of RBD protein as a viral attachment inhibitor and vaccine SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity SCRATCH: a protein structure and structural feature prediction server Genetic diversity and evolution of SARS-CoV-2. Infection Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection On the origin and continuing evolution of SARS-CoV-2 Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. bioRxiv A short review on antibody therapy for COVID-19 Emergence of mutations and possible antigenic drift in the surface glycoprotein of SARS-CoV-2 (COVID-19) Emergence of drift variants that may affect COVID-19 vaccine development and antibody treatment A sequence homology and bioinformatic approach can predict candidate targets for immune responses to SARS-CoV-2 Genomic analysis of COVID-19 Nextstrain: real-time tracking of pathogen evolution Genomic diversity of SARS-CoV-2 in coronavirus disease 2019 patients Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant Genomic characterization of a novel SARS-CoV-2 Multi-output interval type-2 fuzzy logic system for protein secondary structure prediction JPred4: a protein secondary structure prediction server Spider2: A package to predict secondary structure, accessible surface area, and main-chain torsional angles by deep neural networks Porter 5: fast, state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes. bioRxiv RaptorX: exploiting structure information for protein alignment by statistical inference A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction YASSPP: better kernels and coding schemes lead to improvements in protein secondary structure prediction The protein data bank Structure of the SARS-CoV-2 spike receptor-binding domain bound to the ACE2 receptor A rare deletion in SARS-CoV-2 ORF6 dramatically alters the predicted three-dimensional structure of the resultant protein