key: cord-0707024-r2ac5l0x authors: Zhou, Zhi-Jian; Qiu, Ye; Pu, Ying; Huang, Xun; Ge, Xing-Yi title: BioAider: an efficient tool for viral genome analysis and its application in tracing SARS-CoV-2 transmission date: 2020-08-28 journal: Sustain Cities Soc DOI: 10.1016/j.scs.2020.102466 sha: 8f2d01f7ddb366f4d53b11826abc58e300b4f15e doc_id: 707024 cord_uid: r2ac5l0x The novel human coronavirus (SARS-CoV-2) causes the coronavirus disease 2019 (COVID-19) pandemic worldwide. Control of COVID-19 pandemic is vital for public health and is the prerequisite to maintain social stability. However, the origin and transmission route of SARS-CoV-2 is unclear, bringing huge difficult to virus control. Monitoring viral variation and screening functional mutation sites are crucial to prevention and control of infectious diseases. In this study, we developed a user-friendly software, named BioAider, for quick sequence annotation and mutation analysis on large-scale genome-sequencing data. Herein, we detected 14 substitution hotspots within 3,240 SARS-CoV-2 genome sequences, including 3 groups of potentially linked substitution. NSP13-Y541C was crucial substitution which might affect the unwinding activity of helicase. In particular, we discovered a SR-rich region of SARS-CoV-2 distinct from SARS-CoV, indicating more complex replication mechanism and unique N-M interaction of SARS-CoV-2. Interestingly, the quantity of SRXX repeat fragments in SARS-CoV-2 provided further evidence of its animal origin. Overall, we developed an efficient tool for rapid identification of viral genome mutations and could facilitate the viral genomic study. Using this tool, we have found critical clues for the transmission route of SARS-CoV-2 which would provide theoretical support for the epidemic control of pathogenic coronaviruses. joint efforts of multiple industries (Xu, Luo, Yu, & Cao, 2020) . Coronaviruses are enveloped virus with a non-segmented positive sense RNA genome, the full-length genome is the largest among RNA viruses (Rota et al., 2003) . The first ORF encodes a polyprotein 1ab (pp1ab, ORF1ab) and approximately occupied the first two-thirds of the genome, the remainder are structural and non-structural proteins . The pp1ab usually is hydrolyzed into 16 non-structural proteins (NSP1 ~ NSP16) by one or two papain-like protease (PLPs) in NSP3 and 3Clike protease (3CL pro ) in NSP5, to promote viral RNA replication and transcription (Fehr & Perlman, 2015; Ziebuhr, 2005) . For these non-structural proteins in pp1ab, NSP2 can inhibit two host proteins of PHB1 and PHB2, which play an important role in viral infection (Cornillez-Ty, Liao, Yates, Kuhn, & Buchmeier, 2009 ). The PLPs of NSP3 hydrolyze the N-terminal of pp1ab to NSP1 ~ NSP4, 3CL pro of NSP5 binds 11 conserved Q-S dipeptide sites in pp1ab to produce 12 mature non-structural proteins (NSP5 ~ NSP16) (Fehr & Perlman, 2015; Lei, Kusov, & Hilgenfeld, 2018) . NSP12 is a key component, also known as the RNA-dependent RNA polymerase (RdRp) and is crucial in the replication and transcription cycle of coronaviruses (Subissi et al., 2014) . Currently, RdRp is considered to be the main target of antiviral drugs for SARS-CoV-2 . NSP13 owns NTPase/Helicase activity and can unwind doublestranded RNA and DNA helix (Jang, Lee, Yeo, Jeong, & Kim, 2008) . Due to the conservation of NSP13 is necessity in all coronaviruses species, it is considered an ideal target for the development of antiviral drugs for SARS-CoV (Jia et al., 2019; Shum & Tanner, 2008) . The main four structural proteins of coronaviruses are the spike protein J o u r n a l P r e -p r o o f (S), small envelope protein (E), membrane protein (M) and nucleocapsid protein (N) (Cui, Li, & Shi, 2019) . S protein interacts with molecular receptors to mediate cell membrane fusion, allowing viruses to enter host cells (Li, 2016) . E and M proteins are involved in the assembly of the virus, which is related to the formation and release of the viral envelope, N protein plays an important role in virus replication and pathogenesis (Li, 2016; Schoeman & Fielding, 2019) . Based on seven conserved replicase domains in pp1ab (polyprotein-1ab) for the classification of coronavirus, SARS-CoV-2 belongs to the Sarbecovirus subgenus of the genus Betacoronavirus in the subfamily Orthocoronavirinae together with human SARS-CoV and bat various SARSr-CoVs (SARS-related CoVs), and SARS-CoV-2 is highly similar to bat coronavirus Bat-CoV-RaTG13 genetically (International Committee on Taxonomy of Viruses Executive, 2020; P. Zhou et al., 2020; Zhu et al., 2020) . Similar to SARS-CoV, SARS-CoV-2 uses the angiotensin converting enzyme II (ACE2) as receptor, besides, the serine protease TMPRSS2 plays an important role in activating spike (S) protein (Hoffmann et al., 2020) . Some novel features of SARS-CoV-2 has been revealed, like a furin protease cleavage site in the Spike which may be related to viral transmissibility (Q. . Recently, a novel bat-derived coronavirus RmYN02 with the similar furin insertion in the S was found which emphasized the bat origin of the virus (H. Zhou et al., 2020) . As a member of RNA virus, the RNA-dependent RNA polymerase (RdRp) encoded by coronavirus lacks the proofreading capability, which leading to high misincorporation rate during replication even with the help of exoribonuclease like ExoN J o u r n a l P r e -p r o o f (Denison, Graham, Donaldson, Eckerle, & Baric, 2011) . Together with the genome recombination events, it could cause viral diversity and promote the transmission and disease of coronavirus (Hu et al., 2017) . At the same time, in order to better adapt to the host, the virus usually mutates continuously under the selection pressure of the host. In particular, some non-synonymous substitution sites with high frequency are often experience strong positive selection (Pond et al., 2006) . Considering the continuous increase of infected people and the high variability of the virus, it is important to pay attention to the genomic changes of SARS-CoV-2. Recently, 7 substitution hotspots in ) and N-GGG608_609_610AAC (N-RG203_204KR) have been reported, (Capobianchi et al., 2020; Ceraolo & Giorgi, 2020; Issa, Merhi, Panossian, Salloum, & Tokajian, 2020; . In addition, the non-synonymous substitution of ORF3a-G251V and ORF8-L84S both cause the amino acid (aa) polarity changing, which may affect the conformation of the protein and lead to function altering (Ceraolo & Giorgi, 2020) . Despite these recent discoveries, in order to deal with the continuous variation of SARS-CoV-2 and the accumulation of sequencing data, methods for quick and efficient well through scripts, it is difficult for biological or clinical expert without bioinformatics and programming skills. Some tools such as MutPred can analyze variations by entering amino acid sequence, but unable to handle synonymous substitutions and consecutive nucleotide mutations like dinucleotide and trinucleotide substitution (Pejaver et al., 2017) . In comparative genomic studies, it is very important to identify whether amino acid properties have changed when dealing with nonsynonymous substitutions, which can help to identify some important mutations. In this respect, we have developed an interactive analysis tool, Bioinformatics Aider V1.0 (BioAider V1.0). BioAider showed high efficiency and convenient in gene annotation and mutation screening on multiple genome-sequencing data. In this research, we collected 3240 complete genome sequences of SARS-CoV-2 from 64 different countries and regions, sampling time from December 24, 2019 to April 1, 2020. We conducted a detailed genome mutation analysis using BioAider. We identified 14 substitution hotspots and 3 groups of possible linkage substitutions. Especially, we found distinctive polymorphism on SR-rich region of N protein in SARS-CoV-2 and related coronaviruses in other animals. Our work provides a new tool for recognition of the variation and evolution of SARS-CoV-2, contributes to research the replication and pathogenic mechanism of SARS-CoV-2, and gives further evidence for the animal origin of SARS-CoV-2. BioAider providers a convenient function for annotation on homologous sequence. First, import the aligned complete genome sequence set, and adjust the reference sequence for gene extraction to the forefront of the sequence set. Paste the related gene information of reference sequence in the input box, gene name, star string and end string, which separated by comma. Then BioAider can batch extract genes and annotation from a large number of sequences. Note that, the length of start string or end string is not limited in, but it is required to be unique in the reference sequence. Second, for mutation analysis, BioAider will scan all the codons in the aligned sequence, using the first sequence in the data set as a reference sequence. BioAider can identify five different mutation types (synonymous, non-synonymous, insert, deletion and termination) based on the standard codon method. It can distinguish the changes in properties of amino acid when dealing non-synonymous substitutions. Then, it will locate the position of the corresponding base in the mutated codon, so it can identify multiple types of mutation at the same site, and also well discriminate dinucleotide and trinucleotide substitution. Next, it summarizes all the mutation sites with corresponding frequency and strains, and if user chooses to generate the frequency distribution of synonymous or non-synonymous sites, BioAider will directly generate the results in vector format image. Of note, the frequency distribution map does not include those sites that are both synonymous and non-synonymous substitution, because such sites cannot determine the common substitution frequency. J o u r n a l P r e -p r o o f The 3240 complete genome sequences of SARS-CoV-2 with relatively higher quality of sequencing were downloaded from GISAID (https://www.gisaid.org/), and the reference genome sequence of SARS-CoV-2 (NC_045512.2) for ORF annotation was from GenBank (https://www.ncbi.nlm.nih.gov/genbank). All the viral strains used in this study were listed in Table S1 . Multiple sequence alignment of genomic sequences of SARS-CoV-2 were accomplished using MAFFT v7.407 (Katoh & Standley, 2013; .The annotation and extraction of codon genes of these 3240 SARS-CoV-2 genome sequences using BioAider V1.0. We extracted 11 continuously coding genes based on the annotation information of NC_045512.2, including ORF1ab, S, ORF3a, E, M, ORF6, ORF7a, ORF7b, ORF8, N and ORF10. Then we used MUSCLE program in MEGA v7.0.14 to align these coding genes based on codons method (Kumar, Stecher, & Tamura, 2016) . We combined these 11 continuously coding genes to tandem sequence in BioAider J o u r n a l P r e -p r o o f The tertiary protein structure was downloaded from the Protein Data Bank (PDB, http://www.rcsb.org), and protein homology modeling was calculated by online tools SWISS-MODEL then using Chimera 1.10.2 to compare protein models (Pettersen et al., 2004; Waterhouse et al., 2018) . BioAider V1.0 was developed based on Python 3.7 and R 3.5, and used PyQt5 for interface packaging. BioAider puts complex algorithms and fault-tolerant processing mechanisms inside, presenting users a very simple interface and prompts are added to interface controls. The function of BioAider are divided into three main sections ( Fig 1) . The SeqTools was used for common sequence processing, such as Split Sequence Fragmenet, Combine Gene, and sequence Fast Annotation (Fig 2A) . Similar analysis, including two functions of sequence identity matrix and remove high similar sequence by specify the threshold. Especially, BioAider owns the ability to acquire the identity of nucleotide and amino acid simultaneously, and provides optional features of gap compression in Sequence Identity Matrix. For the Mutation Tools, three nested functions are applied for genome variation analysis. The Site Counter can summary the type, count and proportion of nucleotide (or amino acids) at each site for the aligned sequence datasets, the Site Scree is used for extract the sequence of the specified site J o u r n a l P r e -p r o o f ( Fig 2B) . In particular, in addition to the highly summarized analysis results, the function of Mutation Analysis also provides high-quality vector graphics of synonymous or non-synonymous substitution frequency distribution for publication ( Fig 2C) . In this study, 'substitution' refers to synonymous or non-synonymous substitution, and 'mutation' includes all the genomic variance. The frequency of mutation (or substitution) refers to the number of mutant strains compared to the reference strain sequence in this study. Compared with the early viral sampling strains (EPI_ISL_402119) in GISAID database, a total of 2152 mutation sites (regardless of insertions or deletions) were identified in 3239 SARS-CoV-2 sequenced whole genomes, accounting for 7.36% of SARS-CoV-2 full-length gene ( Table 1 ). The number of synonymous and nonsynonymous substitution sites was counted as 784 (2.68%) and 1335 (4.57%), respectively. Among these non-synonymous substitution sites, 738 resulted in changes in amino acid properties. Besides, 12 sites contained both synonymous and nonsynonymous substitution depending on the strains, and a total of 21 termination mutation sites were also detected. In all the coding genes, three with the of the most mutation sites were ORF1ab, spike (S) and nucleocapsid (N) gene, the number of mutations sites were 1400, 296 and 168, respectively, accounting for 6.58%, 7.75% and 13.36% in their corresponding gene length. The gene with the highest proportion of J o u r n a l P r e -p r o o f mutation sites was ORF10, containing 16 mutation sites were in 114 bases which occupied 14.04% of ORF10. All the mutation sites and viral strains of SARS-CoV-2 detected by BioAider in this study were summarized in Tables S2 and S3. To assess the overall substitution frequency of these mutated sites, we divided the substitution frequency into six different groups, and drew the frequency spectra of 2119 substitution sites (784 synonymous and 1335 nonsynonymous) of 3239 sequenced strains (Fig 3) . The result shows that the substitution sites of non-synonymous were always over synonymous except for the fifth group. Besides, a large number of sites with substitution frequencies between 1 and 5, and more than half of the substitution sites only are observed in a single strain. We also found 60 sites owning a substitution frequency greater than 20, including 40 non-synonymous substitution sites and 20 synonymous substitution sites. The substitution frequency distribution of synonymous or nonsynonymous sites for each codon gene are in Fig S1. Totally, 2119 substitution sites were detected in 3239 sequenced strains, most of them with a lower frequency. We defined the site with substitution frequency over 200 as the substitution hotspot, thus far, 14 substitution hotspots were identified which distributed in ORF1ab, S, ORF3a, ORF8 and N gene (Fig 4) . Among these substitution hotspots, 10 sites were non-synonymous and 4 were synonymous. As for these non-synonymous J o u r n a l P r e -p r o o f substitution hotspots, 6 substitution sites caused a change on polarity or chargeability of amino acid. Especially, there was a trinucleotide substitution hotspot on the N gene and caused two amino acid changes. According to the substitution hotspots, we divided two geographical regions, China and outside of China, and studied the distribution differences of mutant and referential type in geographical area at each substitution hotspots, respectively ( Fig 5) . As the result showing, except substitution hotspots of ORF1ab-10818 and ORF3a-752, mutant type and referential type in the 12 other substitution hotspots showed significant spatial distribution differences (p<0.01, chi-square test) between China and outside of China. Furthermore, we found in substitution hotspots of ORF1ab-8517 and ORF8-251C, the mutant type (ORF1ab-8517T or ORF8-251C) owned a higher ratio and were more prevalent in China than outside of China, opposite to other 10 substitution hotspots. Make further efforts, we found some hotspots contains substitutions with similar patterns and frequency (Table 2) , indicating potential connection among these substitution hotspots, and then we found 3 groups of possible linkage substitution. In To test our hypothesis, we compared the number of strains between referential type and mutant type which synchronously mutated at the possible linkage substitution hotspots (Table 3 ). For each combination of possible linkage substitution hotspots, we found mutant type and referential type in accounts for more than 98% of the population, it implied that the genetic variants in these combined substitution hotspots were not independent, but linkage. Given 541 th aa was one of known key sites of NSP13 for binding to nucleic acids in SARS-CoV, the non-synonymous substitution hotspot of ORF1ab-Y5865C (NSP13-Y541C) in NSP13 probably affects the function of NSP13. We conducted protein model prediction of SARS-CoV-2 NSP13 by homology protein modeling. The result showed that closest protein model of NSP13 to SARS-CoV-2 was from SARS-CoV (PBD ID: 6JYT, 99.83% amino acid identity). As the result shown, the tertiary structure of NSP13 were almost completely overlapping between SARS-CoV-2 and SARS-CoV ( Fig 6A) . Besides, the 541 th aa was relatively conservative in SARS-CoV-2 related animal coronaviruses (Fig 6B) . Two non-synonymous substitution sites near the furin cleavage region of PRRA on S protein were identified (Fig 7 and Table 2 ). There were 7 mutant strains with S-G2025C J o u r n a l P r e -p r o o f (S-Q675H) and one with S-A2024G (S-Q675R) among 3239 sequenced strains. 675 th aa on S protein was the sixth amino acid upstream of PRRA, and we found S-Q675H and S-Q675R made original polar amino acids from no-charged to positively charged. We found a continuous variable area with non-synonymous substitutions from 183th to 204th aa on the N protein in SARS-CoV-2 (Fig 8) , and there were 5 variably sites with non-synonymous substitutions frequency over 20, including a trinucleotide substitution which led to two consecutive aa substitutions of R203K and G204R (Table 2) . Notably, there were 2 strains substituted on the 196 th and 201 th codons of SARS-CoV-2, too, but they were synonymous substitutions which did not cause amino acid substitutions (Table S2 ). The variable area was rich in Ser (S) and Arg (R) and contains SRXX repeat fragments. We compared the region among SARS-CoV-2, SARS-CoV-2 related coronavirus, SARS-CoV and SARSr-CoV in this area (Fig 8) , and found that it was relatively conserved in SARS-CoV-2 related coronavirus, SARSr-CoV and SARS-CoV, but showed distinctive polymorphic in SARS-CoV-2. Interestingly, we found 4 SRXX repeat fragments in most strains of SARS-CoV-2, SARS-CoV-2 related coronavirus strains, but SARSr-CoV and SARS-CoV lacked the third SRXX repeat fragments due to one amino acid substitution. We found the number of SRXX repeat fragments well reflects the evolutionary relationship among SARS-CoV-2, SARS-CoV-2 related coronavirus, SARSr-CoV and SARS-CoV. Especially, he substitution of two J o u r n a l P r e -p r o o f continuously amino acids on the last SRXX repeat fragments (203 th and 204 th in SARS- To explore the polymorphism of SR-rich region (aa 183-204) on SARS-CoV-2 N protein, we intercepted the amino acids in this region. After culling some strains sequence containing degenerate bases in the region that could not be translated normally, we screened the SR-rich regions in the remaining 3233 strains. The results showed the SR-rich regions in SARS-CoV-2 can be divided into 23 different types (Table 4) , including the reference type. The two main types were the reference type (2561 strains) and the mutant type of N-R203K_G204R (494 strains). The majority of the 22 mutant types harbored only single-amino-acid substitution compared to the reference type, indicating that this region was relatively conserved among most SARS-CoV-2 strains. However, we could still find some distinctive polymorphism among different strains. The sampling date showed that most mutations existed in the 3233 recently sequenced strains, implying a constant evolution of the SR-rich region in SARS-CoV-2 genome. COVID-19 pandemic poses a great challenge for the healthy cities and societies, its prevention and control requires cross-disciplinary cooperation and research, including biology, industry, medicine and computer information technology . Computational biology, big data and artificial intelligence are playing a more important role in daily monitoring, prevention and treatment of infectious diseases, contributing to maintaining the sustainable development of society (Liu & Li, 2020; Zhou, He, Cai, J o u r n a l P r e -p r o o f Wang, & Su, 2019) . With the continuous variation of SARS-CoV-2, program tools for high-throughput analyses on big sequencing data of genome are of great importance for genomic and evolutionary studies. More importantly, simplicity of operation and high summary of analysis results could save a lot of time for researchers. Some graphic-interface bioinformatics programs, such as MEGA, DNAMAN and Geneious, are commonly used to browse or align the sequences of genes and genomes, but they are not suitable for the mutation analysis of large-scale datasets (Kearse et al., 2012; Kumar et al., 2016) . For instance, SNP-sites can screen the mutant sites in datasets of DNA sequences by sequence alignment with the reference sequence, but it cannot tell the types of those mutations (e.g. synonymous mutation or non-synonymous mutation) which are important for viral genome analyses (Page et al., 2016) . In addition, SNP-sites is a command-line program which is not friendly to users not familiar with bioinformatic programs. Another program named MutPred is capable of analyzing amino acid sequences but not nucleic acid sequences, thus it cannot reveal the variance on viral genomes (Pejaver et al., 2017) . Here, we present BioAider, an efficient graphical tool for mutation analysis of viral genome and could processes nucleic acid and amino acid sequence analysis in parallel. After several simple steps of operation including dataset input and parameter setup, BioAider will process the analyses automatically and outputs comprehensive and visualized graphs presenting all the information about the mutations in the input genomes, such as mutation type list (e.g. synonymous mutation, non-synonymous mutation, insertion, deletion or nonsense mutation), mutation frequency and changes in J o u r n a l P r e -p r o o f amino acid properties. Meanwhile, BioAider is designed for high-throughput analyses, and this interface makes it more efficient experience for the handling of bigdata. BioAider greatly improve the efficiency for genome variation analysis of SARS-CoV-2, contributing to the study of SARS-CoV-2 and the development of healthy smart cities. In addition, BioAider harbors a built-in database of all 25 amino acid codon systems (https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?) and can be used for the analyses of genomes from other organisms such as bacteria and archaea. The expansion and optimization of BioAider in the future will endow the software with more functions. In this study, we conducted high-resolution analysis of mutations within SARS-CoV-2 genome based on 3240 sequenced strains using BioAider. 2152 mutation sites among different strains were detected which accounted for 7.36% of the complete genome. These mutation sites included 1,335 non-synonymous substitution sites, implying abundant variations in the genome of SARS-CoV-2. However, more than half of the mutation sites were only observed in a single strain of SARS-CoV-2 which was difficult to be distinguished from sequencing errors. Considering there were only 60 sites with substitution frequency above 20. Therefore, we speculated that there were no large-scale mutations in the sequencing strains we analyzed. According to the spatial distribution analysis between strains of mutant types and referential type regarding the substitution hotspots, 12 of 14 sites showed significant distribution differences in China and outside of China, indicating different virus prevalence in the two regions and may evolve in different directions in the future. We also noticed mostly strains sampled in China were before Mar 2020, this was due to the J o u r n a l P r e -p r o o f epidemic of China has been roughly controlled in March. Considering these facts, we speculate that virus differentiation in and outside of China may attribute to human intervention. Besides, these non-synonymous substitution hotspots should be given more attention, they were most likely related to the phenotype of the virus. The latest study has discovered the strain with substitution of S-D614G could increases the infectivity of SARS-CoV-2 (Korber et al., 2020) . In SARS-CoV NSP13, the 541 th aa was critical for the protein function and double mutations of S539A/Y541A showed higher unwinding activity for nucleic acids binding (Jia et al., 2019) . The amino acid identity of NSP13 between SARS-CoV-2 and SARS-CoV was 99.83% and their tertiary structure was almost completely overlapping, indicating that NSP13 541 th aa is also a vital site for the function of SARS-CoV-2 NSP13. Notably, the mutant strain of NSP13-541C first appeared in the United States (sampled on Feb 20, 2020), and the United States was the country with the largest number and extremely high proportion of this mutant strains (316, 78% of all the 405 mutant strains with NSP13-Y541C). We also noted that there were a total of 711 strains sampled in the United States on Feb 20, 2020 and beyond, strains with NSP13-Y541C accounted for almost half of these strains in the United States. This fact indicates that NSP13-Y541C has gone through less negative selection pressure and even this variation may be beneficial. Therefore, we speculate that NSP13-Y541C improves the unwinding activity of NSP13 and promotes the replication of SARS-CoV-2, contributing to their rapid spread in the United States. However, further studies are required to reveal the detailed effect of substitution in 541 th aa of SARS-CoV-2 NSP13 J o u r n a l P r e -p r o o f and whether the possible linkage substitution site of 504 th aa plays a synergistic role in this process. Among the potential linkage substitution hotspots identified in this study, ORF1ab-8517 and ORF8-251 were observed in a recent research (Ceraolo & Giorgi, 2020; Tang et al., 2020) . Among the triple linkage substitution hotspots of ORF1ab-277, ORF1ab-14144 and S-1841, ORF1ab-C2772T belongs to synonymous substitution, while ORF1ab-C14144T caused the aa change of NSP12-P323L in interface region which is a bridge section connecting NiRNA (nidovirus RdRpassociated nucleo-tidyltransferase) and Fingers of RdRp . The 614 th aa in S protein was located in the subdomain (SD) region downstream of the receptorbinding domain (RBD) on S1. We found that mutant strains with NSP12-323L & S-614G were popular in the world and owned more than half the frequency (52%) in the population, indicating that this mutant was dominant in SARS-CoV-2. However, the functional impact of linkage substitution in these two sites is still unclear at present, and whether NSP12-323 and S-614 are related on specific function needed more experimental data to verify. Previous studies have reported that SR-rich region in SARS-CoV is crucial for N protein multimerization and the interaction with membrane (M) protein (He, Dobie, et al., 2004; He, Leeson, et al., 2004) . In the SR-rich region of N protein, there was only 2-aa difference between SARS-CoV-2 (referential type) and SARS-CoV. Therefore, the similar function of SR-rich region may exist in SARS-CoV-2, although there is no relevant research reported at present. Compared to SARS-CoV, SARS-CoV-2 harbors J o u r n a l P r e -p r o o f one more SRXX repeat fragments in the 193 th -196 th aa of N protein. Previous research reported that the SR-rich region of SARS-CoV in 184 th -196 th aa was crucial for N protein multimerization and the deletion of this region completely would make N protein abolish the self-multimerization (He, Dobie, et al., 2004) . Besides, SRXX repeat fragments may play an important role in SARS-CoV infection (Luo et al., 2005) . We noted that SARS-CoV could be regarded as a deletion of SRXX in SR-rich region compared to SARS-CoV-2, and the transmission rate of SARS-CoV-2 was higher than SARS-CoV, thus, whether SR-rich region of SARS-CoV-2 plays an important role in this process is worth to explore. However, the research about SR-rich region in coronavirus was still limited at present. Especially, unlike SARS-CoV, the region in SARS-CoV-2 is still constantly evolving, implying that the N protein of SARS-CoV-2 may employ a more flexible replication mechanism and even interaction with M protein. A study reported that the N protein of SARS-CoV can specifically bind to heterogeneous nuclear ribonucleoproteins (hnRNPs) A1 and plays an important role in RNA replication and transcription, especially, the key binding region is in the SR-rich region of SARS-CoV N protein (aa 161-210), the interaction between human hnRNP A1 and SARS-N protein may be the key to SARS-CoV replication and transcription (Luo et al., 2005) . Whether such a similar mechanism exists in SARS-CoV-2 is still unclear. However, in human cells, there are more than 20 hnRNPs has been discovered, and SR-rich region of SARS-CoV-2 showing distinctive polymorphism, whether the SARS-CoV-2 can bind to one of these hnRNPs needs more verification. It might provide a new hint in understanding the process of SARS-CoV-2 replication in human J o u r n a l P r e -p r o o f cells. Besides, the potential phenotype changes related to nonsynonymous substitution hotspots on third SRXX repeat fragments of SARS-CoV-2 may be also worth attention. In the hotspot screening, we have found several substitutions critical for viral replication potentially. For instance, the ORF1ab-C794T was located on NSP2 and NSP2 was reported to inhibit the host protein PHB1 and PHB2 which benefited viral replication, indicating that ORF1ab-C794T may affect viral replication (Cornillez-Ty et al., 2009) . The ORF3a-G171T (Q57H) was located at the transmembranous domain of the 3a protein ( Fig S2) . 3a protein was reported to form ion channels on host cell membrane and enhance the membrane permeability which benefited SARS-CoV life cycle (Minakshi, Padhan, Rehman, Hassan, & Ahmad, 2014) . Thus, ORF3a-G171T (Q57H) may affect the formation of ion channels and subsequently influence the viral replication. As for ORF3a-G752T (G251V), a recent analysis predicted that it was in an important functional domain and might be related to virulence, infectivity, ion channel formation and virus release (Issa et al., 2020) . We also found several important sites for virus entry, though they showed lower substitution frequency in our research data. For instance, S-Q675H and S-Q675R near furin cleavage region possibly influence the cleavage of RRAR, a critical step for virus entry (Fig 7) . We also detected a strain with the mutation of S-R408I located in the RBD (Table S2 ) which was reported to play an important role in virus-receptor binding by a recent study (Saha, Banerjee, Tripathi, Srivastava, & Ray, 2020) . J o u r n a l P r e -p r o o f BioAider greatly simplifies annotation and genome variation analysis of large-scale sequences data. We initially revealed the variation characteristics of SARS-CoV-2 and predicted the possible impact for some non-synonymous substitution hotspots based on BioAider, contributing to study the formation mechanism and evolution of SARS-CoV-2. The distinctive polymorphism SR-rich region of N protein in SARS-CoV-2 provides a new clue to establish anti-virus strategy for viral replication. In this study, we could not include all the sequences of SARS-CoV-2 due to the rapid increase of the genomic data. More researches or real time analysis with relevant clinical data may contribute to the viral epidemiology of SARS-CoV-2 and treatment of COVID-19. The BioAider software could help doctors or those who are not familiar with bioinformatics to analyze the viral genomes, which is conducive to prevention and control of the COVID-19 or similar viral infectious diseases. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. J o u r n a l P r e -p r o o f and naming it SARS-CoV-2 Origin and evolution of pathogenic coronaviruses Coronaviruses: an RNA proofreading machine regulates replication fidelity and diversity Coronaviruses: an overview of their replication and pathogenesis Structure of the RNAdependent RNA polymerase from COVID-19 virus Analysis of multimerization of the SARS coronavirus nucleocapsid protein Characterization of protein-protein interactions between the nucleocapsid protein and membrane protein of the SARS coronavirus SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus Clinical features of patients infected with 2019 novel coronavirus in Wuhan The new scope of virus taxonomy: partitioning the virosphere into 15 hierarchical ranks SARS-CoV-2 and ORF3a: Nonsynonymous Mutations, Functional Domains, and Viral Pathogenesis. mSystems Isolation of inhibitory RNA aptamers against severe acute respiratory syndrome Delicate structural coordination of the Severe Acute Respiratory Syndrome coronavirus Nsp13 upon ATP hydrolysis MAFFT multiple sequence alignment software version 7: performance and usability Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus MEGA7: Molecular Evolutionary Genetics Analysis Version 7.0 for Bigger Datasets Nsp3 of coronaviruses: Structures and functions of a large multi-domain protein Structure, Function, and Evolution of Coronavirus Spike Proteins Smart cities for emergency management The nucleocapsid protein of SARS coronavirus has a high binding affinity to the human cellular heterogeneous nuclear ribonucleoprotein A1 Antivirus-built environment: Lessons learned from Covid-19 pandemic The SARS Coronavirus 3a protein binds calcium in its cytoplasmic domain SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments MutPred2: inferring the molecular and phenotypic impact of amino acid variants. bioRxiv UCSF Chimera--a visualization system for exploratory research and analysis Adaptation to different human populations by HIV-1 revealed by codon-based analyses Characterization of a novel coronavirus associated with severe acute respiratory syndrome A virus that has gone viral: amino acid mutation in S protein of Indian isolate of Coronavirus COVID-19 might impact receptor binding, and thus, infectivity High Contagiousness and Rapid Spread of Severe Acute Respiratory Syndrome Coronavirus 2 Coronavirus envelope protein: current knowledge Differential inhibitory activities and stabilisation of DNA aptamers against the SARS coronavirus helicase From SARS to MERS, Thrusting Coronaviruses into the Spotlight ORF1b-encoded nonstructural proteins 12-16: replicative enzymes as antiviral targets On the origin and continuing evolution of SARS-CoV-2 The contribution of dry indoor built environment on the spread of Coronavirus: Data from various Indian states The establishment of reference sequence for SARS-CoV-2 and variation analysis A Unique Protease Cleavage Site Predicted in the Spike Protein of the Novel Pneumonia Coronavirus (2019-nCoV) Potentially Related to Viral Transmissibility Characterization of an asymptomatic cohort of SARS-COV-2 infected individuals outside of Wuhan, China SWISS-MODEL: homology modelling of protein structures and complexes A new coronavirus associated with human respiratory disease in China The 2019-nCoV epidemic control strategies and of building healthy smart cities PhyloSuite: An integrated and scalable desktop platform for streamlined molecular sequence data management and evolutionary phylogenetics studies Distribution of droplet aerosols generated by mouth coughing and nose breathing in an air-conditioned room A Novel Bat Coronavirus Closely Related to SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein Social inequalities in neighborhood visual walkability: Using street view imagery and deep learning technologies to facilitate healthy city planning A pneumonia outbreak associated with a new coronavirus of probable bat origin A Novel Coronavirus from Patients with Pneumonia in China AG2827CC S943P (polar, none-charge) to (non-polar) 22/0 ORF3a G171T* Q57H (polar, none-charge) to (polar, positive-charge) 578/17 C296T A99V No G587T G196V (polar, none-charge) to (non-polar) 48/1 none-charge) to (non-polar) 299/9 negative-charge) to (polar, none-charge) 41/1 C524T T175M (polar, none-charge) to (non-polar) 117/3 ORF7a C242T S81L (polar, none-charge) to (non-polar) G184T V62L No non-polar) to (polar, none-charge) 646/19 G578T S193I (polar, none-charge) to (non-polar) C581T S194L (polar, none-charge) to (non-polar) 21/0 C590T S197L (polar, none-charge) to (non-polar) 49/1 G605A S202N No G204R (polar, none-charge) to (polar, positive-charge) This work was jointly funded by National Natural Science Foundation of China J o u r n a l P r e -p r o o f Ziebuhr, J. (2005) . The Coronavirus Replicase. In L. Enjuanes (Ed.), Coronavirus Replication and Reverse Genetics (pp. 57-94) . Berlin, Heidelberg: Springer Berlin Heidelberg. J o u r n a l P r e -p r o o f J o u r n a l P r e -p r o o f