key: cord-0191684-f5wvxgc8 authors: Wang, Rui; Hozumi, Yuta; Yin, Changchuan; Wei, Guo-Wei title: Decoding asymptomatic COVID-19 infection and transmission date: 2020-07-02 journal: nan DOI: nan sha: ebeb56cd62f70db5d563028aeb6f69659ff69bde doc_id: 191684 cord_uid: f5wvxgc8 Coronavirus disease 2019 (COVID-19) is a continuously devastating public health and the world economy. One of the major challenges in controlling the COVID-19 outbreak is its asymptomatic infection and transmission, which are elusive and defenseless in most situations. The pathogenicity and virulence of asymptomatic COVID-19 remain mysterious. Based on the genotyping of 20656 Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) genome isolates, we reveal that asymptomatic infection is linked to SARS-CoV-2 11083G>T mutation, i.e., leucine (L) to phenylalanine (F) substitution at the residue 37 (L37F) of nonstructure protein 6 (NSP6). By analyzing the distribution of 11083G>T in various countries, we unveil that 11083G>T may correlate with the hypotoxicity of SARS-CoV-2. Moreover, we show a global decaying tendency of the 11083G>T mutation ratio indicating that 11083G>T hinders SARS-CoV-2 transmission capacity. Sequence alignment found both NSP6 and residue 37 neighborhoods are relatively conservative over a few coronaviral species, indicating their importance in regulating host cell autophagy to undermine innate cellular defense against viral infection. Using machine learning and topological data analysis, we demonstrate that mutation L37F has made NSP6 energetically less stable. The rigidity and flexibility index and several network models suggest that mutation L37F may have compromised the NSP6 function, leading to a relatively weak SARS-CoV subtype. This assessment is a good agreement with our genotyping of SARS-CoV-2 evolution and transmission across various countries and regions over the past few months. The ongoing global pandemic of coronavirus disease 2019 (COVID-19) caused by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has spread to more than 215 countries and territories with 8385440 positive cases and 450686 fatalities as of June 19, 2020 [40] . Unlike SARS-CoV that mainly infects the lower respiratory tract, SARS-CoV-2 is observed with a high level of shedding at the upper respiratory tract [41] . A wide variety of COVID-19 symptoms have been reported, including fever or chills, body or muscle aches, headache, stuffy or congested nose, dry cough, fatigue, sore throat, and loss of taste or smell, 2-14 days after exposure to the virus [41] . Severe symptoms like high fever, severe cough, and shortness of breath indicate the onset of pneumonia. Less common gastrointestinal symptoms like diarrhea, nausea, and vomiting have been listed at the Centers for Disease Control and Prevention. Currently, symptom-based testing, contact tracing, isolation, and quarantine are the main strategies for controlling and combating COVID-19. However, viable viruses have also been isolated from asymptomatic cases [31] . Asymptomatic and presymptomatic cases can play an important role in transmitting coronavirus [23] . Elusive asymptomatic transmission is regarded as the Achilles heel of current strategies to control COVID-19 [24] . Early studies with 235 cases of influenza virus infections indicate that the duration of influenza viral shedding was shorter and decayed faster and quantitative viral loads were lower in asymptomatic than in symptomatic cases. [27] The epidemiological and virological characteristics of COVID-19 asymptomatic pathogenicity remains a mystery. SARS-CoV-2 is a non-segmented positive-sense RNA virus that belongs to the β-coronavirus genus, coronaviridae family, and Nidovirales order. Many RNA viruses, such as the flu virus, are prone to mutations due to the lack of proofreading in their genetic evolutions. Viral mutations are driven by a variety of factors, including replication mechanism, polymerase fidelity, access to proofreading or post-replicative repair, sequence context, cellular environment, and host immune responses or gene editing [34] . Benefit from an error-correction mechanism common to the Nidovirales order, the replication of coronavirus is regulated by a bi-functional enzyme nonstructure protein 14 (NSP14) [19] . Therefore, SARS-CoV-2 maintains relatively high accuracy in virus replication and transcription compared with the flu virus. Nonetheless, SARS-CoV-2 has had 10620 single mutations for the genome collected on January 5, 2020 [38, 42] . Although the impacts of mutations on SARS-CoV-2 transmission and pathogenicity [2, 20] and COVID-19 diagnosis, vaccine, and medicine have been studied [38] , little is known about the connection between viral evolution and asymptomatic transmissions. This work reports the first association of a SARS-CoV-2 single nucleotide polymorphism (SNP) variant and COVID-19 asymptomatic cases based on the genotyping of 20656 SARS-CoV-2 genome isolates. We reveal a significant correlation between asymptomatic infections and SARS-CoV-2 single mutation 11083G>T-(L37F) on NSP6. NSP6 is a common protein of α and β-coronaviruses that locate at the endoplasmic reticulum (ER) [9] . As a multiple-spanning transmembrane protein, coronavirus NSP6 participates in viral autophagic regulation. Autophagy degrades alien components to provide an innate defense against viral infection and promote cell death and morbidity. [32] . In response to extreme cases of starvation, autophagy generates autophagosomes to transfer long-lived proteins, unnecessary or dysfunctional components to lysosomes for orderly degradation. Studies show that NSP6 undermines the capability of autophagosomes to transport viral components to lysosomes for degradation by rendering smaller diameter autophagosomes in nutrient-rich media and thus enhancing viral replication [9] . Additionally, NSP6 proteins induce a higher number of autophagosomes per cell compared with starvation to facilitate the assembly of replicase proteins [4, 9] . Although further studies are required to understand the molecular mechanism of the NSP6 regulation of autophagy, it is clear that coronavirus NSP6 is extremely important to viral protein folding, viral assembly, and the replication cycle. It is of paramount importance to understand how NSP6 mutation L37F leads to COVID-19 asymptomatic transmission and reduced virulence. We show that NSP6 is a relatively conservative protein and the region around residues is quite conservative over several SARS-CoV-2 related genomes to maintain the crucial regulative function of NSP6. Using artificial intelligence (AI), topological data analysis (TDA), and a variety of network models, we further demonstrate that mutation L37F disrupts the folding stability of NSP6. We uncover the correlation between NSP6 mutation L37F and weakened SARS-CoV-2 virulence. We also analyze the global transmission and find the decay tendency of NSP6 mutation L37F. We prove that the evolutionary dynamics of L37F is ageand gender-independent. We analyze 20656 SARS-CoV-2 complete genome sequences deposited in GISAID ( https://www.gisaid.o rg/) database up to June 19, 2020. Among them, 7243 samples have patient status information recorded as asymptomatic, symptomatic, hospitalized, ICU, deceased, and so on. In particular, 83 samples are labeled with asymptomatic (53) and symptomatic (30) cases. By genotyping 53 genome samples, we find that mutation 1083G>T is significantly relevant with asymptomatic infection with a p-value being much smaller than 0.05. Mutation 11083G>T changes leucine (L) residue at position 37 of NSP6 to phenylalanine (F), denoted as L37F. In the following sections, we investigate the relationship between mutation 11083G>T-(L37F) and asymptomatic infections using both gene-specific analysis and protein-specific analysis. The global evolution and transmission pathway of the mutation is studied as well. Our customized dataset generated from genotyping 20656 SARS-CoV-2 complete genome sequences downloaded from the GISAID database is summarized in Table 1 . Here, patient status referred to as incomplete records of asymptomatic, symptomatic, hospitalized, deceased, gender, etc. Additionally, our dataset contains data collection dates and locations of genome isolates. The detailed information of our dataset is available in the Supporting information. We first analyze 83 sequences that have either asymptomatic or symptomatic labels submitted from Japan. The statistic analysis shows that the Pearson correlation coefficient between asymptomatic records and 10083G>T is 0.606. Moreover, the p-value is 1.29×10 −9 , which reveals the significant relevance between asymptomatic and single mutation 11083G>T-(L37F). Next, we split our genotyping dataset of 20656 sequences into different countries and extract those records with the 11083G>T mutation. Table 2 summarizes the total number of sequences related to 11083G>T-(L37F), denoted as N L37F , the total number of sequences N S , the 11083G>T-(L37F) ratio, the number of total cases, the number of total deaths, and the death ratio of 25 countries up to June 19, 2020, respectively. A complete table for all countries/regions can be found in the Supporting information. Table 2 suggests that the 11083G>T-(L37F) ratio correlates with the death ratio. For example, Singapore has the highest 11083G>T-(L37F) mutation ratio as of 0.644. A piece of recent news reported that half of the COVID-19 cases in Singapore are symptomless [25] , which matches our finding that mutation 11083G>T-(L37F) is relevant to the asymptomatic infections. Moreover, Singapore has the second-lowest death ratio as listed in Table 2 , suggesting that single mutation 11083G>T-(L37F) may have weakened SARS-CoV-2 virulence. A similar deduction can be obtained from the records of Turkey, Jordan, Norway, Australia, and South Korea. The relatively high 11083G>T-(L37F) mutation ratio (greater than 0.200) with a correspondingly low death ratio (less than 3.00%) further validates that 11083G>T-(L37F) may be relevant to the asymptomatic-induced the hypotoxicity of SARS-CoV-2. Moreover, 11 out of 17 countries whose mutation ratios are less than 0.200 have a death ratio greater than 4.5%. The death ratios of the United Kingdom, Belgium, France, and Spain are even higher than 10%, which supports our assumption that mutation 11083G>T-(L37F) weakens SARS-CoV-2 virulence. Figure 1 illustrates the number of complete SARS-CoV-2 sequences in our dataset with 11083G>T-(L37F) detected versus the number of complete SARS-CoV-2 sequences without 11083G>T-(L37F) detected every 10 days in Singapore, Japan, India, China, United Kingdom, Italy, United States, Australia, and Spain, as well as two states in the United States: New York and Washington. The blue and red bar represents the 11083G>T-(L37F) counts and other mutation counts every ten-day period, respectively. Similar bar plots for all countries/regions involving this specific mutation can be found in the Supporting information. Singapore's plot shows that 11083G>T was widely found after March 24, 2020, which is consistent with the report saying that at least half of Singapore's newly discovered COVID-19 cases show no symptoms [25] . Almost all of the cases collected from February 12 to February 22 have 11083G>T mutation with asymptomatic infections recorded, which is the most robust evidence to associate the SARS-CoV-2 11083G>T or L37F mutation on NSP6 with the asymptomatic infections. India is one of the countries that also have a large number of 11083G>T mutation cases. Recent news reported that 80 percent of all coronavirus patients in India were asymptomatic or showed mild symptoms [12] , which supports our deduction that 11083G>T correlates with the asymptomatic infection. Moreover, one can see that mutation 11083G>T in China, the United Kingdom, and Australia is not as abundant as in Singapore, Japan, or India. Furthermore, as discussed before, the asymptomatic infection may associate with the hypotoxicity of SARS-CoV-2 as well. Note that from Figure 1 , China, the United States, Spain, and Italy have a relatively lower occurrence of 11083G>T, which may contribute to the relatively high death ratio. However, this specific mutation is not the only factor in determining the death ratio. The number of total infected cases, diagnostic testing, medical and health conditions, age structure, nursing-home population, etc. are also critical factors. For example, Japan has a second-highest ratio of 11083G>T-(L37F), while the death ratio is greater than 5 percent. Similarly, the United Kingdom has the third-highest death ratio, suggesting its ratio of mutation 11083G>T-(L37F) is not at the last echelon. One of the possible reasons is that the healthcare system was heavily broken by a sudden increase in infected cases. The lack of proper healthcare attention to a large number of mid/severe COVID-19 cases resulted in a high mortality ratio. Note that although Russia has the lowest 11083G>T-(L37F) mutation ratio lists in Table 2 , its 1.37% death ratio caused by COVID-19 is also relatively low. One possible reason is Russia's relatively low median age. In total, 53 genome samples were related to 11083G>T among 542 samples that were collected from May 14 to May 24 in New York. After that, fewer cases were collected in New York and the 11083G>T mutation ratio decreased. Also, the state of Washington had a small proportion of 11083G>T mutation before May 13, 2020. In the first week of June, 26 out of 27 genome sequences had mutation 11083G>T, which may indicate that the asymptomatic infection is becoming increasingly prevalent in Washington. During the last few months, the proportion of the 11083G>T mutation in the United Kingdom has gone down as shown in Figure 1 . Globally, the ratio of samples with the 11083G>T mutation over all samples, shown in the middle chart of the last row in Figure 1 decays over time, suggesting the mutation is unfavorable to the viral transmission. Two abnormal peaks appeared on February 22 and June 2 in the ratio were due to the L37F mutation counts from Japan and Washington, respectively. The relatively small numbers of total genome sequences at these dates induce the jumps. Unfortunately, the number of genome sequences decays rapidly after April as shown in Figure 1 , while globally, the number of SARS-CoV-2 infections increasing steadily according to the WHO's daily situation reports ( https://www.who.int/em ergencies/diseases/novel-coronavirus-2019/situation-reports). In this section, we track the global transmission pathways of the single mutation 11083G>T-(L37F) to understand its spread dynamics. We found that the first genome containing single mutation 11083G>T-(L37F) was sequenced in Chongqing, China, on January 18, 2020, which can be considered as the ancestor of 11083G>T in our dataset. However, the complete genome sequences released on the GISAID do not include all of the infected cases. Mutation 11083G>T also detected in a co-mutation record [8782C>T, 11083G>T, 28144T>C] in Yunnan, China one day earlier than Chongqing's sequence, which indicates that the true ancestor of 11083G>T might have occurred early than January 18, 2020. Note that the first SARS-CoV-2 sequence released in GISAID was collected in Wuhan, China on December 24, 2019. China's COVID19 epicenter, Wuhan, implemented a lockdown on January 23, 2020, which explains why China has a relatively low ratio of 11083G>T-(L37F) mutation. The first confirmed case reported by the Singapore government was on January 23, 2020. Interestingly, the first genome sequence related to mutation 11083G>T-(L37F) was found in Singapore on January 29,2020, only a few days after the first confirmed cases. Therefore, the transmission of COVID-19 in Singapore at the initial stage contains a large proportion of the11083G>T-(L37F) mutation. The prevalence of the11083G>T-(L37F) mutation in Singapore's SARS-CoV-2 infections explains its low death ratio. Mutation 11083G>T-(L37F) first appeared in the United States on January 22, 2020, in Arizona, a 26year-old male with co-mutations [8782C>T, 11083G>T, 28144T>C, 29095C>T], which is the descendant of [8782C>T, 11083G>T, 28144T>C] found in China. Since then, more records related to 11083G>T have appeared in the United States. At this stage, 236 genome isolates from the United States in our dataset are relevant to 11083G>T. France and Italy are the following countries that found the 11083G>T mutation at an early stage. Both of them detected this mutation on Jan 29, 2020. Although 11083G>T mutation has arrived in the United States and the United Kingdom at a very early stage, different subtypes of SARS-CoV-2 spread vastly due to their large population, which makes the 11083G>T mutation non-dominated. In Jordan, the earliest sequence released on GISAID is on March 16, 2020, with 11083G>T mutation, which indicates the high 11083G>T ratio in Jordan. Note that multiple reasons may lead to a low death ratio of 0.90% in Jordan. One may be due to the hypotoxicity of SARS-CoV-2 caused by mutation 11083G>T. The other is that the healthcare system is in good condition due to the small number of infected patients. The first confirmed COVID-19 cases reported by the Vietnamese government was on January 16, 2020. However, the first genome isolates related to mutation 11083G>T in our dataset was collected on March 17, 2020. The 11083G>T was found in Vietnam two months after the first confirmed cases, which result in a low 11083G>T mutation ratio. Surprisingly, none of the patients died in Vietnam. As a low-middle income country with nearly 100 million population and a much less-advanced healthcare system than other countries such as Italy, the United Kingdom, and France, Vietnam's success in tackling COVID-19 attracts attention. They immediately took action to ban the foreign nationals' entries and implement strict social distancing rules, which is the essential factor in keeping the coronavirus death toll at zero. mutations. The color of the dominated type of mutation decides the base color of each state. One can see that Oregon, Alaska, Texas, and Ohio have a higher proportion of 11083G>T than the other states, while the death ratio in Oregon, Alaska, and Texas is lower than the average death ratio (5.47%) in the United States as of June 19, 2020. Note that although the ratio of mutation 11083G>T in Nevada is 1, only one genome sequence was submitted from Nevada; hence, it is insignificant in the statistics. 20656 complete genome sequences, 9923 cases have age labels and 9437 cases (male cases: 4993 and female cases: 4444) have gender labels, which are used for our age-related and gender-related analysis, respectively. Overall, male cases have a slightly higher ratio of the 11083G>T-(L37F) mutation than female cases. In terms of age distribution, the ratio is significantly higher in the age group of 99-109 years. However, only 7 cases fall into this group and all data were collected from the United Kingdom (2 cases) and Singapore (5 cases). More data and further studies are required to draw any conclusion from this group. Therefore, we conclude that mutation 11083G>T-(L37F) is quite evenly distributed over age and gender groups, implying that the mutation does not have a host-dependent behavior. Figure 3 depicts the sequence and structural information of NSP6. Figure 3 (a) is generated by an online server [30] , Figure 3 (b) is plotted by PyMol [11] , and Figure 3 (c) is created by [1] . As shown in Figure 3 (a), NSP6 is a multi-span membrane protein. In mutation 11083G>T-(L37F), both leucine and phenylalanine are α-amino acids and non-polar. However, phenylalanine has a benzyl ring on the side chain, making the secondary structure of NSP6 more compact. As shown in Figures 3(a) and (c), from the NSP6 protein residue position 34 to 37, the residues are FFFL. Therefore, there will be four continuous phenylalanine amino acid FFFF after the L37F mutation. As a result, mutation L37F can stiffen the NSP6 structure by the aromatic-aromatic, hydrophobic, or π-stacking interactions. Such an additional interaction can significantly change NSP6 function [4] . Figure 3 (c) presents the alignment of the SARS-CoV-2 NSP6 sequence with those of the other four coronaviruses, including that of SARS-CoV. The sequence identities between SARS-CoV-2 NSP6 and the NSP6s of SARS-CoV [28] , RaTG13 [46] , ZC45 [26] , BM48-31 [13] are 87.2%, 99.7%, 97.9%, and 83.8%, respectively. Note that the overall sequence identity between SARS-CoV-2 and SARS-CoV is 83.8%. Therefore, NSP6 is relatively conservative. Additionally, NSP6 position 37 is in a relatively conservative environment among different species. Position 37 for all other species is valine, which is also hydrophobic and extremely similar to leucine. Two amino acids differ by one -CH 2 -unit. A large region from 25 to 45 has only three mutations among five species, indicating the region is potentially important for the protein function. There three-dimensional (3D) structure of NSP6 is displayed in Figure 3 (b). To further understand the impact of mutation L37F on NSP6, we carry several theoretical analyses using AI [6] , TDA [8] , flexibility and rigidity index (FRI) [43] , and a large number of other network theory models [15] . Our results are summarized in Table 3 . The protein folding stability change following mutations is defined by ∆∆G = ∆G m − ∆G w , where ∆G w is the free energy change of the wild type and ∆G m is the free energy change of the mutant type [6] . The folding stability change ∆G measures the difference between folded and unfolded states. Thus, a negative ∆∆G means the mutation causing destabilization and vice versa. We use a TDA-based deep learning method [6] to compute ∆∆G in this work. As shown in Table 3 , a negative folding stability change ∆∆G is found for the mutation, indicating a destabilizing mutation L37F for NSP6. Physically, a large hydrophobic residue (F) at the lipid bilayer and solvent interface reduces NSP6 stability. Considering the sequence alignment in Figure 3 (c), we have also carried a similar calculation for the mutation of L37V and found its ∆∆G = −0.71 kcal/mol. Therefore, leucine is favored at residue position 37 for SARS-CoV-2 NSP6. Next, the FRI rigidity index R η measures the geometric compactness and topological connectivity of the network consisting of C α at each residue and the heavy atoms involved in the mutation [43] . The scale parameter η determines the range of pairwise interactions. In this work, the parameter η is set to 8Å. As shown in Table 3 , the increase in the FRI rigidity index R η is found, indicating the increase in the rigidity of the first alpha helix of NSP6. It is known that protein flexibility is required to allow it to function through molecular interactions within the cell, among cells and even between organisms [37] . The quadruplet phenylalanine resulting from mutation L37F can significantly reduce NSP6's interactions with ER. Moreover, seven network models [15] are utilized to analyze the L37F mutation in NSP6 as shown in Table 3 . We consider the network of heavy atoms at residue 37 and C α atoms in NSP6. The connections are allowed between any pair of atoms within a cutoff distance of 8.0Å. The mutation-induced decrease in the edge density d is due to the factor that the mutation increases the edge on the surface of the protein. The mutation induced the decrease in the average path length L is owing to the mutation increases the edge with shorter distances. The mutation induced the decrease in average betweenness centrality C b is due to the increase in the crowd of vertices by the mutation residue. The mutation induced the increase in average eigencentrality C e is because increase in the number of the connected components will enlarge the maximal eigenvalues. The mutation induced the increase in the average subgraph centrality C s is due to the decrease in the average participating rate of each edge. The mutation induced the decrease in average communicability M is due to the decrease in the neighbor edges participating rate of each edge. Finally, the mutation induced the increase in average communicability angle Θ is due to the increase in the alignments among different edges. Together with the FRI rigidity index and the protein folding stability Table 3 : The folding stability change and graph network descriptor consisting of wild type and mutant type of NSP6 calculated by web server [6] . The folding stability change ∆∆G = ∆Gm − ∆Gw, where ∆Gm is free energy change of wild type and ∆Gw is free energy change of mutation type. R 8 : FRI rigidity index with η of 8; d: edge density; L : average path length; C b : average betweenness centrality; Ce : average eigencentrality; Cs : average subgraph centrality; M : average communicability; Θ : average communicability angle. While the asymptomatic infections of severe acute respiratory syndrome 2 (SARS-CoV-2) have been reported worldwide in the past few months, little is known about the formation mechanism and virological characteristics of asymptomatic infections. This work parses 20656 complete genome sequences of SARS-CoV-2, and for the first time, reveals the relationship between asymptomatic cases and a single nucleotide mutation 11083G>T-(L37F) on NSP6. After genotyping 83 sequences with asymptomatic and symptomatic records, we find 11083G>T is significantly relevant to the asymptomatic infection. By analyzing the distribution of 11083G>T in various countries, we unveil that the 11083G>T mutation correlates with the death ratio in Singapore, the United States, the United Kingdom, etc, inferring the correlation between the asymptomatic infection and the hypotoxicity of SARS-CoV-2. Moreover, we track the dynamics of 11083G>T mutation ratio globally and discover its decaying tendency, indicating that 11083G>T mutation hinders SARS-CoV-2 transmission. Furthermore, the analysis of the distribution of 11083G>T in different ages and genders unveils that 11083G>T-driven mutation does not have a host-dependent behavior. The protein-specific analysis is also taken into consideration. The 11083G>T mutation leads leucine (L) residue at position 37 of NSP6, changing to phenylalanine (F). By employing the graph network analysis and topology-based mutation predictor, we find that 11083G>T-(L37F) destabilizes the structure of NSP6 in protein-wise. As one of the most conservative proteins of SARS-CoV-2, NSP6 involves in the viral autophagic regulation. Therefore, this destabilized mutation may result in the inefficiency of NSP6 to participate in the viral protein folding, viral assembly, and replication cycle, which underpins our deduction that 11083G>T-(L37F) may be relevant to the asymptomatic infections and weaken the SARS-CoV-2 virulence. Bristol-Myers Squibb, and Pfizer. The authors thank The IBM TJ Watson Research Center, The COVID-19 High Performance Computing Consortium, NVIDIA, and MSU HPCC for computational assistance. Visualization of four coronaviral NSP6s in membrane Figure 4 is the visualization of four NSP6 proteoforms of SARS-CoV, Bat-SL-RaTG, Bat-SL-CoVZC45, and Bat-SL-BM48-31. These proteoforms are consistent with that of SARS-CoV-2 described in the main paper, indicating their similar function of regulating cell autophagy. (e) (f) Figure 5 : Network analysis of the SARS-CoV-2 NSP6 L37F mutation. The networks consist of heavy atoms of mutation site 37 and Cα atoms of SARS-CoV-2 NSP6. The differences of descriptors between the network with wild type, leucine, and the network with mutant type, phenylalanie, are displayed. (a) FRI rigidity index differences; (b) eigencentrality differences; (c) subgraph centrality differences; and (d) betweenness centrality differences. In (e) and (f), relative changes versus cutoff distances to the mutation site are studied. (e) relative changes of FRI rigidity index, path length, edge density, betweenness centrality, and eigencentrality; and (f) relative changes of subgraph centrality, communicability, and communicability angle. Figure 5 shows the network analysis of the SARS-CoV-2 L37F mutation. The networks are constructed by heavy atoms at the mutation site and at the C α atoms of SARS-CoV-2 NSP6. Figures 5 (a), (b) , (c), and (d) present the mutation-induced differences of FRI rigidity index differences, eigencentrality differences, subgraph centrality differences, and betweenness, respectively. Although these descriptors do not change much globally, their local changes around the mutation site reveal the L37F mutation-induced stress around the mutation site. Figures 5 (e) and (f) show the dependence of relative changes over cutoff distances. First of all, every descriptor converges as the cutoff distance increase. Relative changes do not fluctuate for cutoff distance greater than 24Å. The relative change of FRI rigidity index monotonically decreases in absolute value as the cutoff increases. Similarly, the edge density has a monotonically decreasing pattern. The two networks are close in the path length descriptor. Interestingly, betweenness centrality descriptor has the largest difference at 10Å cutoff distance, while the eigencentrality descriptor has the largest difference at 22Å cutoff distance. For betweenness centrality, it shows that the mutation has the largest impact in the network of C α within 10Å to any atoms of the target residue. In Figure 5 (f), three descriptors show a similar pattern -they increase in absolute value first and then decrease eventually as the cutoff increases. Overall, the relative change plots indicate that the mutation happening at L37 has the largest impact on the C α within 10Å to any atoms of the target residue. On January 5, 2020, the complete genome sequence of SARS-CoV-2 was first released on GenBank (Access number: NC 045512.2) by Zhang's group at Fudan University [42] . Since then, there has been a rapid accumulation of SARS-CoV-2 genome sequences. In this work, 20,656 complete genome sequences with high coverage of SARS-CoV-2 strains from the infected individuals in the world were downloaded from the GISAID database [35] ( https://www.gisaid.org/) as of June 19, 2020. All the records in GISAID without the exact submission date were not taken into considerations. To rearrange the 20,656 complete genome sequences according to the reference SARS-CoV-2 genome, multiple sequence alignment (MSA) is carried out by using Clustal Omega [36] with default parameters. The amino acid sequence of NSP6 is downloaded from GenBank [3] . The three-dimensional (3D) structure of nonstructure protein 6 (NSP6) in this work is generated by I-TASSER model [44] . The 3D structure graph is created by using PyMOL [11] . Single nucleotide polymorphism (SNP) genotyping measures the genetic variations between different members of a species. Establishing the SNP genotyping method to the investigation of the genotype changes during the transmission and evolution of SARS-CoV-2 is of great importance [38, 45] . By analyzing the rearranged genome sequences, SNP profiles, which record all of the SNP positions in teams of the nucleotide changes and their corresponding positions, can be constructed. The SNP profiles of a given SARS-CoV-2 genome isolated from a COVID-19 patient capture all the differences from a complete reference genome sequence and can be considered as the genotype of the individual SARS-CoV-2. In this work, the prediction of NSP6 folding energy changes upon mutation is computed by using the topology based mutation predictor (TML-MP) ( https://weilab.math.msu.edu/TML/TML-MP/) which is briefly reviewed as following and its detail can be found in the literature [6] . TML-MP applies element specific persistent homology, which reveals essential biological information [8, 14] . The method employs the element-specific persistent homology [7] and other biological and chemical properties as machine learning features to train gradient boosted regression tree (GBRT) models. The dataset includes 2648 mutations instances in 131 proteins provided by Dehouck et al [10] . The error analysis based on the dataset is given as Pearson correlations coefficient (R p ) of 0.79 and root mean square error (RMSE) of 0.91 kcal/mol from previous work [6] . As the persistent homology widely applied in a variety of practical feature generation problems, it is also successful in the implementation of predictions of protein folding energy changes upon mutation. The key idea in TML-MP is to use the element-specific persistent homology (ESPH) which distinguishes different element types of biomolecules when building persistent homology barcodes. For instance, commonly occurring protein element types include C, N, O, S, and H. However, hydrogen atoms are often absent from PDB data and sulfur atoms are too few in most proteins to be statistically important. Thus, C, N, and O elements are considered on the ESPH in protein characterization. As for persistent homology, barcodes generated based on ESPH provide a topological representation of molecular interactions. Features are extracted from the different dimensions of persistent homology barcodes by dividing barcodes into several equally spaced bins, which is called binned barcode representation. The auxiliary features such as geometry, electrostatics, amino acid types composition, and amino acid sequence are also included for machine learning training. The element specific persistent homology is built by adopting the distance function DI(A i , A j ) describing the distance between two atoms A i and A j defined as where Loc(·) denotes the location of an atom which is either in a mutant site or in the rest of the protein and DE(·, ·) is the Euclidean distance between the two atoms. Then, the persistent homology uses simplicial complexes with a specific rule such as Vietoris-Rips complex, Cech complex, or alpha complex. Vietoris-Rips complex (VC) is used for characterizing first-order interaction where alpha complex (AC) is used for characterizing higher-order patterns. Using ESPH to characterize interactions of different kinds, we construct persistent homology barcodes on the atom sets by selecting one certain type of atoms in mutation site and one other certain type of atoms in the rest of the protein. The set of barcodes from one persistent homology computation as V p,d,b γ,α,β where • p ∈ {VC, AC} is the complex rule, • d ∈ {DI, DE} is the distance function, • γ ∈ {M, W} is the protein of mutant type or wild type, • α ∈ {C,N,O} is the element type selected in proteins except in the mutation site, • β ∈ {C,N,O} is the element type selected in the mutation residue. These barcodes are capable of reflecting the molecular mechanism of protein stability. Features are extracted from the groups of persistent homology barcodes. For 18 groups of Betti-0 ESPH barcode such that 9 groups are from the mutant type and 9 groups are from the wild type, one can specify a fixed length interval to divide the ESPH barcodes into a number of equally spaced bins. For example, a length set, {[0, 0.5], (0.5, 1], ..., (5.5, 6]Å} would turn the 18 groups of Betti-0 ESPH barcode into 18*12 features with dimension of the number of atoms. The death and birth of bars are counted in each bin resulting in features. Therefore, this representation enables us to precisely characterize hydrogen bonds, van der Waals, electrostatic, hydrophilic, and hydrophobic interactions. For the higher-order Betti numbers, the emphasis is given on patterns of both short and long-distance scales. Statistics feature are computed for each group of barcodes for Betti-1 and Betti-2, which are sum, max, and the average of bar length, and max and min of birth and death values. Overall, 12*18 features are generated by Betti-0 on VC, and 7*2*18 features are generated by Betti-1 and Betti-2 on AC. In TML-MP, gradient boosted regression trees (GBRTs) [22] are employed to train the dataset according to the size of the training dataset, absence of model overfitting, non-normalization of features, and ability of nonlinear properties. The GBRT method produces a prediction model as an ensemble method which is a class of machine learning algorithms. It builds a popular module for regression and classification problems from weak learners. By the assumption that the individual learners are likely to make different mistakes, the method uses a summation of the weak learners to eliminate the overall error. Furthermore, a decision tree is added to the ensemble depending on the current prediction error on the training dataset. Therefore, this method is relatively robust against hyperparameter tuning and overfitting, especially for data with a moderate number of features. The GBRT is shown for its robustness against overfitting, good performance for moderately small data sizes, and model interpretability. The current work uses the package provided by scikit-learn (v 0.23.0) [33] . The graph network descriptors are briefly presented which are applied in this work. Graph networks can mimic interactions between pairs of units in molecules. The quantify features of the networks can reveal the biological and chemical properties measured by comparing descriptors on different networks. To detect the single residue impact following mutation, the network consists of a set S(r) of C α atoms from every residue of protein structure except the target mutation residue where r is the cutoff distance such that a C α atom is included if it is within rÅ to any atom of the target mutation. The total atom set T (r) is defined as the atoms (C, N, and O) of the target residue and C α atoms of S(r). Moreover, two vertices are connected in the network if their distance is less than 8Å. Thus the adjacency matrix A can be defined as well where A is a matrix containing 0 and 1 such that A(i, j) = 0 if i-th and j-th are disconnected and A(i, j) = 1 if i-th and j-th are connected. FRI rigidity index was introduced to reflect the flexibility between atoms for molecular interaction prediction [29, 43] . The single residue molecular rigidity index measures its influence on the set S(r) which is given as where N S is the number of C α atoms of the set S(r) and N is the number of atoms in total atom set T (r). Edge density is defined based on the adjacency matrix of the total atom set T (r) such as where I T is the index set of the mutation residue. Average path length measures the separation between two vertices of the whole network, which can be used to study infectious diseases spreading in the networks [39] . The average path length for the single mutation system of biomolecular is defined as where d(i, j) is the shortest path length between vertices v i and v j . Average betweenness centrality shows communications in a network [21] . The average betweenness centrality is given as where g k ij is defined as the number of the shortest path between vertices v i and v j that passes v k , and g ij is the number of shortest paths between v i and v j . Average egeincentrality is the average of elements of the eigenvector V max , which is corresponding to the largest eigenvalues of the adjacency matrix [5] such as Average subgraph centrality is built on the exponential of the adjacency matrix, E = e A . The subgraph centrality is the summation of weighted closed walks of all lengths starting and ending at the same node [15, 18] . Thus the average subgraph centrality reveal the average of participating rate of each vertex in all subgraph, which is given as Average communicability is defined in a similar way as the subgraph centrality on the exponential of the adjacency matrix [15] [16] [17] , which is Average communicability angle is given by [17] where θ(i, j) = cos −1 . Inkscape: guide to a vector drawing program SARS-CoV-2 viral spike g614 mutation exhibits higher case fatality rate Evolutionary analysis of SARS-CoV-2: how mutation of Non-Structural Protein 6 (NSP6) could affect viral autophagy Power and centrality: A family of measures Analysis and prediction of protein folding energy changes upon mutation by element specific persistent homology Integration of element specific persistent homology and machine learning for protein-ligand binding affinity prediction Topology and data Coronavirus NSP6 restricts autophagosome expansion Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0 Pymol: An open-source molecular graphics tool Asymptomatic infections: India's next big coronavirus challenge. The WEEK Genomic characterization of severe acute respiratory syndrome-related coronavirus in european bats and classification of coronaviruses based on partial rna-dependent RNA polymerase gene sequences Topological persistence and simplification Topological analysis of SARS-CoV-2 main protease Communicability in complex networks Communicability angle and the spatial efficiency of networks Subgraph centrality in complex networks Structural and molecular basis of mismatch correction and ribavirin excision from coronavirus RNA Phylogenetic network analysis of SARS-CoV-2 genomes Centrality in social networks conceptual clarification Greedy function approximation: a gradient boosting machine Evidence supporting transmission of severe acute respiratory syndrome coronavirus 2 while presymptomatic or asymptomatic Asymptomatic transmission, the achilles heel of current strategies to control COVID-19 Exclusive: Half of singapore's new COVID-19 cases are symptomless, taskforce head says Genomic characterization and infectivity of a novel SARS-like coronavirus in chinese bats Viral shedding and transmission potential of asymptomatic and paucisymptomatic influenza virus infections in the community A major outbreak of severe acute respiratory syndrome in Hong Kong Generalized flexibility-rigidity index Protter: interactive protein feature visualization and integration with experimental proteomic data Prevalence of asymptomatic SARS-CoV-2 Infection: A Narrative Review Programmed cell death pathways in cancer: a review of apoptosis, autophagy and programmed necrosis Scikit-learn: Machine learning in python Mechanisms of viral mutation. Cellular and molecular life sciences GISAID: Global initiative on sharing all influenza data-from vision to reality Clustal omega Functional aspects of protein flexibility Decoding SARS-CoV-2 transmission, evolution, and ramification on COVID-19 diagnosis, vaccine, and medicine Collective dynamics of small-worldnetworks WHO. Coronavirus disease 2019 (COVID-19) situation report 151. Coronavirus Disease (COVID-2019) Situation Reports Virological assessment of hospitalized patients with COVID-2019 A new coronavirus associated with human respiratory disease in china Multiscale multiphysics and multidomain modelsFlexibility and rigidity The i-tasser suite: protein structure and function prediction Genotyping coronavirus SARS-CoV-2: methods and implications A pneumonia outbreak associated with a new coronavirus of probable bat origin This work was supported in part by NIH grant GM126189, NSF Grants DMS-1721024, DMS-1761320, and IIS1900473, Michigan Economic Development Corporation, George Mason University award PD45722,