key: cord-0996700-02bwyi1w
authors: Hu, Y.; Riley, L. W.
title: Dissemination and co-circulation of SARS-CoV2 subclades exhibiting enhanced transmission associated with increased mortality in Western Europe and the United States
date: 2020-07-15
journal: nan
DOI: 10.1101/2020.07.13.20152959
sha: c832b068beb6645d83c981b9473af7b5872ec1bd
doc_id: 996700
cord_uid: 02bwyi1w

Mechanisms underlying the acute respiratory distress syndrome (ARDS)-like clinical manifestations leading to deaths in patients who develop COVID-19 remain uncharacterized. While multiple factors could influence these clinical outcomes, we explored if differences in transmissibility and pathogenicity of SARS-CoV2 variants could contribute to these terminal clinical consequences of COVID-19. We analyzed 34,412 SARS-CoV2 sequences deposited in the Global Initiative for Sharing All Influenza Data (GISAID) SARS-CoV2 sequence database to determine if regional differences in circulating strain variants correlated with increased mortality in Europe, the United States, and California. We found two subclades descending from the Wuhan HU-1 strain that rapidly became dominant in Western Europe and the United States. These variants contained nonsynonymous nucleotide mutations in the Orf1ab segment encoding RNA-dependent RNA polymerase (C14408T), the spike protein gene (A23403G), and Orf1a (G25563T), which resulted in non-conservative amino acid substitutions P323L, D614G, and Q57H, respectively. In Western Europe, the A23403G-C14408T subclade dominated, while in the US, the A23403G-C14408T-G25563T mutant became the dominant strain in New York and parts of California. The high cumulative frequencies of both subclades showed inconsistent but significant association with high cumulative CFRs in some of the regions. When the frequencies of the subclades were analyzed by their 7-day moving averages across each epidemic, we found co-circulation of both subclades to temporally correlate with peak mortality periods. We postulate that in areas with high numbers of these co-circulating subclades, a person may get serially infected. The second infection may trigger a hyperinflammatory response similar to the antibody-dependent enhancement (ADE) response, which could explain the ARDS-like manifestations observed in people with co-morbidity, who may not mount sufficient levels of neutralizing antibodies against the first infection. Further studies are necessary but the implication of such a mechanism will need to be considered for all current COVID-19 vaccine designs.

previously recognized and that even mixed populations (quasispecies) of the virus could occur in a patient with COVID-19 (7) . A study of only 11 viral isolates from 11 COVID-19 patients in Zhejiang, China identified over a 2-week period revealed 33 distinct mutations in the coding regions of the virus (7) . These strain variants included several founding subclades of SARS-CoV2 now circulating in different regions of the world. They further observed that these variants exhibited as much as 270fold differences in viral load and cytopathic effect (CPE) in Vero-E6 cells infected with the virus (7) .

Thus, it is beginning to be recognized that viral mutations can affect virus function, at least in vitro. In fact, a recent report described global dissemination of a SARS-CoV-2 variant with a mutation in the spike protein (D614G), suggesting that this strain gained fitness advantage for increased infectivity (3) .

The observed geographic differences in COVID-19 CFR led us to speculate that, in addition to increased transmissibility, viral mutations may contribute to differences in mortality. We analyzed 34,412 SARS-CoV2 sequences deposited in the Global Initiative for Sharing All Influenza Data (GISAID) SARS-CoV2 sequence database (https://www.gisaid.org/ ) (as of June 4, 2020) to assess if regional differences in circulating strain variants correlate with differences in mortality in Europe, United States, and California. We found several sequences with missense mutations in Orf1ab, spike protein, and Orf3a genes in strains that were observed to dominate in Europe and the United States, associated with high CFR regions, controlled for the timing of initiation of social-distancing measures, population age structure, number of tests/population performed, prevalence of co-morbidities, and number of hospital beds/population. These observations have implications for predicting the course of the pandemic as well as in vaccine design and safety.

We compared CFRs by 1) country in Western Europe, 2) state in the US, and 3) county in California (Table 1 ). The cumulative CFRs were calculated based on COVID-19 cumulative number of deaths divided by the cumulative number of cases as reported by Worldometer's database on June . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) 6 , 2020 (https://www.worldometers.info/coronavirus/ ). In Western Europe, we included countries that reported >150,000 cases (Italy, Spain, France, Germany, and UK), while in the US, we included states that reported >100,000 as of May 31 (New York, New Jersey, Illinois, California, Massachusetts). In California, we included counties from which > 90 SARS-CoV2 sequences were deposited in the GISAID database (San Francisco, Santa Clara, and San Diego county). In Western Europe, Germany had the lowest CFR at 4.7% compared to >10% in the other countries. In the US, California reported the lowest CFR at 3.4%, while New York had a CFR of 7.7%; the five boroughs of New York City reported a CFR of 10%. In California, Santa Clara county reported a CFR of 5.1%, while San Francisco reported a CFR of only 1.6% (Table 1) .

Each regions' CFR was compared to co-factors that could potentially affect CFRs-date of initiation of the social-distancing measures, median age of the region's population, prevalence of comorbidities (diabetes mellitus and hypertension), the number of hospital beds/1000 population (recent data on the number of ICU beds per population were not uniformly available for all regions included in this study), and the number of RT-PCR tests performed/1 million population (Table 1) We examined 34,412 SARS-CoV2 complete genome sequences deposited from the highincidence countries and regions in Western Europe, the United States, and California as of June 4, 2020 ( Table 2 ). All of the sequences were compared to that of one of the first SARS-CoV2 strains isolated in Wuhan, China (Wuhan HU-1; MN908947) (1). The number of sequences submitted from . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) each region varied (Table 1, Supplementary Table 1 ). The United Kingdom submitted the greatest number of sequences (Table 2 ).

In Western European countries, the United States, and in California, two of three subclades descending from the Wuhan HU-1 strain in China dominated. One variant had a nonsynonymous mutation in the spike protein gene (A23403G). This subclade contained two additional mutations--one in the leader sequence (C241T) and one synonymous mutation in the gene encoding nonstructural protein nsp3 (C3037T). Descending from this subclade was a variant that gained a missense mutation in a segment of Orf1ab encoding the RNA-dependent RNA polymerase (C14408T). The third subclade contained an additional missense nucleotide substitution in Orf3a (G25563T). These missense mutations led to non-conservative amino acid substitutions in the spike protein (D614G), RNA-dependent RNA polymerase (P323L), and Orf3a protein (Q57H). The remainder of this paper will thus focus on these subclades carrying these missense mutations--A23403G, A23403G-C14408T, and A23403G-C14408T-G25563T.

In Western Europe, the proportion of the A23403G-C14408T subclade varied from 39.7% in France to 90.8% in Italy ( Table 2 ). The cumulative proportions of A23403G-C14408T-G25563T mutant in high CFR Western European countries varied from 1.9% in Italy to 50.5% in France. Italy became the first epicenter of the COVID-19 epidemic in Europe. Of 109 sequences deposited from Italy, 99 (90.8%) carried the A23403G-C14408T mutations, while only 2 (1.9%) carried the A23403G-C14408T-G25563T mutation ( Table 2 ). The A23403G-C14408T subclade was associated with high CFR when the United Kingdom (OR=2.1; 95% CI: 1.6-2.7) and Italy (OR=8.8; 95% CI: 4.3-20) were compared to Germany, but not when Spain and France were compared to Germany ( Table 2 ). The A23403G-C14408T-G25563T subclade was associated with high CFR when France was compared to Germany (OR=3.8; 95% CI: 2.5-5.5), but not when the other Western European countries were compared to Germany (Supplementary Table 2A, 2B, 2C) .

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

In the US, most of the sequence submissions were from New York and California ( To determine where these subclades originated, we examined the sequences deposited from China. Of 742 sequences from China deposited mostly from January through March, we found 15 (2%) A23403G, 19 (2.6%) A23403G-C14408T and 8 (1.1%) A23403G-C14408T-G25563T strains (Table 2) . Thus, the latter two dominant nucleotide variants in Europe and the US most likely originated in China and greatly expanded in these regions.

The cumulative CFRs and the cumulative frequencies of the dominant subclades we compared above do not reflect temporal changes of these two variables over the course of each epidemic in a region. Thus, we compared the daily number and 7-day moving averages of the frequencies of these mutants to the daily number of cases and deaths across the course of each epidemic. This analysis had to be limited to samples from countries and regions that submitted a sufficient number of sequences from the early phase to shortly after the peak period of the epidemic.

In the United Kingdom, among 15,413 full genome sequences analyzed, the 7-day moving average of the frequencies of the A23403G-C14408T subclade increased from less than 30% at the beginning of the epidemic in late February to over 90% just past the peak of the epidemic (Fig. 1) .

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07.13.20152959 doi: medRxiv preprint

The 7-day moving average of the A23403G-C14408T-G25563T variant remained below 10% throughout most of the epidemic period. In Germany, from late February to mid-March, the 7-day moving average of the proportions of A23403G-C14408T mutant increased quickly to above 70%, dipped briefly to just above 40%, and then increased again to over 80% just past the peak of its epidemic in late April (Fig. 1) . The A23403G-C14408T-G25563T subclade appeared to compete with the other subclade, initially peaking at nearly 50%, then falling to less than 10% by the post-peak period. The two subclades' 7-day moving averages moved in opposite directions, where the doublemutant seemed to outcompete the triple mutant (Fig. 1) . The same phenomenon is observed in France ( Fig. 1) . In Germany, the number of deaths peaked in mid-April, about a month after the peak 7-day average proportion of the A23403G-C14408T-G25563T strain. In both UK and Spain, the A23403G-C14408T-G25563T subclade's 7-day moving average remained at less than 10% throughout their epidemics.

Interestingly, at the beginning (February) of the epidemic in Germany, more than 90% of the strains carried the A23403G mutation but not the C14408T or G25563T mutations. Strains with C14408T or G25563T mutation were not observed during this early period. The A23403G single nucleotide variant was completely replaced by mid-March by the other two subclades. In all the other regions in Western Europe and the US, this single nucleotide variant represented less than 1% of the sequences deposited in the GISAID database (Table 2 ).

In the United States, unlike Europe, it was the A23403G-C14408T-G25563T mutant that became the dominant strain (Fig. 2) . In New York, the state that had the highest cumulative CFR, the epidemic (cases and deaths) peaked in early April. Of 1342 complete sequences submitted from the state, the moving 7-day average of the proportions of the A23403G-C14408T mutant fluctuated between 0 and 20% during most of the epidemic course (Fig. 2 ). During the same period, the 7-day moving average of the A23403G-C14408T-G25563T subclade fluctuated around 90% (Fig. 2) .

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. .

Interestingly, in California that had a relatively low cumulative CFR, the epidemic started with the triple mutant clade, but by mid-May, the A23403G-C14408T mutant took over.

Within California, Santa Clara county had low frequencies of both mutants, despite having a high CFR relative to San Francisco county. However, Fig. 3 shows that, as observed in Germany, Santa Clara county actually had high frequencies of both of these mutants during the early phase of the epidemic (March) and both mutants decreased rapidly to less than 5% by early April. Nearly 45% of the deaths in this county occurred in long-term care facilities (LTCFs). The peak number of deaths that were not associated with LTCFs occurred in late March (https://www.sccgov.org/sites/covid19/Pages/dashboard-cases.aspx ), shortly after the peak 7-day mean average period of the two mutants ( Fig. 3 ).

This genomic analysis of 34,412 SARS-CoV2 sequences deposited in the GISAID database found two dominant subclades of the virus co-circulating in Europe and the United States derived from the Wuhan HU-1 strain isolated in China in December, 2019 (1). These subclades descended from another subclade with a missense mutation in the spike protein gene A23403G (D614G). The A23403G, A23403G-C14408T, and A23403G-C14408T-G25563T variants accounted for only 2%, 2.6%, and 1.1%, respectively of 742 sequences deposited from China as of June 4 ( Table 2 ). The epidemic in Germany initially involved the A23403G variant, which was replaced during the epidemic by the A23403G-C14408T and A23403G-C14408T-G25563T variants. In Italy, which became the first epicenter of the epidemic in Europe, over 90% of the cases were due to the double mutant. However, this observation could have resulted from sampling bias. The small number of samples submitted from the country (109) may not be representative of Italy's cases. (Table 1, Supplementary Material).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07.13.20152959 doi: medRxiv preprint

The dominant subclades were significantly associated with some of the high cumulative CFR regions in Western Europe, the United States, and in California, even after controlling for differences in the initiation of social distancing measures, population age structure, prevalence of diabetes mellitus or hypertension, number of hospital beds/1000 population, and number of RT-PCR tests/1 million population done. Among five Western European countries reporting >150,000 cases at the end of May 2020, Germany had the lowest cumulative CFR (4.7%). Yet, the lockdown in Germany was initiated after Italy, Spain, and France. Germany's median age (47.8 years) is the highest, and it has the highest prevalence of diabetes and second highest prevalence of hypertension among these countries. It does boast the highest number of hospital beds per 1000 population, but this factor did not correlate with low CFR in the other four European countries, the US or California counties ( Table 1) .

The cumulative frequencies of A23403G-C14408T and A23403G-C14408T-G25563T variants were not each consistently associated with the high cumulative CFR regions. The A23403G-C14408T mutant was significantly associated with high CFR when the United Kingdom and Italy were compared to Germany. On the other hand, the A23403G-C14408T-G25563T mutant was associated with high CFR only when France was compared to Germany. In the US, it was the triple nucleotide mutant that was associated with high CFRs. In early June, the CFR in New York was 7.7%--more than double that of California at 3.4%, and the A23403G-C14408T-G25563T mutant was significantly more prevalent in New York (Table 2) . Within California, this mutant was significantly associated with high CFR when San Diego county was compared to San Francisco county. However, this association was not observed with Santa Clara county, which had a high CFR at 5.1% but low frequencies of both subclades at less than 10%.

When the frequencies of the variants were examined by their 7-day moving averages, a more consistent pattern was observed. In Western Europe, interestingly, it was the A23403G-C14408T mutant that dominated in most of the countries. However, the peak periods of deaths . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07.13.20152959 doi: medRxiv preprint appeared to temporally correspond to the period of increased co-circulation of the A23403G-C14408T-G25563T mutant (Fig. 1) . In Germany, when we examined the frequency of the mutants by intervals of 7-day moving averages over the course of the epidemic, the A23403G-C14408T-G25563T mutant peaked at close to 50% during the early phase of the epidemic, which was about 3-4 weeks before the peak number of deaths (Fig. 2 ). Since the virus isolation is most likely made at the time of initial diagnosis of COVID-19, if the co-circulation of the two mutants is associated with increased number of deaths, it would be expected that the peak 7-day mean average of the co-circulating mutants will precede the peak number of deaths during the epidemic.

In the United Kingdom, the cumulative proportion of the A23403G-C14408T-G25563T mutant was low (6.4%) but the UK's CFR was one of the highest in Europe at 14%. The UK reported more than 255,000 cases and 40,000 deaths at the end of first week of June. Thus, even if only 6.4% of the people were infected with this strain, more than 16,000 people would have been infected, suggesting that both of the subclades were co-circulating at high numbers. Thus, in regions where the case numbers are high, the numbers of these circulating dominant mutants would also be high, even if the relative proportion of one may be low. Part of the high mortality rate of COVID-19 observed in the UK, therefore, could still be related to the high level of co-circulation of these dominant strains (Fig. 2, Table 2 ).

In California, Santa Clara county reported a high cumulative CFR (5.1%) relative to San Francisco at 1.6%. The two mutants together accounted for over 90% of the strains circulating in Santa Clara in late March but both fell rapidly to less than 5% by early April (Fig. 3) . It should be noted that this county reported less than 3000 cases and 150 deaths at the end of first week in June.

Nearly half of all deaths in the county through May occurred in long-term care facilities (LTCF). Thus, the CFR was greatly affected by these outbreaks at LTCFs. Interestingly, the peak 7-day average of the two mutants occurred 2-3 weeks prior to the peak number of non-LTCF deaths, again suggesting that the co-circulation of the two subclades may contribute to increased mortality.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07. 13.20152959 doi: medRxiv preprint In San Francisco with a CFR of only 1.6%, the triple mutant dominated. However, it fluctuated at levels between 20 to 60%--less than what was observed in high CFR city like New York City, which had a CFR of 10%. The small number of samples available from San Francisco (154) could have also contributed to bias in the representativeness of the samples.

It is well recognized that most of the severe complications or deaths due to COVID-19 occur in those who are over 65 years of age or those with co-morbidity (8, 9) . But the mechanisms that lead to the respiratory failure accompanied by hyperinflammatory state in these patients are unknown. The co-circulation of dominant SARS-CoV2 subclades raises a disturbing possibility that people living in places with high prevalence of co-circulating strains may get serially infected with each variant. The second infection may trigger an immune response not unlike the antibodydependent enhancement (ADE) observed in diseases such as dengue. In settings where a population gets serially exposed to different strains of the dengue virus, the exposed individuals may mount varying levels of heterotypic antibody response directed at the different viral strain that do not neutralize but enhance binding of the strain to enter host cells (10) . Infected with SARS-CoV2, those with underlying medical conditions are likely to mount lower levels of neutralizing antibody response to the first infection, putting them at risk for ADE-like response during the second infection.

Indeed, evidence for spike protein antibody-mediated viral entry into cultured cells has been demonstrated for the Middle East respiratory syndrome (MERS) coronavirus (11) . A potential for ADE-like host response from natural serial infections with the dominant SARS-CoV2 mutants may need to be considered for COVID-19 vaccine development.

The non-conservative amino acid changes that occurred in the SARS-CoV2 clades involve proteins that have each been suggested to affect pathogenicity and transmissibility of the virus. The SARS-CoV2 Orf1ab encodes a number of non-structural proteins, including RNA dependent RNA polymerase (RdRp) or nsp12 (12) . The nonsynonymous nucleotide substitution C14408T in Orf1ab leads to P323L amino acid substitution in RdRp (12) . The G25563T mutation in Orf3a leads to a non-. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07.13.20152959 doi: medRxiv preprint conservative amino acid change Q57H. In a study of 2,782 SARS-CoV-2 sequences downloaded from the GISAID database in early April, 51 different nonsynonymous amino acid substitutions in the Orf3a protein were found (13) . The isolates defined by these mutations were separated into distinct subpopulations according to six functional domains. These functional domains were suggested to be associated with infectivity, ion channel formation, and virus release (13) . In SARS-CoV1, the Orf3a protein was reported to be a virulence factor that binds to TRAF3 to activate NF-κB and the NLRP3 inflammasome (14) . If the SARS-CoV2 Orf3a protein can exhibit a similar host response, it is biologically plausible that the Q57H mutation in Orf3a could contribute to the severe clinical manifestations observed in COVID-19.

The A23403G mutation in the spike protein gene leads to a non-conservative amino acid change D614G. The spike proteins of SARS-CoV1 and CoV2 share a receptor-binding domain (RBD) that facilitates virus entry into mammalian cells via the angiotensin converting enzyme-2 (ACE2) receptor (15, 16) . ACE2 is also the receptor for a group 1 coronavirus NL63, which causes common cold (17) . Thus, cell entry, while required, does not appear to be sufficient for SARS-CoV2 to cause the severe hyperinflammatory clinical manifestations. Nevertheless, as recently reported by Korber et al, the mutation D614G in the spike protein could have provided a selective advantage that enabled these two subclades to become the dominant strains in Europe and the United States (3). Indeed, another report showed that the SARS-CoV2 G614 spike protein is more stable than the D614 protein and that a pseudovirus (based on Maloney murine leukemia virus) expressing G614 was able to infect human ACE2-expressing HEK293T cells more efficiently than that expressing the D614 protein (18).

Korber et al also showed that patients infected with the D614G mutant had a higher viral load and that infectious titers of vesicular stomatitis virus (VSV) and lentiviral particles expressing the spike protein with the G614 mutation was higher than those expressing the D614 mutation (3). In a sub-analysis of COVID-19 patients in the United Kingdom, whose viral sequences were available from the GISAID database, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. . they found that infection with the D614G mutant was not associated with increased hospitalization, although the mutant was slightly but not significantly enriched among the ICU patients (3) .

The impact of the mutation C14408T in RdRp on SARS-CoV2 viral function or virulence is unknown, but strains carrying this mutation were shown by others to have a higher number of comutations than strains without this mutation (12) . The mutation in RdRp may have affected the proofreading function of this enzyme. The dominant clade of this virus with this Orf1a mutation may thus continue to accumulate mutations as it spreads into new regions. In SARS-CoV1, Orf1a has also been reported to facilitate viral adaptation to new hosts (20) . If this function exists for SARS-CoV2, this could also explain its highly successful transmissibility in human populations.

The limitation of the findings of our study is that they are based on sequences deposited in the GISAID database, most of which do not have patient clinical data. Hence, the associations we found between high CFRs and co-circulation of the dominant SARS-CoV2 lineages are an ecologic observation. In addition, the number of sequences submitted from each country and region vary greatly and may not necessarily be representative of the strains circulating in these places.

Nevertheless, it is striking that the two subclades of the virus were consistently observed to dominate in Europe and the United States. A similar observation was reported by others based on analyses of the sequences deposited in the GISAID database from earlier periods (3, 12, 13) .

Over the course of 6 months of the COVID-19 pandemic, two subclades descending from the Wuhan HU-1 strain became dominant in Europe and the United States. The co-circulation of the two variants in high numbers correlated temporally with periods of high mortality in Western Europe, the United States, and California counties. We speculate that the severe COVID-19 clinical manifestations and deaths could result from aberrant host immune responses to sequential infections in the same person with these subclades. These observations will need to be confirmed with studies . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. . linking patient clinical outcome data with the patient's viral sequences. As the COVID-19 pandemic expands into the southern hemisphere, we may be able to track these subclades of SARS-CoV2 to determine how they affect the shape and magnitude of the epidemic and mortality in each region.

Furthermore, in addition to the spike protein, the other viral proteins containing the observed mutations should be considered for detailed investigation for their role in pathogenicity and transmissibility, as well as potential targets for a new vaccine or anti-viral drugs.

Cumulative and daily number of COVID-19 cases and deaths in Western Europe and the United States:

Cumulative COVID-19 cases and deaths from Western Europe (United Kingdom, Germany, Spain, France, and Italy), the US (California, Illinois, New York, New Jersey, Michigan), and California counties (San Francisco, Santa Clara, and San Diego) reported in the Worldometer database (https://www.worldometers.info/coronavirus/) as of June 6, 2020 were used to calculate cumulative case fatality rates (CFR). Western Europe countries reporting >150,000 cases and the US states reporting >100,000 cases were selected. The selected California counties included those that submitted > 90 SARS-CoV2 genome sequences into the Global Initiative for Sharing All Influenza Data (GISAID) database. Information on co-factors potentially affecting CFRs in these regions (lockdown initiation date, median age of cases, prevalence of diabetes mellitus, prevalence of hypertension, number of hospital beds per population, and number of RT-PCR tests per population) were obtained from sources indicated under Table 1 . Correlation coefficients of these variables with CFR were calculated with R software, version 3.6.3.

Daily COVID-19 case and death numbers from Western European countries (United Kingdom, Germany, France, Spain, and Italy) and daily COVID-19 case and death numbers from the US states (New York and California) and California counties (San Francisco, Santa Clara, and San Diego) were obtained from the repository (https://github.com/CSSEGISandData/COVID-19) hosted by . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07.13.20152959 doi: medRxiv preprint the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (19) . Data were accessed on June 4, 2020.

SARS-CoV2 genome sequence analysis:

We initially acquired full sequence alignments of 37,647 submissions to EpiCoV from the Global Initiative for Sharing All Influenza Data (GISAID) database ("msa_0604.fasta", downloaded on June 4, 2020). Duplicate and low-quality sequences (>5% NNNNs) were removed and alignment of 34,417 sequences was created based on the sequence multiple alignment program MAFFT (20) by GISAID. We analyzed only complete sequences (length >29000 bp). We trimmed the full-length alignment with trimAL3 v1.4.rev15 (21) , which has the automated option -automated1 to remove any spurious parts of the alignment, thus reducing noise introduced into the single nucleotide variant analysis. The aligned genomes were re-positioned according to the reference SARS-CoV-2 genome (GenBank access number: NC_045512.2; Wuhan HU-1; MN908947).

We extracted single nucleotide polymorphism (SNP) information with the program snp-sites with default option (22) . SNPs with prevalence less than 0.3% among all sequences were excluded from the analysis. All variant information was merged to metadata information based on gisaid_epi_isl number, downloaded on June 4, 2020 from GISAID ("metadata_2020-06-04_18-12.csv").

The statistical analysis was performed with R software, version 3.6.3. We compared the frequencies of the dominant viral subclade sequences in two categories based on CFR of the region from which they were submitted-high CFR vs low CFR region. In Western Europe, Germany was set as the low mortality region (CFR = 4.7%) and all other countries were classified as high mortality region (CFR > 10%). In the US, low mortality states included California, Illinois, Texas, Florida, and Georgia (case-fatality rates, CFR < 5%), and high mortality states included New York, New Jersey, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. . https://doi.org/10.1101/2020.07.13.20152959 doi: medRxiv preprint Michigan, Connecticut, and Louisiana (CFR). In California, the frequency of the sequences of the isolates from San Francisco (CFR of 1.6%) was compared to that of San Diego, which had a cumulative CFR of 4% as of June 6. Frequencies of dominant subclades were compared between high CFR and low CFR regions by Fisher's exact test. All p-values were calculated from 2-sided tests based on 8.8e-06 as the significance level based on Bonferroni correction for multiple comparisons.

Subclade analysis:

All analysis was conducted with R package dplyr (23) and graphically plotted with ggplot2 (24) . Odds ratios and their 95% confidence intervals of cumulative strain frequencies of the dominant strains were calculated in reference to a country or sub-region with the lowest CFR in the region, using Fisher's exact test. Since CFRs and strain frequencies fluctuate by time, we also examined daily frequencies and 7-day moving averages of the frequencies of the dominant subclades by plotting them against daily cases and deaths. Peak frequencies of the subclades were compared to peak numbers of death during each epidemic.

LR conceived of the idea to compare regional differences in CFR with viral variants, obtained cofactor data, and calculated regression coefficients. YH accessed the GISAID database, performed the genome sequence analyses and graphically presented the data. LR and YH interpreted the findings and both prepared the manuscript. . Daily numbers (filled circles) and seven-day moving average of the frequencies of A23403G-C14408T subclade (blue line) and A23403G-C14408T-G25563T subclade (red line) compared to cases (gray bar) and deaths (black bar) in three counties in California (San Francisco, San Diego, and Santa Clara) that submitted >90 SARS-CoV2 genome sequences. Santa Clara county had the highest cumulative CFR while the cumulative frequencies of the subclades were lowest of the three counties. Nearly half of the deaths in Santa Clara county involved residents of long-term care facilities (LTCFs). Note that the two subclades dominated at the beginning of the epidemic, which preceded the peak period of deaths in people who were not residents of LTCFs.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted July 15, 2020. . A23403G-C14408T-G25563T cumulative strain frequency in high vs lowest CFR region.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted July 15, 2020. 

A new coronavirus associated with human respiratory disease in China

Washington State -nCo VCIT. 2020. First Case of 2019 Novel Coronavirus in the United States

Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2

The proximal origin of SARS-CoV-2

Genomic Epidemiology of SARS-CoV-2 in Guangdong Province

Evolution of the novel coronavirus from the ongoing Wuhan outbreak and modeling of its spike protein for risk of human transmission

Viral load dynamics and disease severity in patients infected with SARS-CoV-2 in Zhejiang province

China Medical Treatment Expert Group for C. 2020. Clinical Characteristics of Coronavirus Disease 2019 in China

Coronavirus (COVID-19): A Review of Clinical Features

Antibody-dependent enhancement of severe dengue disease in humans

Molecular Mechanism for Antibody-Dependent Enhancement of Coronavirus Entry

Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant

SARS-CoV-2 and ORF3a: Nonsynonymous Mutations, Functional Domains, and Viral Pathogenesis

Severe acute respiratory syndrome coronavirus ORF3a protein activates the NLRP3 inflammasome by promoting TRAF3-dependent ubiquitination of ASC

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted

Cellular entry of the SARS coronavirus

Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein

Human coronavirus NL63 employs the severe acute respiratory syndrome coronavirus receptor for cellular entry

The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity

An interactive web-based dashboard to track COVID-19 in real time

MAFFT multiple sequence alignment software version 7: improvements in performance and usability

trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses

SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments

2020. dplyr: A Grammar of Data Manipulation

Ggplot2: Elegant Graphics for Data Analysis

International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity

The authors have no conflict.. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)