key: cord-312524-ee5xw1r8
authors: Moustafa, Ahmed M.; Planet, Paul J.
title: Rapid whole genome sequence typing reveals multiple waves of SARS-CoV-2 spread
date: 2020-06-08
journal: bioRxiv
DOI: 10.1101/2020.06.08.139055
sha: 
doc_id: 312524
cord_uid: ee5xw1r8

As the pandemic SARS-CoV-2 virus has spread globally its genome has diversified to an extent that distinct clones can now be recognized, tracked, and traced. Identifying clonal groups allows for assessment of geographic spread, transmission events, and identification of new or emerging strains that may be more virulent or more transmissible. Here we present a rapid, whole genome, allele-based method (GNUVID) for assigning sequence types to sequenced isolates of SARS-CoV-2 sequences. This sequence typing scheme can be updated with new genomic information extremely rapidly, making our technique continually adaptable as databases grow. We show that our method is consistent with phylogeny and recovers waves of expansion and replacement of sequence types/clonal complexes in different geographical locations. GNUVID is available as a command line application (https://github.com/ahmedmagds/GNUVID).

Rapid sequencing of the SARS-CoV-2 pandemic virus has presented an 40 unprecedented opportunity to track the evolution of the virus and to understand the 41 emergence of a new pathogen in near-real time. During its explosive radiation and 42 global spread, the virus has accumulated enough genomic diversity that we are now 43 able to identify distinct lineages and track their spread in distinct geographic locations 44 and over time [1] [2] [3] [4] [5] [6] . Phylogenetic analyses in combination with rapidly growing 45 databases [1, 7] have been instrumental in identifying distinct clades and tracing how 46 they have spread across the globe, as well as estimating calendar dates for the 47 emergence of certain clades [1] [2] [3] [4] . This information is extremely useful in assessing the 48 impact of early measures to combat spread as well as identifying missed opportunities 49

[3]. Going forward whole genome sequences will be useful for identifying emerging 50 clones or hotspots of reemergence. 51

In all of these efforts, identification of specific clones, clades, or lineages, is a 52 critical first step, and there are few systems available to do this [1] . As of June 1st there 53 are already 35,291 and 4,636 complete genomes (>29,000bp) available at GISAID [7] 54 and GenBank [8] , respectively. To address the problem of identifying sequence types in 55 SARS-CoV-2 and leverage these huge datasets, we took inspiration from a an 56 approach used widely in bacterial nomenclature, multilocus sequence typing (MLST) [9] . 57

Our panallelome approach to developing a whole genome (wgMLST) scheme for 58 SARS-CoV-2 uses a modified version of our recently developed tool, WhatsGNU [10], 59 to rapidly assign an allele number to each gene nucleotide sequence in the virus's 60 genome creating a sequence type (ST). The ST is codified as the sequence of allele 61 numbers for each of the 10 genes in the viral genome. 62

Here we show that this approach allows us to link STs into clearly defined clonal 63 complexes (CC) that are consistent with phylogeny. We show that assessment of STs 64 and CCs agrees with multiple introductions of the virus in certain geographical locations. 65

In addition, we use temporal assessment of STs/CCs to uncover waves of expansion 66 and decline, and the apparent replacement of certain STs with emerging lineages in 67 specific geographical locations. 68 69

We developed the GNU-based Virus IDentification (GNUVID) system as a tool 71 that automatically assigns a number to each unique allele of the 10 open reading 72 frames (ORFs) of SARS-CoV-2 ( Figure 1A) information. The majority of these alleles (65%) are for ORF1ab which represents 71% 77 of the genome length ( Figure 1A) . Strikingly, the most abundant alleles of each ORF 78 (except ORF1ab) were present in at least 79% of the 10,422 isolates, and for 8 ORFs 79 (ORF3a-10) the allele that was observed in the earliest genomes was also the most 80 prevalent, suggesting strong nucleotide level conservation over time. 81 Some widespread alleles corresponded to mutations that have been 82 hypothesized to be important to the evolution or pathogenesis of the virus. For instance, 83

for the S gene, the gene for the Spike protein, 64% (526/817) of unique alleles have the 84 A23403G (D614G) mutation ( Figure 1B ) that has been associated with the emergence 85 of increased transmission whether through increased transmissibility [11] isolates. The earliest sequenced, and most common, ORF3a allele that carries this 92 mutation (allele 25) was isolated in France on February 21st in a virus that also carries 93 also the A23403G mutation in the spike gene. 94

To create an ST for each isolate GNUVID automatically assigned 5510 unique 95

ST numbers based on their allelic profile (Supplementary Table 2 ). We then used a 96 minimum spanning tree (MST) to group STs into larger taxonomic units, clonal 97 complexes (CCs), which we define here as clusters of >20 STs that are single or double 98 allele variants away from a "founder". Using the goeBURST algorithm [13, 14] to build 99 the MST and identify founders, we found 24 CCs representing 79% (4352/5510) of all 100 unique STs ( Figure 1B) . 101

When the global region of origin for each genome sequence was mapped to 102 each CC there was a strong association of some CCs with certain geographical 103 locations. For instance, genomes from CCs 255, 300, 301, 317, 348, 355, 369, 399, 104 454, 498, 985, 1063, 1148 are predominantly from Europe while genomes from CCs 26, 105 800 and 927 are mainly from Asia ( Figure 1B) . Interestingly, genomes originating from 106 the US appear to be associated with 2 very divergent CCs, potentially reflecting two 107 major introductions. The first, CC256, is associated with locations on the West Coast, 108 specifically Washington state. The first two isolates belonging to CC256 are from China 109 followed by the first isolate from Washington (01/19/2020). The second predominant US 110 CC, CC258, is closely related to other CCs found predominantly in Europe ( Figure 1B  111 and 1C). Isolates of CC258 were initially found and sequenced in Europe, followed by 112 the US East Coast, and later in other US locations ( Figure 1B) . Interestingly, almost all 113 isolates (99%) from CC258 and its descendants CCs 768, 800, 844 and 1063 ( Figure  114 1B) carry the G25563T mutation in ORF3a, representing 88% of all isolates that carry 115 this mutation; the other 12% are from STs that were not assigned CCs. CC800 is 116

interesting for its geographic predominance in the Middle East (75% from Saudi Arabia 117

and Turkey) and its close relationship to ST338 and ST258, which are mostly found in 118 the US. This may signal a transmission event from the US to the Middle East. 119

To show that CCs are mostly consistent with whole genome phylogenetic trees 120

we produced a maximum likelihood tree and mapped the CC designations onto the tree. 121 Table 2) . 133

Because we included collection dates for each genomic sequence, we can use 134

STs and CCs to better understand the emergence and replacement of certain lineages 135 in certain geographical regions over time. Figure 2A shows temporal plots of the most 136 common 12 CCs around the world. This makes clear the emergence of new CCs over 137 time such as CC255, CC300 and CC258. CC4, the earliest CC, started by representing 138 60% of sequenced genomes in mid-January, but had dropped to only 5% by mid-March. 139

Of course, relative proportions of STs or CCs isolated and sequenced may be a 140 highly biased statistic that is contingent upon where the isolate comes from, the 141 decision to sequence its genome, and the local capacity to sequence a whole genome. Interestingly, Europe showed a general CC diversity over time resembling that of 159 the worldwide temporal plot, and then showed expansion of the local CC300 and 160 CC255 after mid-February (Figure 2D) . 161

The US plot (Figure 2E ) reflects the two possible introductions on the west and 162 east coasts from Asia and Europe, respectively, with the current dominance (more than 163 45%) of CC258. Focusing on Washington, it is interesting to note the possible 164 replacement of CC256 by CC258 perhaps by introduction from the East Coast or 165 Europe ( Figure 2F ) [2, 4] . In New York, a different pattern is seen with CC258 being 166 persistently dominant ( Figure 2G ). However, a more granular view of STs in New York, 167 not CCs, shows a shifting epidemiology with ST258 declining and the rise of closely 168 related SLVs and DLVs of ST258 ( Figure 2H) . 169

Because a change in any allele creates a new ST our method may accumulate and 171 count "unnecessary" STs that have been seen only once or may be due to a 172 sequencing error. This is partially ameliorated by the use of the CC definition that allows 173 some variability amongst the members of a group. A large number of STs also may 174 allow more granular approaches to tracking new lineages. Our method is also limited by 175 the quality and extent of the database. For this implementation we limited the database 176 to genomes that do not have any ambiguity or degenerate bases. However, these 177 genomes could be queried through our tool to be assigned to the closest ST/CC. 178

Another limitation is the stability of the classification system, some virus genomes may 179 be reassigned to new CCs as clones expand epidemiologically, but this may also reflect 180 a dynamic strength as circulating viruses emerge and replace older lineages. 181

The genomic epidemiology of the 10,422 SARS-CoV-2 isolates studied here 184

show six predominant CCs circulated/circulating globally. Our tool (GNUVID) allows for 185 fast sequence typing and clustering of whole genome sequences in a rapidly changing 186 pandemic. As illustrated above, this can be used to temporally track emerging clones or Table 1 

The compressed genomes from our quality controlled dataset are available from the 261 corresponding author and available online for download. The compressed database will 262 be updated weekly on https://github.com/ahmedmagds/GNUVID. Source code for 263 GNUVID can be found in its most up-to-date version here, 264 https://github.com/ahmedmagds/GNUVID, under the GNU General Public License. 265 266

The authors declare that they have no competing interests We would like to thank Lidiya Denu and Michael Silverman for helpful comments and 279 discussion. We would like to thank the Global Initiative on Sharing All Influenza Data 280 (GISAID) and thousands of contributing laboratories for making the genomes publicly 281 available. A full acknowledgements table is available as supplemental information. 282 

A 285 dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology

A Genomic Survey of SARS-CoV-2 Reveals Multiple Introductions into 289 Northern California without a Predominant Lineage

The emergence of SARS-CoV-2 in Europe and the US

Cryptic transmission of SARS-CoV-2 in Washington 296 State

Comprehensive genome analysis of 298 6,000 USA SARS-CoV-2 isolates reveals haplotype signatures and localized 299 transmission patterns by state and by country

Global genetic diversity patterns and transmissions of 302 SARS-CoV-2

GISAID: Global initiative on sharing all influenza data -from vision 304 to reality

Multilocus sequence typing: A portable approach to the 309 identification of clones within populations of pathogenic microorganisms

WhatsGNU: a tool for identifying proteomic novelty

Spike mutation pipeline reveals the emergence of a 315 more transmissible form of SARS-CoV-2

Severe acute respiratory syndrome coronavirus ORF3a 318 protein activates the NLRP3 inflammasome by promoting TRAF3-dependent 319 ubiquitination of ASC

Global optimal eBURST analysis of 321 multilocus typing data using a graphic matroid approach

Evolutionary Descent among Clusters of Related Bacterial Genotypes from Multilocus 325 Sequence Typing Data

A 327 new coronavirus associated with human respiratory disease in China

Basic local alignment search tool

PHYLOViZ 2.0: 332 providing scalable data integration and visualization for multiple phylogenetic 333 inference methods

MAFFT: a novel method for rapid multiple 335 sequence alignment based on fast Fourier transform

IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in 339 the Genomic Era

Dating of the human-ape splitting by a molecular clock 341 of mitochondrial DNA

UFBoot2: Improving the

Interactive Tree Of Life (iTOL) v4: recent updates and new 345 developments