key: cord-0779794-yzb7g9ox authors: Sobkowiak, B.; Kamelian, K.; Zlosnik, J.; Tyson, J.; da Silva, A. G.; Hoang, L.; Prystajecky, N.; Colijn, C. title: Cov2clusters: genomic clustering of SARS-CoV-2 sequences date: 2022-03-13 journal: nan DOI: 10.1101/2022.03.10.22272213 sha: 7cdd72d4f379a360368482fab6ed740cb6df0c67 doc_id: 779794 cord_uid: yzb7g9ox Background: The COVID-19 pandemic remains a global public health concern. Advances in rapid sequencing has led to an unprecedented level of SARS-CoV-2 whole genome sequence (WGS) data and sharing of sequences through global repositories that has enabled almost real-time genomic analysis of the pathogen. WGS data has been used previously to group genetically similar viral pathogens to reveal evidence of transmission, including methods that identify distinct clusters on a phylogenetic tree. Identifying clusters of linked cases can aid in the regional surveillance and management of the disease. In this study, we present a novel method for producing stable genomic clusters of SARS-CoV-2 cases, cov2clusters, and compare the sensitivity and stability of our approach to previous methods used for phylogenetic clustering using real-world SARS-CoV-2 sequence data obtained from British Columbia, Canada, Results: We found that cov2clusters produced more stable clusters than previously used phylogenetic clustering methods when adding sequence data through time, mimicking an increase in sequence data through the pandemic. Our method also showed high sensitivity when compared to epidemiologically informed clusters. These clusters often contained a high number of cases that were identical or near identical genetically. Conclusions: This new approach presented here allows for the identification of stable clusters of SARS-CoV-2 from WGS data. Producing high-resolution SARS-CoV-2 clusters from sequence data alone can a challenge and, where possible, both genomic and epidemiological data should be used in combination. The development of effective vaccines and regional containment strategies have 12 allowed countries to mitigate the spread of SARS-CoV-2 and thereby reduce 13 transmission, hospitalization, and death rates from COVID-19. Nevertheless, the threat 14 posed by the disease is still a worldwide concern due to the emergence of Variants of 15 Concern (VoCs) such as the Delta and Omicron variants that display increased 16 transmissibility with lower vaccine effectiveness 6,7 , delayed global vaccination 17 deployment, vaccine hesitancy, and unequal access to vaccines and therapeutics. We have seen an unparalleled effort in whole genome sequencing (WGS) of COVID-19 20 to identify new variants and mutations of concern. To date, there are over 9 million 21 sequences publicly available through the open-source GISAID initiative 8 . Utilising these 22 data to develop novel and easy-to-implement tools to detect growing or emerging 23 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 13, 2022. ; https://doi.org/10.1101/2022.03.10.22272213 doi: medRxiv preprint 2 transmission clusters can help control the spread of the virus locally. We can use 24 genomic similarity to identify linked cases with shared demography or geography at a 25 higher resolution than a shared lineage assignment or simply via contact tracing. 26 Inspecting clusters can reveal sources of common exposures or patterns of 27 transmission through a population, which can be used to understand regional 28 epidemiology and inform public health policy, such as implementing restrictions in 29 certain settings with a high transmission risk. Practically, we have also seen that the 30 SARS-CoV-2 lineage nomenclature, such as the widely used Pangolin system 9 has 31 been dynamic through the pandemic and cannot provide sufficient resolution for 32 epidemiological investigations. Thus, clustering sequences by genomic similarity 33 provides the resolution and stability necessary for public health applications over the 34 course of a dynamic pandemic. Phylogenetic trees are an effective tool for summarizing evolutionary relationships 37 among taxa, and tree reconstruction methods can be used to achieve realistic 38 measures of genetic divergence. The information contained within a phylogeny can be 39 used to define groups of closely related sequences that may indicate recent 40 transmission between cases, either through identifying distinct clades on a tree or by 41 using the pairwise patristic distance as a measure of divergence between tips. Phylogenetic clustering methods have been applied in many virological analyses 10-12 , 43 as well as early in the COVID-19 pandemic to define putative transmission clusters in 44 SARS-CoV-2 13-15 . However, clustering based solely on genetic variation may not be 45 sufficient to effectively identify meaningful clusters in SARS-CoV-2 where there has 46 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Here, we present a novel method for constructing SARS-CoV-2 genomic clusters, using 55 the pairwise probability of clustering under a logit regression model, and linking cases 56 under a given probability threshold. The logit model incorporates genetic relatedness 57 through phylogenetic distance and collection or symptom onset date; this method also 58 allows for the inclusion of other covariates of interest that may result in meaningful 59 clusters (e.g., contact data, exposure events). In contrast to previous clustering 60 approaches that often rely solely on phylogenetic inference (tree cluster reference), 61 clustering isolates in this pairwise manner allows for greater cluster stability through 62 time, as well as resolution by including epidemiological information without the need for 63 time-consuming manual investigation. Previous clustering designation of sequences can 64 also be specified a priori to further improve cluster stability. This also allows clustering 65 to be performed on subsampled datasets where previously clustered sequences have 66 been removed for ease of analysis. We provide this method as an R package, 67 github.com/bensobkowiak/cov2clusters, for use within the research and public health 68 community to investigate SARS-CoV-2 transmission dynamics. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 88 Genomic clustering with cov2clusters (using the pairwise probability thresholds of 0.7 89 and 0.8) and TreeCluster 'single_linkage' found fewer, larger clusters than both 90 cov2clusters at the 0.9 threshold and TreeCluster 'max_clade'. This occurs both in the 91 pre-Delta dominance and Delta wave data. cov2clusters at the 0.9 probability threshold 92 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 13, 2022. ; https://doi.org/10.1101/2022.03.10.22272213 doi: medRxiv preprint 5 found many small clusters and a high number of sequences assigned as non-clustered, 93 indicating this threshold may over-cluster the data. Figure 2 shows the phylogenetic 94 trees produced by the pre-Delta dominance and Delta waves and the resulting cluster 95 assignments, with the largest five clusters found by each approach shown in colours, all 96 sequences clustered in smaller clusters in grey, and non-clustered sequences in white. The largest clusters found using cov2clusters at the 0.7 and 0.8 probability thresholds 98 were of similar size, though the number of clusters at the 0.7 probability threshold was 99 significantly lower, with most sequences of the same sub-lineage assigned to a single, 100 large cluster. 103 The sensitivity of the logit clustering tool, cov2clusters, for assigning sequences into 104 seven epidemiologically well-defined clusters was tested using two pairwise probability 105 thresholds (0.8, 0.9). These results were also compared to the sensitivity of TreeCluster 106 using both the maximum clade distance threshold approach and the maximum pairwise 107 linkage threshold approach (Table 1) . We found that cov2clusters with a pairwise 108 probability threshold of 0.8 assigned the highest number of sequences to 109 epidemiologically informed clusters (92%), performing marginally better than 110 TreeCluster 'single_linkage' method (87%) and significantly better than TreeCluster 111 'max_clade' (66%) and cov2clusters at the 0.9 probability threshold (71%). We also 112 tested cov2clusters with a probability threshold of 0.7, though at this threshold very 113 large lineage-specific clusters were formed, which did not add any meaningful resolution 114 to transmission clusters compared to simply using lineage classification. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 13, 2022. ; https://doi.org/10.1101/2022.03.10.22272213 doi: medRxiv preprint 'max_clade' scored the highest entropy, reflecting the more even distribution of the data 139 into smaller clusters, though as shown, this can be at the cost of over-clustering. We used patristic distance from phylogenetic trees as the measure for genetic 160 divergence in our method to utilize the full information available in the sequence 161 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) CoV-2 trees, where many terminal branches are supported by low numbers of 164 mutations, has been explored previously 21 . It was shown that variation in tree topology, 165 which in turn will alter pairwise distances between tips, was driven by the sample set of 166 sequences used to construct the tree that changes through time. While this will impact 167 the stability of any method that uses patristic distance to inform clustering, we have 168 shown that our approach reduces this instability in genomic clustering. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 13, 2022. ; https://doi.org/10.1101/2022.03.10.22272213 doi: medRxiv preprint 9 transmission is occurring within the sampling jurisdiction. Therefore, using only genomic 185 divergence derived from a given phylogeny is unlikely to identify well-separated SARS-186 CoV-2 transmission clusters. Additional epidemiological data can be used to refine large 187 clusters found using our genomic clustering approach. For example, including 188 information such as common exposures and contact tracing data may divide large 189 clusters into operational units with public health relevance. One limitation of our study is 190 that we do not have exposure, contact or location information to explore this application. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 13, 2022. ; https://doi.org/10.1101/2022.03.10.22272213 doi: medRxiv preprint While COVID-19 remains at pandemic levels with high case numbers in many regions 208 globally, it is anticipated that there will be a shift to endemicity characterized by 209 persistent, lower levels of the disease interspersed with seasonal or occasional 210 outbreaks 24 . In that context, it is likely that the viral population will have smaller and 211 better-separated clusters. We suggest that the method presented here for clustering 212 can be effectively utilized in both contexts. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Cluster G (P.1) (N = 6) 6/6 (100%) 3/6 (50%) 3/6 (50%) 5/6 (83%) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 13, 2022. ; https://doi.org/10.1101/2022.03.10.22272213 doi: medRxiv preprint A Distinct Phylogenetic Cluster of Indian Severe Acute Respiratory 365 Syndrome Coronavirus 2 Isolates Low genetic diversitymay be an Achilles heel of SARS-CoV-2 Molecular epidemiology surveillance of SARS-CoV-2: 370 Mutations and genetic diversity one year after emerging Establishment and lineage dynamics of the SARS-CoV-2 373 epidemic in the UK. Science (80-. ) Exponential growth, high prevalence of SARS-CoV-2, and vaccine 375 effectiveness associated with the Delta variant Impacts and shortcomings of genetic clustering methods for 377 infectious disease outbreaks 22. Cecco, L. Canada ski resort linked to largest outbreak of P1 Covid 310 The authors declare they have no competing interests. Sequences in the largest five clusters found by each method are coloured, with those in the largest cluster in red, followed by green, blue, yellow and pink. All other clustered sequences are coloured grey, and non-clustered sequences are in white.. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 13, 2022. ; https://doi.org/10.1101/2022.03.10.22272213 doi: medRxiv preprint Figure S2 . The pairwise patristic distance between P.1 sequences in clusters identified by each clustering approach.. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 13, 2022. ; https://doi.org/10.1101/2022.03.10.22272213 doi: medRxiv preprint Figure S3 . The pairwise probability of linking two sequences by SNP distance and difference in collection date using the selected beta coefficients used in this study ( = 3, = -1.9736 x 10 -4 , and = 7.5 x 10 -2 ). Patristic distance has been converted to SNP distance by multiplying SNP distance by the genome length for easier interpretation of pairwise sequence distance.. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)The copyright holder for this preprint this version posted March 13, 2022. ; https://doi.org/10.1101/2022.03.10.22272213 doi: medRxiv preprint