key: cord-0984183-mrwnhpcl authors: Banu, Sofia; Jolly, Bani; Mukherjee, Payel; Singh, Priya; Khan, Shagufta; Zaveri, Lamuk; Shambhavi, Sakshi; Gaur, Namami; Reddy, Shashikala; Kaveri, K; Srinivasan, Sivasubramanian; Gopal, Dhinakar Raj; Siva, Archana Bharadwaj; Thangaraj, Kumarasamy; Tallapaka, Karthik Bharadwaj; Mishra, Rakesh K; Scaria, Vinod; Sowpati, Divya Tej title: A distinct phylogenetic cluster of Indian SARS-CoV-2 isolates date: 2020-09-18 journal: Open Forum Infect Dis DOI: 10.1093/ofid/ofaa434 sha: ac5e6f56be1400b6d8db42cea1e59a36cf3520db doc_id: 984183 cord_uid: mrwnhpcl MOTIVATION: From an isolated epidemic, COVID-19 has now emerged as a global pandemic. The availability of genomes in the public domain following the epidemic provides a unique opportunity to understand the evolution and spread of the SARS-CoV-2 virus across the globe. RESULTS: We performed whole-genome sequencing of 303 Indian isolates, and analyzed them in the context of publicly available data from India. We describe a distinct phylogenetic cluster (Clade I/A3i) of SARS-CoV-2 genomes from India, which encompasses 22% of all genomes deposited in the public domain from India. Globally approximately 2% of genomes, which till date could not be mapped to any distinct known cluster fall in this clade. CONCLUSIONS: The cluster is characterized by a core set of 4 genetic variants and has a nucleotide substitution rate of 1.1 x 10 (-3) variants per site per year, lower than the prevalent A2a cluster. Epidemiological assessments suggest that the common ancestor emerged at the end of January 2020 and possibly resulted in an outbreak followed by countrywide spread. To the best of our knowledge, this is the first comprehensive study characterizing this cluster of SARS-CoV-2 in India. Since the emergence of the outbreak in the Chinese city of Wuhan in late 2019, the novel Coronavirus disease has spread widely to become a global pandemic, with nearly twenty five million individuals infected worldwide and resulting in the death of >800,000 individuals. 1 The causative virus, SARS-CoV-2, is a member of the genus Betacoronavirus. During its transmission, the virus has differentiated into at least 10 clades globally and is continuously evolving. 2 This has implications in genetic epidemiology, surveillance, contact tracing and the development of long term strategies for mitigation of this disease. 3 The recent availability of whole-genome sequences of the SARS-CoV-2 from across the world deposited in public databases provides an unprecedented opportunity to understand the dynamics and evolution of the pathogen. The availability of genomic data in a public repository like GISAID 4 also provides wider access to the resources and enables researchers across the globe to address pertinent hypotheses. Likewise, this gave us a unique scope to understand the introduction, evolution, and spread of the virus in India and understand it in the context of global clades circulating across the world. In this manuscript, we report the sequences of SARS-CoV-2 isolates predominantly sampled from the states of Telangana and Tamil Nadu. Further, we systematically analysed the phylogenetic clusters of genomes from India and characterised a unique cluster of sequences (Clade I/A3i), which could not be classified into any of the previously annotated global clades. Isolates forming this cluster were predominant in a number of states and characterized by a shared set of four genetic variants. The cluster potentially arose from a single outbreak followed by a rapid spread across the country. To the best of our knowledge, this is the first comprehensive report of the novel and predominant cluster of sequences from India and suggests its distribution beyond India in many countries in South Asia, Oceania and America. A c c e p t e d M a n u s c r i p t 4 Samples were collected and processed as per the guidelines of the Institutional Ethics Committee. RNA was isolated from nasopharyngeal or oropharyngeal swabs collected in viral transport media as explained in Supplementary Methods. Purified RNA was sequenced using either a shotgun approach or the ARTIC v3 protocol, as detailed in Supplementary Methods 5 . Quality control of the FASTQ files was performed using FastQC v0.11.7, and adaptors/poor quality bases were trimmed using Trimmomatic. 6, 7 Reads were aligned to the reference genome MN908947.3 using hisat2. 8 Consensus sequence from the bam file was derived using seqtk and bcftools. 9 Samtools depth command was used to calculate the coverage across the genome. 10 The sequences were deposited in GISAID with accessions detailed in Supplementary Data 1. The datasets of Indian SARS-CoV-2 genomes deposited in GISAID (till 7th August 2020) were used for the analysis. Further 10 high-quality genomes from each of the 10 clades respectively as annotated by Nextstrain were retrieved from GISAID and used in the analysis. The datasets and acknowledgements are listed in Supplementary Data 2. We considered only high quality genomes for evaluation of the nucleotide substitution rates, molecular clock, and phylogenetic clustering, as these would be sensitive to the quality of Wuhan-Hu-1 genome (NC_045512) was used as reference wherever applicable. The variants were also evaluated for the functional consequences using SIFT. 13 A SIFT score of 0.0 to 0.05 was interpreted to have a deleterious effect. The functional effects of protein variants identified in the clades were assessed using the PROVEAN web server, using a default threshold value of -2.5. 14 Additionally, PhyloP conservation scores and base-wise GERP Rejected Substitutions scores (RS scores) for the variants were computed. 15, 16 Sites having positive PhyloP scores were predicted to be conserved, while positive GERP scores were considered indicative of a site under evolutionary constraint. The variants were also checked for overlaps with immune epitope predictions as given on UCSC Genome Browser for SARS-CoV-2. The samples sequenced encompass 303 genomes in total, majorly collected from the states of Telangana and Tamil Nadu. The age of the patients ranged from 1.5-80 years, with >80% (275 out of 303) within the age bracket of 20-60 years ( Figure S1A) . A total of 294 samples were sequenced using an amplicon-based approach with a target of ~2 million paired-end reads per sample. We could achieve an average coverage of >1000x in all cases, with a The second-largest cluster consisted of 160 genomes (11.6%). This cluster of sequences could not be classified into any of the 10 clade sequences defined by Nextstrain, and did not share the nucleotide compositions that define any of the 10 clades. 17 This cluster was found to have diverged from the A1a and A3 clades and most, but not all of the sequences shared a variant (L3606F in ORF1a) with members of the A3 and A1a clades. We propose to call this the A3i clade in cognisance of this fact. To avoid potential conflict with the nomenclature followed by Nextstrain, we, therefore, define this cluster of sequences as Clade I/A3i, for the unique occurrence as a dominant cluster amongst SARS-CoV-2 genome sequences from India and also since this clade is largely formed by sequences from India ( Figure S2, S3) . A c c e p t e d M a n u s c r i p t 7 The other clusters encompassed the B4, A3, A1a, B, and B1 clades with 52 and 17 genomes falling into the clusters A3 and B4 respectively and clades A1a, B and B1 having one genome each. Of these, 23 were sampled from a date earlier than the earliest sample of this cluster from India and were from USA, Canada, Australia, Thailand, Saudi Arabia, Taiwan, Singapore, Malaysia, Japan, and Brazil ( Figure 3B) . A c c e p t e d M a n u s c r i p t 8 Mutation rates were calculated for the Indian sequences using BEAST, with the WH1 genome as the root. Our analysis suggests that the substitution rate is 1.76 × 10 -3 (95% HPD 1.57 × 10 -3 -1.99 × 10 -3 ) per site per year for the entire Indian SARS-CoV-2 genomes put together. This also confirms the estimates previously made. 18 The substitution rate was also computed for the individual clades. The gene-wise substitution rates were also similarly calculated for the major clusters. The analysis suggests that the I/A3i clade has a nucleotide substitution rate of 1. Two of the variants C6312A in ORF1a and C13730T in ORF1b also mapped to immune epitope predictions (HLA-A0201 binding peptides) from NetMHC 4.0 available on UCSC Genome Browser and as listed on UCSC Genome Browser for SARS-CoV-2. 21 The potential consequences of the variants in the immune response could not be ascertained. The presence of a short tree of Clade I/A3i with divergence from a single point suggests a single point of introduction. 22 The single point of divergence also suggests that the origin and spread of the cluster were possibly from a single outbreak (Figure 3) . The clustering of samples around January 2020 suggests a rapid spread spanning multiple regions across the country. The first sequence from the cluster in India was GMC-KN443/2020 (Accession ID Genomic evolution coupled with the appropriate tools like genome sequencing provides a unique opportunity to understand the spread and evolution of pathogens. 23, 24 The emergence of COVID-19 as a global pandemic and the availability of the Open Data for SARS-CoV-2 genomes from across the globe facilitated by genomic databases like GenBank and GISAID has truly opened up new opportunities to understand the pathogen and its spread and evolution at an unprecedented rate. 4, 25 Whole-genome sequencing of SARS-CoV-2 has also been extensively used in understanding epidemics at a macro-as well as microlevels, at hospitals. 26 In this report, we describe a distinct cluster of sequences from genomes of SARS-CoV-2 sequenced and deposited from multiple laboratories across India, which we classify as the I/A3i clade. This distinct cluster could not be classified into any of the 10 clade annotations as described by Nextstrain, and was characterized by a unique combination of four variants which was shared by over 95% of the isolates falling in the cluster. The cluster was predominantly found in genomes from India; though additional members could also be found from genomes deposited in other countries, they form a minor proportion of the genomes from the respective countries. As per Nextstrain, the Indian genomes constituted over 30% of the global genomes for this cluster. In-depth analysis of the genome cluster suggests a comparable rate of nucleotide substitutions with other predominant clades, though a gene-wise estimate of substitution suggests a distinct mode of evolution, driven by the Nucleocapsid (N) and Membrane (M) A c c e p t e d M a n u s c r i p t 12 genes, and sparing of the Spike (S) gene in contrast to predominant diversity in the Spike (S) gene in A2a clade, the globally predominant clade. 27 However, it has not escaped our attention that host genetic factors could modulate the evolution of the virus genome, and without large-scale host-genomic studies, the causal relationships cannot be conclusively established. The cluster suggests a potential single introduction around February, followed by a country- A written consent from the patients was obtained wherever applicable. The design and implementation of this work has been approved by a local ethical committee. Authors declare no conflict of interest. A c c e p t e d M a n u s c r i p t COVID-19: Epidemiology, Evolution, and Cross-Disciplinary Perspectives Genotyping coronavirus SARS-CoV-2: methods and implications Global initiative on sharing all influenza data -from vision to reality nCoV-2019 sequencing protocol v2 v1 (protocols.io.bdp7i5rn) Babraham Bioinformatics -FastQC A Quality Control tool for High Throughput Sequence Data Trimmomatic: a flexible trimmer for Illumina sequence data Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data The Sequence Alignment/Map format and SAMtools Computational Analysis and Phylogenetic clustering of SARS-nCov-2 genomes U1Zq0/edit?usp=sharing&usp=embed_facebook Nextstrain: real-time tracking of pathogen evolution SIFT Missense Predictions for Genomes PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels Detection of nonneutral substitution rates on mammalian phylogenies Distribution and Intensity of Constraint in Mammalian Genomic Sequence Temporal signal and the evolutionary rate of 2019 n-CoV using 47 genomes collected by Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. The lancet Structure of the RNA-dependent RNA polymerase from COVID-19 virus NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data Tracking virus outbreaks in the twenty-first century The Establishment of Reference Sequence for SARS-CoV-2 and Variation Analysis On the origin and continuing evolution of SARS-CoV-2 Rapid implementation of real-time SARS-CoV-2 sequencing to investigate healthcare-associated COVID-19 infections Emergence of Drift Variants That May Affect COVID-19 Vaccine Development and Antibody Treatment EMBOSS: the European Molecular Biology Open Software Suite Issues with SARS-CoV-2 sequencing data Recent developments in the MAFFT multiple sequence alignment program IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 Phylodynamic estimation of incidence and prevalence of novel coronavirus (nCoV) infections through time Posterior Summarization in A c c e p t e d M a n u s c r i p t 16