key: cord-0813475-sqyk8ois authors: Rani, Pallavali Roja; Imran, Mohamed; Lakshmi, J. Vijaya; Jolly, Bani; Afsar, S.; Jain, Abhinav; Divakar, Mohit Kumar; Suresh, Panyam; Sharma, Disha; Rajesh, Nambi; Bhoyar, Rahul C.; Ankaiah, Dasari; Kumari, Sanaga Shanthi; Ranjan, Gyan; Lavanya, Valluri Anitha; Rophina, Mercy; Umadevi, S.; Sehgal, Paras; Devi, Avula Renuka; Surekha, A.; Sekhar, Pulala Chandra; Hymavathy, Rajamadugu; Vanaja, P.R.; Scaria, Vinod; Sivasubbu, Sridhar title: Insights from Genomes and Genetic Epidemiology of SARS-CoV-2 isolates from the state of Andhra Pradesh date: 2021-01-22 journal: bioRxiv DOI: 10.1101/2021.01.22.427775 sha: affa835793f25289633a434fc3c7fddda5445c3e doc_id: 813475 cord_uid: sqyk8ois Coronavirus disease (COVID-19) emerged from a city in China and has now spread as a global pandemic affecting millions of individuals. The causative agent, SARS-CoV-2 is being extensively studied in terms of its genetic epidemiology using genomic approaches. Andhra Pradesh is one of the major states of India with the third-largest number of COVID-19 cases with limited understanding of its genetic epidemiology. In this study, we have sequenced 293 SARS-CoV-2 genome isolates from Andhra Pradesh with a mean coverage of 13,324X. We identified 564 high-quality SARS-CoV-2 variants, out of which 15 are novel. A total of 18 variants mapped to RT-PCR primer/probe sites, and 4 variants are known to be associated with an increase in infectivity. Phylogenetic analysis of the genomes revealed the circulating SARS-CoV-2 in Andhra Pradesh majorly clustered under the clade A2a (94%), while 6% fall under the I/A3i clade, a clade previously defined to be present in large numbers in India. To the best of our knowledge, this is the most comprehensive genetic epidemiological analysis performed for the state of Andhra Pradesh. The emergence of COVID-19 as a global pandemic has necessitated approaches to understand the emergence and evolution of SARS-CoV-2. Genome sequencing has emerged as one of the widely used approaches to understand the genetic epidemiology of SARS-CoV-2 1 . The availability of the complete genome of the pathogen early in the epidemic and subsequent application of genomics on a global and unprecedented scale has provided an immense opportunity to trace the introduction, spread and genetic evolution of the SARS-CoV-2 across the globe 2 . India is now a major country affected by COVID-19 with over 10 million people affected since the initial introduction of SARS-CoV-2 into the country in 2020 and subsequent introductions through travellers across major cities. These include states with significantly large populations and air travellers like Andhra Pradesh which has a population of 49 million people, with an estimated 1.5 to 2 million people who are part of the diaspora spread across the world. While a number of genomes have been sequenced from different states in India 3 , there is a paucity of genomic data and genetic epidemiology of SARS-CoV-2 isolates from the state of Andhra Pradesh which motivated us to study the genomes from this state in detail. In the present study, we report a total of 293 SARS-CoV-2 genomes from the state of Andhra Pradesh. To the best of our knowledge, this is the first comprehensive report of the genetic epidemiology and evolution of SARS-CoV-2 from the state of Andhra Pradesh. The study is in compliance with relevant laws and institutional guidelines and in accordance with the ethical standards of the Declaration of Helsinki and approved by Institutional Human Ethics Committee (RC.No.03/IHC/kmcknl/2020, dated 03/08/2020). The patient consent has been waived by the ethics committee. RNA samples isolated from nasopharyngeal/oropharyngeal swabs of patients from a tertiary care teaching hospital (Kurnool Medical College) were used in the study. Two protocols for RNA isolation were employed for the clinical samples. The first protocol used GenoSens (SARS-CoV-2) PCR Viral RNA extraction reagents and samples were processed as per the instructions provided by the manufacturer. The second protocol involved processing the samples A total of 1,43,726 samples were tested between April 21 to 5th August and 10,073 samples were identified as COVID-19 positive cases. 293 samples were considered for viral genome sequencing for the present study and were selected between the dates of 27 June to 3 August 2020. Library preparation and sequencing was performed as per the COVIDSeq protocol (Illumina, USA) as described in a previous study 4 . The samples were sequenced in technical replicates. The raw binary sequence files in bcl format were demultiplexed and converted to FASTQ files. We followed a previously published protocol for data analysis 5 . Briefly, raw FASTQ files underwent quality control with average Phred score quality of Q30 and read length of 30 bps with adapter trimming using Trimmomatic (version 0.39) 6 . The Wuhan-Hu-1 (NC_045512.2) genome was used as the reference. Replicate files were independently aligned and merged. Genomes with ≥ 99% coverage and ≤ 5% unassigned nucleotides were processed for variant calls. The variants were annotated by ANNOVAR 7 using custom database tables for annotating the SARS-CoV-2 genome. Table 1 ). Phylogenetic analysis was performed as described in a previous protocol using the genomes sequenced in this study and the dataset of 3,058 genomes from India deposited in the GISAID database (Supplementary Table 2 ) 2,10 . Briefly, the consensus sequences were aligned to the Wuhan-Hu-1 genome and the phylogenetic tree was constructed using the Augur implementation of Nextstrain 11, 12 . Genomes having ambiguous collection dates or Ns >5% were excluded from the analysis. The tree was loaded in Auspice for visualisation. Lineages were assigned to the genomes using the Phylogenetic Assignment of Named Global Outbreak LINeages (PANGOLIN) package 13 . SARS-CoV-2 genetic variants with potential impacts on functional consequences validated through experimental evidence or by computational predictions were compiled from published studies and article pre-prints. Variants filtered in this study were compared with this repository to assess their possible functional impacts. The primer/probe sequences used in the molecular detection of SARS-CoV-2 were compiled from the literature and other sources 14 . This compilation included a total of 132 primer and probe sequences. The primer and probe sequences were mapped to the Wuhan-Hu-1 reference genome using BLAST to get their genomic loci. Variants filtered in this study were mapped to these genomic loci using bespoke scripts. Table 4 ). The distribution of the variants in the genomes and their annotations are summarised in Figure 1 . 145 variants were annotated as deleterious by SIFT 15 . In addition, a total of 18 genetic variants mapped to diagnostic RT-PCR primer/probe sites (Supplementary Table 5 ). A total of 42 and 421 variants were predicted to map to potential B and T cell epitopes respectively. 11 variants spanning potential homoplasic, hypermutable and sequencing error-prone regions of the genome were filtered. The detailed annotations of the variants are summarised in Supplementary Table 6 . Phylogenetic analysis was done for the dataset of 3,033 SARS-CoV-2 genomes from India including 276 genomes from this study and the genome Wuhan/WH01 (EPI_ISL_406798) as the root. Out of 276 genomes, 260 genomes (94%) clustered under the clade A2a while 16 were under the clade I/A3i (6%) 16 . The phylogenetic reconstruction of the dataset of Indian genomes and the distribution of clades is summarized in Figure 2A previously reported for genomes from Gujarat 18 . The cluster is characterized by an S194L (C28854T) mutation in the nucleocapsid protein of the virus, a mutation which was found to be significantly associated with disease mortality in Gujarat. One genome from this study (CS0804) also forms a polytomy with other samples from the neighbouring state of Telangana in the phylogenetic tree of Indian genomes, which could be suggestive of multiple, simultaneous divergence events although further data and analysis would be needed to confirm this hypothesis reliably. Potential functional impact of the high-quality variants filtered in the study was identified by precisely mapping back these variants to a manually curated compilation of functionally relevant SARS-CoV-2 genetic variants. In our analysis, we were able to identify 4 genetic variants in the S gene which have been reported to be involved in increased infectivity through experimental validation. These mutations include 23403:A>G (D614G) and three co-occurring mutations 23403A>G+21575C>T (D614G+L5F), 23403A>G+24368G>T (D614G+D936Y), 23403A>G+24378C>T (D614G+S939F), having a frequency of 94.20%, 0.725%, 5.435% and 0.362% respectively in the 276 genomes analysed in this study. Our analysis using the COVIDSeq approach and downstream data analysis has provided detailed insights into the genetic epidemiology and evolution of SARS-CoV-2 isolates in the state of Andhra Pradesh. A total of 564 high quality unique genetic variants were identified, out of which 15 variants are novel. Extensive analysis of the functional consequences of the filtered variants has provided insights on the impact of these genetic variants in current diagnostic practices. Phylogenetic analysis of the genomes highlights the potential shift in clade dominance from clade I/A3i to A2a in Andhra Pradesh, a trend also observed in the neighbouring state of Telangana. The lineages B.1.112 and B.1.104 were also reported for the first time from Indian genomes. In conclusion, our study highlights the utility of whole-genome sequencing to study the genetic landscape and evolution of SARS-CoV-2 isolates in major states like Andhra Pradesh and emphasises the use of such scalable technologies to gain better and timely insights into epidemics. Table 6 Compilation of filtered unique genetic variants along with their corresponding annotations A new coronavirus associated with human respiratory disease in China Global initiative on sharing all influenza data -from vision to reality Initial insights into the genetic epidemiology of SARS-CoV-2 isolates from Kerala suggest local spread from limited introductions. Cold Spring Harbor Laboratory High throughput detection and genetic epidemiology of SARS-CoV-2 using COVIDSeq next generation sequencing. Cold Spring Harbor Laboratory Computational Protocol for Assembly and Analysis of SARS-nCoV-2 Genomes. rr Trimmomatic: a flexible trimmer for Illumina sequence data ANNOVAR: functional annotation of genetic variants from highthroughput sequencing data EMBOSS: the European Molecular Biology Open Software Suite SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments Computational Analysis and Phylogenetic clustering of SARS-nCov-2 genomes Nextstrain: real-time tracking of pathogen evolution TreeTime: Maximum-likelihood phylodynamic analysis A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nature microbiology Analysis of the potential impact of genomic variants in SARS-CoV-2 genomes from India on molecular diagnostic assays. Cold Spring Harbor Laboratory SIFT missense predictions for genomes A distinct phylogenetic cluster of Indian SARS-CoV-2 isolates. Open Forum Infect Dis Integrated genomic view of SARS-CoV-2 in India Genomic variations in SARS-CoV-2 genomes from Gujarat: Underlying role of variants in disease epidemiology. Cold Spring Harbor Laboratory Authors acknowledge funding from CSIR India (MLP2005). AJ, BJ, MD, and PS acknowledge research fellowships from CSIR-India. MI acknowledges a research fellowship from ICMR. DS acknowledges research fellowship from Intel India. The funders had no role in the analysis of data, preparation of the manuscript or decision to publish. Authors declare no conflict of interest. The data that support the findings of this study are openly at NCBI short Read Archive with Project ID PRJNA662193 with accession IDs from SAMN16707355 to SAMN16707555. The remaining samples raw dataset are available at NCBI short Read Archive with Project ID PRJNA655577.