key: cord-0300646-gmofbffn authors: Kumar, Sanjeet; Bansal, Kanika title: Cross-sectional genomic perspective of epidemic waves of SARS-CoV-2: a pan India study date: 2021-08-11 journal: bioRxiv DOI: 10.1101/2021.08.11.455899 sha: 4aa7342e2f6d8554b8bfc9a7953a1c8d12d31f91 doc_id: 300646 cord_uid: gmofbffn COVID-19 has posed unforeseen circumstances and throttled major economies worldwide. India has witnessed two waves affecting around 31 million people representing 16% of the cases globally. To date, the epidemic waves have not been comprehensively investigated to understand pandemic progress in India. In the present study, we aim for a cross-sectional analysis since its first incidence up to 26th July 2021. We have performed the pan Indian evolutionary study using 20,086 high-quality complete genomes of SARS-CoV-2. Based on the number of cases reported and mutation rates, we could divide the Indian epidemic into seven different phases. First, three phases constituting the pre-first wave had a very less average mutation rate (<11), which increased in the first wave to 17 and then doubled in the second wave (~34). In accordance with the mutation rate, variants of concern (alpha, beta, gamma and delta) and interest (eta and kappa) also started appearing in the first wave (1.5% of the genomes), which dominated the second (~96% of genomes) and post-second wave (100% of genomes) phases. Whole genome-based phylogeny could demarcate the post-first wave isolates from previous ones by the point of diversification leading to incidences of VOCs and VOIs in India. Nation-wide mutational analysis depicted more than 0.5 million events with four major mutations in ~97% of the total 20,086 genomes in the study. These included two mutations in coding (spike (D614G) and NSP 12b (P314L) of RNA dependent RNA polymerase), one silent mutation (NSP3 F106F) and one extragenic mutation (5’ UTR 241). Large scale genome-wide mutational analysis is crucial in expanding knowledge on evolution of deadly variants of SARS-CoV-2 and timely management of the pandemic. Coronavirus represents a large family of RNA viruses causing upper and lower respiratory tract infections to humans ranging from mild to lethal. Previously reported outbreaks of coronaviruses causing great public health threats include Severe Acute Respiratory Syndrome (SARS) and Middle East Respiratory Syndrome (MERS) (Peiris, Guan et al. 2004 , Memish, Cotten et al. 2014 ). In late December 2019, the ongoing outbreak caused by novel coronavirus epi-centered in Hubei province of People's Republic of China (Chen, Liu et al. 2020 , Wu, Zhao et al. 2020 . Patients were epidemiologically linked to a wet animal and seafood wholesale market in Wuhan (Bogoch, Watts et al. 2020 , Lu, Stratton et al. 2020 ). On the basis of phylogeny and taxonomic analysis Coronavirus Study Group of International Committee on Taxonomy of Viruses recognized this as a sister to SARS-CoV (Peiris, Guan et al. 2004 ) and named it as SARS-CoV-2 (Gorbalenya, Baker et al. 2020) . SARS-CoV-2 has the largest genome (26-4 to 31.7 kb) among all known RNA viruses with variable GC content ranging from 32% to 43% (Woo, Huang et al. 2010 ). On 30 th January 2020, a global health emergency was declared by the WHO Emergency Committee. Due to its substantial human to human transmissions, it has spread to many countries and till now it has affected more than 195 million people with more than 4 million casualties worldwide (Lu, Zhao et al. 2020 , Worldometer 2020 . First case of SARS-CoV-2 from India was reported in Kerala during 27-31 January, 2020 in individuals with travel history of Wuhan, China (Andrews, Areekal et al. 2020) . In order to contain further spread of SARS-CoV-2 strict actions were imposed including international travel restrictions from the affected countries. However, a surge of From 26th March to 11th May 2020 a nationwide lockdown was imposed to contain the nationwide spread. This was one of the most rigid lockdown restrictions in the world which resulted in a well-controlled infectivity rate in India (Maitra, Raghav et al. 2020 , Mitra, Misra et al. 2020 Second wave was then reported to end by June 2021. During both the waves, Delhi and Maharashtra were most badly affected states with several localized outbursts due to rampant community transmission. As of 26 th July 2021, India is reporting 30820 cases per day with total cases of 3,14,40,492 and the death toll has also reached 421,414. Yet, a nationwide third wave is also predicted which is concerning for the policy makers and public governance. We have witnessed generation of unprecedented genomic resources of SARS-CoV-2 worldwide such as by UK Consortium (Gorbalenya 2020) , African union (Salyer, Maeda et al. 2021) , Indian SARS-CoV-2 Genomic Consortia (INSACOG) (Maitra, Raghav et al. 2020 , Alai, Gujar et al. 2021 ) etc. GISAID (https://www.gisaid.org/), which is a global initiative for a public repository for the genomic data of SARS-COV-2 storage and analysis. Such a vast genomic resource has been investigated in detailed based on mutation in SARS-CoV-2 into various lineages (Rambaut, Holmes et al. 2020 ). On the basis of these studies, World Health Organisation have announced variants of concern (VOC) (alpha, beta, gamma and delta) and variants of interest (VOI) (eta, lota, kappa and lambda) (https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/). These VOCs and VOIs are known to pose increased risk to the public health globally and will aid in monitoring the evolution of deadly variants worldwide. In order to understand the genetic diversity, transmission and cure in human, several large-scale genome-based studies have been conducted (Bajaj and Purohit 2020 , Helmy, Fawzy et al. 2020 , Phan 2020 , Salyer, Maeda et al. 2021 , Yadav, Nyayanit et al. 2021 . Pan India study of 1000 sequences across 10 states suggest the widespread presence of the several lineages of SARS-CoV-2 (Maitra, Raghav et al. 2020) . Pan India sero survey suggested average seropositivity to be 10.14% (among 10,427 subjects) (Naushin, Sardana et al. 2021) . Unfortunately, the spread of lineages across India and seropositivity rate is very complex due to the vast population and landmass. Genomic diversity of Indian isolates in comparison to the global lineages is supposed to evolve further which needs to be closely monitored (Alai, Gujar et al. 2021 ). Till date pan India genome-based studies are focused on evolution of SARS-CoV-2 only upto first wave (Maitra, Raghav et al. 2020 , Alai, Gujar et al. 2021 , Yadav, Nyayanit et al. 2021 This created a lacuna in understanding the evolution of deadly variants of SARS-CoV-2. Since the first epidemic wave there is an upsurge in public genomic resources of SARS-CoV-2 from India which is available from global GISAID initiative. Current scenario provides scope of a crosssectional genome-based monitoring of the deadly variants across India. In the present study, we have analysed 20,086 high quality genomes to understand the dominance of VOCs and VOIs in the second wave. Amongst 0.5 million mutations we could identify four major mutation events in more than 19,300 genomes in spike, RNA dependent RNA polymerase and extragenic 5'UTR. Our analysis depicted single nucleotide transitions as the major player in the evolution of SARS-CoV-2 in India. Pan India study based on the mutation can open a gate way to understand the hotspots of mutation in SARS-CoV-2. Since the first incidence of COVID-19 in India by the end of January 2020, the number of cases has increased abruptly twice, which we call first and second wave. In order to understand the peak and plateau of incidences, we have differentiated the time period from first incidence (January 2020) to 26th July 2021 in seven different phases (Figure 1 C). Here, phase I indicates the early days of infection i.e., introductory phase from January 2020 to 25th March 2020. Phase II refers to the nationwide lockdown from 26th March 2020 to 11th May 2020 which was imposed to contain the spread of COVID-19. Once the lockdown was relaxed the incidences of cases rose gradually, termed as pre-first wave period or phase III from 12th May 2020 to 31st June 2020. Phase IV or first wave of COVID-19 was demarcated for quite a long duration from 1st July 2020 to 31st December 2020. India had witnessed a peak incidence of 93,732 cases of COVID-19 on 17th September 2020 (https://www.worldometers.info/coronavirus/country/india/; https://www.covid19india.org/) during phase IV (Figure 1 A) . With the gradual decrease in number of cases across India, phase V was demarcated from 1st January 2021 to 15th March 2021 as post-first wave or pre-second wave. Once again major leap in incidences was observed resulting in second wave which we refer as phase VI from 16th March 2021 to 30th June 2021. Strikingly, the rise and fall of incidences during the second wave (phase VI) was very steep as compared to the first wave (phase IV). Currently India is undergoing phase VII with drastic reduction in overall incidences from 1st July 2021 up to now. India has witnessed two epidemic waves of SARS-CoV-2, the number of cases and genome resources have also increased accordingly. By the end of the first wave, India reported 6,525 high quality genomes which increased by another 13,531 by the end of the second wave and still continuing (Figure 1 B) Pan India phylogeny based on whole genome sequences (n=20,086) have revealed major lineages of SARS-CoV-2 in India (Table 1) In India second wave outbreak has been a record-buster with impulsive rise in infection from 0.7% to 1.06% of the total population within two months. During this time maximum per day cases reported were around 0.4 million which is way beyond than recorded worldwide. Such a widespread of virus imparts opportunities of mutation and provides fitness to its variants. Postfirst wave 83% of the genomes could be linked to the deadly variants (VOCs and VOIs) (Table 3 and Supplementary Table 2 ). Among them, delta, alpha and kappa were dominant in phases V, VI and VII (Figure 2 ). In addition to the confirmed cases, seven phases of the pandemic in India can also be distinguished on the basis of mutations detected in the rapidly evolving virus. For instance, the average mutation detected before the first wave was less than 11 which elevated to 17.2 and 33.6 in the first and second waves respectively (Figure 1 D) . Hence, average mutations were doubled in the second wave as compared to the first wave and is still in the rising trend. Strikingly, in accordance to the phylogeny, rise in mutations is directly correlated with the emergence of deadly variants (VOCs and VOIs) in India. For instance, based on genomic analysis, these deadly variants were reported during the first wave, yet cases due to them started alarmingly incriminating only after the first wave (Figure 1 E) . However, the second wave was dominated by these deadly viruses and the post-second wave is hauntingly related to these deadly variants only (this is based on just 30 genomes available till 26th July for phase VII). According to the pangolin lineages out of 20,086 stains used in the present study, 7421 were delta Further, nation-wide mutational analysis depicted more than 0.5 million mutations amongst the Indian SARS-CoV-2 population (Supplementary Table 3 ) from seven different phases (Supplementary Figure 2) . Out of which we have looked at the top twenty prevalent mutations (Table 3) . Here, top twenty mutations were basically found in spike, RNA dependent RNA polymerase, nucleocapsid, ORF3, ORF7 and extragenic regions (5' UTR and 3' UTR). Interestingly, there were four widespread mutations in ~97% of the genomes analysed essentially representing all over India since its emergence in the country. Two mutations in protein coding region i.e., D614G in spike and P314L in NSP 12b; one extragenic mutation at 241 position of 5'UTR and one silent mutation at F106 of NSP3 (Figure 3 , Table 3 ). Such prevalent nonsynonymous or silent mutations in spike protein and rdrp and 5' UTR may likely to improve pathogenicity of the virus, in evasion from host immune system and risk of human-to-human transmissions etc. We have considered 20,086 high quality genomes from India encompassing the country's length and breadth. In order to obtain a pan India phylogeny of SARS-CoV-2, we have used all high quality genomes (n=20,086) available by 26th July 2021 in the public repository of GISAID MSA obtained was used for the phylogenomic tree generation using fasttree v2.1.8 with double precision (Price, Dehal et al. 2010) with gamma time reversal method (gtr). Visualization of the phylogenomic tree was performed using a web server of iTOL v6 (https://itol.embl.de/) (Letunic and Bork 2019). The isolates were marked in accordance with their phase of isolation and pangolin lineage (Rambaut, Holmes et al. 2020) . In order to understand the rate of mutation across the genome procured for different phases (seven), we have first aligned all the sequences (n=20086) against NC 045512.2 (https://www.ncbi.nlm.nih.gov/genome/?term=NC_045512.2) (Wuhan-Hu-1) strain (reference strain) using nucmer v3.1 (Delcher, Phillippy et al. 2002) . Also, we have separately aligned all the strains from several phases against the reference sequence. In order to translate all the alignments scores into mutational events, we have implemented a well-documented method earlier described by Mecatelli and Giorgi (Mercatelli and Giorgi 2020) . This approach uses a gff3 annotation file and reference sequence of NC_045512.2 to extract the genomic coordinates of SARS-CoV-2 proteins. R library package seqinr (https://cran.rproject.org/web/packages/seqinr/index.html) and biostring package of bioconductor (https://bioconductor.org/packages/release/bioc/html/Biostrings.html) was used to get the list of mutational events in terms of nucleotide and protein. Frequency and rate of mutation per sample was also obtained for all samples and across each phase. Overall number of mutations, coordinates of mutations with respect to the reference strain were also calculated using same R script. The authors declare no competing interests. Authors whole heartedly acknowledge motivation and advice from Dr. Prabhu B. Patil -CSIR-Institute of Microbial Technology. We do gratefully acknowledge GISAID for sharing the genomic sequences in public domain and several contributors of SARS-CoV-2 genomic data. We would also like to convey our acknowledgement to Government of India for enriching the genomic resources through initiatives like INSACOG. Further, authors would like to acknowledge Ms. Anu Singh for tea time chit-chat over SARS-CoV-2. KB and SK have performed the data procurement and analysis. KB has conceived and drafted the manuscript by taking inputs from SK. Nil. All the metadata files generated in this study can be accessed through https://figshare.com/s/0a81433867e6e6df2cec. Distribution of variants of concern (VOC) and variants of interest (VOI) and other lineages defined in accordance with pangolin lineage. The total number of genomes included in each phase is marked in the center of the pie chart. Table 3 : Top twenty mutations amongst the pan Indian SARS-CoV-2 isolates. Supplementary Figure 1: A) . Total number of cases reported state wise in India until 26th July 2021. B). Genome sequence data available from each state from India until 26th July 2021. Overview of the mutation reports during seven different phases. Pan-India novel coronavirus SARS-CoV-2 genomics and global diversity analysis in spike protein First confirmed case of COVID-19 infection in India: A case report Understanding SARS-CoV-2: genetic diversity, transmission and cure in human Pneumonia of unknown aetiology in Wuhan, China: potential for international spread via commercial air travel RNA based mNGS approach identifies a novel human coronavirus from two individual pneumonia cases in 2019 Wuhan outbreak Fast algorithms for large-scale genome alignment and comparison Severe acute respiratory syndrome-related coronavirus: The species and its viruses-a statement of the Coronavirus Study Group An integrated national scale SARS-CoV-2 genomic surveillance network The COVID-19 pandemic: a comprehensive review of taxonomy, genetics, epidemiology, diagnosis, treatment, and control Interactive Tree Of Life (iTOL) v4: recent updates and new developments Outbreak of pneumonia of unknown etiology in Wuhan, China: The mystery and the miracle Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding PAN-INDIA 1000 SARS-CoV-2 RNA genome sequencing reveals important insights into the outbreak Human infection with MERS coronavirus after exposure to infected camels, Saudi Arabia Geographic and genomic distribution of SARS-CoV-2 mutations COVID-19 pandemic in India: what lies ahead Parallelization of MAFFT for large-scale multiple sequence alignments Insights from a Pan India Sero-Epidemiological survey (Phenome-India Cohort) for SARS-CoV2 Severe acute respiratory syndrome Genetic diversity and evolution of SARS-CoV-2 FastTree 2-approximately maximum-likelihood trees for large alignments A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology The first and second waves of the COVID-19 pandemic in Africa: a cross-sectional study Coronavirus genomics and bioinformatics analysis COVID-19 coronavirus pandemic A new coronavirus associated with human respiratory disease in China An Epidemiological Analysis of SARS-CoV-2 Genomic Sequences from Different Regions of India