key: cord-0280661-2wc7drn7 authors: Zhou, Hang-Yu; Cheng, Ye-Xiao; Xu, Lin; Li, Jia-Ying; Tao, Chen-Yue; Ji, Cheng-Yang; Han, Na; Yang, Rong; Li, Yaling; Wu, Aiping title: Genomic evidence for divergent co-infections of SARS-CoV-2 lineages date: 2021-09-04 journal: bioRxiv DOI: 10.1101/2021.09.03.458951 sha: b6988d6a42e9637c134c5d726e0016189f9f999c doc_id: 280661 cord_uid: 2wc7drn7 Recently, patients co-infected by two SARS-CoV-2 lineages have been sporadically reported. Concerns are raised because previous studies have demonstrated co-infection may contribute to the recombination of RNA viruses and cause severe clinic symptoms. In this study, we have estimated the compositional lineage(s), tendentiousness, and frequency of co-infection events in population from a large-scale genomic analysis for SARS-CoV-2 patients. SARS-CoV-2 lineage(s) infected in each sample have been recognized from the assignment of within-host site variations into lineage-defined feature variations by introducing a hypergeometric distribution method. Of all the 29,993 samples, 53 (~0.18%) co-infection events have been identified. Apart from 52 co-infections with two SARS-CoV-2 lineages, one sample with co-infections of three SARS-CoV-2 lineages was firstly identified. As expected, the co-infection events mainly happened in the regions where have co-existed more than two dominant SARS-CoV-2 lineages. However, co-infection of two sub-lineages in Delta lineage were detected as well. Our results provide a useful reference framework for the high throughput detecting of SARS-CoV-2 co-infection events in the Next Generation Sequencing (NGS) data. Although low in average rate, the co-infection events showed an increasing tendency with the increased diversity of SARS-CoV-2. And considering the large base of SARS-CoV-2 infections globally, co-infected patients would be a nonnegligible population. Thus, more clinical research is urgently needed on these patients. Since its initial report in the late of 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has rapidly developed into a global pandemic 1, 2 . The widespread transmission and geographical isolation of SARS-CoV-2 has greatly resulted in its genetic diversity. Until Jul. 18 th , 2021, thousands of lineages have been clearly defined by the Pangolin nomenclature 3 . Viruses within a defined lineage often share several common mutations and have similar biological properties. Of all the identified lineages, four SARS-CoV-2 variants have been announced as "a variant of concern (VoC)" by the World Health Organization (WHO). Among them, B.1.1.7 (Alpha strain defined by WHO) was estimated to have >50% enhanced transmissibility 4 , B.1.351 (Beta strain defined by WHO) and P.1 (Gamma strain defined by WHO) showed the capacity to evade inhibition by neutralizing antibodies 5 , while B.1.617.2 (Delta strain defined by WHO) caused the greatly increased infections in India and became the dominant epidemic strain in some countries recently 6, 7 . Currently, the re-infection of SARS-CoV-2 have been extensively discussed 8, 9 . Meanwhile, more and more evidence has pointed out that the co-infection events caused by different SARS-CoV-2 lineages may have occasionally occured [10] [11] [12] [13] [14] . This phenomenon should be paid more attention for at least two reasons. Firstly, previous reports indicated that viral co-infection may cause severe clinic symptoms. For instance, human immunodeficiency virus (HIV) co-infection contributes to rapid disease progression 15 , increased viral load 16 , and requires antiretroviral treatment effective against both HIV variants 17 . Secondly, the co-infection may contribute to the SARS-CoV-2 recombination and accelerate the generation of recombinant viruses, since coronaviruses have relatively high recombination rates [18] [19] [20] . Thus, with the increasing diversity of SARS-CoV-2 and the co-existence of multiple regional lineages globally, it is significant to clarify that how often the co-infection occurs in population and what's the exact compositional lineages of co-infection in individuals. In theory, genomic evidence should be kept in the deep sequencing data if a patient has been co-infected with two or even more SARS-CoV-2 lineages. Like other RNA viruses, the identified SARS-CoV-2 genomes in patient belongs to quasi-species with many within-host variations 12, 21, 22 . The identification of co-infection lineages mainly rely on the existence of their lineage-defined feature mutations in viral quasi-species. Recent studies have confirmed the high reliability of illumina sequencing data in detecting within-host variations 23 , and deep sequencing data generated by illumina could be used for formal analysis 24, 25 . Benefiting from the worldwide rapid accumulation and open sharing of SARS-CoV-2 genomes, the large-scale genomic dataset provides us substantial support to detect the co-infection events even when they are very rare in population. In this study, we have collected and analyzed 29,993 paired-end deep sequencing genomes generated by illumina platform from the National Center for Biotechnology Information (NCBI). All these samples had detailed metadata and were isolated from the USA during 1 st Jan. 2021 to 31 st Jul. 2021. By introducing a hypergeometric-distribution based method, we have successfully decoded the within-host variations from these deep sequencing data and detected the compositional lineages of potential co-infection events. Of all the 29,993 samples, we have identified 53 (~0.18%) co-infection events of SARS-CoV-2. Apart from 52 samples with two co-infected lineages, one sample was co-infected with three lineages. As expected, the co-infection events have mainly happened in the region where co-circulated with two or even more dominant SARS-CoV-2 lineages. Overall, for the first time, we have captured the genomic evidence for the co-infection events in large-scale sequencing data and provided robust method to identify the compositional lineages of SARS-CoV-2 in co-infected samples. Furthermore, the increased number of co-infected samples might also raise great concerns for the possible viral recombination and decrease of vaccine effectiveness. When an individual has been infected by SARS-CoV-2 virus from a single lineage, we can imagine that most of the lineage-defined feature variations could be detected at a similar level in deep sequencing data, and the frequency of each lineage-defined feature variation should be nearly 100% (Fig. 1) . However, when an individual has been co-infected by two strains from different SARS-CoV-2 lineages, the situation is a little complicated. Assuming that the co-infected strains from two lineages are propagated independently in a patient, then three evidence could be observed in the genomic sequencing data (Fig. 1) . Firstly, feature variations specific to the same lineage A or B possess the similar frequencies. Secondly, the sum of the mean frequencies of feature variations specific to lineages A and B is nearly 100%. Thirdly, the frequencies of the shared feature variations by lineages A and B are nearly 100% (Fig. 1 ). Based on these three evidence, co-infection event could possibly be identified from the genomic sequencing data. In this study, we have introduced a hypergeometric-distribution based method to decode the possible lineages in a sequencing sample ( Fig. 2 In total, 52 samples were clearly classified to be co-infected by SARS-CoV-2 strains from two different lineages (Figs. 3C and S1). As shown in Figure 3C Apart from co-infection by two lineages, we have unexpectedly identified one co-infected sample with three lineages. As shown in Figure 4 , the sample was collected in Connecticut State, USA on 17 th May 2021. The three hypothesized genomic evidence ( Fig. 1 ) could be observed in this sample clearly (Fig. 4A ). Firstly, most lineage-specific feature variations of Alpha, Iota (B.1.526) and Gamma (P.1) could be identified at their own levels, respectively. Alpha lineage was identified to occupy ~55% of all strains while Iota and Gamma occupied ~25% and ~15%, respectively. Secondly, three feature variations (Spike_N501Y, N_R203K and N_G204R) shared by Gamma and Alpha were nearly 70%, which almost equal to the sum of the mean frequencies of Alpha and Gamma. Thirdly, two feature variations (NSP12_P323L and Spike_D614G) shared by all three lineages were nearly 100%. The detection of these three lineages were also in consistent with the epidemiological patterns of SARS-CoV-2 lineages in the sampling location, Connecticut State (Fig. 4B) . The metadata of all the SARS-CoV-2 co-infected samples ( Table 1) Since 53 co-infected samples were obtained, we made effort to answer the question that whether the co-infected SARS-CoV-2 lineages have lineage tendentiousness, by assigning each pair of co-infected lineages as a connection to build up a comprehensive network (Fig. 5A ). In the co-infected network, For most of the SARS-CoV-2 positive samples, no matter infected by one lineage or by multiple lineages, the pattern of mutations in sequencing data fits well with their lineage-defined feature variations. Especially, we observed that the sum of the frequencies of lineage-unique variations was just equal to the average frequencies of their shared variations, demonstrating the co-existence of these lineages within the same sample. Moreover, the epidemiological background of the detected co-infected SARS-CoV-2 samples were highly consisted with the identified lineages for their co-circulations around the sampling locations. The consistence between hypothesis (Fig. 1) and observations (Figs. 3, 4 and S1) provides strong evidence for the detected co-infection events. Also, we believe the co-infection events are truly existed instead of induced by contamination for three reasons. Firstly, the observed co-infection events have obvious lineage tendentiousness rather than random distribution (Fig. 5A) . Secondly, we found that the co-infection events were increased with time (Fig. 5B ), which was contradictory with the expectation that the contamination samples would decrease with the development of sequencing experience in SARS-CoV-2 detection. Thirdly, the co-infection events occurred with a relatively high rate (~0.18%) in all samples, which was not likely the result of occasionally sequencing contamination. One question is can we inferred the sources of the co-infection event from their genomic characters? When we assigned variations into lineage(s), we found there were always some undetermined variations. Our further analysis suggested that these undetermined variations could possibly be used to trace the origin of co-infection events. For instance, in a representative co-infected sample (SRR14391243) with two lineages (Fig. 3C) However, during Jul. 2021, the dominant lineage has been changed from Alpha to Delta strain, followed by the co-infection center switching from Alpha lineage to Delta lineage (Fig. 5A) . To ensure the validity and reliability of our detected co-infection events, we have set a series of stringent criteria (Fig. 2 and Methods). Apart from the determined 53 co-infected samples, 16 additional samples are potentially co-infected (Fig. S6) if we release the parameters slightly. Therefore, we suppose that the co-infected rate of SARS-CoV-2 lineages as ~0.18% in population are under-estimated. If we expand this co-infected rate to all the SARS-CoV-2 infections in the USA or even the whole world, the co-infected patients would be a nonnegligible population. Nevertheless, we cannot make strong conclusion for how co-infected SARS-CoV-2 lineages influence the disease symptoms and the response for disease treatment, which require more research in the future. All the 29,993 SRA runs in Project PRJNA716985 were collected from NCBI (https://www.ncbi.nlm.nih.gov). These samples were collected in the USA from Jan. 2021 to Jul. 2021 and sequenced with illumina platform. Samples in this project have been kept with complete meta information, including the collection date, isolated region, sex, and age of patients. To guarantee the accuracy of identified intra-host amino acid variations, only paired-end sequenced samples were selected for further analysis. All the selected samples in the study met the above criteria. The collected samples were primarily transformed into FASTQ files with sra-tools. Then the The lineage-defined feature variations were defined as the shared lineage-specific signature variations of strains belonging to the same lineage. In general, it was set as the nonsynonymous mutations shared by >= 75% viral strains in a specific lineage (https://outbreak.info/situation-reports/methods#characteristic). However, given the rapid divergence of SARS-CoV-2, many sub-lineages were formed and shared the same feature variations at 75% level, which could not distinguish viral strains belonging to similar lineages. Therefore, in this study, we further introduced the mutations shared by >= 10% viruses to distinguish the neighboring lineages with similar feature variations at 75% level. In total, over 2.5 million SARS-CoV-2 consensus genomes were collected from GISAID 26, 27 . All variations that caused nonsynonymous mutations were identified for each viral genome. The lineage of each virus was derived with the Pango nomenclature 3 . A homemade Python script was applied to extract the mutations that shared by >= 75% of all the viruses in one lineage as the 75% feature variations (FV-75). Similarly, mutations shared by >= 10% of all the viruses in one lineage were extracted as 10% feature variations (FV-10). To avoid overfitting, the lineage with few viral genomes globally (<0.01% in all 2.5 million SARS-CoV-2 genomes, or < 250 genomes) were discarded. Under the null hypothesis that each variation has the equal probability of being detected, the number of variations associated with a lineage that overlap with the set of variation follows a hypergeometric distribution. Hence, this process can be conducted using Fisher's exact test, which uses the hypergeometric distribution. Firstly, all the intra-host variations in NGS raw data were extracted by CLC Genomics Workbench, while feature variations of each lineage were extracted from all the available SARS-CoV-2 consensus sequences in GISAID 26, 27 . Secondly, the results were analyzed by a homemade Python script (https://github.com/wuaipinglab/SARS-CoV-2_co-infection). All the potential lineages in a sample could be detected and given a p-value for each test. Thirdly, a judgement process was preformed to evaluate whether the assigned variations in a lineage have similar frequencies, and whether the summed mean frequencies of all lineage-defined feature variations is equal to ~100%. Finally, one lineage infection, multi-lineages co-infection, and infection of other situations were outputted as three individual files, respectively. The CLC workflow, parameter for CLC modules, ID of screened samples, homemade Python script for identifying potential co-infection events were available online (https://github.com/wuaipinglab/SARS-CoV-2_co-infection). Tables Fig. 1 The co-infection pattern of SARS-CoV-2 lineages in patient. In one lineage infection, frequencies of lineage-defined feature variations are ~100%. When a sample was co-infected by two SARS-CoV-2 lineages A and B, the frequencies of each lineage-specific feature variations (blue and pink) have their own levels. The sum of the average frequencies of lineage-specific feature variations is ~100%, and the shared feature variations of lineages A and B (purple) are ~100%. Three modules are developed for identifying the SARS-CoV-2 lineage(s) in each sample, including mutation calling, lineage filtration and pattern determination. In mutation calling module, next-generation sequencing (NGS) raw data (FASTQ file) was imported into CLC genomics workbench for SNV calling. All available consensus sequences in GISAID were collected and identified for feature variations. In lineage filtration module, two files from the mutation calling module were put into a homemade Python script for hypergeometric distribution analysis. The candidate within-host lineages were identified for the next analysis. By ranking candidate lineages according to the statistical significance (p-value), mutation frequency uniformity (Frequency STD) and mutation proportion (P mutation /N mutation ), lineage with the highest confidence was recorded. In the last module for pattern determination, sample with one identified lineage were classified as One lineage infection sample, samples with at least two identified SARS-CoV-2 lineages were classified as Co-infection samples, the remaining samples that cannot be classified into any type were classified as Unclassified. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study An interactive web-based dashboard to track COVID-19 in real time A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Estimated transmissibility and impact of SARS-CoV-2 lineage B. 1.1. 7 in England SARS-CoV-2 variants B. 1.351 and P. 1 escape from neutralizing antibodies Reduced sensitivity of SARS-CoV-2 variant Delta to antibody neutralization Reduced neutralization of SARS-CoV-2 B. 1.617 by vaccine and convalescent serum Genomic evidence for reinfection with SARS-CoV-2: a case study Resurgence of COVID-19 in Manaus, Brazil, despite high seroprevalence Pervasive transmission of E484K and emergence of VUI-NP13L with evidence of SARS-CoV-2 co-infection events by two different lineages in Rio Grande do Sul Patterns of within-host genetic diversity in SARS-CoV-2 SARS-CoV-2 within-host diversity and transmission Infection with different strains of SARS-CoV-2 in patients with COVID-19 Change of dominant strain during dual SARS-CoV-2 infection Dual HIV-1 infection associated with rapid disease progression Identifying HIV-1 dual infections Update on HIV-1 and HIV-2 Dual Infection Epidemiology, genetic recombination, and pathogenesis of coronaviruses Genomic recombination events may reveal the evolution of coronavirus and the origin of SARS-CoV-2 Recombination should not be an afterthought Inferring SARS-CoV-2 variant within-host kinetics One viral sequence for each host?-The neglected within-host diversity as the main stage of SARS-CoV-2 evolution SARS-CoV-2 evolution during treatment of chronic infection Respiratory viral co-infections among SARS-CoV-2 cases confirmed by virome capture sequencing Analytical validity of nanopore sequencing for rapid SARS-CoV-2 genome analysis disease and diplomacy: GISAID's innovative contribution to global health Global initiative on sharing all influenza data-from vision to reality Pennsylvania 36 female Alpha(B.1.1.7)/Delta(B.1.617.2) SRR14812095 2021/05/17 Massachusetts 27 female 18.87 Alpha(B.1.1.7)/Delta(B.1.617.2) SRR15383414 2021/07/19 Missouri 28 male 15 Eta(B.1.525) SRR14812093 2021/05/17 Connecticut 18 female 23.75 Alpha(B.1.1.7)/Gamma(P.1)/Iota(B .1.526) Iota(B.1.526) SRR14811859 2021/05/15 Pennsylvania 68 female 22 Alpha(B.1.1.7)/Iota(B.1.526) SRR14812107 2021/05/17 Massachusetts 40 male 17.20 Alpha(B.1.1.7)/Iota(B.1.526) Alpha(B.1.1.7)/Iota(B.1.526) Alpha(B.1.1.7)/Iota(B.1.526) Alpha(B.1.1.7)/Iota(B.1.526) SRR14452322 2021/04/13 California 21 female 19.48 Alpha(B.1.1.7)/Iota(B.1.526) 2)/Delta(AY.3) SRR15432502 2021/07/26 Virginia 19 male Delta(AY.2)/Delta(AY.3) SRR15432222 2021/07/26 Wisconsin 27 female 16.64 Delta(AY.2)/Delta(AY.3) SRR15433019 2021/07/26 Michigan 11 male 11.87 Delta(AY.2)/Delta(AY.3) SRR15433677 2021/07/17 New Jersey 50 male 22 Minnesota 29 female 23.15 Delta(AY.2)/Delta(AY.3) 2)/Delta(AY.3) We thank supporting from The CAMS Initiative for Innovative Medicine ( All authors declare no competing interests.