key: cord-0886240-i001yyuh authors: Montoya, Vincent; McLaughlin, Angela; Mordecai, Gideon J.; Miller, Rachel L.; Joy, Jeffrey B. title: Variable routes to genomic and host adaptation among coronaviruses date: 2021-03-10 journal: J Evol Biol DOI: 10.1111/jeb.13771 sha: 7ca9773f9a458172a6c141b993e341390d4aed44 doc_id: 886240 cord_uid: i001yyuh Natural selection operating on the genomes of viral pathogens in different host species strongly contributes to adaptation facilitating host colonization. Here, we analyse, quantify and compare viral adaptation in genomic sequence data derived from seven zoonotic events in the Coronaviridae family among primary, intermediate and human hosts. Rates of nonsynonymous (d (N)) and synonymous (d (S)) changes on specific amino acid positions were quantified for each open reading frame (ORF). Purifying selection accounted for 77% of all sites under selection. Diversifying selection was most frequently observed in viruses infecting the primary hosts of each virus and predominantly occurred in the orf1ab genomic region. Within all four intermediate hosts, diversifying selection on the spike gene was observed either solitarily or in combination with orf1ab and other genes. Consistent with previous evidence, pervasive diversifying selection on coronavirus spike genes corroborates the role this protein plays in host cellular entry, adaptation to new hosts and evasion of host cellular immune responses. Structural modelling of spike proteins identified a significantly higher proportion of sites for SARS‐CoV‐2 under positive selection in close proximity to sites of glycosylation relative to the other coronaviruses. Among human coronaviruses, there was a significant inverse correlation between the number of sites under positive selection and the estimated years since the virus was introduced into the human population. Abundant diversifying selection observed in SARS‐CoV‐2 suggests the virus remains in the adaptive phase of the host switch, typical of recent host switches. A mechanistic understanding of where, when and how genomic adaptation occurs in coronaviruses following a host shift is crucial for vaccine design, public health responses and predicting future pandemics. A mechanistic understanding of how adaptation occurs following a shift from one host to another is crucial for understanding macroevolution, speciation, adaptation and the emergence of infectious diseases (Joy & Crespi, 2012; Longdon et al., 2014; Nosil et al., 2002; Sanchez-Flores et al., 2016; Streicker et al., 2012) . Host shifts and subsequent adaptation to the new host are particularly important in viral evolution (Forni et al., 2016; Mollentze et al., 2014) and in the emergence of viral infectious diseases in humans (Longdon et al., 2014) . The generally small genome sizes and high mutation rates exhibited by RNA viruses make viral host-switching events excellent models for investigating the genomics of adaptation (Duffy et al., 2007; Forni et al., 2016; Zhao et al., 2019) . These properties of RNA viruses also quickly lead to the accretion of significant genetic variation (Drummond et al., 2003) . The large pool of genetic variation combined with the generally large population sizes of RNA viruses provide ideal conditions for selection to optimize viral phenotypes (Krakauer & Komarova, 2003) . For a viral host switch to take place, there are numerous barriers to be overcome, including adequate ecological contact between the two host species for viral exchange to occur, structural barriers, sufficient host genetic similarity to enable initial infection and overcoming host immune defences (Olival et al., 2017; Parrish et al., 2008) . Following a host-switching event, there may either be repeated contact with the new host and viral transfer but no subsequent ongoing transmission (e.g. West Nile Virus infection in humans), limited transmission in novel host species before viral extinction (e.g. Ebolavirus in humans) or sustained transmission among the new host leading to endemic or epidemic disease in the novel host (e.g. HIV, hepatitis C virus, SARS-CoV-2) (Andersen et al., 2020; Boni et al., 2020; Worobey et al., 2004) . The amount of viral adaptation necessary for a host switch to result in an endemic disease in the novel host is thought to be dependent upon the extent of genetic overlap between the hosts as well as the frequency of contact between them (Olival et al., 2017) . Viral transfer between phylogenetically proximate hosts (i.e. between mammals) is thus expected to be associated with a shallower valley between fitness peaks in the fitness landscape (Gavrilets, 2004) , with fewer mutations required for colonization and adaptation to the novel host (Zhao et al., 2019) . In contrast, viral transfer between phylogenetically distant hosts is expected to be associated with a deep fitness valley requiring more extensive host-specific adaptation to facilitate colonization of the new host (Olival et al., 2017) . Due to habitat destruction, there is a higher propensity for ecological overlap between displaced animals and humans leading to increased opportunities for viral host switches and emergence of novel viruses (Leao et al., 2020; Wertheim et al., 2013) . Prior host jumps of coronaviruses to humans have involved an intermediate host (Chan & Chan, 2013) . For example, emergence of Middle East respiratory syndrome coronavirus (MERS-CoV) into humans involved viral exchange between bats and camels before transmission to humans was possible. The role of selection in the intermediate host in facilitating adaptation to humans has been poorly documented; however, MacLean et al. (2020) found evidence that the majority of selection of SARS-CoV-2 occurred in bats not humans. In this study, we test hypotheses concerning adaptation following a host shift across all known coronaviruses that have switched into humans and have available genomic sequence data. These include endemic human coronaviruses (HCOV-HKU1, HCOV-OC43, HCOV-229E, HCOV-NL63), which commonly infect people globally, and are usually associated with mild respiratory disease. Additionally, we include the three more recently introduced coronaviruses (SARS-CoV, MERS-CoV and SARS-CoV-2). It is thought that the dominant selective regime on viral genes is purifying selection, which maintains gene and protein functionality (Spielman et al., 2019) . This is supported by the observation that in the majority of viruses, the number of synonymous substitutions (d s ) exceeds the number of nonsynonymous substitutions (d N ). However, this observation is expected to be reversed on regions of a gene that are under positive selection (d N > d s ), such as an epitope in a viral protein (Hughes & Hughes, 2007) . For example, previous research on coronaviruses has typically identified extensive adaptive evolution in the spike protein (Berrio et al., 2020; Tagliamonte et al., 2020; Xu et al., 2020) . However, most investigations of coronavirus evolution and adaptation focus only on one or a limited number of genes and in the context of only a single virus-host system. Previous studies of selection in MERS-CoV have identified a large number of sites undergoing positive selection in genes involved in viral replication (Forni et al., 2016) . Nevertheless, a pan-genomic evolutionary analysis is required in order to assess which sites are undergoing adaptation across coronaviruses in order to help identify possible trends that may exist in previous host-switching events. These findings are valuable for vaccine design, managing public health responses and predicting future zoonotic events. Specifically, using molecular evolutionary analyses of historical selection we test the hypothesis that particular features of coronavirus genomes are imperative for transmission to novel hosts. We hypothesize that selection on coronavirus genes will vary with time depending upon the gene function. Specifically, we predict positive selection on the genomic regions associated with adaptation to the novel host immediately following the host shift followed by a transition to purifying selection ( Figure 1 ). We also predict that genes involved in host adaptation and immune evasion will reveal a correlation between positively selected sites and time since host switching. We further expect these relationships to be consistent across human coronaviruses. Secondly, we evaluate the role that viral selection in intermediate hosts plays in facilitating persistence in humans. Finally, we evaluate associations between the number of positively selected sites since viral establishment and time since emergence in each host type. Genome sequences for all SARS-CoV-2 isolates were downloaded from GISAID on 13 June 2020. All other SARS-CoV, MERS-CoV, HCOV-HKU1, HCOV-NL63, HCOV-229E and HCOV-OC43 genomes were obtained from NCBI Genbank on 28 May 2020. Viral isolates from bats immediately preceding the SARS-CoV-2 lineage were obtained from Boni et al. (2020) . Candidate genes encoding the spike protein from each respective coronavirus species were used to identify additional relatives using BLASTn alignments against the nonredundant database with relaxed stringencies. Coordinates for each gene were obtained from each respective Genbank reference sequence, and these identified regions were subsequently used in alignments to retrieve genes for those sequences without precise gene coordinates. To avoid artefacts associated with in vitro evolution, sequences were removed if they were sequenced from a cultured isolate as described in the respective Genbank record. We also removed sequences if they were derived from a recombinant virus, had greater than 5% unresolved nucleotides ("N") within a gene, or coverage for a particular gene was <50%. Multiple sequence alignments were performed for each species/ gene using MAFFT version 7.464, (Katoh et al., 2009) and the resulting alignments were visually inspected using AliView version 1.26 (Larsson, 2014 All tests for selection were performed using Hyphy version 2.5.8 (Pond, Poon, et al., 2020) . Prior to each analysis, recombination events among the sequences in all alignments were detected using GARD (Kosakovsky Pond et al., 2006) . Due to computational issues with the GARD analysis for the orf1ab and spike genes for SARS-CoV-2, the data sets for these genes were down-sampled into 20 groups by randomly selecting unique sequences and performing GARD on each of these groups. This was performed 10 times and the consensus of these analyses were used for all subsequent results. For all other GARD analyses, if any breakpoints were identified, all subsequent analyses were performed on each respective partitioned alignment. To identify positions in the genome under selection for each of the coronavirus species, two site-specific models were used [fixed effects likelihood, FEL (Pond & Frost, 2005) and mixed effects model of episodic selection, MEME (Murrell et al., 2012) ]. Differences in selective regimes among human and nonhuman hosts were assessed with an additional site-level model (CONTRAST-FEL) . To account for uncertainty in phylogenetic tree topologies, (Parker et al., 2008) these tests were performed across distributions of ten phylogenetic trees reconstructed for each gene. The default thresholds for each of the resulting p-values (www.hyphy.org) were used to determine significance, and the consensus of these results was used to identify the type of selection. In all comparisons of selection we controlled for the difference in number of sequences analyzed by dividing the number of positively selected sites by the log 10 of the sequence counts for each respective virus/gene group. Positions under selection were required to be identified within internal branches for both FEL and MEME. In order to compare the results from CONTRAST-FEL (multihost derived viral sequences) and FEL/MEME (single host derived viral sequences), the precise codon for each analysis was identified F I G U R E 1 Mock-phylogenetic tree depicting host-switching events for the coronaviruses in this study. Each virus evolved from within its presumed primary, natural host (bats or rats), subsequently spilled over into an intermediate host, and finally jumped into humans. The estimated time since introduction into the human population is shown at the node preceding each human viral lineage (years ago: YA). However, for HCOV-NL63 and HCOV-229E, these timeframes were estimated from the emergence from bats. At each of these instances of host switching, presumably there was a large increase in adaptive evolution (shown in red on the phylogenetic pathway from primary to human host) followed by purifying selection (shown in blue). Blue arrow signifies documented ongoing cross-species transmission events as exemplified in MERS-CoV [Colour figure can be viewed at wileyonlinelibrary.com] by aligning each codon-aware alignment to each other. The estimated time of zoonoses or host-switching events were calculated as the difference between the estimated years since emergence into humans and the most recent collection date for each coronavirus sequence obtained from a clinical isolate. For these results, the total FEL/MEME sites were used, respectively, in each analysis. The estimates were derived from the following studies: HCOV-OC43 (Vijgen et al., 2005) were analysed, annotated and visualized using the PyMOL molecular graphic system, version 2.4.0 (Schrodinger, LLC, 2015) . As there was no suitable template for HCOV-HKU1, HCOV-OC43 was used as a surrogate for visual purposes and excluded from further structural analyses. Glycosylation sites were modelled using GlyProt, (Bohne-Lang & Lieth, 2005) and sites in close proximity (10 angstroms) were identified using PyMOL (Schrodinger, LLC, 2015) . Structure-based alignments were generated using the TM-Align algorithm (Zhang & Skolnick, 2005) . Aligned amino acid positions for each pairwise protein structure alignment found to be in close proximity (5 angstroms, as indicated in the output of the TM-alignment) were then compared with positions that were positively selected. Neutralizing epitopes were obtained from the Immune Epitope Database (Vita et al., 2019) . Monte Carlo permutation tests (999 unrestricted permutations, p ≤ 0.05) were used to test the significance of glycosylation proximity to sites of selection using the Coin package in R (Zeileis et al., 2008) . A total of 50,354 sequences were obtained from global databases before quality filtering ( Due to the extraordinary efforts of researchers around the world, 95% of all genomic sequences in this study belong to SARS-CoV-2. Unsurprisingly, except for SARS-CoV and MERS-CoV, the majority of sequences in each data set were dominated by human sequences (median = 87.4%), whereas the number of sequences from primary and intermediate hosts were typically less well-represented (medians 6.3% and 18.1%, respectively). Rates of nonsynonymous (d N ) and synonymous (d S ) mutations on specific amino acid positions were quantified for each ORF from seven coronaviruses known to have a history of host-switching events: SARS-CoV-2, SARS-CoV, MERS-CoV, HCOV-229E, HCOV-OC43 and HCOV-NL63 as well as at least one of their nonhuman hosts. HCOV-HKU1 was also investigated; however, all related viruses derived from nonhuman hosts were too divergent. HCOV-OC43-like viruses isolated from rats were also not investigated, as only four suitable genomes were available to the best of our knowledge and most of these ORFs collapsed into identical sequences. Viruses from primary (Forni et al., 2016; Pond, 2020; Tang et al., 2009) . The majority of selection was revealed to be acting on the primary bat hosts Table S2 ). The large amount of selection in SARS-CoV-2 suggests that the virus is still in the adaptive phase of the host switch, typical of recent F I G U R E 3 Number of sites under positive ("Positive Sel") or negative selection ("Negative Sel") for each gene/ORF. Sites under selection are those determined to be significant for both FEL and MEME models. Counts of selection were normalized by the log 10 number of sequences used in each analysis. Results for civets, camels and bovine derived viruses were combined into the "intermediate" host category [Colour figure can be viewed at wileyonlinelibrary.com] host switches (Parrish et al., 2008) . To further explore this observation, we compared the estimated time since the introduction into the human population (Table S1, Figure 1 Viruses that frequently change hosts provide a unique opportunity to investigate evolutionary pathways critical for adaptation to novel host environments (Moncla et al., 2016; Morley et al., 2015) . In this study, positive selection was observed sporadically across the genomes of each coronavirus, affecting both exogenous and endogenous proteins. Although the pattern of selection in each host was primarily unique to each virus, we provided evidence in line with our hypotheses that the timing of adaptation to each new host is correlated with a decreasing number of sites undergoing positive selection. It is intuitive, and in accordance with both predictions derived from evolutionary theory and empirical evidence in other contexts (Braga et al., 2018; Ricklefs & Fallon, 2002) , that following host switching, viral populations undergo a period of rapid diversification to facilitate adaptation to the novel environment (Abbott et al., 2013; Joy & Crespi, 2007; Turner & Elena, 2000) . These adaptations may increase host receptor binding affinity, evade immune detection and promote viral establishment; subsequently, as viruses ascend a peak of optimality in the fitness landscape (Gavrilets, 2004) (i.e. high affinity binding of the spike protein to the ACE2 receptor in humans for SARS-CoV-2), evolutionary forces favour stabilization and purifying selection to maintain viral phenotypes at new optima in the novel hosts (Morley et al., 2015; Pashley, 1988) . This is also consistent with previous genomic investigations of epidemics where substitution rate estimates are inversely correlated with time of sampling (Holmes et al., 2016) . Parallel evolution, well documented throughout the tree of life (Boughman et al., 2005; Nosil et al., 2002; Schluter et al., 2004) , has been repeatedly observed in viruses during host shifts both in vitro and in vivo (Bedhomme et al., 2012; Remold et al., 2008; Wichman et al., 1999) . ditional explanation is that in our study sequences that had been cultured prior to sequencing were excluded. Similar to SARS, analyses of MERS-CoV have found a higher degree of positive selection for genes involved in replication relative to the Spike gene (Forni et al., 2016) , largely consistent with the results obtained in our study (e.g. Figures 2 and 3) . The implications of a larger degree of positive selection in replicase genes for both SARS-CoV and MERS-CoV are intriguing as it suggests genes other than spike are under strong diversifying selection immediately following host-switching events consistent with patterns observed in influenza (Bhatt et al., 2013) . An alternative and more simplified explanation, however, could simply be due to the significantly longer lengths for genes involved in replication. Studies investigating the evolution of SARS-CoV-2 have largely found that purifying selection has dominated its evolution (MacLean et al., 2020) . It was also demonstrated that extensive positive selection likely occurred prior to its introduction into the human population (MacLean et al., 2020) . Analyses using more recent genomes (December 2020) have revealed evidence for positive selection on 1,860 sites, again predominantly affecting replicationassociated genes (n = 1,222) compared with those sites selected in the spike gene (n = 237) (Pond, 2020) . The proximity between glycans and sites of positive selection in the SARS-CoV-2 spike protein are intriguing specifically within the context of other coronaviruses. Previous research has found that the SARS-CoV-2 spike protein has a relatively large glycan shield (Watanabe et al., 2020) , and when compared with other coronaviruses, there were few patterns concerning the precise sites of glycosylation (Xu et al., 2020) . In this study the proximity of glycans (Figures 3 and 6) . Nevertheless, this discrepancy may still influence the results of this study, and once further data arise for other coronaviruses, future work may profitably investigate the relationships outlined in this study. Concerning the moderate relationship identified between the number of sites under selection and estimated divergence times, it is interesting that this relationship was only identified in orf1a and orf1b ( Figure S2 ). In fact, the opposite trend was observed for the spike protein where the number of sequences was inversely correlated with the number of positively selected sites. This may imply that while endogenous proteins The relationships between the number of positively selected sites and estimated years since introduction into humans. Using estimated years from previous research, the inverse estimated years since introduction is plotted against the total number of unique sites under positive selection (FEL and MEME) and the same sites divided by the log 10 normalized counts of sequences used in each selection analysis for both the ORF1a and ORF1b genes. insightful comments which greatly improved our manuscript. The authors have no conflict of interest to declare. VM conceived of and coordinated the study, performed data analyses, generated figures and tables and wrote the first draft of the manuscript. JBJ conceived of and coordinated the study, critically examined analyses and results, generated figures and contributed to writing the first draft of the manuscript. GJM, RLM and AM critically examined the analyses and results and contributed to draft versions of the manuscript. All authors approved the final version. All data used in this analysis are freely available to access on GenBank, GISAID and the RSCB Protein Databank. The accession numbers for all sequences used in the analysis are available in Table S3 and all positively selected sites are in Table S4 , on GitHub. https://github. com/vmon5 813/Coron aviru sHost Adapt ations All alignment files are available on request. https://orcid.org/0000-0002-0615-4591 Hybridization and speciation The new SARS-CoV-2 strain shows a stronger binding affinity to ACE2 due to N501Y mutation Molecular epidemiology and evolutionary histories of human coronavirus OC43 and HKU1 among patients with upper respiratory tract infections in Kuala Lumpur The proximal origin of SARS-CoV-2 Multihost experimental evolution of a plant RNA virus reveals local adaptation and hostspecific mutations The Protein Data Bank and the challenge of structural genomics Positive selection within the genomes of SARS-CoV-2 and other Coronaviruses independent of impact on protein function The evolutionary dynamics of influenza A virus adaptation to mammalian hosts GlyProt: In silico glycosylation of proteins Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic Parallel evolution of sexual isolation in sticklebacks Unifying host-associated diversification processes using butterfly-plant networks Tracing the SARS-coronavirus High speed BLASTN: An accelerated MegaBLAST search tool Measurably evolving populations Evolution of host specificity drives reproductive isolation among RNA viruses Extensive positive selection drives the evolution of nonstructural proteins in lineage C betacoronaviruses Fitness landscapes and the origin of species The evolution of Ebola virus: Insights from the 2013-2016 epidemic More effective purifying selection on RNA viruses than in DNA viruses Evidence supporting a zoonotic origin of human coronavirus strain NL63 Adaptive radiation of gall-inducing insects within a single host-plant species Island phytophagy: Explaining the remarkable diversity of phytophagous insects Multiple alignment of DNA sequences with MAFFT Evidence for N-glycan shielding of antigenic sites during evolution of human influenza A virus hemagglutinin GARD: A genetic algorithm for recombination detection Levels of selection in positivestrand virus dynamics AliView: A fast and lightweight alignment viewer and editor for large datasets Severe Acute Respiratory Syndrome (SARS) coronavirus ORF8 protein is acquired from SARS-related coronavirus from greater horseshoe bats through recombination Coronaviridae-Old friends, new enemy! Early transmissibility assessment of the N501Y mutant strains of SARS-CoV-2 in the United Kingdom The impact of mutations in SARS-CoV-2 spike on viral infectivity and antigenicity The evolution and genetics of virus host shifts Natural selection in the evolution of SARS-CoV-2 in bats, not humans, created a highly capable human pathogen The role of viral evolution in rabies host shifts and emergence. Current Opinion in Virology Selective bottlenecks shape evolutionary pathways taken during mammalian adaptation of a 1918-like avian influenza virus Rate of novel host invasion affects adaptability of evolving RNA virus lineages Detecting individual sites subject to episodic diversifying selection Phylogenetic relationship of SARS-CoV-2 sequences from Amazonas with emerging Brazilian variants harboring mutations E484K and N501Y in the Spike protein Host-plant adaptation drives the parallel evolution of reproductive isolation Host and viral traits predict zoonotic spillover from mammals Correlating viral phenotypes with phylogeny: Accounting for phylogenetic uncertainty Crossspecies virus transmission and the emergence of new epidemic diseases Quantitative genetics, development, and physiological adaptation in host strains of fall armyworm Distant relatives of severe acute respiratory syndrome coronavirus and close relatives of human coronavirus 229E in bats Natural selection analysis of SARS-CoV-2/COVID-19 Not so different after all: A comparison of methods for detecting amino acid sites under selection HyPhy 2.5-A customizable platform for evolutionary hypothesis testing using phylogenies Contrast-FEL: A test for differences in selective pressures at individual sites among clades and sets of branches FastTree 2 -Approximately Maximum-Likelihood Trees for Large Alignments The PyMOL molecular graphics system Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations Evolutionary genomics of host adaptation in vesicular stomatitis virus Diversification and host switching in avian malaria parasites Genome evolution in three species of cactophilic Drosophila. G3 (Bethesda) The high infectivity of SARS-CoV-2 B. 1.1. 7 is associated with increased interaction force between Spike-ACE2 caused by the viral N501Y mutation Parallel evolution and inheritance of quantitative traits The evolution of HIV-1 and the origin of AIDS Evolution of viral genomes: Interplay between selection, recombination, and other forces Variable evolutionary routes to host establishment across repeated rabies virus host shifts among bats Multiple recombination events and strong purifying selection at the origin of SARS-CoV-2 spike glycoprotein increased correlated dynamic movements Differential stepwise evolution of SARS coronavirus functional proteins in different host species Cost of host radiation in an RNA virus Complete genomic sequence of human coronavirus OC43: Molecular clock analysis suggests a relatively recent zoonotic coronavirus transmission event The Immune Epitope Database (IEDB): 2018 update Site-specific glycan analysis of the SARS-CoV-2 spike SWISS-MODEL: Homology modelling of protein structures and complexes A case for the ancient origin of coronaviruses Different trajectories of parallel evolution during viral adaptation Origin of AIDS: Contaminated polio vaccine theory refuted How significant is a protein structure similarity with TM-score = 0.5? Variations in SARS-CoV-2 spike protein cell epitopes and glycosylation profiles during global transmission course of COVID-19 Implementing a class of permutation tests: The coin package TM-align: A protein structure alignment algorithm based on the TM-score Evolutionary dynamics of MERS-CoV: Potential recombination, positive selection and transmission Existing host range mutations constrain further emergence of RNA viruses Variable routes to genomic and host adaptation among coronaviruses