key: cord-265329-bsypo08l
authors: van Dorp, Lucy; Acman, Mislav; Richard, Damien; Shaw, Liam P.; Ford, Charlotte E.; Ormond, Louise; Owen, Christopher J.; Pang, Juanita; Tan, Cedric C.S.; Boshier, Florencia A.T.; Ortiz, Arturo Torres; Balloux, François
title: Emergence of genomic diversity and recurrent mutations in SARS-CoV-2
date: 2020-05-05
journal: Infect Genet Evol
DOI: 10.1016/j.meegid.2020.104351
sha: 
doc_id: 265329
cord_uid: bsypo08l

SARS-CoV-2 is a SARS-like coronavirus of likely zoonotic origin first identified in December 2019 in Wuhan, the capital of China's Hubei province. The virus has since spread globally, resulting in the currently ongoing COVID-19 pandemic. The first whole genome sequence was published on January 52,020, and thousands of genomes have been sequenced since this date. This resource allows unprecedented insights into the past demography of SARS-CoV-2 but also monitoring of how the virus is adapting to its novel human host, providing information to direct drug and vaccine design. We curated a dataset of 7666 public genome assemblies and analysed the emergence of genomic diversity over time. Our results are in line with previous estimates and point to all sequences sharing a common ancestor towards the end of 2019, supporting this as the period when SARS-CoV-2 jumped into its human host. Due to extensive transmission, the genetic diversity of the virus in several countries recapitulates a large fraction of its worldwide genetic diversity. We identify regions of the SARS-CoV-2 genome that have remained largely invariant to date, and others that have already accumulated diversity. By focusing on mutations which have emerged independently multiple times (homoplasies), we identify 198 filtered recurrent mutations in the SARS-CoV-2 genome. Nearly 80% of the recurrent mutations produced non-synonymous changes at the protein level, suggesting possible ongoing adaptation of SARS-CoV-2. Three sites in Orf1ab in the regions encoding Nsp6, Nsp11, Nsp13, and one in the Spike protein are characterised by a particularly large number of recurrent mutations (>15 events) which may signpost convergent evolution and are of particular interest in the context of adaptation of SARS-CoV-2 to the human host. We additionally provide an interactive user-friendly web-application to query the alignment of the 7666 SARS-CoV-2 genomes.

On December 31 2019, China notified the World Health Organisation (WHO) about a cluster of pneumonia cases of unknown aetiology in Wuhan, the capital of the Hubei Province. The initial evidence was suggestive of the outbreak being associated with a seafood market in Wuhan, which was closed on January 1 2020. The aetiological agent was characterised as a SARS-like betacoronavirus, later named SARS-CoV-2, and the first whole genome sequence (Wuhan-HU-1) was deposited on NCBI Genbank on January 5 2020 (1) . Human-to-human transmission was confirmed on January 14 2020, by which time SARS-CoV-2 had already spread to many countries throughout the world. Further extensive global transmission led to the WHO declaring COVID-19 as a pandemic on March 11 2020.

Betacoronaviridae comprise a large number of lineages that are found in a wide range of mammals and birds (2) , including the other human zoonotic pathogens SARS-CoV-1 and MERS-COV. The propensity of Betacoronaviridiae to undergo frequent host jumps supports SARS-CoV-2 also being of zoonotic origin. To date, the genetically closest-known lineage is found in horseshoe bats (BatCoV RaTG13) (3) . However, this lineage shares 96% identity with SARS-CoV-2, which is not sufficiently high to implicate it as the immediate ancestor of SARS-CoV-2 (2) . The zoonotic source of the virus remains unidentified at the date of writing (April 23 2020).

The analysis of genetic sequence data from pathogens is increasingly recognised as an important tool in infectious disease epidemiology (4, 5) . Genetic sequence data sheds light on key epidemiological parameters such as doubling time of an outbreak/epidemic, reconstruction of transmission routes and the identification of possible sources and animal reservoirs. Additionally, whole-genome sequence data can inform drug and vaccine design. Indeed, genomic data can be used to identify pathogen genes interacting with the host and allows characterization of the more evolutionary constrained regions of a pathogen genome, which should be preferentially targeted to avoid rapid drug and vaccine escape mutants.

There are thousands of global SARS-CoV-2 whole-genome sequences available on the rapid data sharing service hosted by the Global Initiative on Sharing All Influenza Data (GISAID; https://www.epicov.org) (6, 7) . The extraordinary availability of genomic data during the COVID-19 pandemic has been made possible thanks to a tremendous effort by hundreds of researchers globally depositing SARS-CoV-2 assemblies (Table S1 ) and the proliferation of close to real time data visualisation and analysis tools including NextStrain (https://nextstrain.org) and CoV-GLUE (http://cov-glue.cvr.gla.ac.uk).

In this work we use this data to analyse the genomic diversity that has emerged in the global population of SARS-CoV-2 since the beginning of the COVID-19 pandemic, based on a download of 7710 assemblies. We focus in particular on mutations that have emerged independently multiple times (homoplasies) as these are likely candidates for ongoing adaptation of SARS-CoV-measured via the site specific consistency index. For this analysis all ambiguous sites in the alignment were set to 'N'. To assess whether any particular Open Reading Frame (ORF) showed evidence of more homoplasies than expected given the length of the ORF, an empirical distribution was obtained by sampling, with replacement, equivalent length windows and recording the number of homoplasies detected (Table S3) .

HomoplasyFinder identified 1132 homoplasies (1042 excluding masked sites), which were distributed over the SARS-CoV-2 genome ( Figure S5, Table S4 ). Of these, 40 sites have a derived allele at >1% of the total isolates. However, homoplasies can arise due to convergent evolution (putatively adaptive), recombination, or via errors during the processing of sequence data. The latter is particularly problematic here due to the mix of technologies and methods employed by different contributing research groups. We therefore filtered identified homoplasies using a set of thresholds attempting to circumvent this problem (filtering scripts and figures are available at https://github.com/liampshaw/CoV-homoplasy-filtering).

In summary, for each homoplasy we computed the proportion of isolates with the homoplasy pnn where the nearest neighbouring isolate in the phylogeny also carried the homoplasy (excluding identical sequences). This metric ranges between pnn=0 (all isolates with the homoplasy present as singletons) and pnn=1 (no singletons i.e. clustering of isolates with the homoplasy in the phylogeny). We reasoned that artefactual sequencing homoplasies would tend to show up as singletons, so excluded all homoplasies with pnn<0.1 from further analysis.

To obtain a set of high confidence homoplasies, we then used the following criteria: ≥0.1% isolates in the alignment share the homoplasy (equivalent to >8 isolates), pnn>0.1, and derived allele found in strains sequenced from >1 originating lab and >1 submitting lab. We also required the proportion of isolates where the homoplasic site was in close proximity to an ambiguous base (±5 bp) to be zero. The application of these various filters reduced the number of homoplasies to 198 (Table S5) . We also plotted the distributions of cophenetic distances between isolates carrying each homoplasy compared to the distribution for all isolates ( Figure  S6) , and inspected the distribution of all identified homoplasies in the phylogenies from our own analyses and on the phylogenetic visualisation platform provided by NextStrain. Finally, we examined whether ambiguous bases were seen more often at homoplasic sites than at random bases(excluding masked sites), which was not the case ( Figure S7 ).

To further validate the homoplasy detection method applied to the alignment of the 7666 SARS-CoV-2 genome assemblies, we took advantage of the genome sequences for which raw reads were available on the Short Read Archive (SRA). A variant calling pipeline (available at https://github.com/DamienFr/CoV-homoplasy) was used to obtain high-confidence alignments for the 348 (out of 889 as of April 19 2020) SRA genomic datasets both meeting our quality criterions and matching GISAID assemblies. The topology of the Maximum Likelihood phylogeny of these 348 samples was compared to that of the corresponding samples from the GISAID genome assemblies using a Mantel test and the Phytools R package (16) (Figures S8-S9 , see Supplementary text). ≥1%), 10 and 11 homoplasies were kept in the SRA dataset and in the GISAID dataset, respectively. Nine sites were detected in both datasets. For sites which failed the filtering thresholds, this was largely due to the low number of studied accessions, which increases the probability of an isolated strain displaying a homoplasy e.g. if n=2 isolates have a homoplasy, by definition they cannot be nearest neighbours, so pnn=0.

The alignment was translated to amino acid sequences using SeaView V4 (17) . Sites were identified as synonymous or non-synonymous and amino acid changes corresponding to these mutations were retrieved via multiple sequence alignment. We assessed the change in hydrophobicity and charge of amino acid residues arising due to homoplastic non-synonymous mutations using the hydrophobicity scale proposed by Janin (18) . The ten most hydrophobic residues on this scale were considered hydrophobic and the rest as hydrophilic. In addition, amino acid residues were either classified as positively charged, negatively charged or neutral at pH 7. The charge of each residue can either increase, decrease or remain the same (neutral mutation) due to mutation ( Figure S10 ).

SARS-CoV-1 and MERS-CoV are both zoonotic pathogens related to SARS-CoV-2, which underwent a host jump into the human host previously. We investigated whether the major homoplasies we detect in SARS-CoV-2 affect sites which also underwent recurrent mutations in these related viruses as these adapted to their human host. All Coronaviridae assemblies were downloaded (NCBI TaxID:11118) on 8th of April 2020 and human associated MERS-CoV and SARS-CoV-1 assemblies extracted. This gave a total of 15 assemblies for SARS-CoV-1 and 255 assemblies for MERS-CoV. Following the same protocol (Augur align) as applied to SARS-CoV-2 assemblies, each species was aligned against the respective RefSeq reference genomes: NC_004718.3 for SARS-CoV-1 and NC_019843.3 for MERS-CoV. This produced alignments of 29,751bp (187 SNPs) and 30,119bp (1588 SNPs) respectively.

The 7666 SARS-CoV-2 genomes offer an excellent geographical and temporal coverage of the COVID-19 pandemic (Figure 1a-b) . The genomic diversity of the 7666 SARS-CoV-2 genomes is represented as Maximum Likelihood phylogenies in a radial (Figure 1c ) and linear layout ( Figure  S1-S2 ). There is a robust temporal signal in the data, captured by a statistically significant correlation between sampling dates and 'root-to-tip' distances for the 7666 SARS-CoV-2 ( Figure  S3 ; R 2 =0. 20, p<0.001). Such positive association between sampling time and evolution is expected to arise in the presence of measurable evolution over the timeframe over which the genetic data was collected. Specifically, more recently sampled strains have accumulated additional mutations in their genome than older ones since their divergence from the Most Recent Common Ancestor (MRCA, root of the tree).

The origin of the regression between sampling dates and 'root-to-tip' distances ( Figure S3 ) provides a cursory point estimate for the time to the MRCA (tMRCA) around late 2019. Using TreeDater (12), we observe an estimated tMRCA, which corresponds to the start of the COVID-19 epidemic, of 6 October 2019 -11 December 2019 (95% CIs) ( Figure S4 ). These dates for the start of the epidemic are in broad agreement with previous estimates performed on smaller subsets of the COVID-19 genomic data using various computational methods ( Table 1) , though they should still be taken with some caution. Indeed, the sheer size of the dataset precludes the use of some of the more sophisticated inference methods available.

The SARS-CoV-2 global population has accumulated only moderate genetic diversity at this stage of the COVID-19 pandemic with an average pairwise difference of 9.6 SNPs between any two genomes, providing further support for a relatively recent common ancestor. We estimated a mutation rate underlying the global diversity of SARS-Cov-2 of ~6×10 -4 nucleotides/genome/year (CI: 4x10 -4 -7x10 -4 ) obtained following time calibration of the maximum likelihood phylogeny. This rate is largely unremarkable for an RNA virus (19, 20) , despite Coronaviridae having the unusual capacity amongst viruses of proofreading during nucleotide replication, thanks to the non-structural protein nsp14 exonuclease, which excises erroneous nucleotides inserted by their main RNA polymerase nsp12 (21, 22) .

Some of the major clades in the maximum likelihood phylogeny (Figure 1c and Figure S1 ) are formed predominantly by strains sampled from the same continent. However, this likely represents a temporal rather than a geographic signal. Indeed, the earliest available strains were collected in Asia, where the COVID-19 pandemic started, followed by extensive genome sequencing efforts first in Europe and then in the USA.

The SARS-CoV-2 genomic diversity found in most countries (with sufficient sequences) essentially recapitulates the global diversity of COVID-19 from the 7666-genome dataset. Figure  2 highlights the proportion of the global genetic diversity found in the UK, the USA, Iceland and China. In the UK, the USA and Iceland, the majority of the global genetic diversity of SARS-CoV-2 is recapitulated, with representatives of all major clades present in each of the countries (Figure 2A-C) . The same is true for other countries such as Australia ( Figure S2a ).

This genetic diversity of SARS-CoV-2 populations circulating in different countries points to each of these local epidemics having been seeded by a large number of independent introductions of the virus. The main exception to this pattern is China, the source of the initial outbreak, where only a fraction of the global diversity is present (Figure 2d ). This is also to an extent the case for Italy (Figure S2b) , which was an early focus of the COVID-19 pandemic. However, this global dataset includes only 35 SARS-CoV-2 genomes from Italy, so some of the genetic diversity of SARS-CoV-2 strains in circulation likely remains unsampled. The genomic diversity of the global SARS-CoV-2 population being recapitulated in multiple countries points to extensive worldwide transmission of COVID-19, likely from extremely early on in the pandemic.

The SARS-CoV-2 alignment can be considered as broken into a large two-part Open Reading Frame (ORF) encoding non-structural proteins, four structure proteins: spike (S), envelope (E), membrane (M) and nucleocapsid (N), and a set of small accessory factors (Figure 3a ). There is variation in genetic diversity across the alignment, with polymorphisms often found in neighbouring clusters ( Figure S5) . A simple permutation resampling approach suggests that both Orf3a and N exhibit SNPs which fall in the 95 th percentile of the empirical distribution (Table S3) . However, not all of these sites can be confirmed as true variant positions, due to the lack of accompanying sequence read data. However, we closely inspected those sites that appear to have arisen multiple times following a maximum parsimony tree building step. We identified a large number of putative homoplasies (n=1042 excluding masked regions), which were filtered to a high confidence cohort of 198 positions (see Methods).

These 198 positions in the SARS-CoV-2 genome alignment (0.67% of all sites) were associated with 290 amino acid changes across all 7666 genomes. Of these amino acid changes, 232 comprised non-synonymous and 58 comprised synonymous mutations. Two non-synonymous mutations involved the introduction or removal of stop codons were found (*13402Y, *26152G). 53 of the remaining 101 non-synonymous mutations involved neutral hydrophobicity changes ( Figure S10a ). In addition, 79 of the remaining 101 non-synonymous mutations involved neutral changes ( Figure S10b ). Both Orf1ab and N had a four-fold higher frequency of hydrophilic → hydrophobic mutations than hydrophobic → hydrophilic mutations ( Figure S10 ). In addition, neutral hydrophobic changes were clearly favoured in the S protein. Lastly, 87 of the remaining 110 non-synonymous mutations involved neutral charge changes.

Amongst the strongest filtered homoplasic sites (>15 change points on the tree), three are found within Orf1ab (nucleotide positions 11083, 13402, 16887) and S (21575). We exemplify the strongest signal and our approach using position 11083 in Figure 3 and provide a full list of homoplasic sites, both filtered and unfiltered, in Tables S4-5. The strongest hit in terms of the inferred minimum number of changes required (Figure 3b -c) at Orf1ab (11083 , Codon 3606) falls over a region encoding the non-structural protein, Nsp6, and is also observed in our analyses of the SRA dataset (Table S7) .

We note that some of the hits also overlap with positions identified as putatively under selection using other approaches (http://virological.org/t/selection-analysis-of-gisaid-sars-cov-2data/448/3, accessed April 23 2020), with Orf1ab consistently identified as a region comprising several candidates for non-neutral evolution. Orf1ab is an orthologous gene with other humanassociated betacoronaviruses, in particular SARS-CoV-1 and MERS-CoV which both underwent host jumps into humans from likely bat reservoirs (23, 24) . We performed an equivalent analysis on human-associated virus assemblies available on the NCBI Virus platform. We identified six putative homoplasic sites within SARS-CoV-1, two occurring within the 3c-like proteinase just upstream of Nsp6 (10384, 10793) and a further two homoplasies within Orf1ab at Nsp9 and Nsp13 ( Figure S11 ). In addition, one homoplasy was identified in the spike protein and one in the membrane protein ORFs.

For MERS-CoV, multiple unfiltered homoplasies were detected, consistent with previous observations of high recombination in this species (25) , though only one invoked more than a minimum number of 10 changes on the maximum parsimony tree ( Figure S12) . This corresponded to a further homoplasy identified in Orf1ab Nsp6(position 11631). It is of note that this genomic region coincides with the strongest homoplasy in SARS-CoV-2 which also occurs in the Nsp6 encoding region of Orf1ab. Codon 3606 of Orf1ab shares a leucine residue in MERS-CoV and SARS-CoV-2, though a valine in SARS-CoV. The exact role of these and other homoplasic mutations in human associated betacoronaviruses represents an important area of future work, although it appears that the Orf1ab region may exhibit multiple putatively adapted variants across human betacoronavirus lineages.

The genome alignment of the 7666 SARS-CoV-2 genomes can be queried through an open access, interactive web-application (https://macman123.shinyapps.io/ugi-scov2-alignmentscreen/). It provides users with information on every SNP and homoplasy detected across our global SARS-CoV-2 alignment and allows visual inspection both within the sequence alignment and across the maximum likelihood tree phylogeny. Figure 3 illustrates some of the functionalities of the web application using position 11083 in the alignment as an example. This particular homoplasy was observed 1078 times across the genomes and requires a minimum of 37 character-site changes to become congruent with the observed SARS-CoV-2 phylogeny (Figure 3a and 3b ).

Pandemics have been affecting humanity for millennia (26) . Over the last century alone, several global epidemics have claimed millions of lives, including the 1957/58 influenza A (H2N2) pandemic, the sixth (1899-1923) and seventh 'El Tor' cholera pandemic (1961) (1962) (1963) (1964) (1965) (1966) (1967) (1968) (1969) (1970) (1971) (1972) (1973) (1974) (1975) , as well as the HIV/AIDS pandemic (1981-today). COVID-19 acts as an unwelcome reminder of the major threat that infectious diseases represent in terms of deaths and disruption.

One positive aspect of the current situation, relative to previous pandemics, is the unprecedented availability of scientific and technological means to face COVID-19. In particular, the rapid development of drugs and vaccines has already begun. Modern drug and vaccine development are largely based on genetic engineering and an understanding of host-pathogen interactions at a molecular level. The mobilisation to address the COVID-19 pandemic by scientists worldwide has been remarkable. This includes the feat of the global scientific community who has already produced and publicly shared well over 11,000 complete SARS-CoV-2 genome sequences at the time of writing (April 23 2020), which we have used here with gratitude. Further initatives in the United Kingdom (https://www.cogconsortium.uk/data/) have already to date produced over 10,000 genomes, some of which overlap with those already available on GISAID.

To put these numbers of SARS-CoV-2 genomes in context, it is interesting to consider parallels with the 2009 H1N1pdm influenza pandemic, the first epidemic for which genetic sequence data was generated in near-real time (27, 28) . The genetic data available at the time looks staggeringly small in comparison to the amount that has already been generated for SARS-CoV-2 during the early stages of the COVID-19 pandemic. For example, Fraser et al. considered 11 partial hemagglutinin gene sequences two months after the WHO had declared 2009 H1N1pdm influenza a pandemic (27) .

This unprecedented genomic resource has already provided strong conclusions about the pandemic. For example, analyses by multiple independent groups place the start of the COVID-19 pandemic towards the end of 2019 ( Table 1 ). This rules out any scenario that assumes SARS-CoV-2 may have been in circulation long before it was identified, and hence have already infected large proportions of the population.

Extensive genomic resources for SARS-CoV-1 should in principle also be key to informing on optimal drug and vaccine design, particularly when coupled with knowledge of human proteome and immune interactions (29) . Ideally, drugs and vaccines should target relatively invariant, strongly constrained regions of the SARS-CoV-2 genome, to avoid drug resistance and vaccine evasion. Therefore ongoing monitoring of genomic changes in the virus will be essential to gain a better understanding of fundamental host-pathogen interactions that can inform drug and vaccine design.

The vast majority of mutations observed so far in SARS-CoV-2 circulating in humans are likely neutral (32, 33) or even deleterious (34) . Homoplasies, such as those we detect here, can arise by product of neutral evolution or as a result of ongoing selection. Of the 198 homoplasies we detect (after applying stringent filters), some proportion are very likely genuine targets of positive selection which signpost to ongoing adaptation of SARS-CoV-2 to its new human host. Indeed, we do observe an enrichment for non-synonymous changes (80%) in our filtered sites. As such, our provided list (Table S5) contains candidates for mutations which may affect the phenotype of SARS-CoV-2 and virus-host interactions and which require ongoing monitoring. Conversely, the finding that 78% of the homoplasic mutations involve no polarity change could still reflect strong evolutionary constraints at these positions (35, 36) . The remaining non-neutral changes to amino acid properties at homoplasic sites may be enriched in candidates for functionally relevant adaptation and could warrant further experimental investigation.

One of the strongest homoplasies lies at site 11083 in the SARS-CoV-2 genome in a region of Orf1a encoding Nsp6. This site passed our stringent filtering cirteria and was also present in our analysis of the SRA dataset (Table S7) . Interestingly, this region overlaps a putative immunogenic peptide predicted to result in both CD4+ and CD8+ T-cell reactivity (37) . More minor homoplasies amongst our top candidates, identified within Orf3a (Table S5) , also map to a predicted CD4 T cell epitope. While the immune response to SARS-CoV-2 is poorly understood at this point, key roles for CD4 T cells, which activate B cells for antibody production, and cytotoxic CD8 T cells, which kill virus-infected cells, are known to be important in mediating clearance in respiratory viral infections (38) . Of note, we also identify a strong recurrent mutation in nucleotide position 21575, corresponding to the SARS-CoV-2 spike protein (codon 5). While the spike protein is the known mediator of host-cell entry, our detected homoplasy falls outside of the N-terminal and receptor binding domains.

Our analyses presented here provide a snapshot in time of a rapidly changing situation based on available data. Although we have attempted to filter out homoplasies caused by sequencing error with stringent thresholds, and also used available short-read data to validate a subset of homoplasic sites in a smaller dataset, our analysis nevertheless remains reliant on the underlying quality of the publicly available assemblies. As such, it is possible that some results might be artefactual, and further investigation will be warranted as additional raw sequencing data becomes available.

However, given the crucial importance of identifying potential signatures of adaptation in SARS-CoV-2 for guiding ongoing development of vaccines and treatments, we have suggested what we believe to be a plausible approach and initial list in order to facilitate future work and interpretation of the observed patterns. More data continues to be made available, which will allow ongoing investigation by ourselves and others. We believe it is important to continue to monitor SARS-CoV-2 evolution in this way and to make the results available to the scientific community. In this context, we hope that the interactive web-application we provide will help identify key recurrent mutations in SARS-CoV-2 as they emerge and spread. Figure 1 . Global sequencing efforts have contributed hugely to our understanding of the genomic diversity of SARS-CoV-2. a) Viral assemblies available from global regions as of 19/04/2020. b) Cumulative total of viral assemblies uploaded to GISAID included in our analysis. c) Radial Maximum Likelihood phylogeny for 7666 complete SARS-CoV-2 genomes. Colours represent continents where isolates were collected. Green: Asia; Red: Europe; Purple: North America; Orange: Oceania; Dark blue: South America according to metadata annotations available on NextStrain (https://github.com/nextstrain/ncov/tree/master/data). Figure 1c .  Phylogenetic estimates support that the COVID-2 pandemic started sometimes around 6 October 2019 -11 December 2019, which corresponds to the time of the host-jump into humans.

 The diversity of SARS-CoV-2 strains in many countries recapitulates its full global diversity, consistent with multiple introductions of the virus to regions throughout the world seeding local transmission events.

 198 sites in the SARS-CoV-2 genome appear to have already undergone recurrent, independent mutations based on a large-scale analysis of public genome assemblies.

 Detected recurrent mutations may indicate ongoing adaptation of SARS-CoV-2 to its novel human host.

 Monitoring the build-up and patterns of genetic diversity in SARS-CoV-2 has potential to inform targets for drug and vaccine development. 

A new coronavirus associated with human respiratory disease in China

The phylogenetic range of bacterial and viral pathogens of vertebrates

A pneumonia outbreak associated with a new coronavirus of probable bat origin

The genomic and epidemiological dynamics of human influenza A virus

Unifying the Epidemiological and Evolutionary Dynamics of Pathogens

disease and diplomacy: GISAID's innovative contribution to global health

Global initiative on sharing all influenza datafrom vision to reality

MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference

GGTREE: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data

Bayesian inference of ancestral dates on bacterial phylogenetic trees

Scalable relaxed clock phylogenetic dating

MPBoot: fast phylogenetic maximum parsimony tree inference and bootstrap approximation

HomoplasyFinder: a simple tool to identify homoplasies on a phylogeny

Toward defining course of evolution -minimum change for a specific tree topology

phytools: an R package for phylogenetic comparative biology (and other things)

SeaView Version 4: A Multiplatform Graphical User Interface for Sequence Alignment and Phylogenetic Tree Building

Surface and inside volumes in globular proteins

An unusually high substitution rate in transplant-associated BK polyomavirus in vivo is further concentrated in HLA-C-bound viral peptides

The evolution of Ebola virus: Insights from the 2013-2016 epidemic

Unique and conserved features of genome and proteome of SARS-coronavirus, an early split-off from the coronavirus group 2 lineage

Discovery of an RNA virus 3 '-> 5 ' exoribonuclease that is critically involved in coronavirus RNA synthesis

Severe acute respiratory syndrome coronavirus-like virus in Chinese horseshoe bats

Middle East Respiratory Syndrome Coronavirus in Bats, Saudi Arabia

MERS-CoV recombination: implications about the reservoir and potential for adaptation

What are pathogens, and what have they done to and for us?

Pandemic Potential of a Strain of Influenza A (H1N1) : Early Findings

Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic

A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing

Infectious Diseases of Humans

A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology

Computational inference of selection underlying the evolution of the novel coronavirus, SARS-CoV-2

A SARS-CoV-2 vaccine candidate would likely match all currently circulating strains

Synonymous mutations and the molecular evolution of SARS-Cov-2 origins

Looking for Darwin in all the wrong places: the misguided quest for positive selection at the nucleotide sequence level

Distribution of the strength of selection against amino acid replacements in human proteins

A Sequence Homology and Bioinformatic Approach Can Predict Candidate Targets for Immune Responses to SARS-CoV-2

Immunity to Respiratory Viruses

Transmission dynamics and evolutionary history of 2019-nCoV

The first two cases of 2019-nCoV in Italy: Where they come from

Genomic epidemiology of SARS-CoV-2 in Guangdong Province

95% BCI May

95% BCI November 16

Rate-estimated relaxed clock model

2020 (40) 54 6

95% CI October 28

95% CI November 16

Unreported clock model (BEAST

95% HPD November

Strict clock model (BEAST v1.10)

Relaxed clock model (BEAST v1.10)

95% CI November 6

O analysed data and performed computational analyses

L.v.D and F.B. acknowledge financial support from the Newton Fund UK-China NSFC initiative (grant MR/P007597/1) and the BBSRC (equipment grant BB/R01356X/1). Computational analyses were performed on UCL Computer Science cluster and the South Green bioinformatics platform hosted on the CIRAD HPC cluster. We thank Jaspal Puri for insights and assistance on the development of the alignment visualisation tool and Nicholas McGranahan and Rachel Rosenthal for their comments on the manuscript. We additionally wish to acknowledge the very large number of scientists in originating and submitting labs who have readily made available SARS-CoV-2 assemblies to the research community.