key: cord-342539-o004ggon authors: Lam, Tommy Tsan-Yuk title: Tracking the genomic footprints of SARS-CoV-2 transmission date: 2020-05-28 journal: Trends Genet DOI: 10.1016/j.tig.2020.05.009 sha: doc_id: 342539 cord_uid: o004ggon Abstract There is considerable public and scientific interest on the origin, spread and evolution of SARS-CoV-2. A recent study by Lu et al. [1], conducted genomic sequencing and analysis of SARS-CoV-2 in Guangdong, revealing its early transmission out of Hubei and shedding light on the effectiveness of controlling local transmission chains. SARS-CoV-2 in Guangdong, revealing its early transmission out of Hubei and shedding light on the effectiveness of controlling local transmission chains. The outbreak of SARS-CoV-2 (causing the disease known as COVID-19), was initially reported in Wuhan city (of Hubei province) of China in December 2019, and soon swept across the nation and then over the world within two months, leading to its declaration as a pandemic by World Health Organisation (WHO) on 11th March 2020. As of 20th May 2020, the death toll of COVID-19 reached 318,789 with a total of nearly five million confirmed cases across all continents [2] . The enormous health and economic impacts caused by this virus, at such an unprecedent speed, has raised considerable public interest, fuelling research on its origin, spread and evolution. Genomic data and its analysis have been key tools in studying emerging pathogens as such [3] . The recent identification and genomic characterisation of coronaviruses in bats and pangolins, that are evolutionarily related to human SARS-CoV-2, have suggested the former could act as the zoonotic origins for the latter [4, 5] . Other pressing questions for understanding the ongoing pandemic include identifying means by which SARS-CoV-2 spread across China and the world from its starting place, and whether the disease control measures effectively suppress the introduced infections against further transmission. Several studies using mathematical modelling of the COVID-19 incidence and other coronavirus infections have suggested the effectiveness of disease control such as social distancing and city lockdown [6, 7] , but these methods seldom assess individual transmission It is evident in literature that genomic information of pathogens provide valuable empirical information about their transmission histories [2] , such as the identification of transmission chains through phylogenetic analysis of the genome sequences as illustrated in the work done by Lu et al. [1] . In addition to revealing the evolutionary process of the pathogen and acting as a proxy of disease transmission history, phylogenetic trees also serve as versatile frameworks for comparative analysis of virus genetics and phenotypes, disease epidemiology and clinical manifestations, and population demography and environment, thus facilitating identification of a possible interplay between these various aspects in disease dynamics ( Figure 1 ). There has been active research in the methodology for integrating multidimensional data related to pathogen and disease for statistical inference, especially in J o u r n a l P r e -p r o o f structure could be easily distorted by including sequences that carry high number of sequencing errors or mutations introduced over longer passage history in virus culture. Another issue is the variable intensity of virus genome sampling in different regions. This has been seen in the SARS-CoV-2 data set used by Lu et al, which has relatively few genomes outside Guangdong provinces (e.g. 32 genomes from Hubei province which has over 60,000 cases at the time). Such undersampling in the regions with high disease incidence could result in phylogenetic analyses that underestimate the virus exportation from these regions to other regions [10] such as Guangdong, and hence the extent of local transmission in Guangdong might be overestimated [1] . Notably, the SARS-CoV-2 genome data set publicly available at the time of this writing remain highly uneven across different countries. For instance, the GISAID database (https://www.gisaid.org) had only 5 full genome sequences from Iran where over 120,000 cases have been reported; in contrast it has about 1,300 genomes from around 7,000 cases reported in Australia. Therefore, any interpretation in the phylogenetic analysis involving those undersampled regions has to be highly cautious. All these sequence limitations undermine the confidence of the interpretations using phylogenetics, and particularly require a conservative approach to explain the results estimated from the complex phylogenetic models and multidimensional data. With many countries investing their efforts into the genomic surveillance of SARS-CoV-2, the data sharing on the GISAID public database has now reached 25,995 full genomes with an unprecedent speed. Although there is no doubt that the challenges of low phylogenetic resolution and biased sampling will remain, it is anticipated that future research of these SARS-CoV-2 genomes will take on these challenges with more robust statistical methods and cautious data interpretation, and will potentially provide important insights into SARS-CoV-2 transmission and evolution within the countries and across the world, to aid more effective control of the disease. Genomic Epidemiology of SARS-CoV-2 in Guangdong Province Coronavirus disease (COVID-2019) situation reports Evolutionary analysis of the dynamics of viral infectious disease A pneumonia outbreak associated with a new coronavirus of probable bat origin Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins The effect of human mobility and control measures on the COVID-19 epidemic in China Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 A phylodynamic workflow to rapidly gain insights into the dispersal history and dynamics of SARS-CoV-2 lineages Genomic epidemiology reveals multiple introductions of Zika virus into the United States