key: cord-0942898-k8m19by9 authors: Bauer, Denis C.; Tay, Aidan P.; Wilson, Laurence O. W.; Reti, Daniel; Hosking, Cameron; McAuley, Alexander J.; Pharo, Elizabeth; Todd, Shawn; Stevens, Vicky; Neave, Matthew J.; Tachedjian, Mary; Drew, Trevor W.; Vasan, Seshadri S. title: Supporting pandemic response using genomics and bioinformatics: A case study on the emergent SARS‐CoV‐2 outbreak date: 2020-05-25 journal: Transbound Emerg Dis DOI: 10.1111/tbed.13588 sha: 582908c20f5d8f90b00ef984c1ee1efba554ac60 doc_id: 942898 cord_uid: k8m19by9 Pre‐clinical responses to fast‐moving infectious disease outbreaks heavily depend on choosing the best isolates for animal models that inform diagnostics, vaccines and treatments. Current approaches are driven by practical considerations (e.g. first available virus isolate) rather than a detailed analysis of the characteristics of the virus strain chosen, which can lead to animal models that are not representative of the circulating or emerging clusters. Here, we suggest a combination of epidemiological, experimental and bioinformatic considerations when choosing virus strains for animal model generation. We discuss the currently chosen SARS‐CoV‐2 strains for international coronavirus disease (COVID‐19) models in the context of their phylogeny as well as in a novel alignment‐free bioinformatic approach. Unlike phylogenetic trees, which focus on individual shared mutations, this new approach assesses genome‐wide co‐developing functionalities and hence offers a more fluid view of the ‘cloud of variances’ that RNA viruses are prone to accumulate. This joint approach concludes that while the current animal models cover the existing viral strains adequately, there is substantial evolutionary activity that is likely not considered by the current models. Based on insights from the non‐discrete alignment‐free approach and experimental observations, we suggest isolates for future animal models. 2020); two of these viruses (H1N1 and SARS-CoV2) have resulted in pandemics within 10 years (WHO, 2020) . The SARS outbreak of 2002-2004, the MERS outbreaks since 2012 and the current COVID-19 outbreak since 2019 demonstrate the potential of coronaviruses, especially bat-derived betacoronaviruses (Zhou et al., 2020) , to cause PHEICs, with COVID-19 having escalated to a global pandemic. Viruses in a new host (humans) have the potential to evolve rapidly and present quasispecies diversity (Eigen, McCaskill, & Schuster, 1988) , which is a hallmark of RNA viruses that exist as a 'cloud of variants' due to low fidelity, high polymorphism and viral polymerases lacking the capability to correct errors (Drew, 2011; Wilke, Wang, Ofria, Lenski, & Adami, 2001) . As a result, most variants are a random accumulation of errors, useful for tracing aetiology, but typically without substantial functional change (Grubaugh, Petrone, & Holmes, 2020) . Unlike most other RNA viruses, coronaviruses express a 3′-to-5′ exoribonuclease that enables the high-fidelity replication of their relatively large 26-32 kb ssRNA(+) genome (Minskaia et al., 2006; Snijder et al., 2003) . Coronaviruses have a moderate mutation rate (0.80-2.38 × 10 -3 nucleotide substitutions per site per year for the SARS-CoV genome; Zhao et al., 2004) allowing a wider evolutionary space to be explored more deliberately. This can complicate the outbreak response in terms of rapid development and evaluation of diagnostics, vaccines, antivirals and antibody therapies as many diverse strains with unknown functional differences exist (Figure 1 ). This is particularly exacerbated by increased movement of people (enabled by global air travel), animals and goods spreading new viruses across the world's population and exposing them to huge variations in environment, demographics, age structure, socio-economic status, co-morbidities and equitable access to health care. The sheer number of these inter-connected influencing factors often makes an unfolding situation hard to comprehend fully and challenges the traditional virology and public health disciplines by rendering them less effective in coping with the spread of the virus. Bioinformatic approaches may be able to better inform epidemiology and responses to trans-boundary viruses, by synthesizing complex information more effectively and systematically. Enabled by the advances in genomic sequencing technology (e.g. Oxford Nanopore, Illumina) and the willingness to share the sequenced information in the public domain (e.g. GenBank, GISAID), bioinformatic approaches developed in the human domain have enabled comparative analysis of the emerging genomic diversity of a virus as it spreads throughout a host population. For the ongoing COVID-19 outbreak, as illustrated in Figure 1 , applying in silico approaches can rapidly provide answers to questions like: Is the first sequenced SARS-CoV-2 genome (often called the 'reference genome' in GISAID, GenBank, etc.) closest to the original or 'true reference' strain which entered humans (which may not have been sequenced and/or which may still be circulating with minimal mutations)? Combined with epidemiological or experimental evidence, bioinformatics can help address questions like 'Are there one or multiple circulating strains and are they different to the most virulent ones' (if yes can we identify the molecular basis)? Finally, F I G U R E 1 Illustration of coronavirus spread while it accumulates mutations. The dark blue arrows represent the main volume of transmissions, while the nucleic acid symbol illustrates mutations acquired by the different viral strains as they enter humans from a primary/reservoir host (represented by the bat symbol) through an intermediate host (which is yet to be identified for SARS-CoV-2). The first human SARS-CoV-2 isolate sequenced (with orange and pink mutation) may not have been the original strain that first infected humans (grey). It is possible that a strain sequenced later (green) may be genetically closer to the original strain. In this scenario, the original strain has not been captured through sequencing at all. It also shows that there may be two currently circulating strains (orange-pink-purple and orange-pink-brown), which in turn might be different from the most virulent one (orange-pink-blue). In the absence of clinical data correlated with SARS-CoV-2 genome isolates, bioinformatic analysis (represented by the computer symbol) can identify clusters and consensus sequences to investigate the genetic diversity of the emerging SARS- Answers to these questions are not just of academic interest; they can help inform outbreak response and development and evaluation of diagnostics, vaccines and countermeasures. For instance, outbreak response efforts including vaccine evaluation should focus on the most prevalent circulating strain (lest we may end up rejecting acceptable candidates by raising the efficacy threshold too high), but it would be desirable to evaluate therapeutics against the most virulent strain. However, with the field not yet using bioinformatic tools to their fullest potential, current data collection practices are not capturing the information necessary for advanced analysis. For example, pathogen sequences derived from patients are typically not annotated with deidentified disease progression and other clinically relevant information, and this in turn hampers the identification of the most virulent strain or associating pathogen properties with genomic modifications. Here, we use bioinformatic analysis to identify emerging trends among the SARS-CoV-2 isolates. From this, we sought to identify the most representative strains for animal models and pre-clinical research, in particular, which strains among the emerging Australian isolates are good choices for animal model development and which ones are not representative. Two Australian isolates (BetaCoV/Australia/VIC02/2020 and BetaCoV/ Australia/SA01/2020) were sequenced with the MiSeq platform (Illumina, Inc). In brief, RNA was purified from each isolate using a Directzol RNA Miniprep kit (Zymo Research). Purified RNA was reverse transcribed using a TaqMan Reverse Transcription kit (Applied Biosystems) with random octamers linked to a specific primer sequence, followed by second-strand cDNA synthesis using Klenow DNA Polymerase I (Promega). Complementary DNA was further amplified (using primers specific to the sequences added to the random octamers used for reverse transcription) with a KAPA HiFi HotStart kit (Roche). Resulting DNA was purified using a DNA Clean and Concentrator kit (Zymo Research). Fragmentation and dual-index library preparation was conducted using a Nextera XT DNA Library Preparation Kit (Illumina, Inc), and denatured libraries were sequenced using a 300-cycle MiSeq Reagent kit v2 (Illumina, Inc). Sequencing reads were trimmed for quality and mapped to the published reference sequence (BetaCoV/Wuhan-Hu-1/2019; GenBank accession number NC_045512.2) using Geneious 11.1.4. Consensus genome sequences for the isolates were generated for analysis. All available viral sequences were downloaded from GISAID (on 05/03/2020, see Appendix S3 (Elbe, & Buckland-Merrett, 2017) ), filtering for complete sequences of human origin (187 genomes in total). Low-quality sequences (defined as sequences with an N content greater than 1%) were filtered out leaving 178 strains in total. We also included one recently reported viral sequence from the European Virus Archive global (Ref-SKU: 026V-03883) and the two Australian sequences in the full data set. This data set of 181 sequences was aligned against each other using Muscle (v3.8.31) (Madeira et al., 2019) . Based on the alignments, we could see significant variation at the ends of the sequences ( Figure S1 ), likely due to sequencing artefacts and errors preventing the full viral genomes from being sequenced. To reduce the impact of potential sequencing errors, we wrote a custom perl script to trim the sequences such that only positions with 95% coverage were retained. This trimming entailed removing the first 101 and last 72 bp from the alignments. Once trimmed and converting all 'U' to 'T', we identified 54 sequences that were identical. To limit the effect of duplicate sequences on subsequent analysis, we collapsed these into a single entry. These collapsed sequences are summarized in Table S1 . The maximum likelihood phylogenetic tree was generated from the above alignments using RAxML-NG (Kozlov, Darriba, Flouri, Morel, & Stamatakis, 2019 ). The evolutionary model used was a general time reversable model with gamma-distributed rate heterogeneity and invariant sites (GTR + G + I). We use this mode since it is the most general model and has been proposed to produce equal trees to optimally selected models (Abadi, Azouri, Pupko, & Mayrose, 2019) . The tree was visualized using iTOL (Letunic & Bork, 2019 ) as a midpoint rooted tree and shows the likely evolutionary relationships between the sampled strains. Every organism and potentially isolate can have a unique genomic signature based on the composition of their genomic sequence. To quantify this signature, we determined the K-mer frequency. Counting of all possible strings of length k in the sequence of the virus has emerged as an alternative to phylogenetic trees in other disciplines (Sims, Jun, Wu, & Kim, 2009 ). The conceptual distance between all isolates can then be visualized by running a principal component analysis (PCA) over all genomic signatures to reduce this high-dimensional K-mer frequency vector into a two-dimensional space (Jolliffe & Cadima, 2016) . Please see Appendix S1 for more details of this method. Custom scripts were used to calculate the K-mer frequency for each sequence using a k of 10 (Sims et al., 2009 ). K-mers containing ambiguous bases (i.e. N's) were removed. We then calculated the relative proportion of each K-mer, resulting in a frequency vector. We used the PCA implementation of Python scikit-learn to reduce the genomic signatures containing 1,048,576 10-mer proportions into a vector containing two principal components. Finally, custom scripts were used to compare the genomic signatures for all aforementioned coronavirus sequences. The evolutionary structure of the 181 isolates as determined by the phylogenetic tree reveals three major clusters (C1-C3) ( Figure 2 ). The C1 cluster mainly represents Wuhan and isolates captured early on; C2 and C3 contain later isolates, such as Sydney/3, Australia/VIC01 and France/IDF0372 in C2 and Australia/NSW01, Australia/QLD01-3, Australia/VIC02 and USA/WA1 in C3. The three clusters are separated by distinct mutations (Table 1) but contain a substantial number of other unique mutations, which we define as diversity within the cluster. These individual mutations are outside of the established hotspots of diversity as shown in Figure S4 , which are predominantly of concern for PCR primer design rather than aetiology. There may also be three additional clusters emerging (C4-6) with C4 capturing the suspected community spread from Lombardy ('Narrative: Genomic analysis of COVID-19 spread', n.d.), C5 regionally mixed (Asia and North Amerika) and C6 from Australia and Asia, notably Australia/NSW05-7 (see fully annotated tree in Figure S5 ). This finding is different to Tang et al. (2020) , who postulate two clusters (S and L). Their analysis was only on 103 GISAID isolates and includes betacoronaviruses from bats, which roots the tree differently and merges C1 and C2. However, as the aetiology is not fully demonstrated, especially with the intermediate vector yet unknown, F I G U R E 2 Phylogenetic tree highlighting isolates of interest with branch points of the six clusters labelled to indicate mature (orange) and emerging (yellow) disease clusters (full list of identical sequences for these branch points are in Table S1 and complete image in Figure S5 ) J a p a n /A I/ I-0 0 4 Irrespective of the root placement, both trees allow the assessment of individual isolates that are not part of major branches by being genetically divergent offshoots. For example, Sydney/2 appears to be an off-shoot from Wuhan-Hu-1. Upon further inspection of the Sydney/2 strand, we discovered a 41-bp deletion, which overlaps the infectious bronchitis virus' (IBV) motif at the 3′ end (Goebel, Cluster Mutations in common Note: Table lists the isolate of note for this paper and collects the information from Tables S1 and S2 for easy access. The third column counts the number of differences to Wuhan-HU-1 for the full and (core sequences), in a condensed (deletions count as 1)/and full way. Taylor, & Masters, 2004) . Inspecting other isolates, Australia/VIC01 has also a 10-bp deletion within the genomic location of the 41-bp deletion ( Figure S3 ). Despite this similarity, Australia/VIC01 was placed into C2 and Sydney/2 in C1 since Australia/VIC01 contains the G26144T mutation indicative of C2 which Sydney/2 lacks (See Table 2 for a list of referenced isolates). A more systematic sequence analysis revealed that many isolates have deletions in their core genome (Table S2) . While these deletions appear to be specific to each isolate ( Figure S4 ), their effect on virus structure or function might be pronounced. For example, Australia/ VIC01 and Sydney/2 may have the IBV motif disrupted with implications for replication, while USA/CA6 and Japan/AI-004 may have disruption in the non-structural protein 1 (nsp1), with implications for host gene expression ( Figure S3 ). While the full impact of these genomic variations can only be confirmed through functional genomics experiments, coronaviruses' proof-reading ability likely lifts them above random accumulation of errors. As such, having a methodology able to take deletions into consideration when calculating genomic distance is desirable. Aiming to overcome the limitations of phylogenetic tree approaches, we also investigated whether an alignment-free approach can be used to understand how the genomic content of different SARS-CoV-2 isolates changes over time and with respect to each other by also taking deletions into account. We therefore calculated the frequency of all possible 10-mers across each viral genome followed by principal component analysis (PCA) to reduce the high-dimensional K-mer vector space into a two-dimensional image for visual comparison. We first demonstrate the methods' ability to faithfully separate distant coronavirus strains by comparing all SARS-CoV-2 against 17 SARS and 6 MERS isolates. As shown in Figure 3 ( C3 G e rm a n y /B a v P a t + H u m a n 2 0 1 9 -n C o V phylogenetic results and reflects the fact that these sequences have only mutational changes in their core sequences compared to Wuhan-Hu-1 (Table S2) . However, this alignment-free method positions isolates with deletions (viz. Australia/VIC01 and Canada/ON-VIDO-01) further away from Wuhan-Hu-1 than in the phylogenetic tree, demonstrating the ability of the K-mer method to represent deletions accurately (see Supplemental Table 7) . Of the 181 isolates, we found that Singapore/4, Taiwan/NTU01, Finland/1, USA/IL1 and Shenzhen/SZTH-001 were among the furthest from Wuhan-Hu-1 (data not shown). For both Singapore/4 and Taiwan/NTU01, this is due to these being shorter than the core sequences, so the K-mer fingerprint accurately reflects the missing sequences. Meanwhile, USA/IL1 and Finland/1 contained several ambiguous bases effectively shortening the length of similar sequence, so the method correctly places them further from the other isolates. While the K-mer approach is not as suggestive as phylogenetic trees with respect to visualizing the potential transmission route (e.g. Lombardy), it may more accurately reflect the fluidity of changes ('cloud of variants') and capture recombination events (Graham & Baric, 2010 ). An example is Italy/INMI1, which has mutations in common with both Sydney/3 (C1 cluster, G26144T) and Chongqing/IVDC-CQ-001 (C3 cluster, G11083T) making it impossible to definitively place it in the discrete phylogenetic tree (it was placed in the C2 cluster, Figure 2) , while the PCA plot shows the fluid evolution placing it between the Italian and Australian isolate ( Figure 3 ). More generally, phylogenetic analysis is based on the presence of shared mutations; for example, two strains which share most SNPs are likely to be closely related and therefore exist on neighbouring branches in the phylogenetic tree. In comparison, an alignment-free method (such as K-mer signatures) is more concerned with global similarities and difference, for example changes averaged across the whole genome rather than at specific locations. This can be informative about high-level similarities between genomes, for example evolution of distinct genomic islands with common functions or recombination events. While this does not create visually clear clades, it offers a more holistic representation of pair-wise distances between all isolates. Together with clinical information, this could hence help determine the most virulent strains (Figure 1 ). Currently, the process of determining which isolate to use for animal models is less informed than it could be due to the lack of shared genomic information and readily available bioinformatic methodologies, especially towards the beginning of an outbreak. For example, while China published the genomic sequence of SARS-CoV-2 (Zhou et al., 2020) , patient samples or virus isolates were not made available, and with reverse genetics of a large RNA virus taking time (Thao et al., 2020) , countries had to wait for imported cases. Australia's Doherty Institute was the first in the world to isolate the virus (Australia/VIC01) and made it available for pre-clinical animal models to the authors (CSIRO, 2020), Public Health England, etc. This practical consideration dictated the initial choice, however, with more isolates to choose from subsequently a more informed approach can be taken (Figure 1 According to the phylogenetic tree shown in Figure 4a , the main clusters are represented by the current animal models. That is, C1 is represented by Germany/BavPat1 and Human 2019-nCoV (whose core sequence is identical), though they seem to be half-way to the C4 cluster (suspected Lombardy cluster). C2 is represented by Australia/VIC01 and France/IDF0372, and C3 is represented by USA/WA1. Note that Scotland/CVR01 was removed due to its high N content (2.3%). However, with phylogenetic methods focusing on the presence of shared mutations rather than the overall genomic similarity (e.g. shared evolved functionality), the assumption that current animal models adequately cover the evolutionary space of the actively circulating virus might be misleading as the PCA plot indicates ( Figure 4b) . Here, the top-left and bottom-right areas seem to be underrepresented. To join the strain-etiology of phylogenetic approaches with the more fluid distance measure of alignment-free methods, we created the consensus sequences of the major and emerging clusters (Appendix S2) and re-ran the PCA analysis to find more robust and future relevant isolates (Figure 4b ). (Callaway, 2020) . Moving away from purely practical considerations towards a more deliberate approach that assesses current and future characteristics of isolate choices will lead to a better coverage of actively circulating strains ( Figure 1 ). With more sharing of isolates internationally and information collected about patient-deidentified details of case severity and outcome, more sophisticated machine learning approaches can be generated to assist in triaging and treatment choices. Additional information, such as co-morbidities, socio-economic and smoking status, may also help in anticipating public health demand. Furthermore, releasing the full high-throughput sequencing data sets rather than the consensus sequences would allow a more detailed exploration of the existing quasispecies to further improve isolate selection. The two Australian isolates not yet in GISAID were kindly pro- The authors declare no conflict of interest. Ethical Statement is not applicable. The data that support the findings of this study are available in GISAID at https://www.gisaid.org/. Denis C. Bauer https://orcid.org/0000-0001-8033-9810 Seshadri S. Vasan https://orcid.org/0000-0002-7326-3210 Model selection may not be a mandatory step for phylogeny reconstruction The emergence and evolution of swine viral diseases: To what extend have husbandry systems and global trade contributed to their distribution and diversity? Revue Scientifique et Technique de l'OIE Molecular quasi-species Data, disease and diplomacy: GISAID's innovative contribution to global health Ebola public health emergency of international concern, democratic republic of the congo Note from the editors: World Health Organization declares novel coronavirus (2019-nCoV) sixth public health emergency of international concern The 3' cis-acting genomic replication element of the severe acute respiratory syndrome coronavirus can function in the murine coronavirus genome Recombination, reservoirs, and the modular spike: Mechanisms of coronavirus cross-species transmission We shouldn't worry when a virus mutates during disease outbreaks Principal component analysis: A review and recent developments RAxML-NG: A fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference Interactive tree of life (iTOL) v4: Recent updates and new developments The EMBL-EBI search and sequence analysis tools APIs in 2019 Discovery of an RNA virus 3'->5' exoribonuclease that is critically involved in coronavirus RNA synthesis Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions Unique and conserved features of genome and proteome of SARS-coronavirus, an early split-off from the coronavirus group 2 lineage On the origin and continuing evolution of SARS-CoV-2 Rapid reconstruction of SARS-CoV-2 using a synthetic genomics platform The establishment of reference sequence for SARS-CoV-2 and variation analysis WHO Director-General's opening remarks at the media briefing on COVID-19 Evolution of digital organisms at high mutation rates leads to survival of the flattest Moderate mutation rate in the SARS coronavirus genome and its implications A pneumonia outbreak associated with a new coronavirus of probable bat origin Supporting pandemic response using genomics and bioinformatics: A case study on the emergent SARS-CoV-2 outbreak