key: cord-0278263-jtmjdcsq authors: Campbell, Ellsworth M.; Boyles, Anthony; Shankar, Anupama; Kim, Jay; Knyazev, Sergey; Switzer, William M. title: MicrobeTrace: Retooling Molecular Epidemiology for Rapid Public Health Response date: 2020-07-24 journal: bioRxiv DOI: 10.1101/2020.07.22.216275 sha: 05cd7d2cbcb7f33b7828cd63a7fd833213110387 doc_id: 278263 cord_uid: jtmjdcsq Motivation Outbreak investigations use data from interviews, healthcare providers, laboratories and surveillance systems. However, integrated use of data from multiple sources requires a patchwork of software that present challenges in usability, interoperability, confidentiality, and cost. Rapid integration, visualization and analysis of data from multiple sources can guide effective public health interventions. Results We developed MicrobeTrace to facilitate rapid public health responses by overcoming barriers to data integration and exploration in molecular epidemiology. Using publicly available HIV sequences and other data, we demonstrate the analysis of viral genetic distance networks and introduce a novel approach to minimum spanning trees that simplifies results. We also illustrate the potential utility of MicrobeTrace in support of contact tracing by analyzing and displaying data from an outbreak of SARS-CoV-2 in South Korea in early 2020. Availability and Implementation MicrobeTrace is a web-based, client-side, JavaScript application (https://microbetrace.cdc.gov) that runs in Chromium-based browsers and remains fully-operational without an internet connection. MicrobeTrace is developed and actively maintained by the Centers for Disease Control and Prevention. The source code is available at https://github.com/cdcgov/microbetrace. Contact ells@cdc.gov MicrobeTrace handles a variety of file types and formats that are traditionally collected during public 113 health investigations. Pathogen genomic information can be integrated as raw genomic sequences, genetic 114 distance matrices, pairwise genetic distances, or phylogenetic trees. Epidemiologic and other metadata 115 about cases (node lists) and their high-risk contacts (edge or link lists) can be integrated as spreadsheets. 116 Importable in a variety of file formats, these file types can be visualized independently or in-concert to 117 achieve different analytic goals (Fig. 1) . Early in an outbreak investigation, high-risk contacts can be 118 combined with other epidemiologic information to visualize and characterize a risk network. When 119 genomic data become available later in the investigation, genetic networks can be integrated to visualize 120 concordance between epidemiologic and laboratory data sources. Alternatively, all available data sources 121 can be integrated to construct a more holistic visualization of an ongoing public health investigation. The information processing technology within MicrobeTrace is well adapted for use in a public health 127 setting because it prioritizes the confidential but effective use of sensitive data collected during an 128 outbreak investigation. MicrobeTrace was developed as a client-side only application that is incapable of 129 transmitting any user data over the internet. In contrast, most web-based bioinformatic applications 130 require the user's data be submitted over the internet for processing by a remote server-side application 131 before results can be returned to the user. Local processing is achieved through open source development 132 and translations of traditional bioinformatic algorithms to align (Boyles, 2019a; Li, 2014; Smith, et al., 133 1981) , compare (Boyles, 2019b; Pond, et al., 2018; Tamura and Nei, 1993) , and evaluate genomic 134 sequences and their relationships to one another (Boyles, 2019d; Fourment and Gibbs, 2006; Knyazev, 135 2020; Kruskal, 1956) . Importantly, sequence (a) alignment, (b) comparisons, (c) phylogeny, and (d) 136 network evaluations are recapitulations of established methods and do not constitute novel development. 137 Therefore, to the best of our knowledge, the results derived from these JavaScript methods are 138 interchangeable with results derived from their respective, native implementations. A novel extension of 139 the network evaluation method is described below in section 3.4 as the 'Nearest Connected Neighbor'. 140 Visualizations must be generated with care during an outbreak investigation to ensure 141 confidential and narrow use of sensitive data. PII and other sensitive information like geospatial 142 coordinates, zip codes, and phone numbers should only be accessible to Disease Investigation Specialists 143 conducting contact tracing interviews. However, an epidemiologist performing a retrospective analysis 144 can use the same visualization layout with remapped labels, colors, shapes and sizes. Indeed, sensitive 145 geocoordinates can still be used confidentially to produce informative maps by applying the random 146 'jitter' function in MicrobeTrace to reduce the precision of the displayed map marker. In concert, these 147 diverse and accessible controls enable public health experts to safely and confidently leverage sensitive 148 data without risk to the public's confidentiality. 149 To demonstrate the bioinformatics capacity of MicrobeTrace, we used a publicly available HIV-1 data set 151 consisting of 1,164 sequences of the partial polymerase (pol) region (GenBank accession numbers 152 KX465238-KX467180) from a recent study in Germany in addition to associated metadata describing 153 behavioral risk factors and gender (Pouran Yousef, et al., 2016) . Partial pol sequences are typically collected for determination of antiretroviral drug resistance monitoring for care and treatment for persons 155 living with HIV infection. 156 The bioinformatics workflow of genetic distance networks in MicrobeTrace begins with a pairwise 157 sequence alignment of each input sequence against a reference, according to the Smith-Waterman 158 algorithm (Boyles, 2019a; Li, 2014; Smith, et al., 1981) . Multiple sequence alignments are too time 159 constrained and are not used. A user can align to a curated reference, an arbitrary custom reference, or the 160 first input sequence. For HIV-1, the strain HXB2 from the United States (U.S.) is a common reference 161 sequence (GenBank accession number K03455). Once aligned, pairwise genetic distances are calculated 162 according to either a raw hamming distance or the Tamura-Nei substitution model (TN93) (Boyles, 163 2019d; Pond, et al., 2018; Tamura and Nei, 1993) . When the TN93 substitution model is selected, 164 handling of ambiguous bases can be configured as previously described (Pond, et al., 2018) . Pairwise 165 genetic distances can be easily filtered by a threshold defined by the user, in this case 1.5% nucleotide 166 substitutions per site (Fig 2A) . Notably, users are empowered with the tools necessary to identify and 167 select the distance threshold value that best fits their public health use case (Wertheim, et al., 2017) . In 168 some situations for HIV-1, a conservative threshold of 1.5% genetic distance might be appropriate to best 169 understand the historical evolution of recent transmission events (Wertheim, et al., 2014) . A more 170 stringent TN93 threshold of 0.5% is often used to identify the most recent and rapid clusters of HIV-1 171 transmission ( Fig 2B) . Threshold determinations are often informed by cluster size and growth rate 172 criteria (Erly, et al., 2020; France and Oster, 2020; Oster, et al., 2018) . MicrobeTrace offers the ability to 173 filter by genetic distance and cluster size thresholds in the same 'Global Settings' menu. Here, using the 174 German HIV-1 dataset we have filtered for clusters of size N ≥ 5 after the 1.5% genetic distance threshold 175 is applied. This filter hides 73.1% (N = 851) of individuals that are too genetically distant to cluster with 176 any other sequences in the sample as well as 17.9%(N = 208) of individuals whose HIV-1 sequences 177 reside in clusters of size N ≤ 4. HIV-1 sequences from the remaining 9.0% (N = 105) of individuals are 178 displayed as genetic distance networks in Figure 2 . Variables of interest can be readily mapped to the nodes or links, including HIV-1 pol drug resistance mutations to identify clusters of transmitted drug 180 resistance ( Fig 2C) . 181 A simple nucleotide substitution model is not always suitable to understand phylogenetic relationships. 183 Rather than require the use of a single model, MicrobeTrace supports the integration of precomputed 184 distance matrices and pairwise distance lists. A user can provide any pre-computed pairwise distances, 185 regardless of the underlying nucleotide substitution model, as a list or a matrix in order to render those 186 data as a network. For distance matrices, both full matrix and PHYLIP formats are accepted. 187 MicrobeTrace also provides a novel and simple filtering algorithm to render only the nearest connected 188 genetic neighbor(s) for each node, while still maintaining cluster connectivity. Where any two genetically 189 equidistant neighbors are possible, both links are rendered when the 'Nearest Connected Neighbor' filter 190 is applied. This approach is particularly useful to understand the historical context of an entire cluster, 191 while focusing on the part of the cluster exhibiting the most concerning and rapid growth. For example, 192 an HIV cluster in rural southeastern Indiana grew rapidly in 2015 but underwent slow growth for nearly a 193 decade prior (Campbell, et al., 2017) . The nearest connect neighbor method yields results similar to a 194 non-exhaustive search for all minimum spanning trees, as has been previously described (Bbosa, et al., 195 2020; Campbell, et al., 2017) . The threshold and nearest connected neighbor filters are not mutually 196 exclusive and can therefore be applied simultaneously to ensure that genetically distant nodes remain 197 disconnected. This enables the inclusion of related, but more distant sequences in a cluster visualization 198 while minimizing the information overload typically accompanied by increased distance thresholds (as 199 shown in Fig. 2A ). HIV-1 genetic distance links that fell below the 1.5% threshold but were not included 200 as a nearest connected neighbor link are shown at reduced opacity (Fig. 2C) . Phylogenies are ubiquitous in public health and bioinformatics, but a phylogeny may be difficult to 212 integrate with more traditional contact tracing data. While powerful new tools are available to integrate 213 taxa-level characteristics into phylogenies, integration of paired contacts is unavailable. Instead, the 214 genetic distances encoded on the phylogeny must be measured and recast as pairwise patristic distances of a phylogeny. Specifically, these are tip-to-tip measurements between individuals on an evolutionary tree 216 that account for the most recent common ancestor. This step is necessary, because it results in a pairwise 217 genetic distance list that is readily integrated with pairwise contact data. Provided a phylogenetic tree in 218 Newick format, MicrobeTrace will traverse the phylogeny to calculate and render the pairwise patristic 219 distance network corresponding to that phylogeny. 220 Importantly, phylogenies or pathogen genetic sequence data are not required to leverage MicrobeTrace to 222 visualize public health data. MicrobeTrace supports the visualization of arbitrary networks, such as those 223 collected during contact tracing during an outbreak or cluster investigation. Acceptable networks are not 224 limited to person-to-person links but can include person-to-place or place-to-place. To visually 225 differentiate persons from places, MicrobeTrace can style the shape of any network node according to a 226 node type column (e.g., nodeType = 'Person' or 'Place') defined in the data set. If additional metadata are 227 available to describe a link, it can be colored according to user-defined categorical variables. 228 Alternatively, an option is provided to scale link width according to a user-defined numeric variable or its 229 reciprocal. To demonstrate the generalized visualization capacity of MicrobeTrace, we present a publicly available 253 data set describing clinical, demographic and contact tracing data derived from the Korean Centers for 254 Disease Control (KCDC) investigation of the COVID-19 outbreak (Kim, 2020) . The data set does not 255 contain coronavirus sequence data, but instead details 383 transmission histories between 510 cases. It 256 also contains an additional 1,627 cases of COVID-19 with no documented transmission histories. As 257 before, using filtering capabilities unique to MicrobeTrace, we limit our visualizations to transmission 258 clusters of size ≥ 5 cases (Fig. 3) . 259 transmission network, the symptom onset incidence curve, and a geospatial map with transmission 267 network overlay (Fig. 3) . Here, we perform the following visual manipulations within MicrobeTrace: (1) 268 automatically calculate and map the number of contacts for each case to the label that is centered over 269 each node (Fig. 3A) , (2) map the node color to the case's province (Fig. 3A-D) , (3) map link color to the 270 mode of exposure (Fig. 3A-D) , (4) map node shapes to the case's gender (Fig. 3A) (5) superimpose the 271 network onto a high-resolution geospatial 2D map projection (Fig. 3B-C) , (6) tailored color, size and 272 transparency to desired values (Fig. 3B-C) , and (7) generated an incidence curve according to the date of 273 symptom onset (Fig 3E) . Indeed, MicrobeTrace can be used to achieve rich visualizations using a list of nodes with a handful of 285 variables like age, gender, province, city, exposure type, symptom onset date, test confirmation date and 286 hospital release data. We demonstrate the construction of complex figures like a Flow Diagram, Gantt 287 Chart, Cross-tabulation, Aggregation, and Histogram with simple dropdown menus (Fig. 4) . 288 Additional diagrams can be achieved with the 2D Network, 3D Network, Scatter Plot, Heatmap, 289 Bubbles, Choropleth, and Globe Views with relevant data types selected with simple dropdown menus. When sequence data are available, a variety of additional diagrams and views are available. For example, 305 the Sequences View can be used to export or check the quality of the pairwise alignment. The 306 Phylogenetic Tree View will construct a tree via a neighbor-joining algorithm according to the provided 307 pairwise distance calculations. The Phylogenetic Tree View has robust customization controls that have 308 been modularized in a separate JavaScript library called TidyTree (Boyles, 2019c) . 309 Public health investigations are iterative and the underlying data sources tend to grow over time. Once 311 MicrobeTrace workspaces have been customized they can be saved in two ways: (1) as a custom. between collaborators and preserve confidentiality. Style files can also be used to ensure continuity 318 between public health investigations, such that different investigations yield identically styled 319 visualizations even with different underlying data. 320 Communicating data arising from public health investigations is a complex process that requires many 322 fine adjustments, as messages are tuned to their audiences. To meet this need, MicrobeTrace is designed 323 to provide users maximum control over visualization customization and export capabilities. For example, 324 communication to academic and public health audiences often involves poster presentations that require images be scaled-up for large printer formats. We accommodate this requirement by enabling users to set 326 specific export resolutions for PNG and JPEG formats. Alternatively, visualizations can be exported as 327 Scalable Vector Graphics (SVGs) that can be enlarged to any arbitrary size without a loss of resolution. 328 By default, a MicrobeTrace watermark is placed on images exported from MicrobeTrace; however, the 329 transparency of the watermark can be increased using a menu slider to render it invisible. Taken together, 330 these capabilities offer publication-ready image exports for scientific journals. opposed to during the weekend. In red, are the number of monthly weekday users. In teal, are the number of monthly weekend users. Each month's mean daily user count is mapped to the size of the circle and 367 colored by day type. A local regression for each day type is shown to smooth the month-to-month effects 368 and highlight the increasing trend. 369 370 A notable influx of MicrobeTrace usage occurred in late April 2020 (data not included in figure) , 371 simultaneously across nine cities in Vietnam over a span of two local afternoon hours. This brief influx of 372 traffic from a single country, spread across disparate geography, is suggestive of workforce development 373 efforts. If true, this would represent the first clear evidence of a training webinar held by non-CDC staff. 374 Following on from this training event, the fraction of returning users was three times higher than 375 MicrobeTrace's historical fraction of returning users (64% versus 21%). Further, the average session 376 duration was also nearly three times higher (20.1min versus 7.3min) than the historic average session 377 duration. broadly used to integrate genomic and epidemiologic data for tuberculosis outbreak investigations 385 (Springer, 2020) . It has also been used to integrate partner services, epidemiologic and whole genome 386 data to better understand transmission during a retrospective public health investigation of Neisseria 387 gonorrhoeae (Town, et al., 2020) . Outside of its intended domain of sexually transmitted diseases, 388 MicrobeTrace has also been applied to integrate epidemiologic and laboratory data in outbreaks of 389 foodborne pathogens, such as Escherichia coli O157:H7 (Allen, 2020). It is currently being evaluated for 390 integration and visualization of epidemiologic and genetic data from cases of Ebola and COVID-19 (S. 391 Whitmer, personal communication; S. Tong, personal communication). Leaflet: an open-source JavaScript library for mobile-friendly interactive maps Visualizing sequence data and epidemiological data together using MicrobeTrace Foodborne Outbreak Response and Management Conference Applied Maths. BioNumerics version 5.10 Microreact: visualizing and sharing data for genomic epidemiology and 459 phylogeography Gephi: an open source software for exploring and 462 manipulating networks Phylogenetic and Demographic Characterization of Directed HIV-1 Transmission Using 465 Deep Sequences from High-Risk and General Population Cohorts/Groups in Uganda Detailed Transmission Network Analysis of a Large Opiate-Driven Outbreak of 485 HIV Infection in the United States Phylodynamic Analysis Complements Partner Services by Identifying Acute and 488 Guest Editorial Special Section on Cloud Computing Computing, Internet of Things, and Big Data Analytics Applications for Healthcare Industry A data-supported history of bioinformatics tools MicrobeTrace : The Visualization Multitool for Molecular Epidemiology and Bioinformatics Rooftop Recommendations #02: MicrobeTrace. In.: Centers for Disease Control and 503 Prevention Notes from the field: HIV diagnoses among persons who inject drugs-Northeastern 507 The igraph software package for complex network research. InterJournal, 510 complex systems Characterization of Molecular Cluster Detection and Evaluation of Cluster Investigation 513 Criteria Using Machine Learning Methods and Statewide Surveillance Data in Washington State CLUSTERING OF HEPATITIS C VIRUS INFECTION AMONG PEOPLE Conference on Retroviruses and Opportunistic Infections PATRISTIC: a program for calculating patristic distances and graphically 521 comparing the components of genetic change The Promise and Complexities of Detecting and Monitoring HIV 524 Integrating Advanced Molecular Technologies Twenty years of West Nile virus spread and evolution in the Americas visualized by 530 Nextstrain: real-time tracking of pathogen evolution Exploring network structure, dynamics, and function using 536 Los Alamos National Lab.(LANL) BioEdit: a user-friendly biological sequence alignment editor and analysis program for 539 Windows 95/98/NT. In, Nucleic acids symposium series HIV TRANSMISSION POTENTIAL DUE TO INJECTION DRUG USE IN RURAL Conference on Retroviruses and Opportunistic Infections Matplotlib: A 2D Graphics Environment MOLECULAR SURVEILLANCE AS A MEANS TO EXPAND AN OUTBREAK 547 INVESTIGATION: MA Data-Science-for-COVID-19 epsilon Minimal Spanning Trees (eMST) /4/2 date last accessed) On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem MEGA: a biologist-centric software for evolutionary analysis of DNA and protein 560 sequences Playful User Interfaces: Literature Review and Model for Analysis A review of bioinformatic pipeline frameworks bioseq-js GHOST: global hepatitis outbreak and surveillance technology Identifying Clusters of Recent and Rapid HIV Transmission Through Analysis of 573 Molecular Surveillance Data APE: Analyses of Phylogenetics and Evolution in R language HIV-TRACE (TRAnsmission Cluster Engine): a Tool for Large Scale Molecular 579 Epidemiology of HIV-1 and Other Rapidly Evolving Pathogens Inferring HIV-1 Transmission Dynamics in Germany From Recently 582 Fortify Software MicrobeTrace User Manual Clusters of Diverse HIV and Novel Recombinants Identified Among Persons Who 589 Inject Drugs in Kentucky and Ohio. In, 14th Annual International HIV Transmission Workshop. Virology 590 Education Cytoscape 2.8: new features for data integration and network visualization SonarQube.org. 2020. SonarQube. Release 7.9 Logically Inferred Tuberculosis Transmission (LITT) Algorithm User's Manual -Appendix 602 3 Building robust systems an essay Estimation of the number of nucleotide substitutions in the control region of 607 mitochondrial DNA in humans and chimpanzees Phylogenomic analysis of Neisseria gonorrhoeae transmission to assess sexual mixing 610 HIV transmission risk in England: a cross-sectional, observational, whole-genome sequencing study The global transmission network of HIV-1 Social and Genetic Networks of HIV-1 Transmission achievable with an array of software, tools, and custom scripts, and substantive computational experience. 394A putative MicrobeTrace user, such as epidemiologists or disease investigation specialist, typically 395 achieves proficiency after one brief training session and aided by a cursory understanding of common 396 browser interactions, such as 'dropdown menus', 'slider bars', and 'drag-and-drop'. Many standalone 397 tools are available to calculate pairwise genetic distances with varying degrees of specificity to the 398 pathogen of interest. MEGA is a bioinformatic tool broadly used in public health, but new users can be 399 overwhelmed by dense interfaces with scores of options that are often dense with jargon and required 400 inputs (Kumar, et al., 2008) . HIV-TRACE, which is specific to HIV sequence data, now offers rich 401 visualization capabilities but its installation requires a keen understanding of Unix and the Git protocol 402 for local installation and use (Pond, et al., 2018 ) . An iteration of HIV-TRACE is available on the internet 403 but at a web server which has concomitant data security issues (Weaver, et al., 2015) . Patristic distance 404 calculations are available via the APE package in R or the Java application PATRISTIC, but these require Python+NetworkX) or generated via opaque plug-ins in Gephi or Cytoscape that offer minimal 418 customizations. Anecdotally, use of MicrobeTrace and its network layout interface can be playful; which 419 has been shown to improve the user experience and increase their motivation to use the tool (Kuts, 2009) . 420While MicrobeTrace has been developed for a public health user base, it also has many 421 applications in academia. It is adept at integrating arbitrary networks with independent node-and edge-422 level characteristics that are necessary to evaluate social, behavioral, biochemical, cellular, technological 423 and physical networks. MicrobeTrace also offers rich customizations that reduce the time and effort to 424 achieve insights and discoveries when grappling with a novel data set. The MicrobeTrace development 425 team is not aware of another tool that offers all of these capabilities in a secure, interoperable, and light-426 weight format that requires no installation prior to use. We are thankful to the CDC's Advanced Molecular Detection initiative for providing intramural funding 447 for this project. 448