key: cord-0311064-nrznkrxm authors: Cantrell, Kalen; Fedarko, Marcus W.; Rahman, Gibraan; McDonald, Daniel; Yang, Yimeng; Zaw, Thant; Gonzalez, Antonio; Janssen, Stefan; Estaki, Mehrbod; Haiminen, Niina; Beck, Kristen L.; Zhu, Qiyun; Sayyari, Erfan; Morton, Jamie; Tripathi, Anupriya; Gauglitz, Julia M.; Marotz, Clarisse; Matteson, Nathaniel L.; Martino, Cameron; Sanders, Jon G.; Carrieri, Anna Paola; Song, Se Jin; Swafford, Austin D.; Dorrestein, Pieter C.; Andersen, Kristian G.; Parida, Laxmi; Kim, Ho-Cheol; Vázquez-Baeza, Yoshiki; Knight, Rob title: EMPress enables tree-guided, interactive, and exploratory analyses of multi-omic datasets date: 2020-10-08 journal: bioRxiv DOI: 10.1101/2020.10.06.327080 sha: 51c22a9ef03af98cb743c037d823f851c55b15a2 doc_id: 311064 cord_uid: nrznkrxm Standard workflows for analyzing microbiomes often include the creation and curation of phylogenetic trees. Here we present EMPress, an interactive tool for visualizing trees in the context of microbiome, metabolome, etc. community data scalable beyond modern large datasets like the Earth Microbiome Project. EMPress provides novel functionality—including ordination integration and animations—alongside many standard tree visualization features, and thus simplifies exploratory analyses of many forms of ‘omic data. Main Text 41 42 The increased availability of sequencing technologies and automation of molecular methods 43 have enabled studies of unprecedented scale [1] prompting the creation of tools better suited to 44 store, analyze [2], and visualize [3] studies of this magnitude. Many of these tools, such as [4, 5, 45 6, 7], use phylogenies detailing the evolutionary relationships among features or dendrograms 46 that organize features in a hierarchical structure (e.g. clustering of mass spectra) [8] . The 47 challenge of enabling fully interactive analyses stems from the disconnect between feature-level 48 tools and dataset-level tools; few can interactively integrate multiple representations of the data 49 [9], and to our knowledge none scale to display large datasets. This is a key unresolved 50 challenge for the field: to allow researchers to contextualize community-level patterns 51 (groupings of samples) together with feature-level structure, i.e. which features lead to the 52 groupings explained in a given sample set. 53 54 Here, we introduce EMPress (https://github.com/biocore/empress), an open-source (BSD 3-55 clause), interactive and scalable phylogenetic tree viewer accessible as a QIIME 2 [2] plugin. 56 EMPress is built around the high-performance balanced parentheses tree data structure [10], 57 and uses a hardware-accelerated WebGL-based rendering engine that allows EMPress to 58 visualize trees with hundreds of thousands of nodes using a laptop's web browser (Methods of their implementation (PHYLOViZ Online also uses WebGL), and/or use-cases (SigTree is 79 mostly used to visualize differential abundance patterns, and iTOL supports the visualization of 80 QIIME 2 tree artifacts). EMPress stands out in its scalability: iTOL claims trees with more than 81 10,000 tips to be "very large" (https://itol.embl.de/help.cgi), while EMPress readily supports trees 82 with over hundreds of thousands of tips, as shown in Fig. 1 Using the first data release of the Earth Microbiome Project (EMP), we demonstrate EMPress' 105 scalability by rendering a 26,035 sample ordination and a 756,377 node tree ( Figure 1A ). To 106 visualize the relative proportions of taxonomic groups at the phylum level, we use EMPress' 107 feature metadata coloring to highlight the top 5 most prevalent phyla (see Methods). Next, we 108 add a barplot layer showing, for each tip in the tree, the proportions of samples containing each 109 tip summarized by level 2 of the EMP ontology (Animal, Plant, Non-Saline, and Saline). Paired 110 visualizations allow us to click on a tip in the tree and view the samples that contain that feature 111 in the ordination. This functionality is useful when analyzing datasets with outliers or mislabeled 112 metadata. Tip-aligned barplots summarize environmental metadata: for example, Figure 1B 113 shows the subset of samples (4,002) with recorded pH information and a barplot layer with the 114 mean pH where each feature was found. The barplot reveals a relatively dark section near 115 many Firmicutes-classified features on the tree; in concert with histograms showing mean pH 116 for each phylum ( Figure 1C) annotations and food types. Figure 2B shows a tree where each tip is colored by its chemical 143 super class, and where barplots show the proportion of samples in the study containing each 144 compound by food type. This representation reveals a clade of lipids and lipid-like molecules 145 that are well represented in animal food types and seafoods. In contrast, salads and fruits are 146 broadly spread throughout the cladogram. 147 148 Lastly, in Figure 2C , we compare three differential abundance methods in an oral microbiome 149 dataset [18] as separate barplot layers on a tree. This dataset includes samples (n=32) taken 150 before and after subjects brushed their teeth (see Methods). As observed across the three 151 differential abundance tools' outputs, all methods agree broadly on which features are 152 particularly "differential" (for example, the cluster of Firmicutes-classified sequences in the 153 bottom-right of the tree; see Methods), although there are discrepancies due to different 154 methods' assumptions and biases. 155 We thank members of the Knight Lab and IBM AIHL Bioinformatics team for feedback during 164 code reviews and presentations. We gratefully acknowledge the following Authors from the 165 originating laboratories responsible for obtaining the specimens and the submitting laboratories 166 where genetic sequence data were generated and shared via the GISAID Initiative, on which a 167 portion of this research is based (Supplemental Table 1 EMPress and EMPeror are dynamically linked together. For example, clicking on a tip reveals the node's inspection menu, and highlights the 2 samples in the ordination that contain that microbial feature. A communal catalogue reveals Earth's multiscale microbial 219 diversity Reproducible, interactive, scalable and extensible microbiome data 221 science using QIIME 2 EMPeror: a tool for 223 visualizing high-throughput microbial community data UniFrac: a new phylogenetic method for comparing microbial 225 communities Phylogenetic factorization of compositional data yields lineage-227 level associations in microbiome datasets A phylogenetic 229 transform enhances analysis of compositional microbiota data Chemically-informed Analyses of Metabolomics Mass Spectrometry 233 Data with Qemistree PHYLOViZ 2.0: providing scalable data integration and 236 visualization for multiple phylogenetic inference methods Simple and efficient fully-functional succinct trees Using ggtree to Visualize Data on Tree-Like Structures ETE 3: Reconstruction, Analysis, and 243 Visualization of Phylogenomic Data Anvi'o: an advanced analysis and visualization platform for 'omics data SigTree: A 247 Microbial Community Analysis Tool to Identify and Visualize Significantly Responsive 248 Branches in a Phylogenetic Tree Interactive Tree Of Life (iTOL) v4: recent updates and new 250 developments Dynamic Graphics for Data Analysis Genomic Diversity of Severe Acute Respiratory Syndrome-Coronavirus 2 254 in Patients With Coronavirus Disease Establishing microbial composition measurement standards with 256 reference frames Enzyme Annotation and Metabolic Reconstruction Using KEGG Multi-Omics integration analysis of respiratory specimen characterizes 260 baseline molecular determinants associated with COVID-19 diagnosis We declare none. 184