key: cord-102613-hly07ne3 authors: Danko, David; Bezdan, Daniela; Afshinnekoo, Ebrahim; Ahsanuddin, Sofia; Bhattacharya, Chandrima; Butler, Daniel J; Chng, Kern Rei; Donnellan, Daisy; Hecht, Jochen; Kuchin, Katerina; Karasikov, Mikhail; Lyons, Abigail; Mak, Lauren; Meleshko, Dmitry; Mustafa, Harun; Mutai, Beth; Neches, Russell Y; Ng, Amanda; Nikolayeva, Olga; Nikolayeva, Tatyana; Png, Eileen; Ryon, Krista; Sanchez, Jorge L; Shaaban, Heba; Sierra, Maria A; Thomas, Dominique; Young, Ben; Abudayyeh, Omar O.; Alicea, Josue; Bhattacharyya, Malay; Blekhman, Ran; Castro-Nallar, Eduardo; Cañas, Ana M; Chatziefthimiou, Aspassia D; Crawford, Robert W; De Filippis, Francesca; Deng, Youping; Desnues, Christelle; Dias-Neto, Emmanuel; Dybwad, Marius; Elhaik, Eran; Ercolini, Danilo; Frolova, Alina; Gankin, Dennis; Gootenberg, Jonathan S.; Graf, Alexandra B; Green, David C; Hajirasouliha, Iman; Hernandez, Mark; Iraola, Gregorio; Jang, Soojin; Kahles, Andre; Kelly, Frank J; Knights, Kaymisha; Kyrpides, Nikos C; Łabaj, Paweł P; Lee, Patrick K H; Leung, Marcus H Y; Ljungdahl, Per; Mason-Buck, Gabriella; McGrath, Ken; Meydan, Cem; Mongodin, Emmanuel F; Moraes, Milton Ozorio; Nagarajan, Niranjan; Nieto-Caballero, Marina; Noushmehr, Houtan; Oliveira, Manuela; Ossowski, Stephan; Osuolale, Olayinka O; Özcan, Orhan; Paez-Espino, David; Rascovan, Nicolas; Richard, Hugues; Rätsch, Gunnar; Schriml, Lynn M; Semmler, Torsten; Sezerman, Osman U; Shi, Leming; Shi, Tieliu; Song, Le Huu; Suzuki, Haruo; Tighe, Scott W; Tong, Xinzhao; Udekwu, Klas I; Ugalde, Juan A; Valentine, Brandon; Vassilev, Dimitar I; Vayndorf, Elena; Velavan, Thirumalaisamy P; Wu, Jun; Zambrano, María M; Zhu, Jifeng; Zhu, Sibo; Mason, Christopher E title: Global Genetic Cartography of Urban Metagenomes and Anti-Microbial Resistance date: 2020-05-04 journal: bioRxiv DOI: 10.1101/724526 sha: doc_id: 102613 cord_uid: hly07ne3 Although studies have shown that urban environments and mass-transit systems have distinct genetic profiles, there are no systematic worldwide studies of these dense, human microbial ecosystems. To address this gap in knowledge, we created a global metagenomic and antimicrobial resistance (AMR) atlas of urban mass transit systems from 60 cities, spanning 4,728 samples and 4,424 taxonomically-defined microorganisms collected for three years. This atlas provides an annotated, geospatial profile of microbial strains, functional characteristics, antimicrobial resistance markers, and novel genetic elements, including 10,928 novel predicted viral species, 1302 novel bacteria, and 2 novel archaea. Urban microbiomes often resemble human commensal microbiomes from the skin and airways, but also contain a consistent “core” of 31 species which are predominantly not human commensal species. Samples show distinct microbial signatures which may be used to accurately predict properties of their city of origin including population, proximity to the coast, and taxonomic profile. These data also show that AMR density across cities varies by several orders of magnitude, including many AMRs present on plasmids with cosmopolitan distributions. Together, these results constitute a high-resolution, global metagenomic atlas, which enables the discovery of new genetic components of the built human environment, highlights potential forensic applications, and provides an essential first draft of the global AMR burden of the world’s cities. 1 Introduction limitations in sequencing depth, or by missing annotations in the reference databases used for taxonomic 112 classification, which is principally problematic with phages. It is worth noting that potentially prevalent presence and absence of species (which is robust to noise from relative abundance) and performed a 187 dimensionality reduction of the data using UMAP (Uniform Manifold Approximation and Projection, 188 McInnes et al. (2018)) for visualization (Figure 2A) . Jaccard distance was correlated with distance based 189 on Jensen-Shannon Divergence (which accounts for relative abundance) and k-mer distance calculated by 190 MASH (which is based on the k-mer distribution in a sample, so cannot be biased by a database) (Supp. Figure S10A , B, C). In principle, Jaccard distance could be influenced by read depth as low abundance 192 species drop below detection thresholds. However we expect this issue to be minor as the total number 193 of species identified stabilized at 100,000 reads (Supp. Figure S9B ) compared to an average of 6.01M 194 reads per sample. Samples collected from North America and Europe were distinct from those collected 195 in East Asia, but the separation between other regions was less clear. A similar trend was found in an 196 analogous analysis based on functional pathways rather than taxonomy (Supp. Fig S5D) , which indicates 197 geographic stratification of the metagenomes at both the functional and taxonomic levels. Subclusters 198 identified by UMAP roughly corresponded to city and climate but not surface type (Supp. Figure We quantified the degree to which metadata covariates influence the taxonomic composition of our 202 samples using MAVRIC, a statistical tool to estimate the sources of variation in a count-based dataset 203 (Moskowitz and Greenleaf, 2018). We identified covariates which influenced the taxonomic composition 204 of our samples: city, population density, average temperature in June, region, elevation above sea-level, 205 surface type, surface material, elevation above or below ground and proximity to the coast. The most 206 important factor, which could explain 19% of the variation in isolation, was the city from which a sample 207 was taken followed by region which explained 11%. The other four factors ranged from explaining 2% 208 to 7% of the possible variation in taxonomy in isolation (Supp . Table S2 ). We note that many of 209 the factors were confounded with one another, so they can explain less diversity than their sum. One 210 metadata factor tested, the population density of the sampled city, had no significant effect on taxonomic 211 variation overall. To quantify how the principle covariates, climate, continent, and surface material impacted the taxo-213 nomic composition of samples, we performed a Principal Component Analysis (PCA) on our taxonomic 214 data normalized by proportion and identified principal components (PCs) which were strongly associated 215 with a metadata covariate in a positive or negative direction (PCs were centered so an average direction 216 indicates an association). We found that the first two PCs (representing 28.0% and 15.7% of the variance 217 of the original data, respectively) associated strongly with the city climate while continent and surface 218 material associate less strongly (Figure 2B) . 219 Next, we tested whether geographic proximity (in km) of samples to one another had any effect on 220 the variation, since samples taken from nearby locations could be expected to more closely resemble one 221 another. Indeed, for samples taken in the same city, the average JSD (Jensen-Shannon distance) was 222 weakly predictive of the taxonomic distance between samples, with every increase of 1km in distance 223 between two samples representing an increase of 0.056% in divergence (p < 2e16, R 2 = 0.01073, Supp. Figure S1B ). This suggests a "neighborhood effect" for sample similarity analogous to the effect described 225 by Meyer et al. (2018) , albeit a very minor one. To reduce bias that could be introduced by samples Colored by the region of origin for each sample. Axes are arbitrary and without meaningful scale. The color key is shared with panel B. B) Association of the first 25 principal components of sample taxonomy with climate, continent, and surface material. C) Distribution of major phyla, sorted by hierarchical clustering of all samples and grouped by continent. D) Distribution of high-level groups of functional pathways, using the same order as taxa (C). E) Distribution of AMR genes by drug class, using the same order as taxa (C) . Note that MLS is macrolide-lincosamide-streptogramin. taken from precisely the same object we excluded all pairs of samples within 1km of one another. At a global level, we examined the prevalence and abundance of taxa and their functional profiles 228 between cities and continents. These data showed a fairly stable phyla distribution across samples, but 229 the relative abundance of these taxa is unstable (Figure 2C ) with some continental trends. In contrast 230 to taxonomic variation, functional pathways were much more stable across continents, showing relatively 231 little variation in the abundance of high level categories (Figure 2D ). This pattern may also be due to 232 the more limited range of pathway classes and their essential role in cellular function, in contrast to the 233 much more wide-ranging taxonomic distributions examined across metagenomes. Classes of antimicrobial 234 resistance were observed to vary by continent as well. Clusters of AMR classes were observed to occur 235 in groups of taxonomically similar samples ( Figure 2E ). 236 We quantified the relative variation of taxonomic and functional profiles by comparing the distribution 237 of pairwise distances in taxonomic and functional profiles. Both profiles were equivalently normalized 238 to give the probability of encountering a particular taxon or pathway. Taxonomic To facilitate characterization of novel sequences we created GeoDNA, a high-level web interface (Figure 246 3A) to search raw sequences against our dataset. Users can submit sequences to be processed against 247 a k-mer graph-based representation of our data. Query sequences are mapped to samples and a set of 248 likely sample hits is returned to the user. This interface will allow researchers to probe the diversity in 249 this dataset and rapidly identify the range of various genetic sequences. 250 We sought to determine whether a samples taxonomy reflected the environment in which it was 251 collected. To this end we trained a Random Forest Classifier (RFC) to predict a sample's city of origin 252 from its taxonomic profile. We trained an RFC with 100 components on 90% of the samples in our 253 dataset and evaluated its classification accuracy on the remaining 10%. We repeated this procedure with 254 multiple subsamples of our data at various sizes and with 5 replicates per size to achieve a distribution 255 (Fig. 3B ). The RFC achieved 88% on held out data which compares favorably to the 7.01% that would 256 be achieved by a randomized classifier. These results from our RFC demonstrate that city specific 257 taxonomic signatures exist and can be predictive. We expanded our analysis of environmental signatures in taxonomy to the prediction of features in 259 cities not present in our training set. To do this we collated a set of 7 features for each city: population, 260 surface material, elevation, proximity to the coast, population density, region, ave June temperature, 261 and Koppen climate classification. We trained a RFCs to predict each feature based on all samples that 262 were not taken from a given city then used the relevant RFC to predict the feature for samples from 263 the held out city and recorded the classification accuracy ( Figure 3D ). While not all features and cities 264 were equally predictable (in particular features for a number of British cities were roughly similar and 265 could be predicted effectively) in general the predictions exceeded random chance by a significant margin 266 (Supp. Figure S3A ). This suggests that certain features of cities generate microbial signatures that are 267 present globally and distinct from city specific signatures. The successful geographic classification of 268 samples demonstrates distinct city-specific trends in the detected taxa, that may enable future forensic 269 biogeographical capacities. However, unique, city-specific taxa are not uniformly distributed ( Figure 3B ). To quantify this, we 271 developed a score to reflect how endemic a given taxon is within a city, which reflects upon the forensic 272 usefulness of a taxon. We define the Endemicity Score (ES) of a taxa as term-frequency inverse document 273 frequency where the document consists of samples from some metadata defined group such as a city or 274 region. This score is designed to simultaneously reflect the chance that a taxon could identify a given 275 city and that that taxon could be found within the given city. A high ES for a taxon in a given city 276 could be evidence of the evolutionary advantage that the taxon has in a particular cities environment. However, neutral evolution of microbes within a particular niche is also possible and the ES alone does 278 not distinguish between these two hypotheses. Note that while the ES only considers taxa which are found in a city, a forensic classifier could also 280 take advantage of the absence of taxa for a similar metric. ES show a roughly bimodal distribution for 281 regions (Fig. 3C) . Each region possesses a number of taxa with ES scores close to 1 and a slightly larger samples for all cities are transformed into lists of unique k-mers (left). After filtration, the k-mers are assembled into a graph index database. Each k-mer is then associated with its respective city label and other informative metadata, such as geo-location and sampling information (top middle). Arbitrary input sequences (top right) can then be efficiently queried against the index, returning a ranked list of matching paths in the graph together with metadata and a score indicating the percentage of k-mer identity (bottom right). The geo-information of each sample is used to highlight the locations of samples that contain sequences identical or close to the queried sequence (middle right). B) Classification accuracy of a random forest model for assigning city labels to samples as a function of the size of training set. C) Distribution of Endemicity scores (term frequency inverse document frequency) for taxa in each region. D) Prediction accuracy of a random forest model for a given feature (rows) in samples from a city (columns) that was not present in the training set. Rows and columns sorted by average accuracy. Continuous features (e.g. Population) were discretized. number close to 0 (note that ES is not bounded in [0, 1]). Some cities, like Offa (Nigeria), host many unique taxa while others, like Zurich (Switzerland), host fewer endemic species (Supp. Figure S3B ). Large numbers of endemic species in a city may reflect geographic bias in sampling. However, some 285 cities from well sampled continents (e.g., Lisbon, Hong Kong) also host many endemic species which 286 would suggest that ES may indicate interchangeability and local pockets of microbiome variation for 287 some locations. Quantification of antimicrobial diversity and AMRs are key components of global antibiotic stewardship. Yet, predicting antibiotic resistance from genetic sequences alone is challenging, and detection accuracy 291 depends on the class of antibiotics (i.e., some AMR genes are associated to main metabolic pathways 292 while others are uniquely used to metabolize antibiotics). As a first step towards a global survey of 293 antibiotic resistance in urban environments, we mapped reads to known antibiotic resistance genes, 294 using the MegaRES ontology and alignment software. We quantified their relative abundance using 295 reads/kilobase/million mapped reads (RPKM) for 20 classes of antibiotic resistance genes detected in 296 our samples (Figure 4A B) . 2,210 samples had some sequence which were identified as belonging to an 297 AMR gene, but no consistent core set of genes was identified. The most common classes of antibiotic 298 resistance genes were for macrolides, lincosamides, streptogamines (MLS), and betalactams, yet the most 299 common class of antibiotic resistance genes, MLS was found in only 56% of the samples where AMR 300 sequence was identified. Despite being relatively common, antibiotic resistance genes were universally in low abundance com-302 pared to functional genes, with RPKM values for resistance classes typically ranging from 0.1 -1 com- pared to values of 10 -100 for typical housekeeping genes (AMR classes contain many genes so RPKM 304 values may be lower than they would be for individual genes). In spite of the low abundance of the genes 305 themselves, some samples contained sequences from hundreds of distinct AMR genes. Clusters of high 306 AMR diversity were not evenly distributed across cities ( Figure 4C ). Some cities had more resistance 307 genes identified on average (15-20X) than others (e.g. Bogota) while other cities had bimodal distribu-308 tions (e.g. San Francisco) where some samples had hundreds of genes while others very few. We note 309 that 99% of the cases where we detected an AMR genes had an average depth of 2.7x, indicating that 310 our global distribution would not dramatically change with altered read depth (Supp. Figure S6E ). As with taxa, AMR genes can be used to classify samples to cities -albeit with much less accuracy. A random forest model analogous to the one trained to predict city classification from taxonomic profiles 313 was trained to predict from profiles of antimicrobial resistance genes. This model achieved 37.6% accuracy 314 on held out test data (Supp. Figure S6A ). While poor for actual classification this accuracy far exceeds 315 the 7.01% that would be achieved by randomly assigning labels and indicates that there are possibly 316 weak, city specific signatures for antimicrobial resistance genes. Multiple AMR genes can be carried on a single plasmid and ecological competition may cause mul-318 tiple taxa in the same sample to develop antimicrobial resistance. As a preliminary analysis into these 319 phenomenons we identified clusters of AMR genes that co-occurred in the same samples ( Figure 4D ). We measured the Jaccard distance between all pairs of AMR genes found in at least 1% of samples and 321 performed agglomerative clustering on the resulting distance matrix. We identified three large clusters of 322 genes and numerous smaller clusters. Of note, these clusters often consist of genes from multiple classes 323 of resistance. At this point we do not posit a specific ecological mechanism for this co-occurrence, but 324 we note that the large clusters contain far more genes than are typically found on plasmids. 325 We performed a rarefaction analysis on the set of all resistance genes in the dataset, which we call 326 the "panresistome" (Figure (Supp. Figure S6B ). Similar to the rate of detected species, the panresistome 327 also shows an open slope with an expected rate of discovery of 1 previously unobserved AMR gene per 328 10 samples. Given that AMR gene databases are rapidly expanding and that no AMR genes were found 329 in some samples, it is likely that future analyses will identify many more resistance genes in this data. Additionally, AMR genes show a "neighbourhood" effect within samples that are geographically prox-331 imal analogous to the effect seen for taxonomic composition (Supp. Figure S6C ). Excluding samples 332 where no AMR genes were detected, the Jaccard distance between sets of AMR genes increases with 333 distance for pairs of samples in the same city. As with taxonomic composition. the overall effect is weak 334 and noisy, but significant. To examine these samples for novel genetic elements, we assembled and identified Metagenome Assembled 337 Genomes (MAGs) for viruses, bacteria, and archaea and analyzed them with several algorithms. This 338 includes thousands of novel CRISPR arrays that reflect the microbial biology of the cities and 1,304 339 genomes from our data, of which 748 did not match any known reference genome within 95% average 340 nucleotide identity (ANI). 1302 of the genomes were classified as bacteria, and 2 as archaea. Bacterial 341 genomes came predominantly from four phyla: the Proteobacteria, Actinobacteria, Firmicutes, and 342 Bacteroidota. Novel bacteria were evenly spread across phyla ( Figure 5A ). Assembled bacterial genomes were often identified in multiple samples. Several of the most prevalent 344 bacterial genomes were novel species ( Figure 5B ). Some assembled genomes, both novel and not, showed 345 regional specificity while others were globally distributed. The taxonomic composition of identifiable 346 genomes roughly matched the composition of the core urban microbiome (Section 2). The number of 347 identified bacterial MAGs was somewhat based on read depth and the sample count per city (Supp. 348 Figure S7A ). The number of bacterial MAGs discovered in a city which did not match a known species 349 was closely correlated to the total number of bacterial MAGs discovered in that city (Supp. Figure S7B ). Bacterial MAGs were roughly evenly distributed geographically with the notable exception of Offa, which 351 had dramatically more novel bacterial species than other cities. We investigated assembled contigs from our samples to identify 16,584 predicted uncultivated viral 353 genomes (UViGs). Taxonomic analysis of predicted UViGS to identify viral species yielded 2,009 clusters 354 containing a total of 6,979 UViGs and 9,605 singleton UViGs for a total of 11,614 predicted viral species. We compared the predicted species to known viral sequences in the JGI IMG/VR system, which 360 contains viral genomes from isolates, a curated set of prophages and 730k viral MAGs from other studies. Of the 11,614 species discovered in our data 94.1% did not match any viral sequence in IMG/VR (Paez-362 Espino et al., 2019) at the species level for a total of 10,928 novel viruses. We note that this number is 363 surprisingly high but was obtained using a conservative pipeline (99.6% precision) and corresponded well 364 with our identified CRISPR arrays (below). This suggests that urban microbiomes contain significant 365 diversity not observed in other environments. Next, we attempted to identify possible bacterial and eukaryotic hosts for our predicted viral MAGs. For the 686 species with similar sequences in IMG/VR, we projected known host information onto 2,064 368 MetaSUB viral MAGs. Additionally, we used CRISPR-Cas spacer matches in the IMG/M system to 369 assign possible hosts to a further 1,915 predicted viral species. Finally, we used a database of 20 million 370 metagenome-derived CRISPR spacers to provide further rough taxonomic assignments. Our predicted 371 viral hosts aligned with our taxonomic profiles, 41% of species in the core microbiome (Section 2) had 372 predicted viral-host interactions. Many of our viral MAGs were found in multiple locations ( Figure 5D ). Many viruses were found in South America, North America and Africa. Viral MAGs in Japan often 374 corresponded to those in Europe and North America. 375 We identified 838,532 CRISPR arrays in our data of which 3,245 could be annotated for specific 376 systems. The annotated CRISPR arrays were principally type 1-E and 1-F btu a number of type two 377 and three systems were identified as well ( Figure 5E , F). A number of arrays had unclear or ambiguous 378 type assignment. Critically the spacers in our identified CRISPR arrays closely matched our predicted 379 viral MAGs. We aligned spacers to both our viral MAGs and all viral sequences in RefSeq. The total 380 fraction of spacers which could be mapped to our viral MAGS and RefSeq was similar (Supp. Figure 381 S7C) but the mapping rate to our viral MAGs dramatically exceeded the mapping rate to RefSeq (Figure 382 5C). We present this as additional evidence supporting these novel viral MAGs. (Tables 1 and S1) , constituting the first large scale metagenomic study 387 of the urban microbiome. We also identified species that are geographically constrained and showed that 388 these can be used to determine a samples city of origin (Section 2.1). Many of these species are associated 389 with commensal microbiomes from human skin and airways, but we observed that urban microbiomes are 390 nevertheless distinct from both human and soil microbiomes. Notably, no species from the Bacteroidetes, a prominent group of human commensal organisms (Eckburg et al., 2005; Qin et al., 2010) , was identified 392 in the core urban microbiome. We conclude that there is a consistent urban microbiome core ( Figure 393 1, 2), which is supplemented by geographic variation (Figure 2 ) and microbial signatures based on the 394 specific attributes of a city ( Figure 3 ). Our data also indicates that significant diversity remains to be 395 characterized and that novel taxa may be discovered in the data (Figure 5 ), that environmental factors 396 affect variation, and that sequences associated with AMR are globally widespread but not necessarily 397 abundant ( Figure 4 ). In addition to these results, we present several ways to access and analyze our 398 data including interactive web based visualizations, search tools over raw sequence data, and high level 399 interfaces to computationally access results. Unique taxonomic composition and association with covariates specific to the urban environment 401 suggest that urban microbiomes should be treated as ecologically distinct from both surrounding soil 402 microbiomes and human commensal microbiomes. Though these microbiomes undoubtedly interact 403 with the urban environment, they nonetheless represent distinct ecological niches with different genetic 404 profiles. While our metadata covariates were associated with the principal variation in our samples, they 405 do not explain a large proportion of the observed variance. It remains to be determined whether variation 406 is essentially a stochastic process or if a deeper analysis of our covariates proves more fruitful. We have 407 observed that less important principal components (roughly PCs 10-100) are generally less associated 408 with metadata covariates but that PCs 1-3 do not adequately describe the data alone. This is a pattern 409 that was observed in the human microbiome project as well, where minor PCs (such as our Figure 2B ) 410 were required to separate samples from closely related body sites. Much of the urban microbiome likely represents novel diversity as our samples contain a significant In addition to general purpose data analysis tools essentially all analysis in this paper is available 505 as a series of Jupyter notebooks. Our hope is that these notebooks allow researchers to reproduce our 506 results, build upon our results in different contexts, and better understand precisely how we arrived at 507 our conclusions. By providing the exact source used to generate our analyses and figures, we also hope 508 to be able to quickly incorporate new data or correct any mistakes that might be identified. For less technical purposes, we also provide web-based interactive visualizations of our dataset (typ-510 ically broken into city-specific groups). These visualizations are intended to provide a quick reference 511 for major results as well as an exploratory platform for generating novel hypotheses and serendipitous 512 discovery. The web platform used, MetaGenScope, is open source, permissively licensed, and can be run 513 on a moderately powerful machine (though its output relies on results from the MetaSUB CAP). Our hope is that by making our dataset open and easily accessible to other researchers the scientific 515 community can more rapidly generate and test hypotheses. One of the core goals of the MetaSUB 516 consortium is to build a dataset that benefits public health. As the project develops we want to make 517 our data easy to use and access for clinicians and public health officials who may not have computational 518 or microbiological expertise. We intend to continue to build tooling that supports these goals. fields collected were the location of sampling, the material being sampled, the type of object being 634 sampled, the elevation above or below ground, and the station or line where the sample was collected. However, several cities were unable to use the provided apps for various reasons and submitted their 636 metadata as separate spreadsheets. Additionally, certain metadata features, such as those related to 637 sequencing and quality control, were added after initial sample collection. To collate various metadata sources, we built a publicly available program which assembled a large 639 master spreadsheet with consistent sample UUIDs. After assembling the originally collected data at- 100), which we have added to an empty sterile urine cup followed by swabbing for 3 min (Fig.S1 ). Furthermore, the working space has been swabbed for 1.5 min / 3 min before and after treatment with 701 10% bleach (Fig. S2) Figure S14 ). Distributions of DNA concentration and the number of reads were as expected. GC content 770 was broadly distributed for negative controls while positive controls were tightly clustered, expected since 771 positive controls have a consistent taxonomic profile. Comparing the number of reads before and after 772 quality control did not reveal any major outliers. 773 6.5.3 Batch effect appears minimal 774 A major concern for this low-biomass studies and large-scale studies are batch effects. The median flowcell 775 used in our study contained samples from 3 cities and 2 continents. However, two flowcells covered 18 776 cities from 5 or 6 continents respectively. When samples from these flowcells were plotted using UMAP 777 (see Section 2.1 for details) the major global trends we described were recapitulated (Supp. We used BLASTn to align nucelotide assemblies from case samples to control samples. We used a 788 threshold of 8,000 base pairs and 99.99% identity as a minimum to consider two sequences homologous. This threshold was chosen to be sensitive without solely capturing conserved regions. We identified all 790 connected groups of homologous sequences and found approximate taxonomic identifications by aligning 791 contigs to NCBI-NT using BLASTn searching for 90% nucleotide identity over half the length of the 792 longest contig in each group. Section 2). Our dilemma was that a microbial species that is common in the urban environment might 797 also reasonably be expected to be common in the lab environment. In general, negative controls had 798 lower k-mer complexity, fewer reads, and lower post PCR Qubit scores than case samples and no major Previous studies have reported that microbial species whose relative abundance is negatively cor-803 related with DNA concentration may be contaminants. We observed a number of species that were 804 negatively correlated with DNA concentration (Supp. Figure S13A ) but this distribution followed the 805 same shape (but had a greater magnitude) as a null distribution of uniformly randomly generated rela-806 tive abundances (Supp. Figure S13B ) leading us to conclude that negative correlation may simply be a 807 statistical artifact. We also plotted correlation with DNA concentration against each species mean rela-808 tive abundance across the entire data-set (Supp. Figure S13C ). Species that were negatively correlated 809 with DNA concentration were clearly more abundant than uncorrelated species, this suggests that there 810 may be a jackpot effect for prominent species in samples with lower concentrations of DNA but is not 811 generally consistent with contamination. We analyzed the total complexity of case samples in comparison to control samples. Case samples 813 had a significantly higher taxonomic diversity (Supp. Figure S12A ) than any type of negative control 814 sample. We also compared the confidence of taxonomic assignments to control assignments for prominent 815 taxa (Supp. Figure S12B ) using the number of unique marker k-mers to compare assignments. We found 816 that case samples had more and higher quality assignments than could be found in controls. One species, Bradyrhizobium sp. BTAi1, was not clearly better in case samples than controls but in this case we were 818 able to assemble genomes for this species in several unique samples so we feel it is ambiguous. Finally, we compared assemblies from negative controls to assemblies from our case samples searching 820 for regions of high similarity that could be from the same microbial strain. We reasoned that uncontam-821 inated samples may contain the same species as negative controls but were less likely to contain identical 822 strains. Only 137 case samples were observed to have any sequence with high similarity to an assem-823 bled sequence from a negative control (8,000 base pairs minimum of 99.99% identity). The identified 824 sequences were principally from Bradyrhizobium and Cutibacterium. Since these genera are core taxa 825 (See Section 2) observed in nearly every sample but high similarity was only identified in a few samples, 826 we elected not to remove species from these genera from case samples. We generated 31-mer profiles for raw reads using Jellyfish. All k-mers that occurred at least twice in 829 a given sample were retained. We also generated MASH sketches from the non-human reads of each 830 sample with 10 million unique minimizers per sketch. We calculated the Shannon's entropy of k-mers by sampling 31-mers from a uniform 10,000 reads per 832 sample. Shannon's entropy of taxonomic profiles was calculated using the CAPalyzer package (Section 833 4). 834 6.5.7 K-Mer based metrics correlate with taxonomic metrics 835 We found clear correlations between three pairwise distance metrics (Supp. Figure S10A , B, C): k-mer 836 based Jaccard distance (MASH), taxonomic Jaccard distance, and taxonomic Jensen-Shannon diver-837 gence. This suggests that taxonomic variation reflects meaningful variation in the underlying sequence 838 in a sample. We also compared alpha diversity metrics (Supp. Figure S10D ): Shannon entropy of k-mers, and 840 Shannon entropy of taxonomic profiles. As with pairwise distances these metrics were correlated though 841 noise was present. This noise may reflect sub-species taxonomic variation in our samples. and mapped these to a set of large database using BLASTn (see 2 for details). This resulted in 34.6% 860 reads which could not be mapped to any external database compared to 41.3% of reads mapped using 861 our approach with KrakenUniq. We note that our approach to estimate the fraction of reads that could 862 be classified using BLASTn does not account for hits to low quality taxa which would ultimately be 863 discarded in our pipeline, and so represents a worst-case comparison. Explanation (3) as it has been demonstrated to be comparable or having higher sensitivity than the best tools identified 873 in a recent benchmarking study (McIntyre et al. (2017) ) on the same comparative dataset. In addition, KrakenUniq allows for tunable specificity and identifies k-mers that are unique to particular taxa in a 875 database. Reads are broken into k-mers and searched against this database. Finally, the taxonomic 876 makeup of a sample is given by identifying the taxa with the greatest leaf to ancestor weight. KrakenUniq reports the number of unique marker k-mers assigned to each taxon, as well as the total 878 number of reads, the fraction of available marker k-mers found, and the mean copy number of those 879 k-mers. We found that requiring more k-mers to identify a species resulted in a roughly linear decrease At a minimum we required three reads assigned to a taxa with 64 unique marker k-mers. This setting 887 captures a group of taxa with low abundance but reasonable (∼ 10-20%) coverage of the k-mers in their 888 marker set (Supp. Figure S9C ). However, this also allows for a number of taxa with very high (105) 889 duplication of the identified marker k-mers and very few k-mers per read which we believe is biologically 890 implausible (Supp. Figure S9D ). We filtered these taxa by applying a further filter which required that 891 the number of reads not exceed 10 25 times the number of unique k-mers, unless the set of unique k-mers 892 was saturated (> 90% completeness). We include a full list of all taxonomic calls from all samples 893 including diagnostic values for each call. We do not attempt to classify reads below the species level in 894 this study. 895 We further evaluated prominent taxonomic classifications for sequence complexity and genome cov- Figure S9B ). We chose sub-sampling (sometimes referred to as rarefaction in the literature) based on the 916 study by Weiss et al. (2017) , showing that sub-sampling effectively estimates relative abundance. Note 917 that we use the term prevalence to describe the fraction of samples where a given taxon is found at any 918 abundance and we use the term relative abundance to describe the fraction of DNA in a sample from a 919 given taxon. 920 We compared our samples to metagenomic samples from the Human Microbiome Project and a 921 metagenomic study of European soil samples using MASH (Ondov et al., 2016) , a fast k-mer based 922 comparison tool. We built MASH sketches from all samples with 10 million unique k-mers to ensure 923 a sensitive and accurate comparison. We used MASH's built-in Jaccard distance function to generate 924 distances between our samples and HMP samples. We then took the distribution of distances to each particular human commensal community as a proxy for the similarity of our samples to a given human 926 body site. 927 We also compared our samples to HMP and soil samples using taxonomic profiles generated by 928 MetaPhlAn v2.0 (Segata et al., 2012) . We generated taxonomic profiles from non-human reads using 929 MetaPhlAn v2.0 and found the cosine similarity between all pairs of samples. 930 We used the Microbe Directory (Shaaban et al., 2018) We analyzed the metabolic functions in each of our samples by processing non-human reads with HU- MAnN2 (Franzosa et al., 2018) . We aligned all reads to UniRef90 using DIAMOND (v0.8.36, (Buchfink 936 et al., 2014) ) and used HUMAnN2 to produce estimate of pathway abundance and completeness. We 937 filtered all pathways that were less than 50% covered in a given sample but otherwise took the reported 938 pathway abundance as is after relative abundance normalization (using HUMAnN2's attached script). High level categories of functional pathways were found by grouping postively correlated pathways 940 and manually annotating resulting clusters. Figure S1: Ecological relationships with taxa. A) Correlation between species richness and latitude. Richness decreases significantly with latitude B) Neighbourhood effect. Taxonomic distance weakly correlates with geographic distance within cities. C) Fraction of reads assigned to different databases by BLAST for each region, at different levels of average nucleotide identity Figure S6 : Antimicrobial Resistance Genes, supplemental. A) Classification accuracy of a random forest model predicting city labels for held out samples from antimicrobial resistance genes. B) Rarefaction analysis of antimicrobial resistance genes. Curve does not flatten suggesting we would identify more AMR genes with more samples. C) Neighbourhood effect. Jaccard distance of AMR genes weakly correlates with geographic distance within cities. D) Number of AMR genes detected for samples in each region. E) Distribution of reads per gene (normalized by kilobases of gene length) for AMR gene calls. The vertical red line indicates that 99% of AMR genes have more than 9.06 reads per kilobase and would still be called at a lower read depth. Gaussian distributions to sampling locations where the taxa was found with standard deviations based on the geographic distance between observations. Top Row) Sampling sites in three major cities Rows 2-4) Estimated distribution of different example species in major cities Row 5) Estimated distribution of three species together in major cities Structure, function and diversity of the healthy human microbiome MetaSPAdes: A new versatile 1279 metagenomic assembler Nala An, Núria Andreu So-1368 mavilla A) Number of species detected as k-mer threshold increases for 100 randomly selected samples B) Number of species detected as number of sub-sampled reads increase C) k-mer counts compared to number of reads for species level annotations in 100 randomly selected samples, colored by coverage of marker k-mer set D) k-mer counts compared to number of reads for species level annotations in 100 randomly selected samples, colored by average duplication of k-mers E) Comparison of Mean Sequence Entropy and Coverage Equality for core and sub-core taxa A) Jensen-Shannon Divergence of taxonomic profiles vs MASH Jaccard distance of k-mers B) Divergence of taxonomic profiles vs Jaccard distance of taxonomic profiles. C) Jaccard distance of taxonomic profilesvs MASH Jaccard distance of k-mers D) Shannon's Entropy of taxonomic profiles vs Shannon's Entropy of k-mers E) Taxonomic richness (number of species) vs Shannon's Entropy of taxonomic profiles Figure S11: A) MASH k-mer Jaccard similarity to representative HMP samples MetaPhlAn v2.0 cosine similarity to representative HMP samples, colored by continent C) Fraction unclassified DNA by surface material D) Cosine similarity to MetaPhlAn v2.0 skin microbiome profile by surface E) Jensen-Shannon distance between pairs of taxonomic profiles vs Geographic Distance F) MASH k-mer Jaccard similarity to representative soil samples