key: cord-0302174-yiit6pwg
authors: Merrill, Bryan D.; Carter, Matthew M.; Olm, Matthew R.; Dahan, Dylan; Tripathi, Surya; Spencer, Sean P.; Yu, Brian; Jain, Sunit; Neff, Norma; Jha, Aashish R.; Sonnenburg, Erica D.; Sonnenburg, Justin L.
title: Ultra-deep Sequencing of Hadza Hunter-Gatherers Recovers Vanishing Microbes
date: 2022-03-31
journal: bioRxiv
DOI: 10.1101/2022.03.30.486478
sha: 4d6482a2d82a00cfd5b531d719e2d14d1defdae5
doc_id: 302174
cord_uid: yiit6pwg

The gut microbiome has been identified as a key to immune and metabolic health, especially in industrialized populations1. Non-industrialized individuals harbor more diverse microbiomes and distinct bacterial lineages2, but systemic under-sampling has hindered insight into the extent and functional consequences of these differences3. Here, we performed ultra-deep metagenomic sequencing and laboratory strain isolation on fecal samples from the Hadza, hunter-gatherers in Tanzania, and comparative populations in Nepal and California. We recover 94,971 total genomes of bacteria, archaea, bacteriophage, and eukaryotes, and find that 43% are novel upon aggregating with existing unified datasets4,5. Analysis of in situ growth rates, genetic pN/pS signatures, and high-resolution strain tracking reveal dynamics in the hunter-gatherer gut microbiome that are distinct from industrialized populations. Industrialized versus Hadza gut microbes are enriched in genes associated with oxidative stress, possibly a result of microbiome adaptation to inflammatory processes. We use phylogenomics to reveal that global spread of the spirochaete Treponema succinifaciens parallels historic human migration prior to its extinction in industrialized populations. When combined with a detailed definition of gut-resident strains that are vanishing in industrialized populations, our data demonstrate extensive perturbation in many facets of the gut microbiome brought on by the industrialized lifestyle. Recognition of work with indigenous communities Research involving indigenous communities is needed for a variety of reasons including to ensure that scientific discoveries and understanding appropriately represent all populations and do not only benefit those living in industrialized nations3,6. Special considerations must be made to ensure that this research is conducted ethically and in a non-exploitative manner. In this study we performed deep metagenomic sequencing on fecal samples that were collected from Hadza hunter-gatherers in 2013/2014 and were analyzed in previous publications using different methods2,7. A material transfer agreement with the National Institute for Medical Research in Tanzania specifies that stool samples collected are used solely for academic purposes, permission for the study was obtained from the National Institute of Medical Research (MR/53i 100/83, NIMR/HQ/R.8a/Vol.IX/1542) and the Tanzania Commission for Science and Technology, and verbal consent was obtained from the Hadza after the study’s intent and scope was described with the help of a translator. The publications that first described these samples included several scientists and Tanzanian and Nepali field-guides as co-authors for the critical roles they played in sample collection, but as no new samples were collected in this study, only scientists who contributed to the analyses described here were included as co-authors in this publication. It is currently not possible for us to travel to Tanzania and present our results to the Hadza people, however we intend to do so once the conditions of the COVID-19 pandemic allow it.

complete genome of a stingless bee (232 megabase pairs and 92.3% complete), the honey and larvae of which are known to be consumed by the Hadza 107 , and four are novel Amoebae (n=2) and Trepomonas (n=2) genomes (Fig. 1C) . While a comprehensive genome database does not yet exist for eukaryotes known to colonize the human gut, genomes from these species are not present in NCBI GenBank 108 (a repository of genomes sequenced from all environments).

Metagenomic reads generated here were mapped to three custom databases containing full genome sequences of species-level representatives for the bacteria/archaea (n=5,755) bacteriophage (n=16,899), and eukaryote (n=12) genomes (see methods for details). Over 80% of the metagenomic reads from Hadza, Nepali, and Californian samples map to these databases (Fig.   1E) . Notably, the Hadza have higher bacterial, bacteriophage, and archaeal diversity than other populations in this study, with the exception of Nepali Forager bacteriophage diversity (Fig. 1F) .

This increased diversity was not due to increased sequencing depth as an in-silico rarefaction analysis revealed more total and novel species of bacteria, archaea, and bacteriophage in Hadza samples compared to other populations across a range of sequencing depths ( Fig. 1G; Extended   Data Fig. 4) . Analysis of exceptionally deeply-sequenced samples (≥ 50 Gbp) suggests that the Hadza gut microbiome contains two-to-four times the number of bacterial species when compared to Californians at similar sequencing depths (Extended Data Fig. 5 ).

To explore the extent to which the Hadza microbiome differs from other populations, we curated a dataset of 1,800 human gut metagenomes from 21 published studies 11,15,29-44 (industrial, n=950; transitional, n=583; Hadza hunter-gatherers from this study, n=135; and other hunter-gatherers, n=132; Extended Data Fig. 6A-B, Supplementary Table 4) . Analysis of the hunter-gatherer samples demonstrates that much diversity and distinguishing taxa are recovered with deeper sequencing, so subsequent compositional analysis was focused on the deeply sequenced Hadza samples (Extended Data Fig. 6C-F) . The presence of each species within our bacterial/archaeal genome database was determined for each sample ( Fig. 2A Most VANISH taxa (n=120; 96%) and all BloSSUM taxa (n=63; 100%) are detected in "transitional" samples (taken from populations intermediate between hunter-gatherer/forager and industrialized lifestyles). These taxa are typically found at intermediate prevalence, consistent with the extent of lifestyle change corresponding to the magnitude of microbiome shifts 45 (Fig. 2C) .

Interestingly, BloSSUM taxa have higher in situ growth rates than VANISH taxa in transitional samples (Fig. 2D) and are anti-associated with the presence of Blastocystis, even within individuals from industrialized populations (Extended Data Fig. 8 ). Replication rate differences may indicate a competitive advantage of BloSSUM taxa in the industrialized gut versus the slower replicating VANISH taxa. Table 6 ). Pfams most associated with VANISH taxa point to a relatively outsized use of metal ions, peptidases, and RNA methylation. BloSSUM Pfams are associated with antioxidant and redox sensing functionality, perhaps reflecting increased oxygen tension associated with inflammation or an altered epithelial metabolic state in the industrialized gut 21, 47 . These differences demonstrate that VANISH and BloSSUM taxa are not functionally redundant.

Several species of the phylum Spirochaetota were identified as VANISH taxa in this study (Supplemental Table 5 ). Spirochaetota in general, and especially the most well-studied species Treponema succinifaciens, are known to be depleted in industrialized microbiomes 8 3A ). Hadza Spirochaetota genomes fall into three diverse families also found in other populations (colored boxes, Fig. 3B) suggesting that Spirochaetota are a core component of the nonindustrialized microbiome and highly susceptible to loss upon lifestyle change.

The MAGs recovered here increase the number of publicly available Treponema succinifaciens genomes from 125 to 346 (276% increase), enabling a robust phylogenomic analysis of the species (Fig. 3C) . We identified both Hadza-specific and globally distributed clades of T. succinifaciens and observed an association between phylogeny and continent of origin (delta statistic d=7.79, p-value<0.0001) 49 . To model the dispersal of T. succinifaciens between human populations, we performed stochastic character mapping on the phylogenetic tree of MAGs in which the population where each MAG was recovered is coded as a trait of the genome and the frequency of "transition events" between each pair of populations is quantified 50 (Fig. 3D) . The 4 most frequent transition events between populations are from the Hadza to other populations, accounting for 46.7% of all transition events, suggesting that T. succinifaciens was carried along the out-of-Africa human dispersal routes 51 . The congruence of T. succinifaciens phylogenomics with known patterns of past human migration is consistent with its dispersal being linked to close human contact (e.g., vertical transmission), as has been described for Helicobacter pylori 19, 52 .

The high sequencing depth and sample number achieved in this study provide an unprecedented opportunity to investigate in situ growth rates, microdiversity, and strain sharing within a huntergatherer population. The Hadza gut microbiome has been shown to undergo seasonal cycling in carbohydrate-active enzyme (CAZyme) and species composition 2,7 . Here we confirm these findings using deeper sequencing and updated metagenomic methods (Extended Data Fig. 9 ) and for the first time identify seasonal cycling in bacterial replication rates (Extended Data Fig. 10A ).

Analysis of intra-genic pN/pS ratios, a measure of bias towards non-synonymous mutations that suggests directional or diversifying selection, reveals a roster of functions with higher pN/pS ratios (p < 0.01; n=693); many are associated with extracellular or membrane-bound proteins, such as Ig-like folds, pilin motifs, and collagen-binding proteins (Extended Data Fig. 10B ).

Family relation and cohabitation are among the strongest factors associated with microbial strain sharing in industrial populations 53, 54 , but it is unknown whether these patterns hold for huntergatherer populations like the Hadza. We performed a high-resolution strain-tracking analysis (threshold for same strain = 99.999% popANI) and found that family members share more recently-transmitted strains than unrelated individuals among the Hadza (Fig. 4A , Supplementary Table 7) . Interestingly, strain sharing among members of the same bush camp approaches that between members of the same family (Fig. 4B) , and this effect is stronger in some bush camps ( Fig. 4B) . For example, individuals from the Hukamako camp share more strains with one another than family members share on average across all camps. Drinking water source (e.g., spring, stream, riverbed, etc.) and season (late dry, early dry, early wet, or late wet) have been previously linked to gut microbiome similarity 2,28 , and here we demonstrate that these factors are also linked with the sharing of identical microbial strains (Fig. 4A) . Overall, these results point to the importance of environmental factors, kinship, and bush camp membership (a social structure with no equivalent in the industrialized populations) in driving strain dispersal among hunter-gatherers.

Here we elucidated many novel facets of lifestyle differentiation in the gut microbiome using Data and materials availability: The authors declare that the data supporting the findings of this study are available within the paper and its supplementary information files. Metagenomic reads and de novo genomes are being submitted to the short read archive (SRA) and GenBank and this manuscript will be updated with additional accessions when the submission is complete. Bacteriophages without a host prediction are labeled "Uncharacterized".

(E) The percentage of metagenomic reads mapping to various domains averaged across all metagenomic samples from each population. The phyla "Bacteriodota" and "Firmicutes_A" are shown separated from other bacteria. "Unmapped" depicts the percentage of reads that do not map to any genomes, and "Low confidence" depicts the percentage of reads that map to genomes with less than 50% genome breadth. Wilcoxon rank-sum test).

The association of Pfams with VANISH or BloSSUM genomes. The x-axis displays the fraction of BloSSUM genomes a Pfam is detected in minus the fraction of VANISH genomes a Pfam is detected in (Pfam differential prevalence). The y-axis displays the p-value resulting from

Fisher's exact test with multiple hypothesis correction. 

Samples from Tanzania are from 2013-2014 and described previously 2, 7 . Permission was obtained from the National Institute of Medical Research and the Tanzania Commission for Science and Technology. For longitudinal samples, one sample from each individual was marked "high_prority" (Table S1 ) and used as noted in statistical analyses that are not robust to multiple samples from the same individual. Nepal samples were obtained previously 28 Metagenome quality control and assembly Raw sequencing reads were demultiplexed and data originating from the same libraries were concatenated prior to analysis. Raw reads were processed using BBtools suite 56 . Exact duplicate reads (subs=0) were marked (clumpify), adapters and low-quality bases were trimmed (bbduk;trimq=16 minlen=55), trimmed reads were mapped (BBmap) against the human genome (hg19) with masks over regions conserved broadly in eukaryotes, and duplicate reads were removed. FastQC 57 was used to ensure read quality. BBMerge was used to merge reads that could be joined unambiguously using the recommended settings (rem k=62 extend2=50 ecct vstrict) 58 . conditions. Individual colonies were re-streaked and then biotyped on a Bruker MALDI-TOF microflex to determine taxonomy. Colonies were grown in liquid media of the same type as the originating agar plate in anaerobic conditions. For isolating Treponema, 0.5% agar was added to the liquid media before making plates. Treponema strains were isolated after removing the top layer of agar to harvest colonies within the agar. Many of these isolated strains are not currently amenable to freezer storage and liquid-culture-based propagation in isolation.

Genomic DNA was extracted (Qiagen DNeasy Blood and Tissue). Libraries were prepared using half-reactions of the Nextera Flex kit, a minimum of 10 ng of DNA as input, 6 or 8 PCR cycles to minimize PCR amplification bias and a different 12 base pair unique dual-indexed barcode.

Libraries were quantified (Agilent Fragment Analyzer) and size-selected (AMPure XP beads; Beckman), targeting a fragment length of 450bp (insert size of 350 bp). Paired-end sequencing (2x140bp) was performed on a NovaSeq 6000 using S4 flow cells at Chan Zuckerberg Biohub.

Assembly of genomes was performed by trimming using BBduk (trimq=30), normalizing read depth using BBnorm (target=320, min=2), and assembled using SPAdes v3. 13 Bins were refined using MAGpurify v2 68 (using weighted mode for gc_content, tetra_freq, and coverage). The database used by Nayfach et al. 68 for conspecific analysis was augmented by adding all bins that were >=95% complete and <=5% contaminated (CheckM and anvi'o). For each species-level group, only the highest-quality genome bin for each individual was included.

Flagged contigs were removed. Rarely, a module suggested the removal of >25% of a bin's length, and in such cases that module was turned off. Genomes with ≥50% completeness and <10% contamination according to CheckM were retained, in accordance with MIMAG standards 69 .

Evaluation of self-and co-mapping relative to isolate genomes Isolate genomes from Hadza stool samples were de-replicated (dRep v3.2.2;-s 100000, -sa 0.99).

The highest scoring isolate as the representative when multiple isolates from the same secondary cluster were isolated from the same sample. 19 representative isolates were identified from samples that also had metagenome sequencing, assembly, and binning. Representative isolates and bins (>=50% complete, <5% contamination) generated using self-mapping and co-mapping were compared (MASH;-s 100000), selecting most similar bin MASH distance <0.05), with comapping and self-mapping recovered 17 and 10 bins representing isolates, respectively, with no significant differences in quality.

Bacterial and archael genomes sharing ≥ 95% average nucleotide identity (ANI) over 30% of their length were considered the same species 70 . Species-level groups were determined using dRep (v3.0.0 71 ;--S_algorithm fastANI --multiround_primary_clustering --clusterAlg greedy -ms 10000 -pa 0.9 -sa 0.95 -nc 0.30 -cm larger) based on the ANI between all genomes within each species-level group. Each genome was assigned a "centrality" score according to its average ANI to all other genomes in the group. The highest score genome was chosen as representative for each species-level group using the formula: score = (1*completeness) -(5*contamination) + (0.5*log10(ctg_N50)) + (1*log10(contig_bp)) + (2*(centrality-0.95)*100).

Centrality was calculated between all genomes in the UHGG genome database (v1.0) using the species-grouping 4 , and species representatives were chosen as above. Representatives from de novo genomes generated here and from the UHGG database (v1.0) were compared (dRep;--S_algorithm fastANI --multiround_primary_clustering --clusterAlg greedy -ms 10000 -pa 0.9 -sa 0.95 -nc 0.30 -cm larger). Representatives for each species-level group were chosen using the formula: score = (1*completeness) -(5*contamination) + (0.5*log10(ctg_N50)) + (1*log10(contig_bp)). Representatives were compared using the same dRep command, and winners were chosen using the same scoring criteria. Species-level group membership was backpropagated to the original bins.

Annotating bacteria / archaeal genomes and assessing genomic novelty Taxonomy was determined for all species-level representative genomes using GTDB (v95) 72 .

Novelty against UHGG were determined based on the species-level clustering described above.

Only genomes that pass both the MIMAG genomic standards used in this study (≥50% completeness and <10% contamination) and the standard used during UHGG creation (completeness -(5*contamination) > 50) were considered in comparisons against UHGG.

Species groups containing only genomes recovered from the Hadza were considered novel relative to UHGG.

A phylogenetic tree was made (GtoTree (v1.5.36) 73 with bacterial gene sets (-H Bacteria). All other settings were default. The tree was visualized using iTol 74 with taxonomy provided by GTDB.

EukRep (v0.6.6) 75 was employed on all assemblies (default settings) and if a genome bin was both >5 mega base pairs and >80% eukaryotic according to EukRep, it was called eukaryotic.

EukCC (v1.1) 76 was run on eukaryotic bigs using database eukcc_db_20191023_1

Proteins identified via EukCC were compared against UniRef100 77 (downloaded 3/5/2020) using DIAMOND 78 with a maximum e-value of 0.0001 (blastp -f 6 -e 0.0001 -k 1). The resulting taxonomy was parsed with tRep (https://github.com/MrOlm/tRep/tree/master/bin) 79 . Eukaryotic genomes with the same species-level taxonomy that originated from the same metagenomic sample were presumed to be from the same organism, were merged into a single file and reanalyzed using EukCC and tRep.

Phylogenetic tree was created (GToTree; v1.5.36) 73 "GToTree -H Universal_Hug_et_al -j 4 -Bc 1 -t") with a custom set of public reference genomes. Tree was visualized using iTol 74 .

To identify eukaryotic species that may be present in the metagenomics sequenced in this study and which did not have genomes recovered using the pipeline described above, we ran the program EukDetect 80 on all metagenomes sequenced in this study. Five species were detected in at least two samples with "perecent_observed_markers" ≥ 50, and reference genomes for these five species were included in the eukaryotic species-level genome database. In addition to these five genomes, the highest quality representative genome from each of the seven species of eukaryotes recovered in this study was included in the eukaryotic species-level genome database.

Metagenome reads were mapped onto the eukaryotic species-level genome database (Bowtie 2 65 ) and the resulting mappings were processed (inStrain quick_profile; v1.2.14 24 and CoverM v0.4.0 (https://github.com/wwood/CoverM)). A species was "present" if the breadth of coverage according to inStrain exceeded 0.1.

CheckV 81 (version 0.8.1, end-to-end mode, database v1.0) was run on all assembled contigs >=1500bp. Contigs predicted to contain one or more proviruses were run iteratively through CheckV (up to 5 rounds) until CheckV assumed the remaining region was viral. For provirus iterations only yielding an HMM-based completeness estimates, the most complete fragment was selected and excised from the parent contig. For provirus iterations with AAI (Average Amino acid Identity)-based completeness predictions, the fragment with the length closest to expected length was selected and excised from the parent contig. Viral contigs were passed through the MGV viral detection pipeline 5 and Bacphlip (v0.9.6) was run to assign a lytic and temperate score 82 .

Creating bacteriophage species-level genome database The 40,171 viruses recovered in this study were clustered into species-level groups as described previously 5 (blastn --min_ani 95 --min_qcov 0 --min_tcov 85, https://github.com/snayfach/MGV/tree/master/ani_cluster), and the longest viral contig in each cluster was selected as the representative. To measure novelty versus MGV, the 16,899 specieslevel representatives were subsequently clustered with the 54,118 MGV cluster representatives into species-level groups using the same method, and clusters without an MGV genome were considered novel.

Host prediction was performed on the 40,171 viruses as described previously 5 . Briefly, CRISPR spacers were identified (PILER-CR 83 and CRT 84 ) And BLASTN 85 used to search viruses for CRISPR spacers identified from bins reported here and UHGG v1 (-dust no -word_size 18).

CRISPR spacer hits were retained if there was a maximum of one mismatch or gap over >=95% of the spacer length. Additionally, hs-blastn 86 was used to identify >=1kb and >=96% DNA identity hits between all UHGG and newly-recovered genomes and viruses reported here. All viral connections to host genomes were aggregated, and host taxonomy was assigned based on the lowest host taxonomic rank that had >70% agreement across CRISPR or BLASTN.

Reads from all metagenomes generated here were mapped to the bacterial/archaeal, bacteriophage, and eukaryote species-level genome databases (Bowtie 2 65 ). Resulting mappings were processed (inStrain quick_profile; v1.2.14 24 and CoverM v0.4.0 (https://github.com/wwood/CoverM)). Prokaryotes where the representative genome was detected at ≥ 0.5 breadth (i.e. at least half of bases were covered by at least 1 read) were considered present. Bacteriophages and eukaryotes breadth thresholds were 0.75 and 0.1, respectively.

Relative abundance (% DNA) was calculated as (# reads mapping a genome / total # reads in metagenome). Shannon diversity was calculated based on relative abundance (% DNA) values (scikit-bio (http://scikit-bio.org)) .

In silico rarefaction was performed on samples sequenced to ≥ 50 Gbp using the InStrain auxiliary script "rarefaction_curve.py" (v0.3.0) (https://github.com/MrOlm/inStrain/blob/master/auxiliary_scripts/rarefaction_curve.py) on a .bam file of reads mapped with Bowtie 2 65 . For other rarefaction curves (Fig. 1G, Extended   Data Fig. 4C ) an alternative in silico rarefaction technique was used. Genomes with < 50% breadth were removed from the analysis, and for each rarefaction level 1) a scaling threshold was established based on the total sequencing depth (scaling factor = rarefaction depth / total sequencing depth), 2) scaled genome coverage was calculated by each genome by multiplying un-rarefied coverage by this scaling factor, and 3) genomes with scaled coverage ≥ 1 were considered detected.

Collating previously published human gut metagenomic samples Prevalence of microbial species across lifestyle was characterized using a curated collection of 2122 metagenomes including samples from industrial 29-32,34,40,87,88 , transitional 15,17,35-38,40,41,89 , and hunter-gatherer populations 36, 38, [42] [43] [44] Genomes in the 95th percentile or greater where Hadza prevalence was higher were "VANISH" taxa. Those in the 95th percentile or greater where industrial prevalence was higher were "BloSSUM".

Heatmaps displaying species prevalence data were created using the R package "pheatmap" (v1.0.12). Principal coordinate analysis was performed on the species prevalence data using the vegdist function in the package "vegan" 92 (v2.5-6) and the function cmdscale from the package "stats" (v4.0.4).

InStrain profile (v1.2.14) (26) was run on all .bam files created as described in the "species prevalence analysis" section. All iRep values for genomes with ≥ 50% genome breadth and with values < 5 were considered valid. Seasonality of iRep values was plotted using seaborn v0.11.1 93 "lineplot" with the default estimator (mean) and 95% confidence interval for error bars.

Presence or absence of each Blastocystis MAG was determined as described above. The top two most prevalent Blastocystis MAGs were most closely related to Blastocystis ST1 and Blastocystis ST4, respectively(tRep; https://github.com/MrOlm/tRep/tree/master/bin) 79 . Wilcoxon rank sum test was used to determine if presence of a Blastocystis genome was correlated with total relative abundance of VANISH taxa and BloSSUM taxa separately. Linear discriminant analysis was performed using the "lda" function from the package MASS (v7.3) to determine the effect size of each association.

Principal coordinate analysis was performed on the Bray-Curtis distance between all Hadza samples in our study. Relative abundance was aggregated at the taxonomic level of family to mirror initial analysis done in Smits, et al. 2 . The adonis function in the R package "vegan" was used to test significance by season. Subject ID was used as a sub-stratum. A Wilcoxon rank-sum test was used to determine whether samples varied in composition along the major axis of variation, aggregated by season.

The average relative abundance of each species-level group in our bacterial/archaeal specieslevel genome database was calculated for each sub-season. Taxa that observed cyclical abundance over the course of a year was determined (Kruskal-Wallis test; p-values were Bonferroni-adjusted to control for multiple hypothesis testing).

CAZyme annotation was performed using dbCAN_v9 HMMs 94 (http://bcb.unl.edu/dbCAN2/download/Databases/V9/dbCAN-HMMdb-V9.txt). Proteins were searched against the HMM collection using hmmscan 95 and filtered using the "hmmscanparser.sh" script provided with dbCAN2. Seasonal CAZyme analysis was performed using previously described seasonal delineations 2 .

Protein clustering and novelty assessment Predicted proteins were clustered at 95% identity (MMseqs2 96 ; v12.113e3; easy-linclust --covmode 1 -c 0.8 --kmer-per-seq 80 --min-seq-id 0.95 --compressed 1). Novelty relative to UHGP-95 (v1.0) 4 was determined by clustering together UHGP-95 with our de novo representative proteins (MMseqs2) and back-propagating to the initial de novo clustering to calculate the number of protein clusters assembled from each lifestyle (Extended Data Figure 2 ).

Representative proteins were also compared against UniRef100 using DIAMOND 78 . Novel proteins were defined when the representative protein was not related to any protein in the UniRef100 database with ≥ 95% amino acid identity.

Proteins were annotated (Pfams (v32) 97 ;hmmsearch 95 ), filtered (hmmsearch --cut_ga -domtblout), and protein domain overlap was resolved (cath-resolve-hits.ubuntu14.04 98 ; --inputformat hmmer_domtblout --hits-text-to-file).

For each Pfam, the number of VANISH and BloSSUM genomes with at least one gene containing a Pfam was recorded as "c1" and "c2", respectively. Pfams found more often in one genome set or the other were detected using a Fisher's exact test on the following contingency correction was performed using the FDR method 99 ). Pfam differential prevalence was calculated as (c2 / (# BloSSUM genomes)) -(c1 / (# VANISH genomes)).

Spirochaete genomes from the bacterial/archaeal species-level genome database and NCBI were de-replicated (dRep; --S_algorithm fastANI -ms 10000 -pa 0.9 -sa 0.95), and a phylogenetic tree was generated (GtoTree;v1.5.36) 61,73,95,100-102 ) from bacterial (-H Bacteria) gene sets. All other settings were default. The tree was visualized using iTol 74 and colored by taxonomy provided by GTDB.

A tree of Treponema succinifaciens in the bacterial/archaeal species-level genome database was generated using GtoTree; v1.5.36) 61,73,95,100-102 with IQ-Tree 103 from bacterial (-H Bacteria) gene sets (completeness threshold 75% with "-G 0.75"). We used country-of-origin information (recoded as continent-of-origin) as a trait of each genome to measure the degree of phylogenetic signal in the geographic spread of the MAGs ("delta" function from Borges, et al. 49 ). P-value of the delta statistic was performed using 100 calculations with randomly permuted tree tip labels.

Stochastic character mapping was performed using SIMMAP via the "make.simmap" function ("phytools" R package 104 ). We applied the character mapping on the marker-based tree of T.

succinifaciens GToTree generated MAGs (described above). "Country of origin" of each MAG served as a trait and inferred ancestral character states on phylogeny (equal rates model, repeated 100 times to calculate average # of character changes and direction of host transfer events).

The pN/pS was calculated using inStrain (v1.2.14) (inStrain profile --database_mode) 24 on mappings to the bacterial/archaeal species-level genome database, using the predicted genes. All genomes detected with < 80% breadth were excluded from analysis. For remaining genomes, genes with "SNV_count" < 5 were excluded. If <10 genes in a genome fit this criteria, the genome was excluded. Genes with ≥ 5 "SNV_count" and a blank "pNpS_variants" value were assigned a "pNpS_variants" of 100. Genes were sorted according to "pNpS_variants", and genes in the top and bottom 10% of "pNpS_variants" were recorded. How many times each Pfam was detected on any genes that passed the above filters ("trial_count") and how many times the Pfam was in genes in the top and bottom 10% of genes based on "pNpS_variants" ("top_success_count", "bottom_success_count") was noted.

To determine Pfams in the top or bottom 10% of "pNpS_variants" more often than expected by chance, genes detected in less than 5 samples were excluded, the number of times a gene was in the top 10%, bottom 10%, and seen total was scaled ("trial_count"/5), and the scaled "top_success_count", "bottom_success_count", and "trail_count" values were summed together.

Probability that the "top_success_count" or bottom_success_count" was due to random chance was calculated using binomial statistics (Python Scipy 105 ). P-values reported as 0 were set to 1E-300 and multiple hypothesis correction was performed (FDR 99 ). Mean Pfam pN/pS was calculated as the average "pNpS_variants'' of all genes on genomes with ≥ 80% breadth and a non-blank "pNpS_variants" value.

The procedure described above was repeated using "coverage" instead of "pNpS_variants" to detect Pfams associated with genes with higher or lower coverage than others. To avoid mismapping (recruiting genes from other populations), all Pfams with uncorrected p-values < 0.01 were excluded from the "pNpS_variants" analysis.

Genome detection was defined as minCov breadth ≥0.5 (i.e. at least half of bases were covered by at least 5 reads) as measured using "inStrain profile". Each species detected in more than one individual was compared using inStrain compare. Where a genome was detected in more than 120 samples, samples were divided into groups of equal size such that no group had more than 120 samples, and "inStrain compare" was then run on each group. A distance matrix was created for each species based on resulting popANI values and used to cluster each species into individual strains using "average" hierarchical clustering with a threshold of 99.999% popANI (Scipy cluster). Strains shared between sample pairs were calculated based on this strain definition, and P-values were calculated only considering pairs of samples in which both samples were from Hadza adults.

Gut-microbiota-targeted diets modulate human immune status

Seasonal cycling in the gut microbiome of the Hadza hunter-gatherers of Tanzania

Public human microbiome data are dominated by highly developed countries

A unified catalog of 204,938 reference genomes from the human gut microbiome

Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome

Strategic vision for improving human health at The Forefront of Genomics

Links between environment, diet, and the hunter-gatherer microbiome

The ancestral and industrialized gut microbiota and implications for human health

The gut microbiota of rural papua new guineans: composition, diversity patterns, and ecological processes

Human gut microbiome viewed across age and geography

The microbiome of uncontacted

The infant microbiome development: mom matters

Antibiotics and the gut microbiota

The theory of disappearing microbiota and the epidemics of chronic diseases

US Immigration Westernizes the Human Gut Microbiome

Reconstruction of ancient microbial genomes from the human gut

The Prevotella copri Complex Comprises Four Distinct Clades Underrepresented in Westernized Populations

Rapid changes in the gut microbiome during human evolution

An African origin for the intimate association between humans and Helicobacter pylori

What are the consequences of the disappearing human microbiota?

Vulnerability of the industrialized microbiota

Hybrid, ultra-deep metagenomic sequencing enables genomic and functional characterization of low-abundance species in the human gut microbiome

Genome-resolved metagenomics of eukaryotic populations during early colonization of premature infants and in hospital rooms

inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains

Measurement of bacterial replication rates in microbial communities

Robust Variation in Infant Gut Microbiome Assembly Across a Spectrum of Lifestyles

Gut microbiome transition across a lifestyle gradient in Himalaya

A metagenome-wide association study of gut microbiota in type 2 diabetes

A human gut microbial gene catalogue established by metagenomic sequencing

Personalized Nutrition by Prediction of Glycemic Responses

The Human Gut Microbiome as a Transporter of Antibiotic Resistance Genes between Continents

Erratum: Strains, functions and dynamics in the expanded Human Microbiome Project

Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome

Subspecies in the global human gut microbiome

Subsistence strategies in traditional societies distinguish gut microbiomes

Mobile genes in the human microbiome are structured from global to individual scales

Use of shotgun metagenomics for the identification of protozoa in the gut microbiota of healthy individuals from worldwide populations with various industrialization levels

Unique Features of Ethnic Mongolian Gut Microbiome revealed by metagenomic analysis

Interconnected microbiomes and resistomes in low-income human habitats

Differential human gut microbiome assemblages during soil-transmitted helminth infections in Indonesia and Liberia

Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle

Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota

Short-and long-read metagenomics of urban and rural South African gut microbiomes reveal a transitional composition and undescribed taxa

Pfam: The protein families database in 2021

Colonocyte metabolism shapes the gut microbiota

Treponema peruense sp. nov., a commensal spirochaete isolated from human faeces

Measuring phylogenetic signal between categorical traits and phylogenies

Codiversification of gut microbiota with humans

A geographically explicit genetic model of worldwide human-settlement history

Traces of human migrations in Helicobacter pylori populations

The Long-Term Stability of the Human Gut Microbiota

Variation and transmission of the human gut microbiota across multiple familial generations

Gut Microbiota-Targeted Diets Modulate Human Immune Status. Cold Spring Harbor Laboratory

BBTools software package

FastQC: a quality control tool for high throughput sequence data

BBMerge--accurate paired shotgun read merging via overlap

A. metaSPAdes: a new versatile metagenomic assembler

QUAST: quality assessment tool for genome assemblies

Prodigal: prokaryotic gene recognition and translation initiation site identification

Using SPAdes DE Novo Assembler

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Mash: fast genome and metagenome distance estimation using MinHash

Fast gapped-read alignment with Bowtie 2

MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies

Anvi'o: an advanced analysis and visualization platform for 'omics data

New insights from uncultivated genomes of the global human gut microbiome

Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea

Consistent Metagenome-Derived Metrics Verify and Delineate Bacterial Species Boundaries

dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication

GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database

GToTree: a user-friendly workflow for phylogenomics

Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation

Genomereconstruction for eukaryotes from complex natural microbial communities

Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC

UniRef: comprehensive and non-redundant UniProt reference clusters

Fast and sensitive protein alignment using DIAMOND

Necrotizing enterocolitis is preceded by increased gut bacterial replication, Klebsiella, and fimbriae-encoding bacteria

Accurate and sensitive detection of microbial eukaryotes from whole metagenome shotgun sequencing

CheckV assesses the quality and completeness of metagenome-assembled viral genomes

BACPHLIP: predicting bacteriophage lifestyle from conserved protein domains

PILER-CR: fast and accurate identification of CRISPR repeats

CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats

BLAST+: architecture and applications

High speed BLASTN: an accelerated MegaBLAST search tool

Strains, functions and dynamics in the expanded Human Microbiome Project

Dynamics and Stabilization of the Human Gut Microbiome during the First Year of Life

Unique Features of Ethnic Mongolian Gut Microbiome revealed by metagenomic analysis

Elevated rates of horizontal gene transfer in the industrialized human microbiome

VEGAN, a package of R functions for community ecology

seaborn: statistical data visualization

dbCAN2: a meta server for automated carbohydrate-active enzyme annotation

Accelerated Profile HMM Searches

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets

The Pfam protein families database in 2019

cath-resolve-hits: a new tool that resolves domain matches suspiciously quickly

Econometric and statistical modeling with python

MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping

TrimAl: a Tool for automatic alignment trimming

IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era

phytools: an R package for phylogenetic comparative biology (and other things)

Open source scientific tools for Python

Recent developments in Blastocystis research

Honey, Hadza, hunter-gatherers, and human evolution

Database Resources of the National Center for Biotechnology Information

For each population in this study, the percentage of predicted proteins from recovered genomes that are present in the UniRef100 and UHGP-95 protein databases Extended Data Figure 3 . Phylogenies of strains isolated from Hadza stool samples. (A) A phylogenetic tree of all isolate genomes sequenced in this study. The tree is decorated with phylum of each species (inner ring), whether the species is newly isolated for the first time (middle ring) and whether the species is novel relative to UHGG (outer ring). Figure 4 . Increased sequencing depth results in the detection of novel and phylogenetically distinct taxa. (A) Taxonomic distribution of organisms present at different ranges of relative abundance levels (horizontal stacked bar plots) and the percentage of species that are novel according to GTDB (text percentages right of horizontal bars). Organisms detected at low relative abundance levels are more likely to be novel than those that are more abundant. (B) A depiction of how metrics would be different if 5 Gbp (approximately the average depth sequenced in previous large-scale genome binning studies) had been sequenced in the study rather than ~25 Gbp. 'Nepal For.' includes the Chepang foragers, and the 'Nepal Ag.' group includes Raute, Raji, and Tharu agrarians. The violin plots depict distribution of relative abundance for 6 SRGs that vary significantly over the 5 sub-seasons. The top three sub-panels depict specieslevel groups that have higher abundance in the wet seasons. The bottom three sub-panels depict species-level groups that have higher abundance in the dry seasons. (Bulleidia sp., P-adjusted = 7.5 x 10 -20 ; Dorea formicigenerans, P-adjusted = 1.2 x 10 -16 ; Holdemanella sp003436425, Padjusted = 6.5 x 10 -16 ; Prevotella copri_A, P-adjusted = 0.0054; Succinivibrio sp000431835, Padjusted = 4.7 x 10 -7 ; Treponema_D succinifaciens, P-adjusted = 0.012; Kruksal-Wallis test). 2013-LD (Late Dry); 2014-EW (Early Wet); 2014-LW (Late Wet); 2014-ED (Early Dry); 2014-LD (Late Dry). (C) For Hadza gut metagenomes sequenced in this study, genes present (≥ 80% breadth of coverage) on detected genomes (≥ 50% breadth of coverage) were annotated against the CAZyme database. CAZyme Pielou evenness (left), total richness (middle), and Shannon diversity (right) were calculated using the summed relative abundance of genomes containing each GH and PL CAZyme Family (for example, 'GH16') or (D) SubFamily (for example, 'GH16.7'). Values for samples collected from Hadza individuals in different seasons were compared using a two-sided Wilcoxon rank-sum test (* P < 0.01; ** P < 1 x 10 -5 ; *** P < 1 x 10 -10 ). 

Supplementary Tables   Supplementary Table 1 Description of Hadza, Nepali, and Californian cohortsSupplementary Table 2 Comprehensive genome information info (including representative genomes and other genomes)Supplementary Table 3 Roster of strains isolated from Hadza stool (including culturing information)Supplementary Table 4 Global metagenomics data set broken down by sample Supplementary Table 5 Prevalence/abundance data for each species-level representative genome in our bacterial/archaeal species-level genome database Supplementary Table 6 Pfam info (lifestyle-enrichment and pN/pS data) Supplementary Table 7 Strain sharing data between Hadza adult samples