key: cord-103733-blam1f4c authors: Levade, Inès; Saber, Morteza M.; Midani, Firas; Chowdhury, Fahima; Khan, Ashraful I.; Begum, Yasmin A.; Ryan, Edward T.; David, Lawrence A.; Calderwood, Stephen B.; Harris, Jason B.; LaRocque, Regina C.; Qadri, Firdausi; Shapiro, B. Jesse; Weil, Ana A. title: Predicting Vibrio cholerae infection and disease severity using metagenomics in a prospective cohort study date: 2020-06-24 journal: bioRxiv DOI: 10.1101/2020.02.25.960930 sha: doc_id: 103733 cord_uid: blam1f4c Background Susceptibility to Vibrio cholerae infection is impacted by blood group, age, and pre-existing immunity, but these factors only partially explain who becomes infected. A recent study used 16S rRNA amplicon sequencing to quantify the composition of the gut microbiome and identify predictive biomarkers of infection with limited taxonomic resolution. Methods To achieve increased resolution of gut microbial factors associated with V. cholerae susceptibility and identify predictors of symptomatic disease, we applied deep shotgun metagenomic sequencing to a cohort of household contacts of patients with cholera. Results Using machine learning, we resolved species, strains, gene families, and cellular pathways in the microbiome at the time of exposure to V. cholerae to identify markers that predict infection and symptoms. Use of metagenomic features improved the precision and accuracy of prediction relative to 16S sequencing. We also predicted disease severity, although with greater uncertainty than our infection prediction. Species within the genera Prevotella and Bifidobacterium predicted protection from infection, and genes involved in iron metabolism also correlated with protection. Conclusion Our results highlight the power of metagenomics to predict disease outcomes and suggest specific species and genes for experimental testing to investigate mechanisms of microbiome-related protection from cholera. SUMMARY Cholera infection and disease severity can be predicted using metagenomic sequencing of the gut microbiome pre-infection in a prospective cohort, and suggests potentially protective bacterial species and genes. Cholera is an acute diarrheal disease caused by Vibrio cholerae. It is a major public health 58 threat worldwide that continues to cause major outbreaks, such as in Yemen, where over 1.7 59 million cases have been reported since 2016 (1,2). Transmission of V. cholerae between 60 household members commonly occurs through shared sources of contaminated food or water or 61 through fecal-oral spread (3,4). The clinical spectrum of disease ranges from asymptomatic 62 infection to severe watery diarrhea that can lead to fatal dehydration (5). Host factors such as age, 63 innate immune factors, blood group, or prior acquired immunity partially explain why some 64 people are more susceptible to V. cholerae infection than others, but a substantial amount of the 65 variation remains unexplained (6). 66 The gut bacterial community can protect against enteropathogenic infections (7) Here we used shotgun metagenomics to analyze an expanded prospective cohort of 84 persons exposed to V. cholerae in Bangladesh. Our metagenomic analysis yielded improved 85 outcome predictions compared to 16S rRNA sequencing, and identified bacterial genes 86 associated with remaining uninfected after exposure to V. cholerae. We are also able to predict 87 disease severity among infected contacts, albeit with lower power and precision than 88 susceptibility. Finally, we highlight several microbiome-encoded metabolic functions associated 89 with protection against cholera. 90 Sample collection, clinical outcomes and metagenomic sequencing 92 As described in (15), household contacts were enrolled within 6 hours of the presentation 93 of an index cholera case at the icddr,b (International Center for Diarrheal Disease Research, 94 Bangladesh) Dhaka Hospital. Index patients with severe acute diarrhea, a stool culture positive 95 for V. cholerae, age between 2 and 60 years old, and no major comorbid conditions were 96 recruited (4,6). A clinical assessment of symptoms in household contacts was conducted daily for 97 the 10-day period after presentation of the index case, and repeated on day 30. We collected 98 demographic information, rectal swabs, and blood samples for ABO typing and vibriocidal 99 antibody titers as described in the Supplementary Methods. During the observation period, 100 contacts were determined to be infected if any rectal swab culture was positive for V. cholerae 101 and/or if the contact developed diarrhea and a 4-fold increase in vibriocidal titer during the 102 follow-up period (4,6). Contacts with positive rectal swabs developing watery diarrhea were 103 categorized as symptomatic and those without diarrhea were considered asymptomatic ( Figure 104 1). V. cholerae positive contacts (by culture or deep 16S amplicon sequencing (15)) at the time of 105 enrollment were excluded, in addition to contacts who reported antibiotic use or diarrhea during 106 the week prior to enrollment. DNA extraction was performed for the selected samples and used 107 for shotgun metagenomics sequencing. Details on cohorts, sequencing methods and sample 108 processing are described in Supplementary Methods. We used MetaPhlAn2 (version 2.9) (16) for taxonomic profiling and HUMAnN2 (17) to 117 profile cellular pathways (from MetaCyc) and gene families (identified using the PFAM 118 database). For identification of biomarkers of susceptibility and disease severity, we used 119 MetAML (18) to apply a random forests (RF) classifier on species, pathways and gene-family 120 relative abundances, as well as strain-specific markers presence/absence. Models constructed 121 using each of these features types were compared to a random dataset with shuffled labels, and to 122 a model constructed with clinical/demographic data, using two-sample, two-sided t-tests over 20 123 replicate cross-validation (18). We used a stratified 3-fold cross validation approach, splitting our 124 dataset into validation and training sets (1/3 and 2/3 of samples, respectively) with the same 125 infected:uninfected ratio. We used an embedded feature selection strategy to identify the most 126 Metagenomic sequencing of the gut microbiome in household contacts exposed to V. cohort, upon which we base the majority of our analyses. We also performed exploratory 148 analyses on the expanded cohort to determine the potential for predictive models to be 149 generalized to larger samples. We used the shotgun metagenomic DNA sequence reads from 150 these samples to characterize four features of the microbiome: 1) relative abundances of 151 microbial species, 2) the presence/absence of sub-species-level strains, 3) metabolic pathway 152 relative abundances, and 4) gene family relative abundances (Table 1) (Table S4) . However, such high AUC values should be treated 180 with caution because the models can be overfit when a supervised feature selection step is applied 181 on the same data used to train the model (18). Because we did not have a fully independent 182 validation cohort (e.g. from another continent) to test our model, we decided to use the features 183 selected from the Midani cohort to make predictions on the Expanded dataset. Using the same 184 features selected from the Midani 2018 training dataset, we made predictions on the Expanded 185 cohort and achieved AUCs between 0.89 and 0.93 for prediction of infection using the four types 186 of features (Table S4) . Again, because the expanded cohort partly overlaps with the Midani 187 cohort, and includes some repeated samples from the same individuals over time, these results 188 could also be prone to overfitting, but they demonstrate the potential for generalized predictions. 189 Finally, we repeated the RF analysis using all features in the expanded dataset, whichh 190 increased predictive performance relative to the original Midani cohort (Figure S1 ). Once again, 191 genes and pathways outperformed species and strains according to all metrics, with AUC 192 reaching ~0.88 using cellular pathways ( Table 1) . This improvement in the expanded cohort also 193 highlights the importance of using larger, more balanced datasets as input to predictive models. 194 195 To put the metagenomic predictions in context, we compared their predictive power and 197 accuracy to clinical and demographic factors ( Table S1a) . Three of these factors (age, baseline 198 vibriocidal antibodies and blood group) are known to impact susceptibility to V. cholerae 199 infection (6,15) and we used them to train RF models ( Table S5) . As expected, contacts who 200 became infected tended to be younger and have lower baseline antibody titers than those who 201 remained uninfected ( Table S1b) , but these small differences were not sufficient to train a 202 significantly predictive model. An RF model trained on the seven clinical and demographic 203 factors did not perform better than a random model with shuffled labels (AUC=0.60, p=0.66; 204 To predict symptomatic disease among infected individuals (Figure 1) , we divided 227 samples into uninfected, symptomatic and asymptomatic groups and again applied the RF 228 approach. We used the F1 score as a performance metric since it is well suited for uneven class 229 distributions in our uninfected/symptomatic/asymptomatic comparison. Applied to the Midani 230 2018 cohort, this model predicted outcomes significantly better than random (shuffled labels) 231 using species, strains or pathway data, but not gene families ( Table 1 ; see Table S3 for p-232 values). However, the F1 scores for the symptomatic/asymptomatic predictions were 233 systematically lower (mean scores ranging from of 0.57 to 0.60) than for the infected/uninfected 234 prediction (means ranging from 0.64 to 0.71). Using the expanded cohort, the scores were 235 improved only slightly ( Table 1) . These results suggest that disease severity is predictable in 236 principle, but with greater uncertainty than the infection outcome. (Figures 3A, S3A and S4A) . (Figure S5 ), but the overlap was poorer for the uninfected/symptomatic/asymptomatic 258 prediction ( Figure S6 ). This is consistent with the difficulty of predicting disease severity. 259 In general, the most important species were selected by the model because of differences 260 in relative abundance at baseline among uninfected/symptomatic/asymptomatic outcomes 261 ( Figure S7, S8) . In rare cases, species presence/absence information was predictive. For 262 example, Ruminococcus gnavus, is absent (near or below limit of detection) in most of the 263 individuals who become infected, but present in many (but not all) of those who remain 264 uninfected ( Figure S7) . Thus, there is no single, strong predictor of infection outcomes, but 265 rather a probabilistic combination of many species, each of relatively modest predictive value. 266 Table 273 S6 We also identified gene families in the gut microbiome of persons who remained 280 uninfected during follow-up (Figures S9 and S10) , with some of the top gene families involved 281 in DNA repair, transmembrane transporter activity, iron metabolism (indicated with asterisks in 282 Figure 4) , and genes of unknown function (Table S8) . Long-chain fatty acid biosynthesis 283 pathways (e.g. cis-vaccenate, gondoate and stearate) were associated with individuals who 284 remained uninfected, while amino acid biosynthesis and catabolic pathways were associated with 285 individuals who developed infection (Figures S11 and S12, Table S9 ). We identified three iron-286 related genes associated with remaining uninfected: (1) the ferric uptake regulator Fur, a major 287 regulator of iron homeostasis, (2) thioredoxin, a redox protein involved in adaptation to oxidative 288 and iron-deficiency stress, and (3) the TonB/ExbD/TolQR system, a ferric chelate transporter 289 (19-21). In individuals who became infected but asymptomatic, two genes involved in the 290 conversion of riboflavin into catalytically active cofactors, the riboflavin kinase and the FAD 291 synthetase, were found as the first and the third most discriminant features (Figure 4, Table S8 ). 292 We next asked which taxa in the microbiome likely encoded these genes. In some cases, 293 specific taxonomic groups corresponded to discrete gene functions. For example, several iron 294 metabolism-related gene families tend to be encoded by Prevotella genomes (Figure S14) . In 295 other cases, the major contributors to protective gene families were unclassified (Figures 5 and 296 S13 here; see Table S8 for pathways. The contributions of each genus to encoding these pathways are shown as stacked 319 colors within each bar, linearly scaled within the total. See Table S9 for the complete list of 320 pathways 321 The gut microbiome is a potentially modifiable host risk factor for cholera, and 324 identification of specific genes and strains correlated with susceptibility is needed for 325 experimental testing to understand the mechanisms of observed correlations. Compared to a 326 previous study using a single marker gene, shotgun metagenomics provides this degree of 327 resolution, potentially to the species and strain level, and to the level of individual genes and 328 cellular functions. We found that gene families in the gut microbiome at the time of exposure to 329 V. cholerae were more predictive of susceptibility compared to taxonomic or clinical and 330 demographic information. Selecting a subset of the most informative features further improved 331 predictions, but using these selected features may lead to overfitting. This suggests an upper limit 332 to predictive power that requires validation in larger, independent cohorts. All three Bifidobacterium species associated with contacts that developed infection were 347 also associated with asymptomatic rather than symptomatic disease, and prior work on this genus 348 supports several hypotheses for this relationship. First, Bifidobacteria are known to produce the 349 SCFA acetate that can protect against enteric infection in mice (30,33,34). SCFAs are also known 350 to inhibit cholera toxin-related chloride secretion in the mouse gut, reducing water and sodium 351 loss, and have been observed to increase cholera toxin-specific antibody responses (31-33). 352 Bifidobacteria are also major producers of lactate, a metabolite that has been shown to impair V. The authors declare that there are no conflicts of interest. Updated Global Burden of Cholera in Endemic 420 Countries Cholera epidemic in Yemen, 2016-18: an 422 analysis of surveillance data Defining endemic cholera at three levels of 424 spatiotemporal resolution within Bangladesh Clinical Outcomes in Household Contacts of 426 Patients with Cholera in Bangladesh Cholera transmission: the 428 host, pathogen and bacteriophage dynamic Susceptibility to Vibrio cholerae Infection 430 in a Cohort of Household Contacts of Patients with Cholera in Bangladesh Roles of the intestinal microbiota in pathogen protection Members of the human gut microbiota 435 involved in recovery from Vibrio cholerae infection Bile Salts 437 Modulate the Mucin-Activated Type VI Secretion System of Pandemic Vibrio cholerae A single gene of a commensal microbe affects host 440 susceptibility to enteric infection Probiotic strains detect and suppress 442 cholera in mice Anti-biofilm Properties of the Fecal Probiotic 444 Lactobacilli Against Vibrio spp Commensal-derived metabolites govern Vibrio cholerae 446 pathogenesis in host intestine Gut Microbial Succession Follows Acute Secretory 448 Diarrhea in Humans Human Gut Microbiota Predicts Susceptibility 450 to Vibrio cholerae Infection MetaPhlAn2 for enhanced metagenomic 452 taxonomic profiling Species-level functional profiling of 454 metagenomes and metatranscriptomes Machine Learning Meta-analysis of 456 Large Metagenomic Datasets: Tools and Biological Insights Fillat MF. The FUR (ferric uptake regulator) superfamily: diversity and versatility of key 459 transcriptional regulators Thioredoxin H (TrxH) contributes to adversity 461 adaptation and pathogenicity of Edwardsiella piscicida TonB-dependent transporters: regulation, 463 structure, and function The Prevotella copri Complex Comprises Four 465 Distinct Clades Underrepresented in Westernized Populations Dietary Fiber-Induced Improvement 468 in Glucose Metabolism Is Associated with Increased Abundance of Prevotella Utilisation of mucin 471 glycans by the human gut symbiont Ruminococcus gnavus is strain-dependent Mucin glycan foraging in the human gut 474 microbiome Regulation of bacterial pathogenesis by intestinal short-chain 476 Fatty acids From Dietary Fiber to Host 478 Physiology: Short-Chain Fatty Acids as Key Bacterial Metabolites Formation of propionate and butyrate by the human colonic microbiota. 481 Environmental Microbiology Butyrate Protects Mice from Clostridium 483 difficile-Induced Colitis through an HIF-1-Dependent Mechanism Bifidobacteria can protect from enteropathogenic 486 infection through production of acetate Potential beneficial 488 effects of butyrate in intestinal and extraintestinal diseases Facilitate Mucosal Adjuvant Activity of Cholera Toxin through GPR43. The Journal of 492 Immunology Short-Chain Fatty Acids Inhibit 494 Fluid and Electrolyte Loss Induced by Cholera Toxin in Proximal Colon of Rabbit In 495 Vivo Overview on the Bacterial Iron-497 Cholera toxin promotes pathogen acquisition of host-499 derived nutrients Transcriptomics reveals a cross-modulatory effect between riboflavin and iron and outlines 502 responses to riboflavin biosynthesis and uptake in Vibrio cholerae The Human Gut Microbiome: From Association to 505 Modulation Supplementary tables S1-S9 are available at: 415 https://figshare.com/articles/Supplementary_Tables_-_Levade_et_al_2020/12440417 416 417