key: cord-0260473-zom9e3pj
authors: Zhang, Lin; Chen, Likai; Yu, Xiaoqian; Duvallet, Claire; Isazadeh, Siavash; Dai, Chengzhen; Park, Shinkyu; Frois-Moniz, Katya; Duarte, Fabio; Ratti, Carlo; Alm, Eric J.; Ling, Fangqiong
title: Microbial Species Abundance Distributions Guide Human Population Size Estimation from Sewage Microbiomes
date: 2022-04-12
journal: bioRxiv
DOI: 10.1101/2020.12.15.390716
sha: 5b2712b4edc68eaaffea26dae6e6da32ef1dd3a4
doc_id: 260473
cord_uid: zom9e3pj

The metagenome embedded in urban sewage is an attractive new data source to understand urban ecology and assess human health status at scales beyond a single host. Analyzing the viral fraction of wastewater in the ongoing COVID-19 pandemic has shown the potential of wastewater as aggregated samples for early detection, prevalence monitoring, and variant identification of human diseases in large populations. However, using census-based population size instead of real-time population estimates can mislead the interpretation of data acquired from sewage, hindering assessment of representativeness, inference of prevalence, or comparisons of taxa across sites. Here, we show that taxon abundance and sub-species diversisty in gut-associated microbiomes are new feature space to utilize for human population estimation. Using a population-scale human gut microbiome sample of over 1,100 people, we found that taxon-abundance distributions of gut-associated multi-person microbiomes exhibited generalizable relationships with respect to human population size. Here and throughout this paper, the human population size is essentially the sample size from the wastewater sample. We present a new algorithm, MicrobiomeCensus, for estimating human population size from sewage samples. MicrobiomeCensus harnesses the inter-individual variability in human gut microbiomes and performs maximum likelihood estimation based on simultaneous deviation of multiple taxa’s relative abundances from their population means. MicrobiomeCensus outperformed generic algorithms in data-driven simulation benchmarks and detected population size differences in field data. New theorems are provided to justify our approach. This research provides a mathematical framework for inferring population sizes in real time from sewage samples, paving the way for more accurate ecological and public health studies utilizing the sewage metagenome. Author summary Wastewater-based epidemiology (WBE) is an emerging field that employs sewage as aggregated samples of human populations. This approach is particularly promising for tracking diseases that can spread asymptomatically in large populations, such as the COVID-19. As a new type of biological data, sewage has its own unique challenges to utilize. While wastewater samples are usually assumed to represent large populations, it is not guaranteed, because of stochasticity in toilet flushes; unlike epidemiological experiments collecting data from individuals, sample size, i.e., the human population size represented by a wastewater sample, is a fundamental yet difficult-to-characterize parameter for sewage samples. Researchers would need to aggregate data from large areas and week-long collection to stabilize data, during which, important spikes in small areas or short time scales may be lost. It also remains challenging to turn viral titers into case prevalences, evaluating representativeness, or comparing measurements across sites/studies. This study provides a framework to estimate human population size from sewage utilizing human gut-associated microorganisms. Through analysis, we demonstrate that variance of taxon abundances and single-nucleotide polymorphism as two variables that change with population size. We provide a new tool MicrobiomeCensus that performs population size estimation from microbial taxon abundances. MicrobiomeCensus outperforms generic algorithms in terms of computational efficiency while at comparable or better accuracy. Using MicrobiomeCensus, we detected population size differences in sewage samples taken in Cambridge, MA, under two sampling approaches, i.e., “grab” or “composite” sampling. This study provides a framework to utilize individual-level microbiomes to learn from sewage, paving the way to prevalence estimation and improved spatio-temporal resolutions in WBE..

The metagenome embedded in urban sewage is an attractive new data source to 2 understand urban ecology and assess human health status at scales beyond a single 3 host [22, 2, 29] . Sewage microbiomes are found to share a variety of taxa with human 4 gut microbiomes, where the baseline communities are characterized by a dominance of 5 human-associated commensal organisms from the Bacteroidetes and Firmicutes 6 phyla [29, 22, 24] . Human viruses like SARS-CoV-2 and polioviruses were detected in 7 sewage samples during the pandemic and silent spreads, respectively, and found to 8 correlate to reported cases, suggesting that sewage samples could be useful for 9 understanding the dynamics in the human-associated symbionts at a population level[25, 10 21] . Sewage has several advantages as samples of the population's collective symbionts. 11 For instance, sewage samples are naturally aggregated, wastewater infrastructures are 12 highly accessible, and data on human symbionts can be collected without visits to 13 clinics, thus utilizing sewage samples can reduce costs and avoid biases associated with 14 stigma and accessibility [2, 28] . Consequently, SARS-CoV-2 surveillance utilizing sewage 15 samples are underway globally and incorporated into the U.S. Centers for Disease virus monitoring at finer spatial granularity, e.g., single university dorms and nursing 20 homes, are informative for guiding contact tracing and protecting populations at higher 21 risk, but real-time population size estimations at such fine granularity are not yet 22 available. For a given area, the census population (de jure population) can be larger 23 than the number of people who contributed feces to sewage at a given time (de facto 24 population) [11] . Conversely, the de jure population can also be smaller than the de 25 facto population due to the presence of undocumented individuals [13] . Population 26 proxies that are currently used for monitoring at wastewater-treatment plants, such as 27 the loading of pepper mild mottle viruses, likely have high error at the neighborhood 28 level because of their large variability in human fecal viromes (10 6 -10 9 virions per gram 29 of dry weight fecal matter) [38] . Consequently, it is difficult to assess the 30 representativeness of a sewage sample, infer the taxon abundance differences across time 31 and space, or interpret errors. Lack of population size information could lead to false 32 negatives in assessing virus eradication, because an absence of biomarkers might be 33 caused by a sewage sample that under-represents the population size. Despite its 34 importance, few studies have explicitly explored ways to estimate real-time human 35 population size from sewage samples independent from census estimates [37] . 36 Macroecological theories of biodiversity may offer clues to decipher and even 37 enumerate the sources of a sewage microbiome. While we are only beginning to view 38 sewage as samples of human symbionts beyond one person, generating multi-host 39 microbiomes resembles a fundamental random additive process. Sizling et al. showed 40 that lognormal SADs can be generated solely from summing the abundances from 41 multiple non-overlapping sub-assemblages to form new assemblages [35] . Likewise, 42 adding multiple sub-assemblages can also give rise to common Species-Area 43 Relationships [35] . For microbial ecosystems, Shoemaker et al. examined the abilities of 44 widely known and successful models of SADs in predicting microbial SADs and found 45 that Poisson Lognormal distributions outperformed other distributions across 46 environmental, engineered, and host-associated microbial communities, highlighting the 47 underpinning role of lognormal processes in shaping microbial diversity [34] . 48 In this study, we conceptualize a sewage microbiome as a multi-person microbiome, 49 where the number of human contributors can vary. We hypothesize that the species 50 abundance distribution in the multi-person microbiome will vary as a function of the 51 human population size, which would arise from summing taxon abundances from 52 multiple hosts analogous to the Central Limit Theorem. We use human gut microbiome 53 data comprising over a thousand human subjects and machine learning algorithms to 54 explore these relationships. Upon discovering a generalizable relationship, we develop 55 MicrobiomeCensus, a nonparametric model that utilizes relative taxon abundances in 56 the microbiome to predict the number of people contributing to a sewage sample.

MicrobiomeCensus utilizes a multivariate T statistic to capture the simultaneous 58 deviation of multiple taxa's abundances from their means in a human population and 59 performs maximum likelihood estimation. We provide proof on the validity of our 60 approach. Next, we examine model performance through a simulation benchmark using 61 human microbiome data. Last, we apply our model to data derived from real-world 62 sewage. Our nonparametric method does not assume any underlying distributions of 63 microbial abundances and can make inferences with just the computational power of a 64 laptop computer. human-associated anaerobes as an "average gut microbiome" sampled from residents of 70 a catchment area. Hence, our task becomes to find the underlying relationship between 71 the number of contributors and the observed microbiome profiles in sewage samples. We 72 define an "ideal sewage mixture" scenario to illustrate our case, where the sewage 73 sample consists only of gut-associated microorganisms and is an even mix of n different 74 individuals' feces ( Figure 1 ). We denote the gut microbiome profile of an individual as 75 X i = (X i,1 , X i,2 , . . . , X i,p ) > , where each X i,j represents the relative abundance of 76 operational taxonomic unit (OTU) j from individual i. Hence our ideal sewage mixture 77 can be represented as

where vectors X 1 , X 2 , . . . , X n 2 R p are microbiome profiles from individuals 1, . . . , n.

Under the ideal sewage mixture scenario, if we can quantitatively capture the departure 81 of the sewage microbiome profile from the population mean of the human gut 82 microbiomes of people constituting the catchment area, we will be able to estimate the 83 population size.

Using a dataset comprised of 1,100 individuals' gut microbiome taxonomic 85 profiles [39] , we created synthetic mixture samples of different numbers of contributors 86 through bootstrapping ( Figure 1A ). First, examined from an ecological perspective, the 87 shape of the ranked abundance curves of the gut microbiomes differed when the means 88 of multiple individuals were examined: when the number of contributors increased, a 89 normal distribution appeared( Figure 1B log-series or lognormal models, but as the population increased to over a hundred, the 94 multi-person SADs were best described by only lognormal SADs (Table S1) . 95 We explored the distributions of the relative abundances of gut bacteria as a function 96 of population size. As expected, the distribution of a taxon's relative abundance changes 97 with population size ( Figure 1C ). For instance, for OTU-2397, a Bifidobacterium taxon, 98 the relative abundance distribution was approximately log-normal when the relative 99 abundance in single-host samples was considered, yet converged to a Normal 100 distribution when mixtures of multiple hosts were considered. Although the means of 101 the distributions of the same taxon under different population sizes were close, the 102 variation in the data changed. A smaller variance was observed when the number of 103 contributors increased ( Figure 1D ). Notably, different taxa varied in the rates at which 104 their variances decreased with population size ( Figure 1E ), suggesting that a model that 105 considers multiple features would be useful in predicting the number of contributors.

Classifiers utilizing microbial taxon abundance features alone detects 107 single-person and multi-person microbiomes 108 Inspired by the distinct shapes of SADs in multi-person gut-microbiomes from those 109 of single-person microbiomes, we set up a classification task using the taxon relative 110 abundances to separate synthetic communities constituting one, ten, and a hundred MicrobiomeCensus is a statistical model that estimates population size 118 from microbial taxon abundances 119 While the classification tasks described above demonstrated the usefulness of taxa's 120 relative abundances in predicting the population size, a complex model like RF provided 121 little explanatory power. We then ask, since the variance in the relative abundance of a 122 given taxon decreases with population size, can we devise a statistic that captures the 123 simultaneous deviation of several taxa's abundances from their means, and estimate 124 population size utilizing the statistic? Further, will this new method perform well 125 despite inter-personal variation in gut microbiomes?

Our new method, MicrobiomeCensus, involves a statistic T n to capture the 127 simultaneous deviation of multiple taxa's abundances from their means in relation to 128 the variance of those taxa in the population ( Figure 3A ). We denote ⌃ 0 = ( ij ) 1i,jp 129 as the covariance matrix for the individual microbiome profile and let ⇤ 0 be a diagonal 130 matrix with ⇤ 0 = diag( 

whereX n = P n i=1 X i /n denotes the observed microbiome profile in ideal sewage, µ 132 represents the population mean for the catchment area andμ is an estimator,⇤ 0 is an 133 estimator of ⇤ 0 and kvk 2 := ( P p i=1 v 2 i ) 1/2 for any vector v 2 R p . This statistic is 134 enlightened by the classical Hotelling T 2 statistic [16] T n = n(X n µ)⌃ 1 0 (X n µ),

where⌃ 0 is the sample covariance matrix, an estimator of ⌃ 0 . Actually if we assume 136 the covariance matrix is diagonal (no correlations between different taxa), then they are 137 essentially the same statistic in view ofT n = nT n . The reason we replace covariance 138 matrix ⌃ 0 by its diagonal ⇤ 0 is because for high dimensional situations, it would be 139 very difficult to estimate the covariance matrix. In cases when p > n, the sample 140 covariance matrix is singular and thusT n is not even well defined. Studies 141 accommodating the Hotelling T 2 type statistic into the high-dimensional situation can 142 be found, for example, in Bai and Saranadasa [1] , Chen and Qi [10] , Xu et al [36] , etc.

Our proposed statistic can handle the high dimensional cases as well, since the diagonal 144 entities ⇤ 0 can be well estimated even when p is large. And we extend its application 145 beyond the problem of the significance of the multivariate means.

In developing this new method, we utilize the variance change by population, but 147 without any priori assumption about the gut bacterium species taxon abundance 148 distributions and the covariance between species. Our analysis showed that the statistic 149 T n changed monotonically with increasing population size, indicating the promise of a 150 population estimation model ( Figure 3B ).

Leveraging our statistic T n , we constructed an asymptotic maximum likelihood 152 estimator to estimate size of the sample without the information of each individual, that 153 is, we do not observe X 1 , . . . , X n but only their mixtureX n = P n i=1 X i /n. Here, the 154 parameter of interest is the population size n, the test statistic is T n , and a point 155 estimate is made by maximizing the estimated likelihood of T n with respect to n. We Table S3 ), indicating that our model generalized well across 165 different hosts. We then used all data and tuned hyperparameter to acquire a final 166 model. When applying the final model on the same testing data, our model achieved a 167 testing error of 16.2% ( Figure 3D ).

It is worth noting that in this algorithm, for each size n, we only need to estimate 169 the sampling distribution of the statistic T n once. Hence it is not time-consuming 170 regardless of the true population size. We also note that an RF regression model could 171 not be trained in a reasonable time on the same dataset, even with high-performance 172 computing (Methods). Our model performed remarkably better than a ten-fold 173 cross-validated RF regression model utilizing a reduced dataset, which gave an MAPE 174 of 32%, while the training time for our model was only a fraction of that of the RF 175 regression model ( Figure S1 ).

MicrobiomeCensus detects human population size differences in sewage 177 samples 178

With the newly developed population model, we set out to apply our model to 179 sewage samples. Ideally, we would like to apply the model to samples generated from a 180 fully controlled experiment with known human hosts contributing at a given time, yet 181 such an experiment presents logistic challenges. Instead, we applied our model to 182 sewage samples taken using one of two methods, either a snapshot (grab sample) sample 183 taken from the sewage stream over 5 minutes, or an accumulative (composite sample) 184 taken at a constant rate over 3 hours during morning peak human defecation [15] 185 ( Figure S2 ). We hypothesized that the composite samples would represent more people 186 than snapshot samples. Taking grab samples, we sampled at 1-hr intervals at one 187 manhole (n=25); using the accumulative method, we sampled at three campus buildings 188 (classroom, dormitory, and family housing) multiple times over three months (n=76).

To remove sequences possibly contributed by the water, we applied a taxonomic filter to 190 retain families associated with the gut microbiome and normalized the species 191 abundance by the retained sequencing reads (Methods, (Table S7) , suggesting the potential that the SNV profiles of a 218 wide range of gut species could be developed into feature space for population size 219 estimation. Our simulation further shows that the number of polymorphic sites 220 increased with population size more slowly than nucleotide diversity, indicating its 221 potential to reflect more subtle changes in population size ( Figure 4G and H). Despite 222 the need for further model developments, the analysis here shows the potential of the 223 sub-species diversity of gut anaerobes as a feature space to be developed into a 224 population size estimation model, independent from the taxon abundance-based model 225 described here.

The MicrobiomeCensus method we present here can, in theory, estimate the population 228 size contributing to a sewage sample from the taxon abundance of multiple human gut 229 microbiome taxa, using our T statistic and associated maximum likelihood estimation 230 and application procedures. While the model is trained to perform accurate population 231 estimation on a neighborhood scale, we expect the population range it can estimate to 232 expand with increasing training gut microbiome data availability. We propose the 233 MicrobiomeCensus model as a tool to drive further developments in quantitative 234 sewage-based epidemiology. We have provided mathematical proof of the validity of our 235 approach.

MicrobiomeCensus showed excellent performance in our simulation benchmark. In 237 particular, the study subjects that we utilized in the training and testing sets are 238 random samples out of 1,100 men and women across a wide range of age without any 239 stratification, hence the model's testing performance indicates its generalizability. Our 240 study is founded on the observations that healthy gut microbiomes are resilient, with 241 inter-individual variability outweighing variability within individuals over time [12, 20, 242 26] . There are caveats to our approach; potentially, diets and regional effects on human 243 microbiome composition could introduce noises to the prediction [17, 14] . In Rationales If the distribution of X i is known, then the distribution T n is known and 294 one can easily use maximum likelihood estimator (MLE) to estimate the human 295 population size. Here the human population size is essentially the sample size n from 296 the wastewater sample. However for generality, we do not want to impose any specific 297 distribution assumptions on taxon abundance distributions, thus, we need to rely on 298 asymptotic results to estimate the distribution of our statistic. Unlike the univariate 299 case where the asymptotic distribution of the statistic T n can be simply derived by 300 central limit theorem, we are dealing with a much more challenging situation due to 301 high dimensionality.

In the following, we shall firstly introduce some notations and assumptions that will 303 be needed for the theorems. Then, in Theorem 1, we derive the Gaussian approximation 304 for the test statistic T n , which implies that we can use simulated Gaussian vectors to 305 approximate the distribution of our statistic. To apply this approximation, one needs to 306 further estimate the covariance matrix which is highly non-trivial due to high 307 dimensionality. To get around this difficulty, we further propose a sub-sampling 308 approach. Theorem 2 provides the theoretical foundation for this sub-sampling scheme. 309

Notation. Recall that vectors X 1 , X 2 , . . . , X n are microbiome profiles from 310 individuals 1, 2, . . . , n. Assume that they are independent and identically distributed 311 (i.i.d) random vectors in R p with mean µ 2 R p and variance Var(

To construct the Gaussian approximation, we shall firstly work with cases when both ⇤ 0 and µ are given, that is, consider the statistic T ⇧ n = k⇤ 1 0 (X n µ)k 2 2 .

We will later extend all the results to cases when those parameters are unknown. For January 26, 2022 8/18

notation's simplicity, consider the normalized version of X i :

Then T ⇧ n = kȲ n k 2 2 , whereȲ n =

We need the following condition on Y i for the main theorem.

Assumption 1 Let s > 2. Assume 

where c 1 > 0 is independent of p. This justifies K s part in condition (3). Again by 325

Berkholder [6] 's inequality,

where c 2 > 0 is independent of p. And thus D s part in condition (3) holds. 

It is worth noticing that under settings in Remark 1, condition (4) holds. Based on 336 the above theorem, if we can estimate the covariance matrix ⌃ Y , then kZk 2 2 can be 337 generated and used for approximation of our statistic. However the estimation of ⌃ Y is 338 difficult due to high dimensionality unless some additional conditions are imposed on the 339 covariance matrix. To get around this difficulty, we shall consider a bootstrap approach. 340 The main idea is that for each n, we randomly generate n samples from the reference 341 data X i , 1  i  n 0 , and construct some generated statistic. Using the empirical 342 distribution of the generated statistic to approximate the distribution of our statistic.

In the following, we shall provide the theoretical justification for this procedure.

For some integer J > 0, let A 1 , A 2 , . . . , A J be i.i.d uniformly sampled from the class 345 A n = {A : A ⇢ {1, 2, . . . , n 0 }, |A| = n}. Assume the sampling process are independent 346 from our data (X i ) i . Then for each 1  k  J, the set {X i , i 2 A k } is of size n and can 347 be used to construct one test statistic nk⇤ 1 0 (X Aj X n0 )k 2 2 , where 348 X Aj = P i2Aj X i /|A j |. After repeating this procedure J times, we would have J 349 realizations of our test statistic which can be used to construct the empirical 350 distributionF n (t):

We shall later show that this empirical distribution can be adopted for approximating 354 the distribution of the target statistic T ⇧ n . Following result can be derived based on 355 Theorem 3.5 in Xu et al. [36] and Theorem 1.

Theorem 2 Assume conditions in Theorem 1 hold, and moreover assume that n ! 1, 357 n = o(n 0 ) and (4) holds. Then for J ! 1, we have

Theorem 2 implies that we can use the empirical distribution generated from our 359 sub-samples to estimate the distribution of our target. As mentioned previously, so far, 360 we are assuming that we know the value of ⇤ 0 which is not realistic in applications.

Therefore we need to further estimate ⇤ 0 , that is, we need to estimate the standard 362 deviation for each coordinate. This can be easily accomplished by considering the

where X i,j represents the jth coordinate of X i , and ⇤ 0,j is the jth diagonal entity of ⇤ 0 366 and n 0 is the size of the reference data. Also one can use the average of the reference 367 data to replace µ. If X i,j has heavy tail, we can also consider a robust m-estimator for 368 ⌃ 0 and µ, see for example, Catoni [8] .

Bootstrap procedure. Below we describe the bootstrap procedure we use to 370 approximate the distribution of T n for different census counts. Recall that X 1 , ..., X n 371 represent arrays of taxon relative abundances in the gut microbiome of human subject 372 1, . . . , n, and T n is defined in (Eq. 2).

Step 1. Estimate the population meanμ, and the diagonal matrix⇤ 0 , from a 374 reference sample human gut microbiome data.

Step 2. For each census count n, generate X ⇤ 1 , . . . , X ⇤ n from the reference data to 376 compute T ⇤ 1 .

Step 3. Repeat Step 2 B times (herein 10,000 times) to get T ⇤ 1 , . . . , T ⇤ B .

Step 4. Obtain the density functionf n of T n based on T ⇤ 1 , . . . , T ⇤ B using a Gaussian 379 kernel.

Step 5. Repeat Steps 2-4 for all the census counts n = 1, 2, . . . , N considered, herein 381 N = 300. It should be noted that per Theorem 2 we require bootstrap sample size n 382 much smaller than total reference sample size n 0 , thus up to 300-person samples were 383 simulated here because the gut microbiome reference dataset we utilized consisted of a 384 total of 1,100 people. The range can be expanded if a larger dataset is available.

Maximum Likelihood Estimation. We use a maximum likelihood estimation (MLE) procedure to achieve point estimates of the population size from a new mixture sample, W . The MLE procedure firstly computes T 0 by replacing the sample average by W , that is

And then computes the likelihoods that T 0 was drawn from population sizes from 1 to 386 N , respectively, using the estimated distributions generated from the bootstrap 387 procedures described above. Next, the population sizen that yields the highest 388 likelihood is chosen.

Confidence interval. Due to Theorem 1, the asymptotic distribution of nT n is the 390 same as kZk 2 2 and is therefore independent of the parameter n. Hence for any 391 confidence level 1 ↵, we can firstly estimate the 1 ↵ quantile of nT n based on the 392 simulated data described above, denoted asq 1 ↵ . Then for any new mixture W and the 393 corresponding T 0 as in (6) . We our 1 ↵ confidence interval is [1,q 1 ↵ /T 0 ].

Model training, validation, and testing. We synthesized a mixed data set from a gut microbiome dataset of 1,135 healthy human hosts from the Lifelines Deep study [39] , which was the largest single-center study of population-level human microbiome variations from a single sequencing center at the time of this study. The data set consisted of 661 women and 474 men. We considered OTUs defined by 99% similarity of partial ribosomal RNA gene sequences (Methods of OTU clustering are described in detail in Supplementary Methods). After quality filtering, we retained 1,100 samples that had more than 4,000 sequencing reads/sample. We split the entire dataset approximately in half, using 550 subjects to generate the training/validation set and the other 550 subjects to generate the test set. We then used the aforementioned ideal sewage mixture approach to generate synthetic populations of up to 300 individuals, which is the relevant range for population estimation in upstream sewage. The training error was computed using the entire training data set. Five repeated holdout validations using a 50-50 split in the training set were performed to tune the hyperparameter for feature selection. The training and cross-validation errors were evaluated at integers from 1 to 100, using the error definition:

and the model's performance across all the population sizes was characterized by the single-person and multi-person microbiome data were drawn from a gut microbiome 406 dataset of 1,135 healthy human hosts from the Lifelines Deep study [39] , which was the 407 largest study of population-level human microbiome variations from a single sequencing 408 center at the time of this study. The data set consisted of 661 women and 474 men. We 409 considered operational taxonomic units defined by 99% similarity of partial ribosomal 410 RNA gene sequences. After quality filtering, we retained 1,100 samples that had more 411 than 4,000 sequencing reads/sample. The rarefaction depth was chosen to balance 412 sample size and sequencing depth.

16S rRNA gene amplicon sequencing data analysis. Operational taxonomic 414 units defined at 99% sequencing similarity were generated from the combined dataset by 415 first denoising the samples with DADA2 [7] , and then clustering the outputted exact 416 sequence variants with the q2-vsearch plugin of QIIME2 [33] . Taxonomic assignments 417 were performed using a multinomial naïve Bayes classifier against SILVA 132 [32, 4] . All 418 16S rRNA gene amplicon analyses were performed in the QIIME2 platform (QIIME 419 2019.10) [5] . varied successes in predicting microbial SADs [34] . We examined the fit using a 429 rank-by-rank approach as previously described by Shoemaker et al. [34] . First, 430 maximum-likelihood coefficients for each of the SADs described above were estimated 431 using the R package sads [30] . Next, SADs were predicted using each model, and Application to sewage data. 1he 16S rRNA gene amplicon sequencing data from 452 the field sewage samples were trimmed to the same region, 16S V4 (534-786) with the 453 LifeLines Deep data using Cutadapt 1.12 [23] . Forward reads were trimmed to 175bps, 454 and reverse reads were first trimmed to 175bps and then further trimmed to 155bps 455 during quality screening. We created a taxonomic filter based on the composition of the 456 gut microbiome data set, which consisted of the abundant family-level taxa that 457 accounted for 99% of the sequencing reads in the human gut microbiome data set, and 458 excluded those that might have an ecological niche in tap water (Enterobacteriaceae and 459 Burkholderiaceae). This exclusion resulted in 25 bacterial families and one archaeal 460 family in our taxonomic filter, including Lachnospiraceae, Ruminococcaceae, 461 Bifidobacteriaceae, Erysipelotrichaceae, Bacteroidaceae, and others (Table S4) . We 462 applied our taxonomic filter to the sewage sequencing data, which retained 73.9% of the 463 sequencing reads. This retention rate is consistent with our previous report of the 464 human microbiome fraction in residential sewage samples [24] . We then normalized the 465 relative abundance of taxa against the remaining sequencing reads in each sample.

Welch's two-sample t-tests were performed to retain the OTUs whose means did not 467 differ significantly from the human microbiome data set (p > 0.05).

Deployment of generic machine learning models. Logistic regression, support 469 vector machine, and random forest classifiers were employed to perform the 470 classification task for population sizes of 1, 10, and 100. Model training, 471 cross-validation, and testing were performed using the R Caret platform with the 472 default setting [18] . For the support vector machine, the radial basis function kernel was 473 employed. Ten-fold cross-validation and five repeats were performed for all the models 474 considered. Model performance was evaluated using accuracy, sensitivity, and specificity. 475 Based on the classifier performance, the RF regression model was used for comparison 476 with our new model's performance. Initially, we trained the model using the same 477 training data set used in training our maximum likelihood model, however, the 478 computation was infeasible, even with a 36-thread, 3TB-memory computing cluster. We 479 then introduced gaps in the population size range, using populations from the vector performed in R Caret, using 10-fold cross-validation. Ten variables were randomly 483 sampled as candidates at each split, mtry=10. The performance was evaluated using the 484 same testing set that was used to evaluate the maximum likelihood model. We built the MicrobiomeCensus model using our T statistic and a maximum likelihood 534 procedure. The training set consisted of 10,000 samples for population sizes ranging 535 from 1-300, and 50% of the data were used to train and validate the model. Training 

Effect of high dimension: by an example of a 559 two sample problem

Estimation of polio infection prevalence from 561 environmental surveillance data". en

Wastewater-Based Epidemiology: Global Collaborative to 568

Maximize Contributions in the Fight Against COVID-19". eng

Optimizing Taxonomic Classification of Marker-Gene 572

Amplicon Sequences with QIIME 2's Q2-Feature-Classifier Plugin

Reproducible, interactive, scalable and extensible microbiome 576 data science using QIIME 2

Sharp inequalities for martingales and stochastic 578 integrals

DADA2: high-resolution sample inference from 580

Illumina amplicon data

Challenging the empirical mean and empirical variance: a 582 deviation study

National Wastewater Surveillance System. en-us

A two-sample test for high-dimensional data 588 with applications to gene-set testing

Real-time estimation of small-area populations with 591 human biomarkers in sewage

Host lifestyle affects human microbiota on daily 594 timescales". eng

The number of undocumented immigrants in the United States: Estimates based 598 on demographic modeling with data from 1990 to

Regional Variation Limits Applications of Healthy Gut Microbiome 601 Reference Ranges and Disease Models". en

Defecation frequency and timing, and stool form in the 604 general population: a prospective study". eng

The Transformation of Statistics to 607

Simplify their Distribution

Daily Sampling Reveals Personalized Diet-Microbiome 611 Associations in Humans". en

Building Predictive Models in R Using the caret Package

619 en. Google-Books-ID: IEkqDwAAQBAJ

Diversity, stability and resilience of the human gut 622 microbiota

Intensified environmental surveillance supporting the response to 627 wild poliovirus type 1 silent circulation in Israel

Patterns of protist diversity associated with raw sewage in 632 New York City". en

Cutadapt Removes Adapter Sequences from High-Throughput

24-hour multi-omics analysis of residential sewage reflects 641 human activity and informs public health

Presence of SARS-Coronavirus-2 RNA in Sewage and 646

Correlation with Reported COVID-19 Prevalence in the Early Stage of the 647

Epidemic in The Netherlands". en

Publisher

Stability of the human faecal microbiome in a cohort of 653

Microbiomes as Metacommunities: Understanding Host-Associated Microbes 657

Letter to the Editor: Wastewater-Based Epidemiology 660

Can Overcome Representativeness and Stigma Issues Related to COVID-19". eng

Sewage Reflects the Microbiomes of Human Populations

Likelihood Models for Species Abundance Distributions

Computational methods for high-throughput 669

Methods in

Improved Data Processing and Web-Based Tools". eng

VSEARCH: A Versatile Open Source Tool for 677

A 680 macroecological theory of microbial biodiversity". en

Number: 5 Publisher

Species abundance distribution results from a spatial 686 analogy of central limit theorem

L2 Asymptotics for

Monitoring Genetic Population Biomarkers for 694

RNA Viral Community in Human Feces: Prevalence of Plant 697

Population-based metagenomics analysis reveals 702

Publisher: NIH Public Access