key: cord-0283819-bnqljz86
authors: Mangiola, S; Schulze, A; Trussart, M; Zozaya, E; Ma, M; Gao, Z; Rubin, AF; Speed, TP; Shim, H; Papenfuss, AT
title: Robust differential composition and variability analysis for multisample cell omics
date: 2022-03-06
journal: bioRxiv
DOI: 10.1101/2022.03.04.482758
sha: 7ac06dec128bfc05a39f655cd3df2aaa1ce03436
doc_id: 283819
cord_uid: bnqljz86

Cell omics such as single-cell genomics, proteomics and microbiomics allow the characterisation of tissue and microbial community composition, which can be compared between conditions to identify biological drivers. This strategy has been critical to unveiling markers of disease progression such as cancer and pathogen infection. For cell omic data, no method for differential variability analysis exists, and methods for differential composition analysis only take a few fundamental data properties into account. Here we introduce sccomp, a generalised method for differential composition and variability analyses able to jointly model data count distribution, compositionality, group-specific variability and proportion mean-variability association, with awareness against outliers. Sccomp is an extensive analysis framework that allows realistic data simulation and cross-study knowledge transfer. Here, we demonstrate that mean-variability association is ubiquitous across technologies showing the inadequacy of the very popular Dirichlet-multinomial modelling and provide mandatory principles for differential variability analysis. We show that sccomp accurately fits experimental data, with a 50% incremental improvement over state-of-the-art algorithms. Using sccomp, we identified novel differential constraints and composition in the microenvironment of primary breast cancer. Significance statement Determining the composition of cell populations is made possible by technologies like single-cell transcriptomics, CyTOF and microbiome sequencing. Such analyses are now widespread across fields (~800 publications/month, Scopus). However, existing methods for differential abundance do not model all data features, and cell-type/taxa specific differential variability is not yet possible. Increase in the variability of tissue composition and microbial communities is a well-known indicator of loss of homeostasis and disease. A suitable statistical method would enable new types of analyses to identify component-specific loss of homeostasis for the first time. This and other innovations are now possible through our discovery of the mean-variability association for compositional data. Based on this fundamental observation, we have developed a new statistical model, sccomp, that enables differential variability analysis for composition data, improved differential abundance analyses, with cross-sample information borrowing, outlier identification and exclusion, realistic data simulation, based on experimental datasets, cross-study knowledge transfer.

Compositional analyses are central in many fields of biology. Tissue composition analysis enabled seminal discoveries in cancer research (1) (2) (3) (4) (5) (6) and epidemiology (e.g. COVID19 (4)).

Compositional analysis of microbial communities is crucial for studying metabolic disease (7) and skin physiology (8) . Single-cell transcriptomics (9) and high-throughput flow cytometry (CyTOF) (10) enable the characterisation of cell groups measuring the abundance for thousands of transcripts and tens of proteins at the single-cell level. The 16S rRNA and whole microbiome DNA sequencing characterise bacterial taxonomic groups (8) by probing their genetics. The relative abundance of groups of cells or microorganisms can be compared between biological or clinical conditions to identify cellular or taxonomic drivers.

Despite the importance of such comparative analyses, no method for differential variability analyses is available. Differential variability analysis is an avenue for novel discoveries through single-cell transcriptomics, such as for T-cell response in cancer (11) .

Also, although a wide range of differential composition methods exists, they only consider some of the five fundamental properties of cell omics-derived data ( Table 1) . Well known properties are (i) data is observed as counts; (ii) group proportions are compositional and negatively correlated; and (iii) the underlying proportion variability is group-specific. No available method jointly models these three properties. Log-linear proportion regression methods such as scDC (12) , propeller (13) and diffcyt (14) model data compositionality (ii), and they are flexible to group-specific variability (iii) but do not model the data count distribution. Cluster-free methods such as milo (15) and DA-seq (6) are also based on log-linear models. Log-count-based methods such as MixMC (16) , Bach, K. et al. (17) , ANCOM-BC (18) model group-specific variability (iii) but do not model counts or data compositionality. Binomial-based methods such as Pal, Chen et al. 2021 (19) and corncob (20) model counts (i) and cell-group-specific variability (iii) but do not model the compositionality (ii). Multinomial-based methods such as ALDEx2 (21) , dmbvs (22) and scCODA (23) model count data (i) and compositional properties (ii) but assume the same variability for all groups.

Other important data properties have remained mostly uncharacterised. While Phipson et al. (13) introduced variability moderation through empirical Bayes (limma (24) ), a mathematical description of the proportion mean-variability association (4; Table 1 ) across datasets and technologies is not apparent. This description would allow differential variability analysis and have profound implications for the adequacy of the very popular Dirichlet multinomial models. Similarly, the extent and the effect of outliers (v) has never been characterised for cell omic-derived data. This knowledge would guide robust method developments for cell omics data. Currently, only Bach, K. et al. (17) deal with outliers using robust regression; however, the efficacy of such an approach was not benchmarked.

Here, we introduce sccomp, a generalised method for differential composition and variability analyses based on sum-constrained independent Beta-binomial distributions.

Sccomp takes into account the five main properties of cell omics compositional data.

Furthermore, sccomp can simulate realistic data with the properties of any experimental dataset. The simulated data can be used to assess the adequacy of the fitted model and for benchmarking purposes. Sccomp can also transfer knowledge across datasets to improve analyses in low-data regimens.

Using sccomp on 18 datasets, we characterise the mean-variability relationship of compositional data across cell omics technologies. Our findings suggest that the Dirichlet-multinomial is inadequate for models of differential composition analysis and that incorporating the mean-variability relationship is a requirement for differential variability analysis tools. Our results also show the ubiquitous presence of outlier observations in all datasets. Using realistic simulations, we show that sccomp significantly improves performance compared to other methods. Sccomp uncovered differential microenvironmental constraints of breast cancer subtypes and cell-type-specific differences involving lymphoid and myeloid cell populations. Uniquely, the sum-constrained Beta-binomial distribution allows modelling of the compositional properties of data with mean-variability association while allowing for outlier exclusion; we anticipate its adoption in other scientific fields. 

To take into account all five major properties of count-based compositional data (Table 1) , we developed a model based on sum-constrained independent Beta-binomial distributions.

Sccomp can simultaneously estimate differences in composition and variability ( Figure 1C ).

Sccomp is compatible with complex experimental designs, including discrete and continuous covariates. The estimation is done through Hamiltonian Monte-Carlo via the Bayesian inference framework Stan (25) . The hypothesis testing is performed by calculating the posterior probability of the composition and variability effects being larger than a fold-change threshold (26) . The estimation is made more stable with an adaptive shrinkage in the form of a prior distribution defining the association between proportion means and variabilities (see Figure 1B and Methods). Sccomp identifies outliers probabilistically through iterative fitting ( Figure 1E and Methods), which are excluded from later fits ( Figure   1F ).

The Bayesian framework allows sccomp to incorporate the mean-variability association from other datasets ( Figure 1A ). This prior knowledge is helpful when only a few groups or samples are present. After fitting the model, sccomp can simulate data recapitulating the learned properties ( Figure 1H ). The simulated data can help identify potential failings of the model and enables benchmarking based on more realistic simulations. variability (right-hand side) that sccomp can estimate (Differential variability analysis in Methods). D:

Representation of the process from cell clustering and counting that is the input for the differential composition analysis (User interface in Methods). E: Schematics of the iterative process of outlier identification and outlier-free data fitting that sccomp adopts for robust estimation (Iterative outlier detection in Methods). F: This check allows users to evaluate the ability of the model to fit a specific input dataset. L: Representation of benchmarking with realistic data that sccomp allows in a user-friendly way.

To develop a method able to share information across groups, we studied the association between group proportion means and variability for 18 single-cell RNA sequencing, CyTOF and microbiome datasets (Table S1 ). We first fitted our sum-constrained Beta-binomial model to the data with no linear association between the logit-multinomial proportion means (µ g ) and log-variabilities (-⍵ g ) built-in (see Methods for notation), and observed the correlation of the estimates. We observed consistent positive linear homoscedastic association for all three data types (Figure 2A -left and S1, dotted line and residuals). Comparing the estimates with the mean-variability log-linear association built-in, we observe shrinkage of the variability estimates of up to 4 fold for ( Figure 2A -right and S1D).

For single-cell RNA sequencing data, modelling this association had a shrinkage effect on the variability estimates (-⍵ g ; and means µ g to a lesser extent), something obvious for the BRCA1 dataset for cell types with low abundance (e.g. tumour associated macrophages, Tam1, Figure S1 ). For CyTOF data, the shrinkage effect is evident in the Bodenmiller and cytonorm datasets. Similarly, the most significant impact can be seen for rare cell types. Microbiome data is characterised by higher uncertainty and greater spread around the regression line (before shrinkage). The impact of shrinkage there is more dramatic than in the other data types, especially for the means.

The estimated slope of the linear relationship is fairly consistent across technologies. Figure S1A ). This pattern is observable in the resulting bimodal residual distribution ( Figure 2A -middle and second rows of S1A, S1B and S1C). Our model uses a Gaussian mixture distribution that accurately fits both modes (Figure S1A third row; dashed lines). (Table S1 ). Each line represents a cell group. The slope of fitted lines represents the match between observed and generated data for one group (paired by their ranks), which is expected to be 1 when two distributions are the same. The dashed grey line represents a perfect linear match. E: The distribution of slopes of the scatter plots (panel D). F: Association between the slopes of the scatter plots (y-axis) and the estimated proportion abundance of each group (x-axis). Sum-constrained Beta-binomial (scBb) and Dirichlet-multinomial (Dm) are compared. If data simulated from posterior predictive distribution is similar to observed data, we expect a straight horizontal line intersecting 1.

Our method can simulate realistic data based on the learned characteristics of experimental datasets ( Figure 2B ). This simulation is achieved by first estimating the posterior distribution from a given dataset and then generating data from the posterior predictive distribution. The posterior predictive check (27, 28) is helpful to assess the model's descriptive adequacy (29) to specific datasets and study designs. For example, the overlay of experimental to simulated data shows the descriptive adequacy of sccomp to the COVID19 dataset EGAS00001004481

(4) (replicating quartile ranges; Figure 2C ) and the absence of noticeable pathologies. To provide a more quantitative assessment, we regressed the observed and simulated data for each cell type of 18 publicly available datasets (1-5, 17, 30-33) across three cell omic technologies ( Figure 2D ). The fitted lines are tightly centred around the 45°reference line for all datasets ( Figure 2E ). This evidence suggests that proportion means and variability inference are descriptively adequate for and representative of experimental data across technologies. This trend is particularly significant considering that performing posterior predictive checks on datasets with a small sample size suffers from noise.

Considering the existence of a mean-variability association, we assessed the ability of our model to adequately estimate the variability of small and large groups. We analysed the relationship between fitted slopes between observed and simulated proportions ( Figure 2D ) and the baseline abundance (estimated intercept) across 18 datasets (Table S1) . We compared our model with the Dirichlet-multinomial model, a de facto standard for count-based compositional analyses (22, 23, (34) (35) (36) (37) .

We saw no bias in the fitted slopes of observed-simulated data across group abundance. These results present no evidence that our model underestimates or overestimates the data variability for any group, regardless of their relative abundance and the data source where more data is available, the consistent overestimation for small groups is mirrored by an underestimation for large ones.

To add dependence to a series of independent Beta-binomials, we impose a sum-to-one constraint on proportions analogously to the Dirichlet-multinomial. We hypothesised that our model would capture the compositional nature of data of a Dirichlet-multinomial, despite allowing for group-wise variability. Data was generated by a four-group Dirichlet-multinomial (with parameters 0.2, 0.6, 2.0, 4.0; Figure 3A and 3B), and the sccomp single-mean model was fitted to this data. To show the adequacy of our model of the data, we simulated data from the posterior predictive distribution. The overlay of the simulated data on the observed data shows that the densities match (red data points, Figure 3C and 5D).

We tested whether our model can capture the dependence structure across the proportion means, typical of compositional data, analysing the correlation among estimated means using pairs plots. We also compared the estimated means for a Dirichlet-multinomial (as a baseline) and an unconstrained (independent) Beta-binomial model. The estimated means of our model show a negative correlation structure similarly to the Dirichlet-multinomial model ( Figure 3E ). This correlation is strong for groups one and two (G1 and G2) and to a lesser extent for group three. On the contrary, the unconstrained Beta-binomial does not reproduce this dependence. This lack of dependence results in a higher uncertainty around the estimates, especially for low-abundance groups G1 and G2.

The differences between sum-constrained and unconstrained Beta-binomial models are reflected in the ability to simulate representative data to the Dirichlet-multinomial ( Figure   3F ). The marginal distributions of the predictive posterior and the weak dependence structure of the simulated data across the four groups, characteristic of the Dirichlet-multinomial, is accurately reproduced by the sum-constrained Beta-binomial. On the contrary, the unconstrained Beta-binomial generates visibly distinct data densities compared to the Dirichlet-multinomial. (65)). F: Overlap between the observed and generated data between each group across models.

To compare the performance of sccomp with publicly available methods for differential composition analysis (Table 1) , we performed a benchmarking on realistic simulated data based on the noise and outlier characteristics of the COVID19 dataset (4) ( Figure 4A ). The simulation was based on a logit-linear-multinomial model to ensure fairness across methods.

We built a receiving-operator characteristic curve for every run and evaluated the performance using the area under the curve (AUC, up to 0.1 false-positive rates; Figure 4B ).

Overall across simulation settings, sccomp shows significantly better performance than publicly available methods, including a generic logit-linear regression (lm in R). The gain in performance tracks with the increase of slopes until the 0.1 average AUC plateau.

Sccomp has the highest incremental performance gain within the method rank ( Figure 4D )

(1.4 folds and 1.9 fold using sum-constrained Beta-binomial-based simulations; Figure S2 ). It is the only method having a more-than-linear gain (i.e. > 1 fold). The method rlm and logit-linear are the second and third best performers with an incremental performance gain of 0.64 and 0.75, respectively.

Overall across simulation settings, the number of groups was the least impactful. An outlier-free benchmark ( Figure S3 ) shows a better performance of sccomp with a smaller incremental improvement.

Sccomp can further inform the estimates by transferring information from publicly available datasets (see Cross-dataset learning transfer subsection). To test the effectiveness of this technique to regularise estimates in low-data regimens, we compared the use of uninformative or informative hyperpriors. Our results show a noticeable impact on the performance for simulations with low sample or group size, up to 2 fold ( Figure S4 ). 

We used sccomp to analyse the microenvironment of primary breast cancer from data first described by Wu et al. (3) . This study analysed 26 breast cancer primary tumour tissues and identified 49 cell phenotypes. We analysed the difference in composition and variability of the triple-negative subtype (TNBC) compared to ER+ and HER2+. Our analysis led to a rich and diverse landscape of compositional and variability changes across subtypes ( Figure 5A ).

The main feature is the depletion of cytotoxic CD8 IFN-γ, compared to HER2+ and ER+ ( Figure 5B ). Compared with triple-negative, the HER2+ microenvironment is enriched in several other lymphocytic populations, including CD4 follicular helper (CD4 fh in Figure   5B ) Figure 5A and 4B), HER2+ shows a distinct cohort-level heterogeneity profile. A markedly smaller slope indicates a more similar relative variability across cell types and potentially distinct microenvironmental processes acting for this subtype. 

To further assess the ability of sccomp in generating novel results, we expanded our analysis on a time-resolved BRCA1 model of tumorigenesis (E-MTAB-10043 (17)). Using a robust log-linear model followed by a robust F test, this study estimated 17 significant differences along the tumour developmental timeline, including fibroblast, dendritic, monocyte, and T cells. We confirmed the majority of those associations and identified 15 new associations, such as tumour-associated fibroblasts (Fb7, Fb8) and macrophages (Tam1, Tam2), neutrophils and mig dendritic cells (migDC). Five associations proposed by the study were labelled as non-significant by sccomp ( Figure 5H ), two of those including outliers.

To assess the usefulness of sccomp more broadly, we analysed four other single-cell RNA sequencing public datasets (Table S1 ). Overall, sccomp was able to generate novel results, including differential composition and variability for all datasets ( Figure 5I , Figure   S5B ). We identified and excluded outlier observations in all datasets, with 19% of cell groups containing one or more outliers ( Figure 5J ). The 20% of those cell groups presented significant compositional differences after excluding outliers. The comparison between sccomp estimation and the estimation from the selected studies revealed that 15% of the calls that disagreed included one or more outliers.

We have introduced sccomp, a method for differential analysis of count-based compositional data. It is based on sum-constrained independent Beta-binomial distributions that share compositional characteristics with the Dirichlet-multinomial but allow group-specific variability and exclusion of outlier observation from the fit. Our model shares features with the generalised Dirichlet-multinomial (38) . However, it allows for missing observations and suits outlier exclusion.

The present study describes the proportion mean-variability association for cell omic compositional data. We tested such associations across 18 single-cell RNA sequencing, CyTOF and microbiome datasets. Our results have fundamental implications. They challenge the use of the Dirichlet-multinomial distribution, a standard in count-based compositional analysis, and the use of unconstrained, independent distributions. We showed that cell omic compositional data (e.g. EGAS00001004481) with N groups can be modelled with no more than N+1 degrees of freedom (N-1 for the means and 2 for the variability). This finding implies that such unconstrained models tend to be heavily overparameterised (using 2N degrees of freedom).

Our description of mean-variability association also has implications for differential variability testing. Ignoring the mean-variability association would result in biased estimates of the differential variability necessarily associated with the differential composition estimates. Defining the correlation line in log space allowed us to disentangle differential composition and variability and provide a meaningful estimate of how cell/taxonomic proportion variability varies across samples.

While the impact of outlier observations has been approached for metagenomic data (39) , our study proposes that single-cell compositional data are also outlier-rich. Our outlier identification approach overcomes the challenges of using residuals to identify outliers caused by the heteroscedastic nature of count-based compositional data and potential low sample size. We identified outliers in all single-cell RNA sequencing datasets that we have re-analysed, present in differentially and non-differentially large groups.

In real-world analyses, it is crucial to assess whether a statistical model is descriptively adequate for a specific query dataset. Sccomp offers a convenient functionality for posterior-predictive checks. Being able to generate data from a fitted model, sccomp offers a data simulation framework that reflects the properties of any target dataset while also allowing arbitrary simulation designs. Data simulation is possible using the sum-constrained Beta-binomial, Dirichlet-multinomial and logit-linear-multinomial distributions. Our realistic benchmarks (using a foreign distribution) show that sccomp confers an up to 2-fold incremental performance gain compared to previous methods.

Our reanalysis of public data demonstrates the practical application and efficacy of sccomp, which identified novel differential variability and compositional associations. We

show that some of the differential composition associations proposed by the respective studies might be false-negative due to the presence of outliers. For the breast cancer dataset introduced by Wu et al.

(3), we unveil differential constraints for the subtypes triple-negative, ER+ and HER2+.

This study, introducing several innovations in the field of cell omics compositional analyses such as differential variability analysis, a log-linear mean-variability relation, probabilistic outlier identification, cross-study information transfer, aims to enrich the field of single cell omics. Also, this study challenges established methodologies and provides a robust and flexible tool for the single-cell and microbiome scientific community. Being the first statistical model that fits data compositionality and group-wise variability while allowing the exclusion of outliers, we anticipate its wide adoption in other scientific fields. Sccomp is available as an R package via Bioconductor and GitHub.

The regression model underlying sccomp is based on sum-constrained independent

Beta-binomial distributions. The beta distribution is a continuous probability distribution defined between zero and one. The binomial distribution is the discrete probability distribution of the number of successes obtained in a specified number of mutually independent trials, each with the same probability of success. The Beta-binomial probability distribution is the compound binomial distribution obtained when the binomial success probability is given a beta distribution. While a single Beta-binomial distribution can model the distribution of the number of elements belonging to a particular group, it cannot model the compositional nature of the data from a number of independent Beta-binomials embodied in the constraint that the underlying expected proportions belonging to the different groups must sum to one. This dependence induces a small negative correlation among the observed proportions of elements across groups, similar to that seen in the multinomial distribution.

We impose this negative correlation by constraining the expected values of the group proportions (i.e. the means of our beta distributions) to sum to one.

We introduce here the common notation used in the mathematical formulation of the model. G is the number of groups, S is the number of samples, n s is the total number of cells probed for sample s, k g,s is the number of cells in sample s belonging to a group g. For clarity, we introduce our model in four steps. First, we describe the single-mean model; second, we describe the single-mean model with a log-linear constraint between variabilities and means; third, we introduce a two-mean model; fourth, we describe the linear model generalisation that is used in sccomp.

The Beta-binomial distribution is commonly defined using the (latent) shape parameters ɑ and β (Equation 1) from the Beta distribution. Here and elsewhere B (ɑ, β) denotes the classical Beta function with argument ɑ and β. Here, we use the mean and concentration (the inverse of variability) parameterisation (π,σ) with g representing the mean and g representing the concentration parameter of cell group g, this being the sum of the corresponding α and β. This parameterisation is convenient for our linear modelling. The mean is the average value of the underlying Beta distribution, while the concentration captures how concentrated the underlying Beta distribution is around its mean. The equivalence of the standard (ɑ, β) and the alternative (π,σ) parameterisation is shown in Equations 1-3.

Step 1, Single-mean model. The parameters of the single-mean model are elements = ( g ) ∈ S G+1 (simplex) of the sum-to-one-constrained vector of size G and a vector = ( g ) ∈ R + G of concentrations. The data are an G×S matrix K = (k g,s ) of counts, and a vector n = (n s ) of length S is the sum of k s ( ). The joint probability mass function is defined by = =1 ∑ , two observed quantities, K and n, depending on the parameters and , see (Equations 4-7). Statement 5 includes the sum constraint that induces the weak negative correlation of proportions characteristic of compositional data. The underlying assumption of this model is that the counts k g,s from the total counts n s are mutually independent Beta-binomially distributed random variables with the alternative parameters given.

means. For this model, we transform the parameters and to μ and ⍵ (see Equation 6 and below). The parameters and are suitable for an unconstrained single-mean model. Still, to permit a (log) linear relationship between our mean and concentration (the inverse of variability) parameters and the extension to more general linear models, we must use a different but equivalent set of parameters appropriate for linear subspaces of R G . The inverse-logit-multinomial (also known as softmax) function (Equation 6) takes a vector μ ∈ R G and converts it into a vector of G proportions that sum to 1, the components being proportional to the exponentials of the corresponding components of μ. However, this mapping is many-to-one. If inverse-logit-multinomial (μ) = , then also inverse-logit-multinomial (μ + c1 M ) = , where c is any real constant and 1 M is the G-vector of 1s. To make it one-to-one and so permit invertibility on its range, we need to restrict its domain. Write ℒ 0,G for the linear subspace of R G consisting of all μ = (μ g ) such that

. We will see that for every ∈ S G+1 there is a unique μ ∈ ℒ 0,G such that =1 ∑ µ = 0 inverse-logit-multinomial (μ) = . We call the µ the logit-multinomial proportion mean parameters, or just mean parameters when no confusion is likely. Letting GM denote the geometric mean, we write GM ( ) = . Then μ = (μ g ) where μ g = log ( g /GM 1 2 ...

( )) is readily checked to satisfy our requirements, i.e. μ ∈ ℒ 0,G and softmax (µ) = inverse-logit-multinomial (μ) = . This function of π is known as its center (ed) log-ratio (CLR). We see from (7) that ⍵ g = log ( g ), so our new parameter space is ℒ 0,G x R G . This process is also known as stick-breaking, which underlies the Dirichlet process (40, 41) .

Given µ, the parameter ⍵ will be given a normal prior distribution. The linear relation between μ and ⍵ which underlies our development is shown in Equation (8) . where λ 0 and λ 1 are scalars. The likelihood and priors for the single-mean model with log-linear concentration-mean relation are represented by the formulae 8-9. The complete parameter set is now μ ∈ ℒ 0 ⊂ R G , ⍵ ∈ R G , λ 0 ∈ R, λ 1 ∈ R, and the standard deviation ∈ R + going with the normal conditional distribution of the ω g s given the µ g s, see (11) below. The dataset is unchanged from the original single-mean model. Before generalising this model, we introduce and use the matrix M = (µ g,s ) of mean parameters, where µ g,s is the mean parameter for sample s and group g. The single-mean model is characterised by M having all its columns identical.

Step 3, Two-mean model. We now introduce the two-mean model. In this case, the matrix M = (µ g,s ) has two potentially distinct columns, one for each of two sets of samples. For simplicity, we will call these the control and treated samples and introduce the 2×S matrix X, whose 2 rows are the indicator vectors (i.e. vectors of zeros and ones) of the control and treated samples, respectively. If we now define a G×2 matrix Γ whose columns are any two mean parameter vectors, say μ c ∈ ℒ 0 , μ t ∈ ℒ 0 , then our two-mean model has matrix M = ΓX.

Step 4, Arbitrary linear model. The approach of the previous paragraph can easily be generalised to arbitrary linear models. For this generalisation, we replace the 2×S design matrix X above by an arbitrary C×S design matrix X, where C is the number of covariates associated with the samples (including one for an intercept if that is appropriate), and the G×2 matrix Γ above becomes a general G×C matrix whose C columns are all elements of ℒ 0 .

As before, M = ΓX.

We now define the full hierarchical linear model based on the sum-constrained Beta-binomial distribution. This model is defined through the G×C parameter matrix Γ; ⍵ of length G; , the scalars λ 0 and λ 1 ; and the dataset includes the G×S matrix K of counts, the S×1 vector n of totals, and the C×S design matrix X. The prior normal distributions are parameterised by their means and standard deviations. X s denotes the design vector (s th column) for sample s, and γ g indicates the coefficient vector (g th row) of Γ for cell-group g. Since M = ΓX , we must have µ g,s = γ g X s .

Inference. This set of sampling statements and the data ( The probability of the null hypothesis (i.e. no effect across conditions) for each group is obtained by calculating the posterior probability of γ g,c being larger (or smaller) than a fold-change threshold (0.2 by default). The false-discovery rate (FDR) is obtained by sorting in ascending order the probability of the null hypothesis (for any coefficient) and calculating the cumulative average as described by Stephens (26) .

By default, the data variability is modelled with one concentration (inverse of variability)

parameter ω g per group (variability independent of covariates). However, the user can provide a more general variability model using a variability design matrix. For example, the concentration can be estimated conditional on a factor of interest to perform differential variability analyses. We now introduce a two-group differential variability model. The following notation is the same as in the paragraph "Step 3, Two-mean model." of the Methods subsection "Statistical model". As ⍵ g is the log-concentration for the cell-group g, we introduce ⍵ g,i as the concentration for cell-group g and condition i. In this model, we increase the dimensionality of ⍵ from G to 2G, where each ⍵ g,1 and ⍵ g,2 represents the concentration of group g for two conditions (e.g. treatment and control). The expected value of ⍵ for a two-group differential model and the prior distribution is described in Equations 17 and 18.

Since group proportion means and variabilities are associated (see Proportion means and variabilities are log-linearly correlated in cell-omic data) differences in composition and variability will be associated. To test the biological effects that lead to differential variability that are not explained by differences in composition, we need to subtract the contribution of differential composition from the apparent differential variability. We compute the adjusted differential variability (independent of differential composition) using Formula (19) . The left side of the formula represents the (apparent) difference between variabilities, the right side of the formula represents the contribution of differential composition.

Using the dataset Wu et al., we show that without adjustment, the estimates of differential variability and composition would appear correlated ( Figure S6) . Often, when a cell group is differentially abundant seems also to be differentially variable. Again, this difference is the result of the mean-variability association in the first place. Without adjustment, the difference in variability would just indirectly inform us about the difference in composition without learning anything new. We show in Figure 5C and 5D that, using λ 1 to adjust for the contribution of differential composition, we obtain estimates for differences in variability that are uncorrelated with differences in composition.

These adjusted differential variability estimates are used to carry out a test along the lines of our testing for differential composition (see Method section, Statistical model subsection).

A robust iterative strategy for outlier identification was developed for negative-binomial data from bulk RNA sequencing (42) . Such a strategy is necessary because a fit that includes outliers makes the model biased by definition and produces skewed estimates. Sccomp implements a three-step approach; the first two aim to identify outliers, and the third aims to estimate associations. In theory, the outlier identification process should iterate until convergence (no other outlier detected); however, our analyses across seven datasets show that two iterations always reached convergence. In the first step, the model is fitted, and posterior predictive distribution is produced for each data point from the fitted parameters. 

Sccomp simulates data from a specific fit to observed data using posterior predictive distribution. This data can be overlaid to the observed data to assess the model descriptive adequacy. The probabilistic framework Stan (25) is used for data simulation. The posterior distribution draws for the parameters are used to draw from one to a maximum of 2000 datasets (by default, depending on the sampling size of the fit) from the sum-constrained Beta-binomial distributions.

By default, our model uses uninformative gaussian hyperpriors (see the Statistical model subsection) on the intercept (λ 0 ), slope (λ 1 ) and standard deviation ( ) of the prior for the concentration parameter ⍵. Sccomp offers the possibility to integrate prior knowledge about the mean-variability association from other, previously analysed datasets by setting informative hyperpriors. We also provide users with a set of hyperpriors for single-cell RNA sequencing, CyTOF and microbiome data, integrating the information from the 18 analysed datasets (Table S1) . We fit the model and calculate the posterior means and standard deviations of the three parameters (λ 0 , λ 1 , ) from these data sources and set them as the mean and standard deviation of the respective hyperpriors.

The function for linear modelling takes as input a table of cell counts ( Figure 1D 

To study the association between logit-multinomial mean µ g (where g is one cell type ) and log concentration ⍵ g (negative of variability) across cell omic technologies, we gathered 7 datasets from single-cell RNA sequencing (1-5, 17, 30-33) , 6 from CyTOF (45-50) and 6 from microbiome (51-56) studies (Table S1 ). The cell or taxonomic groups were defined in the respective studies. These datasets were analysed using the design suggested in the respective studies, assuming that the group-wise variability was independent of the covariates. For each dataset, the parameters µ g and ⍵ g were first estimated using sccomp without imposing any relationship between the two. This setting was obtained using flat, independent priors on the µ g and ⍵ g . We calculated the mean, 2.5% and 97.5% quantiles from the posterior distributions of µ g and ⍵ g . We then calculated the correlation between the posterior means of ⍵ g and µ g using a robust linear model (rlm, MASS (57, 58) ). The residuals of the robust regression (difference between estimated ⍵ g and regression line) were calculated, and their distribution was analysed.

To assess the shrinkage effect on the concentration ⍵ g of the modelling of its linear relationship with µ g , we used sccomp including the prior of the ⍵ g given the µ g . We calculated the posterior mean and quantiles as we did with the flat independent priors. We then calculated the shrinkage as the ratio of the estimated means of µ g and ⍵ g for the two runs with or without conditional priors. We model the bimodal distribution along the regression trend present in single-cell RNA sequencing data with a mixture regression model having Gaussian distributed errors. The mixture distribution assumes an ordering of the components. The component with a higher intercept (λ 0,high ) is given a 0.9 probability, and the smaller component (λ 0,low ) is given a probability of 0.1. The slope (λ 1 ) and the standard deviation ( ) are assumed to be the same for the two components (given our analyses on the single-cell RNA sequencing data with no linear association between the means and variabilities built-in).

The implementation of sccomp gives the option to model the mean-variability association using mixture distribution (suggested for single-cell RNA sequencing data).

To assess the adequacy (29) (Table S1 ). For comparison, we performed the inference and analyses with both the sum-constrained Beta-binomial and the Dirichlet-multinomial models. We first used sccomp on the cell or taxonomic groups using the designs defined in the respective studies, assuming the concentrations are independent of covariates. We then used the simulation feature of sccomp to replicate those 18 datasets (i.e. posterior predictive distribution). We calculated proportion from the observed and generated counts and compared their distributions (one element being the proportion for one sample-group pair) using linear regression (lm function from R). To assess the presence of any over-or under-estimation bias conditional on the relative abundance, we compared the slope of the association between observed and generated data with the baseline group abundance (intercept coefficient).

To assess the ability of sccomp to generate discoveries from publicly available datasets, we applied sccomp to 6 single-cell RNA sequencing datasets (1) (2) (3) (4) (5) 17) . The cell groups were defined in the respective studies, and the concentration was set to be conditional to the factor of interest (first covariate in the formulae in Table S1 ). We then compared the significant associations and the presence of outliers identified by sccomp with the findings presented in the respective studies. We estimated the differential composition for all datasets, while only for the datasets with a binary factor of interest and a sample size larger than 10, we estimated the differential variability. indicates that all groups (cell-types) are similarly variable; a larger slope indicates that the larger groups are relatively much more variable than small groups. We can compare these high-level properties across groups of samples, for example, from different breast cancer subtypes. This analysis can identify different high-level signatures that might underlie different biological constraints and processes. For that, we ran sccomp independently for the samples according to the factor of interest "subtype" and compared the posterior distribution of the estimated mean-variability association (λ 1 ) and the baseline variability (modelled in its negative form as concentration λ 0 ).

We base our benchmark on simulated data with realistic characteristics based on the COVID19 dataset EGAS00001004481 (4). Data was simulated based on a logit-linear-multinomial model to ensure fairness across methods. For the simulation, we calculated group proportions across samples and fitted a logit-linear regression model using Stan (25) . This model captures the mean-variability association similar to the default sccomp model. We use the posterior distribution to simulate the benchmark datasets, replacing the intercepts and slopes with given ones to establish ground truth. For each simulation, we randomly selected 40% of the groups to be compositionally different between conditions.

With the values of the specified parameters, we simulated data from a logit-linear-multinomial model to obtain counts. The total cell count for each sample was set to 1000.

We based the simulation on three variables: the magnitude of the difference, number of samples per condition, and number of groups. The outliers were injected with realistic frequency (10%) and magnitude (from 2 to 10 fold increase or decrease), observed in our data-integration analyses. For each simulation, we produced a receiving-operator characteristic (ROC) curve ranking groups by their statistics and comparing them with the ground truth of significant/non-significant differentially abundant groups. We calculated the area under the curve (AUC) from the receiving-operator characteristic curves to a 0.1 false-positive rate (grey shade in Figure 4B ). For each combination of the simulation parameters, we simulated 50 datasets and averaged across the areas under the curves. For comparative purposes, we performed a benchmark (as described above) simulating data without outliers and using the sum-constrained Beta-binomial distribution.

We assessed the leap of performance improvement that sccomp provides compared to its next-best in relation to the performance gain of all methods. We call this measure incremental performance gain. An incremental performance gain of one indicates that, compared to the average performance improvement that any method had with its second-best, a method had the same improvement. This can be thought of as linear incremental performance gain. For each of the 15 simulation settings, we calculated the average area under the ROC curve (AUC) to obtain a unique performance score. From this average, we excluded the slope ranges where the performance of most algorithms approached a plateau to calculate the difference in performance in the most informative simulation regimens. For each simulation setting, we ranked methods based on the performance score. We calculate the gain in performance as the difference in average AUC between each method and their next-best ranked (e.g. first against second, second against third). The fold gain in performance was calculated as the ratio between the gain in performance of each method and the average of all others.

The data analysis was performed in R (59) . Data wrangling was done through tidyverse (60).

Single-cell data analysis and manipulation were done through Seurat (43) Figure S2 . Benchmark on realistic simulated data from the COVID19 dataset EGAS00001004481 (4) using the sum-constrained Beta-binomial model. (Table S1 ). Cells are shaded according to the type of finding (e.g. green shades for novel differential composition associations). 

Single-cell analysis reveals new evolutionary complexity in uveal melanoma

Tumor and immune reprogramming during immunotherapy in advanced renal cell carcinoma

A single-cell and spatially resolved atlas of human breast cancers

COVID-19 severity correlates with airway epithelium-immune cell interactions identified by single-cell analysis

Defining T Cell States Associated with Response to Checkpoint Immunotherapy in Melanoma

Detection of differentially abundant cell subpopulations in scRNA-seq data

Gut microbiota in human metabolic health and disease

The human skin microbiome

A single-cell type transcriptomics map of human tissues

Screening: CyTOF-the next generation of cell detection

Differential Variation Analysis Enables Detection of Tumor Heterogeneity Using Single-Cell RNA-Sequencing Data

scDC: single cell differential composition analysis

propeller: testing for differences in cell type proportions in single cell data. bioRxiv

diffcyt: Differential discovery in high-dimensional cytometry via high-resolution clustering

Differential abundance testing on single-cell data using k-nearest neighbor graphs

MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities

Time-resolved single-cell analysis of Brca1 associated mammary tumourigenesis reveals aberrant differentiation of luminal progenitors

Analysis of compositions of microbiomes with bias correction

A single-cell RNA expression atlas of normal, preneoplastic and tumorigenic states in the human breast

Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis

An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data

scCODA is a Bayesian model for compositional single-cell data analysis

limma: Linear Models for Microarray Data

Stan: A Probabilistic Programming Language

False discovery rates: a new deal

Posterior predictive checks: Principles and discussion

Posterior predictive checks can and should be Bayesian: comment on Gelman and Shalizi

Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data

Systematic comparison of single-cell and single-nucleus RNA-sequencing methods

Single cell transcriptomics reveals opioid usage evokes widespread suppression of antiviral gene program

Single-cell transcriptomics of blood reveals a natural killer cell subset depletion in tuberculosis

Estimating error models for whole genome sequencing using mixtures of Dirichlet-multinomial distributions

Batch effects correction for microbiome data with Dirichlet-multinomial regression

Dirichlet-multinomial modelling outperforms alternatives for analysis of microbiome and other ecological count data

Bayesian and frequentist approaches to multinomial count models in ecology

Clustering of Count Data Using Generalized Dirichlet Multinomial Distributions

Robust regression with compositional covariates

Discrete random probability measures: a general framework for nonparametric Bayesian inference

A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS

Probabilistic outlier identification for RNA sequencing generalized linear models

Integrating single-cell transcriptomic data across different conditions, technologies, and species

Orchestrating single-cell analysis with Bioconductor

Removing unwanted variation with CytofRUV to integrate multiple CyTOF datasets

Minimizing Batch Effects in Mass Cytometry Data

Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators

CytoNorm: A Normalization Algorithm for Cytometry Data

Comprehensive Immune Monitoring of Clinical Trials to Advance Human Immunotherapy

Systems-Level Immunomonitoring from Acute to Recovery Phase of Severe COVID-19

Gut metagenome in European women with normal, impaired and diabetic glucose control

The treatment-naive microbiome in new-onset Crohn's disease

Diet rapidly and reproducibly alters the human gut microbiome

Antibiotics, birth mode, and diet shape microbiome maturation during early life

Cohabiting family members share microbiota with one another and with their dogs

Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition

Robust statistics john wiley & sons

Robust Statistics: The Approach Based on Influence Functions

The new S language

Welcome to the Tidyverse

Interfacing Seurat with the R tidy universe

tidybulk: an R tidy framework for modular transcriptomic data analysis

Makeflow: a portable abstraction for data intensive computing on clusters

Data analysis using regression and multilevel/hierarchical models

GGally: extension to "ggplot2

A distinct innate immune signature marks progression from mild to severe COVID-19

For panels A, B and C, the points are the posterior means of the parameters. The error bars are the 95% credible intervals. The first row refers to mean and variability estimates association with no constraints on their relationship. The dotted line is the line fitted by robust linear modelling (rlm (58)). The second row has rlm residuals vs fitted values with a (blue) lowess smoother superimposed. The third line represents the estimates with the mean-variability association modelled. The dashed lines are the correlation estimated by sccomp. A: Mean-variability association for the single-cell RNA sequencing data. For this data type, the bimodal association is modelled (see Methods, Study of mean-variability association) B: CyTOF data. The association is modelled as unimodal. C: Metagenomics data. The association is modelled as unimodal

We thank Davis MacCarthy, Gordon Smyth and Jeffrey Pullin for the fruitful discussions about the method. We thank all the Stan community for the constant support.

Authors have no competing interests to declare.