Accommodating site variation in neuroimaging data using hierarchical and Bayesian models ACCOMMODATING SITE VARIATION IN NEUROIMAGING DATA USING HIERARCHICAL AND BAYESIAN MODELS A PREPRINT Johanna M. M. Bayer Orygen Centre for Youth Mental Health, Melbourne, Australia The University of Melbourne, Melbourne, Australia bayerj@student.unimelb.edu.au Richard Dinga Donders Institute, Radboud University, Nijmegen, the Netherlands Radboud University Medical Centre, Nijmegen, the Netherlands Seyed Mostafa Kia Donders Institute, Radboud University, Nijmegen, the Netherlands Radboud University Medical Centre, Nijmegen, the Netherlands Akhil R. Kottaram Orygen Centre for Youth Mental Health, Melbourne, Australia Thomas Wolfers Radboud University Medical Centre, Nijmegen, the Netherlands Department of Psychology, University of Oslo, Norway Jinglei Lv School of Biomedical Engineering Brain and Mind Center, University of Sydney, Sydney, Australia Andrew Zalesky Melbourne Neuropsychiatry Centre, The University of Melbourne Melbourne Health, Melbourne, Australia Department of Biomedical Engineering, The University of Melbourne, Australia Lianne Schmaal ∗ Orygen Centre for Youth Mental Health,Melbourne, Australia The University of Melbourne, Melbourne, Australia Andre Marquand * Donders Institute, Radboud University, Nijmegen, the Netherlands Radboud University Medical Centre, Nijmegen, the Netherlands Institute of Psychiatry, Kings College London, London, UK February 9, 2021 ∗shared last author (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 ABSTRACT The potential of normative modeling to make individualized predictions has led to structural neu- roimaging results that go beyond the case-control approach. However, site effects, often con- founded with variables of interest in a complex manner, induce a bias in estimates of normative models, which has impeded the application of normative models to large multi-site neuroimag- ing data sets. In this study, we suggest accommodating for these site effects by including them as random effects in a hierarchical Bayesian model. We compare the performance of a linear and a non-linear hierarchical Bayesian model in modeling the effect of age on cortical thickness. We used data of 570 healthy individuals from the ABIDE (autism brain imaging data exchange, http://preprocessed-connectomes-project.org/abide/) data set in our experiments. We compare the proposed method to several harmonization techniques commonly used to deal with additive and multiplicative site effects, including regressing out site and harmonizing for site with ComBat, both with and without explicitly preserving variance related to age and sex as biological variation of interest. In addition, we make predictions from raw data, in which site has not been accommodated for. The proposed hierarchical Bayesian method shows the best performance accord- ing to multiple metrics. Performance is particularly bad for the regression model and the ComBat model when age and sex are not explicitly modeled. In addition, the predictions of those models are noticeably poorly calibrated, suffering from a loss of more than 90 % of the original variance. From these results we conclude that harmonization techniques like regressing out site and ComBat do not sufficiently accommodate for multi-site effects in pooled neuroimaging data sets. Our results show that the complex interaction between site and variables of interest is likely to be underestimated by those tools. One consequence is that harmonization techniques removed too much variance, which is undesirable and may have unpredictable consequences for subsequent analysis. Our results also show that this can be mostly avoided by explicitly modeling site as part of a hierarchical Bayesian Model. We discuss the potential of z-scores derived from normative models to be used as site corrected variables and of our method as site correction tool. Keywords neuroimaging · normative modeling · site effects · Hierarchical Bayesian Modeling 1 Introduction The most prominent paradigm in clinical neuroimaging research has for a long time been case-control approaches which compare averages of groups of individuals on brain imaging measures. Case-control inferences can be clinically meaningful under some circumstances when the group mean is a good representation of each individual in the group. However, this pre-condition has been challenged recently, demonstrating that the biological heterogeneity within clinical groups can be substantially large [Marquand et al., 2016]. For example, the structure and morphology of the brain have been found to vary between individuals in dynamic phases like adolescence [Foulkes and Blakemore, 2018] and within clinical groups, such as bipolar disorder and schizophrenia [Wolfers et al., 2018a] and attention deficit disorder [Wolfers et al., 2019]. In addition, inter-individual differences have shown to not necessarily be in line with results obtained via the group comparison approach [Wolfers et al., 2019]. Such heterogeneity has been considered a potential cause for the lack of differences between clinical groups and controls within the standard group comparison approach [Feczko et al., 2019] and the failure to replicate findings between studies [Fried, 2017]. As a consequence, there has been a shift in focus towards taking into account variation at the individual level [Marquand et al., 2019]. This is in line with a trend towards personalized medicine or "precision medicine" [Mirnezami et al., 2012], where characteristics of the individual are used to guide the treatment of mental disorders. This shift has been accompanied by a trend towards approaches that go beyond comparing averages of distinctly labeled groups [Insel et al., 2010, Insel, 2014], for an overview of methods see [Marquand et al., 2016]. Among them, normative modeling has been successfully used to capture inter-individual variability and make predictions at the individual level. The strength of normative modeling lies within the ability to map variation along one dimension (e.g., brain volume) onto a second co-varying variable (e.g., age), redefining the variation in the first dimension as explained by this new covariate of interest. This concept allows to describe the normative variation, thus the range containing e.g., 95 % of all individuals, as a function of the covariate and considers each individual’s score in relation to the variation in the reference group defined by the covariate score. The concept is similar to the use of growth charts in pediatric medicine, in which height and weight are expressed as a function of age. Hence, in this setting, an individual’s height or weight is not considered by its absolute value, but expressed as a percentile score of deviation fluctuating with age, with the median line corresponding to the 50% percentile and defining the norm, or average height. 1 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint http://preprocessed-connectomes-project.org/abide/ https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 In neuroimaging, normative models have been applied to clinical and non-clinical problems using various covariates, statistical modeling approaches (for an overview see [Marquand et al., 2019]) and targeting a variety of response variables. In general, any variable can be used as a covariate in a normative model targeting neuroimaging measures, as long as the variation along the co-varying dimension is not zero. However, normative models with age and sex as covariates and brain volume as response variable are currently more frequently found in the litera- ture, [Wolfers et al., 2020, Wolfers et al., 2018b, Zabihi et al., 2019, Kessler et al., 2016]. These implement the growth charting idea applied to high dimensional brain imaging data. For example, a normative model of a brain structure can be created based on the variation of individuals in population based cohorts. The estimated norm can be used to infer where individuals with clinical symptoms can be placed with respect to the reference defined by the norma- tive model. This has been the recipe of many recently published studies using the normative modeling framework [Wolfers et al., 2020, Bethlehem et al., 2018, Wolfers et al., 2019, Lv et al., 2020]. Underlying this approach is the assumption that the individually derived patterns of deviation uncover associations to clinical/behavioral variables that would be obscured by averaging across groups of individuals. However, the amount of data necessary to create normative models poses a challenge to normative modeling in neuroimaging, as the cost and time factor associated with neuroimaging data impedes the collection of large neuroimaging samples in a harmonized way. One exceptional example where large scale data collection succeeded and included both harmonized scanners and scanning protocols, is the UK Biobank initiative, which, when launched in 2006, aimed to scan 100,000 individuals at four different scanning locations [https://www.ukbiobank.ac.uk/explore-your-participation/contribute-further/ imaging-study][Miller et al., 2016]. Other neuroimaging initiatives have also taken on the challenge to collect neuroimaging data in large scale quantities and have relied on harmonized scanning protocols, but did not collect the data using harmonized scanners (i.e. ADNI, [Mueller et al., 2005], ABCD study [Volkow et al., 2018]). Nonetheless, the restricted age ranges (e.g., 40-69 years in UK Biobank [Miller et al., 2016]), or focus on a particular (clinical) cohort (e.g. Alzheimer’s in ADNI, [Mueller et al., 2005]) limit their utility for estimating normative models mapping the normative association between, for example, age and brain structure or function. An alternative way to obtain large neuroimaging data sets and assess data from a large number of subjects is by pooling or sharing data that has already been collected. One example is the Enhancing NeuroImaging and Genetics through Meta-Analysis (ENIGMA) consortium [Thompson et al., 2020]. ENIGMA succeeded in pooling neuroimaging and genetics data of thousands of individuals, including healthy individuals and individuals with psychiatric or neurological disorders. The strategy of data sharing initiatives like ENIGMA is to collect already collected data from different cohorts and different scanning sites and harmonize preprocessing and statistical analysis with standardized protocols. However, a major disadvantage is the presence of confounding "scanner effects" [Fortin et al., 2018] (e.g., differences in field strength, scanner manufacturer etc. [Han et al., 2006])). These confounding effects present as site correlated biases that cannot be explained by biological heterogeneity between samples. An example of those effects on derived measures of cortical thickness can be found in Fig. 1a. They result from a complex interaction between site and variables of interest, manifesting in biases on lower and higher order properties of the distribution of interest, such as differences in mean and standard deviations, skewness and spatial biases Fig. (1a, 1b), and cannot be explained by e.g., differences in age or sex Fig. (1c). As the origin of these effects might not only be related to the scanner per se, but extend to various factors related to a single acquisition site [Gronenschild et al., 2012], we will refer to them as site effects from here on. As outlined in the previous paragraph, the effort to create large samples to capture between subject variability often induces site-driven variability. This issue of site-driven variability in shared neuroimaging data has been acknowledged and has led to the development of harmonization methods at a statistical level. A common approach to deal with site effects is through "harmonizing" by, e.g., confound regression. One example of this approach is a set of algorithms summarized under the name "ComBat" [Fortin et al., 2017]. The method had originally been developed by [Johnson et al., 2007], who used empirical Bayes to estimate "batch effects", referring to non-biological variation added due to the handling of petri-dishes in micro-array experiments on the results of gene expression data. Fortin and colleagues adapted the framework to apply to neuroimaging data [Fortin et al., 2017]. In ComBat, additive and multiplicative site effects on a particular target unit (e.g., a particular brain voxel for one participant) are estimated using empirical Bayes and by placing a prior distribution over estimates for these units. The etsimate of the scanner effect is then used to adjust the prediction. Newer versions also allow to preserve variance of interest in the model, for example for age, sex or diagnosis [Fortin et al., 2017, Fortin et al., 2018]. ComBat has been applied to several types of neuroimaging data, including diffusion tensor imaging data (DTI, [Fortin et al., 2017]) and structural magnetic resonance imaging data [Fortin et al., 2018]. However, the reliability of harmonization strategies is grounded on the condition that site effects are orthogonal to the effect of interest and uncorrelated with other covariates in the model [Chen et al., 2014]. In reality, however, data pooled from several sites is often confounded with co-linear effects. Many individual neuroimaging samples, 2 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://www.ukbiobank.ac.uk/explore-your-participation/contribute-further/imaging-study https://www.ukbiobank.ac.uk/explore-your-participation/contribute-further/imaging-study https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 for example, are restricted to a specific age range, leading to age being correlated with site effects. In this scenario, removing an estimate of the scanner effect can lead to excluding (biological) variation that would be of interest. With this paper we suggest an alternative approach to deal with site effects in neuroimaging, which is relatively generally applicable. However, here we focus in particular on normative modeling. We propose a hierarchical Bayesian approach in which we include site as a covariate into the model, avoiding the exclusion of meaningful variance correlated to site by predicting site effects as part of the model instead of removing them from the data. This approach is similar to the approach by [Kia et al., 2020], who used hierarchical Bayesian regression (HBR) in a similar way for multi-site modeling in a pooled neuroimaging data set, which contained 7499 participants that were scanned with 33 different scanners. [Kia et al., 2020]’s estimate of site variation is based on a partial pooling approach, in which the variation between site-specific parameters is bound by a shared prior. The approach showed better performance when evaluated with respect to metrics accounting for the quality of the predictive mean and variance compared to a complete pooling of site parameters and to ComBat harmonization, and similar performance to a no-pooling approach, with the benefit of reduced risk of over-fitting due to the shared site variance. Moreover, [Kia et al., 2020] also showed that the posterior distribution of site parameters from the training set can also be used as an informed prior to make predictions in an unseen, new test set, outperforming predictions from complete pooling and uninformed priors, and overcoming a weakness of ComBat. The method was also able to display heterogeneity between individuals with varying clinical diagnoses in associated brain regions of 1017 clinical patients of the study. The present paper is a replication and extension of the approach by [Kia et al., 2020]. Based on several successful attempts of using Gaussian Process Regression to map non-linearity in normative models [Kia and Marquand, 2018, Marquand et al., 2016, Marquand et al., 2014], we extend the normative model with the capacity to account for site effects by adding a Gaussian process to model non-linear effects between age and the brain structure. In addition, our model is fully Bayesian and entails a hierarchical structure, including priors and hyper priors for each parameter. We use data from the ABIDE (autism brain imaging data exchange, http://preprocessed-connectomes-project. org/abide/) data set to compare a non-linear, Gaussian version of the model, to a linear hierarchical Bayesian version accounting for site effects that does not include the Gaussian Process term. We show that the hierarchical Bayesian models including a site parameter perform better than existing methods for dealing with additive and multiplicative site effects, including ComBat and regressing out site. We discuss the normative hierarchical Bayesian methods with regard to their implications for neuroimaging data-sharing initiatives and their use as general technique to correct for site effects. 2 Methods In this section we will introduce the data used in this study and the pre-processing steps applied, followed by a conceptual and mathematical description of our approach to include site as predictor in a normative hierarchical Bayesian model. We will also illustrate other methods (than including site as predictor) to accommodate for site effects that will be used to validate our approach against. Lastly, we will outline which measures will be used for model comparison. 2.1 Data The following sub-section aims to give a description of the ABIDE data set, including a study on the scope of site effects in the data. 2.1.1 ABIDE data set The ABIDE consortium (http://preprocessed-connectomes-project.org/abide/) was founded to facilitate research and collaboration on autism spectrum disorders by data aggregation and sharing. The consortium provides a publicly available structural magnetic resonance imaging (MRI) data set and corresponding phenotypic information of 539 individuals with autism spectrum disorder and 573 age-matched typical controls. For this study, only data from healthy individuals were included. As those healthy controls are meant to be complementary to the autism branch in the data set, 403 out of 539 subjects in this study were male. The data was processed using a standardized protocol [Craddock et al., 2013] of the FreeSurfer standard pipeline (Desikan-Kiliany Atlas) as part of the Preprocessed Connectomes Project [Craddock et al., 2013] and has been made available for download on the preprocessed section of the ABIDE initiative. For the current study we focused on cortical thickness measures of the 34 bilateral regions of the Desikian Killiany atlas parcellation [Desikan et al., 2006] as a part of the FreeSurfer [Fischl et al., 2004] output and the average cortical thickness across all 34 regions. We chose to include cortical thickness measures since they show a strong (negative) association with age (unlike measures of surface area, which remain more stable across the life span [Storsve et al., 2014]). 3 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint http://preprocessed-connectomes-project.org/abide/ http://preprocessed-connectomes-project.org/abide/ http://preprocessed-connectomes-project.org/abide/ https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 (a) Distribution of average cortical thickness measures of 573 individuals, grouped by the 20 acquisition sites the data were collected at (each boxplot de- scribes the distribution of one site). (b) Average cortical thickness of 573 individuals regressed onto age, grouped by site (each regression line describes one site). (c) Thickness measures of all 34 cortical re- gions average cortical thickness grouped by individual, colored by site, sorted by age (each boxplot represents one individual). Displayed are 4 out of 20 sites from the ABIDE data set (d) Distribution of all 34 cortical regions average cortcial thickness per individual, summarized as boxplot (each boxplot represents one individual). Boxplots are coloured by site and ordered by age within site. Figure 1: Site effects in 573 healthy individuals from the ABIDE data set. 2.1.2 Site effects in the ABIDE data set The ABIDE data set has been obtained by aggregating data from 20 independent samples collected at 17 different scan- ning locations [Di Martino et al., 2014]. Although all data has been collected with 3 Tesla scanners and preprocessed in a harmonized way [Craddock et al., 2013], sequence parameters for anatomical and functional data, as well as type of scanner varied across sites [Di Martino et al., 2014]. In addition, sites differ in distribution of age and sex and in sample size. An overview of site-specific data is provided in Table 1 and in [Di Martino et al., 2014]. The ABIDE data set is affected by site specific effects that are unlikely to be explained by biological variation. They manifest as linear and non-linear interactions between scanning site, covariates (for example age and sex), and cortical measures. Similar to batch effects in genomics [Leek et al., 2010], those effects lead to a clustering of the data caused by external factors related to the scanning- and analysis process. With the aim to estimate to which extent the ABIDE data set is affected by site effects, we calculated an ANCOVA with age as covariate. It revealed that average cortical thickness differed between site (main effect site: F(19, 516) = 4.4, p < 0.1 × 10−8, sum contrast). In addition we tested for differences in variance between sites. Bartlett’s sphericity test [Bartlett, 1937] showed a difference in variance between sites even after regressing out variance that could be explained by age and sex (p < 0.001). The site effects in the ABIDE data set are visualized in Fig. 1. 2.2 Splitting the ABIDE data set into training and test sets To evaluate the performance of the models, we split the data into a training set (70% of data) and a test set (30% of data) using the R package caret and splitstackshape, while the distribution of age, sex and site was preserved between sets. Thus, training and test sets contained individuals from the same sites ("within-site-split"). An overview of the distribution of age and sex for the training and test sets can be found in Fig. 2. Subsequently, the training and test sets were standardized region-wise based on location and scale parameters of the training set. For the model estimation process, only complete pairs of observations (per region) were used. 4 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 Figure 2: Overview over pheno- typic information in the ABIDE data set. // Age male subjects: M = 17.5. SD = 8.3. Age female subjects: M = 15.6, SD = 7.0. Range = 6.5-40 2.3 site as a predictor in a Hierarchical Bayesian Model With the aim to create reliable normative models in multi-site neuroimaging data, we developed and compared two versions of a hierarchical Bayesian models that include site as a predictor. In a hierarchical linear version of the model, site is modeled hierarchically, resulting in a random effect for site (Hierarchical Bayesian Linear Model, HBLM). In a non-linear version of the model, a Gaussian Process for age is added to test whether performance is increased if the model is also able to capture non-linear effects between age and thickness of the cortical region ("Hierarchical Bayesian Gaussian Process Model, HBGPM"). Both Hierarchical Bayesian models were trained and tested in a within site split (see section 2.2 on splitting the multi-site ABIDE data set.) 2.4 Comparison models To get a better understanding of the performance of our approach, we performed a second analysis, comparing the hierarchical Bayesian approach with site as predictor to predictions made from a data that other methods managing site effects had been applied on. In the following, those alternative models will be summarized under the term comparison models. Of note, the approach used to accommodate for site effects in the comparison models is fundamentally different from the approach used in the hierarchical Bayesian models. In the hierarchical Bayesian approach, multi-level modeling is used to account for site-variance without removing it, whereas different methods of harmonization are used on the data to remove variance related to site as part of the comparison models approach. In detail, the comparison model approach entailed a two-step procedure, in which site effects are first harmonized by three different common models of site harmonization, and then a simple Bayesian linear algorithm, with an additive term for age and sex, but without site as a predictor is used to make predictions in Stan [Stan Development Team, 2020b]. The harmonization procedures include i) regressing out site effect from the cortical thickness measures using linear regression and using the residuals as input to the simple Bayesian linear model (thus, removing additive variant components of site), ii) using ComBat [Johnson et al., 2007, Fortin et al., 2017] to clear the data from site effects (thus, harmonizing for additive and multiplicative effects of site, and iii) using ComBat as above, but explicitly preserving the variance associated with sex and age; an approach which will be referred to as modified ComBat in the following. Predictions made from raw data (thus, without any treatment of site effects) were used as a baseline model. An overview over all pipelines for all models can be found in Fig. 3. 2.5 Performance measures 2.5.1 Measures of model performance Model performance is assessed using several common performance metrics. The Pearson’s correlation coefficient ρ indicates the linear association between true and predicted value of cortical thickness measures. However, correlations 5 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 Figure 3: Pipelines for hierarchical Bayesian and comparison models are not a sensitive error measure and cannot capture the "miss" between true and predicted value. Hence, we also calculate the standardized version of the root mean squared error (SRMSE) and the point-wise log-likelihood at each data point in the test set as a metric indicating deviance from the true value. However, these measures only take into account the estimate of the mean, and do not account for variations in the estimate of the variance. Thus we also compute the proportion of variance explained (EV) by the predicted values and a standardized version of the log-loss (mean standardized log-loss, MSLL [Rasmussen and Williams, 2006]). The latter does not only take into account the variance of the test set, but also standardizes it by the variance of the training set, making a comparison between the models possible. This step is necessary as various methods of correcting for site might also have an impact on the variance remaining in the data. 2.5.2 Measures of goodness of the simulation in Stan Parameters indicating the goodness of the model simulation process in Stan itself, like convergence, effective sample size, and trace plots can be found in the supplementary material. 2.6 Model specification In this section we show how normative models describing the association between age, and sex, and cortical thickness measures can be modeled on data comprising site effects using a hierarchical Bayesian linear mixed model with a Gaussian Process term, which allows to model non-linear association between age and cortical thickness measures. 6 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 Following the notations of [Gelman, 2008, Rasmussen and Williams, 2006], we model a target vector y ∈ Rn×1 containing the the individual responses yi for each subject i = 1, . . . ,n and each region, using a latent function f = f(x). fi = f(xi) is the evaluation of the latent function for an input vector xi containing all p input variables of subject i, and is considered to differ from the true response variables by additive noise �i with the variance ηi and N(0,σ2) along the diagonal, with I being a n×n identity matrix: y ∼N(f,σ2I), (1) or, for the individual case: yi = f(xi) + �i, (2) with: �i ∼N (0,σ2). The ability of the model to deal with site effects is obtained by introducing a random effect for site s = 1, 2, . . . ,q so that the prediction for the ith subject is a combination of fixed and varying effects: f = Xβ + Zu + γ, (3) where γ is an additional non-linear component (defined in (5) below) and the estimate for one particular subject i is calculated the following: fi = p∑ j=1 xijβj + q∑ s=1 zisus + γi (4) with β ∼N(0, Σj) u ∼N(0, Σs). Here, β is a 1 × p vector containing the fixed regression weights corresponding to an n × p input matrix X with columns j = 1, . . . ,p. In case of non-centralized data one column of ones for an intercept offset has to be added. Similarly, u is a 1 × q vector containing the weights for random effects across subjects, corresponding to a dummy coded n×q matrix Z modeling site. For all linear models, in (3) we assume γi = 0. For the non-linear models we assume γ is a Gaussian Process with mean function m(x) and covariance function k(x,x′) to allow for non-linear dependencies between the predictors and the target variable: γ ∼ GP(m(x),k(x,x′)). (5) In our case, we set m(x) = 0 and define k(x,x′) as the additional non-linear component in the following squared-exponential form: k(x,x′) = σ2fexp(− 1 2l2j (x,x′))2, (6) with free parameters for the signal variance term σ2f and the length scale l. Note this allows to specify two sources of variance: The signal variance σ2f and the noise variance σ 2 as modeled in (1). From a hierarchical Bayesian point of view, random effects are equal to a hierarchical structure of sources of variation. For modeling site effects, introducing a hierarchical structure has the benefit that it allows to include structural dependencies between sites via partial pooling. Thus, instead of modeling site effects as an effect shared between sites or independently from each other, a semi-independent association between sites can be obtained via assuming that all site parameters originate from a shared first-order prior distribution. This concept has been used elsewhere [Kia et al., 2020, Gelman et al., 2013, Mathys et al., 2012]. 7 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 We hence induce shared priors and hyper priors θ0 for site s, i.e. ∀s,us ∼ InvΓ(2, 2), and a uniform prior for the length scale l ∼ U(1, 8). We use Stan [Carpenter et al., 2017, Stan Development Team, 2020b] to estimate all free parameters θ = (βT , uT, lT,σ,σf ) performing Bayesian inference: p(θ|X,y,θ0) = p(θ,X,y,θ0) p(X,y,θ0) = p(θ|X,y,θ0) p(X,y,θ0) (7) where p(X,y,θ0) = ∫ p(θ)p(X,y,θ0|θ) dθ. 2.6.1 Posterior predictive distribution We obtain the posterior predictive distribution y∗ for a new sample x∗ via: p(y∗|y) = ∫ p(y∗,θ|x∗,X,y,θ0) dθ = ∫ p(y∗|θ,x∗,X,y,θ0) p(θ|X,y,θ0) dθ = ∫ p(y∗|θ) p(θ|X,y,θ0) dθ (8) as y and y∗ are considered to be conditionally independent given θ [Gelman et al., 2013]. Further, the predictive distribution can be computed exactly, writing the joint distribution of the known data y, X and the new sample x∗, with the variance being determined by sample variance σ2 and the Gaussian kernel k(x,x′): k(x,x′) = [ K + σ2I k∗ kT∗ k∗∗ ] (9) Here, K is an n×n covariance matrix of training data, k∗∗ denotes the variance at the test sample points and k∗ is the covariance between y∗ the known data. 2.6.2 Comparison models We compare the hierarchical Bayesian attempt to normative modeling to commonly used harmonization techniques in which site is controlled for by subtracting an estimate of the site effect from the data prior to fitting the normative model. These methods included: i) removing additive effects of site, by regressing out site effects via linear regression and using the residuals as input for the simple Bayesian linear model to obtain the normative scores, ii) harmonizing for additive and multiplicative effects of site using ComBat [Johnson et al., 2007, Fortin et al., 2017], iii) modified ComBat, thus, using ComBat as before, but preserving biological variance of interest i.e., sex and age. All these methods involve removing site effects prior to estimating the normative scores in contrast to our method in which we explicitly model site within the normative modeling framework. These harmonized data, obtained as output from the harmonization techniques, are subsequently used for normative modeling in a simple Bayesian linear model that does neither take into account site effects nor non-linear dependencies between age and measures of cortical thickness. Thus, equation (3) is reduced to f = Xβ with β ∼N(0, ∑ j). In addition we use this simple Bayesian linear model to make one set of predictions for each regions from data that was not in any way harmonized for site (raw data model). R [R Core Team, 2020] was used for preprocessing of all data and to create the data set where site was regressed out, and for preprocessing the data with ComBat [Johnson et al., 2007, Fortin et al., 2017]. 2.6.3 Implementation: Normative modeling in Stan Both the hierarchical Bayesian and the comparison model version of the normative models were implemented in Stan [Carpenter et al., 2017, Stan Development Team, 2020b], a probabilistic C++ based programming language to perform Bayesian Inference, and analyzed in R [R Core Team, 2020] using the package rstan [Stan Development Team, 2020a]. Stan allows to directly compute the log posterior density of a model given the known variables x and y. It uses the No-U-Turn Sampler (NUTS) [Hoffman and Gelman, 2014], a variation of Hamiltonian Monte Carlo Sampling [Duane et al., 1987, Neal et al., 2011, Neal, 1994] to generate representative samples from the posterior distribution of parameters and hyper parameters θ, each of which has the marginal distribution p(θ | y,x). This is achieved by first approximating the distribution of the data to a defined threshold in a warm up period and then randomly sampling 8 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 from the model, generating new draws of parameters for each iteration and calculating the response of the model. This approach of sampling instead of fitting allows for the simulation of complex models for which the derivation of an analytical solution of the posterior is computationally costly or not possible. The Bayesian framework provides access to the full posterior distribution and to the distribution of all parameters. This allows to deduce the a variance estimate of each parameter, leading to a parameter estimate that is not only described by its mean, but also by the (un)-certainty around the mean estimation, providing information on its accuracy and reliability. Moreover, we can use the posterior distribution of each site-specific parameter from the training set as prior for the test set, allowing to make predictions for unfamiliar sites. The Stan code for the HBLM, the HBGPM and the simple Bayesian linear model without site as predictor can be found at https://github.com/likeajumprope/Bayesian_normative_models. 2.7 Model simulation process in Stan Parameters indicating the goodness of the model simulation process in Stan [Carpenter et al., 2017, Stan Development Team, 2020b] itself, like convergence, effective sample size and trace plots can be found in the supplementary material. 3 Results Both the HBLM and the HBGPM outperformed all other comparison models with respect to all performance measures considered in this study. In detail, the HBLM and the HBGPM showed higher average values of the Pearson’s correlation coefficient ρ (Table 2), lower average SRMSEs (Table 3), smaller average LL (Table 4) and higher average proportions of EV (Table 5) than all comparison models (p < 0.001 for all comparisons). For none of these comparisons did the non-linear HBGPM outperform the linear HBLM. In addition to the mean comparisons reported in Table 2 - 5, the distribution of all performance measures across all 34 regions and for average cortical thickness across the entire cortex per model can be found in Fig. 4. A detailed comparison of all models with respect to to ρ, SRMSE, EV and LL can be found in the supplementary material. 3.1 Mean standardized log loss To also account for the second order statistics of the posterior distributions created by each model, we calculated the mean standardized log loss (MSLL). This measure can only be calculated for the test set, as it is the log loss standardized by the mean loss of the training data set [Rasmussen and Williams, 2006]. Hence, the MSLL gives an indication of whether a model is able to predict the data better than the mean of the training set (with more negative values being better). An overview of the MSLL for all cortical thickness measures of all regions for all models is given in Fig. 5a. The only models that perform better for most regions than the mean of the training data set are the Hierarchical Bayesian models ( MSLLHBGP M < 0 for all regions; MSLLHBLM < 0 for all but one region), in contrast to prediction from the residuals and the ComBat model, where none of the predictions perform better than the mean of the training data set (MSLLresiduals > 0 for all regions; MSLLComBat > 0 for all regions, see Fig. 5a. The MSLL for the modified ComBat model and raw data model were region-dependent, with 45 % regions (16 out of 35) for the modified ComBat model and 17% of regions (six out of 35) for the raw data model performing better than predictions from the mean of the training set. It should also be mentioned that for some individual regions the comparison models performed very poorly (max MSLLComBat = 356, max MSLLmod.ComBat = 138, max MSLLraw = 1252; max MSLLresiduals = 517) and show measures that exceeded the plotted range of Fig. 5a. In contrast, the maximum MSLL for the hierarchical Bayesian models was max -0.056 for the HBGPM and max 0.08 for the HBLM. 3.2 Predictive Variance We also observed that the models differ in the variance of predicted values, as visualized in Fig. 5b for average cortical thickness. For the ComBat, the raw data and the residuals model the range of predicted values was severely restricted (range predicted values raw data, test set: [2.60 - 3.03], range predicted values residuals, test set [2.64 - 3.00]; range predicted values ComBat, test set: [2.73 - 2.97]. These intervals cover 9.2 %, 7.9 % and 8.0 % of the original test set variance, respectively. The modified ComBat model retained 29.0% of the original test set variance (range predicted value modified ComBat [2.55 = 3.01]. In other words, all harmonization techniques had a reduced predictive variance and were instead biased toward predicting the mean, sometimes severely. In contrast, this bias was substantially reduced 9 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://github.com/likeajumprope/Bayesian_normative_models https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 S it e M an uf ac tu re r P la tf or m V ox el S iz e T R T E n m al es ag e ra ng e (A bb re vi at io n) [m m ] [m s] [m s] [% ] [y ea rs ] C al if or ni a I. of Te ch no lo gy (C al te ch ) S IE M E N S T IM T R IO 1. 0× 1. 0× 1. 0 1 5 9 0 2 .7 3 1 5 0 .7 3 17 -3 9 C ar ne gi e M el lo n U .( C M U ) S IE M E N S T IM T R IO 1. 0× 1. 0× 1. 0 1 8 7 0 2 .4 8 1 3 0 .7 7 20 -4 0 K en ne dy K ri eg er I. (K K I) P hi li ps A ch ie va 1. 0× 1. 0× 1. 0 8 3 .7 3 3 0 .7 3 8- 12 L ud w ig M ax im il ia ns U .M un ic h (L M U ) S IE M E N S V E R IO 1. 0× 1. 0× 1. 0 1 8 0 0 3 .0 6 1 5 1 18 -2 9 N Y U L an go ne M ed ic al C en te r (N Y U ) S IE M E N S A L L E R G R A 1. 3× 1. 0× 1. 3 2 5 3 0 3 .2 5 2 0 0 .7 5 12 -1 6 O li n I. of L iv in g at H ar tf or d H os pi ta l (O L IN ) S IE M E N S A L L E G R A 1. 0× 1. 0× 1. 0 2 5 0 0 2 .7 4 3 0 0 .9 7- 35 O re go n H ea lt h an d S ci en ce U . (O H S U ) S IE M E N S T IM T R IO 1. 0× 1. 0× 1. 0 2 3 0 0 3 .5 8 1 0 5 0 .7 5 6- 31 S an D ie go S ta te U . (S D S U ) G E M R 75 0 1. 0× 1. 0× 1. 0 N A N A 1 5 1 8 12 S oc ia lB ra in L ab (S B L ) P hi li ps IN T E R A 1. 0× 1. 0× 1. 0 9 3 .5 1 6 0 .8 7 5 10 -2 3 S ta nf or d U . (S TA N F O R D ) G E S IG N A 0. 85 9 × 1. 50 0 × 0. 85 9 8 .4 1 .8 2 7 0 .8 5 9 33 T ri ni ty C en tr e fo r H ea lt h S ci en ce s (T R IN IT Y ) P hi li ps A ch ie va 1. 0 × 1. 0 × 1. 0 3 .9 8 .5 1 3 1 20 -3 9 U .o f C al if or ni a L os A ng el es 1 (U C L A 1 ) S IE M E N S T IM T R IO 1. 0 × 1. 0 × 1. 2 2 3 0 0 2 .8 4 2 2 0 .7 3 8- 16 U .o f C al if or ni a L os A ng el es 2 (U C L A 2 ) S IE M E N S T IM T R IO 1. 0 × 1. 0 × 1. 2 2 3 0 0 2 .8 4 2 0 0 .8 7- 12 U .o f L eu ve n 1 (L E U V E N 1 ) P hi li ps A ch ie va 0. 98 × 0. 98 × 1. 20 9 .6 4 .6 2 5 1 12 -2 5 U .o f L eu ve n 2 (L E U V E N 2 ) P hi li ps A ch ie va 0. 98 × 0. 98 × 1. 20 9 .6 4 .6 3 2 0 .8 8 9- 17 U .o f M ic hi ga n 1 (U M 1 ) G E S IG N A N A 2 5 0 5 .7 1 3 0 .8 5 9- 13 U .o f M ic hi ga n 2 (U M 2 ) G E S IG N A N A 2 5 0 5 .7 5 4 0 .6 9 8- 19 U .o f P it ts bu rg hS ch oo lo f M ed ic in e (P IT T ) S IE M E N S A L L E R G R A 1. 1× 1. 1× 1. 1 2 1 0 0 3 .9 3 2 1 0 .9 5 13 28 U .o f U ta h S ch oo lo f M ed ic in e (U S M ) S IE M E N S T IM T R IO 1. 0× 1. 0× 1. 2 2 3 0 0 2 .9 1 4 3 1 8- 39 Y al e C hi ld S tu dy C en te r (Y A L E ) S IE M E N S T IM T R IO 1. 0× 1. 0× 1. 2 1 2 3 0 1 .7 3 2 8 0 .7 1 7- 17 Ta bl e 1: T he sc an ne r pa ra m et er s an d sa m pl e sp ec ifi ca ti on s of th e A B ID E da ta se t. 10 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 in the hierarchical Bayesian models, which retained 57.0 % (HBLM) and 65.0 % (HBGPM) of the original test variance (range predicted values HBLM, test set: [2.43 - 3.23]; range predicted values HBGPM, test set: [2.38 - 3.28]). Mean Correlation (STD) Post-hoc comparison ρ training set test set HBLM HBGPM mod. ComBat ComBat residuals raw data HBLM 0.734 (0.06) 0.694 (0.06) ns. *** *** *** *** HBGPM 0.752 (0.05) 0.705 (0.06) ns. *** *** *** *** mod. ComBat 0.541 (0.15) 0.568 (0.16) *** *** *** *** *** ComBat 0.289 (0.09) 0.343 (0.11) *** *** *** ns. *** residuals 0.267 (0.08) 0.329 (0.12) *** *** *** ns *** raw data 0.435 (0.14) 0.435 (0.16) *** *** *** * ** Table 2: Post-hoc tests of correlations between true and predicted values. Cell values indicate post-hoc comparison significance values (adjusted by tukey method for a comparing a family of 6 estimates). Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ns. blue: test set. yellow: training set. Mean SRMSE (STD) Post-hoc comparison SRMSE training set test set HBLM HBGPM mod. ComBat ComBat residuals raw data HBLM 0.0608 (0.006) 0.066 (0.005) n.s *** *** *** *** HBGPM 0.0587 (0.006) 0.064 (0.006) ns. *** *** *** *** mod. ComBat 0.0763 (0.007) 0.075 (0.008) *** *** *** n.s ns. ComBat 0.0872 (0.003) 0.085 (0.005) *** *** *** *** *** residuals 0.0865 (0.003) 0.085 (0.004) *** *** ns. *** n.s raw data 0.0808 (0.006) 0.085 (0.008) *** *** *** *** *** Table 3: Post-hoc tests of SRMSE between true and predicted values. Cell values indicate post-hoc comparison significance values (adjusted by tukey method for a comparing a family of 6 estimates). Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ns. blue: test set. yellow: training set. LL training set test set HBLM −1.050 −1.121 HBGPM −1.020 −1.109 ComBat mod. −1.225 −1.193 ComBat −1.374 −1.336 residuals −1.381 −1.394 raw −1.299 −1.335 Table 4: Averaged log loss for training and test set. EV training set test set HBLM 0.5674 0.5 HBGPM 0.5397 0.485 ComBat mod. 0.3146 0.338 ComBat 0.0918 0.122 residuals 0.0778 0.114 raw 0.2091 0.208 Table 5: Averaged explained variance for training and test set. 4 Discussion In this work, we aim to provide a method that allows the application of normative modeling to neuroimaging data sets that are affected by site effects resulting from pooling data between sites. In contrast to other methods of harmonizing for additive and multiplicative site effects in the data prior to the normative modeling (e.g., regressing out site effects, harmonization with ComBat), our approach is based on modeling site as predictor within the normative modeling framework. The benefit of this approach is that it does not entail removing variance and thus cannot lead to an overestimation of site variance and accidental removal of meaningful variation in case the latter is confounded with site variation. Using a hierarchical Bayesian approach, we propose two versions of normative models that were able to control for site effects. In both versions, site is modeled via a random intercept offset, but one version only models linear effects of age on cortical thickness (Hierarchical Bayesian Linear Model, HBLM), whereas the other version also includes a Gaussian process term in order to allow potential non-linear relationships between age and cortical thickness measures (Hierarchical Bayesian Gaussian Process Model; HBGPM). The normative models are trained on a training set consisting of healthy individuals from the ABIDE data set (70% of the data from 20 different sites, within-site split, preserving the distribution of age and sex across training and 11 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 (a) Distribution of Pearson’s correlation coefficient ρ for 35 cortical regions, indicating the correlation between true and predicted values, training and test set. (b) SRMSE for 35 cortical regions, indicating the deviation true and predicted values of six different models for the training and the test set. (c) Explained variance for 35 cortical regions, training and test set. (d) Log likelihood distribution for 35 cortical regions, train- ing and test set. Figure 4: Performance measures test set) and we present results from generalization to a test set (the remaining 30% of the data from the same sites). We compare the performance of our hierarchical Bayesian normative models explicitly modeling site effects applied to cortical thickness measures derived from FreeSurfer [Fischl et al., 2004]) to other commonly used methods to deal with site effects. These alternative methods included: i) regressing out site via linear regression and using the residuals, removing additive site variation, ii) applying ComBat [Fortin et al., 2017, Fortin et al., 2018] to harmonize additive and multiplicative site effects in the data, and iii) modified ComBat, hence applying ComBat while preserving age and sex effects in the data. Cortical thickness measures cleared from site effects using these alternative methods are used as dependent variables in a normative model with age and sex as predictors but excluding site. For comparison reasons, we also include a fourth model where we made predictions from raw data uncorrected for any site effects. We report three main findings: (1) Our normative hierarchical Bayesian models (both the linear HBLM and non-linear HBGPM version), explicitly modeling site effects within the normative modeling framework, outperform all alternative harmonization models with respect to model fit, including correlations between true and predicted values (ρ), standardized root mean square error (SRMSE), explained variance scores (EV), log-likelihood (LL) and the mean standardized log loss (MSLL); (2) the non-linear model did not significantly improve prediction of cortical thickness 12 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 (a) MSLL distribution for 35 cortical regions, test set. (b) Predicted variance vs. actual variance for average cortical thickness for each model derived from predictions of 573 individuals. Figure 5: Mean standardized log loss and predicted variance for 35 cortical regions based on age, sex and site compared to the linear model; (3) all methods, but in particular the harmonization methods lead to an undesirable shrinking of the variance in the predictions. We showed that when using neuroimaging structural data sets pooled across different sites and scanners for estimating normative models, better predictive performance can be achieved by including site as a predictor than using a two-step approach of first harmonizing the data with respect to site and subsequently creating a normative model using these “cleared” data. This conclusion is based on results showing that the hierarchical Bayesian models outperformed the harmonizing comparison models on all of the performance metrics we examined. This includes the predictions derived from data that was cleared from site effects by a version of ComBat [Fortin et al., 2017, Fortin et al., 2018] in which variation associated with age and sex was preserved, which was the best performing method across all harmonizing models. We observed a higher correlation between true and predicted values and LL values closer to zero for our hierarchical Bayesian models explicitly modeling site effects with a random intercept offset, indicating better model fit. As a key factor of normative models is that they are not only able to estimate the predictive mean, but also give an estimate of the predictive variance and variation around the mean [Marquand et al., 2019, Marquand et al., 2016], we also included explained variance scores and the MSLL as performance metrics. Our HBLM and HBGPM models showed higher explained variance than the alternative models. In addition, the HBLM and HBGPM showed a negative MSLL in the test set; a metric which contrasts the log loss between the true and predicted values by the loss that would be achieved using the mean and the variance of the training set [Rasmussen and Williams, 2006], thus capturing differences in variance in the data sets. This benefit in performance for the hierarchical Bayesian models is in line with previous literature using a similar paradigm [Kia et al., 2020]. [Kia et al., 2020] showed that a hierarchical Bayesian regression approach using site as a batch effect lead to a better performance than complete pooling, no pooling and ComBat. In detail, our findings match [Kia et al., 2020]’s findings with respect to the comparison between a normative model created from hierarchical Bayesian regression (HBR) and a modified ComBat version in a data set with the same sites in training and test set. Their findings are in line with ours with respect to ρ ([Kia et al., 2020]: HBR range: 0.4 - 0.9, modified ComBat range: 0.2 - 0.8), SMSE: ([Kia et al., 2020]: HBR range: 0.2 - 0.9, modified ComBat range: 0.4 - 1.0) and MSLL ([Kia et al., 2020]: HBR range: -0.7 - -1.0, modified ComBat range: -0.04 - 0.0), except that the MSLL for the modified Combat model was worse in our study (see Figs. 4a, 4b, 5a). Therefore, our findings replicate the findings of [Kia et al., 2020] using an independent data set and separate implementation and extend that method to model non-linear functions using a Gaussian process term. We anticipated that the non-linear version of the normative model, which included a Gaussian Process for age, would perform better than the linear version, as studies have shown that the association between age and regions of cortical thickness can be non-linear, especially for older age ranges [Storsve et al., 2014]. However, our results showed similar performance in predicting cortical thickness based on age, sex and site for both linear and non-linear models. This might be due to the fact that the the age range in our sample was restricted, ranging from 6-40 years, thus likely capturing an age range where the association between age and cortical thickness is still mostly linear [Wierenga et al., 2014]. As a consequence, the non-linear version of the model was not able to improve the overall 13 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 performance. Nonetheless, since other structural brain measures, including sub-cortical volumes and cortical surface area [Wierenga et al., 2014, Raznahan et al., 2011], have shown stronger non-linear associations with age, non-linear normative models may outperform a linear model for other types of structural brain imaging measures. Despite an overall good performance of our models, it should also be mentioned that the performance showed substantial variation between regions, as reflected in the variation in ρ values, SRMSE, EV, LL and MSLL within models. We assume that this due to the fact that, although average cortical thickness shows a strong association with age, different cortical brain regions differ in their association with age and the magnitude of this correlation also changes across the life span ([Storsve et al., 2014]). All models, but in particular the comparison models, have a significant shrinkage effect on the variance of the predicted values, indicating that harmonization techniques remove variance that is useful in predicting the response variable. This is most extreme for regressing out site effects and leads to poor performance across all performance metrics. We also observe that the performance of the residuals model is similar to the ComBat model without the preservation of age and sex, which is particularly reflected in the similarities of predicted variance in Fig. 5b and in the SRMSE. Both models suffer a loss of more than 90% of their original test variance. In contrast, the performance improves when variables like age and sex are preserved, as demonstrated by an increase in performance measures when using the version of ComBat in which variation associated with age and sex was preserved. We argue that the similarity in performance between ComBat and the residuals model is an indicator of the same underlying process, showing a weakness of the harmonization approach: merely regressing out site effects leads to the removal of meaningful variation correlated with the predictors of interest (in this case age and sex), especially when these predictors of interest are correlated with the site effects, which subsequently led to worse predictions of cortical thickness based on age and sex. This can be partially prevented by preserving important sources of variation when regressing out site effects, as shown for the modified ComBat model, where specified sources of variance were preserved when regressing out site effects. However, our results show two additional flaws of the harmonization approach: 1) as already pointed out by [Kia et al., 2020], in order to specify sources of variance that should be retained, all those sources of variance have to be known, which is not always the case; 2) even with age and sex preserved the modified ComBat model only retains 40% of the original variance. Our hierarchical Bayesian models including the prediction-based approach, in contrast, preserves known and unknown interactions between site and biological covariates by specifically modeling site, thus overcoming this requirement. The result is reflected in larger proportions of variance retained (see Fig. 5b. The advantage of the hierarchical Bayesian approach becomes particularly clear when considering that the scores derived from normative models are relative scores describing the deviation from a predicted normative mean. Thus, the normative deviation score is not affected by the absolute value of the predicted mean, and the number of predictors in the model does not influence the normative score. Previous attempts to estimate the centiles of normative models have included polynomial regression [Kessler et al., 2016], support vector regression [Erus et al., 2015], quantile regression [Huizinga et al., 2018, Lv et al., 2020] and Gaussian process regression [Wolfers et al., 2018b], providing different degrees of the ability to separate between sources of variances and making individual predictions (for an overview see [Marquand et al., 2019]). We chose a hierarchical Bayesian framework for the implementation of our normative model as it has several advantages. The distribution-based structure based on posteriors allows for the separation and integration of different sources of variances, including epistemic (uncertainty in the model parameters), aleatoric (inherent variability in data) and prior variation, which are all considered when predicting cortical thickness based on age, sex and site. This allows for both the integration of already known information in the form of priors into the predictions, and for an adjustment of the precision of the estimate based on the uncertainty at each data point. In addition, the Bayesian framework, as implemented in software packages like Stan [Carpenter et al., 2017, Stan Development Team, 2020b], allows to draw samples from the full posterior distribution at the level of individual participants, which leads to an exact estimate of all parameters instead of an approximation. In particular in comparison to quantile regression, the distributional assumption entailed in the hierarchical Bayesian approach also allows to get more precise estimates of the underlying centiles, particularly in the outer centiles, which are usually of primary interest and where the data are sparsest. The proposed Bayesian framework also offers an elegant way to integrate site effects into normative models. site effects can be modeled via a hierarchical random effect structure, in which different sites are modeled semi-independently, sharing variation via a combined prior of higher order. This approach, also known as partial pooling, allows for including site- specific variance into the prediction for site, while at the same time constraining the amount of between-site variation to a maximum. Whilst the primary aim of this study was to develop a novel method for dealing with site effects specifically within a normative modeling framework, the method can be used as general approach to clear neuroimaging data from site, age and sex effects. This is due to the fact that a normative score describes an individual’s cortical thickness in relation to the variance explained by the predictor variables in the normative model (age, sex and site). Hence, they can be seen as “cleaned” cortical thickness measures that can be the basis for further analysis, for example to establish the 14 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 association between cortical thickness measures and clinical or demographic information. Another potential clinical use of a normative model based on healthy controls could be, that, once established, it can be used to derive individualized deviation scores from individuals with a psychiatric or neurological disorder. Their individual deviation scores can be considered as the degree of deviation from the normative variation and be used for further analysis, for example to predict clinically useful information. Our proposed method has two potential disadvantages. The first one is related to the computational cost associated with estimating the covariance matrix within the Gaussian Process for the non-linear models, which in our analysis amounted to 25 hours per model per region and could only be mastered via parallel processing on a cluster. This is due to the fact that using the non-linear Gaussian Process term becomes very time and memory expensive with growing n (O(n2)). Thus, in cases in which the relationship between the predictor and the outcome is estimated to be close to linear, the need for the more complex non-linear model should be carefully considered. Secondly, the between-site split and the model at its current state only allow generalizations to a test set which includes individuals from the same sites as the training set, thus where the site variation is known. However, especially in clinical settings, generalizing the model and making predictions in data from new sites is an important additional goal. Despite the fact that we cannot use the posterior distribution of one particular site as a prior when applying the model to a new, unknown site, the hierarchical Bayesian framework still allows using the posterior parameter distributions of all sites as derived from the training data set as priors for site parameters when applying the model to a new site. This approach has already been successfully demonstrated in [Kia et al., 2020] where the posterior parameter distribution of site derived from the training data was fed as a informative prior for the site predictor in a normative model applied to the test data consisting of new (unknown) sites. This use of a so called informed priors leads to more accurate and precise predictions than the broad, unspecific prior that would have to be used in cases where the distribution of the data is unknown [Kia et al., 2020]. Thus, despite some loss in precision, the Bayesian framework can, in contrast to all other methods examined in this paper, be adapted to make predictions to new, unknown sites. 5 Conclusion We proposed an extended version of a normative modeling approach that is able to accommodate for site effects in neuroimaging data. The method is superior to previous approaches, including regressing out site and versions of ComBat [Fortin et al., 2017, Johnson et al., 2007] and facilitates the estimation of normative models based on neuroimaging data pooled across many different scan sites. A further extension of the model to make generalizations to new sites and the application to clinical data will be the objective of future work. 6 Online material The supplementary material and the Stan code for the HBLM, HBGPM and simple Bayesian linear model can be found at https://github.com/likeajumprope/Bayesian_normative_models. 15 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://github.com/likeajumprope/Bayesian_normative_models https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 References [Bartlett, 1937] Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal Society of London. Series A-Mathematical and Physical Sciences, 160(901):268–282. [Bethlehem et al., 2018] Bethlehem, R., Seidlitz, J., Romero-Garcia, R., and Lombardo, M. (2018). Using normative age modelling to isolate subsets of individuals with autism expressing highly age-atypical cortical thickness features. bioRxiv, page 252593. [Carpenter et al., 2017] Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language. Journal of statistical software, 76(1). [Chen et al., 2014] Chen, J., Liu, J., Calhoun, V. D., Vasquez, A. A., Zwiers, M. P., Gupta, C. N., Frannke, B., and Turner, J. A. (2014). Exploration of scanning effects in multi-site structural MRI studies. Journal of Neuroscience Methods, 23(15):37–50. [Craddock et al., 2013] Craddock, C., Benhajali, Y., Chu, C., Chouinard, F., Evans, A., Jakab, A., Khundrakpam, B. S., Lewis, J. D., Li, Q., Milham, M., et al. (2013). The neuro bureau preprocessing initiative: open sharing of preprocessed neuroimaging data and derivatives. Neuroinformatics, 4. [Desikan et al., 2006] Desikan, R. S., Ségonne, F., Fischl, B., Quinn, B. T., Dickerson, B. C., Blacker, D., Buckner, R. L., Dale, A. M., Maguire, R. P., Hyman, B. T., Albert, M. S., and Killiany, R. J. (2006). An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage, 31(3):968–980. [Di Martino et al., 2014] Di Martino, A., Yan, C. G., Li, Q., Denio, E., Castellanos, F. X., Alaerts, K., Anderson, J. S., Assaf, M., Bookheimer, S. Y., Dapretto, M., Deen, B., Delmonte, S., Dinstein, I., Ertl-Wagner, B., Fair, D. A., Gallagher, L., Kennedy, D. P., Keown, C. L., Keysers, C., Lainhart, J. E., Lord, C., Luna, B., Menon, V., Minshew, N. J., Monk, C. S., Mueller, S., Müller, R. A., Nebel, M. B., Nigg, J. T., O’Hearn, K., Pelphrey, K. A., Peltier, S. J., Rudie, J. D., Sunaert, S., Thioux, M., Tyszka, J. M., Uddin, L. Q., Verhoeven, J. S., Wenderoth, N., Wiggins, J. L., Mostofsky, S. H., and Milham, M. P. (2014). The autism brain imaging data exchange: Towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular Psychiatry, 19(6):659–667. [Duane et al., 1987] Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid monte carlo. Physics letters B, 195(2):216–222. [Erus et al., 2015] Erus, G., Battapady, H., Satterthwaite, T. D., Hakonarson, H., Gur, R. E., Davatzikos, C., and Gur, R. C. (2015). Imaging Patterns of Brain Development and their Relationship to Cognition. Cerebral Cortex, 25(6):1676–1684. [Feczko et al., 2019] Feczko, E., Miranda-Dominguez, O., Marr, M., Graham, A. M., Nigg, J. T., and Fair, D. A. (2019). The Heterogeneity Problem: Approaches to Identify Psychiatric Subtypes. Trends in Cognitive Sciences, 23(7):584–601. [Fischl et al., 2004] Fischl, B., Van Der Kouwe, A., Destrieux, C., Halgren, E., Ségonne, F., Salat, D. H., Busa, E., Seidman, L. J., Goldstein, J., Kennedy, D., Caviness, V., Makris, N., Rosen, B., and Dale, A. M. (2004). Automatically Parcellating the Human Cerebral Cortex. Cerebral Cortex, 14(1):11–22. [Fortin et al., 2018] Fortin, J. P., Cullen, N., Sheline, Y. I., Taylor, W. D., Aselcioglu, I., Cook, P. A., Adams, P., Cooper, C., Fava, M., McGrath, P. J., McInnis, M., Phillips, M. L., Trivedi, M. H., Weissman, M. M., and Shinohara, R. T. (2018). Harmonization of cortical thickness measurements across scanners and sites. NeuroImage, 167(June 2017):104–120. [Fortin et al., 2017] Fortin, J. P., Parker, D., Tunç, B., Watanabe, T., Elliott, M. A., Ruparel, K., Roalf, D. R., Sat- terthwaite, T. D., Gur, R. C., Gur, R. E., Schultz, R. T., Verma, R., and Shinohara, R. T. (2017). Harmonization of multi-site diffusion tensor imaging data. NeuroImage, 161:149–170. [Foulkes and Blakemore, 2018] Foulkes, L. and Blakemore, S.-J. (2018). Studying individual differences in human adolescent brain development. Nature neuroscience, 21(3):315–323. [Fried, 2017] Fried, E. (2017). Moving forward: how depression heterogeneity hinders progress in treatment and research. Expert Review of Neurotherapeutics, 17(5):423–425. [Gelman, 2008] Gelman, A. (2008). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. [Gelman et al., 2013] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian data analysis. CRC press. 16 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 [Gronenschild et al., 2012] Gronenschild, E. H. B. M., Habets, P., Jacobs, H. I. L., Mengelers, R., Rozendaal, N., van Os, J., and Marcelis, M. (2012). The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements. PLoS ONE, 7(6):e38234. [Han et al., 2006] Han, X., Jovicich, J., Salat, D., van der Kouwe, A., Quinn, B., Czanner, S., Busa, E., Pacheco, J., Albert, M., Killiany, R., Maguire, P., Rosas, D., Makris, N., Dale, A., Dickerson, B., and Fischl, B. (2006). Reliability of MRI-derived measurements of human cerebral cortical thickness: The effects of field strength, scanner upgrade and manufacturer. NeuroImage, 32(1):180–194. [Hoffman and Gelman, 2014] Hoffman, M. D. and Gelman, A. (2014). The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1):1593–1623. [Huizinga et al., 2018] Huizinga, W., Poot, D., Vernooij, M., Roshchupkin, G., Bron, E., Ikram, M., Rueckert, D., Niessen, W., and Klein, S. (2018). A spatio-temporal reference model of the aging brain. NeuroImage, 169:11–22. [Insel et al., 2010] Insel, T., Cuthbert, B., Garvey, M., Heinssen, R., Pine, D. S., Quinn, K., Sanislow, C., and Wang, P. (2010). Research domain criteria (rdoc): toward a new classification framework for research on mental disorders. [Insel, 2014] Insel, T. R. (2014). The NIMH Research Domain Criteria (RDoC) Project: Precision Medicine for Psychiatry. American Journal of Psychiatry, 171(4):395–397. [Johnson et al., 2007] Johnson, W. E., Li, C., and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8(1):118–127. [Kessler et al., 2016] Kessler, D., Angstadt, M., and Sripada, C. (2016). Growth charting of brain connectivity networks and the identification of attention impairment in youth. JAMA psychiatry, 73(5):481–489. [Kia et al., 2020] Kia, S. M., Huijsdens, H., Dinga, R., Wolfers, T., Mennes, M., Andreassen, O. A., Westlye, L. T., Beckmann, C. F., and Marquand, A. F. (2020). Hierarchical bayesian regression for multi-site normative modeling of neuroimaging data. In Martel, A. L., Abolmaesumi, P., Stoyanov, D., Mateus, D., Zuluaga, M. A., Zhou, S. K., Racoceanu, D., and Joskowicz, L., editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, pages 699–709, Cham. Springer International Publishing. [Kia and Marquand, 2018] Kia, S. M. and Marquand, A. (2018). Normative Modeling of Neuroimaging Data Using Scalable Multi-task Gaussian Processes. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 11072 LNCS, pages 127–135. [Leek et al., 2010] Leek, J. T., Scharpf, R. B., Bravo, H. C., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerly, K., and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics, 11(10):733–739. [Lv et al., 2020] Lv, J., Biase, M. D., Cash, R. F., Cocchi, L., Cropley, V., Klauser, P., Tian, Y., Bayer, J., Schmaal, L., Cetin-Karayumak, S., Rathi, Y., Pasternak, O., Bousman, C., Pantelis, C., Calamante, F., and Zalesky, A. (2020). Individual deviations from normative models of brain structure in a large cross-sectional schizophrenia cohort. bioRxiv, page 2020.01.17.911032. [Marquand et al., 2014] Marquand, A. F., Brammer, M., Williams, S. C., and Doyle, O. M. (2014). Bayesian multi-task learning for decoding multi-subject neuroimaging data. NeuroImage, 92:298–311. [Marquand et al., 2019] Marquand, A. F., Kia, S. M., Zabihi, M., Wolfers, T., Buitelaar, J. K., and Beckmann, C. F. (2019). Conceptualizing mental disorders as deviations from normative functioning. Molecular Psychiatry, 24(10):1415–1424. [Marquand et al., 2016] Marquand, A. F., Rezek, I., Buitelaar, J., and Beckmann, C. F. (2016). Understanding Heterogeneity in Clinical Cohorts Using Normative Models: Beyond Case-Control Studies. Biological Psychiatry, 80(7):552–561. [Mathys et al., 2012] Mathys, C. D., Prüssmann, K., Stephan, K. E., and Behrens, T. (2012). HIERARCHICAL GAUSSIAN FILTERING Construction and variational inversion of a generic Bayesian model of individual learning under uncertainty. [Miller et al., 2016] Miller, K. L., Alfaro-Almagro, F., Bangerter, N. K., Thomas, D. L., Yacoub, E., Xu, J., Bartsch, A. J., Jbabdi, S., Sotiropoulos, S. N., Andersson, J. L., et al. (2016). Multimodal population brain imaging in the uk biobank prospective epidemiological study. Nature neuroscience, 19(11):1523. [Mirnezami et al., 2012] Mirnezami, R., Nicholson, J., and Darzi, A. (2012). Preparing for Precision Medicine. New England Journal of Medicine, 366(6):489–491. [Mueller et al., 2005] Mueller, S. G., Weiner, M. W., Thal, L. J., Petersen, R. C., Jack, C. R., Jagust, W., Trojanowski, J. Q., Toga, A. W., and Beckett, L. (2005). Ways toward an early diagnosis in alzheimer’s disease: the alzheimer’s disease neuroimaging initiative (adni). Alzheimer’s & Dementia, 1(1):55–66. 17 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 [Neal, 1994] Neal, R. M. (1994). An improved acceptance procedure for the hybrid monte carlo algorithm. Journal of Computational Physics, 111(1):194–203. [Neal et al., 2011] Neal, R. M. et al. (2011). Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2. [R Core Team, 2020] R Core Team (2020). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Rasmussen and Williams, 2006] Rasmussen, C. E. and Williams, C. K. I. (2006). GAUSSIAN PROCESSES FOR MACHINE LEARNING. MIT Press, Cambridge. [Raznahan et al., 2011] Raznahan, A., Shaw, P., Lalonde, F., Stockman, M., Wallace, G. L., Greenstein, D., Clasen, L., Gogtay, N., and Giedd, J. N. (2011). How Does Your Cortex Grow? Journal of Neuroscience, 31(19):7174–7177. [Stan Development Team, 2020a] Stan Development Team (2020a). RStan: the R interface to Stan. R package version 2.21.2. [Stan Development Team, 2020b] Stan Development Team (2020b). Stan modeling language users guide and reference manual, version 2.25. [Storsve et al., 2014] Storsve, A. B., Fjell, A. M., Tamnes, C. K., Westlye, L. T., Overbye, K., Aasland, H. W., and Walhovd, K. B. (2014). Differential Longitudinal Changes in Cortical Thickness, Surface Area and Volume across the Adult Life Span: Regions of Accelerating and Decelerating Change. Journal of Neuroscience, 34(25):8488–8498. [Thompson et al., 2020] Thompson, P. M., Jahanshad, N., Ching, C. R. K., Salminen, L. E., Thomopoulos, S. I., Bright, J., Baune, B. T., Bertolín, S., Bralten, J., Bruin, W. B., Bülow, R., Chen, J., Chye, Y., Dannlowski, U., de Kovel, C. G. F., Donohoe, G., Eyler, L. T., Faraone, S. V., Favre, P., Filippi, C. A., Frodl, T., Garijo, D., Gil, Y., Grabe, H. J., Grasby, K. L., Hajek, T., Han, L. K. M., Hatton, S. N., Hilbert, K., Ho, T. C., Holleran, L., Homuth, G., Hosten, N., Houenou, J., Ivanov, I., Jia, T., Kelly, S., Klein, M., Kwon, J. S., Laansma, M. A., Leerssen, J., Lueken, U., Nunes, A., Neill, J. O., Opel, N., Piras, F., Piras, F., Postema, M. C., Pozzi, E., Shatokhina, N., Soriano-Mas, C., Spalletta, G., Sun, D., Teumer, A., Tilot, A. K., Tozzi, L., van der Merwe, C., Van Someren, E. J. W., van Wingen, G. A., Völzke, H., Walton, E., Wang, L., Winkler, A. M., Wittfeld, K., Wright, M. J., Yun, J.-Y., Zhang, G., Zhang-James, Y., Adhikari, B. M., Agartz, I., Aghajani, M., Aleman, A., Althoff, R. R., Altmann, A., Andreassen, O. A., Baron, D. A., Bartnik-Olson, B. L., Marie Bas-Hoogendam, J., Baskin-Sommers, A. R., Bearden, C. E., Berner, L. A., Boedhoe, P. S. W., Brouwer, R. M., Buitelaar, J. K., Caeyenberghs, K., Cecil, C. A. M., Cohen, R. A., Cole, J. H., Conrod, P. J., De Brito, S. A., de Zwarte, S. M. C., Dennis, E. L., Desrivieres, S., Dima, D., Ehrlich, S., Esopenko, C., Fairchild, G., Fisher, S. E., Fouche, J.-P., Francks, C., Frangou, S., Franke, B., Garavan, H. P., Glahn, D. C., Groenewold, N. A., Gurholt, T. P., Gutman, B. A., Hahn, T., Harding, I. H., Hernaus, D., Hibar, D. P., Hillary, F. G., Hoogman, M., Hulshoff Pol, H. E., Jalbrzikowski, M., Karkashadze, G. A., Klapwijk, E. T., Knickmeyer, R. C., Kochunov, P., Koerte, I. K., Kong, X.-Z., Liew, S.-L., Lin, A. P., Logue, M. W., Luders, E., Macciardi, F., Mackey, S., Mayer, A. R., McDonald, C. R., McMahon, A. B., Medland, S. E., Modinos, G., Morey, R. A., Mueller, S. C., Mukherjee, P., Namazova-Baranova, L., Nir, T. M., Olsen, A., Paschou, P., Pine, D. S., Pizzagalli, F., Rentería, M. E., Rohrer, J. D., Sämann, P. G., Schmaal, L., Schumann, G., Shiroishi, M. S., Sisodiya, S. M., Smit, D. J. A., Sønderby, I. E., Stein, D. J., Stein, J. L., Tahmasian, M., Tate, D. F., Turner, J. A., van den Heuvel, O. A., van der Wee, N. J. A., van der Werf, Y. D., van Erp, T. G. M., van Haren, N. E. M., van Rooij, D., van Velzen, L. S., Veer, I. M., Veltman, D. J., Villalon-Reina, J. E., Walter, H., Whelan, C. D., Wilde, E. A., Zarei, M., and Zelman, V. (2020). ENIGMA and global neuroscience: A decade of large-scale studies of the brain in health and disease across more than 40 countries. Translational Psychiatry, 10(1):100. [Volkow et al., 2018] Volkow, N. D., Koob, G. F., Croyle, R. T., Bianchi, D. W., Gordon, J. A., Koroshetz, W. J., Pérez-Stable, E. J., Riley, W. T., Bloch, M. H., Conway, K., et al. (2018). The conception of the abcd study: From substance use to a broad nih collaboration. Developmental cognitive neuroscience, 32:4–7. [Wierenga et al., 2014] Wierenga, L. M., Langen, M., Oranje, B., and Durston, S. (2014). Unique developmental trajectories of cortical thickness and surface area. NeuroImage, 87:120–126. [Wolfers et al., 2019] Wolfers, T., Beckmann, C. F., Hoogman, M., Buitelaar, J. K., Franke, B., and Marquand, A. F. (2019). Individual differences v. the average patient: mapping the heterogeneity in ADHD using normative models. Psychological Medicine, pages 1–10. [Wolfers et al., 2020] Wolfers, T., Beckmann, C. F., Hoogman, M., Buitelaar, J. K., Franke, B., and Marquand, A. F. (2020). Individual differences v. the average patient: mapping the heterogeneity in adhd using normative models. Psychological Medicine, 50(2):314–323. [Wolfers et al., 2018a] Wolfers, T., Doan, N. T., Kaufmann, T., Alnæs, D., Moberget, T., Agartz, I., Buitelaar, J. K., Ueland, T., Melle, I., Franke, B., Andreassen, O. A., Beckmann, C. F., Westlye, L. T., and Marquand, A. F. (2018a). 18 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 A PREPRINT - FEBRUARY 9, 2021 Mapping the Heterogeneous Phenotype of Schizophrenia and Bipolar Disorder Using Normative Models. JAMA Psychiatry, 75(11):1146. [Wolfers et al., 2018b] Wolfers, T., Doan, N. T., Kaufmann, T., Alnæs, D., Moberget, T., Agartz, I., Buitelaar, J. K., Ueland, T., Melle, I., Franke, B., et al. (2018b). Mapping the heterogeneous phenotype of schizophrenia and bipolar disorder using normative models. JAMA psychiatry, 75(11):1146–1155. [Zabihi et al., 2019] Zabihi, M., Oldehinkel, M., Wolfers, T., Frouin, V., Goyard, D., Loth, E., Charman, T., Tillmann, J., Banaschewski, T., Dumas, G., et al. (2019). Dissecting the heterogeneous cortical anatomy of autism spectrum disorder using normative models. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 4(6):567–578. 19 (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this preprintthis version posted February 11, 2021. ; https://doi.org/10.1101/2021.02.09.430363doi: bioRxiv preprint https://doi.org/10.1101/2021.02.09.430363 Introduction Methods Data ABIDE data set Site effects in the ABIDE data set Splitting the ABIDE data set into training and test sets site as a predictor in a Hierarchical Bayesian Model Comparison models Performance measures Measures of model performance Measures of goodness of the simulation in Stan Model specification Posterior predictive distribution Comparison models Implementation: Normative modeling in Stan Model simulation process in Stan Results Mean standardized log loss Predictive Variance Discussion Conclusion Online material