key: cord-0044814-k7ho1he7 authors: Guhaniyogi, Rajarshi title: High Dimensional Bayesian Regularization in Regressions Involving Symmetric Tensors date: 2020-05-16 journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems DOI: 10.1007/978-3-030-50153-2_26 sha: 1e3e870abc08395beb3178868a9152ac356076f4 doc_id: 44814 cord_uid: k7ho1he7 This article develops a regression framework with a symmetric tensor response and vector predictors. The existing literature involving symmetric tensor response and vector predictors proceeds by vectorizing the tensor response to a multivariate vector, thus ignoring the structural information in the tensor. A few recent approaches have proposed novel regression frameworks exploiting the structure of the symmetric tensor and assume symmetric tensor coefficients corresponding to scalar predictors to be low-rank. Although low-rank constraint on coefficient tensors are computationally efficient, they might appear to be restrictive in some real data applications. Motivated by this, we propose a novel class of regularization or shrinkage priors for the symmetric tensor coefficients. Our modeling framework a-priori expresses a symmetric tensor coefficient as sum of low rank and sparse structures, with both these structures being suitably regularized using Bayesian regularization techniques. The proposed framework allows identification of tensor nodes significantly influenced by each scalar predictor. Our framework is implemented using an efficient Markov Chain Monte Carlo algorithm. Empirical results in simulation studies show competitive performance of the proposed approach over its competitors. This article is motivated by a variety of applications, in which a sample of symmetric tensors is available along with a few scalar variables of interest. Analogous to rows and columns of a matrix, various axes of a tensor are known as tensor modes and the indices of a tensor mode are often referred to as "tensor nodes". A tensor is known to be symmetric if interchanging modes results in the same tensor. Entries in a tensor are known as "tensor cells". In our motivating applications, each sample point is represented by its own symmetric tensor, and the Supported by Office of Naval Research and National Science Foundation. tensor nodes are labeled and shared across all sample points through a registration process. The goals of such scientific applications are two fold. First, it is important to build a predictive model to assess the change of the symmetric tensor response as the predictors of interest vary. A more important scientific goal becomes identifying nodes of the symmetric tensor significantly impacted by each predictor. Although there is a gamut of applications involving symmetric tensors, our work is specifically motivated by scientific applications pertaining to brain connectomics. In such applications the dataset contains brain network information along with a few observable phenotypes (e.g., IQ, presence or absence of any neuronal disorder, age) for multiple subjects. Brain network information for each subject is encoded within a symmetric matrix of dimension V × V , with V as the number of regions of interest (ROI) a human brain is parceled into following a popular brain atlas. The (k, l)th cell of the matrix consists of the number of neuron connections between the k-th and l-th regions of interest (ROI). Thus, each mode of this symmetric matrix (when viewed as a 2-D symmetric tensor) consists of V nodes (V = 68 when Desikan-Killany brain atlas is followed [1] ), each corresponding to a specific ROI in the human brain. The most important scientific goal here boils down to making inference on brain regions of interest (ROIs) and their inter-connections significantly associated with each phenotypic predictor [9] . One approach is to vectorize the symmetric tensor and cast the modeling problem as a high dimensional multivariate reduced rank sparse regression framework with the vectorized tensor response and scalar predictors. There are adequate literature on frequentist penalized optimization [15] , as well as on Bayesian shrinkage [2] which deal with model fitting and computational issues with high dimensional multivariate reduced rank regression models. Although computationally efficient, these approaches treat the cells of the symmetric tensor coefficients as if they were fully exchangeable, ignoring the fact that coefficients that involve common tensor nodes can be expected to be correlated a priori. Ignoring this correlation may appear to be detrimental in terms of model performance. Additionally, such modeling framework does not directly lead to the identification of nodes significantly associated with each predictor. We develop a symmetric tensor response regression model with a symmetric tensor response and scalar predictors. The symmetric tensor coefficients corresponding to each predictor in this regression is assigned a novel Bayesian shrinkage prior that combines ideas from low-rank parallel factor (PARAFAC) decomposition methods, spike-and-slab priors and Bayesian high dimensional regularization techniques to generate a model that respects the tensor structure of the response. These structures are introduced to achieve several inferential goals simultaneously. The low-rank structure is primarily assumed to capture the interactions between different pairs of tensor nodes, the node-wise sparsity offers inference on various tensor nodes significantly associated with a predictor. The Bayesian regularization structure allows appropriate shrinkage of unimportant cell coefficients towards zero while minimally shrinking the important cell coefficients. All structures jointly achieve parsimony and deliver accurate characterization of uncertainty for estimating parameters and identifying significant tensor nodes. The proposed approach finds excellent synergy with the recent literature on bilinear relational data models [4, 7] , multiway regression models [6] and other object oriented regression models [3] , where low-rank and/or regularization structures are imposed on parameters. The proposed framework is similar to but distinct from the recent developments in high dimensional regressions with multidimensional arrays (tensors) and other object oriented data. For example, recent literature that builds regression models with a scalar response and tensor predictors [6] is less appealing in this context, since it does not incorporate the symmetry constraint in the tensor. In the same vein, [5] formulate a Bayesian tensor response regression approach that is built upon a novel multiway stick breaking shrinkage prior on the tensor coefficients. While [5] is able to identify important tensor cells, it does not allow detection of tensor nodes influenced by a predictor. Moreover, these approaches have not been extended to accommodate scenarios other than a tensor response with continuous cell entries and do not directly incorporate the symmetry constraint in the tensor coefficient corresponding to a predictor. Also, unlike these approaches, our approach does not assume a low-rank representation of tensor coefficients; hence allowing more flexible structure to analyze impact of the predictors on tensor cells and interaction between tensor nodes. A work closely related to our framework develops shrinkage priors in a regression framework with a scalar response and an undirected network predictor, expressed in the form of a symmetric matrix [3] . However, they treat the tensor as a predictor, whereas we treat it as a response. This difference in the modeling approach leads to a different focus and interpretation. The symmetric tensor predictor regression focuses on understanding the change of a scalar response as the symmetric tensor predictor varies, while regression with symmetric tensor response aims to study the change of the symmetric tensor as the predictors vary. Rest of the article flows as follows. In Sect. 2 the model and prior distributions on parameters are introduced. Section 2 also briefly discusses posterior computation, where as Sect. 3 presents simulation studies to validate our approach. Finally, Sect. 4 concludes the article with an eye to the future work. For i = 1, ..., n, let y i = ((y i,(k1,...,kD) )) V k1,...,kD=1 ∈ Y ⊆ R V ×···×V denote the D-way symmetric tensor response with dummy diagonal entries and z i = (z i1 , ..., z ip ) be p predictors of interest corresponding to the ith individual. The symmetric constraint in the tensor response implies y i,(k1,...,kD) = y i,(P (k1),...,P (kD)) , with P (·) being any permutation of {k 1 , ..., k D }. Due to the diagonal entries being dummies in the symmetric tensor response, it is enough to build a probabilistic model for y upper For the sake of this article, we assume y i,k ∈ R and propose where Γ 1,k , ..., Γ p,k are the k = (k 1 , ..., k D )th cells of the V × · · · × V symmetric coefficient tensors Γ 1 , ..., Γ p with dummy diagonal entries, respectively. Here Γ s,k , s = 1, ..., p, is the coefficient corresponding to the sth predictor of interest on the k = (k 1 , ..., k D )th cell of the symmetric tensor response. The coefficient γ 0 ∈ R is the intercept in the regression model (4). To account for association between the tensor response y and predictors z 1 , ..., z p , we propose a shrinkage prior distribution on Γ s,k , s = 1, ..., p, represented as a location and scale mixture of normal distributions. In particular, the distribution of Γ s,k is assumed to be conditionally independent normal distribution with s,kD , i.e., the prior distribution of Γ s is centered on a rank-R PARAFAC [10] decomposed tensor. The PARAFAC or parallel factor decomposition is a multiway analogue to the twodimensional factor modeling of matrices. In particular it provides a low-rank structure to the mean function of Γ s . Note that, in the tensor regression literature, it is a fairly common practice to assume a low-rank structure for Γ s directly [6] . In contrast, the prior distribution in (2) centers on a low-rank PARAFAC/CP representation [10] , precluding any additional imposition of a low-rank structure on Γ s a priori. This allows more flexibility in the structure of the coefficients. κ s,k is the scale parameter corresponding to each Γ s,k controlling the local variability of each coefficient a priori and λ s,r ∈ {0, 1}, r = 1, ..., R, are introduced to assess the effect of the r-th summand on the mean of Γ s,k . In particular, λ s,r = 0 implies that the r-th summand of the low-rank factorization is not informative to predict the response. To develop a data dependent learning of nonzero λ s,r 's, we propose λ s,r ∼ Ber(θ s,r ), θ s,r ∼ Beta(1, r c ), c > 1, a priori. The hyper-parameters in the beta distribution are set so as to penalize the usage of large number of summands in the PARAFAC decomposition, which protects the model from over-fitting. Define In the course of identifying important tensor nodes significantly associated with the sth predictor, we note that the v-th node has minimal effect on the sth predictor if γ s,v = 0. Thus, in order to directly infer on γ s,v , a spike-and-slab mixture distribution prior [8] is assigned on γ s,v as below where δ 0 is the Dirac-delta function at 0 and H s is a covariance matrix of order R×R. The rest of the hierarchy is completed by setting κ s,k ∼ Exp(ζ 2 /2), ξ s,v ∼ Ber(Δ s ), H s ∼ IW (I, ν), Δ s ∼ U (0, 1), where IW stands for the inverse wishart distribution. Finally, γ 0 and σ 2 are assigned N(μ γ , σ 2 γ ) and IG(a,b) priors respectively, IG corresponding to the inverse gamma density. The posterior distribution of ξ s,v is monitored and the vth tensor node is related to the sth predictor if the posterior probability of {ξ s,v = 1} turns out to be greater than 0.5. A few remarks are in order. Note that, if κ s,k = 0 for all k, then Γ s assumes an exact low-rank decomposition given by, s,kD . Also, if γ s,v = 0, then a priori Γ s,k follows an ordinary Bayesian lasso shrinkage prior distribution [12] for all k with some k l = v. In general, Γ s,k a priori can be expressed as s,kD + Γ s,2,k , where Γ s,2,k following an ordinary Bayesian lasso shrinkage prior [12] . Although summaries of the posterior distribution cannot be computed in closed form, full conditional distributions for all the parameters are available and correspond, in most cases, to standard families. Thus, posterior computation can proceed through a Markov chain Monte Carlo algorithm. Details of the Markov chain Monte Carlo algorithm with the conditional posterior distributions are provided in the Appendix. We run the MCMC chain for 15000 iterations. With the first 5000 as burn-ins, the posterior inference is drawn on the L = 10000 post burn-in draws suitably thinned. In order to identify whether the v-th tensor node is significantly related to the sth predictor, we rely on the post burn-in L samples ξ This article illustrates the performance of our proposed approach referred to as the symmetric tensor regression (STR) along with some of its competitors under various simulation scenarios. In fitting our model, we fix the hyper-parameters at a = 1, b = 1, μ γ = 0, σ γ = 1, ν = 10 and ζ = 1. We compare our approach to ordinary least squares (LS), which proposes a cell by cell regression of the response on the predictors. Although a naive approach, LS is included due to its widespread use in neuro-imaging applications. Additionally, we employ the Higher-Order Low-Rank Regression (HOLRR) [14] as a competitor. HOLRR provides a framework for higher order regression with a tensor response and scalar predictors. A comparative assessment of these three methods will help evaluate possible gains in inference in our method for taking into account the symmetry in the tensor response. For the sake of simplicity, we work with p = 1 (hence get rid of the subscript s hereon), with the scalar variable of interest z i 's are drawn iid from N(0, 1). We also set D = 2. The response is simulated from the following model where γ * 0 is the true intercept and Γ * = ((Γ * k )) V k1,k2=1 is the true symmetric tensor coefficient. The true intercept γ * 0 is set to be 0.2. To simulate the true symmetric tensor coefficient Γ * , we draw V tensor node specific latent variables We then construct Γ * under two different simulation scenarios, referred to as Simulation 1 and Simulation 2, as described below. kD . Thus, the coefficient tensor assumes a symmetric rank-R * PARAFAC decomposition. Note that, if γ * v = 0, then the vth tensor node specific variable has no impact in describing the relationship between y and z. Hence (1 − π * 1 ) is the probability of a tensor node being not related to z i . We refer to it as the node sparsity parameter. In particular, the node sparsity parameter indicates the proportion of nodes in the tensor response (among the total of V tensor nodes) which are not related to the predictor. Notably, this data generation mechanism in Simulation 1 is quite similar (although not identical) to our fitted model. Hence, the goal of this first simulation is to evaluate the ability of the model to recover the true data-generation mechanism. In particular, we consider four cases under Simulation 1 (see Table 1 ) by varying the fitted rank of PARAFAC (R), true rank of Γ * (R * ), sample size (n), no. of tensor nodes (V ) and the tensor node sparsity (defined before). Cases R R * n V π * Under Simulation 2, we first simulate node specific latent variables, similar to Simulation 1. If either of γ * k1 , ..., γ * kD is 0, we set Γ * k = 0. Otherwise, Γ * k is simulated from a mixture distribution π * 2 N (0, 1) + (1 − π * 2 )δ 0 , where (1 − π * 2 ) is referred to as the cell sparsity parameter and δ 0 refers to the Dirac delta function at 0. Unlike Simulation 1, Simulation 2 does not necessarily leads to a low-rank structure of Γ * . Hence this simulation is ideal for investigating the performance under model mis-specification. Table 1 shows different cases under Simulation 2 where the model is investigated. We compare the three competitors in terms of their accuracy of estimating Γ * . The accuracy of estimating Γ * is measured by the scaled mean squared error (MSE) defined as ||Γ * −Γ || 2 /||Γ * || 2 , whereΓ corresponds to a suitable point estimate of Γ , e.g., the posterior mean of Γ for STR. || · || refers to the Frobenius norm of a matrix. Additionally, we quantify the uncertainty offered by each these competitors through the coverage and length of 95% credible intervals of Γ k , averaged over all k. Length and coverage of posterior 95% credible intervals for each Γ k are available empirically from the post burn-in MCMC samples of Γ for our proposed approach. On the other hand, the 95% confidence intervals of frequentist competitors are constructed using a bootstrap approximation. To infer on the performance of STR in terms of identifying tensor nodes significantly associated with z i , we present True Positive Rate (TPR) and False Positive Rate (FPR) with different choices of the cut-off t for all simulation cases. As mentioned earlier, such measures are not available for our competitors since they are not designed to detect tensor nodes related to the predictor of interest. Scaled MSE, coverage and length of 95% CI for all competitors are presented under Simulations 1 and 2 in Tables 2 and 3 , respectively. With no model misspecification under Simulation 1, STR is showing significantly better performance than LS and HOLRR under all four cases. The low-rank structure of HOLRR facilitates its superior performance over LS for larger V /n ratio. However, as V /n ratio increases, HOLRR loses its edge over LS. All three competitors show over-coverage under Simulation 1, with STR producing substantially narrower credible intervals. Moreover, for a fixed n, the credible intervals tend to be more narrow with increasing V for all competitors. Even under model mis-specification in Simulation 2, STR outperforms all competitors in cases with smaller V and higher node sparsity (Cases 1 and 3), as seen in Table 3 . However, with a larger V in case 2, LS and STR are competitive to each other. Since HOLRR is constructed on variants of sparsity within low-rank principle, it loses edge over LS in terms of MSE in these cases. Comparing MSE of STR between cases 3 and 4, we find that MSE increases as the node sparsity decreases. Similarly, comparing Cases 1 and 3 reveals adverse effect of decreasing cell sparsity on MSE of STR. It is generally found that the effect of node sparsity is more profound than the effect of cell sparsity on the performance. Moving onto uncertainty characterization, STR shows close to nominal coverage along with its competitors in cases 1, 2 and 4 when V is small. With increasing V , coverage of STR and HOLRR drops below 0.90. Similar to Simulation 1, STR demonstrates sufficiently narrower credible intervals than HOLRR in all cases. LS offers over-coverage with much wider 95% credible intervals in all cases. Since LS and HOLRR are not designed to detect nodes significantly related to the predictor, we focus our inference on STR for node detection. To this end, we choose three cur-off values t = 0.1, 0.5, 0.9 and present TPR and FPR values for our approach. STR yields the posterior probability of a node being related to the predictor of interest to be very close to 1 or 0 for all reasonable values of cut-off t, depending on whether a tensor node is related or not to the predictor of interest in the truth, respectively. As a consequence, TPR and FPR values (see Table 4 ) turn out to be close to 1 and 0, respectively, for all the simulation cases, indicating a close to perfect active node detection. The posterior distributions of γ 0 also appear to be centered around the truth (not shown here). The overarching goal of this article is to propose a symmetric tensor regression framework with a symmetric tensor response and scalar predictors. The model is aimed at identifying tensor nodes significantly related to each scalar predictor. Unlike the existing approaches, the proposed framework does not assume any low-rank constraint on the symmetric tensor coefficients. Rather, we propose a tensor shrinkage prior which decomposes the symmetric tensor coefficients into low-rank and sparse components a priori. The low-rank component is further assigned a novel hierarchical mixture prior to enable identification of tensor nodes related to each predictor. The sparse component is equipped with Bayesian regularization or shrinkage priors to enable accurate estimation of tensor cell coefficients. Detailed simulation study with data generated under the true model and mis-specified model demonstrates superior performance of our approach compared to its competitors. Although there is a considerable literature on theoretical understanding of Bayesian shrinkage priors in high dimensional regression, there is a limited literature on theoretical aspects of shrinkage prior on tensor coefficients. In future, we will focus on developing conditions for posterior consistency of the proposed approach under suitable conditions imposed on tensor shrinkage priors. It is also instructive to employ other shrinkage priors on Γ s,2,k from the class of globallocal shrinkage prior distributions [13] and provide a comparative understanding of their performances. Finally, we would like to extend our approach when each entry in y i,k are categorical or count in nature. Some of these constitute our future work. R r =1,r =r λ s,r γ (r ) s,k1 · · · γ (r ) s,kD + γ (r) s,k1 · · · γ (r) s,kD , σ 2 ), D s,r = k N (y i,k | R r =1,r =r λ s,r γ (r ) s,k1 · · · γ (r ) s,kD , σ 2 ) for r = 1, .., R. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest Bayesian sparse reduced rank multivariate regression Bayesian regression with undirected network predictors with an application to brain connectome data Joint modeling of longitudinal relational data and exogenous variables Bayesian tensor response regression with an application to brain activation studies Bayesian tensor regression Bilinear mixed-effects models for dyadic data Spike and slab variable selection: frequentist and Bayesian strategies Tensor decompositions and applications Parsimonious tensor response regression The Bayesian lasso Sparse Bayesian regularization and prediction Higher-order low-rank regression Sparse multivariate regression with covariance estimation The full conditional distributions of parameters for implementing the MCMC is given by following.. Define J as a matrix of the order #K v × R with a representative row ( k l =v λ s,1 γ