key: cord-0147291-l1g7sapx authors: Laves, Max-Heinrich; Tolle, Malte; Schlaefer, Alexander; Engelhardt, Sandy title: Posterior Temperature Optimization in Variational Inference for Inverse Problems date: 2021-06-11 journal: nan DOI: nan sha: 0c9fab7bcebe0add9cbe44b075e0daca78d091f7 doc_id: 147291 cord_uid: l1g7sapx Bayesian methods feature useful properties for solving inverse problems, such as tomographic reconstruction. The prior distribution introduces regularization, which helps solving the ill-posed problem and reduces overfitting. In practice, this often results in a suboptimal posterior temperature and the full potential of the Bayesian approach is not realized. In this paper, we optimize both the parameters of the prior distribution and the posterior temperature using Bayesian optimization. Well-tempered posteriors lead to better predictive performance and improved uncertainty calibration, which we demonstrate for the task of sparse-view CT reconstruction. Reconstructing a tomography from a finite number of X-ray projections requires solving an inverse problem. The unknown tomography x can only be observed through projections y = F[x], affected by the forward Radon transform F, which is not directly invertible. The reconstruction can be found by minimization of the ill-posed objectivex = arg min {L(y, F[x]) + λR(x)}, with similarity measure L and regularization R, weighted by λ [1] . Common regularization is manually engineered, such as penalization of spatial derivatives, or implicitly learned from a large data set. However, obtaining ground truth pairs {x, y} is impossible in computed tomography (CT), especially in sparseview CT, where only a limited number of projections are obtained to reduce radiation exposure. Deep image prior (DIP) has shown promising results in solving inverse problems by optimizing a randomly-initialized convolutional network as neural representation of the reconstruction [2, 3] . To overcome the overfitting behavior of DIP, different Bayesian approaches have been proposed [4, 5] . In Bayesian deep learning, a prior distribution p(w | α) is placed over the weights w of a neural network, governed by a hyperparameter α. After observing the data D, we are interested in the posterior p(w | D, α) = p(D | w, α)p(w | α)/p(D). However, this distribution is not tractable in general as the normalizing factor involves marginalization of the model likelihood over the prior p(D) = p(D | w, α)p(w | α) dw. A common way to approximate the posterior is variational inference (VI), which uses optimization to find the member q φ (w) of a family of distributions that is close to the exact posterior, defined by the variational parameters φ. q φ (w) is optimized w.r.t. φ, such that the Kullback-Leibler divergence is minimized with regard to the true posterior [6] . A practical implementations of VI is Bayes by backprop, where a fully factorized Gaussian distribution w ij ∼ N (µ ij , σ 2 ij ) is used as variational distribution q φ (w), also known as mean-field distribution, which treats the mean and variance of each weight as learnable parameters φ ij = {µ ij , σ 2 ij } [7] . Cold Posteriors Cold posteriors have been reported to perform better in practice in the context of Bayesian deep learning [8] . In order to bring the variational distribution q φ (w) close to the true posterior, a lower bound on the log-evidence (ELBO) is derived and maximized. Graves [9] already suggested to reweight the complexity term in the ELBO using a factor λ to balance both terms in case of discrepancy between number of weights and training samples: (1) It is common for Bayesian deep learning researchers to employ values of λ < 1 to achieve better predictive performance [7] . While their main motivation was to qualitatively balance out discrepancies between number of model parameter and dataset size, the reweighting has recently been studied in more detail and described as the "cold posterior" effect [10] . Wenzel et al. [8] derived the tempered Bayesian posterior p(w | D) ∝ exp(−U (w)/T ) with posterior energy function U (w) = − log p(D | w) − log p(w) and have shown empirically that cold posteriors with T < 1 perform considerably better. The authors also recover Eq. (1) and show that introducing λ into the ELBO is equivalent to a partially tempered posterior, where only the likelihood term is scaled. In this paper, we will not argue whether cold posteriors invalidate Bayesian principles, as there is disagreement among researchers [8, 10, 11] , but use it in a directed way to increase predictive performance and uncertainty calibration of unsupervised sparse-view CT reconstruction with deep image prior. This workshop paper is based on our recent journal submission [12] and extends it by additional experiments on CIFAR-10/100 (see Appendix C). The ELBO for a fully temperature-scaled posterior in VI is given by (derivation in Appendix B): The KL contains the scaled prior p T (w) ∝ p(w) 1 /T , which will have the same mean, but different variance as the unscaled prior. In case of a Gaussian prior p(w) ∝ exp(− w 2 /2σ 2 ), this is equivalent to a scaled prior variance [13] . Therefore, we set p T (w) = N (0, σ 2 T I 2 ), which results in the following minimization criterion arg min which, in contrast to Eq. (1) and Wenzel et al. [8] , optimizes the fully temperature-scaled ELBO T . Instead of manually selecting the optimal posterior temperature using heuristics or inefficient grid search, we employ Bayesian optimization (BO) to jointly find the posterior temperature T and prior scale σ. BO allows us to optimize functions that are expensive to evaluate, e.g., the training of a deep network [14] . It uses a computationally inexpensive surrogate to retrieve a distribution over functions. We apply optimization of the posterior temperature to maximize the peak signal-to-noise ratio (PSNR) between the sparse-view reconstructionx and the dense-view image x as a function of T and σ max using a Gaussian process (GP) as surrogate f ∼ GP. In each step of the BO, we evaluate our objective function f at the current candidates T * and σ * to increase the set of observations D BO and update the posterior of the surrogate model. Next, we maximize an acquisition function a(T, σ; µ GP , σ 2 GP ) using the current GP posterior mean µ GP and variance σ 2 GP . Its maximizing arguments T * , σ * ← arg max a(T, σ; µ GP , σ 2 GP ) are used as candidates for the next iteration [15] . We choose the commonly accepted expected improvement (EI) as acquisition function where f * = f (T best , σ best ) is the minimal value of the objective function observed so far. Eq. (6) can be solved analytically as shown in [16] . We utilize automatic differentiation from modern deep learning frameworks to optimize the acquisition function to get the next candidates T * and σ * [17] . To evaluate posterior temperature optimization in Bayesian inversion, we simulate sparse-view CT by computing only 45 projections from dense-view lung CTs of COVID-19 patients 1 using the forward Radon transform. We use mean-field VI (MFVI) as Bayesian approach to DIP for solving the inverse task (see Fig. 2 in the appendix). The Bayesian network is used as parameterization of the reconstructionx and its variational parameters are optimized by minimizing Eq. (4) using the squared error F[x] − y 2 as likelihood. BO is used to find optimal values for {T, σ} as described below. Finding the Optimal Posterior Temperature The Gaussian process regressor from § 3 is implemented in GPyTorch [17] using a constant mean function with prior N (15, 4 2 ), a scaled radial basis function kernel as covariance function and a prior length-scale = 0.3. The surrogate model is trained on observations {(log T i , log σ i ), PSNR(x Ti,σi , x)} to impose a non-negativity constraint on T and σ. A Gaussian likelihood with a homoscedastic noise model with prior Γ(0.1, 100) is used. We limit the search space to T ∈ [1e−12, 1e−2] and σ ∈ [1e−10, 1] and initialize the BO with four candidate pairs with T ∈ {1e−7, 1e−4} and σ ∈ {1e−6, 1e−1}. If the acquisition function from Eq. (6) has multiple local maxima, we select the best four candidates for the next iteration. The results for a test image are summarized in Fig. 1 . At optimal temperature T * , the Bayesian reconstruction outperforms filtered back-projection (FBP) and non-Bayesian DIP by means of PSNR. From the GP mean, we see that the posterior temperature has a considerable effect on the reconstruction, with T * 1. The effect of the prior scale is less prominent, with optimal value σ * ≈ 1e−2. We observe similar findings for classification experiments on CIFAR-10/100 with Bayesian ResNets (see Appendix C). The uncertainty calibration is improved at optimal temperature. We optimized the ELBO for a fully tempered posterior to exploit the cold posterior effect in Bayesian deep learning. For ill-posed inverse problems, the optimized posterior temperature introduces the right amount of regularization to allow enough flexibility but to avoid overfitting. This can be used in many medical applications such as CT reconstruction, registration, denoising, or artifact removal. In the following, the ELBO for a fully temperature-scaled Bayesian posterior in variational inference is derived. Let p T (w | D) be the fully tempered posterior [8] : As the tempered evidence E T is constant, maximizing ELBO T minimizes the KL, thus bringing the variational distribution q φ (w) closer to the fully tempered posterior p T (w | D). C CIFAR-10/100 Experiments • Code for training pipeline and evaluation is available at github.com/Cardio-AI/mfvi-dip-mia. • For CT reconstruction, we use the same architecture as described by Lempitsky et al. [2] and optimize the network for 1e5 iterations. • The final CT is sampled from the probabilistic neural representation using Monte Carlo integrationx = 1 N N i=1x i , wherex i is a sample from the posterior predictive p(x | w i , y). • We estimate reconstruction uncertainty using the predictive variance from Monte Carlo samplesΣ = 1 Deformable medical image registration: A survey Deep Image Prior Computed tomography reconstruction using deep image prior and learned reconstruction methods A bayesian perspective on the deep image prior Uncertainty estimation in medical image denoising with bayesian deep image prior Variational inference: A review for statisticians Weight uncertainty in neural network How good is the Bayes posterior in deep neural networks really Practical variational inference for neural networks Bayesian deep learning and a probabilistic perspective of generalization What are bayesian neural network posteriors really like Posterior temperature optimized bayesian models for inverse problems in medical imaging. under review A statistical theory of cold posteriors in deep neural networks Scalable bayesian optimization using deep neural networks A tutorial on bayesian optimization Efficient global optimization of expensive black-box functions GPyTorch: Blackbox matrix-matrix gaussian process inference with GPU acceleration Well-calibrated model uncertainty with temperature scaling for dropout variational inference MT is supported by Informatics for Life founded by the Klaus Tschira Foundation. ML and AS are partially funded by the Interdisciplinary Competence Center for Interface Research (ICCIR).