key: cord-0512242-mdy0xp09 authors: Wu, Dongxia; Gao, Liyao; Xiong, Xinyue; Chinazzi, Matteo; Vespignani, Alessandro; Ma, Yi-An; Yu, Rose title: Quantifying Uncertainty in Deep Spatiotemporal Forecasting date: 2021-05-25 journal: nan DOI: nan sha: aa07ea8d4fcd5c83ba26776182b80102c763fdd7 doc_id: 512242 cord_uid: mdy0xp09 Deep learning is gaining increasing popularity for spatiotemporal forecasting. However, prior works have mostly focused on point estimates without quantifying the uncertainty of the predictions. In high stakes domains, being able to generate probabilistic forecasts with confidence intervals is critical to risk assessment and decision making. Hence, a systematic study of uncertainty quantification (UQ) methods for spatiotemporal forecasting is missing in the community. In this paper, we describe two types of spatiotemporal forecasting problems: regular grid-based and graph-based. Then we analyze UQ methods from both the Bayesian and the frequentist point of view, casting in a unified framework via statistical decision theory. Through extensive experiments on real-world road network traffic, epidemics, and air quality forecasting tasks, we reveal the statistical and computational trade-offs for different UQ methods: Bayesian methods are typically more robust in mean prediction, while confidence levels obtained from frequentist methods provide more extensive coverage over data variations. Computationally, quantile regression type methods are cheaper for a single confidence interval but require re-training for different intervals. Sampling based methods generate samples that can form multiple confidence intervals, albeit at a higher computational cost. Forecasting the spatiotemporal evolution of time series data is at the core of data science. Spatiotemporal forecasting exploits the spatial correlations to improve the forecasting performance for multivariate time series [10] . Many works have used deep learning for spatiotemporal forecasting, applied to various domains from weather [62, 65] to traffic [20, 34] forecasting. However, the majority of existing works focus on point estimates without generating confidence intervals. The resulting predictions fail to provide credibility for assessing potential risks and assisting decision making, which is critical especially in high-stake domains such as public health and transportation. It is known that deep neural networks are poor at providing uncertainty and tend to produce overconfident predictions [3] . Therefore, uncertainty quantification (UQ) is receiving growing interests in deep learning [17, 47, 55, 60] . There are roughly two types of UQ methods in deep learning. One leverages frequentist thinking and focuses on robustness. Perturbations are made to the inference procedure in initialization [16, 32] , neural network weights [17] , and datasets [33, 45] . The other type is Bayesian, which aims to model posterior beliefs of network parameters given the data [24, 28, 43] . A common focal point of both approaches is the generalization ability of the neural networks. In forecasting, we not only wish to explain in-distribution examples, but also generalize to future data with model misspecifications and distribution shifts. Time series data provide the natural laboratory for studying this problem. Spatiotemporal sequence data pose unique challenges to UQ: (1) Spatiotemporal dependency: accurate forecasts require models that can capture both spatial and temporal dependencies. (2) Evaluation metrics: point estimates often use mean squared error (MSE) as a metric, which only measures the averaged point-wise error for the sequence. For uncertainty, common metrics such as held-out log-likelihood require exact likelihood computation. The number of possibilities grows exponentially with the length of the sequence. (3) Forecasting horizon: the sequential dependency in time series often leads to error propagation for long-term forecasting, so does uncertainty. Unfortunately, most existing deep UQ methods deal with scalar classification/regression problems with a few exceptions. [55] proposes to use concrete dropout for climate down-scaling but not for forecasting. [56] proposes a Bayesian neural network based on variational inference. Still, a systematic study of UQ methods for spatiotemporal forecasting with deep learning is missing in the data science community. In this paper, we conduct a benchmark study of deep uncertainty quantification for spatiotemporal forecasting. We investigate spatiotemporal forecasting problems on a regular grid as well as on a graph. We compare 2 types of UQ methods: frequentist methods including Bootstrap, Quantile regression, Spline Quantile regression and Mean Interval Score regression; and Bayesian methods including Monte Carlo dropout, and Stochastic Gradient Markov chain Monte Carlo (SG-MCMC). We analyze the properties of both frequentist and Bayesian UQ methods, as well as their practical performances. We further provide a recipe for practitioners when tackling UQ problems in deep spatiotemporal forecasting. We experimented with many real-world forecasting tasks to validate our hypothesis: road network traffic forecasting, epidemic forecasting, and air quality prediction. Our contributions include: • We conduct the first systematic benchmark study for deep learning uncertainty quantification (UQ) in the context of multi-step spatiotemporal forecasting. • We cast frequentist and Bayesian UQ methods under a unified framework and adapt 6 UQ methods for grid-based and graph-based spatiotemporal forecasting problems. • Our study reveals that posterior sampling excels at mean prediction whereas quantile and mean interval score regression obtain confidence levels that cover data variations better. • We also notice that the sample complexity of posterior sampling is lower than that of the bootstrapping method, potentially due to posterior contraction [61] . Spatiotemporal forecasting is challenging due to the higher-order dependencies in space, time, and variables. Classic Bayesian statistical models [10] often require strong modeling assumptions. There is a growing literature on using deep learning due to its flexibility. Depending on the geometry of the space, the existing methods fall into two categories: regular grid-based such as Convolutional LSTM [35, 58, [62] [63] [64] and graph-based such as Graph Convolutional LSTM [20, 34, 66] . The high-level idea behind these architectures is to integrate spatial embedding using (Graph) Convolutional neural networks with deep sequence models such as LSTM. However, most existing works in deep learning forecasting only produce point estimation rather than probabilistic forecasts with built-in uncertainty. Two types of deep UQ methods prevail in data science. Bayesian methods model the posterior beliefs of network parameters given the data. For example, Bayesian neural networks learning (BNN) uses approximate Bayesian inference to improve inference efficiency. Wang et al. [57] proposes natural parameter network using exponential family distributions. Shekhovtsov and Flach [50] proposes new approximations for categorical transformations. However, these BNNs are focused on classification problems rather than forecasting. [55] use concrete dropout for climate downscaling. [56] proposes a Bayesian deep learning model VI for weather forecasting. On the other hand, frequentist UQ methods emphasize the robustness against variations in the data. For instance, [19, 46] propose prediction intervals as the forecasting objective. [2] proposed a UQ method with bootstrapping using the influence function. This paper extends previous works and studies the efficacy and efficiency of various methods for UQ in spatiotemporal forecasts. We defer the detailed discussion of these methods to Section 4.3. To the best of our knowledge, there does not exist a systematic benchmark study of these UQ methods for deep spatiotemporal forecasting. Given multivariate time series X = (X 1 , · · · , X ) of time step, where X ∈ R × indicating features from locations. We also have a spatial correlation matrix ∈ R × indicating the spatial proximity. Deep spatiotemporal forecasting approximates a function with deep neural networks such that: : where is the forecasting horizon. Spatiotemporal forecasting exploits spatial correlations to improve the prediction performance. A popular idea behind existing deep learning models is to integrate a spatial embedding of X into a deep sequence model such as an RNN or LSTM. Mathematically speaking, that means to replace the matrix multiplication operator in an RNN with a convolution or graph convolution operator. For grid-based data we can use convolution, which is the key idea behind ConvLSTM [62] . As the locations are distributed as a grid, we reshape X into a tensor of shape × × with × = representing the width and the height of the grid. A convolutional RNN model becomes: where h are the hidden states of the RNN and * is the convolution For graph-based data we use graph convolution, leading to graph convolutional RNN, similar to the design of DCRNN [34] : Where * stands for graph convolution. There are many variations for graph convolution or graph embedding. One example is Here ∈ R × contains the diagonal element of . One can also replace the random walk matrix −1 with the normalized Laplacian matrix − 1/2 ( − ) 1/2 as in [29] . These deep learning models are developed only for point estimation. We will use them as base models to design and compare various deep UQ methods for probabilistic forecasts. For point estimation, mean squared error (MSE) or mean absolute error (MAE) are well-established metrics. But for probabilistic forecasts, MSE is insufficient as it cannot characterize the quality of the prediction confidence. While held-out likelihood is the gold standard in the existing UQ literature, it is difficult to use for deep learning as they often do not generate explicit likelihood outputs [27] . Instead, we revisit a key concept from statistics and econometric called Mean Interval Score [21] . Mean Interval Score (MIS) is a scoring function for interval forecasts. It is also known as Winkler loss [4] . It rewards narrower confidence or credible intervals and encourages intervals that include the observations (coverage). From a computational perspective, MIS is preferred over other scoring functions such as the Brier score [8] , Continuous Ranked Probability Score (CRPS) [21, 23, 41] as it is intuitive and easy to compute. MIS only requires three quantiles while CRPS requires integration over all quantiles. From a statistical perspective, MIS does not constrain the model to be parametric, nor does it require explicit likelihood functions. It is therefore better suited for comparison across a wide range of methods. We formally define MIS for estimated upper and lower confidence bounds. For a one-dimensional random variable ∼ P , if the estimated upper and lower confidence bounds are and , where and are the (1 − 2 ) and 2 quantiles for the (1 − ) confidence interval, MIS is defined using samples ∼ P : In the large sample limit, MIS converges to the following expectation: We prove in the following two novel propositions about the consistency of MIS as a scoring function, which demonstrates that minimizing MIS will lead to unbiased estimates for the confidence intervals. The first proposition is concerned with the consistency of MIS in the large sample limit and the second studies the finite sample consistency of it. Proofs for both propositions are deferred to Appendix A. Proposition 1. Assume that the distribution P of has a probability density function. Then for ( * , * ) = argmin > ; , ∈R MIS ∞ ( , ; ), [ * , * ] is the (1 − ) confidence level. Note from Proposition 2 that the optimal interval for MIS contains portion of the data points, because of the balance it strikes between the reward towards narrower intervals and the penalty towards excluding observations. Next, we describe a unified framework based on statistical decision theory to better understand uncertainty quantification. We cast uncertainty quantification from frequentist and Bayesian perspectives into a single framework. Then we describe both the Bayesian and frequentist techniques to enable deep spatiotemporal forecasting models to generate probabilistic forecasts. We start from a statistical decision theory point of view to examine uncertainty quantification. Assume that each datum X and its label Y are generated from a mechanism specified by a probability distribution with parameter : (Y|X; ). We further assume that the parameter is distributed according to ( ). The goal of a probabilistic inference procedure is to minimize the statistical risk, which is an expectation of the loss function over the distribution of the data as well as ( ), the distribution of the class of models considered. Concretely, consider an estimatorˆof , which is a measurable function of the dataset {X} = {X 0 , · · · , X }. We wish to minimize its risk: where in this paper, function is taken to be the prediction of the deep learning model given its parameters. When we design an inferential procedure, we can use equation 5 or equation 6. If we take the procedure of equation 5 and be a frequentist, we first seek the estimatorˆto minimize In practice, we minimize the empirical risk over the instead of the original expectation that is agnostic to the algorithms. If we take the procedure of equation 6 and be Bayesian, we first take the expectation over the distribution of the parameter space , conditioning on the training dataset {X}. We can find that For a proper prior distribution over , this boils down to the posterior mean of ( ). In this work, we also take function to be the output of the model. To quantify uncertainty of the respective approaches, it is important to capture variations in that determine the distribution of the data. There are two sources of in-distribution uncertainties corresponding to different modes of variations in : effective dimension of the estimatorˆbeing smaller than that of the true , oftentimes addressed to as the bias correction problem; and the variance of the estimatorˆ, often approached via parameter calibration. On top of that, we are also faced with distribution shift that poses additional challenges. Frequentist methods aim to directly capture variations in the data. For example, quantile regression explores the space of predictions within the model class by ensuring proper coverage of the data set. bootstrap method resamples the data set to target both the bias correction and parameter calibration problems, albeit possessing a potentially high resampling variance. Bayesian methods often leverage prior knowledge about the distribution of and use MCMC sampling to integrate different models to capture both variations in the parameter space. In practice, it is important to understand the sample complexity (e.g., number of resampling runs in bootstrap or number of MCMC samples) of all methods and how it scales with the model and data complexity. For deep learning models, it is also crucial to capture the trade-off between the variance incurred by the algorithms and their computation complexity. To understand the performance of the uncertainty quantification methods, we apply the statistically consistent metric MIS to measure how well the confidence intervals computed from either frequentist or Bayesian methods capture the uncertainty of the predictions given the data. In what follows, we introduce the computation methods for uncertainty quantification. We start by describing the frequentist methods. The general idea behind them is to directly capture variations in the dataset, either by resampling the data or by learning an interval bound to encompass the dataset. Bootstrap. The (generalized) bootstrap method [15] randomly generates in each round a weight vector over the index set of the data. The data are then resampled according to the weight vector. With every resampled dataset, we retrain our model and generate predictions. Using the predictions from different retrained models, we infer the (1 − ) confidence interval from the (1 − 2 ) and 2 quantiles based on Proposition 2. The quantiles can be estimated using the order statistics of Monte Carlo samples. MIS Regression For a fixed confidence level , we can directly minimize MIS to obtain estimates of the confidence intervals. Specifically, we use MIS as a loss function for deep neural networks, we use a multi-headed model to jointly output the upper bound ( ), lower bound ( ), and the prediction ( ) for a given input , and minimize the neural network parameter : Here 1 {·} is an indicator function, which can be implemented using the identity operator over the larger element in Pytorch. Quantile Regression We can use the one-sided quantile loss function [31] to generate predictions for a fixed confidence level . Given an input , and the output ( ) of a neural network, parameterized by , quantile loss is defined as follows: Quantile regression behaves similarly as the MIS regression method. Both methods generate one confidence interval per time. Similar ideas of directly optimizing the prediction interval using different variations of quantile loss have also been explored by Kivaranovic et al. [30] , Pearce et al. [46] , Tagasovska and Lopez-Paz [52] . One caveat of these methods is that different predicted quantiles can cross each other due to variations given finite data. This will lead to a strange phenomenon when the size of the data set and the model capacity is limited: the higher confidence interval does not contain the interval of lower confidence level or even the point estimate. One remedy for this issue is to add variations in both the data and the parameters to increase the effective data size and model capacity. In particular, during training, we can use different subsets of data and repeat the random initialization from a prior distribution to form an ensemble of models. In this way, our modified MIS and quantile regression methods have integrated across different models to quantify the prediction uncertainty and have taken advantage of the Bayesian philosophy. Another solution to alleviate the quantile crossing issue is to minimize CRPS by assuming the quantile function to be a piecewise linear spline with monotonicity [19] , a method we call Spline Quantile regression (SQ). In the experiment, we also included this method for comparison. We describe the Bayesian approaches to UQ. These methods impose a prior distribution over the class of models and average across them, either via MCMC sampling or variational approximations. Stochastic Gradient MCMC (SG-MCMC) To estimate expectations or quantiles according to the posterior distribution over the parameter space, we use SG-MCMC [38, 59] . We find in practice that the stochastic gradient thermostat method (SGNHT) [12, 49] is particularly useful in controlling the stochastic gradient noise. This is consistent with the observation in [24] where stochastic gradient thermostat method is applied to an i.i.d. image classification task. To generate samples of model parameters (with a slight abuse of notation) according to SGNHT, we first denote the loss function (or the negative log-likelihood) over a minibatch of data as L ( ). We then introduce hyper-parameters including the diffusion coefficients and the learning rate ℎ and make use of auxiliary variables ∈ R and ∈ R in the algorithm. We randomly initialize , , and and update according to the following update rule. Upon convergence of the above algorithm at -th step, follows the distribution of the posterior. We run parallel chains to generate different samples according to the posterior and quantify the prediction uncertainty. Approximate Bayesian Inference SG-MCMC can be computationally expensive. There are also approximate Bayesian inference methods introduced to accelerate the inference procedures [14, 40] . In particular, the Monte Carlo (MC) drop-out method sets some of the network weights to zero according to a prior distribution [18] . This method serves as a simple alternative to variational Bayes methods which approximate the posterior [7, 22, 37, 47, 48] . We use the popular MC dropout [18] method in the experiments. The above-mentioned UQ methods have different properties. Table 1 shows an overall comparison of different UQ methods for deep spatiotemporal forecasting. The relative computation budget is displayed in the "Computation" row. The numbers represent the numbers of parallel cores required to finish computation within the same amount of time. It is worth noting that when comparing computational budgets, we assume only one level of confidence (usually 95%) is estimated. If confidence levels are required, then the computation complexity of quantile regression and MIS regression multiplies by , whereas We benchmark the performance of both frequentist and Bayesian UQ methods on many spatiotemporal forecasting applications, including air quality PM2.5, road network traffic, and COVID-19 incident deaths. Air quality PM2.5 is on a regular grid, while traffic and COVID-19 are on graphs. Air Quality PM 2.5. The air quality PM 2.5 dataset contains hourly weather and air quality data in Beijing provided by the KDD competition of 2018 1 . The weather data is at the grid level (21 × 31), which contains temperature, pressure, humidity, wind direction, and wind speed, forming a 21 × 31 × 5 tensor at each timestamp. The air quality data is from 35 stations in Beijing. The air quality index we considered is PM2.5 measured in micrograms per cubic meter of air ( / 3 ). According to the air quality index provided by the U.S. Environmental Protection Agency [54] , the air quality is unhealthy from 101 to 300, and hazardous from 301 to 500. We follow the setting in [36] to map PM2.5 to grid-based data using the weighted average over stations: Here 1, is the distance between station i and the grid cell k. is introduced to prevent divide-by-zero error. METR-LA Road Network Traffic. The METR-LA [26] dataset contains 4 months vehicle speed information from loop detector sensors in Los Angeles County highway system. The maximum speed limit on most California highways is 65 miles per hour. The task is to forecast the traffic speed for 207 sensors simultaneously. We build the spatial correlation matrix using road network distances between sensors and and a Gaussian kernel: The COVID-19 dataset contains reported deaths from Johns Hopkins University [13] and the death predictions from a mechanistic, stochastic, and spatial metapopulation epidemic model called Global Epidemic and Mobility Model (GLEAM) [5, 6, 9, 11, 53, 69] . Both data are recorded for the 50 US states during the period from May 24th to Sep 12th, 2020. We use the residual between the reported death and the corresponding GLEAM predictions to train the model (http://covid19.gleamproject.org/). We construct the spatial correlation matrix using air traffic between different states. Each element in the matrix is the average number of passengers traveling between two states daily. The air traffic data were obtained from the Official Aviation Guide (OAG) and the International Air Transportation Association (IATA) databases (updated in 2021) [25, 44] . See Appendix B.2 for details. For all experiments, we start with a deep learning point estimate model (Point) based on Sec. 3.1. We use as deep learning models ConvLSTM [62] for grid-based data and DCRNN [34] for graphbased data. Then we modify and adapt the deep learning models to incorporate 6 different UQ methods as in Sec. 4 . For the ConvLSTM model [62] , we follow the best setting reported in [36] with 3 × 3 filter, 256 hidden states, and 2 layers. We applied early stopping, autoregressive learning, and gradient clipping [67] to improve the generalization. As PM2.5 is closely related to weather, we use a combined loss For the DCRNN model, we use the exact setting in [34] for the traffic dataset and use calibrated network distance to construct the graph. We applied early stopping, curriculum learning, and gradient clipping [67] to improve the generalization. For the COVID-19 dataset, we chose to forecast the residual rather than directly predicting the death number because the latter has lower accuracy due to limited samples, see Sec. 6.2 for quantitative comparisons. We report the mean absolute error (MAE) as a point estimate metric. For UQ metric, we use MIS suggested in Sec 3.2 for 95% confidence intervals. As MIS combines both the interval and coverage, we also report the interval score, which is the average difference between the upper and lower bounds. Table 2 : Performance comparison of 6 different UQ methods applied to grid-based air quality PM2.5, graph-based traffic and COVID-19 incident death forecasting. (F) indicates frequentist methods and (B) standards for Bayesian methods. We compare these methods using the mean absolute error (MAE), the mean interval score (MIS), and the width of the confidence bounds. We adapt ConvLSTM and DCRNN to implement various UQ methods with the following parameter settings. Bootstrap We randomly dropped 50% of training data while keeping the original validation and testing data. We obtain 25 samples for constructing mean prediction and confidence intervals. Quantile regression We apply the pinball loss function [31] with the corresponding quantile (0.025, 0.5, 0.975). The model and learning rate setup is the same as point estimation. SQ regression We use the linear spline quantile function to approximate a quantile function and use the CRPS as the loss function. For every point prediction, there are 11 trained parameters to construct the quantile function. The 1 st parameter is the intercept term. The next 5 can be transformed to the slopes of 5 line segments. The last 5 can be transformed to a vector of the 5 knots' positions. The model and learning rate setup is the same as point estimation. MIS regression We combine MAE with MIS (equation 7) and directly minimize this loss function. The model and learning rate setup is the same as [34] . MC Dropout We implement the algorithm provided by [70] and simplify the model by only considering the model uncertainty. We apply random dropout through the testing process with 5% drop rate and iterate 50 times to achieve a stable prediction. The result for comparison averages the performance of 10 trails. We set the learning rate of SG-MCMC ℎ = 5 −4 , as defined in Eqn. 9, and we selected a Gaussian prior ∼ N (0, 4.0) with random initialization as 0 ∼ N (0, 0.2). We selected different priors and initialization parameters around the choices above. The performances are fairly stable for different choices as long as the model would not diverge (e.g., random initialization variance too large). We note here a symmetric Gamma prior could also work in this case with ∼ Γ(0.1, 1). We apply early stopping at epoch 50, and it helps to improve the generalization on testing set. Our result is averaged from 25 posterior samples. We use V100 GPUs to perform all the experiments. In addition to the MAE and MIS scores discussed in section 3.2, we also include the widths of the confidence bounds to demonstrate the coverage. Table 2 compares the forecasting performances for 6 UQ methods across three different applications: (1) PM 2.5 forecasting for next 12 -48 hours on the hourly recorded weather and air quality dataset, (2) 15min -1 hour ahead traffic speed forecasting on METR-LA and (3) four weeks ahead mortality forecasting on COVID-19. Performances for air quality are aggregated for the entire sequence following the KDD competition standard. Others are reported with the corresponding timestamp. Across the applications, we observe Bayesian method (SG-MCMC) leads to better prediction accuracy in MAE, even outperforming the point estimation models. On the other hand, frequentist methods generally outperform Bayesian methods for the inference of 95% confidence intervals in MIS. In specific, directly minimizing MIS score function achieves the best performance in MIS, which is expected. Quantile regression has a similar performance to the MIS-regression, but under-performs in both MAE and MIS. SQ enjoys good prediction accuracy, but has a larger MIS score compared to MIS-regression. MC dropout has relatively good MAE, but the MIS scores are often the highest among all UQ methods. We note that bootstrap still under-performs most methods due to the limited amount of samples we have. SQ regression helps avoid the confidence interval crossing issue. However, it suffers from a large MIS due to the small number of knots for linear spline quantile function construction. For MC dropout, its prediction accuracy is relatively robust but its MIS is unsatisfactory. We visualize the forecasts of Quantile regression (F) and SG-MCMC (B) for three different tasks in Figure 1, 2, 3 . Black solid lines indicate the ground truth. Green lines are the quantile regression predictions and orange lines are the forecasts from SG-MCMC. The shades represent the estimated 95% confidence intervals. In general, we can see that Bayesian method SG-MCMC generates mean predictions much closer to the ground truth but often comes with narrower credible intervals. For example, in PM 2.5 prediction, the confidence intervals of SG-MCMC fail to cover the ground truth for both 24 and 48 hours prediction on 03/06 to 03/08. On the other hand, frequentist method quantile regression typically captures the true values with wider confidence interval bounds. This corresponds to its low MIS score reported in Table 2 . Among all the tasks, the confidence interval bounds of quantile regression generally capture ground truth with 95% probability. For COVID-19 forecasting, quantile regression suffers from limited amount of samples, occasionally resulting in poor performances in states including Texas (US-TX) and New Jersey (US-NJ). Both bootstrap and SG-MCMC provide asymptotic consistency. Yet the sample complexities are often large for complex datasets. It is clear from Figure 4 that the MIS score decreases as more Monte Carlo samples are drawn for both the bootstrap method and SG-MCMC posterior sampling method, indicating that sample variance is a major contribution to the inaccuracies. More importantly, we observe that the MIS score decreases faster with SG-MCMC than with bootstrap, signaling a smaller sample complexity for SG-MCMC. For COVID-19 mortality forecasting, we have limited amount of data, which is insufficient to train a deep model. Instead, we train our model on the residual between the reported death and the corresponding GLEAM predictions, leading to a hybrid method called DeepGLEAM. Essentially, DeepGLEAM is learning the correction term in the mechanistic GLEAM model. An alternative approach is to directly train our model on the raw incident death counts, which we refer as Deep. In Figure 5 , we visualize the predictions from these three different methods: Deep, GLEAM, and DeepGLEAM models in California. The input data (for training or validation) are from weeks before 34, and we start the prediction from week 34, labeled by the vertical dash line. It can be observed that the Deep model fails to predict the dynamics of COVID-19 evolution. One reason for this phe- We conclude from Table 2 that in all the experiments we perform, better performance in MIS is related to not only higher accuracy but also larger confidence bounds. This phenomenon indicates that the deep learning models we consider in this paper generally tend to be overconfident and benefit greatly from better capturing of all sources of variations in the data. We conduct benchmark studies on uncertainty quantification in deep spatiotemporal forecasting from both Bayesian and frequentist perspectives. Through experiments on air quality, traffic and epidemics datasets, we conclude that, with a limited amount of computation, Bayesian methods are typically more robust in mean predictions, while frequentist methods are more effective in covering variations in the data. It is of interest to understand how to best combine Bayesian credible intervals with frequentist confidence intervals to excel in both mean predictions and confidence bounds. Another future direction worth exploring is how to explicitly make use of the spatiotemporal structure of the data in the inference procedures. Current methods we analyze treats minibatches of data as i.i.d. samples. For longer time series with a graph structure, it would be interesting to leverage the spatiotemporal nature of the data to construe efficient inference algorithms [1, 2, 39] . One limitation of current method is that the confidence interval does not reflect the prediction accuracy. We hypothesize this is due to the limited samples and complexity of the dynamics. 2 ∑︁ We prove in the following that the minimum of MIS score is achieved when we sort { 1 , · · · , } in an increasing order and take = ⌈ · /2⌉ and = − ⌊ · /2⌋ , which define the quantile of the empirical distribution formed by the samples { 1 , · · · , }. We first note that MIS ( , ) is a continuous function with respect to and . Since MIS ( , ) is piece-wise linear, we simply need to check that The result holds similarly for . In what follows we prove that MIS ( , ) is decreasing in whenever ≤ ⌈ · /2⌉ . Whenever ≥ ⌈ · /2⌉ , ( , ) is increasing in . We can use similar reasoning to prove that the minimum is achieved when = − ⌊ · /2⌋ . Consider derivative of MIS ( , ) over Since When ≤ ⌈ · /2⌉ , 2 is decreasing in whenever ≤ ⌈ · /2⌉ . Whenever ≥ ⌈ · /2⌉ , Therefore, = ⌈ · /2⌉ is the minimum of MIS ( , ). The above two facts complete the proof and conclude that arg min > ; , ∈R MIS ( , ) = ( − ⌊ · /2⌋ , ⌈ · /2⌉ ), which define the quantile of the empirical distribution formed by the samples { 1 , · · · , }. □ For the ConvLSTM experiment on air quality dataset, we concatenate the interpolated PM2.5 at grid level with the weather data and use 24-hour sequence as input to forecast PM2.5 in the next 48 hours. The training data is from January 2nd 2017 to January 31st 2018. We split it into 72-hour sequences. For every sequence, the data is normalized for each channel (temperature, pressure, humidity, wind direction, wind speed, PM2.5). We use one month data from February 1st to February 28th in 2018 for validation and the following month data from March 1st to March 26th for testing, using the first 26 and 24 24-hour records respectively as the input to make the next 48 hour's PM2.5 forecast. For the DCRNN model used in traffic dataset, we use two hidden layers of RNN with 64 units. The filter type is a dual random walk and the diffusion step is 2. The learning rate of DCRNN is fixed at 1 −2 with Adam optimizer. We perform early stopping at 50 epochs. For the COVID-19 dataset, We use an encoder-decoder sequence to sequence learning framework in the DCRNN structure. The encoder reads as input a 7 × 50 × 4 tensor that comprises the daily residuals between the observed death number and the GLEAM forecasts for the 50 US states over 4 different prediction horizons. It encodes the information in 7 hidden layers. The decoder produces forecasts of the weekly residuals between the true death number and the GLEAM forecasts for each state for the following 4 weeks. We perform autoregressive weekly death predictions (from one week ahead to four weeks ahead). The DCRNN model only has one hidden layer of RNN with 8 units to overcome the overfitting problem. The filter type is Laplacian and the diffusion step is 1. The base learning rate of DCRNN is 1 −2 and decay to 1 −3 at epoch 13 with Adam optimizer. We have a strict early stopping policy to deal with the overfitting problem. The training stops as the validation error does not improve for three epochs after epoch 13. The Global Epidemic and Mobility model (GLEAM) is a stochastic, spatial, age-structured epidemic model based on a metapopulation approach which divides the world into over 3,200 geographic subpopulations constructed using a Voronoi tessellation of the Earth's surface. Subpopulations are centered around major transportation hubs (e.g. airports) and consist of cells with a resolution of approximately 25 x 25 kilometers [5, 6, 9, 11, 53, 69] . High resolution data are used to define the population of each cell. Other attributes of individual subpopulations, such as age-specific contact patterns, health infrastructure, etc., are added according to available data [42] . GLEAM integrates a human mobility layer -represented as a network -that uses both short-range (i.e., commuting) and longrange (i.e., flight) mobility data from the Offices of Statistics for 30 countries on 5 continents as well as the Official Aviation Guide (OAG) and IATA databases (updated in 2021). The air travel network consists of the daily passenger flows between airport pairs (origin and destination) worldwide mapped to the corresponding subpopulations. Where information is not available, the short-range mobility layer is generated synthetically by relying on the "gravity law" or the more recent "radiation law" both calibrated using real data [51] . The model is calibrated to realistically describe the evolution of the COVID-19 pandemic as detailed in [9, 11] . Lastly, GLEAM is stochastic and produces an ensemble of possible epidemic outcomes for each set of initial conditions. To account for the potentially different reporting levels of the states, a free parameter Infection Fatality Rate (IFR) multiplier is added to each model. To calibrate and select the most reasonable outcomes, we filter the models by the latest hospitalization trends and confirmed cases trends, and then we select and weight the filtered models using Akaike Information Criterion [68] . The forecast of the evolution of the epidemic is formed by the final ensemble of the selected models. Figure 6 visualizes the forecasts of Quantile regression (F) and SG-MCMC(B) for COVID-19 for the rest of the 41 states in U.S. Black solid lines indicate the ground truth. Green lines are the quantile regression predictions and orange lines are the forecasts from SG-MCMC. The shades represent the estimated 95%confidence intervals. In general, we can see that SG-MCMC generates mean predictions much closer to the ground truth but often comes with narrower credible intervals In addition to autoregressive forecasting, we also build an ensemble DeepGLEAM model by training based on the ensembled data along the forecasting horizon as the output. We use the ensemble DeepGLEAM model which predicts the next 4 weeks death prediction together from the first decoder. Table 4 shows the performance among the UQ methods, which shares similar results with the autoregressive model. Stochastic gradient MCMC for state space models Frequentist Uncertainty in Recurrent Neural Networks via Blockwise Influence Functions Concrete problems in AI safety On the comparison of interval forecasts Multiscale mobility networks and the spatial spreading of infectious diseases Modeling the spatial spread of infectious diseases: The GLobal Epidemic and Mobility computational model Weight Uncertainty in Neural Network Verification of forecasts expressed in terms of probability The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak Statistics for spatio-temporal data Bayesian sampling using stochastic gradient thermostats An interactive web-based dashboard to track COVID-19 in real time Efficient and scalable Bayesian neural nets with rank-1 factors Deep ensembles: A loss landscape perspective Dropout as a Bayesian approximation: Representing model uncertainty in deep learning Concrete dropout Probabilistic forecasting with spline quantile function RNNs Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting Strictly proper scoring rules, prediction, and estimation Practical variational inference for neural networks A probabilistic forecast contest and the difficulty in assessing short-range forecast uncertainty Bayesian Inference for Large Scale Image Classification International Air Transport Association Big data and its technical challenges Glow: Generative flow with invertible 1x1 convolutions Auto-encoding variational Bayes Semi-Supervised Classification with Graph Convolutional Networks Adaptive, Distribution-Free Prediction Intervals for Deep Networks Quantile Regression Simple and scalable predictive uncertainty estimation using deep ensembles Why M heads are better than one: Training a diverse ensemble of deep networks Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting Preserving Dynamic Attention for Long-Term Spatial-Temporal Prediction Air quality forecasting using convolutional LSTM Structured and efficient variational deep learning with matrix gaussian posteriors A complete recipe for stochastic gradient MCMC Stochastic gradient MCMC Methods for hidden Markov models A simple baseline for Bayesian uncertainty in deep learning Scoring rules for continuous probability distributions Inferring high-resolution human mixing patterns for disease modeling Bayesian learning for neural networks Aviation Worlwide Limited Deep Exploration via Bootstrapped DQN Highquality prediction intervals for deep learning: A distribution-free, ensembled approach Quantifying Point-Prediction Uncertainty in Neural Networks via Residual Estimation with an I/O Kernel Stochastic backpropagation and approximate inference in deep generative models Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling Feed-forward propagation in probabilistic neural networks with categorical and max layers A universal model for mobility and migration patterns Single-model uncertainties for deep learning Real-time numerical forecast of global epidemic spreading: case study of 2009 A/H1N1pdm Revised air quality standards for particle pollution and updates to the Air Quality Index (AQI) Quantifying uncertainty in discretecontinuous and skewed data with Bayesian deep learning Deep uncertainty quantification: A machine learning approach for weather forecasting Natural-parameter networks: A class of probabilistic neural networks Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms Bayesian learning via stochastic gradient Langevin dynamics A multi-horizon quantile recurrent forecaster Bayesian Deep Learning and a Probabilistic Perspective of Generalization Convolutional LSTM network: A machine learning approach for precipitation nowcasting Revisiting spatial-temporal similarity: A deep learning framework for traffic prediction Deep multi-view spatial-temporal network for taxi demand prediction Deep distributed fusion network for air quality prediction Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting Why gradient clipping accelerates training: A theoretical justification for adaptivity Forecasting seasonal influenza fusing digital indicators and a mechanistic disease model Spread of Zika virus in the Americas Deep and confident prediction for time series at uber Proof of Proposition 1. Since the distribution P of has probability density ,We demonstrate in the following that the minimum of MIS ∞ ( , ; ) is achieved when and define the (1 − ) confidence level. For simplicity of exposition, we let be symmetric around 0 and let = − . Thenwe reach the conclusion that the upper bound * that achieves the minimum MIS satisfies:Proof of Proposition 2. We think about minimizing the MIS score over lower and upper bounds and , given samples 1 , · · · , from the posterior distribution: