key: cord-0131186-daqzjzxr authors: Kamarthi, Harshavardhan; Rodr'iguez, Alexander; Prakash, B. Aditya title: Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future date: 2021-06-08 journal: nan DOI: nan sha: 035c271232ab6951016ef24eb2046f9d2c458145 doc_id: 131186 cord_uid: daqzjzxr In real-time forecasting in public health, data collection is a non-trivial and demanding task. Often after initially released, it undergoes several revisions later (maybe due to human or technical constraints) - as a result, it may take weeks until the data reaches to a stable value. This so-called 'backfill' phenomenon and its effect on model performance has been barely studied in the prior literature. In this paper, we introduce the multi-variate backfill problem using COVID-19 as the motivating example. We construct a detailed dataset composed of relevant signals over the past year of the pandemic. We then systematically characterize several patterns in backfill dynamics and leverage our observations for formulating a novel problem and neural framework Back2Future that aims to refines a given model's predictions in real-time. Our extensive experiments demonstrate that our method refines the performance of top models for COVID-19 forecasting, in contrast to non-trivial baselines, yielding 18% improvement over baselines, enabling us obtain a new SOTA performance. In addition, we show that our model improves model evaluation too; hence policy-makers can better understand the true accuracy of forecasting models in real-time. The current COVID-19 pandemic has challenged our response capabilities to large disruptive events, affecting the health and economy of millions of people. A major tool in our response has been forecasting epidemic trajectories, which has provided lead time to policymakers to optimize and plan interventions (Holmdahl & Buckee, 2020) . Broadly two classes of approaches have been devised: traditional mechanistic epidemiological models (Shaman & Karspeck, 2012; Zhang et al., 2017) , and the fairly newer statistical approaches Adhikari et al., 2019; Osthus et al., 2019b) including deep learning models (Adhikari et al., 2019; Deng et al., 2020; Panagopoulos et al., 2021; Rodríguez et al., 2021a) , which have become among the top-performing ones for multiple forecasting tasks (Reich et al., 2019) . These also leverage newer digital indicators like search queries (Ginsberg et al., 2009; Yang et al., 2015) and social media (Culotta, 2010; Lampos et al., 2010) . As noted in multiple previous works (Metcalf & Lessler, 2017; Biggerstaff et al., 2018) , epidemic forecasting is a challenging enterprise because it is affected by weather, mobility, strains, and others. However, real-time forecasting also brings new challenges. As noted in multiple CDC real-time forecasting initiatives for diseases like flu (Osthus et al., 2019a) and COVID-19 (Cramer et al., 2021) , as well as in macroeconomics (Clements & Galvão, 2019; Aguiar, 2015) the initially released public health data is revised many times after and is known as the 'backfill' phenomenon.The various factors that affect backfill are multiple and complex, ranging from surveillance resources to human factors like coordination between health institutes and government organizations within and across regions Reich et al., 2019; Altieri et al., 2021; Stierholz, 2017) . While previous works have addressed anomalies (Liu et al., 2017) , missing data (Yin et al., 2020) , and data delays (Žliobaite, 2010) in general time-series problems, the backfill problem has not been addressed. In contrast, the topic of revisions has not received as much attention, with few exceptions. For example in epidemic forecasting, a few papers have either (a) mentioned about the 'backfill Published as a conference paper at ICLR 2022 problem' and its effects on performance Rodríguez et al., 2021b; Altieri et al., 2021; Rangarajan et al., 2019) and evaluation (Reich et al., 2019); proposed to address the problem via simple models like linear regression (Chakraborty et al., 2014) or 'backcasting' the observed targets c) used data assimilation and sensor fusion from a readily available stable set of features to refine unrevised features (Farrow, 2016; Osthus et al., 2019a) . However, they focus only on revisions in the target and typically study in the context of influenza forecasting, which is substantially less noisy and more regular than the novel COVID-19 pandemic or assume access to stable values for some features which is not the case for COVID-19. In economics, Clements & Galvão (2019) surveys several domain-specific (Carriero et al., 2015) or essentially linear techniques for data revision/correction behavior of several macroeconomic indicators (Croushore, 2011) . Motivated from above, we study the more challenging problem of multi-variate backfill for both features and targets. We go further beyond prior work and also show how to leverage our insights for a more general neural framework to improve both predictions (i.e. refinement of the model's predictions) and performance evaluation (i.e. rectification from the evaluator's perspective). Our specific contributions are the following: • Multi-variate backfill problem: We introduce the multi-variate backfill problem using real-time epidemiological forecasting as the primary motivating example. In this challenging setting, which generalizes (the limited) prior work, the forecast targets, as well as exogenous features, are subject to retrospective revision. Using a carefully collected diverse dataset for COVID-19 forecasting for the past year, we discover several patterns in backfill dynamics, show that there is a significant difference in real-time and revised feature measurements, and highlight the negative effects of using unrevised features for incidence forecasting in different models both for model performance and evaluation. Building on our empirical observations, we formulate the problem BFRP, which aims to 'correct' given model predictions to achieve better performance on eventual fully revised data. • Spatial and Feature level backfill modeling to refine model predictions: Motivated by the patterns in revision and observations from our empirical study, we propose a deep-learning model Back2Future (B2F) to model backfill revision patterns and derive latent encodings for features. B2F combines Graph Convolutional Networks that capture sparse, cross-feature, and cross-regional backfill dynamics similarity and deep sequential models that capture temporal dynamics of each features' backfill dynamics across time. The latent representation of all features is used along with the history of the model's predictions to improve diverse classes of models trained on real-time targets, to predict targets closer to revised ground truth values. Our technique can be used as a 'wrapper' to improve model performance of any forecasting model (mechanistic/statistical). • Refined top models' predictions and improved model evaluation: We perform an extensive empirical evaluation to show that incorporating backfill dynamics through B2F consistently improves the performance of diverse classes of top-performing COVID-19 forecasting models (from the CDC COVID-19 Forecast Hub, including the top-performing official ensemble) significantly. We also utilize B2F to help forecast evaluators and policy-makers better evaluate the 'eventual' true accuracy of participating models (against revised ground truth, which may not be available until weeks later). This allows the model evaluators to quickly estimate models that perform better w.r.t revised stable targets instead of potentially misleading current targets. Our methodology can be adapted for other time-series forecasting problems in general. We also show the generalizability of our framework and model B2F to other domains by significantly improving predictions of non-trivial baselines for US National GDP forecasting (Marcellino, 2008; Tkacz & Hu, 1999 ). In this section, we study important properties of the revision dynamics of our signals. We introduce some concepts and definitions to aid in the understanding of our empirical observations and method. Real-time forecasting. We are given a set of signals F = Reg × Feat, where Reg is the set of all regions (where we want to forecast) and set Feat contains our features and forecasting target(s) for each region. At prediction week t, x (t) i,1:t is a time series from 1 to t for feature i, and the set of all signals results in the multi-variate time series X (t) 1:t 1:t is the forecasting target(s) time 1 In practice, delays are possible too, i.e, at week t, we have data for some feature i only until t − δi. All our results incorporate these situations. We defer the minor needed notational extensions to Appendix for clarity. series. Further, let's call all data available at time t, D (t) 1:t } as real-time sequence. For clarity we refer to 'signal' i ∈ F as a sequence of either a feature or a target, and denote it as d (t) i,1:t . Thus, at prediction week t, the real-time forecasting problem is: Given D (t) 1:t , predict next k values of forecasting target(s), i.e.ŷ t+1:t+k . Typically for CDC settings and this paper, our time unit is week, k = 4 (up to 4 weeks ahead) and our target is COVID-19 mortality incidence (Deaths). Revisions. Data revisions ('backfill') are common. At prediction week t + 1, the real-time sequence D (t+1) 1:t+1 is available. In addition to the length of the sequences increasing by one (new data point), values of D (t+1) 1:t+1 already in D . Note that previous work has studied backfill limited to Y (t) , while we address it in both X (t) and Y (t) . Also, note that the data in the backfill is the same used for real-time forecasting, but just seen from a different perspective. Backfill sequences: Another useful way we propose to look at backfill is by focusing on revisions of a single value. Let's focus on value of signal i at an observation week t . For this observation week, the value of the signal can be revised at any t > t , which induces a sequence of revisions. We refer to revision week r ≥ 0 as the relative amount of time that has passed since the observation week t . Defn. 1. (Backfill Sequence BSEQ) For signal i and observation week t , its backfill sequence is i,t is the initial value of the signal and d Defn. 3. (Stability time STIME) of a backfill sequence BSEQ is the revision week r * that is the minimum r for which the backfill error BERR < for all r > r * , i.e., the time when BSEQ stabilizes. Note: We ensured that BSEQ length is at least 7, and found that in our dataset most signals stabilize before r = 20. For d (∞) i,t , we use d (t f ) i,t , at the final week t f in our revisions dataset. In case we do not find BERR < in any BSEQ, we set STIME to the length of that BSEQ. We use = 0. We collected important publicly available signals from a variety of trusted sources that are relevant to COVID-19 forecasting to form the COVID-19 Surveillance Dataset (CoVDS ). See Table 1 for the list of 20 features (|Feat| = 21, including Deaths). Our revisions dataset contains signals that we collected every week since April 2020 and ends on July 2021. Our analysis covers 30 observation weeks from June 2020 to December 2020 (to ensure all our backfill sequences are of length at least 7) for all |Reg| = 50 US states. The rest of the unseen data from Jan 2021 to July 2021 is used strictly for evaluation. Patient line-list: traditional surveillance signals used in epidemiological models (Chakraborty et al., 2014; derived from line-list records e.g. hospitalizations from CDC (CDC, 2020), positive cases, ICU admissions from COVID Tracking (COVID-Tracking, 2020). Testing: measure changes in testing from CDC and COVID-Tracking e.g. tested population, negative tests, used by Rodríguez et al. (2021b) . Mobility: track people movement to several point of interests (POIs), from Google (2020) and Apple (2020) , and serve as digital proxy for social distancing (Arik et al., 2020) . Exposure: digital signal measuring closeness between people at POIs, collected from mobile phones (Chevalier et al., 2021) Social Survey: previously used by Rodríguez et al., 2021b) CMU/Facebook Symptom Survey Data, which contains self-reported responses about COVID-19 symptoms. We first study different facets of the significance of backfill in CoVDS . Using our definitions, we generate a backfill sequence for every combination of signal, observation week, and region (not all signals are available for all regions). In total, we generate more than 30, 000 backfill sequences. Backfill error BERR is significant. We computed BERR for the initial values, i.e., BERR(r = 0, i, t ), for all signals i and observation weeks t . Obs. 1. (BERR across signals and regions) Compute the average BERR for each signal; the median of all these averages is 32%, i.e. at least half of all signals are corrected by 32% of their initial value. Similarly in at least half of the regions the signal corrections are 280% of their initial value. We also found large variation of BERR. For features (Figure 1a) , compare avg. BERR = 1743% of five most corrected features with 1.6% of the five least corrected features. Also, in contrast to related work that focuses on traditional surveillance data Yang et al. (2015) , perhaps unexpectedly, we found that digital indicators also have a significant BERR (average of 108%). For regions (see Figure 1b ), compare 1594% of the five most corrected regions with 38% of the five least corrected regions. Stability time STIME is significant. A similar analysis for STIME found significant variation across signals (from 1 weeks for to 21 weeks for COVID-19, see Figure 1c for STIME across feature types) and regions (from 1.55 weeks for GA to 3.83 weeks for TX, see Figure 1d ). This also impacts our target, thus, actual accuracy is not readily available which undermines real-time evaluation and decision making. Obs. 2. (STIME of features and target) Compute the average STIME for each signal; the average of all these averages for features is around 4 weeks and for our target Deaths is around 3 weeks, i.e. on average, it takes over 3 weeks to reach the stable values of features. Backfill sequence BSEQ patterns. There is significant similarity among BSEQs. We cluster BSEQs via K-means using Dynamic Time Warping (DTW) as pair-wise distance (as DTW can handle sequences of varying magnitude and length). We found five canonical categories of behaviors (see Figure 2 ), each of size roughly 11.58% of all BSEQs. Also, each cluster is not defined only by signal Model performance vs BERR. To study the relationship between model performance (via Mean Absolute Error MAE of a prediction) and BERR, we use REVDIFFMAE: the difference between MAE computed against real-time target value and one against the stable target value. We analyze the top-performing realtime forecasting models as per the comprehensive evaluation of all models in COVID-19 Forecast Hub (Cramer et al., 2021) . YYG and UMASS-MB are mechanistic while CMU-TS and GT-DC are statistical models. The top performing ENSEMBLE is composed of all contributing models to the hub. We expect a well-trained real-time model will have higher REVDIFFMAE with larger BERR in its target (Reich et al., 2019) . However, we found that higher BERR does not necessarily mean worse performance. See Figure 3 -YYG has even better performance with more revisions. This may be due to the more complex backfill activity/dependencies in COVID in comparison to the more regular seasonal flu. Obs. 4. (Model performance and backfill) Relation between BERR and REVDIFFMAE can be non-monotonous and positively or negatively correlated depending on model and signal. Real-time target values to measure model performance: Since targets undergo revisions (5% BERR on average), we study how this BERR affects the real-time evaluation of models. From Figure 4 , we see that the scores are not similar with real-time scores over-estimating model accuracy. The average difference in scores is positive which implies that evaluators would overestimate models' forecasting ability. Obs. 5. MAE evaluated at real-time overestimates model performance by 9.6 on average, with the maximum for TX at 22.63. Our observations naturally motivate improving training and evaluation aspects of real-time forecasting by leveraging revision information. Thus, we propose the following two problems. Let predictions of model M for week t + k be y(M, k) t . Since models are trained on real-time targets, For a model M trained on real-time targets, given history of model's predictions till last week y(M, k) 1 , . . . y(M, k) t−1 and prediction for current week y(M, k) t , our goal is to refine y(M, k) t to better estimate the stable target y (t f ) t+k , i.e. the 'future' of our target value at t + k. Leaderboad Refinement problem, LBRP: At each week t, evaluators are given a current estimate of our target y (t) t and forecasts of models submitted on week t − k. Our goal is to refine y (t) t tô y t , a better estimate of y (t f ) t , so that usingŷ t as a surrogate for y (t f ) t to evaluate predictions of models provides a better indicator of their actual performance (i.e., we obtain a refined leaderboard of models). Relating it to BFRP, assume a hypothetical model M eval whose predictions are real-time ground truth, i.e. y(M eval , 0) t = y (t) t , ∀t. Then, refining M eval is equivalent to refining y (t) t to better estimate y (t f ) t which leads to solving LBRP. Thus, LBRP is a special case of BFRP 0 (M eval ). Overview: We leverage observations from Section 2 to derive Back2Future (B2F), a deep-learning model that uses revision information from BSEQ to refine predictions. Obs. 1 and 2 show that realtime values of signals are poor estimates of stable values. Therefore, we leverage patterns in BSEQ of past signals and exploit cross-signal similarities (Obs. 3) to extract information from BSEQs. We also consider that the relation of models' forecasts to BERR of targets is complex (Obs. 4 and 5) to refine their predictions. B2F combines these ideas through its four modules: • GRAPHGEN: Generates a signal graph (where each node maps to a signal in Reg × Feat) whose edges are based on BSEQ similarities. • BSEQENC: Leverages the signal graph as well as temporal dynamics of BSEQs to learn a latent representation of BSEQs using a Recurrent Graph Neural Network. • MODELPREDENC: Encodes the history of the model's predictions, the real-time value of the target, and past revisions of the target through a recurrent neural network. • REFINER: Combines encodings from BSEQENC and MODELPREDENC to predict the correction to model's real-time prediction. In contrast to previous works that studies target BERR (Reich et al., 2019) , we simultaneously model all BSEQ available till current week t using spatial and signal similarities in the temporal dynamics of BSEQ. Recent works that attempt to model spatial relations for COVID19 forecasting need explicitly structural data (like cross-region mobility) (Panagopoulos et al., 2020) to generate a graph or use attention over temporal patterns of regions' death trends. B2F, in contrast, directly models the structural information of signal graph (containing features from each region) using BSEQ similarities. Thus, we first generate useful latent representations for each signal based on BSEQ revision information of that feature as well as features that have shown similar revision patterns in the past. Due to the large number of signals that cover all regions, we cannot model the relations between every pair using fully connected modules or attention similar to (Jin et al., 2020) . Therefore, we first construct a sparse graph between signals based on past BSEQ similarities. Then we inject this similarity information using Graph Convolutional Networks (GCNs) and combine it with deep sequential models to model temporal dynamics of BSEQ of each signal while combining information from BSEQ s of signals in the neighborhood of the graph. Further, we use these latent representations and leverage the history of a model M 's predictions to refine its prediction. Thus, B2F solves BFRP k (M ) assuming M is a black box, accessing only its past forecasts. Our training process, that involves pre-training on model-agnostic auxiliary task, greatly improves training time for refining any given model M . The full pipeline of B2F is also shown in Figure 5 . Next, we describe each of the components of B2F in detail. For the rest of this section, we will assume that we are forecasting k weeks ahead given data till current week t. GRAPHGEN generates an undirected signal graph G t = (V, E t ) whose edges represent similarity in BSEQs between signals, where vertices V = F = Reg × Feat. We measure similarity using DTW distance due to reasons described in Section 2. GRAPHGEN leverages the similarities across BSEQ patterns irrespective of the exact nature of canonical behaviors which may vary across domains. We compute the sum of DTW distances of BSEQs for each pair of nodes summed over t ∈ {1, 2, . . . , t − 5}. We threshold t till t − 5 to make the BSEQs to be of reasonable length (at least 5) to capture temporal similarity without discounting too many BSEQs. Top τ node pairs with lowest total DTW distance are assigned an edge. BSEQENC. While we can model backfill sequences for each signal independently using a recurrent neural network, this doesn't capture the behavioral similarity of BSEQ across signals. Using a fully-connected recurrent neural network that considers all possible interactions between signals also may not learn from the similarity information due to the sheer number of signals (50 × 21 = 1050) while greatly increasing the parameters of the model. Thus, we utilize the structural prior of graph G t generated by GRAPHGEN and train an autoregressive model BSEQENC which consists of graph recurrent neural network to encode a latent representation for each of backfill sequence in B t = {BSEQ(i, t) : i ∈ F }. At week t, BSEQENC is first pre-trained and then it is fine-tuned for a specific model M (more details later in this section). Our encoding process is in Figure 5 . Let BSEQ t +r (i, t ) be first r + 1 values of BSEQ(i, t ) (till week t + r). For a past week t and revision week r, we denote h (t r ) i,t ∈ R m to be the latent encoding of BSEQ t r (i, t ) where t r = t + r and t ≤ t r ≤ t. We initialize h Thus, h (t r ) i,t contains information from BSEQ t r (i, t ) and structural priors from G t . Using h (t r ) i,t , BSEQENC predicts the value d (t r +1) i,t by passing through a 2-layer feed-forward network FFN i : i,t ). During inference, we only have access to real-time values of signals for the current week. We autoregressively predict h i,t } i∈F where l is a hyperparameter. To learn from history of a model's predictions and its relation to target revisions, MODELPREDENC encodes the history of model's predictions, previous real-time targets, and revised (up to current week) targets using a Recurrent Neural Network. Given a model M , for each observation week t ∈ {1, 2, . . . , t−1−k}, we concatenate the model's predictions y(M, k) t , real-time target y (t +k) t +k as seen on observation week t and revised target y REFINER. It leverages the information from above three modules of B2F to refine model M 's prediction for current week y(M, k) t . Specifically, it receives the latent encodings of signals {h t−k−1 from MODELPREDENC, and the model's prediction y(M, k) t for week t. BSEQ encoding from different signals may have variable impact on refining the signal since a few signals may not very useful for current week's forecast (e.g., small revisions in mobility signals may not be important in some weeks). Moreover, because different models use signals from CoVDS differently, we may need to focus on some signals over others to refine its prediction. Therefore, we first take attention over BSEQ encodings from all signals {h (t+l) i,t } i∈F w.r.t y(M, k) t . We use multiplicative attention mechanism with parameter w ∈ R m based on Vaswani et al. (2017) : Finally we combineh (t) and z (t) t−k−1 through a two layer feed-forward layer FNN RF which outputs a 1-dim value followed by tanh activation to get the correction γ t ∈ [−1, 1] i.e., γ t = tanh(FFN RF (h (t) ⊕ z (t) t−k−1 )). Finally, the refined prediction is y * (M, K) t = (γ + 1)y(M, K) t . Note that we limit the correction by B2F by at most the magnitude of model's prediction because the average BERR of targets is 4.9% and less than 0.6% of them have BERR over 1. Therefore, we limit the refinement of prediction to this range. Training: There are two steps of training involved for B2F: 1) model agnostic autoregressive BSEQ prediction task to pre-train BSEQENC; 2) model-specific training for BFRP. Autoregressive BSEQ prediction: Pre-training on auxiliary tasks to improve the quality of latent embedding is a well-known technique for deep learning methods (Devlin et al., 2019; Radford et al., 2018) . We pre-train BSEQENC to predict the next values of backfill sequences {x (t r +1) t ,i } i∈F . Note that we only use BSEQ sequences {BSEQ t (t , i)} i∈F ,t