key: cord-0131186-daqzjzxr
authors: Kamarthi, Harshavardhan; Rodr'iguez, Alexander; Prakash, B. Aditya
title: Back2Future: Leveraging Backfill Dynamics for Improving Real-time Predictions in Future
date: 2021-06-08
journal: nan
DOI: nan
sha: 035c271232ab6951016ef24eb2046f9d2c458145
doc_id: 131186
cord_uid: daqzjzxr

In real-time forecasting in public health, data collection is a non-trivial and demanding task. Often after initially released, it undergoes several revisions later (maybe due to human or technical constraints) - as a result, it may take weeks until the data reaches to a stable value. This so-called 'backfill' phenomenon and its effect on model performance has been barely studied in the prior literature. In this paper, we introduce the multi-variate backfill problem using COVID-19 as the motivating example. We construct a detailed dataset composed of relevant signals over the past year of the pandemic. We then systematically characterize several patterns in backfill dynamics and leverage our observations for formulating a novel problem and neural framework Back2Future that aims to refines a given model's predictions in real-time. Our extensive experiments demonstrate that our method refines the performance of top models for COVID-19 forecasting, in contrast to non-trivial baselines, yielding 18% improvement over baselines, enabling us obtain a new SOTA performance. In addition, we show that our model improves model evaluation too; hence policy-makers can better understand the true accuracy of forecasting models in real-time.

The current COVID-19 pandemic has challenged our response capabilities to large disruptive events, affecting the health and economy of millions of people. A major tool in our response has been forecasting epidemic trajectories, which has provided lead time to policymakers to optimize and plan interventions (Holmdahl & Buckee, 2020) . Broadly two classes of approaches have been devised: traditional mechanistic epidemiological models (Shaman & Karspeck, 2012; Zhang et al., 2017) , and the fairly newer statistical approaches Adhikari et al., 2019; Osthus et al., 2019b) including deep learning models (Adhikari et al., 2019; Deng et al., 2020; Panagopoulos et al., 2021; Rodríguez et al., 2021a) , which have become among the top-performing ones for multiple forecasting tasks (Reich et al., 2019) . These also leverage newer digital indicators like search queries (Ginsberg et al., 2009; Yang et al., 2015) and social media (Culotta, 2010; Lampos et al., 2010) . As noted in multiple previous works (Metcalf & Lessler, 2017; Biggerstaff et al., 2018) , epidemic forecasting is a challenging enterprise because it is affected by weather, mobility, strains, and others.

However, real-time forecasting also brings new challenges. As noted in multiple CDC real-time forecasting initiatives for diseases like flu (Osthus et al., 2019a) and COVID-19 (Cramer et al., 2021) , as well as in macroeconomics (Clements & Galvão, 2019; Aguiar, 2015) the initially released public health data is revised many times after and is known as the 'backfill' phenomenon.The various factors that affect backfill are multiple and complex, ranging from surveillance resources to human factors like coordination between health institutes and government organizations within and across regions Reich et al., 2019; Altieri et al., 2021; Stierholz, 2017) .

While previous works have addressed anomalies (Liu et al., 2017) , missing data (Yin et al., 2020) , and data delays (Žliobaite, 2010) in general time-series problems, the backfill problem has not been addressed. In contrast, the topic of revisions has not received as much attention, with few exceptions. For example in epidemic forecasting, a few papers have either (a) mentioned about the 'backfill Published as a conference paper at ICLR 2022 problem' and its effects on performance Rodríguez et al., 2021b; Altieri et al., 2021; Rangarajan et al., 2019) and evaluation (Reich et al., 2019); proposed to address the problem via simple models like linear regression (Chakraborty et al., 2014) or 'backcasting' the observed targets c) used data assimilation and sensor fusion from a readily available stable set of features to refine unrevised features (Farrow, 2016; Osthus et al., 2019a) . However, they focus only on revisions in the target and typically study in the context of influenza forecasting, which is substantially less noisy and more regular than the novel COVID-19 pandemic or assume access to stable values for some features which is not the case for COVID-19. In economics, Clements & Galvão (2019) surveys several domain-specific (Carriero et al., 2015) or essentially linear techniques for data revision/correction behavior of several macroeconomic indicators (Croushore, 2011) .

Motivated from above, we study the more challenging problem of multi-variate backfill for both features and targets. We go further beyond prior work and also show how to leverage our insights for a more general neural framework to improve both predictions (i.e. refinement of the model's predictions) and performance evaluation (i.e. rectification from the evaluator's perspective). Our specific contributions are the following:

• Multi-variate backfill problem: We introduce the multi-variate backfill problem using real-time epidemiological forecasting as the primary motivating example. In this challenging setting, which generalizes (the limited) prior work, the forecast targets, as well as exogenous features, are subject to retrospective revision. Using a carefully collected diverse dataset for COVID-19 forecasting for the past year, we discover several patterns in backfill dynamics, show that there is a significant difference in real-time and revised feature measurements, and highlight the negative effects of using unrevised features for incidence forecasting in different models both for model performance and evaluation. Building on our empirical observations, we formulate the problem BFRP, which aims to 'correct' given model predictions to achieve better performance on eventual fully revised data.

• Spatial and Feature level backfill modeling to refine model predictions: Motivated by the patterns in revision and observations from our empirical study, we propose a deep-learning model Back2Future (B2F) to model backfill revision patterns and derive latent encodings for features. B2F combines Graph Convolutional Networks that capture sparse, cross-feature, and cross-regional backfill dynamics similarity and deep sequential models that capture temporal dynamics of each features' backfill dynamics across time. The latent representation of all features is used along with the history of the model's predictions to improve diverse classes of models trained on real-time targets, to predict targets closer to revised ground truth values. Our technique can be used as a 'wrapper' to improve model performance of any forecasting model (mechanistic/statistical).

• Refined top models' predictions and improved model evaluation: We perform an extensive empirical evaluation to show that incorporating backfill dynamics through B2F consistently improves the performance of diverse classes of top-performing COVID-19 forecasting models (from the CDC COVID-19 Forecast Hub, including the top-performing official ensemble) significantly. We also utilize B2F to help forecast evaluators and policy-makers better evaluate the 'eventual' true accuracy of participating models (against revised ground truth, which may not be available until weeks later). This allows the model evaluators to quickly estimate models that perform better w.r.t revised stable targets instead of potentially misleading current targets. Our methodology can be adapted for other time-series forecasting problems in general. We also show the generalizability of our framework and model B2F to other domains by significantly improving predictions of non-trivial baselines for US National GDP forecasting (Marcellino, 2008; Tkacz & Hu, 1999 ).

In this section, we study important properties of the revision dynamics of our signals. We introduce some concepts and definitions to aid in the understanding of our empirical observations and method.

Real-time forecasting. We are given a set of signals F = Reg × Feat, where Reg is the set of all regions (where we want to forecast) and set Feat contains our features and forecasting target(s) for each region. At prediction week t, x (t) i,1:t is a time series from 1 to t for feature i, and the set of all signals results in the multi-variate time series X (t) 1:t

1:t is the forecasting target(s) time 1 In practice, delays are possible too, i.e, at week t, we have data for some feature i only until t − δi. All our results incorporate these situations. We defer the minor needed notational extensions to Appendix for clarity. series. Further, let's call all data available at time t, D (t)

1:t } as real-time sequence. For clarity we refer to 'signal' i ∈ F as a sequence of either a feature or a target, and denote it as d (t) i,1:t . Thus, at prediction week t, the real-time forecasting problem is: Given D (t) 1:t , predict next k values of forecasting target(s), i.e.ŷ t+1:t+k . Typically for CDC settings and this paper, our time unit is week, k = 4 (up to 4 weeks ahead) and our target is COVID-19 mortality incidence (Deaths).

Revisions. Data revisions ('backfill') are common. At prediction week t + 1, the real-time sequence D (t+1) 1:t+1 is available. In addition to the length of the sequences increasing by one (new data point), values of D (t+1) 1:t+1 already in D . Note that previous work has studied backfill limited to Y (t) , while we address it in both X (t) and Y (t) . Also, note that the data in the backfill is the same used for real-time forecasting, but just seen from a different perspective.

Backfill sequences: Another useful way we propose to look at backfill is by focusing on revisions of a single value. Let's focus on value of signal i at an observation week t . For this observation week, the value of the signal can be revised at any t > t , which induces a sequence of revisions. We refer to revision week r ≥ 0 as the relative amount of time that has passed since the observation week t .

Defn. 1. (Backfill Sequence BSEQ) For signal i and observation week t , its backfill sequence is

i,t is the initial value of the signal and d 

Defn. 3. (Stability time STIME) of a backfill sequence BSEQ is the revision week r * that is the minimum r for which the backfill error BERR < for all r > r * , i.e., the time when BSEQ stabilizes.

Note: We ensured that BSEQ length is at least 7, and found that in our dataset most signals stabilize before r = 20. For d (∞) i,t , we use d (t f ) i,t , at the final week t f in our revisions dataset. In case we do not find BERR < in any BSEQ, we set STIME to the length of that BSEQ. We use = 0. We collected important publicly available signals from a variety of trusted sources that are relevant to COVID-19 forecasting to form the COVID-19 Surveillance Dataset (CoVDS ). See Table 1 for the list of 20 features (|Feat| = 21, including Deaths). Our revisions dataset contains signals that we collected every week since April 2020 and ends on July 2021. Our analysis covers 30 observation weeks from June 2020 to December 2020 (to ensure all our backfill sequences are of length at least 7) for all |Reg| = 50 US states. The rest of the unseen data from Jan 2021 to July 2021 is used strictly for evaluation.

Patient line-list: traditional surveillance signals used in epidemiological models (Chakraborty et al., 2014; derived from line-list records e.g. hospitalizations from CDC (CDC, 2020), positive cases, ICU admissions from COVID Tracking (COVID-Tracking, 2020). Testing: measure changes in testing from CDC and COVID-Tracking e.g. tested population, negative tests, used by Rodríguez et al. (2021b) . Mobility: track people movement to several point of interests (POIs), from Google (2020) and Apple (2020) , and serve as digital proxy for social distancing (Arik et al., 2020) . Exposure: digital signal measuring closeness between people at POIs, collected from mobile phones (Chevalier et al., 2021) Social Survey: previously used by Rodríguez et al., 2021b) CMU/Facebook Symptom Survey Data, which contains self-reported responses about COVID-19 symptoms.

We first study different facets of the significance of backfill in CoVDS . Using our definitions, we generate a backfill sequence for every combination of signal, observation week, and region (not all signals are available for all regions). In total, we generate more than 30, 000 backfill sequences.

Backfill error BERR is significant. We computed BERR for the initial values, i.e., BERR(r = 0, i, t ), for all signals i and observation weeks t .

Obs. 1. (BERR across signals and regions) Compute the average BERR for each signal; the median of all these averages is 32%, i.e. at least half of all signals are corrected by 32% of their initial value. Similarly in at least half of the regions the signal corrections are 280% of their initial value. We also found large variation of BERR. For features (Figure 1a) , compare avg. BERR = 1743% of five most corrected features with 1.6% of the five least corrected features. Also, in contrast to related work that focuses on traditional surveillance data Yang et al. (2015) , perhaps unexpectedly, we found that digital indicators also have a significant BERR (average of 108%). For regions (see Figure 1b ), compare 1594% of the five most corrected regions with 38% of the five least corrected regions. Stability time STIME is significant. A similar analysis for STIME found significant variation across signals (from 1 weeks for to 21 weeks for COVID-19, see Figure 1c for STIME across feature types) and regions (from 1.55 weeks for GA to 3.83 weeks for TX, see Figure 1d ). This also impacts our target, thus, actual accuracy is not readily available which undermines real-time evaluation and decision making.

Obs. 2. (STIME of features and target) Compute the average STIME for each signal; the average of all these averages for features is around 4 weeks and for our target Deaths is around 3 weeks, i.e. on average, it takes over 3 weeks to reach the stable values of features. Backfill sequence BSEQ patterns. There is significant similarity among BSEQs. We cluster BSEQs via K-means using Dynamic Time Warping (DTW) as pair-wise distance (as DTW can handle sequences of varying magnitude and length). We found five canonical categories of behaviors (see Figure 2 ), each of size roughly 11.58% of all BSEQs. Also, each cluster is not defined only by signal Model performance vs BERR. To study the relationship between model performance (via Mean Absolute Error MAE of a prediction) and BERR, we use REVDIFFMAE: the difference between MAE computed against real-time target value and one against the stable target value. We analyze the top-performing realtime forecasting models as per the comprehensive evaluation of all models in COVID-19 Forecast Hub (Cramer et al., 2021) . YYG and UMASS-MB are mechanistic while CMU-TS and GT-DC are statistical models. The top performing ENSEMBLE is composed of all contributing models to the hub. We expect a well-trained real-time model will have higher REVDIFFMAE with larger BERR in its target (Reich et al., 2019) . However, we found that higher BERR does not necessarily mean worse performance. See Figure 3 -YYG has even better performance with more revisions. This may be due to the more complex backfill activity/dependencies in COVID in comparison to the more regular seasonal flu. Obs. 4. (Model performance and backfill) Relation between BERR and REVDIFFMAE can be non-monotonous and positively or negatively correlated depending on model and signal.

Real-time target values to measure model performance: Since targets undergo revisions (5% BERR on average), we study how this BERR affects the real-time evaluation of models. From Figure 4 , we see that the scores are not similar with real-time scores over-estimating model accuracy. The average difference in scores is positive which implies that evaluators would overestimate models' forecasting ability. Obs. 5. MAE evaluated at real-time overestimates model performance by 9.6 on average, with the maximum for TX at 22.63. Our observations naturally motivate improving training and evaluation aspects of real-time forecasting by leveraging revision information. Thus, we propose the following two problems. Let predictions of model M for week t + k be y(M, k) t . Since models are trained on real-time targets,

For a model M trained on real-time targets, given history of model's predictions till last week y(M, k) 1 , . . . y(M, k) t−1 and prediction for current week y(M, k) t , our goal is to refine y(M, k) t to better estimate the stable target y (t f ) t+k , i.e. the 'future' of our target value at t + k. Leaderboad Refinement problem, LBRP: At each week t, evaluators are given a current estimate of our target y (t) t and forecasts of models submitted on week t − k. Our goal is to refine y (t) t tô y t , a better estimate of y (t f ) t , so that usingŷ t as a surrogate for y (t f ) t to evaluate predictions of models provides a better indicator of their actual performance (i.e., we obtain a refined leaderboard of models). Relating it to BFRP, assume a hypothetical model M eval whose predictions are real-time ground truth, i.e. y(M eval , 0) t = y (t) t , ∀t. Then, refining M eval is equivalent to refining y (t) t to better estimate y (t f ) t which leads to solving LBRP. Thus, LBRP is a special case of BFRP 0 (M eval ).

Overview: We leverage observations from Section 2 to derive Back2Future (B2F), a deep-learning model that uses revision information from BSEQ to refine predictions. Obs. 1 and 2 show that realtime values of signals are poor estimates of stable values. Therefore, we leverage patterns in BSEQ of past signals and exploit cross-signal similarities (Obs. 3) to extract information from BSEQs. We also consider that the relation of models' forecasts to BERR of targets is complex (Obs. 4 and 5) to refine their predictions. B2F combines these ideas through its four modules: • GRAPHGEN: Generates a signal graph (where each node maps to a signal in Reg × Feat) whose edges are based on BSEQ similarities. • BSEQENC: Leverages the signal graph as well as temporal dynamics of BSEQs to learn a latent representation of BSEQs using a Recurrent Graph Neural Network. • MODELPREDENC: Encodes the history of the model's predictions, the real-time value of the target, and past revisions of the target through a recurrent neural network. • REFINER: Combines encodings from BSEQENC and MODELPREDENC to predict the correction to model's real-time prediction.

In contrast to previous works that studies target BERR (Reich et al., 2019) , we simultaneously model all BSEQ available till current week t using spatial and signal similarities in the temporal dynamics of BSEQ. Recent works that attempt to model spatial relations for COVID19 forecasting need explicitly structural data (like cross-region mobility) (Panagopoulos et al., 2020) to generate a graph or use attention over temporal patterns of regions' death trends. B2F, in contrast, directly models the structural information of signal graph (containing features from each region) using BSEQ similarities. Thus, we first generate useful latent representations for each signal based on BSEQ revision information of that feature as well as features that have shown similar revision patterns in the past. Due to the large number of signals that cover all regions, we cannot model the relations between every pair using fully connected modules or attention similar to (Jin et al., 2020) . Therefore, we first construct a sparse graph between signals based on past BSEQ similarities. Then we inject this similarity information using Graph Convolutional Networks (GCNs) and combine it with deep sequential models to model temporal dynamics of BSEQ of each signal while combining information from BSEQ s of signals in the neighborhood of the graph. Further, we use these latent representations and leverage the history of a model M 's predictions to refine its prediction. Thus, B2F solves BFRP k (M ) assuming M is a black box, accessing only its past forecasts. Our training process, that involves pre-training on model-agnostic auxiliary task, greatly improves training time for refining any given model M . The full pipeline of B2F is also shown in Figure 5 . Next, we describe each of the components of B2F in detail. For the rest of this section, we will assume that we are forecasting k weeks ahead given data till current week t. GRAPHGEN generates an undirected signal graph G t = (V, E t ) whose edges represent similarity in BSEQs between signals, where vertices V = F = Reg × Feat. We measure similarity using DTW distance due to reasons described in Section 2. GRAPHGEN leverages the similarities across BSEQ patterns irrespective of the exact nature of canonical behaviors which may vary across domains. We compute the sum of DTW distances of BSEQs for each pair of nodes summed over t ∈ {1, 2, . . . , t − 5}. We threshold t till t − 5 to make the BSEQs to be of reasonable length (at least 5) to capture temporal similarity without discounting too many BSEQs. Top τ node pairs with lowest total DTW distance are assigned an edge.

BSEQENC. While we can model backfill sequences for each signal independently using a recurrent neural network, this doesn't capture the behavioral similarity of BSEQ across signals. Using a fully-connected recurrent neural network that considers all possible interactions between signals also may not learn from the similarity information due to the sheer number of signals (50 × 21 = 1050) while greatly increasing the parameters of the model. Thus, we utilize the structural prior of graph G t generated by GRAPHGEN and train an autoregressive model BSEQENC which consists of graph recurrent neural network to encode a latent representation for each of backfill sequence in B t = {BSEQ(i, t) : i ∈ F }. At week t, BSEQENC is first pre-trained and then it is fine-tuned for a specific model M (more details later in this section).

Our encoding process is in Figure 5 . Let BSEQ t +r (i, t ) be first r + 1 values of BSEQ(i, t ) (till week t + r). For a past week t and revision week r, we denote h (t r ) i,t ∈ R m to be the latent encoding of BSEQ t r (i, t ) where t r = t + r and t ≤ t r ≤ t. We initialize h 

Thus, h (t r )

i,t contains information from BSEQ t r (i, t ) and structural priors from G t . Using h (t r )

i,t , BSEQENC predicts the value d (t r +1) i,t by passing through a 2-layer feed-forward network FFN i :

i,t ). During inference, we only have access to real-time values of signals for the current week. We autoregressively predict h i,t } i∈F where l is a hyperparameter.

To learn from history of a model's predictions and its relation to target revisions, MODELPREDENC encodes the history of model's predictions, previous real-time targets, and revised (up to current week) targets using a Recurrent Neural Network. Given a model M , for each observation week t ∈ {1, 2, . . . , t−1−k}, we concatenate the model's predictions y(M, k) t , real-time target y (t +k) t +k as seen on observation week t and revised target y

REFINER. It leverages the information from above three modules of B2F to refine model M 's prediction for current week y(M, k) t . Specifically, it receives the latent encodings of signals {h

t−k−1 from MODELPREDENC, and the model's prediction y(M, k) t for week t. BSEQ encoding from different signals may have variable impact on refining the signal since a few signals may not very useful for current week's forecast (e.g., small revisions in mobility signals may not be important in some weeks). Moreover, because different models use signals from CoVDS differently, we may need to focus on some signals over others to refine its prediction. Therefore, we first take attention over BSEQ encodings from all signals {h (t+l) i,t } i∈F w.r.t y(M, k) t . We use multiplicative attention mechanism with parameter w ∈ R m based on Vaswani et al. (2017) :

Finally we combineh (t) and z (t) t−k−1 through a two layer feed-forward layer FNN RF which outputs a 1-dim value followed by tanh activation to get the correction γ t ∈ [−1, 1] i.e., γ t = tanh(FFN RF (h (t) ⊕ z (t) t−k−1 )). Finally, the refined prediction is y * (M, K) t = (γ + 1)y(M, K) t . Note that we limit the correction by B2F by at most the magnitude of model's prediction because the average BERR of targets is 4.9% and less than 0.6% of them have BERR over 1. Therefore, we limit the refinement of prediction to this range.

Training: There are two steps of training involved for B2F: 1) model agnostic autoregressive BSEQ prediction task to pre-train BSEQENC; 2) model-specific training for BFRP.

Autoregressive BSEQ prediction: Pre-training on auxiliary tasks to improve the quality of latent embedding is a well-known technique for deep learning methods (Devlin et al., 2019; Radford et al., 2018) . We pre-train BSEQENC to predict the next values of backfill sequences {x (t r +1) t ,i } i∈F . Note that we only use BSEQ sequences {BSEQ t (t , i)} i∈F ,t <t available till current week t for training BSEQENC. The training procedure in itself is similar to Seq2Seq prediction problems (Sutskever et al., 2014) where for initial epochs we use the ground truth inputs at each step (teacher forcing) and then transition to using output predictions of previous time step by the recurrent module as input to next time step. Once we pre-train BSEQENC, we can use it for BFRP as well as LBRP for current week t for any model M . Fine-tuning usually takes less than half the epochs required for pre-training enabling quick refinement of multiple models in parallel.

Model specific end-to-end training: Given the pre-trained BSEQENC, we train, end-to-end, the parameters of all modules of B2F. The training set consists of past model predictions y(M, k) 1 , y(M, k) 2 , . . . y(M, k) t−1 and backfill sequences {BSEQ t (t , i)} i∈F ,t ≤t . For datapoint y(M, k) t of week t < t − k, we input backfill sequences of signals whose observation week is t into BSEQENC to get latent encodings {h (t) t ,i } i∈F . We also derive z (t) t from MODELPREDENC and finally REFINER ingests z (t) t , {h t t ,i } i∈F and y(M, k) t to get γ t . Overall, we optimize the loss function:

. Following real-time forecasting, we train B2F each week from scratch (including pre-training). Throughout training and forecasting for week t, we use G t as input to BSEQENC since it captures average similarities in BSEQs till current week t.

In this section, we describe a detailed empirical study to evaluate the effectiveness of our framework B2F. All experiments were run in an Intel i7 4.8 GHz CPU with Nvidia Tesla A4 GPU. The model typically takes around 1 hour to train for all regions. The appendix contains additional details (all hyperparameters and results for June-Dec 2020 and k = 1, 3 and GDP forecasting).

Setup: We perform real-time forecasting of COVID-19 related mortality (Deaths) for 50 US states. We leveraged observations (Section 2) from BSEQ for period June 2020 -Dec. 2020 to design B2F. We tuned the model hyperparameters using data from June 2020 to Aug. 2020 and tested it on the rest of dataset including completely unseen data from Jan. 2021 to June 2021. For each week t, we train the model using the CoVDS dataset available till the current week t (including BSEQs for all signals revised till t) for training. As described in Section 3, for each week, we first pre-train BSEQENC on BSEQ data and then train all components of B2F for each model we aim to refine. Then, we predict the forecasts Deaths y * (M, k) t for each model M . Similarly we also evaluated B2F for real-time GDP forecasting task with detailed results in Appendix. We observed that setting hyperparameter τ = c|F| where c ∈ {2, 3, 4, 5} provided best results. Note that τ influences the sparsity of the graph as well as the efficiency of the model since sparser graphs lead to fast inference across GConv layers. We also found setting l = 5 provided the best performance.

Evaluation: We evaluate the refined prediction y * (M, K) t against the most revised version of the target, i.e. y 

Candidate models: We focus on refining/rectifying the top models from the COVID-19 Forecast Hub described in Section 2; these represent different variety of statistical and mechanistic models. Refining real-time model-predictions: We compare the mean percentage improvement (i.e., decrease) in MAE and MAPE scores of B2F refined predictions of diverse set of top models w.r.t stable targets over 50 US states 2 . First, we observe that B2F is the only method, compared to baselines, that improves scores for all candidate models (Table 2) showcasing the necessity of incorporating backfill information (unlike FFNREG and PREDRNN) and model prediction history (unlike BSEQREG and BSEQREG2).

We achieve an impressive avg. improvements of 6.93% and 6.79% in MAE and MAPE respectively with improvements decreasing with increasing k. Candidate models refined by B2F show improvement of over 10% in over 25 states and over 15% in 5 states (NJ, LA, GA, CT, MD). Due to B2F refinement, the improved predictions of CMU-TS and GT-DC, which are ranked 3rd and 4th in COVID-19 Forecast Hub, outperform all the models in the hub (except for ENSEMBLE) with impressive 7.17% and 4.13% improvements in MAE respectively. UMASS-MB, ranked 2nd, is improved by 11.24%. B2F also improves ENSEMBLE, which is the current best-performing model of the hub, by 3.6% -5.18% with over 5% improvement in 38 states, and with IL and TX experiencing over 15% improvement.

Rectifying real-time model-evaluation: We evaluate the efficacy of B2F in rectifying the real-time evaluation scores for the LBRP problem. We noted in Obs 5 that real-time MAE was lower than stable MAE by 9.6 on average. The difference between B2F rectified estimates MAE stable MAE was reduced to 4.4, a 51.7% decrease (Figure 7 ). This results in increased MAE scores across most regions towards stable estimates. Eg: We reduce the MAE difference in large highly populated states such as GA by 26.1% (from 22.52 to 16.64) and TX by 90% (from 10.8 to 1.04) causing an increase in MAE scores from real-time estimates by 5.88 and 9.4 respectively.

Refinement as a function of data availability: Note that during the initial weeks of the pandemic, we have access to very little revision data both in terms of length and number of BSEQ. So we evaluate the mean performance improvement for each week across all regions (Figure 6b ). B2F's performance ramps up and quickly stabilize in just 6 weeks. Since signals need around 4 weeks (Obs 2) to stabilize, this ramp-up time is impressive. Thus, B2F needs a small amount of revision data to improve model performance. B2F adapts to anomalies During real-time forecasting, models need to be robust to anomalous data revisions. This is especially true during the initial stages of the pandemic when data collection is not fully streamlined and may experience large revisions. Consider observation week 5 (first week of June) where there was an abnormally large revision to deaths nationwide (Figure 8a ) when the BERR was 48%. B2F still provided significant improvements for most model predictions (Figure 8b) . Specifically, ENSEMBLE's predictions are refined to be up to 74.2% closer to stable target.

B2F refines GDP forecasts To evaluate the extensibility of B2F to other domains that encounter the problem of backfill, we tested on the task of forecasting US National GDP at 1 and 2 quarters into the future using 25 macroeconomic indicators from past and their revision history for years 2000-2021. We found that B2F improves predictions of candidate models by 6%-15% and significantly outperforms baselines. The details of the dataset and results are found in Appendix Sections B, E.2.

We introduced and studied the challenging multi-variate backfill problem using COVID-19 and GDP forecasting as examples. We presented Back2Future (B2F), the novel deep-learning method to model this phenomenon, which exploits our observations of cross-signal similarities using Graph Recurrent Neural Networks to refine predictions and rectify evaluations for a wide range of models.

Our extensive experiments showed that leveraging similarity among backfill patterns via our proposed method leads to impressive 6 -11 % improvements in all the top models. As future work, our work can potentially help improve data collection and alleviate systematic differences in reporting capabilities across regions. We emphasize that we used the CoVDS dataset as is, but our characterization can be helpful for anomaly detection (Homayouni et al., 2021) . We can also study how backfill can affect uncertainty calibration in timeseries analysis . Adapting to situations where revision history is not temporally uniform across all features due to revisions of different frequency is another research direction.

Although the features used in the CoVDS dataset and for GDP forecasting are publicly available and anonymized without any sensitive information. However, due to the relevance of our dataset to public health and macroeconomics, prospects for misuse should not be discounted. The disparities in data collection across features and regions can also have implications on equity of prediction performance and is an interesting direction of research. Our backfill refinement framework and B2F is generalizable to any domain that deals with real-time prediction tasks with feature revisions. Therefore, there might be limited potential for misuse of our methods as well.

As described in Section 4, we evaluated our model over 5 runs with different random seeds to show the robustness of our results to randomization. We also provide a more extensive description of hyperparameters and data pre-processing in the Appendix. The code for B2F and the CoVDS dataset is attached in the appendix. The dataset for GDP Forecasting is publicly available as described in Appendix Section B. Please refer to the README file in the code folder for more details to reproduce the results. On acceptance, we will also make the code and datasets available publicly to encourage reproducibility and allow for further exploration and research.

Disposable Personal Income (NDPI), Unemployment Rate (RUC), Civilian Labor Force (LFC), Consumer Price Index (CPI).

For each of these 25 features, quarterly data is available since 1965 to third quarter of 2021 and the data is revised quarterly towards more accurate values. Our goal is to predict the Real GDP (ROUTPUT) k quarters ahead using real-time data (data available till current quarter including past data revised till current quarter) where k ∈ {1, 2}. We tune the hyperparameters for model for using data from years 1995-2000 and test on the unseen period of 2000-2021. Due to lack of standard baselines, we chose as candidate models 1. Vector Autoregression (VAR) model, a standard model in macroeconmics literature (Robertson & Tallman, 1999; Rünstler & Sédillot, 2003; Baffigi et al., 2004) , 2. 4-layer feed forward network (FFN) on current quarter's features similar to (Tkacz & Hu, 1999) 

Real-time forecasting As mentioned in the main paper, we handle delays in reporting real-time data by approximating from the previous week. During real-time forecasting, we cannot wait for δ i weeks to get the first value of a signal. Therefore, we replace d (t) i,t with the most revised value of last week for which the signal is available. Let t < t is the last week before t for which we have d i,t . Note that such cases of missing initial value during real-time forecasting is very uncommon.

We saw in Observation 3 that clustering BSEQs resulted in 5 canonical behaviours. The behaviors of these clusters as shown in Figure 2 can be described as: In Observation 4, we found that models differ in how their performance is impacted by BERR on the label and found that relationship between BERR and difference in MAE (REVDIFFMAE) was varied across models with some models (YYG) even showing positive correlation between BERR and REVDIFFMAE. To further study this relationship between BERR and reduction in MAE on using stable labels for evaluation over real-time labels by computing the Pearson correlation coefficient (PCC) between BERR and REVDIFFMAE for in Table 3 . As seen from Figure 3 , we observe that for YYG, PCC is positive, indicating that YYG's scores are actually due to larger BERR on labels. We also see significant differences in PCC for YYG and CMU-TS in over other models.

In this section, we describe in detail hyperparameters related to data preprocessing and B2F architecture.

Missing real-time data Sometimes there is a delay in receiving signals for the current week. In this case, we use values from the most revised version of last observation week for real-time forecasting.

Missing revisions Once we start observing a signal d (t) t from week t till t f , there are weeks t in between where the revised value of this signals is not available or value received is zero. This gives rise to Spike behaviour (Figure 2 ). Before training, we replace this value with the previous value of BSEQ.

Termination of revisions We observed that for some digital signals, revisions stop a few months in the future. This could be due to the termination of data correction for that signal for older observation weeks. In such cases, we assume that signals are stabilized and use the last available revised values to fill the following values of BSEQ.

Scaling signal values Since each signal that is received has a very different range of values, we rescale each signal with a mean 0. and standard deviation 1.0. Note that since B2F is trained separately for each week, we normalize the data for each week before training.

BSEQENC The dimension size of all latent encodings h (t r ) i,t and v (t r ) i,t is set to 50. Each GConv is a single layer of graph convolutional neural network with weight matrix of size R 50×50 .

MODELPREDENC For GRU M E We use a GRU of a single hidden layer with output size 50.

REFINER the feed-forward network F F N RF is a 2 layer network with hidden layers of size 60 and 30 followed by final layer outputting 1-dimensional scalar.

Training hyperparameters We used a learning rate of 10 −3 for pre-training and 5 × 10 −4 for fine-tuning. The pretraining task usually around takes 2000 epoch to train with the first 1000 epochs using teacher forcing and each of the rest of the epoch using teacher forcing with a probability of 0.5. Fine-tuning takes between 500 to 1000 epoch depending on the model, region, and week of the forecast.

As mentioned in Section 4, we used data from June 2020 to Dec 2020 for model design using Observations from Section 2. The hyperparameter tuning was done using data from June 2020 to Aug 2020 and evaluated for time period of Jan 2021 to July 2021. Overall, we found that most hyperparameters are not sensitive. The most sensitive ones mentioned in the main paper are c ∈ {2, 3, 4, 5} that controls sparsity of graph G t and l = 5 that controls how many steps we auto-regress using BSEQENC to derive latent encodings for BSEQ during inference.

We show the average % improvements of all baselines and B2F in Table 4 including for k = 1, 3 week ahead forecasts (We show for k = 2, 4 in main paper Table 2 as well). We show the results for the both June 2020 to Dec 2020 data, which was observed during model design and Jan 2021 to June 2021 which was unseen. B2F show similar performance for both time periods. B2F clearly outperforms all baselines and provides similar improvements for COVID-19 Forecast Hub models for k = 1 week ahead forecasts as described in Section 4 for other values of k. 

We also compare performance of B2F with the baselines for GDP forecasting task in Table 6 . 

Epideep: Exploiting embeddings for epidemic forecasting

Macroeconomic data

Curating a covid-19 data repository and forecasting county-level death counts in the united states. Harvard Data Science Review

Apple mobility trends reports

Interpretable sequence learning for covid-19 forecasting

Bridge models to forecast the euro area gdp

Results from the second year of a collaborative effort to forecast influenza seasons in the united states

Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions

Forecasting with bayesian multivariate vintage-based vars

COVID-19)

Forecasting a moving target: Ensemble models for ili case count predictions

What to know before forecasting the flu

Measuring movement and social contact with smartphone data: A real-time application to covid-19

On the properties of neural machine translation: Encoder-decoder approaches

Data revisions and real-time forecasting

The covid tracking project

Evaluation of individual and ensemble probabilistic forecasts of covid-19 mortality in the us

Forecasting with real-time data vintages. The Oxford handbook of economic forecasting

Towards detecting influenza epidemics by analyzing twitter messages

Cola-gnn: Crosslocation attention based graph neural networks for long-term ili prediction

Bert: Pre-training of deep bidirectional transformers for language understanding

Modeling the past, present, and future of influenza. Private Communication, Farrow's work will be part of the published Phd thesis at

Detecting influenza epidemics using search engine query data

Google covid-19 community mobility reports

Gaussian process nowcasting: Application to covid-19 mortality reporting

Wrong but Useful -What Covid-19 Epidemiologic Models Can and Cannot Tell Us

Anomaly detection in covid-19 time-series data

Inter-series attention model for covid-19 forecasting. ArXiv, abs

Semi-supervised classification with graph convolutional networks

Flu detector-tracking epidemics on twitter

Advances in nowcasting influenza-like illness rates using search query logs

Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited

Dynamic bayesian influenza forecasting in the united states with hierarchical discrepancy (with discussion)

United we stand: Transfer graph neural networks for pandemic forecasting

Transfer graph neural networks for pandemic forecasting

Improving language understanding with unsupervised learning

Forecasting dengue and influenza incidences using a sparse representation of google trends, electronic health records, and time series data

A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States

Vector autoregressions: forecasting and reality. Economic Review-Federal Reserve Bank of Atlanta

Steering a historical disease forecasting model under a pandemic: Case of flu and COVID-19

DeepCOVID: An Operational Deep Learning-Driven Framework for Explainable Real-Time COVID-19 Forecasting

Short-term estimates of euro area real gdp by means of monthly data

Forecasting seasonal outbreaks of influenza

Economic data revisions: What they are and where to find them

Multivariate hierarchical frameworks for modeling delayed reporting in count data

Sequence to sequence learning with neural networks

Forecasting gdp growth using artificial neural networks

Covid-19-related information sources and the relationship with confidence in people coping with covid-19: Facebook survey study in taiwan

Accurate estimation of influenza epidemics using google search data via argo

Identifying sepsis subphenotypes via time-aware multi-modal auto-encoder

Robustifying sequential neural processes

Forecasting seasonal influenza fusing digital indicators and a mechanistic disease model

Change with delayed labeling: When is it detectable?

This work was supported in part by the NSF (Expeditions CCF-1918770, CAREER IIS-2028586, RAPID IIS-2027862, Medium IIS-1955883, Medium IIS-2106961, CCF-2115126), CDC MInD program, ORNL, and faculty research awards from Facebook, funds/computing resources from Georgia Tech. 

Code for B2F and the CoVDS dataset is publicly available 3 . Please refer to the README in the code folder for more details to reproduce the results.A ADDITIONAL RELATED WORK Epidemic forecasting. Broadly two classes of approaches have been devised: traditional mechanistic epidemiological models Shaman & Karspeck (2012) ; Zhang et al. (2017) , and the fairly newer statistical approaches ; Adhikari et al. (2019) ; Osthus et al. (2019b) , which have become among the top performing ones for multiple forecasting tasks Reich et al. (2019) .Statistical models have been helpful in using digital indicators such as search queries Ginsberg et al. (2009); Yang et al. (2015) and social media Culotta (2010); Lampos et al. (2010) , that can give more lead time than traditional surveillance methods. Recently, deep learning models have seen much work. They can use heterogeneous and multimodal data and extract richer representations, including for modeling spatio-temporal dynamics Adhikari et al. (2019) Revisions and backfill. The topic of revisions has not received as much attention, with few exceptions. In epidemic forecasting, a few papers have mentioned about the 'backfill problem' and its effects on performance Rodríguez et al., 2021b; Altieri et al., 2021; Rangarajan et al., 2019) and evaluation (Reich et al., 2019) ; Some works proposed to address the problem via simple models like linear regression (Chakraborty et al., 2014) or 'backcasting' the observed targets. However they focus only on revisions in the target, and study only in context of influenza forecasting, which is substantially less noisy and more regular than forecasting for the novel COVID-19 pandemic. Reich et al. (2019) proposed a framework to study how backfill affects the evaluation of multiple models, but it is limited to label backfill and flu forecasting. Other works use data assimilation and sensor fusion by leveraging revision free digital signals to refine noisy features for nowcasting (Hawryluk et al., 2021; Farrow, 2016; Lampos et al., 2015; Nunes et al., 2013; Osthus et al., 2019a) . However, we observed significant backfill in digital signals as well for COVID-19. Moreover, our model doesn't require revision-free data sources. Some works model revision of event-based features as count variables which can't be applicable to many important features like mobility, exposure (Stoner & Economou, 2020; Lawless, 1994) . Clements & Galvão (2019) surveys several domain-specific (Carriero et al., 2015) or essentially linear techniques in economics for data revision/correction behavior of the source of several macroeconomic indicators (Croushore, 2011) . In contrast, we study the more challenging problem of multi-variate backfill for both features and targets and show how to leverage our insights for more general neural framework to improve both model predictions and evaluation. B DETAILS ABOUT GDP PREDICTION TASK B.1 DATASET