key: cord-0537022-edmz2xwh authors: Ray, Evan L.; Brooks, Logan C.; Bien, Jacob; Biggerstaff, Matthew; Bosse, Nikos I.; Bracher, Johannes; Cramer, Estee Y.; Funk, Sebastian; Gerding, Aaron; Johansson, Michael A.; Rumack, Aaron; Wang, Yijin; Zorn, Martha; Tibshirani, Ryan J.; Reich, Nicholas G. title: Comparing trained and untrained probabilistic ensemble forecasts of COVID-19 cases and deaths in the United States date: 2022-01-28 journal: nan DOI: nan sha: 9982960166791bf89fc97e6e009512f9e3c8c335 doc_id: 537022 cord_uid: edmz2xwh The U.S. COVID-19 Forecast Hub aggregates forecasts of the short-term burden of COVID-19 in the United States from many contributing teams. We study methods for building an ensemble that combines forecasts from these teams. These experiments have informed the ensemble methods used by the Hub. To be most useful to policy makers, ensemble forecasts must have stable performance in the presence of two key characteristics of the component forecasts: (1) occasional misalignment with the reported data, and (2) instability in the relative performance of component forecasters over time. Our results indicate that in the presence of these challenges, an untrained and robust approach to ensembling using an equally weighted median of all component forecasts is a good choice to support public health decision makers. In settings where some contributing forecasters have a stable record of good performance, trained ensembles that give those forecasters higher weight can also be helpful. Accurate short-term forecasts of infectious disease indicators (i.e., disease surveillance signals) can inform public health decision-making and outbreak response activities such as nonpharmaceutical interventions, site selection for clinical trials of pharmaceutical treatments, or the distribution of limited health care resources (Wallinga et al., 2010; Lipsitch et al., 2011; Dean et al., 2020) . Epidemic forecasts have been incorporated into public health decision-making in a wide variety of situations, including outbreaks of dengue fever in Brazil, Vietnam, and Thailand (Lowe et al., 2016; Colón-González et al., 2021; Reich et al., 2016) and influenza in the U.S. . These efforts frequently use ensemble forecasts that combine predictions from many models. In a wide array of fields, ensemble approaches have provided consistent improvements in accuracy and robustness relative to stand-alone forecasts (Gneiting & Raftery, 2005; Polikar, 2006) . The usefulness of ensemble forecasts has also been demonstrated repeatedly in multiple infectious disease settings, including influenza, Ebola, dengue, RSV, and others (Yamana et al., 2016; Viboud et al., 2018; McGowan et al., 2019; Johansson et al., 2019; Reis et al., 2019; Reich et al., 2019) . In light of this record of strong performance, ensembles are natural candidates for forecasts used as an input to high-stakes public health decision-making processes. This paper describes ensemble modeling efforts at the U.S. COVID-19 Forecast Hub (https: //covid19forecasthub.org/, hereafter the "U.S. Hub"), from spring 2020 through fall 2021. Starting in April 2020, the U.S. Hub created ensemble forecasts of reported incident deaths one through four weeks ahead in the 50 states, Washington, D.C., and 6 territories as well as at the national level by combining forecasts submitted by a large and variable number of contributing teams using different modeling techniques and data sources. In July 2020, forecasts of incident reported COVID-19 cases were added. Of note, the U.S. Hub produces probabilistic forecasts in which uncertainty about future disease incidence is quantified through the specification of a predictive distribution. Since the inception of the U.S. Hub, these ensemble forecasts have been provided to the U.S. Centers for Disease Control and Prevention (CDC) and have been the basis of official CDC forecasting communications (CDC, 2021). A wide variety of standalone methodological approaches have been shown to be able to make forecasts of short-term outbreak activity that are more accurate than naive baseline forecasts in various epidemiological settings. Some approaches have used existing statistical frameworks to model associations between outcomes of interest and known or hypothesized drivers of outbreaks, such as recent trends in transmission or environmental factors. To cite just a few examples, methods used include multiscale probabilistic Bayesian random walk models (Osthus & Moran, 2021) , Gaussian processes (Johnson et al., 2018) , kernel conditional density estimation (Ray et al., 2017; Brooks et al., 2018) , and generalized additive models (Lauer et al., 2018) . Other models have an implicit or explicit representation of a disease transmission process, such as variations on the susceptible-infectious-recovered (SIR) model (Shaman & Karspeck, 2012; Lega & Brown, 2016; Osthus et al., 2017; Pei et al., 2018; Turtle et al., 2021) . There is a large literature on ensemble forecasting, but of particular relevance to the present work is the research on combining, calibrating and evaluating distributional forecasts Ranjan & Gneiting, 2010; Claeskens et al., 2016) . We note that prior work on forecast combination has mostly focused on combining forecasts represented as probability densities or probability mass functions rather than forecasts parameterized by a set of discrete quantile levels, which is the format of the forecasts in the present study. However, in psychological studies there is a long history of combining quantiles from multiple distributions as a mechanism for summarizing distributions of response times, error rates, and similar quantities across many subjects (Vincent, 1912; Ratcliff, 1979) . More recently, this approach has also been used to combine probabilistic assessments from multiple subject matter experts or statistical models in fields such as security threat detection and economic forecasting (Hora et al., 2013; Lichtendahl Jr et al., 2013; Gaba et al., 2017; Busetti, 2017) . In the context of infectious disease forecasting, Bracher et al. (b) conducted a similar but less extensive analysis to the one presented here using data from a related forecast hub focusing on Germany and Poland. Taylor & Taylor (2021) recently explored several approaches to constructing quantile-based ensemble forecasts of cumulative deaths due to COVID-19 using the data from the U.S. Hub, although they did not generate ensemble forecasts in real time or appear to have used the specific versions of ground truth data that were available for construncting ensembles in real time. As was mentioned earlier, ensemble forecasts have also been used in a variety of other applications in real-time forecasting of infectious diseases, often with seasonal transmission dynamics where many years of training data are available (Yamana et al., 2016; Reich et al., 2019; Reis et al., 2019; Colón-González et al., 2021) . In such applications, simple combination approaches have generally been favored over complex ones, with equal-weighted approaches often performing similarly to trained approaches that assign weights to different models based on past performance Bracher et al., b) . These results align with theory suggesting that the uncertainty in weight estimation can pose a challenge in applications with a low signal-to-noise ratio (Claeskens et al., 2016) . This paper is focused on explaining the careful considerations that have gone into building a relatively simple "production" ensemble model for a difficult, high-stakes, real-time prediction problem: forecasting COVID-19 cases and deaths in the United States, to support public health decision-making. We do not empirically investigate the performance of complex forecast combination strategies from the online prediction literature, which generally require richer and larger training data sets. The goal of the U.S. Hub in developing an operational ensemble was to produce forecasts of the short-term trajectory of COVID-19 that had good performance on average and stable performance across time and different locations. Real-time forecasting for an emerging pathogen in an open, collaborative setting introduces important challenges that an ensemble combination method must be able to handle. First, teams occasionally submitted outlying component forecasts due to software errors, incorrect model assumptions, or a lack of robustness to input data anomalies (Figure 1 (a) ). Second, some component models were generally better than others, but the relative performance of different models was somewhat unstable across time (Figure 1 (b), Supplemental Figures 1 and 2 ). In particular, some forecasters alternated between being among the best-performing models and among the worst-performing models within a span of a few weeks, which introduced a challenge for ensemble methods that attempted to weight component forecasters based on their past performance. In this manuscript, we explore and compare variations on ensemble methods designed to address these challenges and produce real-time forecasts that are as accurate as possible to support public health decision-makers. We give detailed results from experiments that were run concurrently with the weekly releases of ensemble forecasts during 2020 and 2021, as documented in preliminary reports (Brooks et al., 2020; Ray et al., 2021) . These experiments provided the evidence for decisions (a) to move to a median-based ensemble from one based on means in July 2020, and (b) to switch to a trained ensemble for forecasts of deaths in November 2021. In a secondary analysis, we also consider the performance of these methods in the closely related setting of forecasting cases and deaths in Europe, to examine the generalizability of the results from our experiments using data from the U.S. The following sections document the format and general characteristics of COVID-19 forecasts under consideration, the ensemble approaches studied, and the results of comparing different approaches both during model development and during a prospective evaluation of selected methods. We give an overview of the U.S. and European Forecast Hubs and the high-level structure of our experiments in Sections 2.1 through 2.3, and then describe the ensemble methods that we consider in Section 2.4. Starting in April 2020, the U.S. Hub collected probabilistic forecasts of the short-term burden of COVID-19 in the U.S. at the national, state/territory, and county levels (Cramer et al., 2021) ; a similar effort began in February 2021 for forecasts of disease burden in 32 European countries (European Covid-19 Forecast Hub, 2021) . In this manuscript, we focus on constructing probabilistic ensemble forecasts of weekly counts of reported cases and deaths due to COVID-19 at forecast horizons of one to four weeks for states and territories in the U.S. and for countries in Europe. A maximum horizon of four weeks was set by collaborators at CDC as a horizon at which forecasts would be useful to public health practitioners while maintaining reasonable expectations of a minimum standard of forecast accuracy and reliability. Probabilistic forecasts were contributed to the Hubs in a quantile-based format by teams in academia, government, and industry. The Hubs produced ensemble forecasts each week on Monday using forecasts from teams contributing that week. In the U.S. Hub, seven quantile levels were used for forecasts of cases and 23 quantile levels were used for forecasts of deaths; in the European Hub, 23 quantile levels were used for both target variables. Weekly reported cases and deaths were calculated as the difference in cumulative counts on consecutive Saturdays, using data assembled by the Johns Hopkins University Center for Systems Science and Engineering as the ground truth (Dong et al., 2020) . Due to changes in the definitions of reportable cases and deaths, as well as errors in reporting and backlogs in data collection, there were some instances in which the ground truth data included outlying values, or were revised. Most outliers and revisions were inconsequential, but some were quite substantial in the U.S. as well as in Europe ( Figure 2 ). When fitting retrospective ensembles, we fit to the data that would have been available in real-time. This is critical because the relative performance of different component forecasters may shift dramatically depending on whether originally-reported or subsequently-revised data were used to measure forecast skill. An ensemble trained using revised data can therefore have a substantial advantage over one trained using only data that were available in real time, and its performance is not a reliable gauge of how that ensemble method might have done in real time. The U.S. Hub conducted extensive ensemble model development in real time from late July 2020 through the end of April 2021. We present results for the model development phase as well as a prospective evaluation of a subset of ensemble methods in the U.S. starting with forecasts created on May 3, 2021 and continuing through October 11, 2021. We note that we continued examining a wider range of methods to inform weekly operational forecasting tasks, but the methods that we chose to evaluate prospectively were selected by May 3, 2021, the beginning of the prospective evaluation period, with no alterations thereafter. Real-time submissions of the relative WIS weighted median ensemble described below are on record in the U.S. Hub for the duration of the prospective evaluation period. To examine how well our findings generalize, we also evaluated the performance of a subset of ensemble methods for forecasts of cases and deaths at the national level for countries in Europe from May 3, 2021 to October 11, 2021. In the Forecast Hubs, not all forecasts from contributing models are available for all weeks. For example, forecasters may have started submitting forecasts in different weeks, and some forecasters submitted forecasts for only a subset of locations in one or more weeks. The ensemble forecast for a particular location and forecast date included all component forecasts with a complete set of predictive quantiles (i.e., 7 predictive quantiles for incident cases, 23 for deaths) for all 4 forecast horizons. Teams were not required to submit forecasts for all locations to be included in the ensemble. Some ensemble methods that we considered require historical forecasts to inform component model selection or weighting; for these methods, at least one prior submission was required. The Forecast Hubs enforced other validation criteria, including that predictions of incident cases and deaths were non-negative and predictive quantiles were properly ordered across quantile levels. To evaluate forecasts, we adopted the weighted interval score (WIS) (Bracher et al., a) . Let q k , k = 1, . . . , K be predictive quantiles at quantile levels τ k (e.g., if τ k = 0.5 then q k is the predictive median) for the observed quantity y. The WIS is calculated as where 1 (−∞,q k ] (y) is the indicator function that takes the value 1 when y ∈ (−∞, q k ] and 0 otherwise. This is a negatively-oriented proper score, meaning that negative scores are better and its expected value according to a given data generating process is minimized by reporting the predictive quantiles from that process. WIS was designed as a discrete approximation to the continuous ranked probability score, and is equivalent to pinball loss, which is commonly used in quantile regression (Bracher et al., a) . To compare the skill of forecasters that submitted different subsets of forecasts, we used relative WIS, as done in Cramer et al. (b) . The ensemble forecasters developed and evaluated in this manuscript provided all relevant forecasts; missingness pertains only to the component forecasters, and in the present work the relative WIS is primarily used to summarize component forecaster skill as an input to some of the trained ensemble methods described below. Relative WIS is calculated by computing the ratio of the mean WIS scores for all pairs of models with forecasts for a common set of locations and dates, averaging across the subset of forecasts shared by both models in that pair. A geometric mean of these pairwise scores is calculated for each model, measuring how that model did relative to all other models on the forecasts they had in common. These geometric means are then scaled such that the baseline forecaster has a relative WIS of 1; a relative WIS less than 1 indicates forecast skill that was better than the baseline. The baseline forecaster forecasted a median that was equal to the most recent reported weekly count of cases or deaths, with a prediction interval based on differences in weekly counts over the observed history of the pandemic. We also assessed probabilistic calibration of the models with the one-sided coverage rates of predictive quantiles, calculated as the proportion of observed values that were less than or equal to the predicted quantile value. For a well calibrated model, the empirical one-sided coverage rate is equal to the nominal quantile level. A method that generates conservative two-sided intervals would have an empirical coverage rate that is less than the nominal rate for quantile levels less than 0.5 and empirical coverage greater than the nominal rate for quantile levels greater than 0.5. We denote the reported cases or deaths for location l and week t by y l,t . A single predictive quantile from component forecaster m is denoted by q m l,s,t,k , where s indexes the week the forecast was created, t indexes the target week of the forecast, and k indexes the quantile level (ranging from 1 to 7 for forecasts of cases in the U.S., and 1 to 23 otherwise). The forecast horizon is the difference between the target date t and the forecast date s. We denote the total number of available forecasters by M ; this changes for different locations and weeks, but we suppress that in the notation. All of the ensemble formulations that we considered obtain a predictive quantile at level k by combining the component forecaster predictions at that quantile level: q ens l,s,t,k = f (q 1 l,s,t,k , . . . , q M l,s,t,k ). We conceptually organize the ensemble methods considered according to two factors. First, trained ensemble methods use the past performance of the component forecasters to select a subset of components for inclusion in the ensemble and/or assign the components different weights, whereas untrained methods assign all component forecasters equal weight. Second, we varied the robustness of the combination function f to outlying component forecasts. Specifically, we considered methods based on either a (weighted) mean, which can be sensitive to outlying forecasts, or a (weighted) median, which may be more robust to these outliers. The weighted mean calculates the ensemble quantiles as The weighted median is defined to be the smallest value q for which the combined weight of all component forecasters with predictions less than or equal to q is at least 0.5; the ensemble forecast quantiles are calculated as: Graphically, these ensembles can be interpreted as computing a horizontal mean or median of the CDFs of component forecasters (Supplemental Figure 3 ). In trained ensemble methods that weight the component forecasters, the weights were calculated as a sigmoidal transformation of the forecasters' relative WIS (see Section 2.3) over a rolling window of weeks leading up to the ensemble forecast date s, denoted by rWIS m s : . This formulation requires estimating the nonnegative parameter θ s , which was updated each week. If θ s = 0, the procedure reduces to an equal weighting scheme. However, if θ s is large, betterperforming component forecasters (with low relative WIS scores) are assigned higher weight. We selected θ s by using a grid search to optimize the weighted interval score of the ensemble forecast over the training window, summing across all locations and relevant target weeks on or before time s: WIS(q ens,θ l,r,t,1:K , y l,t ). The size of the training window, a, is a tuning parameter that must be selected; we considered several possible values during model development, discussed further below. In this parameterization, the component forecaster weights are by construction nonnegative and sum to 1. When forecasts were missing for one or more component forecasters in a particular location and forecast date, we set the weights for those forecasters to 0 and renormalized the weights for the remaining forecasters so that they summed to 1. Some trained ensembles that we considered used a preliminary component selection step, where the top few individual forecasters were selected for inclusion in the ensemble based on their relative WIS during the training window. The number of component forecasters selected is a tuning parameter that we explored during model development. This component selection step may be used either in combination with the continuous weighting scheme described above, or with an equally-weighted combination of selected forecasters. Throughout the text below, we use the term "trained" ensemble to refer generically to a method that uses component selection and/or weighting based on historical component forecaster performance. There are many other weighted ensembling schemes that could be formulated. For example, separate weights could be estimated for different forecast horizons, for different quantile levels, or for subsets of locations. As another example, the weights could be estimated by directly minimizing the WIS associated with look-ahead ensemble forecasts (Taylor & Taylor, 2021) . We explored these and other ideas during model development, but our analyses did not show them to lead to substantial gains, and thus we settled on the simpler weighting schemes presented above. Further discussion of alternative schemes is deferred to the supplement. All component model forecasts and code used for fitting ensemble models and conducting the analyses presented in this manuscript are available in public GitHub repositories (Cramer et al., a; Ray, b,a) . We discuss the decisions that we made during model development in Section 3.1 before turning to a more focused discussion of the impact on ensemble forecast skill of using robust or non-robust combination mechanisms in Section 3.2, and trained or untrained methods in Section 3.3. Results for the evaluation using forecasts in Europe are presented in Section 3.4. Throughout this section, scores are calculated using the ground truth data available as of December 5, 2021 unless otherwise noted. This allows four weeks of revisions to accrue between the last target end date evaluated and the date of the data used for evaluation. When reporting measures of forecast skill we drop forecasts for which the corresponding reported value of weekly cases or deaths was negative. This can occur when an error in data collection is identified and corrected, or when the definition of a reportable case or death is changed. We include scores for all other outlying and revised data in the primary analysis because it is difficult to define objective standards for what should be omitted. However, a supplemental analysis indicates that the results about the relative performance of different ensemble methods are not sensitive to these reporting anomalies (Supplemental Section 3.3, Supplemental Figures 8 through 10) . During model development, we evaluated many variations on trained ensemble methods. In these comparisons we take the equally weighted median ensemble as a reference approach because this is the method used for the production ensemble produced by the U.S. Hub during most of the time that we were running these experiments. As measured by mean WIS over the model development phase, the equally weighted median ensemble was better than the equally weighted mean ensemble, but both were outperformed by the trained ensemble variations using component forecaster selection and/or weighting (Figure 3 ). The weighted approaches had stable performance no matter how many component forecasters were included. Approaches using an equally weighted combination of selected component forecasters were generally better only when top-performing component forecasters were included. We also considered varying other tuning parameters such as the length of the training window and whether component forecaster weights were shared across different quantile levels or across forecast horizons. However, we did not find strong and consistent gains in performance when varying these other factors (Supplemental Figures 11 through 16) . Finally, we evaluated other possible formulations of weighted ensembles, with weights that were not directly dependent on the relative WIS of the component forecasters but were instead estimated by optimizing the look-ahead ensemble WIS over the training set. As measured by mean WIS, the best versions of these other variations on weighted ensembles had similar performance to the best versions of the relative WIS weighted median considered in the primary analysis. However, they were more sensitive to settings like the number of component forecasters included and the training set window size (Supplemental Figures 11 and 12) . Based on these results, on May 3, 2021 we selected the relative WIS weighted ensemble variations for use in the prospective evaluation, as these methods had similar mean WIS as the best of the other variations considered, but were more consistent across different training set window sizes and numbers of component forecasters included. We used intermediate values for these tuning parameter settings, including 10 component forecasters with a training set window size of 12 weeks. We also included the equally weighted mean and median of all models in the prospective evaluation as reference methods. The following sections give a more detailed evaluation of these selected methods, describing how they performed during both the model development phase and the prospective evaluation phase. We found that for equally weighted ensemble approaches, robust combination methods were helpful for limiting the effects of outlying component forecasts. In most combinations of evaluation phase (model development or prospective evaluation) and target variable (cases or deaths), the equally weighted median had better mean and worst-case WIS than the equally weighted mean, often by a large margin (Figure 3, Supplemental Figure 4 ). Results broken down by forecast date show that the methods achieved similar scores most of the time, but the equally weighted mean ensemble occasionally had serious failures (Supplemental Figure 6 ). These failures were generally associated with instances where a component forecaster issued extreme, outlying forecasts, e.g., forecasts of deaths issued the week of February 15th in Ohio (Figure 1 ). There were fewer consistent differences between the trained mean and trained median ensemble approaches. This suggests that both trained approaches that we considered had similar robustness to outlying forecasts (if the outliers were produced by component forecasters that were down Figure 4 shows the full distribution. A cross is displayed at the difference in overall mean scores for the specified combination method and the equally weighted median averaging across all locations, forecast dates, and horizons. A negative value indicates that the given method outperformed the equally weighted median. The vertical axis of panel (b) shows the probabilistic calibration of the ensemble forecasts through the one-sided empirical coverage rates of the predictive quantiles. A well calibrated forecaster has a difference of 0 between the empirical and nominal coverage rates, while a forecaster with conservative (wide) two-sided intervals has negative differences for nominal quantile levels less than 0.5 and positive differences for quantile levels greater than 0.5. weighted or not selected for inclusion due to poor historical performance) or sensitivity to outlying forecasts (if they were produced by component forecasters that were selected and given high weight). Panel (b) of Figure 3 summarizes probabilistic calibration of the ensemble forecasts with onesided quantile coverage rates. The median-based ensemble approaches generally had lower onesided quantile coverage rates than the mean-based approaches, indicating a downward shift of the forecast distributions. This was associated with poorer probabilistic calibration for forecasts of cases, where the ensemble forecast distributions tended to be too low. For forecasts of deaths, which were better centered but tended to be too narrow, the calibration of the median-based methods was not consistently better or worse than the calibration of the corresponding mean-based methods. Averaging across all forecasts for incident cases and deaths in the model development phase, the weighted median was better than the equally weighted median and the weighted mean was better than the equally weighted mean (Figure 3 ). However, in the prospective evaluation, the trained methods showed improved mean WIS relative to untrained methods when forecasting deaths, but were worse when forecasting cases. We believe that this difference in the relative performance of trained and untrained ensemble methods for cases and deaths is due to differences in component model behavior for forecasting cases and deaths. A fundamental difference between these outcomes is that cases are a leading indicator relative to deaths, so that trends in cases in the recent past may be a helpful input for forecasting deaths -but there are not clear candidates for a similar leading indicator for cases (e.g., see McDonald et al. (2021) for an investigation of some possibilities that were found to yield only modest and inconsistent improvements in forecast skill). Indeed, the best models for forecasting mortality generally do use previously reported cases as an input to forecasting (Cramer et al., b) , and it has previously been noted that deaths are an easier target to forecast than cases Bracher et al., b) . This is reflected in the performance of trained ensembles, which were often able to identify a future change in direction of trends when forecasting deaths, but generally tended to predict a continuation of recent trends when forecasting cases (Supplemental Section 5, Supplemental Figures 19 and 20 ). An interpretation of this is that the component forecasters with the best record of performance for forecasting deaths during the training window were able to capture changes in trend, but the best component forecasters for forecasting cases were often simply extrapolating recent trends. While all ensemble methods tended to "overshoot" at local peaks in weekly incidence, this tendency was more pronounced for forecasts of cases than for forecasts of deaths -and training tended to exacerbate the tendency to overshoot when forecasting cases, but to mitigate this tendency when forecasting deaths (Supplemental Figure 19) . Another difference in component behavior when forecasting cases and deaths is illustrated in Figure 4 , which explores the relationships between component forecaster performance and the relative performance of trained and untrained ensemble methods in more detail. For deaths, the trained ensemble was able to identify and upweight a few component forecasters that had consistently good performance (e.g., Karlen-pypm and UMass-MechBayes). This led to consistently strong performance of the trained ensemble; it was always among the best models contributing to the U.S. Hub, and was better than the equally weighted median ensemble in nearly every week. For cases, the trained ensemble also had strong performance for many months when the LNQ-ens1 forecaster was contributing to the U.S. Hub. However, when LNQ-ens1 stopped contributing forecasts in June 2021, the trained ensemble shifted to weighting Karlen-pypm, which had less stable performance for forecasting cases. During July 2021, Karlen-pypm was the only forecaster in the U.S. Hub that predicted rapid growth at the start of the delta wave, and it achieved the best relative WIS by a substantial margin at that time. However, that forecaster predicted continued growth as the delta wave started to wane and it had the worst relative WIS a few weeks later. In turn, the weighted median ensemble also had poor performance as the delta wave began to peak. During the model development phase, the trained ensembles had better probabilistic calibration than their equally weighted counterparts (Figure 3 panel (b) ). During prospective evaluation, the trained median ensemble had generally higher one-sided coverage rates, corresponding to better calibration in the upper tail but slightly worse calibration in the lower tail. The trained mean ensemble had slightly better calibration than the equally weighted mean when forecasting deaths in the prospective evaluation phase, but inconsistent gains across different quantile levels when forecasting cases. Figure 5 summarizes weighted interval scores and calibration for the four selected ensemble methods when applied prospectively to forecast data collected in the European Forecast Hub. Consistent with what we observed for the U.S. above, the equally weighted median ensemble was generally better than the equally weighted mean. However, in the European evaluation, the trained methods had worse performance than the equally weighted median for forecasting both cases and deaths. In a post hoc exploratory analysis, we noted that patterns of missingness in forecast submissions are quite different in the U.S. and in Europe (Figure 6, Supplemental Figures 21 through 28) . In the U.S. Hub, nearly all models submit forecasts for all of the 50 states, and many additionally submit forecasts for at least one of the District of Columbia and territories. However, in the European Hub, roughly half of contributing models submit forecasts for only a small number of locations. Because the trained ensembles selected for prospective evaluation select the top 10 individual forecasters by relative WIS, this means that in practice the trained ensembles only included a few component forecasters for many locations in Europe. In this work, we have documented the analyses that have informed the selection of methods employed by the official U.S. Hub ensemble that is used by the CDC for communication with public health decision makers and the public more generally. In this context, our preference is for methods that have stable performance across different locations and different points in time, and good performance on average. Our most consistent finding is that robust ensemble methods (i.e., based on a median) are helpful because they are more stable in the presence of outlying forecasts than methods using a mean. Ensemble methods based on means have repeatedly produced extreme forecasts that are dramatically misaligned with the observed data, but median-based approaches have not suffered from this problem as much. This stability is of particular importance in the context of forecasts that will be used by public health decision makers. These observations informed our decision to use an equally weighted median ensemble for the official U.S. Hub ensemble early on. We have seen more mixed success for trained ensemble methods. Overall, trained ensemble methods did well when they were able to identify and upweight component forecasters with stable good performance, but struggled when component forecaster skill varied over time. In the U.S., trained ensembles have a long record of good performance when forecasting deaths, and the U.S. shows the full distribution. A cross is displayed at the difference in overall mean scores for the specified combination method and the equally weighted median of all models, averaging across all locations, forecast dates, and horizons. A negative value indicates that the given method had better forecast skill than the equally weighted median. Panel (b) shows the probabilistic calibration of the forecasts through the one-sided empirical coverage rates of the predictive quantiles. A well calibrated forecaster has a difference of 0 between the empirical and nominal coverage rates, while a forecaster with conservative (wide) two-sided intervals have negative differences for nominal quantile levels less than 0.5 and positive differences for quantile levels greater than 0.5. The plot on the right shows the estimated weights that would be used if all of the top 10 models (each represented by a different color) were available for a given location (at left side), and the effective weights used in each location after setting the weights for models that did not provide location-specific forecasts to 0 and rescaling the other weights proportionally to sum to 1. Hub adopted the relative WIS weighted median ensemble as its official method for forecasting deaths in November 2021. However, trained methods have been less successful at forecasting cases in the U.S., both near peaks in weekly incidence (when they tend to overshoot) and at points where the performance of the component forecasters is inconsistent. Additionally, the trained methods we adopted did not translate well to a setting with a large number of missing component forecasts, as in the European Hub. To preserve the prospective nature of our analyses, we have not examined additional ensemble variations in the European application, but we hypothesize that these problems might be mitigated by including all component forecasters rather than the top 10, or by performing weight estimation separately in clusters of locations where the same component forecasters are contributing. In this manuscript, we have focused on relatively simple approaches to building ensemble forecasts. There are several opportunities for other directions that we have not considered here. We have addressed the challenge posed by outlying component forecasts by using median-based combination mechanisms, but another approach would be to pre-screen the component forecasts and remove outlying forecasts. This is a difficult task because there are times when weekly cases and deaths grow exponentially, and occasionally only one or two models have captured this growth accurately. A second challenge is that the ensemble forecasts have not always been well calibrated. We are actively developing approaches to address this by post hoc recalibration of the ensemble outputs. A third challenge we have highlighted is the inconsistency of the relative performance of many component forecasters. For models with a relatively long history of performance over multiple epidemic waves, it might be possible to address this by using weights that depend on covariates like recent trends in incidence. This might allow the ensemble to learn the conditions in which component forecasters have been more or less reliable, and upweight models locally during phases similar to those in which they have done well in the past. Similar approaches have been used for other infectious disease systems such as influenza in the past (e.g., , but they used a substantial amount of training data over multiple years. We have used the WIS and probabilistic calibration to measure the extent to which forecasts are consistent with the eventually observed data. These summaries of performance are commonly used and provide useful insights into forecast performance, but it is worth noting that they do not necessarily reflect the utility of the forecasts for every particular decision-making context. Aggregated summaries of performance, such as overall quantile coverage rates could obscure finerscale details -for instance, a method with good coverage rates on average could have high coverage at times that are relatively unimportant and low coverage when it matters. Additionally, for some public health decision-making purposes, one or another aspect of a forecast may be more important; for example, some users may prioritize accurate assessments about when a new wave may begin, but other users may find accurate forecasts of peak intensity to be more important. Our evaluation metrics do not necessarily reflect the particular needs of those specific end users, and it is possible that different ensemble methods would be more or less appropriate to generate forecasts that serve different purposes. Careful consideration and rigorous evaluation are required to support decisions about what ensemble methods should be used for infectious disease forecasting. As we discussed earlier, to obtain an accurate measure of a forecaster's performance, it is critical that the versions of ground truth data that would have been available in real time are used for parameter estimation. This applies as much to ensemble forecasters as it does to individual models. Additionally, it is important to be clear about what methods development and evaluation were done retrospectively and what forecasts were generated prospectively in real time. We believe that to avoid disruptions to public health end users, a solid evidence base of stable performance in prospective forecasts should be assembled to support a change in ensemble methods. We have followed these principles in this work, and have followed the EPIFORGE guidelines in describing our analysis (Supplemental Section 6). The COVID-19 pandemic has presented a unique challenge for infectious disease forecasting. The U.S. and European Forecast Hubs have collected a wealth of forecasts from many contributing teams -far more than have been collected in previous collaborative forecasting efforts for infectious diseases such as influenza, dengue, and Ebola. These forecasts have been produced in real time to respond to an emerging pathogen that has been one of the most serious public health crises in the last century. This setting has introduced a myriad of modeling difficulties, from data anomalies due to new reporting systems being brought online and changing case definitions, to uncertainty about the fundamental epidemiological parameters of disease transmission, to rapidly changing social factors such as implementation and uptake of non-pharmaceutical interventions. The behavior of individual models in the face of these difficulties has in turn affected the methods that were suitable for producing ensemble forecasts. We are hopeful that the lessons learned about infectious disease forecasting will help to inform effective responses from the forecasting community in future infectious disease crises. Evaluating epidemic forecasts in an interval format A pre-registered short-term forecasting study of COVID-19 in Germany and Poland during the second wave Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions Comparing ensemble approaches for short-term probabilistic COVID-19 forecasts in the U.S. International Institute of Forecasters blog Quantile aggregation of density forecasts COVID-19 mathematical modeling The forecast combination puzzle: A simple theoretical explanation Probabilistic seasonal dengue forecasting in Vietnam: A modelling study using superensembles COVID-19 Forecast Hub Consortium (2021). The United States COVID-19 Forecast Hub dataset. medRxiv COVID-19 Forecast Hub Consortium (a). reichlab/covid19-forecast-hub: release for zenodo Ensemble forecast modeling for the design of COVID-19 vaccine efficacy trials An interactive web-based dashboard to track COVID-19 in real time European Covid-19 Forecast Hub (2021) Combining interval forecasts Probabilistic forecasts, calibration and sharpness Weather Forecasting with Ensemble Methods Strictly Proper Scoring Rules, Prediction, and Estimation Median aggregation of distribution functions An open challenge to advance probabilistic forecasting for dengue epidemics Phenomenological forecasting of disease incidence using heteroskedastic Gaussian processes: A dengue case study Prospective forecasts of annual dengue hemorrhagic fever incidence in Thailand Data-driven outbreak forecasting with a simple nonlinear growth model Is it better to average probabilities or quantiles? Management Science Improving the evidence base for decision making during a pandemic: The example of 2009 influenza A/H1N1 Evaluating probabilistic dengue risk forecasts from a prototype early warning system for brazil. eLife, 5 , e11285 Can auxiliary indicators improve COVID-19 forecasting and hotspot prediction Collaborative efforts to forecast seasonal influenza in the United States Forecasting seasonal influenza with a state-space SIR model Multiscale influenza forecasting Forecasting the spatial transmission of influenza in the United States Ensemble based systems in decision making Combining probability forecasts Group reaction time distributions and an analysis of distribution statistics COVID-19 ensemble methods manuscript (b). reichlab/covidEnsembles: pre-publication release Challenges in training ensembles to forecast COVID-19 cases and deaths in the United States Prediction of infectious disease epidemics via weighted density ensembles Infectious disease prediction with kernel conditional density estimation Challenges in Real-Time Prediction of Infectious Disease: A Case Study of Dengue in Thailand Accuracy of real-time multi-model ensemble forecasts for seasonal influenza in the U.S On the predictability of COVID-19. International Institute of Forecasters blog Superensemble forecast of respiratory syncytial virus outbreaks at national, regional, and state levels in the United States Forecasting seasonal outbreaks of influenza Combining probabilistic forecasts of COVID-19 mortality in the United States Accurate influenza forecasts using type-specific incidence data for small geographic units The RAPIDD Ebola forecasting challenge: Synthesis and lessons learnt The Functions of the Vibrissae in the Behavior of the White Rat Optimizing infectious disease interventions during an emerging epidemic Superensemble forecasts of dengue outbreaks