key: cord-0692804-hc6nmz1m
authors: Krishnan, R.G.; Cenci, S.; Bourouiba, L.
title: Mitigating bias in estimating epidemic severity due to heterogeneity of epidemic onset and data aggregation
date: 2021-08-19
journal: Ann Epidemiol
DOI: 10.1016/j.annepidem.2021.07.008
sha: 8faeb10ffaedbbb7a6b5ba825289a5f6ef872a1d
doc_id: 692804
cord_uid: hc6nmz1m

Outbreaks of infectious diseases, such as influenza, are a major societal burden. Mitigation policies during an outbreak or pandemic are guided by the analysis of data of ongoing or preceding epidemics. The reproduction number, [Formula: see text] , defined as the expected number of secondary infections arising from a single individual in a population of susceptibles is critical to epidemiology. For typical compartmental models such as the Susceptible-Infected-Recovered (SIR) [Formula: see text] represents the severity of an epidemic. It is an estimate of the early-stage growth rate of an epidemic and is an important threshold parameter used to gain insights into the spread or decay of an outbreak. Models typically use incidence counts as indicators of cases within a single large population; however, epidemic data are the result of a hierarchical aggregation, where incidence counts from spatially separated monitoring sites (or sub-regions) are pooled and used to infer [Formula: see text]. Is this aggregation approach valid when the epidemic has different dynamics across the regions monitored? We characterize bias in the estimation of [Formula: see text] from a merged data set when the epidemics of the sub-regions, used in the merger, exhibit delays in onset. We propose a method to mitigate this bias, and study its efficacy on synthetic data as well as real-world influenza and COVID-19 data.

The burden of infectious diseases is felt around the world. Developed countries are often faced with increased health-care expenditure, while developing countries face increased mortality and detrimental long-term effects on the population [3] . Seasonal epidemics such as influenza infect millions of people annually since preventative measures are imperfect. In the 2017-2018 epidemic, there were 45 million influenza illnesses and 21 million influenza-as-sociated medical visits. In the 2018 − 2019 epidemic, the efficacy of the flu vaccine was reported to be as low as 47% [4] . Influenza severely affects vulnerable groups in particular, such as young children and the elderly [52] , and in 2017, the associated death toll was around 61, 000 in the United States (US) [18] . Moreover, the total economic burden of influenza epidemics within the US is projected to be in the realm of $87.1 billion [49] . The ongoing COVID-19 pandemic has already affected more than 206 millions worldwide with more a fatality of 4.3 millions [59] , and an economic contraction nearing that of the Great Depression of the 1930s and with projected injections of more than 5 trillions globally [42] . Economists estimate, under the optimistic assumption that the pandemic's effects will abate in two years, the total cost of the pandemic would lie at nearly 16 trillion dollars [27] and we are now seeing the emergence of new strains. Clearly, the severe human and economic costs of pandemics and regular seasonal epidemics motivate the need for accurate methods for risk-prediction and planning during outbreaks. To this end, Non Governmental Agencies (NGOs) and health care officials work in tandem to collect disease incidence data.

Epidemiological models can be used to turn the collected data into actionable insights [29, 40, 50] . The parameters in epidemiological models quantify the severity of the disease [48] and the parameters thus obtained are used to devise strategies of intervention [21, 32, 37] . The reproduction number, R 0 , is a critical quantity defined as the expected number of secondary cases caused by a single infected individual in a population of susceptibles [13, 30, 39] . The reproduction number characterizes the growth potential of an epidemic [54] . It also serves as a threshold parameter: if above one, an epidemic is expected to grow, while if below one, an epidemic is expected to decay. R 0 is widely used for risk-assessment, optimization of strategies for intervention or management of allocation of human and economic resources, particularly during the first stages of a crisis [23, 34, 41 ]. An accurate estimation of the reproduction number from observational data is thus critical. However, such assessment relies on the accuracy of collected epidemiological data, which is often noisy due to a variety of factors which we shall discuss next.

As exemplified by the ongoing COVID-19 pandemic, the collection of data during an outbreak is challenging because of logistical, cultural, political, and technical factors, particularly at early stages of an epidemic or pandemic. For example, under-reporting of cases, and slow transmission of data between organizations were important sources of heterogeneity that made their way into the final estimates collected by the World Health Organization (WHO) during the Ebola outbreak [53] . In the case of influenza in the United States, data are collected in a hierarchical structure: incidence is recorded at sub-regional organizations such as local hospitals, agglomerated by regional organizations (such as state level health-care agencies) before being transmitted to the Centers for Disease Control and Prevention (CDC) and aggregated to create national infection curves [16] . Modern multi-regional data collection and aggregation practices may no longer reflect the assumptions of typical epidemiological models. Is such aggregation approach valid when the epidemic has different dynamics across the regions monitored? In this study, we characterize bias in the estimation of R 0 from a merged dataset when the epidemics of the sub-regions, used in the merger, exhibit delays in onset. We propose a method to mitigate this bias, and study its efficacy on synthetic data as well as real-world influenza and COVID-19 data.

The paper is organized as follows: in Section 2, we showcase a real-world example of variation in epidemic onset, and discuss how such variation arises in epidemiological data. Section 3 provides background on the SIR model and how its parameters are interpreted. In particular, we discuss how temporal delays in the onset of an epidemic affect the epidemic curves and bias estimates of the reproduction number. Section 4 details the method we propose to reduce this bias by explicitly accounting for temporal offsets, introducing a new Shifted-SIR model (S-SIR). In Sections 4 and 5, we validate our method using synthetic data and realworld influenza data collected in the United States, respectively. Finally, in Section 6 we use our methodology to illustrate how the methodology can help reduce errors at early stages of epidemics even when understanding of the underlying disease dynamics is not yet clear. We On the x-axis is weeks elapsed since week 40 [17] , used by the CDC as a marker for the onset of the flu season. On the y-axis is the number of individuals who report to outpatient clinics with an influenza like illness. Note that the maximum number of infected individuals differs across sub-regions.

conclude with a discussion of related work and the implications of our method in Section 7.

The procedures for reporting cases of infectious diseases vary from country to country but generally follow a hierarchical structure. Starting from hospitals and clinics, data is collected, tabulated and aggregated before being passed onto regional authorities and NGOs. This process is repeated and the aggregated statistics are sent to national bodies where it is used for intervention policy, crisis response, planning and management. In resource-limited settings, some of this hierarchical infrastructure has to be created on-the-go during an outbreak [26] .

By way of example, consider a government agency in the United States that wishes to quantify the severity of the influenza epidemic in 2014. Typically, the severity of an epidemic is assessed by estimating the R 0 from data collected locally and then aggregated.

However, the aggregation process smoothes out local heterogeneity that are relevant for the assessment of the severity of the epidemic. For example, in Figure 1 we plot the incidence curves for the 2014 influenza outbreak in several major states in the United States; here we observe non-uniformity in the start times of the epidemic across the various (sub-regions) states.

Week 40 of the year is typically used by the CDC to denote the start of the flu season [17] .

Decisions on quantifying the severity of the disease are made using national level epidemic data and the aggregation infrastructure that creates the national data pools the data from the sub-regions. Typical compartmental models assume that incidence curves reflect a single large well-mixed population. This assumption is violated when sub-regional epidemics begin at different points in time. Failing to account for this heterogeneity in onset of the epidemic among sub-regions can bias estimates of R 0 inferred from the aggregated data. We pause to reflect on how such heterogeneity in onset time of epidemics could arise.

Reporting: An infection curve typically represents the number of infected individuals over time. However, the reported number is often a noisy estimate of the true number of individuals infected. Deviations from ground truth can occur due to under-reporting [1, 28] , or errors in tabulation.

Typical compartmental models assume a single, mean, rate of infection and recovery across the entire population. Aggregating incidence counts from sub-regions in which there is high variance in the rates of infectivity and recovery may result in inferring a global reproduction number, R 0 [31] that is not representative of the reproduction numbers of sub-regions. This is a case where ecological bias can arise in epidemiological studies [56] . Correcting ecological bias of this nature is difficult due to nonidentifiability in the data: namely that there exists an infinite set of sub-regional incidence curves, each corresponding to different values of R 0 , that can give rise to an observed aggregated infection curve.

Epidemics rarely begin simultaneously across spatially separated geographic regions. Due to factors such as the population density of sub-regions and the variation in disease vectors, there are often delays in the onset of the epidemic. These delays can be difficult to detect: for example, California in Figure 1 , has a non-zero number of cases of influenza as baseline before the onset of the epidemic, making it difficult to pinpoint exactly how to define the onset of the epidemic. Organizations like the CDC often use week 40 in the year as an empirical mark of the beginning of seasonal epidemics such as the flu. This is a choice that aligns well with the epidemic dynamics in some sub-regions but not others. In this work, we focus on studying, characterizing and mitigating this source of error.

We next focus on mitigating errors in estimating R 0 , and thereby the quantification of epidemic severity, due to the third scenario highlighted above. We expect that when delays in epidemic onset among sub-regions are comparable to the duration of the epidemic, the aggregated curve is fundamentally altered. We conjecture that the resulting bias in R 0 arises because parameters in compartmental models are forced to capture delays, in addition to the dynamics of the epidemic.

Hereafter, sub-regions refer to collection sites for epidemic data while regional data refers to the aggregation of incidence counts from such sub-regions to form the incidence curve from which R 0 is inferred. We work with the canonical epidemic compartmental model, the SIR model, described succinctly in the next section.

The Susceptible Infectious Recovered (SIR) model is ubiquitous for the analysis of epidemic data [45] . It splits a homogeneous population into three groups, or compartments, and represents the progression of the epidemic as the status of individuals transition through the compartments. Typically, nearly all individuals are initially considered to be Susceptible while a small number of infected individuals are introduced in the naive population. The rates of change of the proportion of individuals in each compartment are governed by coupled ordinary differential equations governing the number, or fraction, of the susceptible (S), infected (I) and recovered (R) individuals:

We work with the normalized model where S + I + R = N = 1. β is the mass action parameter encompassing contact and transmission rates between susceptible and infected individuals, and γ, the rate of recovery of infected individuals.

The reproduction number, the average number of secondary infections caused by a single infected individual introduced in the population, is derived as R 0 = βS0 γ where S 0 is the initial fraction of susceptible individuals in the population.

The primary use of R 0 is as a threshold parameter to quantify whether an outbreak is expected to die off or spread and grow into an epidemic. When R 0 > 1, the region experiences an epidemic with an increase in the number of cases, I. When R 0 < 1, the outbreak dies out (I decreases). The value of R 0 is therefore crucial for risk assessment and testing intervention strategies such as planning for the number of drugs to allocate, hospital beds to prepare, and healthcare workers to deploy during the epidemic. An under or over-estimation of the reproduction number can have disastrous public health and economic consequences.

Note that the SIR model makes a number of implicit assumptions such as homogeneity in the distribution of the recovery and infection rates. Anyone is equally likely to be infected and recover. The model also assumes homogeneity in the social contact network: i.e. any infected individual is equally likely to be the source of infection for any other individual, not accounting for distance, heterogeneity in pathogen shedding, residence in the same indoor space, or contributions from air contamination indoors for respiratory diseases for example [6, 7, 46] . Several extensions to the model have been proposed to account in various ways for homogeneity in spatial and temporal scales or shedding amounts and timescale competitions [5, [8] [9] [10] 12] , but there is no universal approach yet to best account for these effects in general.

We begin with a demonstration on synthetic data to show how delays in onset of epidemics in sub-regions can affect assessment of R 0 from the regional data. [E] Regional incidence curve

3 Sub-regional R0 [C] shows the results of inferring R 0 using the SIR model on all the sub-regional data, the aggregated curve along with the true R 0 = 2.3.

[D], [E], and [F] follow similarly but showcase instances where there is heterogeneity in epidemic onset time between the sub-regions with resulting values of R 0 estimated using the SIR model that are much further away from the ground truth.

The setup is as follows. We consider an epidemic with R 0 = 2.3 as manifested in ten different sub-regions. Each sub-regional epidemic is characterized by a choice of R 0 ∼ |N (2.3; 1)|. The incidence curves are simulated and visualized in Figure 2 [A]. The regional data is given by their aggregation in Figure 2 [B]. Although typically, R 0 would only be estimated from regional data, in this exercise, we use the SIR model to infer R 0 from each of the sub-regional curves and their aggregation -we visualize the results in Figure 2 [C]. Next, we repeat the exercise in Figure 2 [D-F] but this time simulate epidemics where the epidemic onset in the sub-regions is delayed, with a delay uniformly sampled from two to six weeks.

Without delays in onset of epidemics in subregions, even with noise in the observational data and a small degree of variation in the subregional R 0 , inferring R 0 from the aggregated data using the SIR model yields a reasonable estimate R 0 = 2.4 compared to the true value of 2.3. However, in the presence of delayed epidemic onsets in the sub-regions, the inferred R 0 = 1.8 from regional data is less accurate, further from the true value. This demonstrates how a naive approach to the inference of the reproduction number can lead to a serious un-derestimation of the severity of an outbreak.

To illustrate the importance of the problem, let us consider a population of a million individuals. In the absence of temporal offsets, This is an under-estimation of 10, 000 individuals or 1% of the population. In the presence of heterogeneity due to variation in epidemic onset (panel [F]), the number of underestimated infected individuals rises up to 200, 000. This is 20% of the population, and four times larger than the error incurred in the absence of temporal variation in disease onset. Such large errors in the prediction of the number of infected individuals can have dramatic consequences during an outbreak; for example: from an operational standpoint, care facilities may be underprepared for dealing with the additional influx of sick patients.

Note that in the experiments discussed here, we ensured overlap between the sub-regional epidemic curves in Figure 2 [D]. Clearly, in the extreme case in which the epidemic curves do not overlap, the use of an SIR model for the aggregated model can no longer hold. The aggregated curve would either be in the form of a plateau or appear as a series of epidemic curve peaks, possibly suggesting multiple epi-demic waves. These extreme cases imply obvious breakdowns of the SIR type model for the aggregated data, and so are not of high concern: a typical user would not typically try to extract SIR-type parameters from such multi-modal or "flat" epidemic data curves. However, more concerning is the intermediate regime for which delays in onset are important, but with sufficient remaining overlap between epidemic curves, such as shown in Figure 2 [D]. In such cases, there is no obvious distortion of the aggregated epidemic curve. As we have seen, in such cases, the R 0 inferred is typically biased and in particular, is underestimated. Indeed, in such intermediate cases, the SIR model is incapable of capturing time-delays, or offsets, in the onset of the epidemic in sub-regions, the source of errors we seek to avoid. To remedy this pathology, we next introduce an alternative epidemic model.

The shifted-SIR model uses an additional parameter to explicitly capture delays in the onset of the epidemic.

Denoting by τ the start time of the epidemic, and I the indicator function:

The S-SIR model has the following dynamics for each of the compartments of a population:

The key difference between Equations (1) and (2) is the incorporation of an additional parameter to the model, τ . When t > τ , the Shifted SIR, and the SIR model are equivalent. When t ≤ τ , the shifted SIR model preserves the initial conditions of the differential equation without temporal evolution. In this way, the model has a mechanism to explicitly account for the start time of each infection curve, and the new parameter has a rooting in a physical, measurable effect of delay in onset or reporting, or a combination of both. Consequently, this means that the parameters β, γ that govern the infection dynamics, and by proxy reproduction number R 0 , are not used to model the early portion epidemic data attributed to delays and noise; they can instead focus on modeling infection counts after τ time-steps.

Simulation and parameter estimation in the S-SIR model: For a given choice of τ , one can simulate from this model using an SIR model followed by right-shifting the simulated curve right by τ time-steps. However, the incorporation of τ to the compartmental model requires new methods to fit parameters from epidemiological data. We use a grid search for τ and nested within it, a least-squares based fitting procedure to estimate β, γ from infection counts I. The practitioner's prior estimates for what values τ can take for a given epidemic may be used to constrain the grid search. We defer the reader to the supplementary information (SI) for a detailed exposition on how the parameters of the model are inferred.

We first reflect upon how heterogeneity in epidemic onset manifests in aggregated incidence curves.

In Figure 3 we study eight different scenarios, where both R 0 , τ take high and low values (values listed in the figure-key labelled as Data).

In the top row, each of the ten sub-regional epidemic curves (faded curves) are assumed to be generated from an SIR model with the corresponding value of R 0 and shifted to the right by an offset drawn from Uniform(τ − 1, τ + 1). In the bottom row, each of the ten subregional epidemic curves (generated as above) are shifted to the right by an offset drawn from Uniform(τ − 3, τ + 3). The top row illustrates the case where there is a small degree of heterogeneity in onset times among the sub-regions and the bottom row illustrates the case where there is a large degree of heterogeneity. The aggregated curve (depicted with the label Data) is then used to infer the parameters of the S-SIR model. We simulate from the model and display the estimated parameter values (R 0 ,τ ) alongside the curve from the simulation, noted with Sim in the figure-key.

The results showcase a few important aspects of the inferred parameters. First, across the board, we find that simulation from the S-SIR model with the inferred parameters presents an accurate fit to the aggregated, epidemic curve. Second, when the epidemic onset delays in the underlying sub-regions have a small amount of variation (top row), when they differ greatly (bottom row) and the epidemics among the subregions have the same fraction of infected individuals, the value taken on byτ lies close to the smallest time-delay among the sub-regions. We conjecture that this is because the smallest time-delay among the sub-regions is the minimum delay that may be identified from the aggregated infection counts. In practice the degree of observational noise, the number of sub-regions, the severity of the epidemic in each sub-region and the heterogeneity in delay onset affect the values ofR 0 ,τ inferred.

Improving R 0 using the S-SIR model Next, we perform a more thorough quantitative study of scenarios to understand where the S-SIR model can improve the estimation of R 0 in the presence of offsets in epidemic onset in subregions.

The setup for the experiments are as follows. Across either ten or one hundred sub-regions, we first simulate an SIR epidemic and shift the epidemic curve to mimic a delayed onset. Folded random noise (with standard deviation of 0.005) is added to each point in every simulated curve. The curves are aggregated and the aggregation is used to infer the parameters of the SIR and the S-SIR, using O(X, 0) and Algorithm 1, respectively. We then study the error between the inferred valueR and the true value, R 0 . Each such experiment is repeated 1000 times.

We study two kinds of epidemics with different severities: R 0 = 1.8, R 0 = 3.1. We vary the way in which the delay in epidemic onset in sub-regions, τ , behaves among the sub-regions by selecting four distributions from which τ (in unit of weeks) is sampled:

The results of our analysis are depicted in Figure 4 when τ is drawn from a Gamma distribution among the sub-regions and in Figure 5 when τ is drawn from a Uniform distribution. Both violin plots indicate that the SIR and S-SIR models underestimate R 0 . However, across all results, we find that errors in the estimation of R 0 inferred from the S-SIR model are on average, lower than errors in R 0 from the SIR model, suggesting that explicitly accounting for time-delays does improve the accuracy of the inferred R 0 . We examine several follow up questions in turn. Figures 4 and 5 show the results of the inference with small time delays (two columns on the left of all the plots of the figures). We note that from the perspective of planning for epidemics, it is worrying that even small timedelays in epidemic onset in sub-regions result in an appreciable error in the estimation of R 0 from aggregated data, giving a lower value than reality on the ground; this showcases a scenario that calls for concern about bias in the estimates of R 0 . Such scenario is particularly concerning when the epidemic is severe (relatively high R 0 ). The Figures (4 and 5) show that both synthetic scenarios tested, we find some and on the right is 100. In each, we study the distribution of errors between the true R0 and inferred values when the time-delays and severity of infection within subregions are varied. We find that the S-SIR systematically obtains better results across all settings but does particularly well when there are larger time-delays in the sub-regions. gains from the use of the S-SIR model which has a lower (on average) error in the estimation of R 0 than the SIR model.

When time-delays in sub-regions are larger (right two columns in the subplots of Figures 4 and 5 , we continue to find that the S-SIR model outperforms the SIR model. The improvements yielded by the S-SIR model are particularly visible when all the sub-regions have consistent shifts in the start of the epidemic and when the epidemic is more severe.

Across all the results we find that there can be a degree of under-estimation of R 0 when the sub-regions have time-delays, regardless of the method used. This is problematic; an underestimation of the epidemic severity can result in worse consequences on affected population than an overestimation of the severity.

The results on synthetic data demonstrate that the S-SIR model does improve the estimation of R 0 realizing values closer to the ground truth. The results also characterize scenarios when we can expect the use of such models to be of benefit: namely, when the delays in onset are large across sub-regions. We next discuss the insights gained from the synthetic data on characterizing the severity of epidemics from real-world influenza data. 

Influenza is a respiratory, viral, infectious disease which spreads via air. Seasonal influenza is recurrent; the young and the aged are particularly vulnerable to developing serious complications. Characterizing the severity of an epidemic both when it is ongoing as well as a posteriori is vital for public health management. We study how offset in epidemic onset times in aggregated data gathered from subregions in the United States affects the estimation of the severity of epidemics nationally. Epidemic data is collected in a hierarchical structure that spans multiple levels. Here, we focus on two levels of the hierarchy. The sub-regions will contain incidence counts available at the state level. The regional incidence counts correspond to those at the national level.

We obtained incidence reports for influenza from the CDC [19] . The data contains dates and raw counts of individuals diagnosed with influenza-like-illness (ILI) each week. For each state and year, we define time zero using the CDC's classification of the beginning of the flu year, which is the 40th week in the calendar year [17] . We self-normalize the raw counts (for example in Figure 1 ) to create incidence curves where each point represents the fraction of the population in the Infected (I) compartment. The influenza data represents the incidence of the disease. Here, this is the number of new cases reported each week. Compartmental models typically involve prevalence: the number of infected individuals a given time.

The two do not always coincide but since recovery from influenza typically takes five to seven days, this duration maps well with the temporal granularity of the available weekly case data.

Thus, we make the assumption in this work that the two are interchangeable and learn the parameters of the compartmental models herein using the incidence data, similar to prior studies of influenza [61] .

The first value of the incidence curve is used as the initial condition of Equations (1) and (2). For the S-SIR model, the τ max was limited to 20 weeks.

For this data, we no longer have access to the ground truth values of R 0 . Therefore, as one proxy to measure how well the resulting model explains the data, we measure absolute error between the raw data and the curves simulated under each model with the learned parameters. Denoting I s as the (vector of) data simulated from a model with learned parameters and I d as the raw incidence data, we denote the absolute error E:

A smaller absolute error implies that the parameters fit by the model more closely align with the observed data.

Are there delays in sub-regional epidemics for influenza? We visually inspect the incidence curves among sub-regions in Figure 6 to answer this question.

We find a few instances where the epidemic onset is delayed. For example in populous states such as Illinois, California and Georgia, there is a large and sudden spike in the number of infected individuals over a short two or three week window. We can therefore suspect that the national aggregated level epidemic is affected as well.

In Figure 6 , we compare the results from SIR and S-SIR models learned on the incidence data both at the state level and at the aggregated national level. Here we verify that (a) the S-SIR model makes use of the parameter τ and (b) that doing so enables it to capture more accurately capture the peaks of the epidemic the national data and the sub-regional data. Both of these observations suggest that delays in sub-regional epidemics do play a role in how well R 0 is inferred on aggregated data. We conjecture this happens because the underlying epidemic data exhibits heterogeneity in the start of the epidemic among the sub-regions it is aggregated from (relative to the CDC's fixed point of observation for the start of the epidemic season). When the SIR model is fit directly to such data, it uses β, γ to fit early trends in the epidemic curve leaving it unable to capture the peak of the epidemic well. In contrast the S-SIR does not use β, γ to model the data until τ time has elapsed giving it the ability to model the peak of the epidemic curve more accurately.

Quantifying model performance: In Figure 7 , we quantify the absolute error obtained on national incidence curves for years ranging from 2012 to 2018. Across many of the years, we find that the S-SIR model has a lower error than the SIR model. In the supplemental material, we repeat this exercise at the sub-regional levelaveraging the results across years from 2012 to 2018 -where we continue to find that the S-SIR captures the data better than the SIR model. 

We visualize the values of R 0 estimated by both algorithms in Figure 8 . On the x-axis are the results from the SIR model and on the y-axis are the results from the S-SIR model. In green are the (paired) estimates from each sub-region while in red are the estimates obtained from the national level incidence curves. We see a large degree of correlation between the values of R 0 inferred by the two algorithms. Across several years however, the value of R 0 estimated using the SIR model is always smaller than than that estimated with the S-SIR model. Since we do not know the ground truth, it is impossible to know for certain which model is more accurate; however, the observation that the S-SIR model provides a better fit to the epidemic curve data ( Figure 7 ) along with Figure 8 , suggests that the SIR model underestimates R 0 . We provide further evidence to support this hypothesis by looking at Figure 12 where we visualized the learned τ from the S-SIR model. We see that years in which the S-SIR model performs better than the SIR model in Figure 7 correspond to years where the inferred value of τ is largest. Figure 8 : R0 estimated from regional and national Incidence curves: For six years, we visualize the values of R0 estimated from regional (green) and national (red) levels incidence curves. The data was extracted and processed based on the incidence counts released by the CDC [15] .

In the previous experiments, our analysis of the data was retrospective -i.e. the epidemic had come and gone. Using retrospective data, we can simulate an ongoing epidemic by limiting the length of the infection curve used to estimate the model parameters. For the national incidence curves for four years, we estimate the parameters of the SIR and S-SIR model using data up to weeks 10, 15, 20. For each choice, we calculate the value of R 0 and display it in Figure 10 . The top row depicts the results obtained from the SIR model while the bottom row contains the results from the S-SIR model. There are two observations of note -first, the S-SIR model captures best the peak of the epidemic; and second, the values of R 0 obtained, even during estimation of an ongoing epidemic, are higher than those obtained by the SIR model. Both observations suggest again that the S-SIR model yield more conservative, higher, estimates for R 0 . For two years, 2013 and 2014, we visualize the epidemic curve corresponding to the values of R0 estimated from each year's national level incidence curve while limiting the length of the curve to 10, 15, 20 weeks. In each case, we visualize the results from the SIR and S-SIR models to infer R0 and assess their model fit.

Having studied how the S-SIR improves estimates of R 0 from regional data, we study the spatial patterns in the inferred values of τ from sub-regional data. We group the sub-regions based on standard federal regions and average the values of τ for states in each group. We visualize the federal regions in Figure 9 and the results in Figure  11 . Across both time and space, we find that the S-SIR model infers a non-zero value of τ suggesting that the underlying epidemic is indeed beginning later than week 40, with large varia-tions from region to region, and with several regions showing an increase in epidemic onset time in the 2015-2018 window (e.g., regions I, V, VII, VIII, and X). This can result in delays in onset of the national epidemic curve, as witnessed in Figures 6 and 12 , which subsequently interferes with the accuracy of the estimation of R 0 from national aggregated data. Robustness of τ When only given access to the regional epidemic data, how robust are our estimates of τ to observational noise in the data?

To answer this question, we use the national level epidemic curves, perturb the raw counts using additive Gaussian noise (with standard deviation set to a varying percentage of the peak infection counts during that year) while restricting ourselves to valid epidemic curves by using the absolute value when the addition of noise results in negative counts. Then, we re-estimate τ over one-hundred random trials. We visualize the results in Figure 12 . We vary the percentage of the peak infection count between 1% and 5% both of which represent a significant amount of observational noise added to the epidemic curve. Note that the noise (which is a constant function of the peak infection count) results in larger variation during the early and late stages of the epidemic curve than in the middle.

This experiment serves to assess the feasibility of using τ to anticipate whether the model we propose can assess the existence of time-delays from regional epidemic data. When the observational noise lies around 1% of the peak epidemic curve, we find that the variation is distributed tightly around the value of τ estimated from the noise-free regional data, implying that the use of τ is within reasonable limits for a practitioner to assess whether or not timedelay is an important effect in the data. When the standard deviation of the noise is increased to around 5% of the peak infection counts, we continue to find that in over 50% of the trials, the value of τ inferred indicates that one can expect time-delays in the underlying data. COVID-19 has spread over the globe, has infected over 206 million people worldwide, more than 4.3 million of whom have died [59] . Here, we compare the fit obtained to data from an S-SIR model with an SIR model on infection data collected from Italy at the early stage of the epidemic. It is important to note that by now, we know that the dynamics of COVID-19 is more complex than an SIR, but here we use only this approach to illustrate how the methodology can help reduce errors at early stages of epidemics even when understanding of the underlying disease dynamics is not yet clear. Given that the SIR model is the one initially typically used with new epidemics of unknown pathogens, we use an SIR rather than other models to make this illustrative point. Of course as knowledge of the disease evolution improve, we can use the same time-shifted approaches with more complex epidemic models, and this is beyond the scope of this manuscript.

The data we use for this illustration was updated daily, at the time, from the National civil protection department and can be downloaded from the official GitHub repository github.com/ pcm-dpc/COVID-19.git. The data include information on those who were infected, recovered, and died. The epidemic began in Lombardy, a region in the north of Italy, and spread over the course of a month to the rest of the country. Figure 13 (large panel) shows how R 0 changes as a function of the amount of data used to infer its value (on the x-axis) in the early stage of the epidemic. The error bars around the values of R 0 quantify the uncertainty on the estimate which is obtained by perturbing the infection curve with observational noise, and repeatedly estimating the reproduction number. The figure illustrates the significantly higher R 0 compared to that of influenza.

The estimation is consistent with other, independent, estimates [33] . The figure also shows that the estimates start high, and decrease over time. The SIR and S-SIR estimates diverge in the early stages of the outbreak (large panel) but converge to consistent values further on (small panel). Furthermore, early estimate of R 0 from the S-SIR are significantly more robust than initial estimate from the SIR model. This suggest that the time-shift of the S-SIR is a more reliable model in the early stages of an epidemic when data are scarce and prone to miss-reporting errors and when the underlying dynamics of the disease is still not well understood, and first attempts of models usually start with an SIR.

Since the onset of COVID-19, our knowledge evolved and of course now we know that an SEIR may be more appropriate and that policy interventions over time have to also be accounted for to account for changes in the values of R 0 over time. Hence, we conjecture that in the early stage of the pandemic and onset of its various recurrent waves, the S-SIR could be a good model to estimate R 0 , keeping in mind its limitations when complex interventions are being used. pandemic: the large panel shows the estimation of the R0 using the national level curve data from the SIR versus S-SIR models in the early stage of the outbreak. The small inset panel shows the estimate in later stages. These estimates are not as reliable as in the later stages of the pandemic there were complex dynamics (e.g. policy intervention) in place.

There is a large body of work studying various sources of heterogeneity that arise in the estimation of R 0 from data using epidemiological models. Extensions of the SIR model, such as the SEIR model [12] , tackle heterogeneity due to differences in how diseases affect individuals in populations. Others focused on relaxing the assumptions made by typical compartmental models by incorporating variation in population sizes and the effect of vaccination [57] into the model of disease dynamics. In [38] , heterogeneity in removal from the Infected compartment is studied and its effect on R 0 is derived. [46] studies rates of infection in diseases where there is population heterogeneity (such as the level of social activity displayed by individuals), social-distancing and changes in hygiene. In the context of the Ebola outbreak in West Africa, [22] discusses the importance of accounting for uncertainty in the estimated model parameters. Variation in epidemiological curves and its parameters as a function of the variation in assumptions about the contact networks were also studied [36, 44] .

Biases incurred due to inferences made at the level of aggregate data is referred to as aggregation bias [25, 35] . For example, [25] examined changes in regression coefficients (for prediction problems) in the presence of data aggregation. Here, we identify a form of such bias in the estimation of R 0 that arises from the offset in temporal onset of epidemics in sub-regions used to produce aggregated epidemic data. Without fine-grained mobility information between sub-regions as in [55] , the offset captures some degree of spatial information about how the virus spreads -for example, given sub-regional data, sorting the regions by their values of τ may prove useful in understanding how a new disease is spreading due to mobility. Closely related to our work in spirit is that of [47] , who studies the effect of the incubation period for soilborne plant pathogens and the resulting differences in the understanding of the spread of plant diseases. Similarly, albeit with a different focus, [24] explore methods to model epidemic waves composed of overlapping sub-epidemics. They use generalizedlogistic growth models to forecast trajectories of emerging epidemics. Among statistical approaches, [25] examined change in regression coefficients used for prediction as a function of data aggregation. Note also that bayesian hierarchical models [14] have been used to capture knowledge about how epidemic curves in previous seasons might dictate the behavior of the epidemic in the following season. However, while [14] does experiment with a scaled and shifted SIR model, their analysis does not touch upon when and why such a model might be warranted, and the kinds of biases it corrects for.

Our work focuses on elucidating the effect of shift or delay between sub-units of data on the predictions done on the aggregated data. We showed how a mismatch between the generative assumptions of epidemiological models and how data is gathered can bias inferences made about the reproduction number and thus, the estimation of severity of epidemics. In particular, we quantified errors accrued when estimating R 0 in the presence of delayed epidemic onset as well as when the sub-regional epidemic data have delayed onsets. Our work illustrates how the lack of alignment between the assumptions made in a model and the data generating process can distort decisions made using the estimated parameters. To address and correct for this distortion, we introduced and validated the Shifted-SIR (S-SIR) model and provided an algorithm for parameter estimation, as a means to correct for the effect of delays in epidemic onset.

The subsequent analysis on synthetic data showcases the strength of the S-SIR model where we see that by explicitly accounting for timedelays within the compartmental model, we are able to correct and mitigate some of the bias in the estimation of R 0 .

In its current form however the S-SIR model is limited to modeling time-shifted dynamics arising from an SIR model. Extending the approach to more general classes of compartmental models (e.g. the Susceptible-Exposed-Infected-Recovered model) may be appropriate for infectious disease that are known to undergo more complex dynamics.

When should the S-SIR model be used? If one has access to sub-regional local data, then policy decisions should be made separately for each sub-region. However, sub-regions can be sparsely populated, their data can be noisy or otherwise unavailable. In such scenarios policy decisions must be made from aggregated regional data. Recall from our earlier example, that serious heterogeneity in onset can introduce a systematic under-estimation of R 0 . An under-estimation of 1.8 instead of 2.3 (figure 2) translates to a difference of 200,000 cases in a population of 1 million, in other words 20% error, which can be dramatic when estimating bed-capacity match to cases in time of pandemics. It is thus clear, that the best course of action would be to account for offsets with the S-SIR model when learning from regional -aggregated -epidemic curves. Doing so can mitigate some of the bias incurred in the estimation of R 0 with important implications for planning of response and management of cases.

Finally, as different modalities of data collection [20] are integrated for epidemiological modeling, it is vital to rethink our modeling framework to take into account the different levels of the data generation hierarchy [51] . Here, we presented one approach to mitigate errors, and systematic under-estimation in particular, in inferring R 0 due to heterogeneity in epidemic onset, particularly at early stages of epidemics when data is scarce and pooling data is common, while understanding of the underlying dynamics of the disease is not developed. Using our validated proposed S-SIR model and methodology, we hope that more accurate estimates of the reproduction number that matters on the ground can be used to improve planning and intervention, and mitigate mortality and morbidity rates in seasonal diseases such as influenza [58] and particularly at the early stages of newly emerging diseases such as those we are experiencing with COVID19 [59] .

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

Rahul G. Krishnan reports financial support was provided by Microsoft Research. Simone Cenci reports financial support was provided by DCI, LLC. This author has no additional relationships to disclose. The authors have no patents to disclose. The corresponding author reports no additional activities to disclose.

The effect of underreporting on the apparent incidence and epidemiology of acute vhial hepatitis

Estimation by least squares and by maximum likelihood

Global burden, distribution, and interventions for infectious diseases of poverty

Update: influenza activity-United States

Understanding the transmission of H5N1. CAB Reviews: Perspectives in Agriculture, Veterinary Sciences, Nutrition and Natural Resources

Turbulent Gas Clouds and Respiratory Pathogen Emissions: Potential Implications for Reducing Transmission of COVID-19

Fluid Dynamics of Respiratory Infectious Diseases

Spatial dynamics of Bar-headed geese migration in the context of H5N1

Highly pathogenic avian influenza outbreak mitigated by seasonal low pathogenic strains: Insights from dynamic modeling

Avian influenza spread and transmission dynamics. Mathematical modeling of infectious diseases

A subspace, interior, and conjugate gradient method for large-scale boundconstrained minimization problems

Mathematical epidemiology: Past, present, and future

Flexible modeling of epidemics with an empirical Bayes framework

Centers for Disease Control (CDC). 2015. National Notifiable Diseases Surveillance System (NNDSS

Overview of Influenza Surveillance in the United States

Centers for Disease Control (CDC). 2018a

Centers for Disease Control (CDC). 2018b. The burden of flu disease 2017-2018 infographic

Centers for Disease Control (CDC). 2019. National, Regional, and State Level Outpatient Illness and Viral Surveillance

Using social media for actionable disease surveillance and outbreak management: a systematic literature review

A simple approximate mathematical model to predict the number of severe acute respiratory syndrome cases and deaths

Fitting dynamic models to epidemic outbreaks with quantified uncertainty: A primer for parameter uncertainty, identifiability, and forecasts

Comparative estimation of the reproduction number for pandemic influenza from daily case notification data

A novel sub-epidemic modeling framework for short-term forecasting epidemic waves

The effects of data aggregation in statistical analysis

The Ebola outbreak, 2013-2016: old lessons for new epidemics

The COVID-19 Pandemic and the $16 Trillion Virus

Unreported cases in the 2014-2016 Ebola epidemic: Spatiotemporal variation, and implications for estimating transmission

Mathematical epidemiology of infectious diseases: model building, analysis and interpretation

Daniel Bernoulli's epidemiological model revisited

Estimation of the basic reproduction number for infectious diseases from age-stratified serological survey data

Strategies for containing an emerging influenza pandemic in Southeast Asia

Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand

Pandemic Potential of a Strain of Influenza A (H1N1): Early Findings

Certain effects of grouping upon the size of the correlation coefficient in census tract material

How heterogeneous susceptibility and recovery rates affect the spread of epidemics on networks

Malaria and its possible control on the island of Príncipe

Improving estimates of the basic reproductive ratio: using both the mean and the dispersal of transition times

Perspectives on the basic reproductive ratio. Journal of The Royal Society Interface

Mathematical modelling and prediction in infectious disease epidemiology

Ethical and Legal Considerations in Mitigating Pandemic Disease: Workshop Summary

Global Economic Effects of COVID-19

SciPy: Open source scientific tools for Python

Networks and epidemic models

Anderson G. 1927. A contribution to the mathematical theory of epidemics

Simple models of influenza progression within a heterogeneous population

Estimating the delay between host infection and disease (incubation period) and assessing its significance to the epidemiology of plant diseases

Six challenges in modelling for public health policy

The annual impact of seasonal influenza in the US: measuring disease burden and costs

The structure of epidemic models. Epidemic models: their structure and relation to data

Epidemiology in the era of big data

Global burden of respiratory infections due to seasonal influenza in young children: a systematic review and meta-analysis

Epidemiological data management during an outbreak of Ebola virus disease: key issues and observations from Sierra Leone

Estimating the basic reproductive number during the early stages of an emerging epidemic

A structured epidemic model incorporating geographic mobility among regions

Ecological inference

Global analysis of an SEIR model with varying population size and vaccination

Estimates of US influenzaassociated deaths made using four different methods

COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU)

US Federal Regions

Modelling seasonal influenza: the role of weather and punctuated antigenic drift

We acknowledge the support of the MIT Alumni Class Fund in support of HST.537/2.250/1.631 Fluids and Diseases course that enabled onset of this research, in addition to the support of the Burroughs Wellcome Fund, the Richard and Susan Smith Family Foundation, and the Wellcome Trust.