key: cord-0150064-i70bmtsi
authors: Capistr'an, Marcos A.; Capella, Antonio; Christen, J. Andr'es
title: Filtering and improved Uncertainty Quantification in the dynamic estimation of effective reproduction numbers
date: 2020-12-03
journal: nan
DOI: nan
sha: 78a705af805d3b007bd67c02290ed6b33e17182e
doc_id: 150064
cord_uid: i70bmtsi

The effective reproduction number $R_t$ measures an infectious disease's transmissibility as the number of secondary infections in one reproduction time in a population having both susceptible and non-susceptible hosts. Current approaches do not quantify the uncertainty correctly in estimating $R_t$, as expected by the observed variability in contagion patterns. We elaborate on the Bayesian estimation of $R_t$ by improving on the Poisson sampling model of Cori et al. (2013). By adding an autoregressive latent process, we build a Dynamic Linear Model on the log of observed $R_t$s, resulting in a filtering type Bayesian inference. We use a conjugate analysis, and all calculations are explicit. Results show an improved uncertainty quantification on the estimation of $R_t$'s, with a reliable method that could safely be used by non-experts and within other forecasting systems. We illustrate our approach with recent data from the current COVID19 epidemic in Mexico.

The effective reproduction number R t is the proportion of new cases generated by active cases at calendar time t. Estimating R t has proved to be an important tool for monitoring ongoing epidemics, to monitor interventions or general changes in contagion patterns (e.g. Ferguson et al., 2006; Fraser, Riley, Anderson, and Ferguson, 2004) . This has been the case in the past and even more recently with the current COVID19 epidemics. In Germany, a calculation of R t is used as public monitor of the evolution of the COVID19 epidemic and has direct consequences on public life and decision making RKI (2020) . Regarding the estimation of R t , in basically all realistic cases, a proxy for incidence data has to be used, and along with the inherent stochastic variability in reported cases, proxies are subject to noise.

There are methods to estimate R t as a sub product of larger mechanistic compartmental models. The main limitation of these methods is precisely that an underlying mechanistic model must be postulated and calibrated beforehand, thus inheriting possible model drawbacks (see Bettencourt and Ribeiro, 2008; Cintrón-Arias et al., 2009, for example) . Recently, more generic methods for estimating R t , requiring incidence data only, have been proposed Cauchemez, Boëlle, Thomas, and Valleron (2006) ; Cori et al. (2013) ; Fraser (2007) ; Thompson et al. (2019) ; Wallinga and Teunis (2004) . Cauchemez et al. (2006) propose a hierarchical model and Bayesian estimation using Markov Chain Monte Carlo (MCMC) methods to estimate R t . MCMC is hardly recommended for non-experts and simpler analytic methods are sough.

Let I t be the incidence of cases, to be used for the estimation of R t , t = 1, 2, . . .. Using the renewal equation and other assumptions Fraser (2007) explains that the effective reproduction number R t may be defined and estimated with

Here w s is a probability mass function (i.e. w s ≥ 0 and ∞ s=1 w s = 1) that accounts for the infectious disease generation interval. It is assumed that infected individuals have a generation interval w s , dependent on the number on days since infection s but independent of current time. At time t, I t−s w s represents a "force of infection" of individuals infected s days ago. Then Λ t represents the "total infectiousness of infected individuals" Cori et al. (2013) . It may also be seen as an estimation of the current total of active cases. R t is the ratio of secondary cases I t produced by the current total of active cases. From here onwards we will assume that time t is measured in days.

Note that I t may be only a proxy of the total number of infected people, as for example, people arriving at hospitals to seek help, etc. In most common cases, the new infected individuals at time t, I * t , including asymptomatic infections, is impossible to measure. Ideally, as a proxy, we may regard instead I t = KI * t with K an unknown constant. However, R t is a unit less measure of the proportion of secondary infections and is indeed independent of K. Any estimation of R t should reflect this, providing the same results if a different K is used or if the proxy is given proportional to the total population etc. Moreover, the uncertainty we have on the actual R t should not be bound to the absolute values of I t but to the recent variation in the observed ratio I t /Λ t .

Moreover, epidemic data has far more caveats that need to be address. Reporting delays , along with right truncation are very common in epidemic data. Only under a thoughg investigation of the reporting system at hand, is that incidence data may be corrected to obtain a reasonable enough good proxy for estimating R t (McGough, Johansson, Lipsitch, and Menzies, 2020; Salmon, Schumacher, and Höhle, 2016) . As we say, only ideally is that I t = KI * t . Even after standard data pre processing techniques are applied, conditions may change during an epidemic with the net result that K changes over time, for example. Undoubtedly, the use of raw incidence data, without proper pre processing, most likely will lead to incorrect/biased estimates of R t (Gostic et al., 2020) . We do not discuss data preprocessing any further and assume I t = KI * t throughout, with an unknown fixed K. In Cori et al. (2013) the estimation of R t , based on the proxy incidence I t , uses (1) to form the statement that

A probability model is then proposed for the observed I t conditional on I t−1 , I t−2 , . . . , I 1 with the expected value R t Λ t . Cori et al. (2013) proposed the Poisson model I t |I t−1 , I t−2 , . . . , I 1 ∼ P o(R t Λ t ). Incidence data is commonly overdisperse and the immediate improvement on this Poisson model would be to use a more general count data model as a Negative Binomial (Lindén and Mäntyniemi, 2011) . Dispersion parameters may be fixed or estimated in the Negative Binomial (Flaxman et al., 2020) . However, directly postulating a model for incidence data may result in estimates that depend Incidence data for the ongoing COVID19 epidemic in two metropolitan areas of Mexico (left, confirmed cases, see Capistran et al., 2020) . The corresponding posterior of R t according to Cori et al. (2013) Poisson model (right). Also, the Mérida incidence data multiplied by 10, and its corresponding posterior (dark red). The uncertainty presented by the posterior responds to the incidence absolute values and not to the variation in reproduction numbers. The generation interval w s used here is explained in Figure 2 . on the unknown constant K. Moreover, as far as estimating R t is concern, the actual variability of I t is not in principle relevant, only the variability observed in the ratios I t /Λ t .

The resulting inference of Cori et al. (2013) , where they assume a constant R τ t of τ = 7 days (with a gamma Ga(5, 1) conjugate prior) is illustrated in Figure 1 . In particular, the coefficient of variation (CV) of the posterior distribution for R τ t is 1 Cori et al., 2013, p.3 web material) . The CV will decrease if we increase the arbitrary constant K. It does depend only on the absolute current values for I t and not on the intrinsic variation of the underlying R t 's, see Figure 1 . Further elaboration on this Poisson model may be found in recent literature. Thompson et al. (2019) improves this Bayesian inference of R t by also estimating w s and adding the corresponding variability in the posterior of R t . Nouvellet et al. (2018) uses Cori et al. (2013) embedded in a more complex forecast system. The need for a statistical estimation of R t is apparent, with a reasonable estimation of the uncertainty in all inferences.

Recently, and prompted by the current uses of R t estimates during the COVID19 epidemic, Gostic et al. (2020) make a review of current methods and recommend the use of Cori et al. (2013) , in contrast to other methods not based on (1), see Gostic et al. (2020) and references therein. Gostic et al. (2020) stress some other three points in the estimation of R t , namely: 1) Data for estimating the infectivity function w s is rarely available, and proxies for estimating w s should be used with care. 2) Correct incidence data should always be used, representing the number of new infections every day, taking special care to correct for reporting delays and other issues. 3) Smoothing of R t estimation is a matter of concern. Cori et al. (2013) uses a smoothed window of τ = 7 days, but this should be used judiciously depending on the data at hand, although no definite guidelines are suggested in Gostic et al. (2020) . However, Gostic et al. (2020) focus on other comparing issues in the estimation of R t and did not concentrate on the adequateness of the uncertainty expressed by the posterior distributions of Cori et al. (2013) .

Our aim here is to improve the Uncertainty Quantification (UQ) in the estimation of R t , from the definition in (1), and using an analytic Bayesian approach (i.e. non-MCMC) that may be robustly embedded in other systems or used by non-experts.

We follow the same basic idea of Cori et al. (2013) but interpret (1) differently. Namely, define ρ t = log(R t ), then

where i k = log(I k ) + log(K) and λ t = log(Λ t ) + log(K), for some unknown constant K. As usual, when taking logarithms of count data to postulate an observation model we assume that i t | · · · ∼ N (ρ t + λ t , σ), for some unknown variance σ, or equivalently

As recommended by Cori et al. (2013) , R t should only be estimated for incidence with I t ≥ 10. This further justifies the use of the above Normal model. Therefore, with the observed log R t 's, i.e. the y t s, we will try to estimate the ρ t = log(R t )s. Cori et al. (2013) assumed a constant R τ t over τ days, obtaining some smoothing. A further improvement on the approach Table 1 : Updating formulas for the parameters of the non-central student-t posterior distribution for ρ t , ρ t |D t ∼ T nt (m t , c t ). Cori et al. (2013) is that we explicitly and formally model smoothing, or coherence, among the ρ t s. This we do by postulating an autoregressive prior for the ρ t 's, modeling the fact that ρ t should be similar (but not necessarily equal) to ρ t−1 . This creates a Dynamic Linear Model, and using Bayesian updating, forms a filtering type inference for the ρ t sequence.

We will continue to assume the generation interval w s fixed and dedicate our effort to improving the UQ in the Bayesian estimation of R t . As mentioned above, we do not discuss data preprocessing and assume that I t is a correct proxy for estimating R t . We do not provide an analysis on the effects of incidence data anomalies on our estimation of R t . The paper is organized as follows. We present the details of our approach in the next section and in section 3 we present two examples using recent COVID19 incidence data from Mexico. Finally, in Section 4 we make a discussion of our results.

A Dynamic Linear Model (West and Harrison, 1997) is proposed for the estimation of the ρ t 's. Namely,

an autoregressive AR(1) (prior) model is proposed to estimate the log R t 's. Here we state that ρ t is equal to ρ t−1 plus some noise ω t . The precision φ t process, i.e. σ 2 t = φ −1 t , is the variance of the observed log R t 's, y t , and is assumed unknown. With γ 1 ∼ Beta(δn 0 /2, (1 − δ)n 0 /2) a second AR (1) is assumed to estimate these variances. n t are the discounted degrees of freedom n t = δn t−1 + 1, (see West and Harrison, 1997, sec. 10 .8 for details). Here we also assume that the precision φ t is similar to φ t−1 multiplied by the random innovation or shock γ t /δ. Since E(γ t ) = δ, γ t /δ has a shifted Beta distribution with mean 1. This keeps all φ t positive.

The discount factor δ and the hyperparameters m 0 , c * 0 , w * , n 0 are part of the modeling and prior definition (see below) and considered known.

The marginal posterior of ρ t may be calculated analytically and corresponds to a non-central student-t distribution with n t degrees of freedom

where D t = (y t , D t−1 ); see Table 1 for the recursive calculation of the central parameter m t and dispersion parameter c t . This constitutes a filtering type estimation (Kalman, 1960) of the ρ t 's, with discounted variance (West and Harrison, 1997, sec. 10.8) . The marginal posterior distribution of R t may be obtained from (2), namely f Rt (r|D t ) = r −1 f ρt (log(r)|D t ).

In particular, quantiles for this posterior may be calculated by transforming the corresponding quantiles of (2) since R t = exp(ρ t ) is monotonic.

The assumption of Cori et al. (2013) that R t remains constant over τ days is relaxed here with the AR(1) prior model for the ρ t 's, ρ t = ρ t−1 + ω t , ω t ∼ N (0, φ −1 t w * ). The variance discount factor δ is key for learning from recent variability in the ρ t 's and gradually cease to learn from more distant ρ t 's. It is proved that the degrees of freedom n t , which we may visualize as the "effective" sample size, the number of sample points in the past actually used to estimate the variance component, has the property n t → (1 − δ) −1 (West and Harrison, 1997, sec. 10.8) . We postulate that (1 − δ) −1 = 2τ , or δ = 1 − (2τ ) −1 . That is, the limit to learning from the past is 2τ days. Regarding the prior for ρ 0 ∼ N (m 0 , φ −1 0 c * 0 ), we center it at m 0 = 0 (i.e. R t = 1) with the same variance for the observed y t , that is c * 0 = 1. Regarding the prior for γ 1 ∼ Beta(δn 0 /2, (1 − δ)n 0 /2), note that by design it is centered The resulting posterior distributions of R t on 03JUL2020 for our method, for the Mérida COVID19 incidence data presented in Figure 1 .

at δ. Using n 0 = 2 provides a diffuse prior with large variance, and leads to the default non informative Beta(1/2, 1/2) in the case of δ = 1/2. The variance for the AR(1) model for ρ t is crucial, as it controls the memory and smoothness in our approach. Our a priori statement on this variance is the value of the variance multiplier w * . The prior model for ρ t is in fact a summation of Gaussian shocks (as in the Weinner process). It is clear then that ρ t |ρ t−k ∼ N (ρ t−k , φ −1 t kw * ). If the variance is similar to, or larger than, the variance of the observational process, that is kw * ≥ 1, then little smoothing is obtained at lag k since ρ t can vary as much as, or more, than y t−k . Taking k = τ , we set τ w * = 2, thus w * = 2/τ , to also limit the level of smoothing to less than τ days. We postulate τ w * = 2 as a pragmatic large enough variance multiplier, that leads to a quite similar smoothing obtained by Cori et al. (2013) using the same value for τ . This is illustrated in the examples presented in the next Section.

A key input to calculating R t is postulating w s through an infectious disease generation interval. w s may be seen as the probability ( ∞ s=1 w s = 1) that an infected person infects other people s days from infection (Fraser, 2007) .

A readily way to define w s is by defining a pdf f (s) and with its cdf let w s = F (s) − F (s − 1). For the COVID19 epidemic, we postulate an Erlang( 3, 8/3) pdf depicted in Figure 2(a) . The expected value is at 8 days and the maximum infectivity is at 5.5 days, decaying near zero after 20 days. This is based on early (Team et al., 2020; Verity et al., 2020) as well as recent reports on the viral load and implied infectiousness of the decease (see Cevik, Kuppalli, Kindrachuk, and Peiris, 2020, fig. 2 ). The use of the Erlang distribution in epidemics has been suggested elsewhere (Champredon, Dushoff, and Earn, 2018) .

To illustrate that our inferences are stable, we also present an Erlang generation interval shifted to the right, that may also be considered adequate for COVID19. The resulting posterior distributions for the R t using our method, for Mérida, México, on 03JUL2020 are presented in Figure 2 (b). Certainly, under broad regular circumstances, the Bayesian posterior operator is smooth with respect to changes in the prior (see Christen and Pérez-Garmendia, 2020, for example). We continue with an Erlang( 3, 8/3) as our generation interval distribution for COVID19 in both examples below.

We use the same data sets presented in Figure 1 to illustrate our approach, plotting the posterior ρ t time series from the last date in the time series to 60 days before, see Figure 3 . Also plotted in Figure 3 are the posterior for the R t s using Cori et al. (2013) and the observed R t = e yt s. Regarding the Mérida data, note how Cori et al. (2013) posterior is basically insensitive to the variability observed in the data, the posterior inter-quantile ranges remain basically equal. For our approach, note from 05.05 to 19.05 the large variability in e yt leads to large posterior ranges whereas from 02.06 to 09.06 a more stable observed R t results in far shorter posterior inter-quantile ranges. Later on, by the end of June, more spread posteriors are seen again given the observed second wave of large and unstable R t s. A sharp increase in e yt follows a sharp decrease, from above 2 to nearly 1, within 10 days until 03.07. Our AR(1) modeling results in a smooth decay on the position of the posterior for R t , and shorter spread, as a response to the short stable spell of observed e yt . For these same 10 days, a similar behavior is seen in the position of Cori et al. (2013) 's posterior, but this as a result of the moving window of τ days, as explained in the previous section. As already mentioned, the UQ resulting from the posterior of Cori et al. (2013) responds only to the absolute values of I t s, which in this case are around 25 to 100 in May and June.

Regarding the results from Mexico city, presented in Figure 3 , note how Figure 3 : R t 's for incidence data presented in Figure 1 for the last 60 days. From the posterior marginal of each ρ t in (2) we calculate its 10%, 25%, 50%, 75% and 90% quantiles and these are transformed back to obtain quantiles of R t . These are plotted vertically with blue shades and the median with a red line. The Poisson model posterior of R t of Cori et al. (2013) is also plotted, using the same quantiles with green shades and a black line for the median. The posterior of Cori et al. (2013) for Mexico city is so sharp that the quantile shades are hardly seen. The gray lines are the observed R t = e yt s.

the posterior of Cori et al. (2013) basically collapses around the median, given the large I t counts of this large metropolitan area (with a population of more than 21 million, I t varies around 1000 to 2000 in May and June). On the contrary, our UQ provided by the posterior of R t s presents larger spreads in early May, when an unstable spell is observed for e yt . Later on, after most of June with a more stable period, the uncertainty decreases with far sharper posteriors by the end of the month and the firsts days of July.

Regarding the smoothing obtained in the estimation of R t with respect to the observed values, that is, gray lines in Figure 3 , see how sharp changes in the latter are reasonably smoothed out by the former. Our strategy to setting the variance multiplier hyperparamenter w * in the AR(1) model for ρ t (i.e. w * = 2/τ , with τ = 7 days, see Section 2.1) results in a smoothing similar to that of Cori et al. (2013) 's, although with larger and adaptive interquantile ranges, as already commented. The smoothness obtained by Cori et al. (2013) is a result of assuming a fixed R t for τ = 7 days and calculating the posterior independently for each R t . Ours is the result of the DLM and the autoregressive prior model on ρ t .

We present a new approach for estimating the effective reproduction number R t with the following improvements: 1) It does not depend on a compartmental model, specific to a particular disease. 2) Using previous ideas (Cori et al., 2013; Fraser, 2007) we construct a novel statistical model and Bayesian inference approach to estimate R t . 3) The resulting UQ is better suited responding to the recent variation of observed R t s. 4) An AR(1) process is used to promote smoothness on the estimation and 5) all calculations are simple, leading to a Bayesian updating (filtering) type analysis that may be used by non-experts. In fact, a simple to use Python program with an implementation of our method (and also the original of Cori et al., 2013 ) may be downloaded from https://github.com/andreschristen/RTs. We are working also on an alternative implementation in a spreadsheet.

The generation interval w s is a crucial input to this and other similar methods to estimate R t . Note that based on preliminary data on infected patients, some basic knowledge can be gained to postulate a reasonable w s . This was the case since early stages of the COVID19 epidemic (Team et al., 2020) . It is easier to gain information on w s since it represents information on the infectivity, bound to an average infected person. Social contact patters, lockdown measures etc do not play a role in w s (from the underlying assumption that β(t, s), in the renewal equation I(t) = β(t, s)I(t − s)ds, has the form β(t, s) = R(t)w(s), see Fraser, 2007) . The social/epidemiological factors influencing contagion patters are coded in R t , an appealing feature of this type of approaches to estimate the latter, to monitor changes in epidemic data from the onset.

Estimating the infectious disease generation interval w s requires time consuming clinical studies that are commonly not available during an ongoing epidemic with a new virus as Sars-Cov-2. Thompson et al. (2019) present an attempt to include uncertainty in the estimation of w s into the posterior distribution of R t of Cori et al. (2013) . Including an estimate of w s in our method, along with its corresponding uncertainty, is formally straightforward within this Bayesian framework, either by including a likelihood with additional observables or by postulating a prior process for w s . Indeed, this will complicate the approach and most likely MCMC methods will need to be used.

As already mentioned, a reliable estimation of R t is quite relevant, to be used as such, or as part of other larger forecasting or monitoring systems. Bettencourt and Ribeiro (2008) elaborate on a simple relation of future incidence with predicted R t s. A reliable forecasting system of R t could then be used for short term forecasting of I t . A related idea is used in Nouvellet et al. (2018) for incidence forecast and other our purposes. Being at time t, the process ρ t+k , k = 1, 2, . . . , p can be easily simulated, by simulating ρ t and other parameters from the posterior at D t and in turn simulating ρ t+k |ρ t+k−1 until ρ t+p , to produce MC samples of the posterior predictive distribution for R t . A feasible path for incidence forecasting based on forecasts of R t is envisaged, but we need to leave this possibility for future research.

Real time bayesian estimation of the epidemic potential of emerging infectious diseases

Forecasting hospital demand in metropolitan areas during the current covid-19 pandemic and estimates of lockdown-induced 2nd waves

Estimating in real time the efficacy of measures to control emerging communicable diseases

Virology, transmission, and pathogenesis of sars-cov-2

Equivalence of the erlang-distributed seir epidemic model and the renewal equation

Weak and tv consistency in bayesian uncertainty quantification using disintegration

The estimation of the effective reproductive number from disease outbreak data

A new framework and software to estimate time-varying reproduction numbers during epidemics

Strategies for mitigating an influenza pandemic

Estimating the effects of non-pharmaceutical interventions on covid-19 in europe

Estimating individual and household reproduction numbers in an emerging epidemic

Factors that make an infectious disease outbreak controllable

A New Approach to Linear Filtering and Prediction Problems

Using the negative binomial distribution to model overdispersion in ecological count data

Nowcasting by bayesian smoothing: A flexible, generalizable model for real-time epidemic tracking

A simple approach to measure transmissibility and forecast incidence

Coronavirus disease 2019 (covid-19) -daily situation report of the robert koch institute

Monitoring count time series in r: Aberration detection in public health surveillance

Updated rapid risk assessment from ecdc on the novel coronavirus disease 2019 (covid-19) pandemic: increased transmission in the eu/eea and the uk

Improved inference of time-varying reproduction numbers during infectious disease outbreaks

Estimates of the severity of covid-19 disease

Different Epidemic Curves for Severe Acute Respiratory Syndrome Reveal Similar Impacts of Control Measures

Bayesian Forecasting and Dynamic Models

The authors are partially founded by CONACYT CB-2016-01-284451 and COVID19 312772 grants and a RDCOMM grant. AC was partially supported by UNAM PAPPIT-IN106118 grant.