key: cord-202376-440zapcw authors: Wilder, Bryan; Mina, Michael J.; Tambe, Milind title: Tracking disease outbreaks from sparse data with Bayesian inference date: 2020-09-12 journal: nan DOI: nan sha: doc_id: 202376 cord_uid: 440zapcw The COVID-19 pandemic provides new motivation for a classic problem in epidemiology: estimating the empirical rate of transmission during an outbreak (formally, the time-varying reproduction number) from case counts. While standard methods exist, they work best at coarse-grained national or state scales with abundant data, and struggle to accommodate the partial observability and sparse data common at finer scales (e.g., individual schools or towns). For example, case counts may be sparse when only a small fraction of infections are caught by a testing program. Or, whether an infected individual tests positive may depend on the kind of test and the point in time when they are tested. We propose a Bayesian framework which accommodates partial observability in a principled manner. Our model places a Gaussian process prior over the unknown reproduction number at each time step and models observations sampled from the distribution of a specific testing program. For example, our framework can accommodate a variety of kinds of tests (viral RNA, antibody, antigen, etc.) and sampling schemes (e.g., longitudinal or cross-sectional screening). Inference in this framework is complicated by the presence of tens or hundreds of thousands of discrete latent variables. To address this challenge, we propose an efficient stochastic variational inference method which relies on a novel gradient estimator for the variational objective. Experimental results for an example motivated by COVID-19 show that our method produces an accurate and well-calibrated posterior, while standard methods for estimating the reproduction number can fail badly. A key goal for public health is effective surveillance and tracking of infectious disease outbreaks. This work is motivated in particular by the COVID-19 pandemic but the methods we describe are applicable to other diseases as well. A central question is to estimate the empirical rate of transmission over time, often formalized via the reproduction number R t , t = 1...T . R t describes the expected number of secondary infections caused by someone infected at time t. Accurate estimates of R t are critical to detect emerging outbreaks, forecast future cases, and measure the impact of interventions imposed to limit spread. R t is typically estimated using daily case counts, i.e., the number of new infections detected via testing each day. Standard methods, including prominent dashboards devel-oped for COVID-19, provide accurate estimates under idealized conditions for the observation of cases and have been successfully used at a national or state level where many observations are available and sampling variation averages out Flaxman et al. 2020; Systrom, Vladek, and Krieger 2020) . However, successful reopening will require programs which track spread at the level of particular colleges, workplaces, or towns, where partial observability poses several challenges. First, only a small number of infected people may be tested. It is estimated that only about 10% of SARS-COV-2 infections in the US result in a confirmed test (Havers et al. 2020 ) and we could expect even fewer in populations with a high prevalence of asymptomatic or mild infections (e.g., college students). Second, the biological properties of the test play an important role. For example, a PCR test which detects viral RNA will show positive results at different times than an antibody or antigen test. Further, there can be substantial heterogeneity across individuals. Third, testing programs may collect samples in a particular way which impacts the observations. For example, one suggestion for schools and workplaces to reopen is to institute regular surveillance testing of a fraction of the population in order to detect outbreaks and catch asymptomatic carriers (Larremore et al. 2020) . The observations will depend on the fraction of the population enrolled in testing (potentially small due to budget constraints) along with the sampling design (e.g., cross-sectional vs longitudinal). This paper presents GPRt, a novel Bayesian approach to estimating R t which accounts for partial observability in a flexible and principled manner (illustrated in Figure 1 ). This method yields well-calibrated probabilistic estimates (the posterior distribution). Our model places a Gaussian process (GP) prior over R t , allowing it to be an arbitrary smooth function. Then, we explicitly model the sampling process which generates the observations from the true trajectory of infections. While this substantially improves accuracy (as we show experimentally) it creates a much more difficult inference problem than has been previously considered. Specifically, our model contains tens or hundreds of thousands of discrete latent variables, preventing the application of out-of-the-box methods. Moreover, the values of many variables are tightly correlated in the posterior distribution, further complicating inference. To make inference computationally tractable, we propose a novel stochastic Figure 1 : Illustration of our GPRt method. Top row: the generative model GPRt posits for the observed data. Bottom row: the inference process to recover a posterior over R t . variational inference method, enabled by a custom stochastic gradient estimator for the variational objective. Extensive experiments show that our method recovers an accurate and well-calibrated posterior distribution in challenging situations where previous methods fail. There is a substantial body of work which attempts to infer unknowns in a disease outbreak. A frequent target for inference is the basic reproduction number R 0 (Majumder and Mandl 2020; Riou and Althaus 2020; Wilder et al. 2020) ; by contrast, we attempt the more challenging task of estimating a reproduction number which can vary arbitrarily over time. There is a literature of both classic methods for estimating R t (Wallinga and Teunis 2004; Cori et al. 2013; Campbell et al. 2019 ) and newer methods developed for the COVID-19 pandemic and implemented in popular dashboards Systrom, Vladek, and Krieger 2020) . None of these methods incorporate partial observability and we empirically demonstrate that GPRt improves substantially over methods in each category. Another strand of literature develops maximum likelihood or particle filter estimates of the parameters of an epidemiological model (King et al. 2008; Dureau, Kalogeropoulos, and Baguelin 2013; Cazelles, Champagne, and Dureau 2018) . However, their work focuses on accommodating a complex model of the underlying disease dynamics; by contrast, we develop methods for probabilistically well-grounded inferences under a complex observation model. There is also a great deal of computational work more broadly concerned with disease control. Examples include optimization problems related to vaccination or quarantine decisions (Saha et al. 2015; Zhang and Prakash 2014; Zhang et al. 2015) , machine learning methods for forecasting (without recovering a probabilistic view of R t ) (Chakraborty et al. 2014; Rekatsinas et al. 2015) , and agent-based simulations of disease dynamics (Barrett, Eubank, and Marathe 2008) . Our work complements this literature by allowing inference of a distribution over R t from noisy data, which can serve as the input that parameterizes an optimization problem or simulation. We now introduce a model for a disease process with a time-varying reproduction number. Subsequently, we introduce example models for how the observations are generated from the disease process which our framework can support. We use a standard stochastic model of disease transmission similar to other R t estimation methods (Wallinga and Teunis 2004; Cori et al. 2013; Campbell et al. 2019 ); our contribution is a more powerful inference methodology which can accommodate complex observation models alongside a GP prior. Let R = [R 1 ...R T ] be the vector with R t for each time. From R, the disease model defines a distribution over a vector n = [n 1 ...n T ] with the number of people newly infected each day. The main idea is that, over the course of a given individual's infection, they cause a Poisson-distributed number of new infections with mean determined by R. Specifically, if on day t individual i has been infected for h i days the expected number of new infections caused by i that day is Here w h i gives the level of infectiousness of an individual h i days post-infection. w is normalized so that h w h = 1. For later convenience, we will define φ t = N i=1 w h i to be the total infectiousness in the population before scaling by R t (N is the total population size). Each day, each infected individual i draws n i t ∼ Poisson(λ i t ) other individuals to infect. We also incorporate infections from outside the population, with a mean of γ such infections per day. We assume the rate of external infection is constant with respect to our time but our model could be extended to a time-varying γ. We treat γ mostly as a nuisance parameter: our true objective is to infer R, but doing so requires accounting for the potential that some detected cases are due to infections from outside the population. We define n t = N i=1 n i t + Poisson(γ) to be the total number of new infections. Since φ is fixed given the time series n, we denote it as a function φ(n). We denote the probabilistic disease model induced by a specific choice of R and γ as M (R, γ) and the draw of a time series of infections from the model as n ∼ M (R, γ). We now depart from the standard disease model used in previous work and describe a wide-ranging set of examples for how our framework can accommodate models of the process which generates the observed data from the latent (unknown) true infections. Previous work assumes either perfect observability or else the simplest of the three observation models we describe below (uniform undersampling). Our focus is where observations are generated by a medical test which confirms the presence of the pathogen of interest. Individuals have some probability of being tested at different times (depending on the testing policy adopted) and then test positive with a probability which depends on the biological characteristics of the disease and the test in question. The kind of test employed determines when an individual is likely to test positive during the course of infection. For example, for COVID-19, PCR tests are commonly used to detect SARS-COV-2 RNA. They are highly sensitive and can detect early infections. Most infected individuals become PCR-negative within the week or two following infection as viral RNA is cleared (Kucirka et al. 2020) . By contrast, serological tests detect the antibodies produced after infection. An individual is not likely to test serologically positive until a week or more postinfection, but may then continue to test serologically positive for months afterwards (Iyer et al. 2020) . The observable data generated by a serological testing program is likely quite different than a PCR testing program since the time-frame and variance of when individuals test positive differs strongly. A range of other examples are possible (e.g., antigen tests) and can be easily incorporated into our framework. Our model adopts a generic representation of a particular test as a distribution D over t convert , the number of days post-infection when an individual begins to test positive and t revert , the number of days post-infection when they cease to test positive. For an infected individual i, we write t i convert , t i revert ∼ D. Our method only assumes the ability to sample from D, meaning that we can directly plug in the results of lab studies assessing the properties of a test. t i convert , t i revert are unobserved: we only get to see if an individual tests positive at a given point in time, not the full range of times that they would have tested positive. Next, we describe a series of example models for how and when individuals are tested, which reveal observations depending on the status (t i convert , t i revert ) for each person who is tested. For convenience, we let t i convert = ∞ for an individual who is never infected. Note however that we can model false negatives by having D sometimes set t i convert = ∞ for an infected individual, or false positives by returning finite t i convert for an uninfected individual. We denote the number of observed positive tests on day t as x t . An observation model is a distribution over x given n, denoted x ∼ Obs(n). Each setting below describes one such distribution. Uniform undersampling In this setting, each individual who is infected is tested independently with some probability p test (e.g., if they individually decide whether to seek a test). To model this process, we introduce two new sets of latent random variables. First, a binary variable z i ∼ Bernoulli(p test ), indicating whether individual i is tested. Second, a delay c i , giving the number of days between t i convert and when individual i is tested. We can integrate out the z i and obtain the following conditional distribution for the observed number of positive tests x t : denotes the indicator function of an event. However, we cannot analytically integrate out t convert and c. Cross-sectional testing Here, a uniformly random sample of s t individuals are tested on each day t. This models a random screening program (e.g., testing random employees each day as they enter a workplace). In this case, we have This expression provides the likelihood of x after conditioning on the latent variables t i convert , t i revert , though there is no closed-form expression conditioning only on n. Longitudinal testing In this setting, a single sample from the population is chosen up front and every individual in the sample is tested every d days. We again denote the total number of individual tested on day t as s t , but note that now the group of individuals who are tested repeats every d days. Longitudinal testing offers different (and potentially more revealing) information than cross-sectional testing since when an individual first tests positive, we know that they did not test positive d days ago. However, it complicates inference by introducing correlations between the test results at different time steps. Let A t denote the set of individuals who are tested at time t. We assume that the complete sample d t=1 A t is chosen uniformly at random from the population, with the chosen individuals then randomly partitioned between the d days. We have where t i convert > t−d captures that i was not positive on their previous test. This introduces correlations between x t and x t−d , so there is not a simple closed-form expression for the distribution of the time series x even after conditioning on t i convert and t i revert . (as there is in the cross-sectional case). We will instead build a flexible framework for inference which can just as well use a kind of sample of the log-likelihood. We will place a Gaussian process (GP) prior over R, resulting in the following generative model: where GP(1, K) denotes a Gaussian process with constant mean 1 and kernel K and Exp(γ) is an exponential prior on γ with meanγ. Given the observation x, our goal is to compute the resulting posterior distribution over R and γ. However, is complicated by the fact that x is determined by a large number of discrete latent variables, primarily n (the time series of infections) and {t i convert , t i revert } N i=1 , the times when each individual tests positive. A typical strategy for inference in complex Bayesian models is Markov Chain Monte Carlo (MCMC). However, MCMC is difficult to apply because of tight correlations between the values of variables over time: due to the GP prior, we expect values of R to be closely correlated between timesteps, and successive values of n are highly correlated via the model M . Formulating good proposal distributions for high-dimensional, tightly correlated random variables is notoriously difficult and has presented problems for GP inference via MCMC in other domains (Titsias, Lawrence, and Rattray 2008) . The other main approach to Bayesian inference is variational inference, where we attempt to find the best approximation to the posterior distribution within some restricted family. Modern variational inference methods, typically intended for deep models such as variational autoencoders (Kingma and Welling 2013) , use a combination of autodifferentiation frameworks and the reparameterization trick to differentiate through the variational objective (Kucukelbir et al. 2017 ). This process is highly effective for models with only continuous latent variables. However, our model has many thousands of discrete latent variables which cannot be reparameterized in a differentiable manner. Typical solutions to this problem would be to either integrate out the discrete variables or to replace them with a continuous relaxation (Jang, Gu, and Poole 2017; Vahdat et al. 2018) . Neither solution is attractive in our case -the structure of the model does not allow us to integrate out the discrete variables analytically, while a continuous relaxation is infeasible because our latent variables have a strict integer interpretation (every infection requires in a particular individual becoming testpositive at particular points in time). The last resort to differentiate through discrete probabilistic models is the score function estimator (Paisley, Blei, and Jordan 2012) , which is often difficult to apply due to high variance. GPRt uses a combination of techniques which exploit the structure of infectious disease models to develop an estimator with controlled variance. First, we develop a more tractable variational lower bound which is amenable to stochastic optimization. Second, we hybridize the reparameterization and score function estimators across different parts of the generative model to take advantage of the properties of each component. Third, we develop techniques to sample low-variance estimates of the log-likelihood for each of the observation models introduced earlier. These techniques are introduced in the next section. We now derive GPRt, a novel variational inference method for R t estimation. GPRt approximates the true (uncomputable) posterior over (R, γ) via a multivariate normal distribution with mean µ and covariance matrix Σ. µ t is the posterior mean for R t while Σ t,t gives the posterior covariance between R t and R t . µ γ is the mean for γ and Σ γ,· gives its covariance with R. The diagonal Σ t,t gives the variance of the posterior over R at each time, capturing the overall level of uncertainty. The aim is to find a µ and Σ which closely approximate the true posterior. Let q(R, γ|µ, Σ) denote the variational distribution. Let p be the true generative distribution, where p(R, γ, x) is the joint distribution over x and (R, γ), p(R, γ) is the prior over (R, γ), and p(R, γ|x) is the posterior over (R, γ) after conditioning on x. The aim of variational inference is to maximize a lower bound on the total log-probability of the evidence x: where the right-hand side is referred to as the Evidence Lower Bound (ELBO). Our goal is to maximize the ELBO via gradient ascent on the parameters µ and Σ. This requires us to develop an estimator for the gradient of each term in the ELBO. The first term is the negative cross-entropy between q and the prior p(R, γ). Because both q and p have simple parametric forms, this can be easily computed and differentiated. The last term is the entropy, which is similarly tractable. The middle term is the expected log-likelihood. Developing an estimator for the gradient of this term is substantially more complicated and will be our focus. In fact, for computational tractability we will actually develop an estimator for a lower bound on the expected log-likelihood; substituting this lower bound into the ELBO still gives a valid lower bound on log p(x) and so is a sensible objective. The essential problem is that computing the log-likelihood of R requires integrating out the discrete latent variable n induced by the disease spread model, which is computationally intractable. The aim of this section is to develop the following stochastic estimator: Theorem 1. Let L be the Cholesky factor of Σ, ξ ∼ N (0, I), There exists a function g(µ, Σ) with E R,γ∼q [log p(x|R, γ)] ≥ g(µ, Σ) ∀µ, Σ and E[∇] = ∇g(µ, Σ). Essentially, Theorem 1 states that∇ is an unbiased estimator for a lower bound on the expected log-likelihood, exactly what we need to optimize a lower bound on log p(x) by stochastic gradient methods. Moreoever, as we will highlight below, we can efficiently compute the terms of∇ via a combination of leveraging the structure of the disease model to apply autodifferentiation tools and novel sampling methods for the observation model. We now derive∇. Proof. Expanding the dependence of x on n we can rewrite the log-likelihood as It is not clear how to develop a well-behaved gradient estimator for this expression because we wish to differentiate with respect to the parameters governing two nested expectations, one within the log. However, via Jensen's inequality, we can derive the lower bound pushing the log inside the expectation. We will substitute this bound into the ELBO, obtaining a valid lower bound to maximize. The key advantage is that our new lower bound admits an efficient stochastic gradient estimator. We start with the inner expectation and attempt to compute a gradient with respect to R (which controls the distribution of the simulation results n). Using score function estimator gives which expresses the gradient with respect to (R, γ) in terms of the gradient of the probability density of the disease model M with respect to (R, γ). It turns out that log M (n, φ|R, γ) can be easily computed. Recall that n Using the Poisson superposition theorem, we have that n t ∼ Poisson( N i=1 R t φ i t + γ) (while φ t is a deterministic function of n 1 ...n t−1 ). Accordingly, we have that where the second line substitutes the Poisson log-likelihood. This expression can be easily differentiated with respect to R and γ in closed form. Accordingly, we obtain an unbiased estimate of the gradient of our lower bound by sampling n ∼ M (R, γ) and computing ∇ R,γ log M (n|R, γ) log p(x|n). This suffices to estimate the gradient with respect to (R, γ). However, our goal is to differentiate with respect to µ and Σ, which control the distribution over (R, γ). Fortunately, R and γ are continuous. So, we can exploit the reparameterization trick by writing (R, γ) as a function of a random variable whose distribution is fixed. Specifically, since Σ is positive semi-definite, it has a Cholesky decomposition Σ = LL T (for convenience, we actually optimize over L instead of Σ). Sampling a standard normal variable ξ ∼ N (0, I) and letting R γ = µ + Lξ is equivalent to sampling R, γ ∼ N (µ, Σ). We rewrite the lower bound as where µ and L appear only as parameters of the deterministic function expressing R and γ in terms of ξ, instead of in the distribution of a random variable. Taking a sample from each of the expectations and substituting the score function estimator now gives the desired expression for∇. Using Theorem 1, our final gradient estimator will sample b values for ξ, run the model M once for each of the resulting values of R to sample n, and then compute 1 b b k=1 ∇ µ,L log M (n(k)|R(ξ(k)), γ(ξ(k))) log p(x|n(k)), easily accomplished with standard autograd tools given the closed-form expressions for R(ξ), γ(ξ), and log M (n|R). In practice, we also use the mean log p(x|n(k)) as a simple control variate to reduce variance (Sutton and Barto 1998) . We now turn to the task of computing the log-likelihood function log p(x|n), which measures the log-likelihood of observing the sequence of positive test results x given n new infections per day. Unfortunately, the log-likelihood is not available in closed form for any of the settings that we consider because it depends on additional latent variables (e.g., t convert , t revert , c, or A). We will show that it suffices to develop an estimator which lower-bounds the log-likelihood and that such estimators can be efficiently implemented for each of the observation models we consider. Specifically, denote the collection of latent variables used in a particular observation model as α. Then, we have which presents a similar difficulty as in developing our earlier lower bound: sampling α to approximate the inner expectation does not result in an unbiased estimator due to the outer log. Using Jensen's inequality in the same way gives and so substituting the right-hand side into our variational objective preserves validity of the lower bound. The RHS has the crucial advantage that we can now develop an unbiased estimator by drawing a single sample of α, which can then be substituted into the stochastic gradient estimator of Theorem 1. That is, for each of the simulation results n(1)...n(b) we sample a corresponding value for the latent variables, α(1)...α(b) and use the gradient estimator ∇µ,L log M (n(k)|R(ξ(k)), γ(ξ(k))) log p(x|n(k), α(k)) This works without issue for the uniform undersampling and cross-sectional models where we can obtain a closed form for the log likelihood after conditioning on the appropriate latent variables. However, the longitudinal testing model presents additional complications. In particular, after sampling the latent variables t convert , t revert , and A t , the number of positive tests becomes deterministic quantity. Denote this simulated trajectory of positive testsx. If x = x, then p(x| t convert , t revert , A) = 1 and otherwise p(x| t convert , t revert , A) = 0. This renders the above gradient estimator useless because log p(x|n(k), α(k)) = −∞ unless the simulated trajectory exactly matches the observed data (a very low-probability event). While −∞ is technically a valid lower bound for the variational objective, it is not very useful for optimization. Essentially, we need to develop a lowervariance estimator where the lower bound is more useful. We now present one such improved estimator. The intuition is that we can marginalize out a great deal of the randomness in the naive estimator by only revealing the results of random draws determining A t a single individual at a time. We start by sampling t convert and t revert . Note that we can expand log p(x|n, t convert , t revert ) = T t=1 log p(x t |x 1 ...x t−1 , n, t convert , t revert ) and consider the likelihood at each day t after conditioning on the results observed on previous days. To compute an estimate for this Table 1 : Mean absolute error of each method averaged over instances and time points for each setting, along with standard deviation of the absolute error. "PCR" and "Serological" denote settings where the observations are generated by the respective testing method. Individual column headings give the percentage of the population enrolled in testing. sum, we introduce a new object, the series of matrices C t . At each time t, C t [t 1 , t 2 ] denotes the number of individuals who have t convert = t 1 , t revert = t 2 , and have not yet actually tested positive by time t. Since A t is selected uniformly at random from the population, independent of the infection process, the x t individuals who test positive on day t are drawn uniformly at random from the set of all individuals who converted between days t − d and t, and who have not yet reverted. Let n draws denote the number of individuals in A t who have not yet tested positive by time t and n conv = t t1=t−d T t2=t+1 C t [t 1 , t 2 ] denote the number of individuals who are "eligible" to test positive at time t. Now x t |x 1 ...x t−1 , n, C t follows a binomial distribution with n draws draws and success probability nconv N − t−1 i=1 xi . Accordingly, the log-likelihood log p(x t |x 1 ...x t−1 , n, C t ) can be computed in closed form. After this, we can sample C t |C t+1 by selecting a uniformly random individual to remove from C t+1 . We can view this as iteratively revealing the test-positive members of A t after conditioning on the se-quence of previous test results, instead of sampling the entire set up front as in the naive method. We test the performance GPRt vs standard baselines on a wide variety of settings. We choose three baselines which have been recommended by leading epidemiologists as methods of choice for COVID-19 (Gostic et al. 2020) . First is the Wallinga-Teunis (WT) method (Wallinga and Teunis 2004) , which uses the distribution of the time between an infected person and their secondary infections to simulate possible who-infected-who scenarios, each of which induces a particular R t . WT assumes that cases are observed exactly and that there is no delay in observation. Second is the method of Figure 2 : Observed x t and the distribution over R t returned by each method on an example in the outbreak setting with longitudinal sampling, d = 14, and a 1% sample. The green line gives the ground truth R t . We test each method in an array of settings, with different distributions for both the true value of R t and the observations. We include two different settings for the ground truth R t . First, the outbreak setting where R starts below 1 and rises above 1 at a random time. Second, the random trend setting where R follows a linear trend which changes randomly at multiple points in time. Details of the settings and other experimental parameters are in the appendix. We also include different observation models characterized by the test used, the sampling method, and the sample size. We include both PCR and serological tests, using previously estimated distributions for D (Kucirka et al. 2020; Iyer et al. 2020) . We also include three sampling models introduced earlier: uniform underreporting, cross-sectional, and longitudinal. Finally, for each of the four combinations of tests and sampling method, we include four different sample sizes. Many sizes model a challenging setting with sparse observations, representing highly limited testing capacity. Note that the sample sizes evaluated are different for each method because they have different interpretations, e.g. 1% in the cross-sectional case means sampling 1% of the population each day while in the longitudinal case it would mean 1% every d days. For each setting, Table 1 shows the mean absolute error between the posterior mean R produced by each method and the ground truth. Each entry averages over 100 instances. For longitudinal testing we use d = 14; results for other values are very similar (see appendix). Across almost all settings, our method has lower MAE than any baseline, often by a substantial margin (reducing error by a factor of 2-10x). Notably, GPRt performs well even with extremely limited data (e.g., when testing 0.05% of the population per day or when 1% of infections are observed). Performance improves with more data, but the gains limited (e.g., 0.02-0.04 MAE), indicating that our method is able to make effective use of even very sparse data. Figure 2 shows a representative example. The observed data is quite sparse, with 0-8 positive tests observed per day. Our method recovers a posterior which closely tracks the ground truth. WT produces an estimate which is correlated with the ground truth but has many fluctuations and overly tight confident intervals. Cori is not appropriate for data this sparse and produces a widely fluctuating posterior. EpiNow does not return an estimate for much of the time series, only estimating the part with denser observations (we gave the baselines an advantage by only evaluating their MAE where they returned an estimate). Moreoever, even in the higherobservation portion, it is less accurate than GPRt. Finally, Figure 3 shows calibration, a metric which evaluates the entire posterior (not just the mean). Intuitively, calibration reflects that, e.g., 90% of the data should fall into the 90%-credible interval of the posterior. Calibration is critical for the posterior distribution to be interpretable as a valid probabilistic inference, and for it to be useful in downstream decision making. Figure 3 shows the fraction of the data which is covered by the credible intervals of each method. This figure shows cross-sectional testing in the outbreak setting, but results for other settings are very similar (see appendix). GPRt is close to perfectly calibrated (the dotted diagonal line) while the baseline methods are not well calibrated. The baselines suffer from two problems. First, as to be expected from their higher MAE, they are biased and so their credible intervals often exclude the truth. Second, they are over-confident: paradoxically, their calibration worsens with increased data since the larger sample size makes them more confident in their erroneous prediction. We conclude that GPRt offers uniquely well-calibrated inferences. Estimating the time-varying reproduction number of SARS-CoV-2 using national and subnational case counts An Interaction-Based Approach to Computational Epidemiology Bayesian inference of transmission chains using timing of symptoms, pathogen genomes and contact data Accounting for non-stationarity in epidemiology by embedding time-varying parameters in stochastic models Forecasting a moving target: Ensemble models for ILI case count predictions A new framework and software to estimate timevarying reproduction numbers during epidemics Capturing the time-varying drivers of an epidemic using stochastic dynamical systems Estimating the effects of nonpharmaceutical interventions on COVID-19 in Europe Practical considerations for measuring the effective reproductive number Seroprevalence of antibodies to SARS-CoV-2 in 10 sites in the United States Dynamics and significance of the antibody response to SARS-CoV-2 infection Categorical reparameterization with gumbel-softmax Inapparent infections and cholera dynamics Auto-encoding variational bayes Variation in false-negative rate of reverse transcriptase polymerase chain reaction-based SARS-CoV-2 tests by time since exposure Automatic differentiation variational inference Test sensitivity is secondary to frequency and turnaround time for COVID-19 surveillance Early in the epidemic: impact of preprints on global discourse about COVID-19 transmissibility Variational Bayesian inference with stochastic search Sourceseer: Forecasting rare disease outbreaks using multiple data sources Pattern of early humanto-human transmission of Wuhan Approximation algorithms for reducing the spectral radius to control epidemic spread Introduction to reinforcement learning Improved inference of time-varying reproduction numbers during infectious disease outbreaks Markov chain Monte Carlo algorithms for Gaussian processes DVAE++: Discrete Variational Autoencoders with Overlapping Transformations Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures Modeling Between-Population Variation in COVID-19 Controlling propagation at group scale on networks Dava: Distributing vaccines over networks under prior information