key: cord-0113563-uecgoe0l
authors: Meagher, Joe; Friel, Nial
title: Assessing epidemic curves for evidence of superspreading
date: 2021-06-22
journal: nan
DOI: nan
sha: deef373c985e63fa22a70234c31c0ddbe3d2a93c
doc_id: 113563
cord_uid: uecgoe0l

The expected number of secondary infections arising from each index case, referred to as the reproduction or $R$ number, is a vital summary statistic for understanding and managing epidemic diseases. There are many methods for estimating $R$; however, few explicitly model heterogeneous disease reproduction, which gives rise to superspreading within the population. We propose a parsimonious discrete-time branching process model for epidemic curves that incorporates heterogeneous individual reproduction numbers. Our Bayesian approach to inference illustrates that this heterogeneity results in less certainty on estimates of the time-varying cohort reproduction number $R_t$. We apply these methods to a COVID-19 epidemic curve for the Republic of Ireland and find support for heterogeneous disease reproduction. Our analysis allows us to estimate the expected proportion of secondary infections attributable to the most infectious proportion of the population. For example, we estimate that the 20% most infectious index cases account for approximately 75-98% of the expected secondary infections with 95% posterior probability. In addition, we highlight that heterogeneity is a vital consideration when estimating $R_t$.

Superspreading, where some individuals give rise to large numbers of secondary infections whilst the majority result in very few or none, is a feature of many epidemics (May and Anderson, 1987; Shen et al., 2004; Wong and Collins, 2020) . This phenomenon is a consequence of heterogeneous disease reproduction, whereby the expected number of secondary infections arising from each index case varies from one individual to the next (Lloyd-Smith et al., 2005) .

Factors that drive heterogeneity separate into two broad categories (Becker and Britton, 1999) . The first affects the infectiousness of each individual, making their infection more or less transmissible. The second is based on the underlying community structure, influencing the number of infectious individuals' contacts. Within these categories, there is a myriad of host, pathogen, and environmental factors that contribute to the expected number of secondary infections from each index case (Anderson and May, 1992) . Thus, many studies think of the infectiousness of individuals within a population as distributed along a continuum (Lloyd-Smith et al., 2005; Grassly and Fraser, 2008; Britton, 2010) .

Heterogeneous disease reproduction is often manifest empirically as a 20/80 rule, whereby the 20% most infectious index cases are typically responsible for at least 80% of disease transmission (Woolhouse et al., 1997) . This phenomenon has significant consequences for the design of control measures to curb the spread of infection. Theoretical analyses show that interventions targeting "core" groups of the most infectious individuals within a population yield significant reductions in overall transmission (Anderson and May, 1992; Hadeler and Castillo-Chávez, 1995; Woolhouse et al., 1997; Lloyd-Smith et al., 2005; Wallinga et al., 2010; Britton et al., 2020) . In practice, however, identifying core groups, a priori, is difficult, if not impossible, and so rigorous contact tracing is required to inform decision making during ongoing epidemics (Koopman, 2004; Wallinga et al., 2010) .

Mathematical and statistical modelling offering both "nowcasts" describing current disease dynamics and "forecasts' projecting the trajectory of epidemic curves into the future informed public health interventions during the Coronavirus Disease 2019 (COVID-19) pandemic (Flaxman et al., 2020) . Heterogeneity poses enormous challenges to these efforts, even for simple models based on a constant reproduction number, denoted R (Forsberg White and Pagano, 2008; Cori et al., 2013; Johnson et al., 2021; Donnat and Holmes, 2021) . Although the definition of R as the expected number of secondary infections for each index case is unchanged (Fraser, 2007) , heterogeneity can drastically widen credible intervals associated with nowcasts of R (Johnson et al., 2021) . Furthermore, heterogeneity will affect the magnitude of both best-and worst-case scenario forecasts (Lloyd-Smith et al., 2005; Donnat and Holmes, 2021) . These effects can be significant even before considering incomplete and delayed reporting of disease incidence.

Interestingly, many state-of-the-art methods for estimating time-varying reproduction numbers do not allow for heterogeneous disease reproduction (Wallinga and Teunis, 2004; Forsberg White and Pagano, 2008; Cori et al., 2013; Thompson et al., 2019; Gostic et al., 2020; Cori, 2020; Bertozzi et al., 2020) . Recently Donnat and Holmes (2021) , Johnson et al. (2021) and Schmidt (2020) have explored the impact of heterogeneity on estimates for R, a concept that extends directly to the instantaneous time-varying reproduction number (Fraser, 2007; Cori et al., 2013) .

In this report, we extend parsimonious branching process models for the spread of infectious disease developed by Wallinga and Teunis (2004) and Bertozzi et al. (2020) to allow for the heterogeneous disease reproduction described by Lloyd-Smith et al. (2005) . Furthermore, Bayesian inference for this model offers estimation and uncertainty quantification for R t (Hoffman and Gelman, 2014; Carpenter et al., 2017) . These methods allow us to assess epidemic curves for evidence of superspreading and explore the implications of heterogeneous disease reproduction. As a case study, we consider the COVID-19 epidemic in the Republic of Ireland (Ireland) between July and November, 2020. Our analysis allows us to draw conclusions about the expected proportion of secondary infections attributable to the most infectious individuals. For example, we estimate that the 20% most infectious individuals in this epidemic give rise to 75-98% of expected secondary infections with 95% posterior probability, while 62-82% of individuals are not expected to pass on the infection, again with 95% posterior probability.

In section 2 we present theory underpinning the current state-of-the-art branching process models for the spread of disease. We then extend these models to heterogeneous disease reproduction in section 3 before outlining a Bayesian framework for inference and model comparison. This approach allows us to estimate R t and explore the impact of heterogeneity on the distribution of secondary infections. Our case study is described in section 4. In section 5 we present the results of our analysis and we conclude in section 6.

Branching process epidemic models offer a flexible set of tools for the analysis of epidemics. Unlike compartmental epidemic models, which separate a population into groups depending on their disease and risk status (Kermack and McKendrick, 1927; Keeling and Rohani, 2011; Bjørnstad et al., 2020) , these stochastic processes model individual infections within a population (Keeling and Eames, 2005) . Such epidemic models offer robust, parsimonious approaches to estimating time-varying reproduction numbers and quantifying heterogeneity in the reproduction of disease from one individual to the next (Wallinga and Teunis, 2004; Lloyd-Smith et al., 2005; Bertozzi et al., 2020) .

Here, we define the time-varying reproduction number R t , referred to as the cohort reproduction number, as the expected number of secondary infections arising from each index case within the cohort of infections recorded at time t. The critical value for the reproduction number lies at R t = 1. The epidemic grows exponentially when R t > 1 while sustaining R t < 1 for a sufficiently long period ensures that the epidemic will eventually die out, provided no new infections come from outside the population. R t , also known as the case reproduction number, is distinct from the instantaneous reproduction number, which we denote R * t , although R * t has the same critical value at 1. R * t is the expected number of secondary infections arising from an index case at time t should conditions remain unchanged for the duration of their infection (Fraser, 2007) . This quantity can be estimated in real-time, as described by Cori et al. (2013) ; Thompson et al. (2019) , and Johnson et al. (2021) , as it is a function of infections occurring up to time t. Reliable estimates for R t , on the other hand, depend on infections occurring after time t and so must be computed retrospectively. This is discussed in detail by Gostic et al. (2020) .

When modelling the spread of disease within a population as a continuous-time branching process, consider the epidemic t = (t 0 , t 1 , . . . t N ) , where t 0 = 0 is the index case seeding the epidemic and t i ∈ (0, T ] is the time at which the i th infection is recorded such that t i < t i+1 for i = 1, . . . , N . The branching process allows for two types of index case. The first type contracts their infection from outside the population of interest and is said to have been imported. For simplicity, we let imported infections follow a Poisson process with a constant intensity µ > 0. All remaining index cases are secondary infections that arise locally within the population. In the general case, we model secondary infections from index case i as a Poisson process with intensity β (t, τ i | θ i ) ≥ 0, a function of calendar time t and time since infection t − t i = τ i > 0 parameterised by θ i ∈ Θ. Here, t i corresponds to the time case i is infected and so β (t, τ i | θ i ) models the generation interval between infector-infectee pairs while θ i is a set of marks for case i allowing the intensity function adapt to environmental changes or individual characteristics. For this branching process, the conditional intensity

Expectation-Maximisation offers a well established approach to maximum likelihood estimation for branching process models (Dempster et al., 1977; Veen and Schoenberg, 2008; Bertozzi et al., 2020) . These methods rely on the fact that for any branching process epidemic model there exists a unique transmission network linking infector-infectee pairs such that

is the relative likelihood that i is a secondary infection of j for all j = 0, . . . , i − 1 and

is the relative likelihood that i was imported (see Appendix A for more detail). This insight allows the relative likelihoods defined by equations (2) and (3) to serve as a basis for estimating R t by counting the expected number of secondary infections associated with each index case. The approach to this problem adopted by Wallinga and Teunis (2004) (henceforth referred to as W&T) is to first assume that imported cases have been identified a priori. This means that the contribution of µ in equation (2) can be ignored and p ii = 0 for all remaining index cases. A further assumption made by W&T is that β (t, τ i | θ i ) = ω (τ i | θ) where ∞ 0 ω (τ | θ) dτ = 1 is the generation interval density. This is to say that all index cases share the same generation interval density, which is a function of the time since infection τ i parameterised by θ. The parameters θ are specified, a priori, based on separate analyses of generation intervals for the epidemic in question. With these assumptions in place, p ji can be computed for each index case as required. Letting R j denote the expected number of secondary infections for the j th index case, we have that

given that p ji = 0 if case i is known to have been imported. Finally, R t is estimated by the arithmetic mean of R j for all index cases in the cohort at time t. The W&T method is implemented within the EpiEstim R package (Cori, 2020; R Core Team, 2020) . Bertozzi et al. (2020) developed an approach to estimating R t which extends W&T. Allowing for imported cases with intensity µ, they assume that β (t,

is the generation interval density adopted in W&T. Substituting these quantities into equations (2) and (3) provides the required relative likelihoods. Bertozzi et al. (2020) propose an iterative expectation-maximisation approach to maximum likelihood estimation for R (t). To this end, they adopt a histogram estimator for R (t) such that

where I 1 , . . . I B are a set of disjoint intervals that must be specified a priori and 1 {t ∈ I k } is the usual indicator function taking the value 1 if t ∈ I K and 0 otherwise. Each interval defines a distinct cohort for which a constant value r k is estimated as

where N k is the total number of index cases on the interval I k . Thus, initialising R (t) and iteratively updating equations (2), (4), and (5) allows Bertozzi et al. to estimate the cohort reproduction number R t . The methods presented above represent the current state-of-the-art approaches to estimating R t within a branching process framework. Based on simple, parsimonious models which minimise the assumptions we must make, they offer informative results even with limited data. However, the models proposed by W&T and Bertozzi et al. (2020) could be considered as restrictive in the following sense. Consider the number of secondary infections associated with an index case at time t, which we denote by Z t . W&T implies that Z t ∼ Pois (1) while Bertozzi et al. (2020) assume that Z t ∼ Pois (R t ). This is a consequence of the definition of β (t, τ i | θ i ) as the intensity function of a Poisson process. This observation reveals that neither model allows for reproduction numbers that vary from one individual to the next within each cohort. Adopting a hierarchical model for Z t offers a framework for tackling this problem. Lloyd-Smith et al. (2005) provide a useful starting point, which assumes that

where a Gamma distributed individual reproduction number ν t , parameterised by shape α and rate β, allows for over-dispersed secondary infections and heterogeneous disease reproduction. Defining α and β in terms of the reproduction number R and a dispersion parameter k, such that α = k and β = k/R, implies that Z t ∼ Neg-Bin (R, k), where E [Z t ] = R and Var (Z t ) = R + R 2 /k. The continuous latent variable ν t models the complex combination of factors governing the infectiousness of each individual, assuming that infectiousness and susceptibility to infection are uncorrelated. Extending this framework to the time-varying reproduction number R t offers an approach to modelling heterogeneity within branching process epidemic models.

A second issue is that empirical data on epidemics are not typically available in continuous time. Instead, the disease incidence is generally reported as a daily case count, i.e. at a series of discrete-time steps. The methods developed by Wallinga and Teunis (2004) or Bertozzi et al. (2020) are readily applied to such data; however, making the discrete-time formulation explicit allows for more efficient inference, given that computation of relative likelihoods (2) and (3) for all i, j scales with O (N 2 ). When daily cases number in the tens to hundreds of thousands, this can present a significant computational burden, particularly if we are to extend these methods to heterogeneous disease reproduction.

Here, we extend these branching process epidemic models via a discrete-time model for epidemic curves where disease reproduction is heterogeneous within cohorts. This model retains the appealing parsimony of Wallinga and Teunis (2004) and Bertozzi et al. (2020) while allowing us to investigate the impact of heterogeneity. We develop a Bayesian approach to inference, applying state-of-the-art sampling techniques to provide a coherent approach to uncertainty quantification for R t and k (Carpenter et al., 2017) .

Let Y t denote the number of index cases recorded on day t = . . . , −1, 0, 1, 2, . . . such that the index cases up to and including day 0 seed the epidemic. We distinguish between imported and locally generated index cases, and so

We let the number of cases imported on day t follow a Poisson distribution parameterised by rate µ t , such that

This implies that the imported case count is conditionally independent of those generated locally, given µ t . In order to model the local incidence of disease Y loc t , that is the number of secondary infections recorded within the population on day t, we let Z t,i denote the number of secondary infections arising from the i th index case on day t. Adopting a hierarchical model for heterogeneous disease reproduction (Lloyd-Smith et al., 2005) , we have

where the Gamma shape α and time-varying rate β t are defined in terms of the daily cohort reproduction number R t and dispersion parameter k. The total reproduction number for all index cases on day t is then

Adopting the nomenclature proposed by Johnson et al. (2021) , we refer to the latent variable η t as the disease momentum on day t. The disease momentum describes the total infectiousness of all index cases recorded at time t when we have heterogeneous individual reproduction numbers, and defines the rate at which the epidemic spreads through the population. That η t is Gamma distributed follows from the fact that the sum of independent Gamma random variables with a common rate parameter is itself Gamma distributed. Given the generation interval probability mass function (pmf) ω = (ω 1 , ω 2 , . . . ) , the distribution of secondary infections in time is modelled as a set of independent Poisson random variables. For the i th index case on day t, the number of secondary infections arising s days after infection can be expressed as

This ensures that the model for secondary infections in equation (6) holds. Locally generated cases on day t are simply the sum over all secondary infections arising on day t from existing index cases, that is

Thus, given ω and the disease momentum up to time t, denoted η t = (η t−1 , η t−2 , . . . ) , locally generated cases are Poisson distributed such that

Adding imported cases to those generated locally, the likelihood for this generative model is

Figure 1 presents a graphical representation of the dependence structure within this model for epidemic curves. Figure 1 : A plate diagram of the conditional dependence structure within the generative model for epidemic curves described in section 3.1, where only the epidemic curve . . . , Y −1 , Y 0 , Y 1 . . . (shaded nodes) is observed. This figure highlights model parameters that are non-identifiable from the epidemic curve alone. Even if the disease momentum η t were observed, joint inference for R t and k depends on prior assumptions restricting the day-to-day variation in reproduction numbers.

Here we outline issues relating to the identifiability of the model. In particular, suppose that we allow ω to be a unknown parameter in the model.

It can be shown that for any ω = ω there exists a unique η t = η t satisfying the linear equations defined by

This implies, following equation (10), that jointly inferring ω and η t leads to a non-identifiable model. For this reason we specify a fixed ω for our model. In Section 4 we discuss our choice of ω based on information drawn from the literature on generation intervals in the context of COVID-19. Similarly, inferring both µ t and η t without observing Y imp t also leads to a non-identifiable model. To see this, suppose we have some constant 0 < c < µ t , Then

This shows that subtracting a constant from the import rate and adding it to each momentum variable leads to a non-identifiable likelihood in equation (10). For this reason we also specify a fixed µ t , as discussed in Section 4.

The model for individual reproduction numbers in equations (6) and (7) allows us to estimate the expected proportion of secondary infections associated with the most infectious index cases, as described by Lloyd-Smith et al. (2005) .

We consider index cases within each cohort separately, such that a single cohort reproduction number parameterises the distribution of individual reproduction numbers. That is, we estimate the expected proportion of secondary infections associated with the most infectious individuals within the y t index cases recorded on day t, for whom the cohort reproduction number is R t . In this case, we model

without any loss of generality. That is, within each cohort, the degree of heterogeneity in the distribution of individual reproduction numbers depends on k only. Given p (x | α, β) and F (x | α, β), the probability density and cumulative distribution functions of a Gamma distributed random variable parameterised shape α and rate β, we define the cumulative distribution function for transmission of the disease within a single cohort as

where we have the identity Γ (k + 1) = kΓ (k). Here, F trans (x | k) is the expected proportion of secondary infections from index cases recorded day t attributable to individuals with u t,i < x. The expected proportion of secondary infections due to individuals with u t,i > x is therefore -Smith et al., 2005) . Thus, if T k (q) denotes the expected proportion of secondary infections from the proportion q ∈ [0, 1] of index cases within a single cohort that are most infectious, then

Written as a single expression, we have that

As no closed form expression for F −1 (1 − q | k, k) exists, we estimate T k (q) numerically for a given q and k.

The generative model presented above offers a framework for learning about the spread of disease within a population given the observed epidemic curve y = (y 0 , y 1 , . . . , y N ) and a set of prior beliefs. In the following, we present a Bayesian approach to inference, assuming that y imp t , the number of imported cases on day t = 1, . . . , N , is unknown. Following the previous subsection, we assume that both the rate at which cases are imported µ = (µ 0 , µ 1 , . . . , µ N ) and the generation interval pmf ω are fixed, a priori. In the following subsection, we outline how we specify a prior distribution for the daily cohort reproduction numbers R = (R 0 , R 1 , . . . , R N ) and dispersion parameter k. For ease of exposition, we assume that the epidemic is seeded by y 0 only, although in practice we include y −(N 0 −1) , . . . , y −1 to seed the epidemic with N 0 days.

We first address prior specification for the dispersion parameter k. Ideally, a prior for k would rely on detailed contact tracing data which reconstructs the underlying transmission networks (Lloyd-Smith et al., 2005; Arinaminpathy et al., 2021; Sun et al., 2021) ; however, this information is often unavailable. Our approach exploits the relationship between k and the expected proportion of disease transmission attributable to the most infectious individuals, as set out in Section 3.1.2. Suppose we assume, for example, that the most , and 99% equal-tailed interval (light shaded region). As an example, we see that this prior on k provides support for 30-95% of expected secondary infections arising from the most infectious 20% of index cases.

infectious 20% of index cases give rise to at least 30% of expected secondary infections. By equation (13), this implies that k < 10. Similarly, if we assume that no more than 95% of expected secondary infections arise from the most infectious 20% of index cases, then equation (13) implies that k > 0.1. Thus, if our prior belief is that the most infectious 20% of index cases give rise to 30-95% of expected secondary infections, this leads to a prior distribution for k with positive support in the interval (0.1, 10). Note that this interval covers both high and low levels of heterogeneous disease transmission within the population. In addition, we note that for a simple SIR compartmental model, secondary infections follow a Geometric distribution, which corresponds to the case where k = 1 (Lloyd-Smith et al., 2005) . This suggests that a prior distribution with some central tendency towards 1 is sensible and leads us to propose a log-Normal prior for k, such that log k ∼ N µ log k , σ 2 log k , and set µ log k = 0 and σ log k = 1. Under this prior, presented in terms of T k (q) in Figure 2 , the median for k is 1 and the same proportion of prior density is assigned to the interval (0.1, 1) as is to (1, 10). This prior distribution can be easily adapted to a more concentrated range of values for k as required.

The cohort reproduction number R t is defined in equations (6) and (7) as the expected number of secondary infections arising from each index case in the cohort recorded on day t. As described by Fraser (2007) and Gostic et al. (2020) , this expectation should vary smoothly in time. For example, if control measures introduced at t c restrict the transmission of disease, some cohorts infected at time t < t c will spend part of their infectious period both before and after the introduction of control measures. Thus, R t will transition from high to low values smoothly at a rate which depends on ω.

We model R t as a smooth function of time via a Gaussian process (GP) prior (Williams and Rasmussen, 2006) . Given that R t > 0, let

where the latent function f (t) ∈ R is a GP such that

with amplitude σ f > 0 and length-scale > 0. The exponentiated quadratic covariance covariance function k (·, ·) implies that f (t) is infinitely differentiable, where σ f defines the expected range of f (·) and governs the functions rate of change, which we loosely describe as its "wiggliness". Our approach to specifying a prior distribution for R involves relating the expected number of zero-upcrossings of f (·) to the expected number of surges in case numbers per year, as we now outline. If we assume that R t is a slowly-varying function such that sustained increases in daily case numbers occur when R t > 1, sustained decreases imply that R t < 1, and short term variation is driven by heterogeneous disease reproduction, then distinct surges in case numbers are associated with R t crossing 1 from below. Thus, given that f (t) = 0 when R t = 1, each surge is associated with a zero-upcrossing of f (·). If we let E [n 0 ] denote the expected number of zero-upcrossings of f (·) per year, then it can be shown for (15) that E [n 0 ] = 365 * (2π ) −1 (Williams and Rasmussen, 2006) , where the length-scale is in units of days. Thus, we can relate the number of surges in case numbers we expect over a given period to . For example, if we expect 3-4 surges in case numbers per year, then should lie on the interval (15, 20) . This insight allows us to specify a prior distribution for , which we outline in Section 4. We refer the reader to Appendix B for further details on prior elicitation for the GP. When inferring cohort reproduction numbers for an ongoing epidemic, a full Bayesian treatment of might not be practical, due to the computational cost incurred (evaluation of the Gaussian density scales with O (N 3 )). In this case, may be fixed, a priori. Inference for R is not particularly sensitive to the value chosen for , although smaller values tend to inflate estimates for k. The approach which we take subsequently is to place a zero-truncated Gaussian hyper-prior on such that ∼ N (µ , σ 2 ). Note that we are typically only interested values for that are far from 0, and so the effect of truncation on this hyper-prior can be safely ignored.

Our primary objective is to learn about the joint posterior over R and k given the epidemic curve y. To do this, we integrate over the unknown momentum variables η = (η 0 , η 1 , . . . , η N ) , latent GP f = (f 0 , f 1 , . . . , f N ) , and the GP length-scale . As detailed in Section 3.1.1, we treat µ and ω as fixed parameters, and specify the hyper-parameters σ f , µ log k , σ log k , µ , σ . Thus, we wish to infer the marginal posterior distribution

The joint posterior distribution of all unknown quantities can be written as

where the complete-data likelihood is expressed as

For ease of expression we have suppressed notation conditioning on the fixed parameters and hyper-parameters. Note that this epidemic model is seeded by y 0 and so these cases are omitted from the likelihood in equation (17). This hierarchical model for y is defined by

where K is the Gram matrix of the covariance function in (15). The probabilistic model described above is coded in Stan (Carpenter et al., 2017 ; Stan Development Team, 2020; R Core Team, 2020) and can be implemented using the R package assessEpidemicCurves which can found at https://github.com/jpmeagher/assessEpidemicCurves.

We consider the COVID-19 epidemic in Ireland as a case study. Note that an alternative analysis of COVID-19 in Ireland, developed by the Irish Epidemiological Modelling Advisory Group (IEMAG) to inform the Irish Government's response to the epidemic, is presented by Gleeson et al. (2022) . We assess the 7-day moving average of confirmed cases, ordered by epidemiological date. The epidemiological date is the earliest recorded date associated with a confirmed case of COVID-19. This is either the date of onset of symptoms, date of diagnosis, the laboratory specimen collection date, the laboratory received date, the laboratory reported date or the notification date. Sorting cases by their epidemiological date strips out some random effects on the epidemic curve introduced by reporting delays, while taking the 7day moving average of case counts smooths over other day-of-the-week effects. This data was extracted from the Computerised Infectious Disease Reporting database hosted by the Health Protection Surveillance Centre (HPSC) and provided to the authors by the Central Statistics Office of Ireland.

Our analysis focuses on the period from July to November 2020, covering Ireland's second surge in coronavirus infections up to the easing of restrictions in December. We have restricted our analysis to this period as we believe that the Irish testing and contact tracing system functioned in a relatively consistent manner throughout. Figure 3 presents the 7-day average test positivity rate reported by the HPSC up to June 2021. We see that the positivity rate did not exceed 10% in our analysis period (the shaded region in the figure), unlike during the first surge in March/April 2020 and the third surge in December 2020/January 2021. This suggests that testing capacity was better able to cope with demand in this period, which we expect to result in more consistent testing and tracing procedures. Thus, we expect recorded data from this period to reflect the ongoing epidemic more accurately than at other points in time, while still recording a growing epidemic. Our assumption is that consistent testing and tracing procedures will allow us to more accurately assess the epidemic curve for evidence of superspreading.

Estimates for the daily count of imported cases y imp are not available for this dataset. As such, we simply assume that the rate at which cases are imported is 1 for all t such that µ is a vector of ones. This assumption may be unrealistic. We might expect that µ t depends on the incidence of COVID-19 across the United Kingdom and European Union, for example, and the rate at which individuals travel between Ireland and these jurisdictions. However, given the absence of relevant data, our assumption seems reasonable. A small value for µ t implies that the vast majority of cases are due to local transmission and provides a similar approach to that taken by the IEMAG, where imported cases were omitted from the model entirely (Gleeson et al., 2022 ). An additional, practical consideration is that non-zero values for µ t avoid numerical issues within our inference scheme, although seeding the epidemic with µ 0 ≈ y 0 is generally sufficient to avoid any difficulties.

Letting γ τ and σ τ denote a mean and standard deviation for the distribution of generation intervals, we assume that ω follows a discretised Gamma distribution with mean γ τ = 5 and Note that the test positivity never exceeds 10% in this period while it exceeds 20% at in April 2020 and January 2021. Our assumption is that a high positivity rate is indicative of a testing system that has been overwhelmed by cases, resulting in less reliable daily case counts. standard deviation σ τ = 2.5, truncated at S = 21 days. This is achieved by setting

such that S s=1 ω s = 1, where we have matched γ τ and σ τ to the shape α and rate β of the Gamma distribution. In the absence of detailed contact tracing information for infectorinfectee pairs in Ireland, these values for γ τ and σ τ have been chosen from the middle of the range for COVID-19 reported in the literature Ganyani et al., 2020; Griffin et al., 2020; Rai et al., 2021) . In addition, truncating ω at 21 days implies that the maximum generation interval is three weeks, an assumption is reasonably consistent with empirical data . Although our analysis considers a single, fixed parameterisation for ω, the posterior distribution is quite robust to changes in γ τ and σ τ . In Appendix C, we consider models parameterised by (γ τ , σ τ ) ∈ {(4, 2), (6, 3)}, which represent lower and upper estimates of γ τ respectively, as reported by Rai et al. (2021) . This analysis shows that each model provides broadly similar inference for R and k.

We specify the GP prior for R with hyper-parameters σ f = 1, µ = 17.5 and σ = 2.5. Under this prior, the marginal prior distribution for log R t is normally distributed with mean 0 and standard deviation 1. Setting σ f = 1 implies that the 95% credible interval for R t under this marginal prior distribution is (0.14, 7.10) with mean 1.65 and a median at 1. This marginal distribution covers the range of values we expect R t to take, a priori. Assuming that ∼ N (17.5, 2.5 2 ), left truncated at zero, implies that we expect 2-5 surges in case numbers per year. Given that Ireland experienced 3 surges over the first year of its coronavirus epidemic, this hyper-prior is not overly restrictive and covers a reasonable range of possible values for . See Appendix B for further analysis supporting this hyper-prior distribution. Finally, we remind the reader that we have proposed a weakly informative prior for k in Section 3.2.1, specified as log k ∼ N (µ log k , σ 2 log k ) where µ log k = 0 and σ log k = 1.

We present our analysis of the Irish COVID-19 epidemic curve as follows. We first fit the model defined by (18) and specified in Section 4, evaluate the fit of our model via the posterior predictive distribution, and assess the epidemic curve for evidence of heterogeneous disease transmission. Following this, we compare the posterior inference for R to those obtained under the assumption of homogeneous disease transmission.

We fit our model to the Irish epidemic curve for the N = 150 days from 4 July to 30 November 2020, inclusive, allowing the N 0 = 5 days up to 4 July to seed the epidemic. We draw 4 chains of 5000 samples from (16) after a warm-up of 2000 samples and thin our chains by retaining every fifth sample. Posterior samples satisfy standard diagnostic checks and tests for convergence (Vehtari et al., 2021) . The posterior predictive distribution for our model is

which we obtain by integrating over the sampled posterior distribution of the momentum variables. In Figure 4 we see that the posterior predictive distribution tracks the empirical epidemic curve closely and provides good coverage of daily cases counts. All observed daily counts fall within the 95% credible interval while 86% fall within the 50% credible interval. This reasonably well balanced posterior predictive distribution suggests that our model offers a good fit to the data. Figure 5a presents the sampled marginal and joint posterior distributions over the dispersion parameter k and GP length-scale hyper-parameter . This figure illustrates that there is a weak inverse relationship between and k, with large values for (i.e. very slowly varying cohort reproduction numbers) tending to be associated with smaller values for k (i.e. more heterogeneous disease transmission). Based on this analysis, we infer 95% credible intervals of approximately (0.07, 0.33) for k and (13, 18) for .

The uncertainty in k, a posteriori, allows us assess the posterior uncertainty in the estimated proportion of expected secondary infections associated with the proportion q of most Note that the predictive model is seeded by N 0 = 5 days. All observed daily case counts fall within the 95% credible interval of the posterior predictive distribution with 86% of counts within the 50% credible interval.

infectious individuals, using the approach developed in Section 3.1.2. To do this, we sample from the posterior distribution of k and solve (13) numerically for each sampled value. This gives rise to a distribution for the expected proportion of secondary infections for a given proportion q. We present this for q ∈ [0, 1] in Figure 5b . If we consider the 20% most infectious individuals as an example, this analysis suggests that 75-98% of expected secondary infections can be attributed to these individuals with 95% posterior probability, while 62-82% of infected individuals are not expected to pass on the infection, again with 95% posterior probability. This represents a high degree of heterogeneity in the spread of COVID-19.

Given that our model provides evidence for heterogeneous disease transmission, our next objective is to compare inference for R t under this model with those assuming homogeneous disease reproduction. We adapt the model defined in (18) to homogeneous disease reproduction by fixing η t = R t y t for all t. In effect, this assumes that k → ∞. As above, we draw 4 chains of 5000 samples after a warm-up of 2000 samples and retain every fifth sample to thin our chains. Once again, posterior samples satisfy our diagnostic checks and tests for convergence. In addition, we fit Wallinga and Teunis' (2004) model (W&T) to our epidemic curve using the EpiEstim package (Cori, 2020) . When estimating and quantifying uncertainty on R t by W&T, the cohort at time t is defined by the 3-day window such that the estimate for Figure 5 : (a) The sampled joint and marginal posterior distributions over k and . Each opaque point represents a sample, while contours are set to have a width of 0.3. We report 95% credible intervals for k of (0.07, 0.33) and for of (13, 18). (b) The proportion of expected secondary infections attributable to the proportion q of most infectious individuals. The mean (solid line), 50% (dark shaded region), and 95% credible interval (light shaded region) for this proportion over the interval q ∈ [0, 1] is presented. T k (q) is estimated numerically by equation (13) given the posterior distribution for k. Based on this analysis we estimate, for example, that the 20% most infectious individuals give rise to 75-98% of expected secondary infections with 95% posterior probabililty, while 62-82% of individuals are not expected to pass on the infection, again with 95% posterior probability.

R t is the arithmetic mean number of secondary infections arising from index cases on days t − 1, t, and t + 1.

This analysis demonstrates that heterogeneous disease reproduction has important consequences when estimating R t . While estimates of the posterior mean show a general agreement across all three approaches, credible intervals around these estimates behave very differently when we assume heterogeneous disease reproduction within cohorts. Empirically, we observe that credible intervals under homogeneous disease transmission and W&T tend to be only 60% as wide as those under heterogeneous disease transmission at time t. If our objective was to establish whether or not R t = 1, then allowing for heterogeneous disease transmission is a crucial consideration.

It is also worth noting that, under homogeneous disease transmission, the GP lengthscale hyper-parameter has a 95% credible interval of (7, 13), despite the ∼ N (17.5, 2.5) hyper-prior distribution. This inference reflects the fact that homogeneous disease transmission requires a flexible model for R in order to fit empirical data well. This model for R In addition, note that our estimate for R t is less "wiggly" under heterogeneous disease transmission than in the homogeneous case. This behaviour illustrates that a more flexible model for R is required to fit data when we assume homogeneous disease transmission.

implies that we should expect 5-8 surges in case numbers per year, far more than we observe empirically.

In this report, we have developed a parsimonious model for epidemic curves in the presence of heterogeneous disease reproduction, offering a Bayesian extension to the work of Wallinga and Teunis (2004) and Bertozzi et al. (2020) . This model treats superspreading as a feature of the epidemic rather than a phenomenon that occurs rarely. We develop a Bayesian inference scheme based on a GP prior for the cohort reproduction number R t and a log-Normal prior on the dispersion parameter k. This hierarchical model allows us to assess the degree of heterogeneity in individual reproduction numbers supported by any given epidemic curve, providing insight into the distribution of secondary infections within the population. An R package implementing these methods is available at github.com/jpmeagher/assessEpidemicCurves. Our analysis of the COVID-19 epidemic in Ireland provides support for heterogeneous disease reproduction. This result, alongside mounting evidence from other jurisdictions (see, e.g. Endo et al. (2020) ; Arinaminpathy et al. (2021) ; Sun et al. (2021) ), leads us to conclude that superspreading is a salient feature of this epidemic. A useful output of our analysis is that it allows us to estimate the expected proportion of secondary infections attributable to the most infectious proportion of the population. For example, we estimate that the 20% most infectious individuals give rise to 75-98% of the expected secondary infections with 95% posterior probability, while 62-82% of individuals did not pass on the infection, also with 95% posterior probability. This finding has important implications for public health policy. In particular, our analysis suggests that heterogeneity should be accounted for when quantifying uncertainty on R t . We observed that credible intervals for R t are typically much wider when k is small than is the case for homogeneous disease reproduction. This uncertainty is an important consideration when deciding whether or not to implement public health interventions based on R t . Secondly, when disease reproduction is heterogeneous, control measures targeting the most infectious individuals will have a disproportionate impact on the overall disease momentum (Woolhouse et al., 1997; Lloyd-Smith et al., 2005; Wallinga et al., 2010) . In this instance, backward tracing, which looks to identify the source of each infection, could play a crucial role in bringing the epidemic under control while minimising the broader societal and economic impact. Finally, the uncertainty introduced by heterogeneous disease transmission should be accounted for when forecasting the trajectory of an epidemic.

The proposed framework comes with important caveats. The parsimonious model ignores several features of empirical epidemic curves. In reality, the epidemic curve is never observed under ideal conditions, the rate at which cases are imported is unknown, and the distribution of generation intervals is likely to change in response to both control measures and the growth rate of the epidemic (Ali et al., 2020; Park et al., 2021) . Misspecification of these parameters will effect the joint distribution over R t and k (Wallinga and Lipsitch, 2007; Gostic et al., 2020; Knight and Mishra, 2020; Donnat and Holmes, 2021) . That said, in our analysis of the Irish COVID-19 epidemic, we have sought to mitigate the worst of these effects by modelling an epidemic curve where cases have been ordered by epidemiological date rather than considering the daily count of confirmed cases. We have also restricted our analysis to a period in which we have reason to believe that testing and tracing systems could cope with demand and procedures were executed in a reasonably consistent manner.

This paper should provide a starting point for research in several directions. For example, if transmission networks are available, then it would be possible to incorporate this type of branching process data into the framework developed here to provide enhanced inference on the heterogeneity of disease spread. Additionally, extending our model to allow joint inference over µ t , R t , k and ω is also possible, if richer epidemic data are available.

The branching process model defined by equation (1) has the likelihood

where θ = (θ 1 , . . . , θ N ) and

This branching process model implies that there exists an underlying transmission network linking infector-infectee pairs where nodes represent individual cases and directed edges the chain of transmission. Given that each secondary infection is associated with one index case only, each node in this transmission network has a single incoming edge. As such, the network can be uniquely characterised by the vector 

The transmission network of an epidemic is generally unknown; however, by the chain rule of probability, we have that the ratio of (21) to (20) yields a conditional distribution for A such that

This probability mass function (pmf) is equivalent to the product of independent multinomial pmfs

where, for i = 1, . . . , N , a i = (A 0i , A 1i , . . . , A ii ) is the outcome vector identifying the source of index case i and p i = (p 0i , p 1i , . . . , p ii ) is a vector of probabilities such that

is the relative likelihood that i is a secondary infection of j for all j = 0, . . . , i − 1 and

is the relative likelihood that i was imported. Thus, the distribution of edges in the transmission network follows a series of independent trials with multinomial outcomes.

The choice of hyper-parameters in the GP prior for the time-varying reproduction number is a crucial aspect of the analysis presented here. The prior for R = (R 0 , R 1 , . . . , R N ) , defined by equations (14) and (15), is parameterised by amplitude σ f > 0 and length-scale > 0. As discussed in the main text, specification of is an important consideration. In order to understand the behaviour of R different values for , we examine samples drawn from the prior distribution on R conditioned on σ f = 1, R 0 = R N = 1, and N = 60 for ∈ {10, 17.5, 25}. Figure 7 illustrates the fact that a shorter length-scale results in a more flexible model for R, manifest as rapid transitions from high to low values of R t , and vice versa.

Specifically, when = 10 we observe that trends in R t can change dramatically in only a few days, allowing for shifts in the reproduction number that are reversed within as little as ten days (see Figure 7a ). Such a model could accommodate an epidemic whereby cases surge twice within a two month period. Empirical data on COVID-19 in Ireland, where case numbers surged three times during the first 12 months of the COVID-19 epidemic, does not behave in this manner. Whether this is a consequence of human behaviour or some intrinsic property of the virus itself, it reflects the time-scale over which we expect to observe changes in the time-varying reproduction number. By this reasoning, = 17.5 is more consistent with empirical data (see Figure 7b ). Within this model, R t remains stable over any given 2-3 week period, reflecting the time-scale on which non-pharmaceutical interventions are enforced. A longer length-scale, such as = 25, sees similar conditions persist over the entire 60 days considered (see Figure 7c ). Such a model is unlikely to be flexible enough to fit R t as epidemic conditions change. Thus, specifying a hyper-prior such that ∼ N (17.5, 2.5) seems a reasonable choice.

In this Appendix, we consider how changes in the generation interval pmf ω may effect our inference for R and k. We consider three candidate generation intervals chosen from the range for COVID-19 reported in the literature Ganyani et al., 2020; Griffin et al., 2020; Rai et al., 2021) , such that (19) is parameterised by (γ τ , σ τ ) ∈ {(4, 2), (5, 2.5), (6, 3)}. We sample from (16) for each model in the manner described in the main text and verify our posterior samples satisfy diagnostic checks and tests for convergence. The results of this analysis are presented in Table 1 and Figure 8 . In each case the solid black line represents the conditional mean alongside the shaded 95% credible interval. Coloured lines represent samples from the conditional distribution. We note that smaller values for (shorter length-scales) imply more flexible GP models.

(γ τ , σ τ ) (4, 5) (5, 2.5) (6, 3) k 0.33 (0.15) 0.18 (0.8) 0.12 (0.06) Table 1 : The posterior mean (standard deviation) for k under each candidate generation interval. We observe an inverse relationship between k and γ τ .

As described by Gostic et al. (2020) , Figure 8b illustrates that larger mean generation intervals force estimates for R t away from 1; however, for each specification of ω, 95% credible intervals overlap. Of more interest is the impact of changes in ω on inference for k, presented in Figure 8a and Table 1 . We observe that smaller values of γ τ are associated with larger k. This analysis suggests that, for example, the 20% most infectious individuals give rise to 65-85%, 80-95%, and 90-99% of expected secondary infections when γ τ is 4, 5, and 6 days, respectively. All three candidate generation intervals imply a high level of heterogeneity and lead us to broadly similar conclusions, although longer generation intervals are associated with more heterogeneous disease transmission. As such, we have presented only the model where γ τ = 5 in the main text, although future work may consider model averaging to allow for uncertainty in ω. This analysis suggests that an inverse relationship exists between k and γ τ , although a more detailed study lies beyond the scope of this report and is left for future work. 

Serial interval of SARS-CoV-2 was shortened over time by non-pharmaceutical interventions

Infectious diseases of humans: dynamics and control

Quantifying heterogeneity in SARS-CoV-2 transmission during the lockdown in India

Statistical studies of infectious disease incidence

The challenges of modeling and forecasting the spread of COVID-19

Modeling infectious epidemics

Stochastic epidemic models: a survey

A mathematical model reveals the influence of population heterogeneity on herd immunity to SARS-CoV-2

Stan: A probabilistic programming language

EpiEstim: Estimate time-varying reproduction numbers from epidemic curves

A new framework and software to estimate time-varying reproduction numbers during epidemics

Maximum likelihood from incomplete data via the EM algorithm

Modeling the heterogeneity in COVID-19's reproductive number and its impact on predictive scenarios

Serial interval of COVID-19 among publicly reported confirmed cases

Estimating the overdispersion in COVID-19 transmission using outbreak sizes outside China

Estimating the effects of nonpharmaceutical interventions on COVID-19 in Europe

A likelihood-based method for real-time estimation of the serial interval and reproductive number of an epidemic

Estimating individual and household reproduction numbers in an emerging epidemic

Estimating the generation interval for coronavirus disease (COVID-19) based on symptom onset data

Calibrating covid-19 susceptible-exposed-infected-removed models with time-varying effective contact rates

Practical considerations for measuring the effective reproductive number

Mathematical models of infectious disease transmission'

Rapid review of available evidence on the serial interval and generation time of COVID-19

A core group model for disease transmission

The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo

Disease momentum: estimating the reproduction number in the presence of superspreading

Networks and epidemic models

Modeling infectious diseases in humans and animals

A contribution to the mathematical theory of epidemics

Estimating effective reproduction number using generation time versus serial interval, with application to COVID-19 in the Greater Toronto Area

Modeling infection transmission

Superspreading and the effect of individual variation on disease emergence

Transmission dynamics of HIV infection

Forward-looking serial intervals correctly link epidemic growth to reproduction numbers

R: A language and environment for statistical computing, R Foundation for Statistical Computing

Estimates of serial interval for COVID-19: A systematic review and meta-analysis

Inference under superspreading: Determinants of SARS-CoV-2 transmission in Germany

RStan: the R interface to Stan

Transmission heterogeneities, kinetics, and controllability of SARS-CoV-2

Improved inference of timevarying reproduction numbers during infectious disease outbreaks

Estimation of space-time branching process models in seismology using an EM-type algorithm

Ranknormalization, folding, and localization: An improved R for assessing convergence of MCMC'

How generation intervals shape the relationship between growth rates and reproductive numbers

Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures

Optimizing infectious disease interventions during an emerging epidemic

Gaussian processes for machine learning

Evidence that coronavirus superspreading is fat-tailed

Heterogeneities in the transmission of infectious agents: implications for the design of control programs

The Insight Centre for Data Analytics is supported by Science Foundation Ireland under Grant Number 12/RC/2289 P2.