key: cord-0640663-vjpphufj
authors: Ballard, P. G.; Black, A. J.; Ross, J. V.
title: Inference of population-level disease transmissibility from household-structured symptom onset data
date: 2021-11-19
journal: nan
DOI: nan
sha: 83a3a73b380e82c40cc9068959d5a4b9ce910d53
doc_id: 640663
cord_uid: vjpphufj

First Few X (FFX) studies collect household-stratified data in the early stages of a pandemic, in order to infer severity and transmissibility of an emerging disease. We present a Bayesian method to approximately infer population-level transmissibility for the first time from such data; previous studies have only inferred household-level transmissibility. To do this we perform the inference at two levels, assuming one transmission rate parameter for within-household infection, and another transmission rate parameter for infection between different households. We use a simplifying assumption: that between-household infections always occur in naive households; while still performing full joint inference on the within-household infection parameters. In addition, a novel technique is used to remove systematic bias when the number of new infections per day is growing or decaying, as is common in real outbreaks, without the need for contemporaneous estimates of the serial interval. The method is validated on simulated data and is shown to perform well, even when the number of infected households is relatively small.

The threat from pandemics is ongoing [44] , and this threat has been underscored by the outbreak of COVID-19 [45] . An important part of dealing with this threat is the ability to infer the characteristics of an emerging strain of infection, in particular its transmissibility and severity, which are strong determinants of impacts of the disease [25, 35, 36] .

The accurate inference of transmissibility is particularly challenging, due to the inherently heterogeneous mixing present in larger populations. Models with two levels of mixing, or so-called household models [4, 3] , are an established way to capture some of this heterogeneity [18, 37, 17] . These assume stronger mixing within a household, and weaker mixing in the overall population between individuals in different households [4, 37, 7] . Households are a convenient unit to monitor because contacts are easy to track; and although different rates of mixing have been shown to exist within households [29, 14] , within-household mixing can be assumed to be homogeneous as a first order approximation.

As one part of the response to the threat of a pandemic, governments around the world have invested in advanced plans for data collection in its early stages, so-called "First Few X", or FFX, studies [2, 16, 26, 41] . As this data is already at the household level, it is natural to consider its use in the inference of transmissibility in a two-level model of the entire population. Previous work has shown how mechanistic models can be used to accurately infer transmissibility within a household using FFX data [8] . In this paper, we build on this and demonstrate how FFX data can also be used to infer population level transmissibility.

The method presented here is effective even if some cases are asymptomatic and hence some infected households are not observed.

The method is a three-step process. The first step, the inference of withinhousehold parameters, uses similar methods to our previous work [8, 6] , and so is not described in detail in this paper.

The second step infers the between-household component of transmissibility. This uses the within-household parameters inferred during the first step.

By using these inferred within-household parameters, we avoid the need to contemporaneously estimate the generation time or serial interval (for instance via contact tracing), as is used in established methods [11, 39] but has its difficulties. 2 The third and final step of our method is to combine the within-household and between-household parameters to obtain the posterior distribution of whichever reproduction numbers are desired. Those most commonly reported are R 0 (the mean number of secondary infections per infectious individual, in an otherwise uninfected population), or R eff (the mean number of secondary infections per infectious individual, during an outbreak). There is some question over what is the most meaningful way to precisely define R 0 or R eff in a two-level model [4, 30, 15] , or even whether the between-household reproduction number R * (the mean number of households secondary households infected per infectious household) [37] is a more useful measure of transmissibility. In any case, any of these reproduction numbers can be derived from the within-household and between-household parameters. It is also possible to infer the posterior distribution of the growth rate, which can then be used for forecasting, if exponential growth or decay is assumed [22] .

Due to the small amount of FFX data in Australia during the COVID-19 pandemic -partly due to the fortunate situation of there not being a large number of COVID-19 cases -this study uses synthetic data. This use of synthetic data allows us to make a number of simplifying assumptions. We present these assumptions, and the details of the data and model, in Section 2. We outline the method and its theory in Section 3, including two alternative methods in Sections 3.4 and 3.5. We present sample results in Section 4 and a conclusion in Section 5.

2 For instance: difficulties in determining the chain of transmission [21] ; difficulties in obtaining enough data points to be a representative sample [21] ; subjects having faulty recollection of events [21] ; or failing to account for the fact that the average serial interval will be shorter within a household, due to exhaustion of susceptibles [24] .

For every household which has had a symptomatic infection, the number of individuals in that household is recorded, as well as a daily record of the number of newly symptomatic individuals each day. This corresponds to the Australian pandemic response plan [2] , which specifies that this daily data should be obtained from all of the FFX infected households. We assume that every symptomatic case will be detected and recorded and that the day of first symptoms can be accurately determined. We also assume that symptoms always correspond to the disease of interest.

The model consists of a set of households, each represented by a stochastic compartmental model (Section 2.2.1), with the addition of between-household infection events (Section 2.2.2).

As synthetic data is used, we have made some simplifying assumptions. We assume that there is effectively an infinite supply of naive households, meaning that the spread between households can be approximated as a branching process [42] , which is a reasonable assumption in a population containing a large number of households. Following on from this, we also assume that every betweenhousehold infection infects a naive household; in other words, that no household contains two or more individuals who have been infected by someone outside of the household.

We also assume that, apart from the initial "seed" infection into a population, there are no further infections from outside the population. This is a less realistic assumption, and the tracing associated with the spread of COVID-19 has shown that populations continually receive external infections, at some rate.

Some analyses have this external infection as a separate rate, splitting infection into "local" and "imported" [39] . We ignore this factor, though in principle it would not be difficult to account for it.

The methods in this paper can be adapted to a number of different compartmental household models, but we focus here on a single model to illustrate the methods: a model designed to emulate the progress of a COVID-19 infection. The compartments, which are the possible states of an individual, are illustrated in Figure 1 . The process is modelled as a continuous-time Markov chain (CTMC) and the events, rates and parameters are given in Table 1 . The model is an extension of the common SEEIIR model with partial observation [8, 28, 33] ; in that it has two "pre-symptomatic" compartments to account for the time in which an infected individual is infectious before symptoms appear, as occurs with COVID-19. The model assumes that infectiousness is equal in all of the "infectious" compartments (P 1 , P 2 , I s1 , I a1 and I 2 ).

Individuals are symptomatic with probability p s , and the I a1 compartment denotes cases which are never symptomatic. 

Transition Rate Observed

Exposed to infection

Pre-symptomatic state progresses

Becomes infectious with symptoms P 2 → I s1 2P 2 p s /t P yes Infectious state progresses

Becomes infectious, no symptoms Table 1 : Within-household transition rates for the COVID-19 model. In the "Rate" column, by a slight abuse of notation, the compartment name refers to the number of individuals in that compartment. β is the within-household infection rate parameter; I = (P 1 +P 2 +I s1 +I a1 +I 2 )

is the total number of infectious individuals in the household; m is the number of individuals in a household; t E , t P and t I are the mean times in the exposed (but not infectious), presymptomatic (and infectious) and symptomatic states respectively; and ps is the probability of an infectious individual becoming symptomatic.

For the process of between-household infection, we assume that the population of naive (that is, fully susceptible) households is large, so new infections can be modelled as a branching process. Thus the rate at which new households become infected is given by α, the rate of between-household infections per infectious individual, multiplied by the total number of infectious individuals (that is, the sum of individuals in the compartments P 1 , P 2 , I s1 , I a1 and

I 2 ) across all households. As the supply of susceptible households is effectively infinite, this term does not need to be reduced as the outbreak progresses, unlike the within-household infection rate.

Since it is assumed that no household is infected from another household more than once, any newly infected household initially has one individual in the E 1 compartment, and all other individuals in the S compartment. However the household is not observed, if it is observed at all, until one of the individuals in the household becomes symptomatic; that is, until a P 2 → I s1 event occurs for one of the individuals in the household.

We infer the reproduction numbers by a three-step process: (1) treat households as independent populations, and infer the within-household model parameters;

(2) treat households as units, and infer α, the between-household infection rate per infectious individual; and (3) combine the parameters obtained in Steps

(1) and (2), to obtain the reproduction numbers. These are described in Sections 3.1, 3.2 and 3.3 respectively.

In addition, we describe two alternative methods in Sections 3.4 and 3.5, which mainly differ at Step (2). The method in Section 3.2 assumes that only data for the first symptomatic infection in each household can be reliably obtained; that is, the data records the day on which each household first has a symptomatic case. The alternative method in Section 3.4 assumes that the day of first symptoms is also available for every individual. This is discussed after Section 3.2 because infector individuals can be treated as households of size 1, making this in most ways a special case of the method discussed in Section 3.2. Section 3.5 describes a simpler method which bypasses Step (2), at some cost in accuracy, by first assuming an exponential growth rate r, and using that to find the between-household infection rate α.

3.1.

Step 1: Infer the within-household parameters

The first step of the method is to infer within-household parameters from the FFX data. We use data from within the observed households, and assum-ing independence between these households, consistent with the assumption of branching dynamics, perform inference and hence sample from the joint posterior distribution for the within-household parameters which were described in Table 1 : β, t E , t P , t I and p s ; though instead of β we infer R 0i = β(t P + t I ), the within-household component of the reproduction number R 0 or R eff .

This step is essentially the same as described in our previous paper [8] , using a Markov chain Monte Carlo scheme to sample from the posterior distribution of the parameters, so it is not described in detail here. The only difference is that instead of an exact matrix exponential approach to calculate the likelihood, we use a particle-marginal approach to estimate the likelihood [43] , using importance sampling for the household model [6] . Despite being an estimate, this still targets the correct posterior distribution [38] . This has the advantage of being able to handle more complex household models and larger household sizes.

The output from this step is N S samples from the joint posterior of the within-household parameters. We typically use N S = 10000, thinning down from a larger number of posterior points obtained during the Markov chain

Although it is not our direct aim in this paper, we note that the inference of within-household parameters other than the transmission rate provides other valuable information on the epidemic: the time-related parameters (t E , t P and t I ) provide information such as the latent period (time from infection to infectiousness), incubation period (time from infection to symptoms) and infectious period; while p s gives a measure of the severity of the disease; and this all arises due to the collection of FFX data. More detailed information on severity can be obtained if extra FFX data is collected, such as which individuals needed to visit a doctor or hospital.

The second step of inference is to infer α, the between-household infection rate per infectious individual, from the daily information of newly infected households. By assuming an effectively infinite supply of households, the spread of infection between households can be modelled as a branching process.

To simplify the discussion, we first consider the case in which all households are the same size (that is, all households contain the same number of individuals), and all households contain at least one symptomatic case. The extensions to different household sizes and households without symptomatic cases, which add several terms to the equations, but are not difficult conceptually, are given in Section 3.2.2. In order to estimate E[y j ], we need to estimate the infection potential of a household: the sum of the infectious times of all infected individuals in that household. Then Ψ, the expected value of the infection potential, is given by:

where I(t) is the number of infectious individuals in a household at time t.

This means that each household generates, on average, αΨ between-household infections. The value of Ψ can be estimated from simulations of the household model.

We also need the distributions of A 1 , A 2 and G, as shown in Figure 2 .

That is, for any given pair of infector and infectee households, we define: A 1 to be a random variable denoting the time from infection to first symptoms in the infector household (the household-wise equivalent of the incubation period [13] ), rounded to the nearest day; A 2 to be a random variable denoting the time from infection to first symptoms in the infectee household, rounded to the nearest day; and G to be a random variable denoting the time from first infection in the infector household to first infection in the infectee household (the householdwise equivalent of the generation time [9] ), rounded to the nearest day. We also define S H to be a random variable denoting the time from first symptoms in the infector household to first symptoms in the infectee household (the householdwise equivalent of the serial interval [12] ); and C to be the sum of G and An important part of our method is that, for a given sample of the within-household parameters, we can obtain unbiased estimates for the distributions of A 1 , A 2 and G (and hence C) from simulation of the within-household model, without the need for contact tracing.

The analysis would be simpler if we could use S H , the household-wise equivalent of the serial interval, to estimate E[y j ]. However an unbiased distribution of S H cannot be obtained without accounting for the growth or decay in the outbreak [10] , in terms of the number of newly infected households per day.

This is because a sample of S H is obtained by subtracting a sample of A 1 from a sample of C; and this subtraction of A 1 , also referred to as looking backwards in time [9] , biases the distribution depending on whether the outbreak is growing or decaying, as has been pointed out previously [40, 9, 10] .

However it is possible to estimate E[y j ] without S H and without contact tracing, using the other variables in Figure 2 , because these are estimated forward from the day the infector household is exposed (day l in Figure 2 ). To do this, it is useful to define w (j,k,l) , the expected number of between-household infections which are caused by an infector household which was infected on day l and first symptomatic on day k, and which are first symptomatic in an infectee household on day j. Then consideration of each day j gives:

Since every infector household generates on average αΨ between-household infections, it follows from consideration of each day k:

and considering infections generated from households initially infected on day l gives:

where T l is the true number of households infected on day l. We cannot measure T l , but a reasonable approximation is that it is proportional to y l , as this would be the case if the rate of growth or decay was constant and if there was no stochasticity. So we estimate T l to be y l multiplied by some value K k which is constant for each value of k, giving:

We estimate K k by substituting (5) into (3) and replacing E[y k ] with the observed value y k , giving:

Since all days l will be the same as or earlier than day k, it follows that K k > 1 when the number of new infections per day is growing at day k, K k < 1 when it is falling, and K k ≈ 1 when it is approximately constant. So the K k factor provides a correction to account for the growth or decay of the outbreak. Appendix

Appendix B illustrates how inference results can be biased if this term is not included.

Then substituting (5) into (2) gives:

As shorthand, we define the quantity ξ j which is equal to E[y j ]/α, that is:

We use a gamma distribution as the prior of α, meaning that if the data is Poisson distributed, then the posterior distribution of α is also a gamma distribution [11, 39] . Using c days of data up to and including day d, it can be shown (Appendix Appendix A) that the posterior of α is

where the first parameter is the shape and the second parameter is the rate, and a and b are the prior's shape and rate respectively. This equation gives the distribution of α as calculated on day d.

Aside from the observed data values in y, all quantities on the right hand sides of (6) and (7) can be estimated from simulations of the household model.

Thus this step gives samples from the joint posterior distribution of α and the within-household parameters.

By making K k constant for any value of k, we are assuming a constant rate of growth or decay. A more sophisticated calculation may be possible which accounts for change in this rate. But since we are only using the last c days of data for inference in (8), this should not be necessary. However, it may be a point for investigation in the future.

We modify the discussion in the previous section to account for households of different sizes, and asymptomatic cases.

To account for asymptomatic cases, we define p h to be the probability of an infected household being symptomatic, and p i to be the probability of a randomly chosen infectious individual being a member of a symptomatic household.

We define a symptomatic household to be an infected household in which at least one individual is eventually symptomatic; and an asymptomatic household to be an infected household in which all infected individuals remain asymptomatic.

To account for different household sizes, we require the observed data to be a 2-dimensional array y, with y (j,n) being the number of households of size n which are first symptomatic on day j. We replace Ψ with Ψ m , the expected infection potential of a symptomatic household of size m, given by:

where I m (t) is the number of infectious individuals in a symptomatic household of size m at time t. In addition, M and N are random variables denoting the sizes of the infector and infectee households respectively.

We define the quantity w (j,k,l,m,n) , which is the expected number of betweenhousehold infections which are caused by an infector household of size m which was infected on day l and first symptomatic on day k, and which are first symptomatic in an infectee household of size n on day j.

Then consideration of each day j also needs to account for the fact that on average only p i of all infections come from infectious households, and only p h of infections are in infectee households which become symptomatic. So (2), (3) and (4) are modified to:

αΨ m E[y (k,m) ] = j l n w (j,k,l,m,n) ; and (11)

where T (l,m) is the true number of symptomatic households of size m infected on day l, and p * means "the probability, given that both the infector and infectee households are symptomatic".

Then following a similar analysis to Section 3.2.1 gives:

and

where ξ (j,n) = E[y (j,n) ]/α.

We again use a gamma distribution as the prior of α, and infer using c days of data up to and including day d. This means (8) is modified to:

where the first parameter is the shape and the second parameter is the rate, and a and b are the prior's shape and rate respectively.

Once again, aside from the observed data values in y, all quantities on the right hand sides of (13) and (14) can be estimated from simulations of the household model. Therefore, for each of the N S samples from the joint posterior of the within-household parameters obtained in Step 1, we run a number of simulations of the household model (typically 1000) per household size, allowing ξ (j,n) to be estimated for all values of n, for each day j; and then a single sample of α is taken from (15) . Thus this step gives N S samples from the joint posterior distribution of α and the within-household parameters.

Once samples of all the model parameters are obtained, the population reproduction numbers are fairly straightforward to evaluate.

To account for different household sizes, we define h m to be the probability that a randomly chosen household will be of size m. Therefore a newly infected household has size m with probability π m , where π m is the size-biased distribution [7] , and is given by the equation,

The household reproduction number R * [7] can then be evaluated using previously obtained values:

where Ψ m is given in (9) .

We 

This equation can also be used to calculate R eff .

Other definitions of R 0 and R eff have been proposed for two-level models [4, 15, 30] . These can also be calculated using the raw data from the previous steps. For instance, an alternative might be to use R HI [15] , the expected number infected by a randomly chosen infectious individual. In that case, we 

An estimate of the exponential growth rate r can also be obtained [7] , by finding the value of r for which

Equation (20) Another advantage is that, by incorporating the summation into an existing simulation model, the code is relatively easy to verify.

In all cases, these parameters are evaluated as samples from the distribution of the posterior. The output from Step (2) is N S samples from the joint posterior of α and the within-household parameters; we typically use N S = 10000. The method in Step (3), to evaluate any of R * , R 0 , R eff or r, is repeated for each of these samples. Therefore the method gives N S joint samples of all parameters.

In this section and in Section 3.5, we present two alternate methods, which are relevant if different data is available.

The analysis in Section 3.2 assumes that the only observed data is the number of newly symptomatic households of each household size on each day, which is stored in the 2-dimensional array y. In this section we consider the situation in which the number of newly symptomatic individuals is also available. We denote this as z = (z 1 . . . z d ), where z j is the number of individuals who are first symptomatic on day j.

In that case, there is no need to use the infector households to estimate the number of newly infectious individuals, because that quantity is available directly as z. The two-dimensional array y is still used as the count of new infectee households.

Infector individuals can be treated identically to infector households of size 1, which are symptomatic with probability p s (Table 1) ; (21) and (14) is modified to:

The rest of the analysis proceeds as in Sections 3.2.2 and 3.3, including the use of (15) to find the distribution of α.

Another alternative method, which builds on a previously published method [37, 7] , is to first infer the growth rate r, and then use r to find α. This estimate of α, along with the estimates of the within-household parameters obtained in

Step 1, is then used to obtain estimates of R * , R 0 or R eff . The inference of r is an extra step in this inference, so this method is usually not preferred. However it might be preferable in situations where the exponential growth rate is relatively easy to obtain. For instance, the distribution of r could be approximately inferred by performing a Bayesian linear regression of log(y) versus time, where y is the number of newly infected households per day.

Given samples from the posterior distribution of r obtained by this or another method, we can take samples from the posterior of the within-household parameters, and calculate corresponding samples of the between-household infection rate α using (20), but with α rather than r being the unknown quantity.

As in Section 3.3, usually the simplest and most efficient way to find α is to estimate the integral in (20) by simulation.

With samples of α generated, corresponding samples of R * , R 0 or R eff can then be generated using (17), (18) or (19) as appropriate.

This method has the appeal of bypassing the Section 3.2 calculation of α.

However a disadvantage is that the samples of r are not joint with the samples of the within-household parameters. This is likely to contribute to an overestimation in the variances in the distributions of α, R * , R 0 and R eff .

For the first two sets of tests, in Figure 3 and 4, data was generated using the parameters: t E = 2, t P = 1.8, t I = 1.5, p s = 0.8, Table 2 . Inference was performed using the method described in Sections 3.1 to 3.3, using (18) to evaluate R eff and (17) to evaluate R * . (All tests were repeated using the method in described in Section 3.4, with very similar results). The following priors were used for the within-household parameters: R 0i had a gamma distribution with shape 3 and rate 5/3 (mean = 1.8); t E had a gamma distribution with shape 3 and rate 0.5 (mean=6); t P and t I both had a gamma distribution with shape 3 and rate 2 (mean=1.5); and the prior for p s had a beta distribution with α = 2.8 and β = 1.2 (mean=0.7). For the between-household infection parameter α, we followed previous work [39] and chose a fairly broad prior: a gamma distribution with shape 1 and scale 0.3 (mean=0.3). With a prior mean infectious time (t P + t I ) of 3, this corresponds to a mean of 0.9 for the between-household component of R eff .

For the first set of tests, we ran the within-household inference only once, inferring the within-household parameters from the full FFX data for the first 50

households, but used daily updates of data for the between-household inference.

This was partly a decision of practicality: the between-household inference is far faster to run than the within-household inference, due to the former's use of conjugate priors. It also reflects the authors' experience of FFX data collection in Australia during the COVID-19 outbreak, in which updates of within-household data tended to occur slowly. But it is also based on a reasonable assumption:

that as countermeasures such as social distancing and working from home are brought in to restrict an outbreak, primarily the between-household infection rate will be affected, and the within-household parameters will be impacted less.

We performed inference of R eff and R * for each day of the simulated outbreak, using the most recent 7 days of data for the between-household inference, and the results are shown in Figure 3 . It shows that inference works well, even with only 50 households of FFX data for within-household inference: it tracks changes to the transmission rate, and the confidence interval reduces as the amount of data grows. In order to determine whether the within-household or between-household inference was the main source of variance, the inference was repeated with the within-household parameter values perfectly known (that is, every sample uses the within-household parameters values used to generate the data: R 0i = 1.4, t E = 2, t P = 1.8, t I = 1.5 and p s = 0.8), with the results in Figure 4 .

This shows a substantial reduction in variance, although this is much more pronounced in R eff than R * . This is because R eff , by its very definition in (18) has a substantial within-household component: within-household and betweenhousehold infections are counted. In contrast, the evaluation of R * only directly considers between-household transmission, so changes to the within-household inference method have a less dramatic effect on R * . Table   2 ; corresponding to R eff = 1.654 and R * = 1.766. As in Figures 3 and 4 We do notice that both R eff and R * are overestimated when there is a small amount of data, in particular for lower values of H b . In the case of lower values of H b , we believe that the main cause is that the initial fade-out of some realisations causes a "selection bias", leading to an overestimation of R eff and R * [27, 34] , which is not simple to correct. 

The inferred values of the exponential growth rate r can also be used for simple forecasting, assuming a continued exponential growth or decay.

In addition to forecasting with the inferred parameters for a given day, forecasts may account for reduced infection rates, re-calculating r using the method in Section 3.3. An example is shown in Figure 6 . The true (simulated) progress of the outbreak for the next 10 days is also included, showing that the exponential approximation works reasonably well for a short-term forecast. 

We have presented a method for inference of infection parameters in a twolevel infection model, with different infection rates within households and be-tween different households. By separating the parameters into two components, we split a complicated inference problem into two relatively well understood problems.

The within-household infection parameters can be inferred using known techniques such as a particle-marginal Metropolis-Hastings algorithm [38, 43] . The inferred within-household parameters can then be used to estimate the effective number of infector households on a given day, and hence estimate α, the between-household infection rate. The technique is similar to other methods which infer the parameters dynamically during an outbreak [11, 39] , but also uses knowledge of within-household structure and parameters, and uses it to eliminate a systematic bias when the number of new infections per day is changing.

The within-household and between-household parameters can be combined, to give the reproduction number in a number of forms. An exponential growth rate r can also be estimated, and used for forecasting. In this case, the required path integral can be efficiently estimated using simulation. In all cases, the testing confirms that the method works well.

This work could be extended in a number of directions. First, we emphasise that the method is not specific to any particular within-household model, so can be adapted to a model for a different type of pathogen, or a more or less detailed model of the internal structure of a household. Also, the method scales well to the use of larger household groups; unlike earlier methods [37, 7] which strained computer resources as household sizes reached even a moderate size.

For the between-household inference, we believe a household-based model can be useful for estimating the spread of an infection through a population.

Our technique for overcoming the bias when the number of new infections per day is growing or decaying (Section 3.2) is not limited to a household-based model, so may be applicable in other situations. The technique presented here assumes a constant rate of change, and it may also be useful to refine this technique to account for changes in the rate of growth or decay.

There is also scope for extending the method for forecasting. We present a very simple method, of calculating the exponential growth rate and using that to project forward. But with all lower-level parameters inferred, such as those relating to transmission of the infection and recovery of individuals, it may be possible to make forecasts using those parameters directly, without assuming a constant exponential growth rate.

In conclusion, we believe the stratification of data into households is an effective method for inference of outbreak parameters, and there is much scope for using these or similar methods in the future.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Simulation and inference software was written in Julia [5] , with post-processing in Python and Matplotlib [19] .

[1] Root finding functions for Julia. https://github.com/JuliaMath/Roots.jl. 

To simplify the logic, we define S = a + which is equivalent to (8) .

This appendix concerns the correction to account for the growth or decay of an outbreak, which is discussed in Section 3.2.1, and appears as the K term in (7), (14) and (22) . We illustrate how, without this term, we see a bias due to the growth or decay of the outbreak, as has previously been reported [40, 9, 10] ;

and that this K term provides an appropriate correction.

Figures B.7 and B.8 show inference results, of R eff and R * respectively, from the same simulation. This simulation uses the distribution of household sizes given in Table 2 ; within-household parameters t E = 2, t P = 1.8, t I = 1.5, p s = 0.8, R 0i = β(t P + t I ) = 1.4; and α initially 0.242, then reducing to 0.0727 on day 70. For inference, the within-household parameters are assumed to be known exactly, while α has a gamma distribution with shape 1 and scale 0.3.

That is, all parameters and settings are identical to those used for Figure 4 , except that α reduces more sharply, to exaggerate the rate of decay of R eff and R * .

Without the correction of the K term, the inference method overestimates α when the outbreak is growing, and underestimates α when the outbreak is decaying, causing bias in R eff ( Figure B .7(a)) and R * ( Figure B.8(a) ).

When the K term is included, the inference accounts for this growth and decay, as is illustrated in Figures B.7 

A general model for stochastic SIR epidemics with two levels of mixing

Epidemics with two levels of mixing

Julia: A fresh approach to numerical computing

Importance sampling for partially observed temporal epidemic models

Epidemiological consequences of household-based antiviral prophylaxis for pandemic influenza

Characterising pandemic severity and transmissibility from data collected during first few hundred studies

Estimation in emerging epidemics: biases and remedies

Intrinsic and realized generation intervals in infectious-disease transmission

A new framework and software to estimate time-varying reproduction numbers during epidemics

Practical considerations for measuring the effective reproductive number, R t . medRxiv

Principles of Epidemiology in Health Practice, Third Edition. U.S. Department of Health and Human Services

Synthetic population dynamics: a model of household demography

Reproductive numbers, epidemic spread and control in a community of households

First Few Hundred" project, epidemiological protocols for comprehensive assessment of early swine influenza cases in the United Kingdom

Incorporating household structure and demography into models of endemic disease

Deterministic epidemic models with explicit household structure

Matplotlib: A 2D graphics environment

Australia: Household size

Rapid review of available evidence on the serial interval and generation time of COVID-19

1,000,000 cases of COVID-19 outside of China: the date predicted by a simple heuristic

On thinning of chains in MCMC

Measurability of the epidemic reproduction number in data-driven contact networks

Pandemic controllability: a concept to guide a proportionate and flexible operational response to future influenza pandemics

Pandemic (H1N1) 2009 influenza in the UK: clinical and epidemiological findings from the first few hundred (FF100) cases

Effective reproduction numbers are commonly overestimated early in a disease outbreak

A data-driven model for influenza transmission incorporating media effects

Social contacts and mixing patterns relevant to the spread of infectious diseases

Reproduction numbers for epidemic models with households and other social structures I: Definition and calculation of R0

Integrals for continuous-time Markov chains

Path integrals for continuous-time Markov chains

Early analysis of the Australian COVID-19 epidemic. eLife

Estimating the basic reproductive number during the early stages of an emerging epidemic

Novel framework for assessing epidemiologic effect of influenza epidemics and pandemics

Early characterization of the severity and transmissibility of pandemic influenza using clinical episode data from multiple populations

Calculation of disease dynamics in a population of households

Probabilistic learning of nonlinear dynamical systems using sequential Monte Carlo

Improved inference of time-varying reproduction numbers during infectious disease outbreaks

Some model based considerations on observing generation times for communicable diseases

Utility of the first few100 approach during the 2009 influenza A(H1N1) pandemic in the Netherlands

Inference of epidemiological parameters from household stratified data

Bayesian model discrimination for partially-observed epidemic models

World Health Organization. WHO guidance for surveillance during an influenza pandemic

Consider a set of event rates αξ 1 . . . αξ c , where ξ 1 . . . ξ c are known but α is not known; and corresponding observed daily event counts y = (y 1 . . . y c ).Given that the prior distribution of α is a gamma distribution with shape a and rate b, meaningwe wish to infer the posterior distribution f (α|y).The number of daily events Y j is poisson distributed, with:which means the likelihood f (y|α) is given by: