key: cord-0582853-vt5snac5
authors: Lega, Joceline
title: Parameter Estimation from ICC curves
date: 2020-05-17
journal: nan
DOI: nan
sha: 03ba6e3537d0dda29415cd6a068f292d7004b836
doc_id: 582853
cord_uid: vt5snac5

Incidence - Cumulative Cases (ICC) curves are introduced and shown to provide a simple framework for parameter identification in the case of the most elementary epidemiological model, consisting of susceptible, infected, and removed compartments. This novel methodology is used to estimate the basic reproductive number of recent outbreaks, including the ongoing COVID-19 epidemic.

This article develops the theoretical underpinnings of the forecasting approach put forward in [1] , which was shown to lead to useful estimates of the characteristics (expected duration, final number of cases, and peak intensity) of ongoing outbreaks. The present analysis additionally provides a method for the practical identification of epidemiological parameters appearing in the SIR (Susceptible -Infected -Removed) compartmental model. Because the proposed approach is robust to under-reporting, it provides a simple way to estimate the basic reproductive number of an outbreak from reported incidence data. The classical SIR model (see for instance [2] ) is the simplest compartmental model that captures the basic phenomenon of disease transmission: before recovering (or being removed through diseaseinduced death), infected individuals transmit the disease to susceptible individuals, who in turn become infected. The process repeats itself until the epidemic has followed its course and no additional infections occur. Many outbreaks have a time scale that is much shorter than the scale at which the total population of N individuals changes, leading to the assumption that N is constant. Defining the time-dependent quantities s = S/N, i = I/N, and r = R/N as the expected sizes of the susceptible (S), infected (I), and recovered (R) compartments relative to the total population N, so that s + i + r = 1, the SIR model is given by

where β is the contact rate of the disease and γ is its recovery rate. Scaling time by τ = 1/γ turns the SIR model into a one-parameter (R 0 ) model, where R 0 = β/γ is the basic reproductive number [3, 4] of the modeled epidemic. The disease spreads if R 0 > 1 and dies out when R 0 < 1.

We also see from the right-hand side of the equation for di/dt that for i = 0, the proportion of infected individuals grows as a function of time as long as s > γ/β = 1/R 0 . The forecasting method proposed in [1] relies on a general property of epidemiological compartment models, which was both empirically observed in epidemiological data sets and confirmed through numerical simulations: in the plane of coordinates C (cumulative cases) and I (incidence), the trajectory of an outbreak is a concave-down curve resembling a parabola (see [1] for examples). We call such a curve an ICC (incidence -cumulative cases) curve. In the present article, we provide an exact expression for the ICC curve of the SIR model, in the form of the following theorem. Theorem 1. For a total susceptible population of size N and each initial condition κ = S(0) N = 1 − C(0) N , the ICC curve of the SIR model is given by

where C = I + R ∈ [0, C ∞ ] and C ∞ , the final number of cases, is the positive solution of the transcendental equation

Details of the derivation of this theorem are given in Section 2.1. Its significance is illustrated in Figure 1 , which shows a numerical simulation of the SIR model and a plot of the corresponding ICC curve for R 0 = 2. Instead of considering the time dependence of S, I, and R as shown in the left panel of Figure 1 , the course of the outbreak is captured by a single curve: a plot of incidence versus cumulative cases. As explained in Section 2.1, knowledge of the ICC curve is sufficient to reconstruct the time-dependence of S, I, and R. In other words, the ICC curve contains the entire information associated with any outbreak described by the SIR model. (Color online). Simulation of the SIR model showing the timedependence of the S, I, and R compartments (left), for R 0 = 2. The corresponding ICC curve is shown on the right. The exact expression given by Equation (1) is plotted as a solid black line; the numerically estimated ICC curve is shown as a thick cyan line.

An important motivation for studying ICC curves is that they can easily be fitted to case data. Specifically, we show in Section 3.1 that, for fixed N , there exists a single set of parameters (β, γ, κ) that minimizes the root mean square error between the simulated outbreak data and the ICC curve I = G κ,N (C) given by Equation (1). Moreover, such an estimation is robust to noise. As a consequence, the present approach provides a method to estimate the parameters of the SIR model, thereby making this model practically identifiable.

The rest of this manuscript is organized as follows. Section 2 presents the derivation of Equations (1) and 2, and discusses the robustness of ICC curves to systemic under-(or over-) reporting. Section 3 explains how model parameters may be retrieved from ICC curves and assesses the robustness of the proposed procedure to reporting noise. Section 4 briefly discusses parameter estimation as an outbreak unfolds. Section 5 shows a few examples of application on real outbreak data. Future extensions and applications of the methodology proposed in this article are reviewed in Section 6.

2. ICC curves of the SIR Model 2.1. Exact form. Epidemiologists use the epidemiological (EPI) curve, a plot of incidence as a function of time, to visualize the course of an epidemic. In the SIR model, incidence is defined as

where C = R + I is the cumulative number of cases at time t. Note that C includes those who have recovered (and thus were a case at some point in time), as well as those who are currently infected. To derive Equation (1), write ds/dr = −βs/γ = −R 0 s, which can be integrated to give s = κ exp(−R 0 r), where κ ∈ (0, 1) is a constant that depends on initial conditions. Typically, κ 1 (see e.g. [5] ) since for an outbreak in a naive population we normally have r(0) = 0, i(0) << 1, and thus κ = s(0)

1. However here we do not make this assumption and keep the parameter κ in the following discussion. With c = C/N and s = 1 − (i + r) = 1 − c, we obtain

Moving back to dimensional variables, we see that I = dC/dt = G κ,N (C). The ICC curve of the SIR model with initial conditions described by κ is thus given by Equation (1). Since G κ,N ≥ 0 on [0, C ∞ ], for a given initial condition κ = 1 − C(0)/N , the part of the ICC curve traced out as the outbreak unfolds will be such that C(t) ≥ C(0) > 0. However, the ICC curve is well defined for C in the range [0, C ∞ ], which is the definition we adopt here. Even though the time variable does not appear in the ICC curve, the value of κ is not arbitrary: C(0) is the value of C when R = 0, where R represents the removed compartment. An approximate expression of the SIR ICC curve, assuming κ = 1 and C/N sufficiently small was given in [1] . ICC curves are particularly useful for forecasting the course of an outbreak because they contain information on quantities that are of interest to public health practitioners, namely the final number of cases (C ∞ ) and peak incidence (I max ). The latter is simply the maximum of G κ,N ; the former may be found as follows. Since C is the cumulative number of cases, C is a monotonic function of time and its derivative, G κ,N (C), must remain non-negative. For C/N small,

where G κ,N (0) = −N γ ln(κ) > 0. By continuity, G κ,N (C) will remain non-negative on the interval [0, C ∞ ], where C ∞ is the unique positive solution of Equation (2) . By writing Equation (2) as N = C ∞ +κN exp(−R 0 C ∞ /N ), where the right-hand-side is the sum of two non-negative monotonic functions of C ∞ , one increasing and the other decreasing, and since 0 < κ < 1, one can see that such a solution exists and is unique. Asymptotically, when the disease has followed its course and all infections have occurred, I = 0 and C = R. Then, S = κN exp(−R 0 R/N ) = κN exp(−R 0 C/N ) and the conservation law N = S + I + R becomes N = κN exp(−R 0 C/N ) + C. In other words, C ∞ defined in Equation (2) is simply the final (asymptotic) number of cases associated with the outbreak. This concludes the proof of Theorem 1.

To see that the entire dynamics of an outbreak in the SIR model is captured by the associated ICC curve, note that knowledge of C(t) leads to knowledge of

In other words, the time course of S, I, and R may be obtained from the solution of the differential equation dC/dt = G κ,N (C) prescribed by the ICC curve, with appropriate initial conditions (which in turn define the value of κ).

Robustness to under-or over-reporting. The quantity N appearing in Equation (1) refers to the total population taken into account in the model, which is conserved by the dynamics. In case of over-or under-reporting, the observed incidence expectation I o is a fraction of the actual incidence I, so that I o = αI, where we assume that α is constant or varying slowly enough for its derivative to be negligible. The ICC curve estimated from case report data will then be a graph of

where N o = αN . Therefore, the ICC curve in the presence of over-reporting (α > 1) or underreporting (α < 1) has the same functional form, with the same parameter values, as the true ICC curve, provided I, C, and N are rescaled by the factor α. For the same reason, κ = S(0)/N is also independent of the value of α.

While systematic under-or over-reportding does not affect the ICC curve provided the size of each compartment is appropriately rescaled, the reproductive number depends on N through the ratio C ∞ /N , as can be seen from Equation (2), in which setting κ = 1 leads to

Finally, rescaling time amounts to rescaling β and γ, and consequently the ICC curve.

Structural identifiability (see e.g. [6] for a review) relates to the ability of uniquely identifying model parameters from knowledge of the model output. In this context, N is assumed to be known since it represents the population of the system to which the model is applied. Given C(t), one can calculate I(t) = dC/dt = G κ,N (C(t)). From Equation (1), it is clear that knowledge of C(t) and I(t) uniquely determines β, γ, and κ. Indeed, if two sets of parameters, (β (1) , γ (1) , κ (1) ) and (β (2) , γ (2) , κ (2) ), led to the same output C(t) (and thus to the same values of its derivative I(t)), we would have

0 , and κ (1) = κ (2) . Therefore, all of the parameters of the SIR model, including initial conditions, are structurally identifiable from knowledge of C(t), the cumulative number of cases, and N , the size of the susceptible population. Similar results were obtained in [7] [8] [9] . If N is unknown, setting

for all values of C N ∈ 0, C∞ N , additionally leads to N (1) = N (2) , making the size of the population N a structurally identifiable parameter as well.

Whereas structural identifiability depends on the nature of the model itself, practical identifiability relates to parameter identifiability in the presence of measurement noise, as well as (hopefully small) deviations between model and reality (see e.g. [6] ). This question may be addressed numerically by simulating an exact solution of the model, adding noise to its output, and estimating the model parameters that best fit the perturbed output. Two recent articles [9, 10] discuss the practical identifiability of a variety of compartmental models and reach different conclusions regarding parameter identification of the SEIR model from cumulative data. In [9] , the cumulative data is multiplied by 1 + , where is normally distributed with zero mean, whereas in [10] , Poissondistributed noise of mean equal to the current incidence value [11] is used as a substitute for the incidence data. Whether or not parameters are identifiable in practice therefore depends on the type of observational noise that affects surveillance reports and thus the EPI curve.

Adding independently-distributed noise to the incidence rather than the cumulative data is more realistic since errors on C are considered to accumulate and not be independent [12] . On the one hand, assuming Poisson-type of uncertainty is natural for data that represent counts, but the effect of replacing incidence data by a Poisson random variable of same mean will lead to significant relative deviations only if N is small (a few hundreds). For larger values of N and thus of I, the relative change δI = |I − Poisson(I)|/I scales like 1/ √ I. Indeed,

With Stirling's formula, we obtain

If it is believed that the diminishing relative influence of the Poisson-distributed noise for large values of N is not realistic for a specific epidemic, then multiplying I by (1 + ) where is normally distributed with mean zero, circumvents this issue. We consider both types of noise below. . Empirical ICC curves for two hypothetical outbreaks described by the SIR model with Poisson-distributed reporting noise. Incidence information is assumed to be available every δ t , with δ t = 1 for R 0 = 2 and δ t = 2 for R 0 = 10.5. In each panel, stars correspond to the simulated outbreak data plotted in the (C, I) plane. Data points without added noise (circles) and the exact ICC curve (solid line) are also shown.

To test practical identifiability, we simulate the SIR model and generate a discrete time series for the cumulative number of cases of the simulated outbreak. This time series is used as a reference and represents the unperturbed (or true) data. We then add noise to the associated discrete incidence data (defined as the difference of cumulative values between consecutive time points), following the procedure described in [11] , with the additional realistic constraint that the final number of cases is the same for all noisy realizations of the same outbreak. Two examples of simulated outbreaks generated in this fashion are shown in Figure 2 for different values of R 0 (associated with different orders of magnitude of the incidence I) and different discretization time steps δ t . For R 0 = 2, we use δ t = 1. For R 0 = 10.5, we use δ t = 2. As shown in Figure 2 , this latter choice leads to a discretized ICC curve with very few points in regions of large incidence, and simulates a situation where incidence data is of poorer frequency. Each time series, consisting of noisy incidence data, is used to calculate an associated time series for the corresponding cumulative data. The resulting set of points is plotted in the (C, I) plane, and defines an empirical ICC curve. In both panels of Figure 2 , C and I are estimated from discrete incidence data as described in Equation (5) below.

3.1. Fitting ICC curves to data. Given a simulated time series for the number of cases of an outbreak in a population of size N , we find the parameters of the ICC curve that best fits the data by least square minimization. This approach is different from parameter identification methods that have been proposed in the literature until now. Indeed, traditional methods aim to find the best combination of model parameters consistent with observed time series; they therefore require numerical integration of the compartmental model whose parameters need to be identified, in order to estimate and then minimize an associated cost function (see for instance [9] [10] [11] ). Instead, the approach we put forward here consists in finding model parameters by directly fitting the relevant ICC curve to the data. We claim that this is a more robust approach because the functional that needs to be minimized has a unique extremizer in parameter space, which can be calculated exactly from the data points.

Given a time series of discrete incidences {I k , k = 0, · · · , M }, or of discrete cumulative cases {C k = k j=0 I j , k = 0, · · · , M } recorded every δ t units of time, we introduce a cost function that measures how much the data points depart from the ICC curve of the SIR model parametrized by β, γ, and p = ln(κ). This function, E e , is defined by

A Maple [13] calculation shows that the cost function E e has a unique critical point, whose expression is given in Appendix A. For outbreaks that start in a naive population, the formulas are hugely simplified by setting p = 0. The resulting cost functional then has a unique minimum, whose value may easily be calculated. With the following notation,

we have (after setting p = 0),

Setting the right-hand sides to zero and solving the resulting linear system for β and γ leads to the following expression for the unique global minimizer of E e when p = 0:

In what follows, we use the expressions given in Appendix A, which can easily be coded up. We checked that for time series generated from an SIR model with κ 1, the simplified expressions above provide good estimates of the actual values of β and γ that were used to generate the reference outbreak. Importantly, we now have a means of associating a single set of values for the parameters β, γ, and p = ln(κ) to each outbreak time series.

Parameter identification in the presence of noise. We now use the expressions found in Section 3.1 to assess the robustness of the proposed parameter identification method to noise. As discussed above, the SIR model will be declared practically identifiable provided good estimates of the parameters may be obtained from noisy data. We therefore simulate 100,000 noisy time series from a "reference" numerical integration of the SIR model, use the formulas of Appendix A to calculate p c , β c , and γ c for each time series, and plot histograms of the estimated values of β and γ. We consider the two types of noise discussed above, both affecting the reported discrete incidence Figure 3 shows the distributions (in the form of properly scaled histograms) of β c , γ c , and R 0 = βc γc obtained from 100,000 noisy realization from a reference simulation of an SIR model with parameters β = 0.5 and γ = 0.25 (R 0 = 2). Figure 4 shows similar plots for β = 0.5 and γ 0.0476, so that R 0 = 10.5. In each case, the size of the total population is N = 10000 individuals. The corresponding sample means and standard deviations are displayed in Table 1 . These results indicate that the respective means of β c and γ c provide very good estimates of the parameters β and γ. Accuracy is not as good for the larger value of R 0 , whose distribution shows a fairly long tail. This is to be expected because the actual value of γ in this case is close to zero. Moreover, the number of data points away from the end points of the outbreak is much smaller for R 0 = 10.5 than for R 0 = 2 (see Figure 2 ), leading to a potential loss of information. Table  1 . Table 1. 3.2.2. Case B: Normally-distributed noise. At each time point, we now multiply the reference incidence by 1 + s, where s is standard normal and the amplitude takes on one of 3 values: 0.05, 0.15, or 0.25. This type of noise is less realistic for incidence reports, whose variability is expected to be well described by a Poisson distribution, as in Case A above. We however perform the present tests to illustrate the robustness of the parameter estimation method introduced in this article. Table 2 . Empirical mean (µ) and standard deviation (σ) of estimated parameter values in the presence of normally-distributed noise of mean zero and amplitude , for two values of R 0 .

For R 0 = 2, the parameter estimation procedure is robust at all values of considered. Means and standard deviations of β c , γ c , and R 0 = β c /γ c are displayed in Table 2 . The corresponding distributions are shown in the top row of Figure 5 . As expected, estimates of β and γ are not as accurate for simulations with R 0 = 10.5, but return mean values for β and γ that are close to the exact values. The distributions of β c /γ c however shows very long tails (see bottom row of Figure  5 ), due to values of γ c close to zero. In such a case, estimating R 0 as the ratio < β c > / < γ c > however gives acceptable values, equal to 9.96, 9.19, and 7.53, for equal to 0.05, 0.15, and 0.25 respectively. Table 2 .

In summary, the above numerical simulations indicate that the ICC-based parameter estimation method presented here provides good results in the presence of reporting noise. Not surprisingly, the quality of the estimates depends on the rate at which incidence is sampled and, in the case of normally distributed noise, on the amplitude of the noise.

In principle, the formulas for β c and γ c given in Appendix A may be used with any number of points M , as long as the data points represent the entire ICC curve. Is is natural to ask how the method performs when applied to a developing outbreak. To test this, we simulate 50,000 possible outbreaks using the same reference simulations as before, and with the two different types of noise introduced above to represent reporting error. We then follow the evolution of the distributions of the estimated parameters as time increases. For R 0 = 2, the outbreak has run its course after 60 units of time, and we include up to M max = 40 data points, with consecutive values separated by 1 unit of time (which could correspond to a day or a week depending on the disease). For R 0 = 10.5, we pick M max = 80, with data points separated by 2 units of time. Figure 6 shows estimated values of β, γ, and R 0 = 2 for the two types of reporting noise. In both cases, estimates of β and γ are close to their actual values near t = 20, which is before the peak of the outbreak (see left panel of Figure 1 ). The value of R 0 converges more slowly (circles in the bottom row of Figure 6 ) but the ratio of the estimated values of β and γ (stars) is close to the actual value of R 0 near t = 12, fairly early in the outbreak.

Similar results for R 0 = 10.5, together with plots showing the evolution of the distributions of β c , γ c , and R 0 , are provided as supplementary material. Parameter estimates for β and γ are close to their true values near t = 22, which is near the peak of the outbreak, with the ratio < β > / < γ > giving a reasonable estimate of R 0 starting at t = 18, even in the presence of large uncertainty.

So far, we have assumed the size of the susceptible population, N , to be known. In practice, the value of N that should be used to fit the ICC curve to outbreak data is often unknown since it may reflect under-reporting (as discussed in Section 2.2) or spatial effects (existence of localized clusters to which the spread is restricted) associated with the outbreak. Moreover, it could be argued that the SIR model is too simplistic to represent real-life outbreaks. The examples of [1] and those discussed below however show there is merit to the present approach.

In what follows, we fit ICC curves to surveillance reports by picking a range of values of N for which the root mean square error (RMSE) between the ICC curve and the data is within 2% of the minimum RMSE value. The data points used in the evaluation of the RMSE are found from the reported cumulative data after smoothing and interpolation. This latter procedure estimates missing incidence values and removes negative incidence reports, if any. The resulting time series is expected to be a better approximation of the "true" (i.e. as close as possible to the output of a compartmental model) evolution of the cumulative data, and is used for the parameter estimation procedure discussed in this article. Our first example is for a gastroenteritis outbreak in a nursing home in Majorca, Spain, and is therefore spatially localized. The other examples are estimations of the basic reproductive number for COVID-19 outbreaks in Hubei Province (China), the Republic of Korea, and France. 5.1. Gastroenteritis outbreak in Majorca, Spain. We use the same data set as in [1] , estimated from the epidemiological curve provided in [14] . We fit the ICC curve to the interpolated data (i.e. find β c (N ), γ c (N ), and p c (N )) for a range of values of N , identify the value N m of N that best fits the data (i.e. for which the RMSE is minimum), and select a range of values of N near N m that give a RMSE within 2% of its minimum. Then, for each value of N in the selected range, we apply the parameter estimation procedure discussed in this article, with 10,000 oubreak realizations, obtained from adding Poisson-distributed noise to the smoothed and interpolated data. Figure 7 shows the resulting parameter distributions (top panel; N ∈ [96, 147]) and a plot of the ICC curve for the average parameter values (bottom panel; N = 121, β = 1.37, and γ = 0.91). The value of R 0 for the ICC curve displayed is 1.52 (whereas the average value of R 0 is 1.55), and corresponds to an attack rate of 59.4%. This is near the upper boundary of the 95% confidence interval (38.5 % -61.3 %) for the overall attack rate, but well within the 95% confidence interval of the attack rate among nursing home residents (42.1 % -72.2 %) reported in [14] . Figure 8 shows ICC curves for COVID-19 outbreaks in a few countries. The data was obtained from the World Health Organization daily situation reports [15] as well as from the Wuhan Health Municipal Commission [16] . For Hubei Province, only laboratory confirmed cases were used. Consequently, daily increment values for 2/17-19/2020 (days 65 to 67 in the data set, when China combined laboratory and clinically confirmed cases in its reports for a brief period of time) were set to half of the reported number of confirmed cases; the number of cumulative cases was then obtained by summing daily reports of laboratory confirmed cases. For the Republic of Korea, only the first 58 points in the data set were used, since they correspond to the first wave of the outbreak. The third example is for a country (France), where the outbreak appears to have passed its peak (as of 4/15/2020), but is not over. In all of these examples, the figure shows the ICC curve for the set of parameters that best fits the interpolated data. The optimal values of N are N = 51160 for Hubei Province, N = 10282 for the Republic of Korea, and N = 161138 for France. The corresponding optimal values of the basic reproductive number are R 0 = 2.74 for Hubei Province, R 0 = 2.13 for the Republic of Korea, and R 0 = 1.96 for France. A histogram of estimated values of R 0 , scaled to represent a probability distribution function, is shown in the right panel of each row of Figure 8 . These were obtained as described above in the case of the outbreak in Majorca, except that no Poisson-distributed noise was added to the data. This is because incidence values are generally large and that, as discussed in Section 3, the relative effect of the added noise is minimal, leading to a level of variability of R 0 that is insignificant compared to the effect of changing N . Values of R 0 in the 2-3 range are consistent with early estimates of the basic reproductive number of COVID-19 published in the literature [17, 18] . The time course of the COVID-19 outbreak in Hubei Province, as described by the SIR model with the optimal parameters identified by the present method (β = 0.401, γ = 0.146 (R 0 = 2.74), and N = 51160) is shown in Figure 9 , together with the reported data. The solid curves are obtained by integration of the ICC curve, with initial conditions corresponding to day 10 of the outbreak, and alignment with the data at day 45. Similar plots for the other outbreaks discussed in this section are provided as supplementary material. The good visual agreement between model and data indicates that the SIR model, and therefore of the ICC curve approach presented in this manuscript, are able to capture the overal dynamics of real-life outbreaks.

This article introduces a parameter estimation method for disease outbreaks that bypasses the numerical integration of epidemiological models. The approach relies on the use of ICC curves, which relate incidence to the cumulative number of cases C. They may be viewed as nonlinear transformations of the traditional epidemiological curves, in which the time variable is replaced by C, a monotonically increasing but nonlinear function of time. For each single-wave outbreak, the ICC curve has a simple concave-down shape that crosses the horizontal axis at the origin and at the final value of C.

The formulas presented in Section 2, which extend the parabolic approximation suggested in [1] , are exact for the SIR model. The numerical experiments of Sections 3 and 4 show the method is robust to noise and may be used for parameter estimation as an outbreak unfolds, with the understanding that reasonably accurate estimates can only be reached shortly before, or after, incidence peaks. This is not a shortcoming of the present approach and was also observed in [9] when fitting synthetic prevalence time series.

Because of the simplicity of Equation 1, and the existence of a unique set of parameters that best fits any collection of data points, parameter distributions may be generated directly from epidemiological data. In the case of the SIR model, this methodology presents a fast and novel alternative to more traditional parameter estimation strategies, such as particle Makov Chain Monte Carlo methods (PMCMC; see [19] for a review). This may be beneficial in pandemic situations where epidemiological estimates need to be updated daily and in many locations simultaneously. For instance, recent work on COVID-19 data from Wuhan suggests that an SIR model can better capture the information contained in case reports than an SEIR model [20] . For more complex compartmental models, distributions generated by the present approach may be used as priors for PMCMC estimations of the contact and recovery rates of a disease.

When applying the proposed method to surveillance data, an estimate of N may not always be available. The examples of Section 5 suggest that this parameter may be identified by optimizing the fit between the ICC curve and a smoothed and interpolated version of the reported data. However, any variability in the selected value of N will be associated with variability in estimates of the basic reproduction number R 0 . Indeed, for an outbreak that has completed its course, any good fit of the reported data by an ICC curve will produce consistent values of C ∞ , the final number of cases for the outbreak (see for instance the plots of Figure 8 ). Because of Equation 3, if C ∞ is known, the estimated value of R 0 is only a function of N . This uncertainty is inherent to the SIR model and cannot be avoided. The existence of a value N m of N that best fits the data is encouraging, but more robust estimates are likely to be obtained if additional information is available. In the absence thereof, a range of values of N close to the optimal value N m should be used to produce credible intervals for the basic reproductive number R 0 .

As previously mentioned, empirical (see [1] ) and numerical observations by the author suggest the concave-down shape of the ICC curve is ubiquitous in outbreak data and in compartmental epidemiological models. It would therefore be interesting to explore methods which, like the discussion of Section 2 of this article, lead to exact formulations of ICC curves. In particular, knowledge of how model parameters affect the shape of ICC curves could provide simple means to visualize the consequences of mitigation effects. Combined with data assimilation, ICC curves may thus present a convenient paradigm for forecasting the course of an outbreak.

,

and we have used the notation

We show below how the distributions of β c , γ c , and the basic reproductive number R 0 = βc γc defined in the main body of the article evolve as functions of time during the course of an outbreak. We consider two examples, with values of R 0 equal to 2 and 10.5 respectively. For the former, Figure  S1 shows how histograms (scaled to approximate distribution functions) of parameter estimates in the presence of Poisson-distributed noise narrow and center on the actual parameter values as the number of available data points increases. Figure S2 shows similar results for normally-distributed noise of amplitude 0.15. The corresponding time-dependence of S, I, and R is displayed in the left panel of Figure 1 , and the temporal evolution of the means and standard deviations of the parameter estimates are shown in Figure 6 , both in the main article. For R 0 = 10.5, the time course of S, I, and R (in the absence of noise) together with the corresponding ICC curve, are shown in Figure S3 . The following pages show the temporal evolution of the histograms of β c , γ c , and β c /γ c for Poisson-distributed noise ( Figure S4 ) and normallydistributed noise of amplitude 0.05 ( Figure S5 ) and 0.15 ( Figure S6 ). Finally, Figure S7 and Figure  S8 show the temporal evolution of the parameter distribution means, together with error bars of one standard deviation about the mean. Estimates of β, γ and R 0 as the outbreak unfolds for Poissondistributed noise (left) and normally-distributed noise of size 0.05 (right), and with R 0 = 10.5. Estimates of R 0 show the evolution of < β/γ > (circles) and < β > / < γ > (stars), where < · > indicates sample mean. Error bars correspond to one standard deviation on each side of the mean. Figure S8 . Estimates of β, γ and R 0 as the outbreak unfolds for normallydistributed noise of size 0.15, and with R 0 = 10.5. Estimates of R 0 show the evolution of < β/γ > (circles) and < β > / < γ > (stars), where < · > indicates sample mean. Error bars correspond to one standard deviation on each side of the mean.

Figures S1, S2, and S3 compare the time course of the gastroenteritis outbreak in Majorca Spain (Reference [14] of the main article) and of the COVID-19 outbreaks in the Republic of Korea and France, to the predictions of the ICC curves discussed in the main article. Figure S1 . Cumulative cases (left) and incidence (right) as functions of time for the gastroenteritis outbreak in a nursing home in Majorca, Spain, described in Reference [14] of the main article. The solid curves represent the predictions of the ICC curve with averaged parameter values β = 1.375, γ = 0.906 (R 0 = 1.518), and N = 121, and the open circles are the data points. 

Data-driven outbreak forecasting with a simple nonlinear growth model

The Mathematics of Infectious Diseases

Reproduction numbers and sub-threshold endemic equilibria for compartmental models of disease transmission

The construction of next-generation matrices for compartmental epidemic models

Springer Undergraduate Mathematics Series

On Identifiability of Nonlinear ODE Models and Applications in Viral Dynamics

The structural identifiability of the susceptible infected recovered model with seasonal forcing

Identifiability and estimation of multiple transmission pathways in cholera and waterborne diseases

Structural and practical identifiability analysis of outbreak models

Assessing parameter identifiability in compartmental dynamic models using a computational approach: Application to infectious disease transmission models

Fitting dynamic models to epidemic outbreaks with quantified uncertainty: A primer for parameter uncertainty, identifiability, and forecasts, Infectious Disease Modelling 2

Avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to Ebola

Maplesoft, a division of

Cohort study of an outbreak of viral gastroenteritis in a nursing home for elderly

WHO Coronavirus disease (COVID-2019) situation reports

The page that reported COVID-19 data

Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study

See also the MIDAS Network dashboard, which lists estimates of R0 in a range of preprints

Introduction to particle Markov-chain Monte Carlo for disease dynamics modellers

Why is it difficult to accurately predict the COVID-19 epidemic?

The function E e defined in Equation (4) has a unique extremizer (β c , γ c , p c ) given by the following expressions: