key: cord-340375-lhv83zac authors: Bliznashki, Svetoslav title: A Bayesian Logistic Growth Model for the Spread of COVID-19 in New York date: 2020-04-07 journal: nan DOI: 10.1101/2020.04.05.20054577 sha: doc_id: 340375 cord_uid: lhv83zac We use Bayesian Estimation for the logistic growth model in order to estimate the spread of the coronavirus epidemic in the state of New York. Models weighting all data points equally as well as models with normal error structure prove inadequate to model the process accurately. On the other hand, a model with larger weights for more recent data points and with t-distributed errors seems reasonably capable of making at least short term predictions. 1. Introduction. The logistic growth model is frequently used in order to model the spread of viral diseases and of covid-19 in particular (e.g. Batista, 2020; Wu et al., 2020) . The differential equation is given in (1): where C is the cumulative number of infected individuals, r is the infection rate, and K is the upper asymptote (i.e. the upper limit of individuals infected during the epidemic). Unlike other models, like SIR, Eq. 1 has an explicit analytical solution: ( ) where A=(K-C 0 )/C 0 and C 0 is the initial number of infectees. The parameters of Eq. 2 can easily be estimated via Least Squares (LS) but in this note we use a Bayesian approach which allows us to make use of explicit posterior distributions in order to make probabilistic predictions. We apply the above model to the state of New York which represents a relatively geographically homogenous population with sufficient data points in order build a reliable model which is not affected by different trends present in different regions. 2. Simulation 1. We begin with a simple model which estimates the parameters of Eq. 2 based on the assumption of normally distributed homoscedastic errors. We use the data for 28 consecutive days of the epidemic beginning with the 4 th of March (11 infectees) and ending with the 31 st of March (75832 infectees) 1 . Prior to estimation we standardized our data by dividing all data points by 70000 in order to avoid numerical problems; after the posteriors were obtained we back-transformed the results in their original scale. We assumed that the errors are normally distributed with mean equal to 0 and standard deviation (σ) estimated by the model. We used the blockwise Random Walk Metropolis algorithm 2 in order to sample from the joint posterior distribution of the four parameters of the model (K, A, r, and σ). The proposal distribution was multivariate normal with scaled variance-covariance matrix estimated on the basis of pilot runs. Uninformative improper uniform priors ranging from 0 to + ∞ were employed for all parameters in the model. A pilot chain showed an acceptance rate within the optimal range of 23% (e.g. Chib & Greenberg, 1995 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101 /2020 r=0.34 (95% CI=[0.31; 0.37]). We see that the estimates are very similar to the posterior EAPs reported above which is to be expected given the uninformative nature of our priors. Still, the Bayesian analysis gives slightly wider intervals for the estimates which as we'll see below is a positive. conservative. This is a commonly observed situation for phenomenological (i.e. purely datadriven) models of this type. At the time of writing this note, there is information for the number of infectees three days after the 28 days used in order to fit the model. We used the posterior estimates in order to predict the number of future infectees. More precisely, we used the posterior estimates (including σ) in order to simulate data for future values of t thereby constructing what is known as posterior predictive distributions. For example, for a given future day (e.g. for the 28 th day) we sampled all posterior values for Eq. 2 parameters and for each sample we plugged in the t=28 value in order to obtain a mean prediction value; then we added a random number generated from N(0, σ) where the value for σ is sampled from the posterior alongside actual predicted the other parameters available for the given step. The resulting predictive distribution has an observed mean, variance, etc. and can be used in order to make point and/or interval predictions (HDIs) as usual. Some results are shown in Table 2 We see that the predictive distributions fail to capture even the immediate true value which once again suggests that indeed the model is inadequate and fails to capture the true trends in the data. Note, however, that the ranges of the prediction intervals increase for later data points which is a desirable quality of a model and is intrinsic to the Bayesian approach employed here. As Fig. 2 (upper right portion) suggests, the model converges too quickly to its upper asymptote and hence its predictions are too low and probably too narrow. This observation is not surprising given that it is well-known that the simple logistic model is applicable only during specific stages of an outbreak and/or when enough data is available (see Wu et al, 2020 for a review). Possible solutions include: improving the model (e.g. Richards, 1959) by adding more parameters which can account better for the deviation of the observed data points from the symmetric S-shaped curve suggested by the logistic growth model; adjusting the prior distributions so as to reflect our expectations of a much higher upper asymptote (K); switching to a different, preferably more mechanistic approach altogether. Instead, we attempted to construct a more accurate model within the same logistic growth paradigm in a different way: we introduced weights to our data points with later data points receiving higher weights than older ones in the hope that this will alleviate some of the problems observed above. Specifically, we weighted the points according to a Rectified . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10. 1101 /2020 Linear-like function (e.g. Glorot et al., 2011) whereby the first 20 observations received constant (0.008) low weights and the last 8 observations received linearly increasing higher weights (last 8 weights=[0.77 1.55 2.32 3.09 3.87 4.64 5.41 6.19]). The idea behind this scheme was to try to force the model to account better for the observations following the approximately linear trend observed in the upper half of Fig. 2 . Note also that the weights sum to the number of original observations (28). The weights pattern is shown in Fig. 3 . In the subsequent simulations we used the proposed weights in order to weigh the likelihood function of the model. Following Simeckova (2005) , assuming we have observations Y 1 , …Y n and Y i has density f i (Y i |θ) where θ is the vector of parameters (see Eq. 2), we apply the weights vector w=[w 1 ,…w n ]. If we let l i (θ)=log(f i (Y i |θ)), the weighted loglikelihood function of our model becomes: 3. Simulation 2. We used the above weighing scheme and repeated the previous simulation. In that sense we altered the likelihood function while leaving the prior distributions intact. Everything else (including the simulation details such as number of posterior draws, thinning, etc.) was the same as reported in Section 2. Again, we observed . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10. 1101 /2020 good convergence for all parameters (see Fig. 4 depicting a traceplot and a histogram for the K parameter). Table 3 gives a summary for the posterior estimates 3 . We see that our weighting scheme appears to give more reasonable results and that the estimates for the upper asymptote . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.04.05.20054577 doi: medRxiv preprint the fitted equation against the observed data (Fig. 5) . It is clear that the fitted curve is much more affected by the later points and consequently the upper asymptote is higher than before (compare also Tables 3 and 1) Table 3 . Means, medians, standard deviations (SD), and 95% HDIs for the parameters of Eq. 2 obtained from the posterior of our second (weighted) model. Table 3 . We see that this time the model accurately predicts two consequent data points and fails to predict the third. This is still not a satisfactory performance, however, and hints towards the possibility that the actual process exhibits steeper rise than the one suggested by the model. In the same way, it appears that the HDIs are not wide enough in order to accommodate the actual uncertainty. Looking at figures 2 and 5 we see that the errors, in all likelihood, both lack homoscedasticity and possess an auto-correlated structure. In order to (partially) alleviate these problems we removed the normality assumption present above and replaced it with the assumption that the errors follow a t-distribution with location parameter equal to 0 and scale (similar to the standard deviation used above) and degrees of freedom (df) parameters estimated from the data (see Kruschke, 2012 for the same approach in the context of a linear model). We used the same weighing scheme as above and introduced the two new parameters (scale and df) describing the t-distribution governing the model's errors. We again proposed the first four parameters of the model (i.e. K, A, r, scale) by a multivariate normal distribution centered on the previous values of the chain with scaled variance-covariance matrix estimated from pilot chains; the df parameter was proposed separately based on a lognormal distribution 4 which was transformed back to the original scale after the end of the simulation. 30 million samples from the posterior were obtained with every 300 th step retained (i.e. we had a thinning parameter of 300). We used improper uniform priors for all parameters except for the df parameter for which a shifted exponential with mean equal to 30 was specified as suggested by Kruschke (2012) . The results indicated good convergence (a traceplot for the K parameter is shown in Fig. 6 below; histograms from the posterior for all parameters are shown in Appendix A). 4 The lognormal distribution is not symmetric and hence we use the actual Metropolis-Hastings acceptance probability (e.g. Chib & Greenberg, 1995) during the step sampling from the df posterior. Table 5 . Means, medians, standard deviations (SD), and 95% HDIs for the parameters of Eq. 2 obtained from the posterior of our last model. Consistently with our expectations the 95% HDI for the df parameter suggests a noticeable deviation from normality. Fig. 7 shows the predicted trend based on the EAP estimates shown in Table 5 . Finally, Table 6 specifies the predictive distributions for the next 7 days and for the estimate for the final cumulative number of infectees. We see that this model accurately predicts at least three future data points. In the next several days we should be able to observe how the model deals with data points further away in time. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10.1101/2020.04.05.20054577 doi: medRxiv preprint As can be seen in Appendix A the posteriors for the different parameters no longer resemble parametric distributions (for the previous two models the posteriors for the K, A, and r definitely resemble normal/t-distributions while the σ parameter is pronouncedly positively skewed and thus resembles a gamma distribution). Nevertheless this model appears to be best suited for the modeled phenomenon. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https: //doi.org/10.1101 //doi.org/10. /2020 5. Discussion. It appears that a logistic growth model with a weighted likelihood function and a t-distribution imposed on the error structure is able to make accurate short term predictions of the spread of a disease. The Bayesian estimation gives more accurate estimates than traditional Least Squares and Maximum Likelihood approaches with more accurate interval estimates. Moreover, the Bayesian posteriors (including the predictive distributions) have a straightforward probabilistic interpretation which cannot be said about traditional frequentist Confidence Intervals. As a rule, the posterior distributions show high correlations between the parameters which makes algorithms like blockwise Metropolis-Hastings more effective in general than algorithms which explore a single posterior distribution at a time such as the Gibbs Sampler and the componentwise Metropolis-Hastings. The fact that some posteriors lack closed-form solutions is another impediment when it comes to the Gibbs Sampler but not necessarily for the use of the componentwise Metropolis-Hastings as demonstrated in Section 3. The weighing scheme employed here proves beneficial over modeling the raw data by forcing the model to pay more attention to more recent observations. Other weighing schemes are certainly possible and investigating the properties of different approaches seems a potentially fruitful future enterprise. As a whole it appears that the combination of Bayesian Estimation, differentially weighing the observations, and employing a more robust approach towards modeling the errors (i.e. assuming a t-distribution with scale and df parameters estimated from the data) results in more reliable HDIs and prediction intervals than more traditional approaches. Far from being perfect, the proposed model appears to be somewhat useful. Of course, such a model can be continuously augmented by including new data points and applying the same or similar weighing procedure. Presumably, continuously adjusting the model by adding new observations as they become available would improve its accuracy. That being said, our simulations suggest that we should be somewhat skeptical towards logistic growth models applied to the raw data describing an outbreak, especially when the number of available data points is relatively small and the upper asymptote appears not to have been reached yet. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . https://doi.org/10. 1101 /2020 Histograms of the posterior distributions for the five parameters for our last model. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. df . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10. 1101 /2020 if propdf<0 alp=-99999; else Pred=cur(1)./(1+cur(2)*exp(-cur(3)*t)); Ln=gentdst(y, Pred, cur(4), exp(propdf)); Ln=sum(wei.*log(Ln)); Ln=Ln+log((1/29)*exp(-(1/29)*(exp(propdf)-1))); alp=(Ln-Lp)+(log(propdf)-log(curdf)); %accounting for the log-normal proposal; function y = gentdst(x, m, s, v) %x -data point, m -location, s -scale, v 0 degrees of freedom; c=(1/sqrt(v))*(1/(beta(v/2, 0.5))); y=(c/s)*(1+((x-m).^2)/(v*(s^2))).^(-0.5*(v+1)); Estimation of the Final Size of Coronavirus Epidemic by the Logistic Model. medRxiv Understanding the Metropolis-Hastings Algorithm Deep Sparse Rectifier Neural Networks. Proceedings of the 14 th International Conference on Artificial Intelligence and Statistics Bayesian Estimation Supersedes the t test A Flexible Growth Function for Empirical Use Maximum Weighted Likelihood Estimation in Logistic Regression Generalized Logistic Growth Modeling of the COVID-19 Outbreak in 29 Provinces in China %The columns of PAR correspond to the posteriors for K, A, r, scale, and df respectively; nit=30000000 wei=wei/sum(wei)*28 /(1+cur(2)*exp(-cur(3)*t)) /29)*exp(-(1/29)*(exp(curdf)-1))) )<=0 || prop(2)<=0 || prop(3)<=0 || prop(4)<=0 alp=0 /(1+prop(2)*exp(-prop(3)*t)) /29)*exp(-(1/29)*(exp(curdf)-1)))