key: cord-132307-bkkzg6h1 authors: Blanco, Natalia; Stafford, Kristen; Lavoie, Marie-Claude; Brandenburg, Axel; Gorna, Maria W.; Health, Matthew Merski Center for International; Education,; Biosecurity,; Medicine, Institute of Human Virology -University of Maryland School of; Baltimore,; USA, Maryland; Epidemiology, Department of; Health, Public; Medicine, University of Maryland School of; Nordita,; Technology, KTH Royal Institute of; University, Stockholm; Stockholm,; Sweden,; Biological,; Centre, Chemical Research; Chemistry, Department of; Warsaw, University of; Warsaw,; Poland, title: Prospective Prediction of Future SARS-CoV-2 Infections Using Empirical Data on a National Level to Gauge Response Effectiveness date: 2020-07-06 journal: nan DOI: nan sha: doc_id: 132307 cord_uid: bkkzg6h1 Predicting an accurate expected number of future COVID-19 cases is essential to properly evaluate the effectiveness of any treatment or preventive measure. This study aimed to identify the most appropriate mathematical model to prospectively predict the expected number of cases without any intervention. The total number of cases for the COVID-19 epidemic in 28 countries was analyzed and fitted to several simple rate models including the logistic, Gompertz, quadratic, simple square, and simple exponential growth models. The resulting model parameters were used to extrapolate predictions for more recent data. While the Gompertz growth models (mean R2 = 0.998) best fitted the current data, uncertainties in the eventual case limit made future predictions with logistic models prone to errors. Of the other models, the quadratic rate model (mean R2 = 0.992) fitted the current data best for 25 (89 %) countries as determined by R2 values. The simple square and quadratic models accurately predicted the number of future total cases 37 and 36 days in advance respectively, compared to only 15 days for the simple exponential model. The simple exponential model significantly overpredicted the total number of future cases while the quadratic and simple square models did not. These results demonstrated that accurate future predictions of the case load in a given country can be made significantly in advance without the need for complicated models of population behavior and generate a reliable assessment of the efficacy of current prescriptive measures against disease spread. On March 11, 2020 the World Health Organization (WHO) declared the novel coronavirus outbreak (SARS-CoV-2 causing COVID-19) as a pandemic 1 more than three months after the first cases of pneumonia were reported in Wuhan, China in December, 2019 1 . From Wuhan the virus rapidly spread globally, currently leading to ten million confirmed cases and half a million deaths around the world. Although coronaviruses have a wide range of hosts and cause disease in many animals, SARS-CoV-2 is the seventh named member of the Coronaviridae known to infect humans 2 . An infected individual will start presenting symptoms an average of 5 days after exposure 3 but approximately 42% of infected individuals remain asymptomatic 4, 5 . Furthermore, almost six out of 100 infected patients die globally due to COVID-19 6 . Currently, treatment and vaccine options for COVID-19 are limited 7 . There is currently no effective or approved vaccine for SARS-CoV-2 although a report from April 2020 noted 78 active vaccine projects, most of them at exploratory or pre-clinical stages 8 . As the virus is transmitted mainly from person to person, prevention measures include social distancing, self-isolation, hand washing, and use of masks. Strict measures of quarantine have been shown as the most effective mitigation measures, reducing up to 78% of expected cases compared to no intervention 9 . Nevertheless, to evaluate the actual effectiveness of any mitigation measure it is necessary to accurately predict the expected number of cases in the absence of intervention. While there has been some early concern about the ability of SARS-CoV-2 to spread at an apparent near exponential rate 10 , real limitations in available resources (i.e. susceptible population) will reduce the spread to a logistic growth rate 11 . Logistic growth produces a sigmoidal curve ( Figure 1 ) where the total number of cases (N) eventually asymptotically approaches the population carrying capacity (NM), which for viral epidemics is analogous to the fraction of the population that will be infected before "herd immunity" is achieved 12, 13 . This is represented in derivative form by the generalized logistic function (Equation 1): where α, β, & γ are mathematical shape parameters that define the shape of the curve, and r is the general rate term, analogous to the standard epidemiological parameter, R0, the reproductive number, which is a measure of the infectivity of the virus itself 13, 14 . For a logistic curve where α = ½ and β = γ = 0, one gets quadratic growth 15 with N = (rt/2) 2 , while for α = β = γ = 1, this equation can be rearranged to quadratic form (Equation 2) 11 Traditionally the number of cases that will occur in an epidemic like COVID-19 is modeled with an SEIR model (Susceptible, Exposed, Infected, Recovered/Removed), in which the total population is divided into four categories: susceptible -those who can be infected, exposedthose who in the incubation period but not yet able to transmit the virus to others, infectious -those who are capable of spreading disease to the susceptible population, and recovered/removedthose who have finished the disease course and are not susceptible to re-infection or have died. For a typical epidemic, the ability for infectious individuals to spread the disease is proportional to the fraction of the population in the susceptible category with "herd immunity" 12, 13 and extinction of the epidemic occurs once a limiting fraction of the population has entered into the Recovered/Removed category 13 . However, barriers to transmission, either natural 18 before knowing the actual outcome) are preferable to retrospective analysis in which effectiveness is gauged after the results of the prescriptive actions are known 24, 25 . This study aimed to evaluate if a simple model was able to correctly prospectively predict the total number of cases at a future date. We found that fitting the case data to a quadratic (parabolic) rate curve 15 for the early points in the epidemic curves (before the mitigation efforts began to have effects) was easy, efficient, and made good predictions for the number of cases at future dates despite significant national variation in the start of the infection, mitigation response, or economic condition. Data on the number of COVID-19 cases was downloaded from the European Centre for Disease Prevention and Control (ECDC) on June 1, 2020 26 . Countries that had reported the highest numbers of cases in mid-March 2020 (and Russia) were chosen as the focus of our analysis to minimize statistical error due to small numbers. The total number of cases for each country was calculated as a simple sum of that day plus all previous days. Days that were missing from the record were assigned values of zero. The early part of the curve was fit and statistical parameters were generated using Prism 8 (GraphPad) using the non-linear regression module using the program standard centered second order polynomial (quadratic), exponential growth, and the Gompertz growth model as defined by Prism 8, and a simple user-defined simple square model (N = At 2 + C) where N is the total number of cases, A and C are the fitting constants, and t is the number of days from the beginning of the epidemic curve. The beginning of the curve (SI Table 1 ) was defined empirically among the first days in which the number of cases began to increase regularly. Typically, this occurred when the country had reported less than 100 total cases. The early part of the curve was defined by manual examination looking for changes in the curve shape and later confirmed by R 2 values for the quadratic model. Prospective predictions for the number of cases were done by fitting the total number of COVID-19 cases for each day starting with day 5 and then extrapolating the number of cases using the estimated model parameters to predict the number of cases for the final day for which data was available (June 1, 2020) or to the last day before significant decrease in the R 2 value for the quadratic fit. Fit parameters for the Gompertz growth model were not used to make predictions if the fit itself was ambiguous. Acceptable predictions were defined as being within a factor of two from the actual number (i.e. predictions within 50-200% of the actual total). A simple exponential growth model is a poor fit for the SARS-COV-2 pandemic: The total number of cases for each of 28 countries was plotted with time and several model equations were fit to the early part of the data before mitigating effects from public health policies began to change the rate of disease spread. In total, 20 (71 %) countries showed mitigation of disease spread by June 1 (Figure 2 ). When the early, pre-mitigation portion of the data was examined for all 28 countries, the Gompertz growth model had the best statistical parameters (mean R 2 = 0.998 ± 0.0028, Table 1 ) although a fit could not be obtained for the data from 2 countries and many of the fit values for NM were unrealistic compared to national populations (e.g. China and India had predicted NM values corresponding to 0.014 % and 0.33 % of their populations respectively 26 (SI Table 2 )). Fitting was also incomplete for the generalized logistic model for all 28 countries underlining the difficulty in applying this model. On the other hand, the simple models were able to robustly fit all the current data, with the quadratic (parabolic) model performing the best (mean R 2 = 0.992 ± 0.004) and the exponential model the worst (mean R 2 = 0.957 ± 0.022)( Table 1 ). In only three (11 %) countries did the exponential model have the best overall R 2 value among the simple models. Furthermore, the trend of the overall superiority of the Gompertz model followed by the quadratic was also observed in the standard error of the estimate statistic as well. The mean standard error of the estimate (Sy.x, analogous to the root mean squared error for fits of multiple parameters) value for the 28 countries was 1699 for the Gompertz model, 5613 for the quadratic model, 8572 for the simple square model and 11257 for the exponential model (Table 1) . Likewise, plots of the natural log of the total number of cases in the early parts of the epidemic (lnN) with time are significantly less linear (as determined by R 2 ) than equivalent plots of the square root of the total number of cases (N 1/2 ) (SI Table 3 , SI Figs 1, 2). While logistic growth models have been widely used to model epidemics 16, 27 , uncertainties in estimates of R0 (and therefore the population carrying capacity NM) make prospective predictions of the course of the epidemic difficult 14, 27 . (Figure 3 , Table 2 , SI Table 4 ). Here we define predictions as accurate when they are within a factor of two (50-200%) of the actual outcome. For most countries, the simple exponential model massively overpredicts the number of future cases. Predictions generated more than 14 days prior were more than double the actual number of cases for 17 (61 %) countries examined. In fact, for 15 (54 %) countries, the exponential model made at least one overprediction by a factor of greater than 10,000 fold, while the quadratic and simple square models make no overprediction by more than a factor of 3.3 and 2.1, respectively (i.e. using the first 10 days of data from Portugal the exponential model predicts 34 million cases while the quadratic, simple square, and Gompertz growth models predict 24957, 20358 and 18953 cases respectively while 23683 total cases were observed while the total population of Portugal in 2018 was 10.3 million 26 ). Predictions using the quadratic and simple square models were much more accurate. Only in four (14 %) countries does the quadratic model ever overpredict the final number of cases by more than a factor of two while the simple square model overpredict by a factor of two for only one (4 %) country (SI Table 4 ). For the quadratic model, the mean maximum daily overprediction was a factor of 1.6-fold (median 1.3 fold) while for the simple square model the mean maximum daily overprediction was 1.3-fold (median 1.1 fold). Both of these models produced much more accurate predictions than the simple exponential model (Table 2) . The (Table 1) , and this may account for the conflation of the course of the SARS-CoV-2 pandemic with truly exponential growth. That the exponential growth constant term, k, is constantly decreasing after day 10 in 10 (68 %) countries (SI Fig. 3 ) further indicates the overall utility of logistic models, which were explicitly developed to model the a constantly decreasing rate of growth due to consumption of the available resource (i.e. the susceptible population pool of the SIR model) 16 . But, while logistic models are implicitly the correct model, they are difficult to accurately fit during the early portion of an epidemic due to inherent uncertainties in the mathematical shape parameters (Equation 1) of the curve itself and the population carrying capacity for SARS-CoV-2, NM, which still has a significant uncertainty as the virus has only recently moved into the human population. Herd immunity is defined as 1 -1/R0, and since current estimates for R0 vary from 1.5 -6.5 14 . This implies that 33 -85% of the population will need to have contracted the disease and developed immunity in order to terminate the epidemic. A discrepancy of this size will significantly affect predictions based on logistic growth models. Here we note the utility of the quadratic (parabolic) and simple square models in predicting the course of the pandemic more than a month in advance. The simple exponential model vastly overpredicts the number of cases (Fig. 2 , Table 2 ). The Gompertz growth model, while often making largely correct predictions often generates wildly inaccurate estimates of the population carrying capacity NM (SI Table 2 ), and the generalized logistic model simply fails to produce a statistically reliable result with the currently available data. Overestimation of the future number of cases will cause problems because the failure of the number of predicted cases to materialize may be erroneously used as evidence that poorly implemented and ineffective policy prescriptions are reducing the spread of SARS-CoV-2, which may lead to political pressure for premature cessation of all prescriptive measures and inevitably an increase in the number of cases and excess, unnecessary morbidities. Fortunately, the quadratic model produces accurate, prospective predictions of the number of cases (Fig. 3 , Table 2 ). Use of this model is simple as it is directly implemented in common spreadsheet programs and can be implemented without much difficulty or technical modeling expertise. In theory, this model can also be applied to smaller, sub-national populations, although the smaller number of total cases in these regions will undoubtedly give rise to larger statistical errors. In no way does the empirical agreement between the quadratic model and empirical data negate the fact that the growth of the SARS-CoV-2 epidemic is logistic in nature in all 28 countries (Table 1, SI Table 2 ). We expect the suitability of these empirical quadratic fits is related to either the fact that quadratic form of the slope of the generalized logistic function or the limitation of the virus to a physical radius of infectivity around infectious individuals, or that it is still early in the pandemic as no country has yet officially logged even 1% of its population as having been infected, or all three. Of course, the true number of COVID-19 cases is a matter of debate as there is speculation that a significant fraction of infections are not being identified 34 . However, because this method is focused on the rate of case growth over time, the errors that lead to any undercounting within a given country are likely to remain largely unchanged over the short time periods observed here and still provide a reasonable estimate of the number of positively identified cases. While despite their similar predictive power we largely focus on the quadratic model rather than the simple square model for the aforementioned reasons, we must also note that quadratic curve fitting is natively implemented in most common spreadsheet software while the simple square model is not. By monitoring the R 2 values for the quadratic models, it is a simple task to identify when the epidemic is beginning to subside within a country (i.e. "bending the curve"). Here we recommend the use of an R 2 value of 0.985 for identifying when the rate of infection is beginning to subside, but more conservative estimates can also be made by lowering this threshold. Examination of the data collected here suggests that early, aggressive measures have been most effective at reducing disease burden within a country. Countries that initially adopted less stringent measures (such as the US, UK, Russia, and Brazil) are currently more heavily burdened than those countries that started with more intense prescriptions (such as China, South Korea, Australia, Denmark, and Vietnam) 35 both logistic curves is the same, the Gompertz curve reaches the population carrying capacity more slowly, resulting in a long tailed epidemic. The initial part of the Gompertz curve (including time points until 5% of the population has been infected) was fit to the simple exponential (red dashes), quadratic (blue dashes) and simple square (green dashes) models. It is apparent from these curves how quickly the exponential curve overestimates the rate of growth for the epidemic as compared to the quadratic and simple square fit curves and how the quadratic model more closely follows the Gompertz growth curve, evidenced by the smaller Sy.x value for the quadratic fit in Table 1 . for each model using only data up to that day are used to predict the number of expected cases for the last day for which data is available (or the last day before significant curve deviation is observed, see figure 2 ). Days on which the fit was not statistically sound were omitted from the graph. Effectiveness" Blanco et al. Table 2 : The fit parameters for the development of the early portion of the SARS-CoV-2 epidemic in 28 countries for the exponential, quadratic, simple square, simple exponential, Gompertz growth models as calculated for each individual day during the early portion of the epidemic. a a The fit equations for each are as follows: Simple exponential: where N is the total number of cases, t is the time in days, N0 is the initial seeding population of the epidemic, NM is the population carrying capacity (the amount of the population that must be infected to achieve herd immunity), A, B & C are the standard quadratic terms (or for the simple square model equation). Additionally, the number of days of data used in the fitting, the R 2 , sum of squares, and Sy.x statistical values are given. For the Gompertz growth model, an adequate fit could not be achieved for Brazil or Denmark and this is indicated by Figure 3 : The change of the exponential rate term (k) over time for each of the 28 countries. It can be clearly seen that k is generally decreasing over time, often on each day but sometimes after an initial bit of increasing. This indicates that the exponential rate is regularly decreasing, as expected for a situation where growth resource is decreasing, as is expected for the logistic models family of models, including the generalized logistic and Gompertz growth models. Coronaviruse disease 2019 (COVID-19) Situation Report -51 ?sfvrsn=1ba62e57_10: World Health Organization COVID-19: Epidemiology, Evolution, and Cross-Disciplinary Perspectives The epidemiology and pathogenesis of coronavirus disease (COVID-19) outbreak Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship Estimation of the asymptomatic ratio of novel coronavirus infections (COVID-19) Real estimates of mortality following COVID-19 infection Impact assessment of non-pharmaceutical interventions against coronavirus disease 2019 and influenza in Hong Kong: an observational study The COVID-19 vaccine development landscape Interventions to mitigate early spread of SARS-CoV-2 in Singapore: a modelling study Real-time forecasts of the COVID-19 epidemic in China fromFebruary 5th to February 24th Herd-Immunity to Helminth Infection and Implications for Parasite Control Herd Immunity'': A Rough Guide The reproductive number of COVID-19 is higher compared to SARS coronavirus Piecewise quadratic growth during the 2019 novel coronavirus epidemic The use of Gompertz models in growth analyses, and new Gompertzmodel approach: An addition to the Unified-Richards family Dynamics of Tumor Growth The impact of a physical geographic barrier on the dynamics of measles Can China's COVID-19 strategy work elsewhere? Effective containment explains subexponential growth in recent confirmed COVID-19 cases in China Covid-19 National Emergency Response Center E CMTK, Prevention. Contact Transmission of COVID-19 in South Korea: Novel Investigation Techniques for Tracing Contacts Using social and behavioural science to support COVID-19 pandemic response 19 Image Data Collection: Prospective Predictions Are the Future Cohort studies: prospective versus retrospective Rational evaluation of various epidemic models based on the COVID-19 data ofChina Model Selection and Evaluation Based on Emerging Infectious Disease Data Sets including A/H1N1 and Ebola National Response to COVID-19 in the Republic of Korea and Lessons Learned for Other Countries The French response to COVID-19: intrinsic difficulties at the interface of science, public health, and policy COVID-19 healthcare demand and mortality in Sweden in response to non-pharmaceutical (NPIs) mitigation and suppression scenarios What policy makers need to know about COVID-19 protective immunity High population densities catalyse the spread of COVID-19 High Temperature and High Humidity Reduce the Transmission of COVID-19 Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2) Whose coronavirus strategy worked best? Scientists hunt most effective policies and Gompertz growth models based on days of good predictions before the target, last day of observed data, inclusive. b b Good predictions are defined as the predicted result being within a factor of 2, predictions from 50 -200% of the actual total number of cases). Thus, the quadratic model was able to predict the total number of cases in the United states in each of the 45 days before that day, while the exponential model was only within the defined good range for the 20 days preceding that day. The range of minimum predictions (under predictions) and maximum predictions (overpredictions) is also given SI Figure 1: Plots of the square root of the total number of cases (√N) for the early portion of the COVID-19 SI Figure 2: Plots of the natural log of the total number of cases (lnN) for the early portion of the COVID-19