key: cord-0787722-ssdx8xwi authors: Anastassopoulou, Cleo; Russo, Lucia; Tsakris, Athanasios; Siettos, Constantinos title: Data-Based Analysis, Modelling and Forecasting of the COVID-19 outbreak date: 2020-02-13 journal: nan DOI: 10.1101/2020.02.11.20022186 sha: 2858f25f5364b0ef37ceeaa370471ee6b3fac29d doc_id: 787722 cord_uid: ssdx8xwi Since the first suspected case of coronavirus disease-2019 (COVID-19) on December 1st, 2019, in Wuhan, Hubei Province, China, a total of 40,235 confirmed cases and 909 deaths have been reported in China up to February 10, 2020, evoking fear locally and internationally. Here, based on the publicly available epidemiological data for Hubei, China from January 11 to February 10, 2020, we provide estimates of the main epidemiological parameters. In particular, we provide an estimation of the case fatality and case recovery ratios, along with their 90% confidence intervals as the outbreak evolves. On the basis of a Susceptible-Infected-Recovered-Dead (SIDR) model, we provide estimations of the basic reproduction number (R0), and the per day infection mortality and recovery rates. By calibrating the parameters of the SIRD model to the reported data, we also attempt to forecast the evolution of the of the outbreak at the epicenter three weeks ahead, i.e. until February 29. As the number of infected individuals, especially of those with asymptomatic or mild courses, is suspected to be much higher than the official numbers, which can be considered only as a subset of the actual numbers of infected and recovered cases in the total population, we have repeated the calculations under a second scenario that considers twenty times the number of confirmed infected cases and forty times the number of recovered, leaving the number of deaths unchanged. Based on the reported data, the expected value of R0 as computed considering the period from the 11th of January until the 18th of January, using the official counts of confirmed cases was found to be ~4.6, while the one computed under the second scenario was found to be ~3.2. Thus, based on the SIRD simulations, the estimated average value of R0 was found to be ~2.6 based on confirmed cases and ~2 based on the second scenario. Our forecasting flashes a note of caution for the presently unfolding outbreak in China. Based on the official counts for confirmed cases, the simulations suggest that the cumulative number of infected could reach 180,000 (with lower bound of 45,000) by February 29. Regarding the number of deaths, simulations forecast that on the basis of the up to the 10th of February reported data, the death toll might exceed 2,700 (as a lower bound) by February 29. Our analysis further reveals a significant decline of the case fatality ratio from January 26 to which various factors may have contributed, such as the severe control measures taken in Hubei, China (e.g. quarantine and hospitalization of infected individuals), but mainly because of the fact that the actual cumulative numbers of infected and recovered cases in the population most likely are much higher than the reported ones. Thus, in a scenario where we have taken twenty times the confirmed number of infected and forty times the confirmed number of recovered cases, the case fatality ratio is around 0.15% in the total population. Importantly, based on this scenario, simulations suggest a slow down of the outbreak in Hubei at the end of February. 2.89-4.39) in the early phase of the outbreak [10] . Very similar estimates, 2.2 (95% CI: 27 1.4-3.9), were obtained for R 0 at the early stages of the epidemic by Imai et al. 2.6 (95% 28 CI: 1.5-3.5) [11] , as well as by Li weeks [12] . Amidst such an important ongoing public health crisis that also has severe 33 economic repercussions, we reverted to mathematical modelling that can shed light to 34 essential epidemiologic parameters that determine the fate of the epidemic [13] . Here, 35 we present the results of the analysis of time series of epidemiological data available in 36 the public domain [14] [15] [16] (WHO, CDC, ECDC, NHC and DXY) from January 11 to 37 February 10, 2020, and attempt a three-week forecast of the spreading dynamics of the 38 emerged coronavirus epidemic in the epicenter in mainland China. 39 Methodology 40 Our analysis was based on the publicly available data of the new confirmed daily cases 41 reported for the Hubei province from the 11th of January until the 10th of 42 February [14] [15] [16] . Based on the released data, we attempted to estimate the mean 43 values of the main epidemiological parameters, i.e. the basic reproduction number R 0 , 44 the case fatality (hatγ) and case recovery (β) ratios, along with their 90% confidence 45 intervals. However, as suggested [17] , the number of infected, and consequently the 46 number of recovered, people is likely to be much higher. Thus, in a second scenario, we 47 have also derived results by taking twenty times the number of reported cases for the number of deaths that is more likely to be closer to the real number. Furthermore, by 50 calibrating the parameters of the SIRD model to fit the reported data, we also provide 51 tentative forecasts until the 29th of February. 52 The basic reproduction number (R 0 ) is one of the key values that can predict 53 whether the infectious disease will spread into a population or die out. R 0 represents 54 the average number of secondary cases that result from the introduction of a single 55 infectious case in a totally susceptible population during the infectiousness period. 56 Based on the reported data of confirmed cases, we provide estimations of the R 0 from 57 the 16th up to the 20th of January in order to satisfy as much as possible the 58 hypothesis of S ≈ N that is a necessary condition for the computation of R 0 . 59 We also provide estimations of the case fatality (γ) and case recovery (hatβ) ratios 60 over the entire period using a rolling window of one day from the 11th of January to the 61 16th of January to provide the very first estimations. 62 Furthermore, we calibrated the parameters of the SIRD model to fit the reported 63 data. We first provide a coarse estimation of the recovery (β) and mortality rates (γ) of 64 the SIRD model using the first period of the outbreak. Then, an estimation of the 65 infection rate is accomplished by "wrapping" around the SIRD simulator an 66 optimization algorithm to fit the reported data from the 11th of January to the 10th of 67 February. We have start our simulations with one infected person on the 16th of 68 November, which has been suggested as a starting date of the epidemic, run the SIR 69 model until the 10th of February . Below, we describe analytically our approach. 70 Let us start by denoting with S(t), I(t), R(t), D(t), the number of susceptible, 71 infected, recovered and dead persons respectively at time t in the population of size N . 72 For our analysis, we assume that the total number of the population remains constant. 73 Based on the demographic data for the province of Hubei N = 59m. Thus, the discrete 74 SIRD model reads: The above system is defined in discrete time points t = 1, 2, . . ., with the 76 corresponding initial condition at the very start of the epidemic: S(0) = N − 1, 77 I(0) = 1, R(0) = D(0) = 0. Here, β and γ denote the "effective/apparent" the per day 78 recovery and fatality rates. Note that these parameters do not correspond to the actual 79 per day recovery and mortality rates as the new cases of recovered and deaths come 80 from infected cases several days back in time. However, one can attempt to provide 81 some coarse estimations of the "effective/apparent" values of these epidemiological 82 parameters based on the reported confirmed cases using an assumption and approach 83 described in the next section. parameters of the SIRD model as: March 5, 2020 3/28 Lets us denote with ∆I(t) = I(t) − I(t − 1), where, X = I, R, D. Let us also denote by ∆X(t) the t × 1 column vector containing all the reported new 96 cases up to time t. Let us also denote by 97 C∆X(t) = [C∆X(1), C∆X(2), · · · , C∆X(t)] T , the t × 1 column vector containing the 98 corresponding cumulative numbers up to time t. On the basis of Eqs.(2), (3), (4), one 99 can provide a coarse estimation of the parameters R 0 , β and γ as follows. Starting with the estimation of R 0 , we note that as the province of Hubei has a 101 population of 59m, one can reasonably assume that for any practical means, at least at 102 the beginning of the outbreak, S ≈ N . By making this assumption, one can then 103 provide an approximation of the expected value of R 0 using Eq.(5) and Eq.(2), Eq.(3), 104 Eq.(4). In particular, substituting in Eq.(2), the terms βI(t − 1) and (3), and ∆D(t) = D(t) − D(t − 1) from Eq.(4) and 106 bringing them into the left-hand side of Eq.(2), we get: Adding Eq.(3) and Eq.(4), we get: Finally, assuming that for any practical means at the beginning of the spread that 109 S(t − 1) ≈ N and dividing Eq.(7) by Eq.(8) we get: Note that one can use directly Eq.(9) to compute R 0 with regression, without the 111 need to compute first the other parameters, i.e. β, γ and α. At this point, the regression can be done either by using the differences per se, or by 113 using the corresponding cumulative functions (instead of the differences for the 114 calculation of R 0 using Eq.(9)). Indeed, it is easy to prove that by summing up both 115 sides of Eq.(7) and Eq.(8) over time and then dividing them, we get the following 116 equivalent expression for the calculation of R 0 . Here, we used Eq. (10) to estimate R 0 in order to reduce the noise included in the 118 differences. Note that the above expression is a valid approximation only at the 119 beginning of the spread of the disease. Thus, based on the above, a coarse estimation of R 0 and its corresponding 121 confidence intervals can be provided by solving a linear regression problem using 122 least-squares problem as: March 5, 2020 4/28 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02. 11.20022186 doi: medRxiv preprint The prime ( ) is for the transpose operation. China [18] for estimating the mortality ratio for the COVID-19 and also the discussion 130 in [19] ). Here, we adopt the one used by the NHC. Thus, a coarse estimation of case fatality and recovery ratios for the period under 132 study can be calculated using the reported cumulative infected, recovered and death 133 cases, by solving a linear regression problem, which for the case fatality ratio reads: and in a (loose) analogy for the case recovery ratio reads: As the reported data are just a subset of the actual number of infected and 136 recovered cases including the asymptomatic and/or mild ones, we have repeated the the "effective" per day recovery rate β and the "effective" per day mortality rate γ were 150 computed by solving the least squares problems (see Eq. (2, 4) : and March 5, 2020 5/28 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.11.20022186 doi: medRxiv preprint As noted, these values do not correspond to the actual per day mortality and 153 recovery rates as these would demand the knowledge of the corresponding time delays. 154 Having estimated the above "effective" approximate values of the parameters β and γ, 155 an approximation of the "effective" infected rate α, that is not biased by the 156 assumption of S = N can be obtained by using the SIRD simulator. In particular, in 157 the SIRD model, the values of the β and γ parameters were set equal to the ones found 158 using the reported data solving the corresponding least squares problems given by 159 Eq.(14), (15) . As initial conditions we have set one infected person on the 16th of 160 November and run the simulator until the last date for which there are available data 161 (here up to the 10th of February). Then, the optimal value of the infection rate α that 162 fits the reported data was found by "wrapping" around the SIRD simulator an 163 optimization algorithm (such as a nonlinear least-squares solver) to solve the problem: 164 where where, C∆X SIRD (t), (X = I, R, D) are the cumulative cases resulting from the 165 SIRD simulator at time t; w 1 , w 2 , w 3 correspond to scalars serving in the general case as 166 weights to the relevant functions. For the solution of the above optimization problem we 167 used the function "lsqnonlin" of matlab [20] using the Levenberg-Marquard algorithm. 168 As discussed, we have derived results using two different scenarios (see in Methodology). 170 For each scenario, we first present the results for the basic reproduction number as well 171 as the case fatality and case recovery ratios as obtained by solving the least squares 172 problem using a rolling window of an one-day step. For their computation, we used the 173 first six days i.e. from the 11th up to the 16th of January to provide the very first 174 estimations. We then proceeded with the calculations by adding one day in the rolling 175 window as described in the methodology until the 10th of February. We also report the 176 corresponding 90% confidence intervals instead of the more standard 95% because of the 177 small size of the data. For each window, we also report the corresponding coefficients of 178 determination (R 2 ) representing the proportion of the variance in the dependent 179 variable that is predictable from the independent variables, and the root mean square of 180 error (RMSE). The estimation of R 0 was based on the data until January 20, in order 181 to satisfy as much as possible the hypothesis underlying its calculation by Eq.(9). Then, as described above, we provide coarse estimations of the "effective" per day 183 recovery and mortality rates of the SIRD model based on the reported data by solving 184 the corresponding least squares problems. Then, an estimation of the infection rate α 185 was obtained by "wrapping" around the SIRD simulator an optimization algorithm as 186 described in the previous section. Finally, we provide tentative forecasts for the 187 evolution of the outbreak based on both scenarios until the end of February. March 5, 2020 6/28 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.11.20022186 doi: medRxiv preprint also reflected in the corresponding confidence intervals. As more data are taken into 201 account, this variation is significantly reduced. Thus, using all the available data from 202 the 11th of January until the 10th of February, the estimated value of the case fatality 203 ratioγ is ∼ 3.2% (90% CI: 3.1%-3.3%) and that of the case recovery ratioβ is ∼ 0.054 204 (90% CI: 0.049-0.060). It is interesting to note that as the available data become more, 205 the estimated case recovery ratio increases significantly from the 31th of January (see . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.11.20022186 doi: medRxiv preprint rates of the SIRD model were γ ∼ 0.01 and β ∼ 0.063 (corresponding to a recovery 211 period of ∼ 15 d). Note that because of the extremely small number of the data used the 212 confidence intervals have been disregarded. Instead, for the our calculations to construct 213 lower and upper bounds, we have considered intervals of 20% around the expected least 214 squares solution. Hence, for γ we have taken the interval (0.008 and 0.012) and for β, 215 we have taken the interval between (0.05 0.076) corresponding to recovery periods from 216 13 to 20 days. As described in the methodology, we have also used the SIRD simulator 217 to provide an estimation of the "effective" infection rate α by optimization with w 1 =1, 218 w 2 =2, w 3 =2. Thus, we performed the simulations by setting β=0.063 and γ=0.01, and 219 as initial conditions one infected, zero recovered and zero deaths on November 16th 220 2019, and run until the 10th of February. The optimal, with respect to the reported 221 confirmed cases from the 11th of January to the 10th of February, value of the infected 222 rate (α) was ∼ 0.193 (90% CI: 0.191-0.195). This corresponds to a mean value of the 223 basic reproduction numberR 0 ≈ 2.6. Note that this value is different compared to the 224 value that was estimated using solely the reported data. However, as more data are released it appears that the mortality rate is lower than 239 the predicted with the current data and thus the death toll is expected, to be 240 significantly less compared with the predictions at least compared to the expected and 241 upper bounds. It is interesting to note that the above estimation of R 0 is close enough to the one 262 reported in other studies (see in the Introduction for a review). andγ should be attributed to the small size of the data and data uncertainty. This is 267 also reflected in the corresponding confidence intervals. As more data are taken into 268 account, this variation is significantly reduced. Thus,using all the (scaled) data from the 269 11th of January until the 10th of February, the estimated value of the case fatality ratio 270 γ now drops to ∼ 0.163% (90% CI: 0.016%-0.0167%) while that of the case recovery 271 ratio is ∼ 0.11 (90% CI: 0.099-0.12). It is interesting also to note, that as the available 272 data become more, the estimated case recovery ratio increases slightly (see Fig 10) , 273 while the case fatality ratio seems to be stabilized at a rate of ∼ 0.163%. interval of recovery periods from 6 to 8 days. Again, we used the SIRD simulator to provide estimation of the infection rate by 285 optimization setting w 1 = 1, w 2 = 400, w 3 = 1 to balance the residuals of deaths with 286 the scaled numbers of the infected and recovered cases. Thus, to find the optimal 287 infection run, we used the SIRD simulations with β = 0.15d −1 , and γ = 0.0005 and as 288 initial conditions one infected, zero recovered, zero deaths on November 16th 2019, and 289 run until the 10th of February. The optimal, with respect to the reported confirmed cases from the 11th of January 291 to the 10th of February value of the infected rate (α) was found to be ∼ 0.3109 (90% March 5, 2020 23/28 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02. 11.20022186 doi: medRxiv preprint cases (corresponding to ∼200,000 unscaled reported cases) in the total population with 303 a lower bound at ∼362,000 (corresponding to ∼18,000 unscaled reported cases) and an 304 upper bound at ∼ 15m cases. Similarly, for the recovered population, simulations result 305 in an expected actual number of ∼4.9m (corresponding to ∼122,000 unscaled reported 306 cases), while the lower and upper bounds are at ∼500,000 and ∼22m, respectively. Finally, regarding the deaths, simulations under this scenario result in an average 308 number of ∼16,000, with lower and upper bounds, ∼1,154 and ∼116,000. 309 Table 1 summarizes the above results for both scenarios. 310 We note, that the results derived under Scenario II seem to predict a slow down of 311 the epidemic around the end of February as the cumulative numbers of infected 312 approaches a plateau. By the time of the acceptance of our paper, according to the official data released on 319 the 29th of February, the cumulative number confirmed infected cases in Hubei was 320 ∼67,000, that of recovered was ∼31,300 and the death toll was ∼2,800. These numbers 321 are within the lower bounds and expected trends of our forecasts from the 10th of 322 February that are based on Scenario I. In Scenario II, by assuming a 20-fold scaling of 323 the confirmed cases of the infected cases and a 40-fold scaling of the confirmed number 324 of the recovered cases in the total population, forecasts show a significant decline of 325 outbreak in Hubei by 29th of February. This fact has been lately reported. Based on 326 this scenario the casefatality rate in the total population is of the order of ∼0.15%. At 327 this point we should note that our SIRD modelling approach did not take into account 328 many factors that play an important role in the dynamics of the disease such as the 329 effect of the incubation period in the transmission dynamics, the heterogeneous contact 330 transmission network, the effect of the measures already taken to combat the epidemic, 331 the characteristics of the population (e.g. the effect of the age, people who had already 332 health problems). Also the estimation of the model parameters is based on an 333 assumption, considering just the first period in which the first cases were confirmed and 334 reported. Of note, COVID-19, which is thought to be principally transmitted from 335 person to person by respiratory droplets and fomites without excluding the possibility of 336 the fecal-oral route [21] had been spreading for at least over a month and a half before 337 the imposed lockdown and quarantine of Wuhan on January 23, having thus infected 338 unknown numbers of people. The number of asymptomatic and mild cases with 339 subclinical manifestations that probably did not present to hospitals for treatment may 340 be substantial; these cases, which possibly represent the bulk of the COVID-19 341 infections, remain unrecognized, especially during the influenza season [22] . This highly 342 likely gross under-detection and underreporting of mild or asymptomatic cases 343 inevitably throws severe disease courses calculations and death rates out of context, 344 distorting epidemiologic reality. Another important factor that should be taken into 345 consideration pertains to the diagnostic criteria used to determine infection status and 346 confirm cases. A positive PCR test was required to be considered a confirmed case by 347 China's Novel Coronavirus Pneumonia Diagnosis and Treatment program in the early 348 phase of the outbreak [14] . However, the sensitivity of nucleic acid testing for this novel 349 viral pathogen may only be 30-50%, thereby often resulting in false negatives, 350 particularly early in the course of illness. To complicate matters further, the guidance 351 changed in the recently-released fourth edition of the program on February 6 to allow 352 March 5, 2020 24/28 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.11.20022186 doi: medRxiv preprint for diagnosis based on clinical presentation, but only in Hubei province [14] . The swiftly 353 growing epidemic seems to be overwhelming even for the highly efficient Chinese 354 logistics that did manage to build two new hospitals in record time to treat infected 355 patients. Supportive care with extracorporeal membrane oxygenation (ECMO) in 356 intensive care units (ICUs) is critical for severe respiratory disease. Large-scale 357 capacities for such level of medical care in Hubei province, or elsewhere in the world for 358 that matter, amidst this public health emergency may prove particularly challenging. 359 We hope that the results of our analysis contribute to the elucidation of critical aspects 360 of this outbreak so as to contain the novel coronavirus as soon as possible and mitigate 361 its effects regionally, in mainland China, and internationally. In the digital and globalized world of today, new data and information on the novel 364 coronavirus and the evolution of the outbreak become available at an unprecedented 365 pace. Still, crucial questions remain unanswered and accurate answers for predicting the 366 dynamics of the outbreak simply cannot be obtained at this stage. We emphatically 367 underline the uncertainty of available official data, particularly pertaining to the true 368 baseline number of infected (cases), that may lead to ambiguous results and inaccurate 369 forecasts by orders of magnitude, as also pointed out by other investigators [1, 17, 22] . We did not receive any specific funding for this study. Table 1 . Model parameters, their computed values and forecasts for the Hubei province under two scenarios: (I) using the exact values of confirmed cases or (II) using estimations for infected and recovered (twenty and forty times the number of confirmed cases, respectively). March 5, 2020 28/28 . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.02.11.20022186 doi: medRxiv preprint Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia WHO Statement Regarding Cluster of Pneumonia Cases in Wuhan, China Novel coronavirus(2019-nCoV) Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding A pneumonia outbreak associated with a new coronavirus of probable bat origin Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study Initial Public Health Response and Interim Clinical Guidance for the 2019 Novel Coronavirus Outbreak -United States Clinical features of patients infected with 2019 novel coronavirus in Wuhan Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Wuhan, China Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in China, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak Transmissibility of 2019-nCoV Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study Mathematical modeling of infectious disease dynamics The Johns Hopkins Center for Health Security. Daily updates on the emerging novel coronavirus from the Johns Hopkins Center for Health Security The Johns Hopkins Center for Health Security. Coronavirus COVID-19 Global Cases by Johns Hopkins CSSE An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases Real-time tentative assessment of the epidemiological characteristics of novel coronavirus infections in Wuhan, China, as at 22 -National Health Commission (NHC) of the People's Republic of China Methods for Estimating the Case Fatality Ratio for a Novel, Emerging Infectious Disease Coronavirus May Transmit Along Fecal-Oral Route, Xinhua Reports 2019-novel Coronavirus (2019-nCoV): estimating the case fatality rate -a word of caution