key: cord-0893876-m9icky9z
authors: Song, Peter X; Wang, Lili; Zhou, Yiwang; He, Jie; Zhu, Bin; Wang, Fei; Tang, Lu; Eisenberg, Marisa
title: An epidemiological forecast model and software assessing interventions on COVID-19 epidemic in China
date: 2020-03-03
journal: nan
DOI: 10.1101/2020.02.29.20029421
sha: c12ad8b411c24fe215be91d1f990645743e55ab5
doc_id: 893876
cord_uid: m9icky9z

We develop a health informatics toolbox that enables public health workers to timely analyze and evaluate the time-course dynamics of the novel coronavirus (COVID-19) infection using the public available data from the China CDC. This toolbox is built upon a hierarchical epidemiological model in which two observed time series of daily proportions of infected and removed cases are emitted from the underlying infection dynamics governed by a Markov SIR infectious disease process. We extend the SIR model to incorporate various types of time-varying quarantine protocols, including government-level macro isolation policies and community-level micro inspection measures. We develop a calibration procedure for under-reported infected cases. This toolbox provides forecast, in both online and offline forms, of turning points of interest, including the time when daily infected proportion becomes smaller than the previous ones and the time when daily infected proportions becomes smaller than that of daily removed proportion, as well as the ending time of the epidemic. An R software is made available for the public, and examples on the use of this software are illustrated. Some possible extensions of our novel epidemiological models are discussed.

The outbreak of the coronavirus disease or COVID-19, originated in Wuhan, the capital city of Hubei province, spreads out quickly in Hubei and then in China, South Korea, Japan, Iran and Italy, as well as many other cities across the world. Being one of the great epidemics in the 21st century, till February 25, 2020, in China it has caused a total of 78,195 confirmed infections, 2,718 deaths and 30,078 recovered cases, and 2,491 suspected cases still remained to be tested. Since the outbreak of the epidemic, many clinical papers [15, 3, 37, 38, 12, 6, 20, 5, 11, 8, 7, 28, 10, 42, 34] have been published to unveil limited but important knowledge of COVID-19, including that (i) COVID-19 is an infectious disease caused by SARS-CoV-2, a virus closely related to the SARS series of pY I t , Y R t q J follows a state-space model with the beta distributions at time t:

where θ I t and θ R t are the respective prevalence of infection and removal at time t, and λ I and λ R are the parameters controlling the respective variances of the observed proportions. Figure 1 : A conceptual framework of the proposed epidemiological state-space SIR model.

As shown in Figure 1 , these observed time series are emitted from the underlying latent dynamics of COVID-19 infection characterized by the latent Markov process θ t . It is easy to see that the expected proportions in both models (1) and (2) are equal to the prevalence of infection and the probability of removal at time t, namely EpY I t |θ t q " θ I t and EpY R t |θ t q " θ R t . See Appendix B. Moreover, the latent population prevalence θ t " pθ S t , θ I t , θ R t q J is a three-dimensional Markov process, in which θ S t is the probability of a person being susceptible or at risk at time t, θ I t is the probability of a person being infected at time t, and θ R t is the probability of a person being removed from the infectious system (either recovered or dead) at time t. Obviously, θ S t`θ I t`θ R t " 1. We assume that this 3-dimensional prevalence process θ t is governed by the following Markov model:

where parameter κ scales the variance of the Dirichlet distribution and function f p¨q is a 3dimensional vector that determines the mean of the Dirichlet distribution. The function f is the engine of the infection dynamics, which operates according to the classical infectious disease SIR model of the form:

dθ I t dt " βθ S t θ I t´γ θ I t , and

Note that the explicit solution to the above system (4) of ordinary differential equations is unavailable. Following [23] , we invoke the fourth-order Runge-Kutta(RK4) approximation, resulting in an approximate of f pθ t´1 , β, γq as follows:

f pθ t´1 , β, γq "¨θ S t´1`1 {6rk S 1 t´1`2 k S 2 t´1`2 k S 3 t´1`k S 4 t´1 s θ I t´1`1 {6rk I 1 t´1`2 k I 2 t´1`2 k I 3 t´1`k

where all these k t terms are given in the appendix A. Denote the set of model parameters by τ " pβ, γ, θ 0 , λ, κq J , which will be estimated using the MCMC method [2] .

The basic epidemiological model with both constant transmission and removal rates in SIR model (4) does not reflect the reality in China, where all levels of quarantines have been enforced. Many forms of human interventions that are altering the transmission rate over time include (i) individuallevel protective measures such as wearing masks and safety glasses, using hygiene, and taking inhome isolation, and (ii) community-level quarantines such as hospitalization for infected cases, city blockade, traffic control and restricted social activities, and so on. In addition, the virus itself may mutate to evolve, so to increase the potential rate of false negative in the disease diagnosis. As a result, some individual virus carriers are not contained. Thus, the transmission rate β indeed varies over time, which should be accounted in the modeling. One extension to the above basic epidemiological model is to allow a time-varying probability that a susceptible person meets an infected person or vice versa. Suppose at a time t, q S ptq P r0, 1s is the chance of an at-risk person being in-home isolation, and q I ptq P r0, 1s is the chance of an infected person being in-hospital quarantine. Thus, the chance of disease transmission when an at-risk person meets an infected person is modified as: βt1´q S ptquθ S t t1´q I ptquθ I t :" βπptqθ S t θ I t , with πptq :" t1´q S ptqut1´q I ptqu P r0, 1s. In effect, this πptq modifies the chance of a susceptible person meeting with an infected person or vice versa, which is termed as a transmission modifier due to quarantine in this paper. Obviously, with no quarantine in place, πptq " 1 for all time. See Figure 2 . This results in a new SIR model with a time-varying transmission rate modifier:

where the product term βπptq defines a realistic transmission rate reflective to the currently enforced quarantine measures of all levels in China. Note that the above extended SIR model assumes primarily that both population-level chance of being susceptible and population-level chance of being infected remain the same, but the chance of a susceptible person meeting with an infected person is reduced by πptq.

The transmission rate modifier πptq needs to be specified according to actual quarantine protocols in a given region. A possible choice of πptq may be a step function that reflects governmentinitiated macro isolation measures in Wuhan, Hubei province:

π 01 , if t ď Jan 23, no concrete quarantine protocols; π 02 , if t P pJan 23, Feb 4s, city blockade; π 03 , if t P pFeb 4, Feb 8s, enhanced quarantine; π 04 , if t ą Feb 8, opening of new hospitals.

When π 0 " pπ 01 , π 02 , π 03 , π 04 q are chosen with different values, as shown in Figure 3 , we obtain different types of transmission rate modifiers aligned with different quarantine protocols. Subfigures from A to C indicate π 0 " pπ 01 , π 02 , π 03 , π 04 q equal to p1, 1, 1, 1q, p1, 0.9, 0.8, 0.5q and p1, 0.9, 0.5, 0.1q with change time points at (Jan 23, Feb 4, Feb 8) respectively.

Alternatively, the modifier πptq may be specified as a continuous function that reflects steadily increased community-level awareness and responsibility of voluntary quarantine and preventive measures, which may be regarded as a kind of micro isolation measure initiated by individuals or local self-organized inspections. For example, as shown in Figure 4 , we may choose the following exponential functions:

πptq " expp´λ 0 tq or πptq " expt´pλ 0 tq ν u, λ 0 ą 0, ν ą 0. Technically, the RK's approximate of f function in Appendix A may be easily obtained by 6 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02. 29.20029421 doi: medRxiv preprint replacing β by βπptq in the specification of the latent prevalence model (3) , and moreover in all quantities and steps in the MCMC implementation. See the detailed in Section 3.

An alternative way to incorporate quarantine measures into the basic epidemiological model (4) is to add a new quarantine compartment that collects quarantined individuals who would have no chance of meeting any infected individuals in the infection system, as shown in Figure 5 . This model allows to characterize time-varying proportions of susceptible cases due largely to the governmentenforced stringent in-home isolation outside of Hubei province. The basic SIR model in equation (4) is then extended by adding a quarantine compartment with a time-varying rate of quarantine φptq, which is the chance of a susceptible person being willing to take in-home isolation at time t. The extended SIR takes the following 4-dimensional latent process pθ S t , θ Q t , θ I t , θ R t q J :

where We suppose that the quarantine rate φptq is a Dirac delta function with jumps at times when major macro quarantine measures are enforced. For example, we may specify the φptq function as follows:

φptq " Here we show several examples of multi-point instantaneous quarantine rates in Figure 6 with jump sizes equal to φ 0 " pφ 01 , φ 02 , φ 03 q that occur respectively at dates of (Jan 23, Feb 4, Feb 8). In particular, we plot three scenarios, e.g., no intervention (φ 0 " p0, 0, 0q), multiple moderate jumps (φ 0 " p0.1, 0.4, 0.3q), and only one large jump (φ 0 " p0, 0.9, 0q). Note that at each jump, the respective proportion of the susceptible population would move to the quarantine compartment. For example, with φ 0 " p0.1, 0.4, 0.3q, the quarantine compartment will be enlarged accumulatively over three time points as 0.1 θ S t 1`0 .4θ S t 2`0 .3θ S t 3 . The f pθ t´1 , β, γq function determined by the above extended SIR model (6) can be solved by applying the fourth-order Runge-Kutta approximation, and the resulting solution is given in Figure 6 : Three examples of multi-point instantaneous quarantine rates. Subfigures from A to C denote φ 0 " p0, 0, 0, 0q, φ 0 " p0.1, 0.4, 0.3q and φ 0 " p0, 0.9, 0q at three time points of (Jan 23, Feb 4, Feb 8) respectively. Appendix A. To deal with the Dirac delta function φptq, we develop a two-step approximation for model (6) . In brief, we first solve a continuous function without change points via the differential equations in (5) , and then we directly move φptqθ S t of the susceptible compartment to the quarantine compartment. From our experience, this approach largely improves the approximation accuracy in the presence of discontinuities.

3 Implementation: Markov Chain Monte Carlo Algorithm

We implemented the MCMC algorithm to collect draws from the posterior distributions, and further obtain posterior estimates and credible intervals of the unknown parameters in the above models specified in Section 2. Because of the hierarchical structure in the state-space model considered in this paper, the posterior distributions can be obtained straightforwardly. The R package rjags [25] can be directly applied to draw samples from the posterior distributions, so we omit the technical details. The latent Markov θ t processes are sampled and forecasted by the MCMC sampler, particularly for the prevalence of infection and the probability of removal, θ I t and θ R t , which enables us to determine turning points of interest and the reproduction number R 0 .

The first turning point of interest is the time when the daily number of new infected cases stops increasing. Mathematically, this corresponds to the time t at which : θ I t " 0 or the gradient of 9 θ I t is zero. As illustrated by Panel A in Figure 7 , the peak of 9 θ I t , denoted by the vertical green line, is the first turning point of interest. The second turning point is the time when the cumulative infected cases reaches its maximum, meaning 9 θ I t " 0. In principle, the second turning point occurs only after the first one, as shown in Panel B in Figure 7 .

The reproduction number R 0 of an infectious disease is estimated by the ratio R 0 " β{γ, where β and γ are both estimated from their posterior distributions. Because our models consider the quarantine compartment, R 0 might change according to the forms of quarantine protocols. We adopt a standard MCMC algorithm to draw the latent process. Let t 0 be the current time up to which we have observed data pY I 0:t 0 , Y R 0:t 0 q. To perform M draws of Y I t , Y R t for t P rt 0`1 , T s, we 8 All rights reserved. No reuse allowed without permission.

author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . , τ pmq s of the prevalence process, at t " t 0`1 , . . . , T ;

pmq t , τ pmq s and rY R t |θ pmq t , τ pmq s according to the observed process, at t " t 0`1 , . . . , T , respectively;

The prior distributions are specified with some of the hyper-parameters being set according to the SARS data from Hong Kong [21] . They are,

In the default setting the variances of the above prior distributions are set at relatively large values to reflect the fact that limited prior knowledge of these epidemiological model parameters is available. When more information becomes accessible during the course of the epidemic, smaller prior variance values may be used, leading to tighter credible intervals for the model parameters and turning points.

We deliver an R software package that is able to output the MCMC estimation, inference and prediction under the epidemiological model with two proposed extended SIR models that incorporate time-varying quarantine protocols. These new models have been discussed in detail in Sections 2.2 and 2.3. Our R package, named eSIR, uses daily-updated time series of infected and removed proportions as input data. The R package is available at GitHub lilywang1988/eSIR, and its user manual is appended as the supplementary material of this paper. The quarantine functions are predefined and will be treated as known functions of protocols/policies in the estimation and 9 All rights reserved. No reuse allowed without permission.

author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . prediction steps. We let the transmission rate modifier πptq be either a step function or an exponential function, and let the quarantine rate φptq follow a Dirac delta function with pre-specified points of jump and sizes of jumps. The package provides various plots for users to visualize the MCMC results, including the estimated prevalence of infection and the estimated probability of removal, and predicted turning points of interest. Various summary statistics are listed in the output, including posterior mean estimates of the transmission and removal rates, estimate of the reproduction number, and forecasts of turning points and their 95% credible intervals. Moreover, the package gives multiple options to users who can save the entire MCMC results, including the output tables and summary plots, traceplots for MCMC quality control, and full MCMC draws for user's own summary analyses. Some illustrations on the use of this software package is given in Section 4. In addition, we developed an online R Shiny App that can automatically update the results whenever the China CDC updates the daily COVID-19 data.

Under-reporting of infections is a common issue in the data collection of infectious disease, especially at the beginning of an outbreak. When medical diagnostic tools become more accurate and reliable, as well the transparency of preventive measures gets improved for an exchange of voluntary in-home quarantine, certain adjustments in data typically occur. It is shown in Figure 8 that on Feb 12 the cumulative and daily added number of infected cases in Hubei had clear jumps with significantly large sizes. Such sizable jumps cannot happen within one day, rather they represent an accumulation of cases that have not been reported in previous dates prior to Feb 12. To fix this under-reporting issue, we develop a calibration procedure with the detail in Appendix C. Below we briefly describe our approach for the calibration of the infected cases.

We assume an exponential growth curve for the cumulative number of infected cases in Hubei before Feb 12 of the form yptq " ae λt`b , where parameters λ, a, b are to be estimated. Under the boundary conditions ypt " Jan 12q " 0 and ypt " Feb 12q " a expp31λq`b, we would like to minimize the one-step ahead extrapolation error on Feb 13. The constrained optimal solution can be obtained by the means of Lagrange Multipliers; the estimates areλ " 0.06605,â " 7142.80,b " 7142.80. The resulting calibration curves for the cumulative and daily added number of infected cases are shown as the solid red curves in Figure 8 . For example, on Jan 31, the reported cumulative number of infected cases is 7,153, but the calibrated number of infected cases is 17,911, with an increment of 10,758 cases. As shown later, this data quality control (QC) step helps improve the performance of MCMC.

We applied our proposed models, algorithms and R package eSIR to analyze the COVID-19 data collected from the public website DXY.cn. The earliest public records for the provincial data are available since Jan 20, 2020. According to an existing R package on GitHub GuangchuangYu/nCov2019

[39], the total counts of confirmed infections and deaths are dated back on Jan 13, 2020. We assumed that before Jan 17 all the reported cases and deaths were from Hubei. We imputed by the linear interpolation the missing cases on Jan 18-19. Therefore, the data used in our analyses starts from Jan 13. The data used in analyses for the other provinces starts on Jan 23, which is the earliest time with non-zero values in the removed compartment. Note that there exist some minor discrepancies between different data sources, and the under-reporting issue is addressed in Appendix C by a calibration procedure. This section aims to provide a demonstration of our software to analyze the current public COVID-19 data, through which users may understand the proposed methods. We focus on illustrating ways to export and interpret the MCMC results. The R package may be applied to analyze infectious data from other countries.

First, we show the analysis of the Hubei COVID-19 data. Note that option dic=T enables to calculate the deviance information criterion (DIC) for model selection, while options, save_files=T and save_mcmc, allow the storage of MCMC output tables, plots, summary statistics and even full MCMC draws, which may be saved via the path of file_add, or otherwise via the current working directory. The major results returned from the package include predicted cumulative proportions, predicted turning points of interest, two ggplot2 [36] objects of the summary plots related to both infection and removed compartments, a summary output table containing all the posterior means, median and credible intervals of the model parameters, and DIC if opted. The traceplots and other useful diagnostic plots are also provided and automatically saved if save_files=T is opted. In the package, function tvt.eSIR() works on the epidemiological model with timevarying transmission rate in Section 2.2, and qh.eSIR() for the other epidemiological model with a quarantine compartment in Section 2.3. Note that in function tvt.eSIR(), with a choice of exponential=F, a step function is run in the MCMC when both change_time and pi0 are specified. To fit the model with a continuous transmission rate modifier function, user may set exponential=T and specify a value of lambda0. The default is to run the basic epidemiological model with no quarantine or πptq " 1 in Section 2.1. death_in_R is usually set to be the average ratio of death and removed proportions at each observation time point, which is used to estimate the death curve in the forecast plot of the removed compartment. Below are the R scripts used in the analysis.

Step function pi(t) ### Y and R are observed proportions of infected and removed compartments change_time <-c("01/23/2020","02/04/2020","02/08/2020") pi0<-c(1.0,0.9,0.5,0.1) res.step <-tvt.eSIR(Y,R,begin_str="01/13/2020",death_in_R = 0.4, T_fin=200,pi0=pi0,change_time=change_time,dic=T, casename="Hubei_step",save_files = T,

All rights reserved. No reuse allowed without permission.

author/funder, who has granted medRxiv a license to display the preprint in perpetuity. In the above R coding scripts, only very short MCMC chains are specified for the consideration of running time. In practice, it is recommended to set M=5e5 and nburnin=2e5 to achieve desirable burn-ins and yield stable MCMC draws. We tried the step function given by Panel C in Figure 3 with pi0=c(1,0.9,0.5,0.1), an exponential function given by Panel B in Figure 4 with rate lambda0=0.05, both of which were compared with the basic model with πptq " 1. The results of the three modifier functions obtained from the tvt.eSIR function are summarized in Figures 9-11 . In these forecasting plots of the infected and removed compartments (Panel A and Panel C), the black dots left to the blue vertical line denote the observed proportions of the infected and removed compartments on the last date of available observations or before. That is, the blue vertical marks time t 0 as defined in Section 3. The green and purple vertical lines denote the first and second turning points, respectively. The salmon color area denotes the 95% credible interval of the predicted proportions Y I t and Y R t after t 0 , respectively, while the cyan color area represents either the 95% credible intervals of the prevalence (θ I t ) and or that of the probability of removal (θ R t ) prior to time t 0 . The gray and red curves are the posterior mean and median curves. The black curve in the plot (Panel C) is the proportion of deaths estimated from a pre-specified ratio death_in_R. The middle Panel B displays important dynamic features of the infection via a spaghetti plot, in which 20 randomly selected MCMC draws of the first-order derivative of the posterior prevalence of infection, namely 9

θ I t . The black curve is the posterior mean of the derivative, and the vertical lines mark times of turning points corresponding respectively to those shown in Panel A and Panel C. Moreover, the 95% credible intervals of these turning points are also highlighted by semi-transparent rectangles in Panel B. As shown in these figures, the transmission rate modifier πptq played an important roles in shortening the key turning points of the epidemic, and its effect on both estimation and prediction of the COVID-19 infection dynamics has been clearly demonstrated .

Next, we analyzed the data from the rest of the Chinese population (i.e. the provinces outside Hubei) starting on Jan 23. We set begin_str="01/23/2020"in tvt.eSIR. To address the delayed starting time, we included two change points for the step function πptq at [Feb 4, Feb 8] with π 0 " p0.8, 0.1q. The exponential function remained the same. It is noted that the spread of 12 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.29.20029421 doi: medRxiv preprint Figure 9 : Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei without intervention, πptq " 1. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.29.20029421 doi: medRxiv preprint Figure 11 : Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei with a step-function transmission rate modifier specified by π 0 " p1, 0.9, 0.5, 0.1q at change points [Jan 23, Feb 4, Feb 8].

COVID-19 outside Hubei has been so far much less severe. Possible reasons for such low proportions of infection and deaths include (i) discontinuing the traffic connections between Hubei and the provinces, (ii) more timely caution and preventative measures taken, and (iii) a comparatively large population that dilutes the exposed group. When Panel A in Figure 12 is zoomed in, some of the observed proportions (black dots) are deviated from the posterior mean or median of the fitted prevalence albeit they all fall in the 95% credible intervals, as shown by Panel A in Figures 13 and  14 . Since the latent process follows the SIR differential equations, there may be a lack of fit for the SIR model to accommodate a very large and complex population of 1.3 billion people, in which most of the subjects are not at risk. The proposed models should work much better for individual provinces, but we did not perform such analyses. The other epidemiological model with an added quarantine compartment as an absorbing state was fitted via our R function qh.eSIR in the package eSIR. The arguments used in qh.eSIR() are almost identical to those in tvt.eSIR(). Note that if the quarantine rate function is set at constant 0, this model will be reduced to a basic epidemiological SIR model. ### Example 4: Dirac delta function of the quarantine process change_time <-c("01/23/2020","02/04/2020","02/08/2020") phi <-c(0.1,0.4,0.4) res.q <-qh.eSIR (Y,R,begin_str="01/13/2020",death_in_R = 0.4, phi0=phi0,change_time=change_time, casename="Hubei_q",save_files = T,save_mcmc = F, M=5e2,nburnin = 2e2) res.q$plot_infection 14 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.29.20029421 doi: medRxiv preprint Figure 12 : Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for the other provinces outside Hubei with πptq " 1. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.29.20029421 doi: medRxiv preprint Figure 14 : Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for the other provinces outside Hubei with a step-function transmission rate modifier specified by π 0 " p1, 0.9, 0.5, 0.1q at change points [Jan 23, Feb 4, Feb 8].

#res.q$plot_removed ### Example 5: basic state-space SIR model res.noq <-qh.eSIR (Y,R,begin_str="01/13/2020",death_in_R = 0.4, T_fin=200,casename="Hubei_noq", M=5e2,nburnin = 2e2) res.noq$plot_infection

We applied the R function qh.eSIR in analyses of the data within and outside Hubei. Results are summarized in Figures 15-16 . Our analyses once again clearly indicated that stringent quarantine protocols can largely reduce the spread of COVID-19 both within Hubei and outside Hubei. Yet, it is known that too strict quarantine can cause backfire; people may lose their trust and patience in their committed system, and consequently may try to reduce compliance or even avoid quarantine. We also present the posterior mean probability of staying quarantine compartment in Figure 17 within Hubei and outside Hubei. Note that Jan 23 was not set as a change point for the case of outside Hubei, leading only to two jumps. It is evident that by Feb 8, more than 90% of the Chinese population have taken in-home isolation or as such, reflective to a very strict quarantine protocol enforced in the entire country.

As described in Subsection 4.1, we corrected the under-reported proportion of infections in Hubei province prior to Feb 12, when a big jump occurred on one day. We repeated the same analyses using the calibrated data of infections, and the corresponding results are shown in It is interesting to see that the abrupt rise in the infection proportion on Feb 12 disappeared in all Panel A in these new analyses, and the observed data (i.e. the black dots) align better with the 16 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02. 29.20029421 doi: medRxiv preprint 17 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02. 29.20029421 doi: medRxiv preprint Figure 17 : The posterior mean probability of staying in quarantine compartment within and outside Hubei. credible intervals of all the latent processes. We also notice that after the correction, the estimated reproduction numbers R 0 became smaller. The reproduction numbers estimated from different models for within and outside Hubei, with and without the data calibration, together with their 95% credible intervals are summarized in Table 1 .

It is worth pointing out that the estimates of the reproduction numbers obtained from the epidemiological models with time-varying transmission or quarantine rates appear larger than those obtained from the basic model with no quarantine. This is no surprising as our new models explicitly incorporate interventions, so that the estimated R 0 is an adjusted number with the influence of interventions be removed. In contrast, the basic model with no use of the quarantine modifier implicitly integrates the effect of interventions into the transmission rate β, and consequently the estimated R 0 is reduced due to the contribution from interventions. Our analyses suggest that reproduction numbers R 0 of COVID-19 without public health interventions would be around 4-5 within Hubei and around 3-3.5 outside Hubei with relatively big credible intervals. These findings are in agreement with findings from [17] . As pointed out above, the size of credible interval may be reduced with more accessible data of COVID-19, which permit users to specify smaller variances in the prior distributions given in section 3.1.

We develop an epidemiological forecast model with an R software package to assess effects of interventions on the COVID-19 epidemic within Hubei and outside Hubei in China. Since our 18 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.29.20029421 doi: medRxiv preprint Figure 18 : Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei province with πptq " 1 after calibration. Note that the second turning point is beyond the time-axis limit in the plots. Figure 19 : Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei with an exponential transmission rate modifier πptq " expp´0.05tq after data calibration.

19 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02. 29.20029421 doi: medRxiv preprint Figure 20 : Prediction plots of θ I t and Y I t (Panel A), 9 θ I t (Panel B), θ R t and Y R t (Panel C) for Hubei with a step-function transmission rate modifier specified by π 0 " p1, 0.9, 0.5, 0.1q at change points [Jan 23, Feb 4, Feb 8] after data calibration. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02. 29.20029421 doi: medRxiv preprint proposed model utilizes the strength of the SIR's dynamic system to capture the primary mechanism of the COVID-19 infectious disease, we are able to predict future episodes of the disease spread patterns over a window of 200 days from the last date of data availability. Some turning points of interest are obtained from these forecasting curves as part of the deliverable information, including the predicted time when daily proportion of infected cases becomes smaller than the previous ones and the predicted time when daily proportion of removed cases (i.e. both recovered and dead) becomes larger than that of infected cases, as well as the time when the epidemic ends. Our informatics toolbox provides quantification of uncertainty on the prediction, rather than only point prediction values, which are valuable to see the best versus the worst. The key novel contribution is the incorporation of time-varying quarantine protocols to expand the basic epidemiological model to accommodate changing transmission rates over time in the population. The toolbox can be used by practitioners who have better knowledge of quarantine and better quality data to perform their own analyses. Practitioners can use the toolbox to evaluate different types of quarantine strategies in practice. All summary statistics obtained from the toolbox are of great importance for public health workers and government policy makers to take proper actions on stop spreading of COVID-19.

We chose the MCMC algorithm to implement the statistical estimation and prediction because of the consideration on the prediction uncertainty. Given the considerable complexity in the COVID-19 virus spread dynamics and potentially inaccurate information of quarantine measures as well as likely under-reported proportions of infected and recovered cases and deaths, it is of critical importance to quantify and report uncertainty in the forecast. Note that the publicly reported data of recovery and death of COVID-19 are mostly collected from hospitals where accessibility to such information is warrant. In contrast, it is very difficulty, if not impossible, to collect the data of infected individuals with light symptoms who had in-home isolation and recovered, in spite of serious efforts made by the government for a door-to-door inspection to identify suspected cases.

This toolbox is indeed so general that it may be applicable to analyze and evaluate the COVID-19 epidemic in other countries, as well as the future outbreak of other types of infectious diseases. As noted in the paper, our proposed method does need prior data of similar infectious disease to set up initial conditions of the infection dynamics. For this, we analyzed the complete SARS data from Hong Kong given some similarity of COVID-19 to SARS. From this perspective, what we learned from this COVID-19 epidemic in this paper is extremely valuable to form initial conditions in the analysis of any future outbreak of similar infectious disease. In addition, understanding forms and strengths of quarantines for the controlling of disease spread is an inevitable path to making effective preventive policies, which is the key analytic capacity that our toolbox offers.

The proposed epidemiological models can be further extended to accommodate more data reported by the China CDC, which are worth future exploration. Two types of data that may be used in the future extension are the daily number of suspected cases and the daily number of hospitalized cases. We did not use such data due to the concern of data accuracy. For example, the number of suspected cases is largely dependent on the diagnostic protocols, which have been revised a few times since the outbreak of the disease, and the sensitivity of the RNA test. Given such concerns, our strategy in the proposed model was to only use necessary data for analysis, and over the course of improved data quality in the near future, our toolbox may be extended to enjoy greater statistical power and more accurate predictions.

Using the RK4 approximation, f pθ t´1 , β, γq in the extended SIR model (6) with a quarantine compartment can be approximated following the two iterative steps:

1. Solve the f pθ t´1 , β, γq in Appendix A without considering the quarantine with f p¨q f pθ t´1 , β, γq " rα 1pt´1q , α 2pt´1q , α 3pt´1q s T .

2. Due to the quarantine, we deduct the susceptible by α1 pt´1q " α 1pt´1q´φ ptqθ S t´1 , and let

Let αt´1 " rα1 pt´1q , α 2pt´1q , α 3pt´1q s T , and it is easy to show that the sum ř 3 k"1 αk pt´1q " 1´θ Q t . Thus we can regenerate the next day's θ t following a Dirichlet distribution adjusted by the prevalence of the quarantine compartment αt " Dirichletpκαt´1{p1´θ Q t qq. The estimated prevalence values become θ t " p1´θ Q t qαt . We follow above two steps and finish the complete prevalence processes. Note that the deduction of susceptible compartments might cause θ S t ď 0, we will bound such prevalence value to be consistently 0, which is equivalent to terminating transmission among susceptible subjects.

For the sake of being self-contained, we list the moments of both Beta and Dirichlet distributions. The mean and variance of Beta distribution Betapα, βq are respectively:

Mean " α α`β , Var " αβ pα`βq 2 pα`β`1q .

While to Dirchlet distribution Dirpκαq, we have

where α " pα 1 , α 2 , α 3 , α 4 q T with α 1`α2`α3`α4 " 1.

As is mentioned in the Introduction, the issue of under-reporting may cause bias in prediction. In order to adjust the under-reported number of infected cases, we apply the following algorithm to calibrate the number of infections before Feb 12, during which time the Chinese government only relies on the RNA test for diagnosis, which was realized later with low sensitivity leading to many false negatives.

We assume that the cumulative number of infected cases between Jan 13 and Feb 12 when a sudden big jump occurs follows an exponential function, yptq " ae λt`b , 26 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02. 29.20029421 doi: medRxiv preprint where t P t1, 2, . . . u and a, b, λ are parameters to be estimated. Here, t " 1 stands for Jan 13 and t " 31 stands for Feb 12. Under the condition of yp0q " 0, we can easily get that yptq " ae λt´a .

To estimate parameter λ and a, we want to minimize the least square error of the estimated number yptq of infected cases at t " 32 (Feb 13), which is one day after the Chinese government changed the diagnosis protocol. It is assumed that the difference between the predicted and observed number of infections on Feb 13 would not be big if the model were established well, although the long term difference might be large due to other interventions. Therefore, the optimization problem we want to solve is, min a,λ typ32q´ae 32λ`a u 2 s.t. ae 31λ´a " yp31q.

The constraint ae 31λ´a " yp31q is used to ensure that the cumulative number of infected cases till Feb 12 equals to the observed value yp31q. The optimization problem can be solved using the method of Lagrange Multipliers. Obtainedλ " 0.06605,â " 7142.80. The calibrated number of infected cased between Jan 13 and Feb 12 is shown in Figure 8 . The proposed calibration method corrected the under-reporting issue, at least partially.

27 All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02. 29.20029421 doi: medRxiv preprint qh.eSIR 3

In this function we allow it to characterize time-varying proportions of susceptible due to governmentenforced stringent in-home isolation. We expanded the SIR model by adding a quarantine compartment with a time-varying rate of quarantine φ t , the chance of a susceptible person being willing to take in-home isolation at time t.

Value casename the predefined casename.

incidence_mean mean incidence.

incidence_ci 2.5%, 50%, and 97.5% quantiles of the incidences. plot_removed. The latter indicate stationary t at which the first-order derivative of the averaged posterior of θ I t equals zero. second_tp_ci with second_tp_mean, it reports the corresponding credible interval and median.

dic_val the output of dic.sample() in dic.sample, computing deviance information criterion for model comparison. change_time <-c("01/23/2020","02/04/2020","02/08/2020") phi0 <-c(0.1,0.4,0.4) res.q <-qh.eSIR (Y,R,begin_str="01/13/2020",death_in_R = 0.4, phi0=phi0,change_time=change_time, casename="Hubei_q",save_files = T,save_mcmc = F, M=5e2,nburnin = 2e2) res.q$plot_infection #res.q$plot_removed res.noq <-qh.eSIR (Y,R,begin_str="01/13/2020",death_in_R = 0.4, T_fin=200,casename="Hubei_noq", M=5e2,nburnin = 2e2) res.noq$plot_infection tvt.eSIR Fit extended state-space SIR model with time-varying transmission rates

Fit extended state-space SIR model with prespecified changes in the transmission rate, either stepwise or continuous, accomodating time-varying quaratine protocols. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02. 29 Step function of pi(t) change_time <-c("01/23/2020","02/04/2020","02/08/2020") pi0<-c(1.0,0.9,0.5,0.1) res.step <-tvt.eSIR(Y,R,begin_str="01/13/2020",death_in_R = 0.4, T_fin=200,pi0=pi0,change_time=change_time,dic=T, casename="Hubei_step",save_files = T, save_mcmc=F,M=5e2,nburnin = 2e2) res.step$plot_infection res.step$plot_removed res.step$dic_val ### continuous exponential function of pi(t) res.exp <-tvt.eSIR(Y,R,begin_str="01/13/2020",death_in_R = 0.4, T_fin=200,exponential=TRUE,dic=F,lambda0=0.05, casename="Hubei_exp",save_files = F,save_mcmc=F, M=5e2,nburnin = 2e2) res.exp$plot_infection #res.exp$plot_removed ### without pi(t), the standard state-space SIR model without intervention res.nopi <-tvt.eSIR(Y,R,begin_str="01/13/2020",death_in_R = 0.4, T_fin=200,casename="Hubei_nopi",save_files = F, M=5e2,nburnin = 2e2) res.nopi$plot_infection #res.nopi$plot_removed All rights reserved. No reuse allowed without permission. author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02. 29 author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02. 29.20029421 doi: medRxiv preprint

Song Lab at UM. eSIR: Extended state-space SIR models

A monte carlo approach to nonnormal and nonlinear state-space modeling

Qing Gong, et al. Clinical characteristics and intrauterine vertical transmission potential of covid-19 infection in nine pregnant women: a retrospective review of medical records

World Health Organization, et al. Plague manual: epidemiology, distribution, surveillance and control

Bat coronaviruses in china

Return of the coronavirus: 2019-ncov

Clinical characteristics of 2019 novel coronavirus infection in china. medRxiv

Ahmet Tural, et al. First case of 2019 novel coronavirus in the united states

Artificial intelligence forecasting of covid-19 in china

Clinical features of patients infected with 2019 novel coronavirus in wuhan, china. The Lancet

The continuing 2019-ncov epidemic threat of novel coronaviruses to global health-the latest 2019 novel coronavirus outbreak in wuhan, china

A state space model for multivariate longitudinal count data

Stationary state space models for longitudinal data

Real-time estimation of the risk of death from novel coronavirus (covid-19) infection: Inference using exported cases

Containing papers of a mathematical and physical character

Estimation of the epidemic properties of the 2019 novel coronavirus: A mathematical modeling study

Trend and forecasting of the covid-19 outbreak in china

Assessing the tendency of 2019-ncov (covid-19) outbreak in china. medRxiv

Molecular epidemiology, evolution and phylogeny of sars coronavirus

Modeling super-spreading events for infectious diseases: case study sars

Summary of probable sars cases with onset of illness from

Forecasting seasonal influenza with a state-space sir model. The annals of applied statistics

Epidemic analysis of covid-19 in china by dynamical modeling

rjags: Bayesian Graphical Models using MCMC

Transmission of 2019-ncov infection from an asymptomatic contact in germany

Responding to global infectious disease outbreaks: lessons from sars on the role of risk perception, communication and management. Social science & medicine

Correlated Data Analysis: Modeling, Analytics, and Applications

Monte carlo kalman filter and smoothing for multivariate discrete state space models

One severe acute respiratory syndrome coronavirus protein complex integrates processive rna polymerase and exonuclease activities

Tracking and predicting covid-19 epidemic in china mainland. medRxiv

A novel coronavirus outbreak of global health concern. The Lancet

Emergencies preparedness, response. pneumonia of unknown origin -china. disease outbreak news

ggplot2: Elegant Graphics for Data Analysis

Timely research papers about covid-19 in china. The Lancet

Clinical findings in a group of patients infected with the 2019 novel coronavirus (sars-cov-2) outside of wuhan, china: retrospective case series

Estimation of the reproductive number of novel coronavirus (covid-19) and the probable outbreak size on the diamond princess cruise ship: A data-driven analysis

Signal extraction and breakpoint identification for array cgh data using robust state space model

A novel coronavirus from patients with pneumonia in china

Software website: https://github.com/lilywang1988/eSIR A Runge-Kutta Approximation A.1 Approximation in the Basic SIR modelThe forth order Runge-Kutta(RK4) method gives an approximate of f pθ t´1 , β, γq in equation (4) as follows:the time series of daily observed infected compartment proportions.R the time series of daily observed removed compartment proportions, including death and recovered.phi0 a vector of values of the dirac delta function φ t . Each entry denotes the proportion that will be qurantined at each change time point. Note that all the entries lie between 0 and 1, its default is NULL. change_time the change points over time corresponding to phi0, to formulate the dirac delta function φ t ; its defalt value is NULL.begin_str the character of starting time, the default is "01/13/2020".T_fin the end of follow-up time after the beginning date begin_str, the default is 200.nchain the number of MCMC chains generated by rjags, the default is 4.nadapt the iteration number of adaptation in the MCMC. We recommend using at least the default value 1e4 to obtained fully adapted chains.M the number of draws in each chain, with no thinning. The default is M=5e2 but suggest using 5e5.thn the thinning interval between mixing. The total number of draws thus would become round(M/thn)*nchain. The default is 10.nburnin the burn-in period. The default is 2e2 but suggest 2e5.dic logical, whether compute the DIC (deviance information criterion) for model selection.death_in_R the numeric value of average of cumulative deaths in the removed compartments. The default is 0.4 within Hubei and 0.02 outside Hubei.casename the string of the job's name. The default is "qh.eSIR".beta0 the hyperparameter of average transmission rate, the default is the one estimated from the SARS first-month outbreak (0.2586).gamma0 the hyperparameter of average removed rate, the default is the one estimated from the SARS first-month outbreak (0.0821).

the hyperparameter of the mean reproduction number R0. The default is thus the ratio of beta0/gamma0, which can be specified directly.gamma0_sd the standard deviation for the prior distrbution of the removed rate γ, the default is 0.1.

the standard deviation for the prior disbution of R0, the default is 1.file_add the string to denote the location of saving output files and tables. 3] for θ R t . The posterior draws of the prevalence process of the quarantine compartment can be obtained via thetaQ_p and thetaQ_pp. Moreover, the input and predicted proportions Y, Y_pp, R and R_pp can also be retrieved. The prevalence and prediceted proportion matrices (ci); the respective means and ci's of the reporduction number (R0), removed rate (γ), transmission rate (β).plot_infection plot of summarizing and forecasting for the infection compartment, in which the vertial blue line denotes the last date of data collected (t ), the vertial darkgray line denotes the deacceleration point (first turning point) that the posterior mean first-derivative of infection prevalenceθ I t achieves the maximum, the vertical purple line denotes the second turning point that the posterior mean firstderivative infection proportionθ I t equals zero, the darkgray line denotes the posterior mean of the infection prevalence θ I t and the red line denotes its posterior median.plot_removed plot of summarizing and forecasting for the removed compartment with lines similar to those in the plot_infection. The vertical lines are identical, but the horizontal mean and median correspond to the posterior mean and median of the removed process θ R t . An additional line indicates the estimated death prevalence from the input death_in_R. spaghetti_plot 20 randomly selected MCMC draws of the first-order derivative of the posterior prevalence of infection, namelyθ I t . The black curve is the posterior mean of the derivative, and the vertical lines mark times of turning points corresponding respectively to those shown in plot_infection and plot_removed. Moreover, the 95% credible intervals of these turning points are also highlighted by semitransparent rectangles.first_tp_mean the date t at whichθ I t = 0, calculated as the average of the time points with maximum posterior first-order derivativesθ I t ; this value may be slightly different from the one labeled by the "darkgreen" lines in the two plots plot_infection and plot_removed, which indicate the stationary point such that the first-order derivative of the averaged posterior of θ I t reaches its maximum. first_tp_mean the date t at whichθ I t = 0, calculated as the average of the time points with maximum posterior first-order derivativesθ I t ; this value may be slightly different from the one labeled by the "darkgreen" lines in the two plots plot_infection and plot_removed, which indicate the stationary point such that the first-order derivative of the averaged posterior of θ I t reaches its maximum. first_tp_ci fwith first_tp_mean, it reports the corresponding credible interval and median.second_tp_mean the date t at which θ I t = 0, calculated as the average of the stationary points of all of posterior first-order derivativesθ I t ; this value may be slightly different from the one labeled by the "pruple" lines in the plots of plot_infection and the time-dependent transission rate modifier π(t) between 0 and 1. change_time the change points over time for step function pi, defalt value is NULL. exponential logical, whether π(t) is exponential exp(−λ 0 t) or not; the default is FALSE. lambda0the rate of decline in the exponential survival function in exp(−λ 0 t). begin_str the character of starting time, the default is "01/13/2020". the hyperparameter of average removed rate, the default is the one estimated from the SARS first-month outbreak (0.0821).

the hyperparameter of the mean reproduction number R0. The default is thus the ratio of beta0/gamma0, which can be specified directly. gamma0_sd the standard deviation for the prior distrbution of the removed rate γ, the default is 0.1.

the standard deviation for the prior disbution of R0, the default is 1. casename the string of the job's name. The default is "tvt.eSIR". file_add the string to denote the location of saving output files and tables. save_files logical, whether save (TRUE) results or not (FALSE). This enables to save summary tables, trace plots, and plots of the posterior means of the first-order derivatives of the infection prevalence process θ I t . save_mcmc logical, whether save (TRUE) all the MCMC outputs or not (FALSE).The output file will be an .RData file named by the casename. We include arrays of prevalence values of the three compartments with their matrices of posterior draws up to the last date of the collected data as theta_p[,,1] and afterwards as theta_pp[,,1] for θ S t , theta_p[,,2] and theta_pp[,,2] for θ I t , and theta_p[,,3] and theta_pp[,,3] for θ R t . Moreover, the input and predicted proportions Y, Y_pp, R and R_pp can also be retrieved. The prevalence and prediceted proportion matrices have rows for MCMC replicates, and columns for days. The MCMC posterior draws of other parameters including beta_p, gamma_p, R0_p, and variance controllers k_p, lambdaY_p, lambdaR_p are also available. save_plot_data logical, whether save the plotting data or not. 6 tvt.eSIR

We fit a state-space model with extended SIR, in which a time-varying transmission rate modifier π(t) (between 0 and 1) is introcuded to model. This allows us to accommodate quarantine protocol changes and ignored resources of hospitalization. The form of reducing rate may be a step-function with jomps at times of big policy changes or a smooth exponential survival function exp(−λ 0 t). The parameters of the function and change points, if any, should be predefined.Value casename the predefined casename.incidence_mean mean incidence.incidence_ci 2.5%, 50%, and 97.5% quantiles of the incidences.out_table summary tables including the posterior mean of the prevalance processes of the 3 states compartments (θ S t , θ I t , θ R t ) at last date of data collected ((t ) decided by the lengths of your input data Y and R), and their respective credible inctervals (ci); the respective means and ci's of the reporduction number (R0), removed rate (γ), transmission rate (β).plot_infection plot of summarizing and forecasting for the infection compartment, in which the vertial blue line denotes the last date of data collected (t ), the vertial darkgray line denotes the deacceleration point (first turning point) that the posterior mean first-derivative of infection prevalenceθ I t achieves the maximum, the vertical purple line denotes the second turning point that the posterior mean firstderivative infection proportionθ I t equals zero, the darkgray line denotes the posterior mean of the infection prevalence θ I t and the red line denotes its posterior median.plot_removed plot of summarizing and forecasting for the removed compartment with lines similar to those in the plot_infection. The vertical lines are identical, but the horizontal mean and median correspond to the posterior mean and median of the removed process θ R t . An additional line indicates the estimated death prevalence from the input death_in_R. spaghetti_plot 20 randomly selected MCMC draws of the first-order derivative of the posterior prevalence of infection, namelyθ I t . The black curve is the posterior mean of the derivative, and the vertical lines mark times of turning points corresponding respectively to those shown in plot_infection and plot_removed. Moreover, the 95% credible intervals of these turning points are also highlighted by semitransparent rectangles.first_tp_mean the date t at whichθ I t = 0, calculated as the average of the time points with maximum posterior first-order derivativesθ I t ; this value may be slightly different from the one labeled by the "darkgreen" lines in the two plots plot_infection and plot_removed, which indicate the stationary point such that the first-order derivative of the averaged posterior of θ I t reaches its maximum. first_tp_ci fwith first_tp_mean, it reports the corresponding credible interval and median.second_tp_mean the date t at which θ I t = 0, calculated as the average of the stationary points of all of posterior first-order derivativesθ I t ; this value may be slightly different from the one labeled by the "pruple" lines in the plots of plot_infection and plot_removed. The latter indicate stationary t at which the first-order derivative of the averaged posterior of θ I t equals zero. second_tp_ci with second_tp_mean, it reports the corresponding credible interval and median.