key: cord-0591515-0ex39yxr
authors: Wang, Li; Wang, Guannan; Gao, Lei; Li, Xinyi; Yu, Shan; Kim, Myungjin; Wang, Yueying; Gu, Zhiling
title: Spatiotemporal Dynamics, Nowcasting and Forecasting of COVID-19 in the United States
date: 2020-04-29
journal: nan
DOI: nan
sha: 624c41048c64f04917f460f0e2a36da5f4ecfb1e
doc_id: 591515
cord_uid: 0ex39yxr

In response to the ongoing public health emergency of COVID-19, we investigate the disease dynamics to understand the spread of COVID-19 in the United States. In particular, we focus on the spatiotemporal dynamics of the disease, accounting for the control measures, environmental effects, socioeconomic factors, health service resources, and demographic conditions that vary from different counties. In the modeling of an epidemic, mathematical models are useful, however, pure mathematical modeling is deterministic, and only demonstrates the average behavior of the epidemic; thus, it is difficult to quantify the uncertainty. Instead, statistical models provide varieties of characterization of different types of errors. In this paper, we investigate the disease dynamics by working at the interface of theoretical models and empirical data by combining the advantages of mathematical and statistical models. We develop a novel nonparametric space-time disease transmission model for the epidemic data, and to study the spatial-temporal pattern in the spread of COVID-19 at the county level. The proposed methodology can be used to dissect the spatial structure and dynamics of spread, as well as to forecast how this outbreak may unfold through time and space in the future. To assess the uncertainty, projection bands are constructed from forecast paths obtained in bootstrap replications. A dashboard is established with multiple R shiny apps embedded to provide a 7-day forecast of the COVID-19 infection count and death count up to the county level, as well as a long-term projection of the next four months. The proposed method provides remarkably accurate short-term prediction results.

An essential question for developing a defense against COVID-19 is how far the virus will spread and how many lives it will claim. It is not clear to anyone where this crisis will lead us. Understanding the dynamics of the disease is therefore undoubtedly critical. One way to answer these questions is through scientific modeling. Several attempts have been made to model and forecast the spread and mortality of COVID-19 (Elmousalami and Hassanien, 2020; Fanelli and Piazza, 2020; Kucharski et al., 2020; Pan et al., 2020; Sun et al., 2020; Wang et al., 2020d; Zhang et al., 2020) .

In epidemiology, the fundamental concept of infectious disease is the investigation of how infections spread. Mathematical methods, such as the class of susceptible-infectious-recovered (SIR) models (Allen et al., 2008; Chen et al., 2020; Lawson et al., 2016; Pfeiffer et al., 2008; Wakefield et al., 2019; Weiss, 2013) , are widely used in epidemics to capture the dynamic process of the spread of the infectious disease. However, pure mathematical modeling is deterministic, and only demonstrates the average behavior of the epidemic. In addition, its focus is often on the form of models, not the parameter estimation for observed data; thus, it is difficult to quantify the uncertainty. Instead, statistical models provide a varying characterization of different types of errors. When it comes to analyzing the reported numbers of infectious diseases, other factors may also be responsible for temporal or spatial patterns. The spread of the disease varies a lot across different geographical regions. Local area features, like socioeconomic factors and demographic conditions, can dramatically influence the course of the epidemic. These data are usually supplemented with the population information at the county level. In addition, the capacity of the health care system, and control measures, such as governmentmandated social distancing, also have a significant impact on the spread of the epidemic. SIR models with assumptions of random mixing can overestimate the health service needed by not taking into account the behavioral change and government-mandated action. In this paper, we propose a class of novel nonparametric dynamic epidemic models to analyze the infectious disease data by incorporating the spatiotemporal structure and the effect of explanatory variables.

In this paper, we borrow the mechanistic rules from the SIR model by including three compartments: infected, susceptible and removed states, and develop a class of data-driven statistical models to reconstruct the spatiotemporal dynamics of the disease transmission. We build a novel space-time epidemic modeling framework for the infected count data, to study the spatial-temporal pattern in the spread of COVID-19 at the county level. The proposed methodology can be used to dissect the spatial structure and dynamics of spread, as well as to assess how this outbreak may unfold through time and space.

Given an parametric epidemic model, the typical inference problem involves estimating the parameters associated with the parametric models from the data to hand. Such specifications are ad hoc, and if misspecified, can lead to substantial estimation bias problems. In practice, this question might be addressed by considering alternative parametric models, or sensitivity analyses if some of the underlying model parameters are assumed to be known. Nonparametric approaches to fitting epidemic models to the data have received relatively little attention in the literature possibly due to the lack of data.

By allowing the infection to depend on time and location, we consider a generalized additive varying coefficient model to estimate the unobserved process of the disease transmission. By adopting a nonparametric approach, we do not impose a particular parametric structure, which significantly enhances the flexibility of the epidemic models that practitioners use. For our model estimation, we propose a quasi-likelihood approach via the penalized spline approximation and the iteratively reweighted least squares technique.

Prediction models for COVID-19 at the county-level that combine local characteristics and actions are very beneficial for the community to understand the dynamics of the disease spread and support decision making at a time when they are urgently needed. Models can help predict rates of new infections, and estimate when the strain on the hospital system could peak. In this paper, we consider both the short-term and long-term impact of the virus. To assess the uncertainty associated with the prediction, we develop a projection band constructed based on the envelope of the bootstrap forecast paths, which are closest to the forecast path obtained on the basis of the original sample. Based on our research findings, we develop multiple R shiny apps embedded into a COVID-19 dashboard, which provides a 7-day forecast and a 4-month forecast of COVID-19 infected and death count at both the county level and state level.

The rest of the paper is organized as follows. Section 2 introduces our case study on COVID-19, including a detailed description of the data. Section 3 outlines the nonparametric spatiotemporal modeling framework and describes how to incorporate additional covariates. Section 4 introduces our estimation method, presents our algorithms, and discusses the details of the implementation. Section 5 starts with a description of the prediction of the infection count and provides the uncertainty with the band of the forecast path. Section 6 shows the results and findings of the case study. Section 7 concludes the paper with a discussion. The supplementary materials (Wang et al., 2020c) contain additional figures, and an animation of the estimation results.

The goal of this study is threefold. First, we develop a new dynamic epidemic modeling framework for public health surveillance data to study the spatial-temporal pattern in the spread of COVID-19. We aim to investigate whether the proposed model could be used to guide the modeling of the dynamic of the spread at the county level by moving beyond the typical theoretical conceptualization of context where a county's infection is only associated with its own features. Second, to understand the factors that contribute to the spread of COVID-19, we model the daily infected cases at the county level in consideration of the demographic, environmental, behavioral, socioeconomic factors in the U.S. Third, we project the spatial-temporal pattern of the spread of the virus in the U.S. For the short-term forecast, we provide the prediction of the daily infection count and death count up to the county level. As for the long-term forecast, we project the total infected and death cases in the next three months.

2.2 Epidemic Data from the COVID-19 Outbreak in the U.S.

This study analyzes data from the reported confirmed COVID-19 infections and deaths at the county level, which are reported by the 3,104 counties from the 48 mainland U.S. states and the District of Columbia. The aggregated COVID-19 cases are from January 20 until April 25, 2020. The data are collected, compiled and cleaned from a combination of public sources that aim to facilitate the research effort to confront COVID-19, including Health Department Website in each state or region, the New York Times (NYT, 2020), the COVID-19 Data Repository by the Center for Systems Science and Engineering at Johns Hopkins University (CSSE, 2020), and the COVID Tracking Project (Atlantic, 2020).

These data sources automatically updated every day or every other day. We have created a dashboard https://covid19.stat.iastate.edu/ to visualize and track the infected and death cases, which was launched on March 27, 2020.

We consider a variety of county-level characteristics as covariate information in our study, which can be divided into six groups. The data sources and the operational definitions of these features are discussed as follows.

Policies. Government declarations are used to identify the dates that different jurisdictions implemented various social distancing policies (emergency declarations, school closures, bans on large gatherings, limits on bars, restaurants and other public places, the deployment of severe travel restrictions, and "stay-at-home" or "shelter-in-place" orders). President Trump declared a state of emergency on March 13, 2020, to enhance the federal government response to confront the COVID-19. By March 16, 2020, every state had made an emergency declaration. Since then, more severe social distancing actions have been taken by the majority of the states, especially those hardest hit by the pandemic. We compiled the dates of executive orders by checking national and state governmental websites, news articles and press releases. Demographic Characteristics. To capture the demographic characteristics of a county, five variables are considered in the analysis to describe the racial, ethnic, sexual and age structures: the percent of the population who identify as African American, the percent of the population who identify as and no husband present (factor loading = 0.81), and civilian labor force unemployment rate (factor loading = 0.56). These two factors, affluence and disadvantage, explain more than 60% of the variation.

We also incorporate the Gini coefficient to measure income inequality. The Gini coefficient, also known as Gini index, is a well-known measure for income inequality and wealth distribution in economics, with value ranging from zero (complete equality, where everyone has exactly the same income)

to one (total inequality, where one person occupies all of the income 

In this section, we propose a class of nonparametric space-time models to estimate the infection count at the area level. In the following, let Y it be the number of new cases at time t for area i, i = 1, . . . , n.

Also for area i, let I it , D it and R it be the cumulative number of active infectious, death and recovered cases at time t, and let C it be the number of cumulative confirmed cases up to time t. Then, it is clear

Further, denote N i the population for area i, and the number of susceptible subjects at time t would be

We denote U i = (U i1 , U i2 ) be the GPS coordinates of the geographic center of area i, which ranges over a bounded domain Ω ⊆ R 2 of the region under study. Let X i = (X i1 , . . . , X iq ) be the covariates of area i that is not varying with time, see the description in Section 2.3. For example, the socioeconomic factors, health service resources, and demographic conditions. Let A ijt denotes the jth dummy variable of actions or measures taken for area i at time t, and let A it = (A i1t , . . . , A ipt ) , which varies with the time.

In this paper, we consider the exponential families of distributions. The conditional density of Y

for some known functions B and C, dispersion parameter σ 2 and the canonical parameter ζ. Let

We assume that the determinants of the daily new cases of a certain area can be explained not only by the features of this area but also by the characteristics of the surrounding areas. Based on the idea of the SIR models, we propose a discrete-time spatial epidemic model comprising the susceptible, infected and removed states, and area-level characteristics. At time point t, we assume

which is modeled via a link function g as follows:

where α jt 's are unknown time-varying coefficients, β 0t (·) and β 1t (·) are unknown bivariate coefficient functions, γ kt (·), k = 1, . . . , q, are univariate functions to be estimated. The parameter r in A ij,t−r 's denotes a small delay time allowing for the control measure to be effective. For model identifiability, we assume E(γ kt ) = 0, k = 1, . . . , q. Note that exp{β 0t (u)} illustrates the transmission rate at location u, β 1t , α 0t are the mixing parameters of the contact process. The rationale for including β 1t (·) (0 < β 1t < 1) is to allow for deviations from mass action and to account for the discrete-time approximation to the continuous time model; see Finkenstädt and Grenfell (2000) ; Wakefield et al. (2019) . In many cases, the standard bilinear form may not necessarily hold. The above proposed epidemic model incorporates the nonlinear incidence rates, which represents a much wider range of dynamical behavior than those with bilinear incidence rates (Liu et al., 1987) . These dynamical behaviors are determined mainly by β 0t and β 1t . When β 1t and α 0t are both 1, it is corresponding to the standard assumption of homogeneous mixing in De Jong et al. (1995) .

Since Y it is the number of new cases at time t for area i, i = 1, . . . , n, Poisson or negative binomial (NB) might be an appropriate option for random component; see Yu et al. (2020) , and Kim and Wang (2020) . We assume that

where µ it can be modeled via the same log link as follows:

At the beginning of the outbreak, infected and death cases could be rare, so "Poisson" might be a reasonable choice of the random component to describe the distribution of rare events in a large population. As the disease progresses, the variation of infected/death count increases across counties and states. So, at the acceleration phase of the disease, the negative binomial random component might be an appropriate option to account for the presence of over-dispersion.

The above spatiotemporal epidemic model (STEM) is developed based on the foundation of epidemic modeling, but it is able to provide a rich characterization of different types of errors for modeling the uncertainty. In addition, it accounts for both spatiotemporal nonstationarity and area-level local features simultaneously. It also offers more flexibility in assessing the dynamics of the spread at different times and locations than various parametric models in the literature.

In this section, we describe how to estimate the parameters and nonparameteric components in the proposed STEM model (2).

To capture the temporal dynamics, we consider the moving window approach. For the current time t, and nonnegative smoothness parameters λ for = 0, 1, we consider the following penalized quasi-likelihood problem:

where t 0 + 1 is the window width for the model fitting, and it can can be selected by minimizing the prediction errors or maximizing the correlation between the predicted and observed values. The energy functional is defined as follows:

where ∇ q u j β(u) is the qth order derivative in the direction u j , j = 1, 2, at any location u = (u 1 , u 2 ) . Note that, except for parameters {α jt } p j=0 , other functions are related to curse of dimensionality due to the nature of functions. To overcome this difficulty, we introduce the basis expansion for univariate and bivariate functions discussed below.

The univariate additive components {γ kt (·)} q k=1 and the spatially varying coefficient components (2) are approximated using univariate polynomial spline and bivariate penalized splines over triangulation (BPST), respectively. The BPST method is well known to be computationally efficient to deal with data distributed on complex domains with irregular shape or with holes inside; see the details in Lai and Schumaker (2007) , Lai and Wang (2013) and Sangalli et al. (2013) . We introduce a brief review of univariate splines and bivariate splines in the following.

, δk) be the space of the polynomial splines of order + 1, which are polynomial functions with -degree (or less) on intervals [δk, j, δ k,j+1 ), j = 0, . . . , J n − 1, and [δk, J n , δ k,Jn+1 ], and have − 1 continuous derivatives globally. Next, let U 0 k = {φ ∈ U k : Eφ(X k ) = 0}, which ensures that the spline functions are centered; see Yu et al. (2020) .

then Eφ kj (X k ) = 0 and Eφ 2 kj (X k ) = 1. Suppose the nonlinear component can be well approximated by a spline function so that, for all

and ξ kt = (ξ kt1 , . . . , ξ kt,Jn+ +1 ) is a vector of coefficients.

For the bivariate coefficient functions β 0t (·) and β 1t (·) in the STEM model (2), we introduce bivariate spline over triangulation. The spatial domain Ω with either an arbitrary shape or holes inside can be partitioned into finitely many M triangles, T 1 , . . . , T M , that is, Ω = ∪ M m=1 T m , and any nonempty intersection between a pair of triangles in is either a shared vertex or a shared edge. A collection of these triangles, := {T 1 , . . . , T M }, is called a triangulation of the domain Ω Lai and Schumaker (2007); Lai and Wang (2013) . For a triangle T ∈ in R 2 with vertices v i , for i = 1, 2, 3, numbered in counter-clockwise order, we can write T :

are called the "barycentric coordinates" of point v ∈ T . The Bernstein basis polynomials of degree 

called the B-form of P relative to T . Let C r (Ω) be the space of rth continuously differentiable functions over the domain Ω. Given 0 ≤ r < d and a triangulation , the spline space of degree d and smoothness r over is defined as

For triangulation with M triangles, denote a set of bivariate Bernstein basis polynomials for In practice, the triangulation can be obtained through varieties of software; see for example, the "Delaunay" algorithm (delaunay.m in MATLAB or DelaunayTriangulation in MATHEMATICA), the R package "Triangulation" , and the "DistMesh" Matlab code. The bivariate spline basis are generated via the R package "BPST" .

Considering the basis expansion, for the current time t, the maximization problem (3) 

In addition, we consider the energy functional E(β ) in (4) can be approximated by E(B θ ) = θ Pθ , for = 0, 1, where P is the block diagonal penalty matrix. Introducing the constraint matrix H which satisfies Hθ = 0, = 0, 1, is a common strategy to reflect global smoothness in S r d ( ) in (5).

Directly solving the optimization problem in (6) is not straightforward due to the smoothness constraints inside. Instead, suppose that the rank r matrix H is decomposed into QR = (Q 1 Q 2 ) R 1 R 2 , where Q 1 is the first r columns of an orthogonal matrix Q, and R 2 is a matrix of zeros, which is a submatrix of an upper triangle matrix R. Then, reparametrization of θ = Q 2 θ * for some θ * , = 0, 1, enforces Hθ = 0. Thus, the constraint problem in (6) can be changed to an unconstrained optimization problem as follows:

Let ( θ * 0t , θ * 1t ) , ( α 0t , α 1t , . . . , α pt ) , and ( ξ 1t , . . . , ξ qt ) be the maximizers of (7) at time point t. We obtain the estimators of β t (·):

the estimator of α jt is α jt , j = 1, . . . , p, and the spline estimator

In addition, let the mean vector µ(β * ) = {µ is } n,t i,s=1 = {g −1 (η is )} n,t i,s=1 , the variance function matrix V = diag{V (µ is )} n,t i,s=1 , the diagonal matrix G = diag{g (µ is )} n,t i,s=1 with the derivative of link function as element, and the weight matrix W = diag[{V (µ is )g (µ is ) 2 } −1 w st , i = 1, . . . , n, s = 1, . . . , t], where w st = I(t − s ≥ t 0 ).

In order to numerically solve the minimization in (7), we design the penalized iteratively reweighted least squares (PIRLS) algorithm as described below. Suppose at the jth iteration, we have µ (j) = µ(α (j) , ξ (j) , θ * (j) ), η (j) = η(α (j) , ξ (j) , θ * (j) ) and V (j) . Then at (j + 1)th iteration, we consider the following objective function:

Take the first order Taylor expansion of µ(α, ξ, θ * ) around (α (j) , ξ (j) , θ * (j) ), then

Algorithm 1 The PIRLS Algorithm.

Step 1. Start with the initial values η (0) and µ (0) . Calculate weight matrix W (0) and working variable

is ), i = 1, . . . , n, and s = 1, . . . , t, .

Step 2. Set step j = 0.

while {α, ξ, θ * } not converge do (i) Obtain α (j+1) , ξ (j+1) , θ * (j+1) by minimizing the (8) with respect to α, ξ, θ * , and update η (j+1) = η(α (j+1) , ξ (j+1) , θ * (j+1) ) and µ (j+1) = µ(α (j+1) , ξ (j+1) , θ * (j+1) ).

. . , n, s = 1, . . . , t, using η (j+1) and µ (j+1) .

is for s = 1, . . . , t. The PIRLS procedure is represented in Algorithm 1. In the numerical analysis, we set µ

is ) as the initial values to start the iteration.

To fit the proposed STEM and make predictions for cumulative positive cases, one obstacle is the lack of direct observations for the number of active cases, I it . Instead, the most commonly reported number is the count of total confirmed cases, C it . Some departments of public health also release information about fatal cases D it and recovered cases R it , while such kind of data tends to suffer from missingness, large error and inconsistency due to its difficulty in data collection; see the discussions in KCRA (2020).

Based on the fact that I it = C it − R it − D it , we attempt to modeling D it and R it in order to facilitate the estimation and prediction of newly confirmed cases Y it based on the proposed STEM model. Let ∆D it = D it − D i,t−1 be the new fatal cases on day t, and following similar notations in the STEM model (2), we assume that

where Due to the lack of data, we are no longer able to use all the explanatory variables discussed above to model daily new recovered cases. Instead, we mimic the relationship between the number of recovered and active cases from some Compartmental models in epidemiology (Anastassopoulou et al., 2020; Siettos and Russo, 2013) . At current time point t, we assume that ∆R is = ν t I i,s−1 + ε is , s = t−t 0 , . . . , t, in which the recovery rate ν t enables us to make reasonable predictions for future recovered patients counts and provide researchers with the foresight of when the epidemic will end. The rate ν t can be either estimated from available state-level data, or obtained from prior medical studies due to the under-reporting issue in actual data.

Early in an epidemic, the quality of data on infections, deaths, tests, and other factors often are limited by underdetection or inconsistent detection of cases, reporting delays, and poor documentation, all of which affect the quality of any model output. There are many counties with zero daily counts at the early stage of disease spread. Therefore, we consider zero-inflated models based on a zero-inflated probability distribution, which allows for frequent zero-valued observations. Following the works by Arab et al. (2012) , Beckett et al. (2014) and Wood et al. (2016) , we assume the observed counts Y it contributes to a zero-inflated Poisson distribution

where µ it follows (2), and p it = logit(η it ) with η it = a 1 + {b + exp(a 2 )} log(µ it ). Here we take b = 0 and a 1 , a 2 are estimated with the roughness parameters. See Wood et al. (2016) for more details in the estimation of a 1 and a 2 .

Similarly, we also consider zero-inflated models, in which we assume the observed count ∆D it contributes to a zero-inflated Poisson distribution parallel fashion to (a 1 , a 2 ) .

To understand the impact of COVID-19, it requires accurate forecast for the spread of infectious cases along with analysis of the number of death and recovery cases. In this section, we describe our prediction procedure of these counts, specifically, we are interested in predicting Y it , I it and D it . We also provide the prediction intervals to quantify the uncertainty of the prediction.

We consider an h-step ahead prediction. As described in Section 3, if we observe C is , I is , R is , D is for s = 1, . . . , t, then the infection model and fatal cases model can be fitted by regressing

,s=1 , respectively. The predictions of infectious count at time t + 1 and iteratively at t + h are

Then, the predicted number of active cases and susceptible cases are

. The above one-step predicted values can be thus plugged back into equation (10) to obtain the predictions for the following days by repeating the same procedure.

There is substantial interest in the problem of how to quantify the uncertainty for the forecasts with a succession of periods. To construct the band for forecast path {Y i,t+h , h = 1, . . . , H}, we consider the bootstrap method (Staszewska-Bystrova, 2009 ), in which the bootstrap samples are generated using the bias-corrected bootstrap procedure; see Algorithms 2 and 3 for the details.

In this section, we present our analysis results and findings for the COVID-19 study.

For the model estimation, we consider the data collected from March 23 to April 25. Based on the data described in Section 2, we consider the following model for the infection count:

Algorithm 2 A bootstrap procedure to correct the bias.

Step 1.

Fit models (2) and (9) using (Y is ,

and

Step 2. Generate bootstrap samples to correct the bias in the estimator of the coefficients. (2) and (9) 

Step 3. Calculate the bias of the coefficients based on the above bootstrap samples. For example, for = 0, 1, let bias( β ) = B −1 B b=1 β b − β , and let β c = β − bias( β ) be the corrected coefficient function. Similarly, we obtain the bias-corrected coefficients of α t and γ t , denoted by α c t , γ c t , respectively.

Algorithm 3 A bootstrap procedure to calculate the prediction band.

Step 1. Generate bootstrap samples to construct prediction band.

Step 2. Construct the 100(1 − α)% prediction band by the above B bootstrap paths with the most extreme αB paths discarded. Start with setting κ = 0. while κ < αB do (i) For each forecast time point h = 1, . . . , H (there are in total B − κ constructed paths available), identify the largest and the smallest bootstrap forecast values, and the associated paths. Notice there are 2H extreme values and at most corresponding 2H paths.

(ii) Compute the distances from each of the bootstrap path (at most 2H) to the bootstrap sample, based on:

Remove the path with the largest distance, and set κ = κ + 1. end

Step 3. Obtain the 100(1 − α)% prediction band from the envelope of the remaining (1 − α)B bootstrap paths.

where i = 1, . . . , 3104. For the death count, we consider the following semiparametric model:

We use 14 days as an estimation window to examine how the covariates affect the new infected cases and fatal cases. The roughness parameters are selected by the generalized cross-validation (GCV).

The performance of the univariate/bivariate splines is dependent upon the choice of the knots/triangulation.

Knots selection and triangulation selection are one of the key ingredients for obtaining satisfactory results. We use cubic splines with 2 interior knots for the univariate spline smoothing. We generate the triangulations according to "max-min" criterion, which maximizes the minimum angle of all the angles of the triangles in the triangulation. Figure 2 shows the triangulations adopted by our method: 1 (119 triangles with 87 vertices) and 2 (522 triangles with 306 vertices). By the "max-min" criterion, 2 is better than 1 , but it also significantly increases the number of parameters to estimate. As a trade-off, for the estimation of β 0 (·) and β 1 (·), we adopt the finer triangulation 2 , and use the rough triangulation 1 to estimate β D 0 (·).

First, we report our findings from modeling the infection count using model (11). To examine the effect of the control measures ("shelter-in-place" or "stay-at-home" order) after 7 days, we test the hypothesis:

H 0 : α 2t = 0 in model (11). We found that the p-values are smaller than 0.0001 at almost all the time points.

The estimated coefficient functions of β 0t (·) and β 1t (·) in model (11) using the data from March 23 to April 25, 2020, are shown in the supplementary materials (Wang et al., 2020c) . We can see that the transmission rate varies at different locations and in different phases of the outbreak, and β 1t (·) is also varying, which indicates that the homogeneous mixing assumption of the simple SIR models does not hold. Transmission rate is high in the majority of states at the end of March, however, in many states, it becomes much lower in the middle of April or late of April.

Next, we examine the effect of the predictors and test the following hypothesis of the individual functions H 0 : γ kt (·) = 0, k = 1, . . . , 12. The figures on Pages 2-14 in the supplementary materials (Wang et al., 2020c) show the estimate and the SCB of the nonparametric functions γ kt , k = 1, . . . , 12, at different time point in the STEM model (11).

From these figures, we can find the effect of the county-level predictors on the spread. Healthcare coverage is essential for a person's health status, and sometimes, a self-selection process. After controlling to social-economic factors, the percent of persons under 65 years without health insurance has a significant impact on the COVID-19 breakout in the community. We can observe a sharp increasing pattern between the non-healthy-coverage rate and the COVID-19 infection rate. An under-covered population is much easier to be infected with the virus. Because there are more uninsured people in the urban area, an increasing pattern is observed in the urban rate impact analysis. "PD" is often considered to have a linear relationship with COVID-19 infection cases in most studies and news reports.

Our results are consistent with the intuition. The higher the "PD" is, the higher the logarithm of new COVID-19 cases is. The local healthcare expenditure, "EHPC", has a similar impact on COVID-19

infections. The elderly population's impact pattern is an inverse U-shape. This pattern is because they are easier to be infected. However, when the older population dominates the community, people are less active and more risk-averse, thus stay home more often, so that it hinders the spread of the virus.

We report our findings from estimating the death count using model (12).

To examine the effect of the infection count, control measures and the county-level predictors, we test the following hypothesis: . . . , 12, in model (12) . Figure 1 plots the p-values of the above tests. From Figure 1 , we find that the "Infection", "AA", "HL", "Disadvantage", and "Old" are very significant with p-values are smaller than 0.05 all the time.

The rest of the predictors are significant on some days, but insignificant on other days.

Page 17 on the supplementary materials (Wang et al., 2020c) shows the pattern of β D 0t (·) in model (12). From this animation, we observe a general decrease pattern in the entire U.S. from March 23 to April 25, 2020.

In this section, we investigate the short-term prediction performance of the proposed method. In the following, we consider h-day ahead prediction based on the forecasting method described in Section 5.

An R shiny app (Wang et al., 2020a) is developed to provide a 7-day forecast of COVID-19 infection and death count at both the county level and state level, in which the state level forecast is obtained by aggregating forecasts across counties in each state. This app was launched on 03/27/2020 for displaying results of our forecasting.

We demonstrate the accuracy of the STEM for h-day ahead predictions, h = 1, . . . , 7. For comparison, we also consider the two naive models that assume a linear or exponential growth pattern for total confirmed cases for each county:

• (Linear) E(C it |t) = β i0 + β i1 t, Var(C it |t) = σ 2 i , i = 1, . . . , n;

• (Exponential, Poisson) log{E(C it |t)} = β i0 + β i1 t, Var(C it |t) = exp(β i0 + β i1 t), i = 1, . . . , n; and the following simple epidemic method (EM):

• (EM) log(µ it ) = β 0 + β 1 log(I i,t−1 ), log(µ D it ) = β D 0 + β D 1 log(I i,t−1 ), i = 1, . . . , n.

We consider the data collected from March 23 to April 18. To predict the counts in the next 7 days, we use the previous 9 days as a training set for model fit. To show the accuracy of different methods, we compute the following root mean-squared prediction errors (RMSPEs):

where T = 18. 

There has been an increasing public health concern regarding the adequacy of resources to treat infected cases. It is well known that hospital beds, intensive care units (ICU), and ventilators are critical for the treatment of patients with severe illness. To project the timing of the outbreak peak and the number of health resources required at a peak, in this section, we also provide the long-term forecast of the infection count and death count.

In Figure 3 , we show the reported COVID-19 confirmed infectious cases and deaths, and the corresponding predicted counts for the next four months in the State of New York based on the observed data from April 16-22, 2020. Given the lack of reliable recovered data, we consider two different daily recovery rates: 0.10 and 0.15.

Based on our research results, we develop an R shiny app Wang et al. (2020b) to provide a forecast of COVID-19 infection count and death count for the next four months. The forecast for other states can be found from Wang et al. (2020b) , which is updated every week.

This work has aimed to bridge the gap between mathematical models and statistical analysis in the infectious disease study. In this paper, we created a state-of-art interface between mathematical models and statistical models for understanding and forecasting the dynamic pattern of the spread of infectious diseases. Our proposed model enhances the dynamics of the SIR mechanism by means of spatiotemporal analysis.

When it comes to analyzing the reported numbers of COVID-19 cases, other factors may also be responsible for temporal or spatial patterns. We investigated the spatial associations between the infection count, death count, and factors or characteristics of the counties across the U.S. by modeling the daily infected/fatal cases at the county level in consideration of the county-level factors. To examine spatial nonstationarity in transmission rate of the disease, we proposed a spatially varying coefficient model, which allows the transmission to vary from one area to another area. The proposed method can be used as an important tool for understanding the dynamic of the disease spread, as well as to assess how this outbreak may unfold through time and space.

From our empirical studies, we found that our method provides a very accurate short-term forecast in the COVID-19 study. Since our model incorporates the epidemiological mechanism, it can also be used for long-term prediction. We also provided a projection band to quantify the uncertainty of the long-term forecast path.

Based on our results, a disease mapping can easily be implemented to illustrate high-risk areas, and thus help policy making and resource allocation. Our method can also be extended to other situations, including epidemic models in which there are several types of individuals with potentially different area characteristics, or more complex models that include features such as latent periods or more realistic population structure.

Our paper did not take the under-reported issue into account. Assuming that the data used is reliable and that the future will continue to follow the past pattern of the disease, our forecasts suggest a continuing increase in the confirmed COVID-19 cases with sizable associated uncertainty. Our prediction method helps to understand where the state stands in combatting COVID-19 and give a sense of what to expect going forward. In predicting the future of the COVID-19 pandemic, many key assumptions have been based on limited data. Models may capture aspects of epidemics effectively while neglecting to account for other factors, such as the accuracy of diagnostic tests; whether immunity will wane quickly; and if reinfection could occur.

• A full list of data citations are available by contacting the corresponding author.

• The R package "STEM" of the proposed method can be downloaded from the Github Repository:

https://github.com/covid19-dashboard-us/covid19.

• The R shiny apps demonstrating the proposed methods can be found from https://covid19. stat.iastate.edu/. dummy variable for declaration of "shelter-in-place" or "stay-at-home" order Geographic Information Lat, Lon Latitude and longitude of the approximate geographic center of the county Note: The covariates with * represent that they are transformed from the original value by f (x) = log(x + δ). For example, PD * = log(PD + δ), where δ is a small number. 

Data-based analysis, modelling and forecasting of the COVID-19 outbreak

Semiparametric bivariate zeroinflated Poisson models with application to studies of abundance for multiple species

The COVID Tracking Project Data

Zeroinflated Poisson (ZIP) distribution: parameter estimation and applications to model data from natural calamities

Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study

Most people recover from Covid-19. Here's why it's hard to pinpoint exactly how many

Novel Coronavirus COVID-19 (2019-nCoV) Data Repository

How does the transmission depend on population size

Day level forecasting for Coronavirus disease (COVID-19) spread: Analysis, modeling and recommendations

Analysis and forecast of COVID-19 spreading in China

Time series modelling of childhood diseases: a dynamical systems approach

COVID-19: Why patient recovery data is scarce

Generalized spatially varying coefficient models

Early dynamics of transmission and control of COVID-19: A mathematical modelling study

Spline Functions on Triangulations

Bivariate penalized splines for regression

Handbook of spatial epidemiology

Dynamical behavior of epidemiological models with nonlinear incidence rates

Coronavirus (Covid-19) Data in the United States

Association of public health interventions with the epidemiology of the COVID-19 outbreak in Wuhan

Spatial analysis in epidemiology

Spatial Spline Regression Models

Mathematical modeling of infectious disease dynamics

Bootstrap Confidence Bands for Forecast Paths

Tracking Reproductivity of COVID-19 Epidemic in China with Varying Coefficient SIR Model

Spatio-temporal analysis of surveillance data

BPST: Bivariate Spline over Triangulation

Triangulation

An R shiny app to visualize, track, and predict real-time infected cases of COVID-19 in the United States

An R Shiny App to predict the infected and death cases of COVID-19 in the U.S. in the next three months

Supplementary materials for 'Spatiotemporal Dynamics, Nowcasting and Forecasting of COVID-19 in the United States

An Epidemiological Forecast Model and Software Assessing Interventions on COVID-19 Epidemic in China

The SIR model and the foundations of public health

WHO Director-General's opening remarks at the media briefing on COVID-19 -11

Smoothing parameter and model selection for general smooth models

Estimation and inference for generalized geoadditive models

Prediction of the COVID-19 outbreak based on a realistic stochastic model