key: cord-1024810-4f1gg6x8
authors: Matuk, James; Guo, Xiaohan
title: Shape-restricted estimation and spatial clustering of COVID-19 infection rate curves
date: 2021-10-22
journal: Spat Stat
DOI: 10.1016/j.spasta.2021.100546
sha: 6378e0b04006f252f5788e268b9b97086484d2a8
doc_id: 1024810
cord_uid: 4f1gg6x8

The study of regional COVID-19 daily reported cases is used to understand pattern of spread and disease progression over time. These data are challenging to model due to noise that is present, which arises from failures in reporting, false positive tests, etc., and the spatial dependence between regions. In this work, we extend a recently developed Bayesian modeling framework for inference of functional data to jointly estimate and cluster daily reported cases data from US states, while accounting for spatial dependence between US states. Shape-restriction allows us to directly infer the number of extrema of a smooth infection rate curve that underlies noisy data. Other parameters in the model account for the relative timing of extrema, and the magnitude and severity of infection rates. We incorporate mobility behavior of each US state’s population into an informative prior model to account for the spatial dependence between US states. Our model corroborates past work that shows that different US states have indeed experienced COVID-19 differently, but that there are regional patterns within the US. The modeling results can be used to assess severity of infection in individual US states and trends of neighboring US states to aid pandemic planning. Retrospectively, this model can be used to see which factors (governmental, behavioral, etc.) are associated with the varying shapes of infection rate curves, which is left as future work.

In late 2019, the outbreak of Coronavirus Disease -2019 was reported in Wuhan, China, and the disease quickly spread worldwide in early 2020 (Wu, Chen and Chan, 2020) . The United States (US), which is the focus of our work, has been one of the most affected countries in the world. Within months of the first reported cases in the US, more infections were reported than in any other country, and the disease has remained a major problem for the health and safety of its citizens (Omer, Malani and del Rio, 2020) . Unmitigated growth of infections could potentially over-burden hospital resources, and poses a severe threat to individuals in populations associated with increased risk of serious illness and mortality (Wolff, Nee, Hickey and Marschollek, 2020) . These facts have brought epidemiological modeling of COVID-19 to the foreground in hopes that understanding disease progression, transmission, and pervasiveness will inform public health interventions (Thompson, 2020; Pei, Kandula and Shaman, 2020) .

Bertozzi, Franco, Mohler, Short and Sledge (2020) state that the novelty and dynamic nature of COVID-19 are the primary challenges in effective modeling, and offer a survey of modeling approaches. Many current models rely on the compartmental Susceptible-Infected-Recovered (SIR) models and its variants (Cooper, Mondal and Antonopoulos, 2020; Arenas, Cota, Gómez-Gardeñes, Gómez, Granell, Matamalas, Soriano and Steinegger, 2020; Giordano, Blanchini, Bruno, Colaneri, Filippo, Matteo and Colaneri, 2020; Sharma, Volpert and Banerjee, 2020; Xue, Jing, Miller, Sun, Li, Estrada-Franco, Hyman and Zhu, 2020) . SIR models directly model the mechanism of infection and spread, and consequently, are able to explicitly infer important quantities, such as reproductive numbers, which quantify the contagiousness of a disease. As the pandemic progresses, there is a need to understand dynamic changes in infection rates as well as differences in geographic regions. Traditional SIR models do not specialize in capturing the flexible dynamic trend of pandemic curves and can ignore important spatial information, resulting in limited applicability in spatio-temporal modelling and potential biased inference due to model misspecification, especially for a novel disease.

In this work, we focus on modelling spatial patterns and temporal trends of infection rate curves. In contrast with SIR models, Functional Data Analysis (FDA) offers more flexible approaches for understanding trends (Boschi, Iorio, Testa, Cremona and Chiaromonte, 2020; Srivastava and Chowell, 2021) . The flexibility of FDA enables the incorporation of spatial information into modeling, and allows for spatial information to support various statistical analysis tasks (Pan, Shen and Hu, 2020) . Consequently, FDA approaches can be used to account for important aspects of modeling the disease, such as spatial heterogeneity (Thomas, Huang, Yin, Luo, Almquist, Hipp and Butts, 2020) and spatial dependence (Guliyev, 2020) . COVID-19 has manifested differently in regions within the same country which has influenced awareness and response to the disease. The dependence among transmission dynamics of states is related to inter-state transportation and similarity in socioeconomic factors, climate, and public health policy, which depends on the spatial distribution of states (Badr, Du, Marshall, Dong, Squire and Gardner, 2020; Cintia, Pappalardo, Rinzivillo, Fadda, Boschi, Giannotti, Chiaromonte, Bonato, Fabbri, Penone et al., 2020) . We propose a FDA model that accounts for both spatial heterogeneity and dependence.

Approaches from FDA use smooth infection rate curves based on daily reported cases to study regional infection rate dynamics through cluster analysis. With the exception of Srivastava and Chowell (2021) , these works tend to overlook the distinct forms of variability in functional data: amplitude, which quantifies features of functions, e.g. magnitude and number of extrema, and phase, which quantifies the relative timing of the amplitude features. Within the context of modeling US states' infection rate curves, amplitude and phase components provide complementary information about how each US state has experienced the pandemic. The study of amplitude enables an interpretation of the magnitude and pattern of spread of the infection in each US state. On the other hand, the study of phase quantifies differences in when US states experienced ebbs and flows of infections. Matuk, Bharath, Chkrebtii and Kurtek (2021) recently developed a Bayesian modeling framework for estimation of different types of functional data in the presence of phase variation. In this work, we extend their framework to model local extrema information of infection rate curves from daily reported COVID-19 cases while accounting for inter-state spatial dependence. Our primary contributions are summarized as follows:

i. Our use of the Bayesian paradigm allows us to jointly model three types of variation in COVID-19 infection rate data: amplitude, phase, and spatial variation. The latent amplitude and phase components are modelled as functional parameters, through a formulation rooted in the Elastic Functional Data Analysis (EFDA) framework (Srivastava, Wu, Kurtek, Klassen and Marron, 2011) . Spatial dependence is introduced through judicious choice of prior distributions for model parameters. Moreover, the Bayesian framework allows for structured uncertainty quantification of all of these aspects of the model. ii. The proposed spatial prior not only includes information about the neighborhood structure among US states, but also important covariate information in terms of community mobility data. We implement amplitude-phase separation in defining a variogram to estimate a spatial correlation measure. iii. We take a shape-restricted perspective for inference which enables the analysis of amplitude and phase variability in the presence of noisy data. Further, the model allows us to infer the number and pattern of extrema of infection rate curves. As we show in Section 4, extrema are useful in understanding the pattern of infections within US states, and heterogeneity of infections between US states. We use extrema to inform spatial clustering, which unveils regional patterns of COVID-19 infection rates throughout the US. The parameters for phase quantify differences in timing of extrema of infection rate curves and also contribute to our spatial clustering results.

The remainder of this paper is organized as follows. In Section 2, we discuss the data that we use to inform our model. In Section 3, we briefly discuss the EFDA framework which serves as the theoretical foundation of our model, and formulate our model. In Section 4, we discuss our modeling results. In Section 5, we summarize our contributions and discuss future work.

We model daily reported cases data calculated from cumulative reported cases data compiled by The COVID Tracking Project. The COVID Tracking Project is a volunteer organization that is dedicated to manually compiling data from US states' COVID-19 dashboards and making them publicly available (Covid Volunteer Team, 2020). Figure  1 panel (a) shows the cumulative reported cases data for the 50 US states and Washington DC, which we simply refer to as 'US states' in our study, and panel (b) shows daily reported cases data for these US states, which are used to inform our model. As is shown in Section 4, the infection rate curves that underlie the noisy data in panel (b) have different patterns of extrema in terms of number and location, with regional similarities in the shapes of these curves. We incorporate spatial dependence into our model through the use of informative prior probability modeling that accounts for correlation between neighboring US states. In our model, the spatial correlations between US states is determined by geographic distance between states and also the similarity in US states' mobility behavior. This information is important to be included in our model, since early studies of the COVID-19 pandemic in the US and other countries (Badr et al., 2020; Kraemer, Yang, Gutierrez, Wu, Klein, Pigott, Du Plessis, Faria, Li, Hanage et al., 2020; Yang, Sha, Liu, Li, Lan, Guan, Hu, Li, Zhang, Thompson et al., 2020) suggested that mobility is an essential factor associated with infection rates. To quantify mobility, we use the community mobility data collected by Google (Google LLC, 2020). The mobility reports track daily percentage changes in the duration of time spent in residential areas relative to a pre-pandemic baseline for the population in each US state. We display smoothed versions of these data in Figure 1 panel (c). These mobility data take the form of functional observations that capture similarities in how US states' populations responded to the pandemic and the ensuing public health policies. We find them useful for determining the spatial dependence in our model as discussed in Section 3.3.

Throughout this work, all data was recorded starting from March 12 th , 2020, the day after the pandemic was declared by the World Health Organization (Ghebreyesus, 11 March 2020), through October 25 th , 2020. As is often done in practice (Ramsay and Silverman, 2005; Srivastava and Klassen, 2016) , the different days throughout this span are represented as time points on the unit interval, ∈ [0, 1]. Further discussion of these data and pre-processing are discussed in Section 1 of the Supplemental Material.

In this section, we briefly summarize the EFDA framework, which serves as the theoretical background of our model, followed by our model specification for COVID-19 infection rate curves.

As briefly mentioned in Section 1, one of the primary challenges in working with functional data is the presence of amplitude and phase variability (Marron, Ramsay, Sangalli and Srivastava, 2015) . Within the context of COVID-19 infection rate curves, there are likely many factors associated with the presence of these variabilities, including the date when cases were first reported, the latency of reporting due to limited testing capacity in early stages of the pandemic, governmental response to perceived surges in infections, etc. While there are many approaches to addressing the issues associated with these types of variabilities (Ramsay and Silverman, 2005; Srivastava and Klassen, 2016) , we choose to work within the EFDA framework (Srivastava et al., 2011) , which accounts for amplitude and phase variability through a registration procedure with desirable properties.

The registration procedure decomposes a functional dataset, 1 , … , , into a set of amplitude functions,̃ 1 , … ,̃ , and phase functions, 1 , … , , such that their composition is equal to the original functional dataset , =̃ • = 1, … , . The phase functions are elements of the group of diffeomorphisms on the unit interval, Γ = { ∶ [0, 1] → [0, 1] | (0) = 0, (1) = 1,̇ > 0}, wherė represents the time derivative of a function . For identifiability, the sample Karcher mean of the phase functions is forced to be the identity warping, ( ) = . The optimality criterion used to determine the registration is based in the Fisher-Rao Riemannian metric on the space of absolutely continuous functions. Square root velocity function (SRVF) representation is a crucial ingredient of the EFDA framework, since it enables simple computation of the Fisher-Rao Riemannian metric. The SRVF of a function, , is defined as, = ( ) ∶=̇ ∕ √ |̇ |. This representation is invertible, if in addition to the SRVF, , one also records the starting point on the function, f(0), −1 ( , (0))( ) ∶= (0) + ∫ 0 ( )| ( )| . Our model, presented in the next subsection, relies on SRVF representation to decompose amplitude and phase of infection rate curves through model parameters.

Let denote the daily reported cases, as discussed in Section 2, and let represent days where non-missing values were recorded for the th US state, = 1, … , 51. Throughout this section, we use the notation ( ) to denote a vector of function evaluations ( ( ,1 ), … , ( , )) ⊤ , where is the length of . We assume the following observation model,

This specifies the daily reported cases data for the th US state as perturbations of a smooth infection rate curve,

, whose components ( , ) , , denote amplitude, phase, and translation, respectively. The amplitude component ( , ) is indexed by the parameters and that determine the number and ordering of extrema of the function −1 ( ( , ) , ). We represent the amplitude component ( , ) via an expansion of shape-restricted basis functions (Wheeler, Dunson and Herring, 2017) 

The basis elements are defined as In this work, we select a value of = 10 based on exploratory data analysis to reflect a flexible, yet parsimonious basis. As discussed in Matuk et al. (2021) , the shape-restricted basis is robust to the choice of for fixed values of ( , ). This basis representation forces the amplitude component of the model, ( , ) , to be zero at the values of ( ) , which is a vector of evenly spaced points throughout the domain. The zeros of ( , ) correspond to extrema for the function −1 ( ( , ) , ). The parameter determines the ordering of extrema. For example, when = 2, = 1 enforces −1 ( ( , ) , ) to have a local maximum followed by a local minimum, while = −1 enforces −1 ( ( , ) , ) to have a local minimum followed by a local maximum. Since infection curves for states exhibit variations in numbers of peaks and valleys, we extend the model by Matuk et al. (2021) , where and are fixed a-priori, to allow for and to take on a range of values. We specify a novel prior for these parameters as,

The number of extrema for the infection rate curve of the th US state, , is able to vary between 0 to , and its probable values depend on nearby US states. The value of = 5 enforces the assumption that no state has experienced more than 3 distinct waves of infections throughout the time that the data has been collected, which was chosen based on exploratory data analysis of the observations. In general, specification of the parameter too small can result in oversmoothing features of the data, and too large can result in mistaking noise as extrema. is specified so that −1 ( ( , ) , ) is increasing in the beginning of the domain.

Journal Pre-proof

The dependence between and is measured by a spatial weight, , , where the weights are truncated so that only bordering US states have a positive weight. These weights depend both on geographic distance and similarity in the changes of US states populations' mobility behavior during the pandemic. The methodology for determining the weights is discussed in Section 3.3. The constant = 8 corresponds to the largest number of bordering US states for any US state, and normalizes the spatial component of the prior to a value between 0 and 1. The regularization parameter, ∈ [0, 1], determines the strength of regularization. When = 0, the number of extrema in the smooth daily infection rate curve for each US state follows a discrete uniform prior that does not depend on neighboring US states, and as approaches 1, the prior on , the number of extrema in the smooth daily infection rate curve for state i, will favor the numbers of extrema that are similar to those of its neighbors. In this work, we select a moderate value for = 0.99 using sensitivity analysis to strike a balance between incorporating spatial information into the model and over-regularization. Section 2 in the Supplemental Material discusses the sensitivity analysis and corresponding selection of this parameter.

The phase component of the th US state, , shifts the extrema of the function −1 ( ( , ) , ) to fit the the daily reported case data. The prior is defined through the finite difference of a discretized phase function,

proposed by Bharath and Kurtek (2020) , where is a grid of evenly space points along the domain and is a concentration hyperparameter that we choose to correspond to a diffuse prior. We assume an independent and identically distributed ( ) ∼ ( , 2 )), which elicits a normal likelihood, and we model the translation and noise variance for each US state with weakly-informative normal and inverse-gamma conjugate priors, respectively.

Posterior inference is based on Markov chain Monte Carlo (MCMC) samples from an adaptive parallel tempering algorithm (Strait, Chkrebtii and Kurtek, 2019) , where the temperature scheme (Geyer, 1991) and the proposal parameters of the Metropolis-within-Gibbs algorithm (Roberts and Rosenthal, 2009 ) are automatically tuned for efficiency. We summarize our hierarchical model and discuss MCMC implementation in Section 3 of the Supplemental Material.

Motivated by the association between infection rate and community mobility, spatial correlation between US states' mobility data is used to inform the dependence between the number of extrema in infection rate curves. Modeling spatial dependence of mobility data for constructing a prior distribution of is challenging, since the state-level weekly community mobility records are functional data that exhibit phase variation. The presence of phase variation can blur the inter-state dependence structure, and can cause biased inference when ignored. The shape parameter is only related to the fluctuation of infection rate curves but invariant to phase variation, so we desire that the estimated inter-state correlation of mobility data is also invariant to potential phase variation. To satisfy this requirement, the amplitude trace-variogram approach proposed by Guo, is implemented to measure the spatial variation between the amplitudes of mobility curves.

The amplitude trace-variogram is defined based on amplitude-phase decomposition of a trace-variogram, which is a popular spatial variation measure for functional data (Giraldo, Delicado and Mateu, 2011) . We model a random field that takes on functional values as a coupling of a latent amplitude random field and phase random field. We estimate the amplitude trace-variogram through incorporating function registration into variogram estimation. This procedure provides spatial weights that capture correlation between magnitudes of mobility records without interference of phase variation potentially caused by various factors, such as lock-down policies among states.

In the functional space  = { ∶ [0, 1] ↦ | is absolutely continuous}, consider the functional random field { | ∈ } ⊂  on a spatial coordinate domain  ⊂ 2 , where at each ∈ , is a random function in . Suppose the mobility data is one realization of the functional random field in the US state located at , for = 1, … , 51. Using the SRVF representation, the model for the mobility data is,

where ( ) is the unobserved deterministic mean amplitude, constant in space; ( ) is the random process with mean 0 and covariance function ( 1 , 2 ) s.t. ∫ 1 0 ( , ) < ∞; ∈ Γ is the unobserved random phase component with ( ) = ∈ Γ for all ∈ . The inter-state dependence structure of interest is contained in amplitude error ( ), which is not directly estimable due to the existence of unknown warping ( ). Equivalently, the amplitude model is, 

Shape-Restricted Estimation and Spatial Clustering for COVID-19 Infection Rate Curves The amplitude random field is defined as { ( • ) | ∈ }. Under the second-order stationary and isotropic assumptions on { ( • ) | ∈ }, the amplitude trace-varioram is defined as a function of spatial distance,

where ‖ ⋅ ‖ is the 2 -norm. The corresponding empirical amplitude trace-variogram (Guo et al., 2020), based on an amplitude distance defined by optimising over Γ (Srivastava and Klassen, 2016) , is defined as,

To guarantee the estimated variogram is conditionally negative definite, we further fit a parametric Matérn variogram to the empirical variogram̃ ( ), resulting in a parametric estimatê ( ). In model fitting, the smoothness parameter is fixed at 0.5, while the range and sill are estimated through least squares. Finally, we specify the spatial weight as,

wherê is the estimated scale parameter of the Matérn family. We note that is in [0, 1] and is 0 when ‖ − ‖ ≥̂ .

In this section, we present our modeling results for infection rate curves based on US states' daily reported cases data. Throughout this section,̂ is used to denote the estimated marginal posterior mode of the number of extrema for the th US state, and functional posterior samples and summaries are shown given this value. Clusters are determined by the estimated number of extrema. Analysis of estimated posterior phase functions, used to quantify the relative timing of extrema, are useful for identifying sub-clusters. Figure 2 shows results for some selected US states that are representative of the patterns seen in the estimated infection rate curves as determined by our model. In each of the panels, the dots represent the daily reported cases data for each of the US states, and the gray lines represent marginal posterior draws of functional parameters from our model. The top row shows the daily reported cases data superimposed on posterior draws of the infection rate curve Regardless of the pattern that is present for each of the US states, the model appears to reasonably estimate a smooth infection rate curve. These results represent the heterogeneity of shapes that are present in estimated infection rate curves. The infection rate for North Dakota has only steadily increased throughout the time of the study; US states like New York and Texas have experienced one prominent peak followed by increasing cases; Georgia and Colorado have both experienced two distinct peaks in their infection rate curves, but Colorado's current infection rate is increasing, while Georgia's is currently decreasing. The middle and bottom rows of the figure display the amplitude and phase decomposition of the infection rate curves. The middle row shows marginal posterior amplitude functions, which are aligned for US states with the same number of extrema,̂ . The bottom row of the figure displays posterior draws for the phase component, which is able to quantify the relative timing of extrema. This is apparent when comparing the results in panels (b) and (c). The model estimates that both New York and Texas have two extrema, a peak followed by a valley. However, the marginal posterior phase functions for New York are generally higher than that of Texas, indicating that the infection peak of New York was experienced much sooner. Refer to Section 2 of the supplemental material to view the effect of the regularization parameter, , on estimation results. The parameter ensures that the pattern of extrema for estimated functions are influenced by their neighbors, which in some cases can prevent overfitting. Figure 3 shows a map of US states colored by cluster memberships determined by the estimated number and timing of extrema in panel (a) along with estimated marginal posterior mean amplitude and phase functions of US states within each cluster in panels (b)-(f). These results indicate that there is strong regional behavior for the shape of infection rate curves in the US. The most prominent regional clusters are the maroon US states in the Northwest, the brown US states in the Atlantic and Midwest region, the yellow US states in the south, and US states in different shades of orange in the Northeast and West. The US states in different shades of orange have the same number of extrema, however, the times at which the extrema occur separates the groups, characterized by marginal posterior phase mean functions. The group colored in light orange has peaks concentrated around the end of March and early-April and the group colored in dark orange has peaks concentrated in mid-June. This subclustering based on phase is investigated further in Figure 4 . The figure again shows the marginal posterior mean amplitude and phase functions for the US states in the orange cluster to highlight the differences in the phase functions. A dendrogram visualizing an agglomerative hierarchical clustering with Ward linkage (Köhn and Hubert, 2015) , based on all of the posterior mean phase functions for US states with two extrema, confirms that there are two prominent groups. Of all of the clusters based solely on the number of extrema, we have determined that this is the most appropriate to subcluster based on phase.

In this paper, we have presented a model to infer infection rate curves in the US from noisy daily reported cases data. Importantly, this approach provides a flexible model for the spatially correlated, noisy data that directly infers the number and timing of extrema of underlying smooth functions, which are crucial in visualizing and understanding the regional patterns of COVID-19 within the US.

In comparison with Srivastava and Chowell (2021) , which also uses EFDA for clustering of infection rate curves, our approach has some key methodological differences. We are able to incorporate spatial information for inference through prior modeling of parameters. We base clustering on the quantity and relative timing of extrema, which aids the interpretation of clusters compared to unsupervised learning of latent groups based on pairwise distances between functions. Our model performs simultaneous estimation, registration, and clustering of observations in contrast with performing these steps sequentially, which fails to propagate uncertainty from the estimation step to the clustering step.

The proposed model can be used to assess current severity of infection in individual US states and trends of neighboring US states to help with pandemic planning. In future work, we believe this model could be adapted within a functional regression framework to determine which factors (governmental, behavioral, etc.) are associated with the varying shapes of infection rate curves. This model could be modified for the analysis of county-level data, where the high number of zero reported cases in rural counties would need to be explicitly accounted for through the use of a zero-inflated likelihood model.

A mathematical model for the spatiotemporal epidemic spreading of covid-19

Association between mobility patterns and covid-19 transmission in the usa: a mathematical modelling study

The challenges of modeling and forecasting the spread of covid-19

Distribution on warp maps for alignment of open and closed curves

The shapes of an epidemic: using functional data analysis to characterize covid-19 in italy

The relationship between human mobility and viral transmissibility during the covid-19 epidemics in italy

A sir model assumption for the spread of covid-19 in different communities

Covid tracking project

Markov chain Monte Carlo maximum likelihood

Who director-general's opening remarks at the media briefing on covid-19

Modelling the covid-19 epidemic and implementation of population-wide interventions in italy

Ordinary kriging for function-valued spatial data

Determining the spatial effects of covid-19 using the spatial panel data model

The effect of human mobility and control measures on the covid-19 epidemic in china

Hierarchical Cluster Analysis

Functional data analysis of amplitude and phase variation

Bayesian framework for simultaneous registration and estimation of noisy, sparse and fragmented functional data

The covid-19 pandemic in the us

Spatial homogeneity learning for spatially correlated functional data with application to covid-19 growth rate curves

Differential effects of intervention timing on covid-19 spread in the united states

Functional Data Analysis

Examples of adaptive mcmc

Extended seiqr type model for covid-19 epidemic and data analysis

Title: Modeling study: Characterizing the spatial heterogeneity of the covid-19 pandemic through shape analysis of epidemic curves

Functional and Shape Data Analysis

Parallel tempering strategies for model-based landmark detection on shapes

Spatial heterogeneity can lead to substantial local variations in covid-19 timing and severity

Risk factors for covid-19 severity and fatality: a structured literature review

The outbreak of covid-19: An overview

A data-driven network model for the emerging covid-19 epidemics in wuhan, toronto and italy

Taking the pulse of covid-19: A spatiotemporal perspective

The authors would like to thank Dr. Karthik Bharath (University of Nottingham), Dr. Oksana Chkrebtii (The Ohio State University), and Dr. Sebastian Kurtek (The Ohio State University) for their helpful conversations. This research was partially supported by grants NSF CCF-1740761 and NSF CCF-1839252.