key: cord-0104874-y2c6tlgv
authors: Shi, Xiaoping; Chen, Meiqian; Dong, Yucheng
title: Exploring the space-time pattern of log-transformed infectious count of COVID-19: a clustering-segmented autoregressive sigmoid model
date: 2021-02-26
journal: nan
DOI: nan
sha: 12285705d94bdb81af503f1e5652b5d6655df113
doc_id: 104874
cord_uid: y2c6tlgv

At the end of April 20, 2020, there were only a few new COVID-19 cases remaining in China, whereas the rest of the world had shown increases in the number of new cases. It is of extreme importance to develop an efficient statistical model of COVID-19 spread, which could help in the global fight against the virus. We propose a clustering-segmented autoregressive sigmoid (CSAS) model to explore the space-time pattern of the log-transformed infectious count. Four key characteristics are included in this CSAS model, including unknown clusters, change points, stretched S-curves, and autoregressive terms, in order to understand how this outbreak is spreading in time and in space, to understand how the spread is affected by epidemic control strategies, and to apply the model to updated data from an extended period of time. We propose a nonparametric graph-based clustering method for discovering dissimilarity of the curve time series in space, which is justified with theoretical support to demonstrate how the model works under mild and easily verified conditions. We propose a very strict purity score that penalizes overestimation of clusters. Simulations show that our nonparametric graph-based clustering method is faster and more accurate than the parametric clustering method regardless of the size of data sets. We provide a Bayesian information criterion (BIC) to identify multiple change points and calculate a confidence interval for a mean response. By applying the CSAS model to the collected data, we can explain the differences between prevention and control policies in China and selected countries.

During the COVID-19 outbreak, multiple complex factors resulted in the space-time pattern of spread. Fig. 1 shows the log-transformed infectious counts in each region in China, and in 33 selected countries at the end of April 20, 2020.

From Fig. 1 , we can see two main characteristics of the spread: (i) the spread of COVID-19 has a space-time characteristic determined by different intervention policies, incomplete information, geographical locations, transport, climate, and so on; (ii) along the time, the log-transformed infectious counts presented different sigmoid (stretched S-shaped) curves.

This phenomenon often happens in the life cycles of plants, animals, and viruses, which can rise and fall periodically. In each cycle, the sigmoid curve experiences three phases: slow rising, sharp rising, and slow falling.

Modeling the spread of COVID-19 in many regions over a long period of time is proven to be challenging. That is because many regions may not share the same spread pattern and different regions may exhibit various intervention policies that may cause instability in the models. The model for each region may have a large degree of noise, but a common cluster of all regions could have less noise by the law of large numbers. Thus, clustering is of importance to increase model fit. We may have to cluster all regions, even if the number of clusters is unknown. In addition, we should allow the model to incorporate unknown change points to further enhance the fitting performance. Ignoring the existence of change points may lead to poor model fitting and misleading model interpretation (Shi, Wang, Wei & Wu, 2016) . Furthermore, it is often necessary to apply the model to updated data from an extended period of time. In the extended period, old clusters need to be updated and new change points may occur. Models with incorporated clusters and change points should be flexible and adaptive to the new data. In the next step, we shall consider the nonlinear characteristics of the models.

Logarithmic transformation is often used for transforming count data, which includes zero values (Jin et al. , 2020) and grows exponentially over time. The simplest formula for exponential growth of a function y at the growth rate r, as time t goes on, is y(t) = y(0)(1 + r) t , which satisfies the linear differential equation dy(t) dt = log(1 + r)y(t). A nonlinear variation of this differential equation may lead y(t) to a sigmoid function. For example, the solution of a nonlinear differential equation dy(t) dt = log(1 + r)y(t) − y 2 (t) is the logistic function (Murray, p.308 , 1989; Liu & Stechlinski, p.84 , 2017) . The exponential growth model has shown numerous applications in the modeling and controlling of complex systems. For example, the number of cells in a culture will increase exponentially until an essential nutrient is exhausted. A virus, for example SARS or COVID-19, has been found to spread exponentially (Katul et al. , 2020) . The speed of spread slows down when an artificial immunization becomes available or intervention policies take effect. Other applications of the exponential growth model can be found in Physics (e.g., radioactive decay), Economics (e.g., a country's gross domestic product), Finance (e.g., investments), Computer science (e.g., computing growth and internet phenomena), and so on.

When systems have short-term memories and become more complex, it is extremely difficult to find a differential equation to describe the growth curve. In contrast, we may add some autoregressive terms in a regression function to adapt to the complex system. Kowsar et al. (2017) shows that an autoregressive logistic model was more accurate than a logistic model when it comes to predicting the behaviors of complex biological systems.

The reason is that the added autoregressive terms, which behave like short-term memory, can make an appropriate adjustment to better fit the complex system. In the same spirit, we propose the clustering-segmented autoregressive sigmoid (CSAS) model with four key characteristics including unknown clusters, change points, stretched S-shaped curves, and autoregressive terms. With the help of the CSAS model, we expect to understand how an outbreak is spreading in time and in space, to understand how the spread is affected by epidemic control strategies, and to apply the model to updated data from an extended period of time.

To identify this CSAS model, we first identify unknown clusters. There are many popular methods, such as K-means (Wang & Hartiganm , 1979 ) (implemented in the R function kmeans), Expectation-Maximization clustering for Gaussian Mixture Models (GMM-EM) (Akaho , 1995) (implemented in the R package mclust), Density-Based Spatial Clustering of Applications with Noise (DBSCAN) (Ester et al., 1996) (implemented in the R function fpc::dbscan), and Hierarchical clustering (Murtagh & Legendre , 2014) (implemented in the R function hclust). Except for GMM-EM, which can be considered to be parametric, all other methods need to predetermine the number of clusters or distance related parameters.

To compare the dissimilarity of the curve time series, we need a nonparametric method that does not require predetermined parameters. Then, we can separate different regions from China and the selected 33 countries into clusters that share common patterns, segment the curve time series, and provide accurate fittings.

Our contributions include the following: (1) we propose the CSAS model to help understand how an outbreak is spreading in time and in space, to understand how the spread is affected by epidemic control strategies, and to apply the model to updated data from an extended period of time; (2) we provide a nonparametric graph-based clustering method with theoretical support, which furthermore proposes a very strict purity score that penalizes the overestimation of clusters. Simulations show that our method is fast and efficient for different sizes of data sets; (3) we give practical methods for segmentation and provide a confidence interval estimation for mean response; (4) we analyze the COVID-19 data in regions in China and selected countries, and explain the differences among the epidemic prevention and control policies.

We assume the clustering-segmented autoregressive sigmoid (CSAS) model:

where

x −∞ e −u 2 /2 du is a cumulative distribution function (CDF) of the standard normal distribution representing the sigmoid curve; β i,t 's are independent random errors with a mean of zero and constant variance of (σ

The CSAS model has four key characteristics: (1) it is implemented with unknown N different clusters among K regions. Due to the epidemic mechanism, human mobility and control strategy, the spread of epidemics displays a spatial propagation. We will propose a nonparametric method to cluster the regional data by applying the characteristic of sigmoid curve. This method does not introduce any factors and hence can be considered nonparametric; (2) the multiple S-shaped curves are described by change points. The change points τ (m) i for 1 ≤ m ≤ M i − 1 are unknown and are related to the cluster (i). This is because different intervention policy releases such as lockdown, maintaining social distance, cancelling large events, closing schools, and so on, result in different segmented sigmoid curves among unknown clusters; (3) the regression function is mainly determined by the stretched S-curve β (m)

4,i t) and it allows a slight adjustment through the autoregressive terms,

q+4,i Z i,t−q which can be considered as a shortterm memory for the response variable; and (4) after specifying both clusters and change points, we use the corresponding data sets to answer three questions. How do we estimate the regression coefficients? Are those coefficients significantly different from zero? How do we give a confidence interval for the mean response?

We give the following five remarks for the logarithmic transformation, the CDF function Φ(x), and the random error in the CSAS model. Remark 1. log(1 + x) transformation is often used for transforming count data that include zero values (Jin et al. , 2020) . When Y i,t is much smaller or larger than 1 in

which may grow exponentially over time, has two patterns, slow rises and slow falls, and hence can often be modeled by a stretched S-curve.

Remark 2. The nonlinear function Φ(x) is used to describe the stretched S-shaped curve. Other similar functions may be considered. For example, if we apply the approximation of Φ(x) by (Tocher , 1963) 

which is an extended logistic function of time t and is commonly used in logistic regression.

Remark 3. Mathematical modelling may provide an understanding of spread mecha-nisms. The original mathematical model was proposed and solved by Daniel Bernoulli in 1760; see (Dietza & Heesterbeek , 2002) . Recent developments and applications are mainly focused on the susceptible-infectious-recovered (SIR) model and its variants. The logistic function derived from a nonlinear differential equation may explain why we should apply the sigmoid curve to model the spread of disease (Murray, p.308 , 1989; Liu & Stechlinski, p.84 , 2017; Katul et al. , 2020) . 

To find clusters of all K regions, we consider the T dimensional series {Z j , 1 ≤ j ≤ K}, where its tth component is Z j,t for 1 ≤ t ≤ T , and define the Euclidean distance between Z j 1 and Z j 2 as follows:

We construct an approximate shortest Hamiltonian path (SHP) based on a heuristic

Kruska algorithm (HKA), which was proposed by (Biswas et al. , 2014) for a two-sample test. This was successfully applied into change point detection in Rao, 2017, 2018) . The HKA first sorts all edges in order of increasing distance defined in (2). First and foremost, the edge with a minimum distance must be selected. Then subsequent edges are chosen one-by-one from the remaining list of sorted edges according to the requirement of a path. If this current edge does not form a cycle with the previously selected edges, and every vertex connected by this current edge, or previously selected edges, has a degree not greater than 2, then this current edge must be selected. The HKA terminates when K − 1 edges have been chosen. The approximate SHP is formed by chosen K − 1 edges denoted as P = (j 1 , . . . , j K ). The next step is to find clusters based on P. We define the edge set of P as E(P), and consider a subset of E(P):

We create a graph from the edge set E * (P, θ) and define the connected components of this graph as a set of clusters A = {A , 1 ≤ ≤ L}. We note that the R function components in the R package igraph (Csardi & Nepusz , 2006) can calculate the connected components given the edge set.

Suppose that there is a set of classes

We need to measure how close the set of clusters A is to the predetermined set of classes C.

Purity (Manning, Raghavan and Schütze , 2008 ) is a measure of this extent defined as:

In most cases, a bad clustering has a purity value close to 0 and a perfect clustering has a purity of 1. However, this measure may not give a realistic evaluation for overestimated clusters. For example, a purity score of 1 could happen by putting A = , L = K and N = 1. In this case, one whole class is mis-clustered to K separate clusters with a purity score of 1.

We propose a very strict purity score to penalize overestimated clusters:

Users may add additional weight on the second penalty term according to different requirements. Based on this very strict purity evaluation, a very bad clustering would have a purity value close to -1, and a perfect clustering will still have a purity of 1. If A = , L = K and N = 1, then S * (A, C) = 1/L, decreasing as L increases. Overestimated clusters may have a very strict purity score close to 0. A natural question comes: does our clustering have a very strict purity score of 1? To answer this question, we make the following assumptions.

In Assumption 2, we require K to be quite small compared to T . Note that d(

We have the following Theorem 1.

Theorem 1. Suppose Assumptions 1-2 hold. Choose θ = η(T ) as in (3). As T → ∞,

where c is a constant not related to either K or T . By Assumption 2, this upper bound converges to zero. Next, we prove that P {min j 1 =j 2 ,δ(j 1 ) =δ(j 2 )

By the Minkowski inequality and Assumption 2,

which converges to zero by (6).

By the HKA, for any A , there exists C i such that C i = A in probability, which implies that max 1≤i≤N |A ∩ C i | = |A | and L = N hold in probability. So, P (S * (A, C) = 1) converges to 1 as T → ∞. The proof of Theorem 1 is finished.

To apply Theorem 1, we need to set the right value for θ. In real problems, θ could be unknown. We shall propose a data driven method to select the threshold value of θ. A naive choice of θ based on outlier detection iŝ θ = median s=1,...,K−1 (x s )

where x s = d(Z js , Z j s+1 ) for s = 1, . . . , K−1, median s=1,...,K−1 (x s ) and 1. with the exception of outliers, is a mixture of two or more probability distributions which commonly occurs in multiple clusters, thenθ may not be consistent to θ. Therefore, we

propose Algorithm 1 based on Bayesian information criterion (BIC) to choose θ.

Denote a set of change points as C i = {τ 

where 1 ≤ t − < t 0 < t + ≤ T and f (t; β) = β 1 + β 2 Φ(β 3 + β 4 t) + β 5 Z i,t−1 + β 6 Z i,t−2 .

Here, we consider two autoregressive terms. Then, the estimated change point is denoted

where ∆ is the minimum distance between two adjacent change points and t + − t − > ∆.

In light of Bai & Perron (2003) , we apply the BIC method for model comparison.

where ν = 0 or 1, 6(ν + 1) is the number of parameters andσ 2

. Combined with the Iterated Cumulative Sums of Squares Algorithm (ICSS) (Inclán & Tiao , 1994) , we propose Algorithm 2 to estimate multiple change points.

In Algorithm 2, there are two main steps that include finding candidate change points and refining them. We set the minimum distance between two adjacent change points, ∆, to be 10 for real data analysis.

First, we use the well-known nls function in the R package stats (R Core Team , 2020) to find the minimum value as shown in (8) and give t tests on regression coefficients, where initial values of parameters are given by grid search. Second, we give a confidence interval of regression function, denoted as

, by the delta method as follows. By first-order Taylor expansion at the solutionβ, we have

where t * α/2 (t + − t − − 6) is the α/2 lower quantile of a t distribution with degrees of freedom t + − t − − 6 and Var(β) can be estimated by the nls function in R. Here, we consider α = 0.05.

We consider three classes N = 3. Let C 1 , C 2 , and C 3 be randomly seperated classes with

Denote the number of elements in ith class as n i with 3 i=1 n i = K. We produce Z i,t from the following model

where i = 1, 2, 3, t = 1, . . . , T and ε t 's are independent Normal errors with mean zero and variance σ 2 .

For the first class, let M 1 = 1, β

2,1 = 10, β

3,1 = −4, and β 

,2 = 0 for = 1, . . . , 4, τ

1,2 = 0, β

2,2 = 20, β (2) 3,2 = −3, and β

(2) 4,2 = 0.03. For the third class, let M 3 = 2, β (1) ,3 = 0 for = 1, . . . , 64, τ

3,3 = −2, and β

(2) 4,3 = 0.07. Fig. 2 plots different S-curves of Z i,t for i = 1, 2, 3 and T = 150. Next, we compare our graph-based clustering method as shown in Algorithm 1 with the model-based clustering method (Fratey & Raftery , 2002) implemented in the R function Mclust (Fraley et al. , 2020 ). Fig. 3 shows the averaged strict purity score as in (5) for estimated clusters based on these two methods for σ = 0.1, 0.2, . . . , 1, different n i 's and 100 replications.

It can be seen from Fig. 3 that our method is very accurate for different σ and sizes of classes because its very strict purity scores are close to 1. This agrees with the conclusion in Theorem 1. In contrast, the performance of the model-based clustering method is affected by large σ and large sizes of classes. Specially, when n 1 = 20, n 2 = 100 and n 3 = 200, the computation is significantly slower compared with our method. The comparison of computing time is not presented here.

We continue to use the data set of log-transformed infection counts from December 1, 2019 to April 20, 2020 from Chinese provinces/regions and the 33 countries, and present the clustering and S-shaped fitting with change points. Here, two autoregressive components (p = 2) in (1) are suggested.

Based on the graph-based clustering Algorithm, the clusters of COVID-19 in China and the rest of the world are presented in Fig. 4 , where the optimal path is presented as a cycle with vertexes representing clusters in different colors and overextended curves. This way of presentation is to transmit three aspects of information: (i) this analysis is for virus data, therefore, we should use the cycle and the sharp nodes to describe the structure of the virus;

(ii) the optimal graph is a path connecting all nodes where nodes can be provinces/regions in China or countries in the world; and (iii) readers can quickly find the different clusters and where to separate them from the path.

From Fig. 4 , we observe the following. (Fratey & Raftery , 2002) suggests both HK and TW are to be in one cluster, which may not be correct.

(2) As shown in COVID-19 cases in the world, the 33 selected countries are clustered in those countries. For example, the first large-scale outbreak was in CN, followed by KR and JP. After that, infections in IR and IT experienced rapid growth, followed by the outbreaks in European countries and the US. Finally, the epidemic spread worldwide. In addition, the clustering is also based on the epidemic control strategies in each country.

For example, in KR and JP, even while the epidemic broke out around the same time, the two countries had taken different strategies: JP adopted a "defensive strategy"' to ensure the health care system operated normally as usual, while KR used an "aggressive attack strategy" to comprehensively detect infections. We can obtain that all sigmoid curves share the form of multiple stages and multiple change points, with the exception of Cluster 7 (XZ) in China, with only one infection; the calculated change points of each cluster can still be explained by the differences in epidemic control strategies. See the details below.

(1) As shown in Fig. 5 A and Fig. 6 A, the sigmoid curves and change points are almost the same because HB province was the center of the COVID-19 outbreak in CN.

In Fig. 6 A in CN, the first segment (19/12/01 to 19/12/13) was the germination period of the outbreak. In the second segment (19/12/13 to 20/01/16), COVID-19 seemed to have been controlled in CN. However, because many COVID-19 cases had not been found due to varied epidemic control strategies in the previous two stages, COVID-19 broke out in the third segment (20/01/16 to 20/01/26) and fourth segment (20/01/26 to 20/02/11) in CN. This coincided with Chunyun (the annual massive movement of people during Chinese Lunar New Year), which particularly accelerated the outbreak. Finally, in the last two segments (20/02/11 to 20/02/27 and 20/02/27 to 20/04/20), COVID-19 was controlled and stabilized once the CN government implemented very strict epidemic control strategies, such as traffic control and home quarantine.

(2) As shown in Fig. 5 C, the sigmoid curves in HK and TW seem similar because they were both strongly affected by COVID-19 cases from mainland China. However, we find that the change points of COVID-19 in TW are about a week delayed compared to those in HK after COVID-19 started to break out in both regions. This is because TW responded in a timely manner to the COVID-19 outbreak and controlled it more quickly and effectively than HK, while the implementation of epidemic control strategies in HK lagged behind.

(3) As shown in Fig. 6 , the number of new cases in China had tentatively stabilized since the last change point, 20/02/27, which was delayed by about one week in other clusters. In Fig. 6 A and B , the infections in CN and KR are mostly stable, but the epidemic situations in other countries have not been controlled effectively. Take the fifth cluster ( Fig. 6 C) as an illustration, considering that this cluster had the fastest growth.

The four segments can be explained as follows: (i) the infections in the first segment were mainly from oversea imports; (ii) in the second segment, COVID-19 seemed to have been controlled; (iii) COVID-19 broke out because of many unfound COVID-19 cases in previous segments; and (iv) in the last segment, COVID-19 began to come under control as governments declared states of emergency and started implementing strict measures to control the spread of the virus.

(4) As shown in Fig. 6 B, confidence intervals for KR tended to be quite narrow in width when the number of new cases had tentatively stabilized, resulting in more precise estimates of mean response, whereas confidence intervals for JP tended to be wide since JP had adopted a "defensive strategy". In most of cases, confidence intervals produced precise results. can see that the fittings continue to work well. We provide an R package, GraphCpClust, which can be accessed from https://github.com/Meiqian-Chen/GraphCpClust. From this R package, users can obtain the same results presented in this paper and can model data for another extended period of time. In addition, the data and code for another two papers Rao, 2017, 2018) are included in this R package.

Regarding the dataset used in this article, Wuhan-2019-nCoV, we make the follow-ing additional remarks: 1. Back in early March 2020, there were very few datasets on COVID-19, and especially few datasets containing timely epidemic data from each Chinese province. This dataset, Wuhan-2019-nCoV, collects national outbreak reports from WHO, as well as daily outbreak reports from provincial health and family planning commissions in China; 2. The Wuhan-2019-nCoV dataset is very timely updated and has been included in the "Open Source Wuhan" data resource. Therefore, we believe that the data quality of the Wuhan-2019-nCoV dataset is trustworthy. There are now more and more COVID-19 data resources available, such as WHO data (https://covid19.who.int/) and Our World in Data (https://ourworldindata.org/covid-data-switch-jhu). For these two datasets, we find that our model still works very well. Please see this webpage, http://graph-clusteringsystem.com/, for the three data analyses described above. Algorithm 2: BIC-based ICSS Algorithm

Result: Output the estimated change pointsĈ i . 1 Notations:Ĉ i,(s) and |Ĉ i | are the sth smallest element and number of elements of setĈ i , respectively; 

Mixture model for image understanding and the EM algorithm

Computation and analysis of multiple structural change models

A distribution-free twosample run test applicable to high-dimensional data

The igraph software package for complex network research. InterJournal, Complex Systems 1695

Daniel Bernoulli's epidemiological model revisited

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise

2020) mclust: Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation

Model-based clustering, discriminant analysis and density estimation

Use of cumulative sums of squares for retrospective detection of changes of variance. Publications of the American Statistical Association

Estimation and model selection in general spatial dynamic panel data models

Global convergence of COVID-19 basic reproduction number and estimation from earlytime SIR dynamics

An autoregressive logistic model to predict the reciprocal effects of oviductal fluid components on in vitro spermophagy by neutrophils in cattle

Infectious Disease Modeling. A Hybrid System Approch

Introduction to Information Retrieval

Mathematical Biology

Ward's hierarchical agglomerative clustering method: which algorithms implement ward's criterion?

R: A language and environment for statistical computing. R Foundation for Statistical Computing

A sequential multiple changepoint detection procedure via VIF regression

Consistent and powerful graph-based changepoint test for high-dimensional data

Consistent and powerful non-Euclidean graph-based change-point test with applications to segmenting random interfered video data

The Art of Simulation

Algorithm as 136: a k-means clustering algorithm

Algorithm 1: Graph-based clustering Algorithm Result: Output the optimal clusters A

Notations: x(θ) = {x s |x s ≤ θ, s = 1, . . . , K − 1} where x s is defined

2 θ (s) is the s'th largest element in set {x s , s = 1

3σ 2 (θ) is the sample variance of x(θ)

A) = (K − 1) log(σ 2 (θ)) + 2L(A) log(K − 1), where L(A) is the number of clusters in A

Initialize: Let i = 1, L = 1, and A = {A 1 } where A 1

Cluster 1(CN) and Cluster 2(IR,IT)

Cluster 5