key: cord-0684997-mlkx9c90 authors: D’Urso, Pierpaolo; De Giovanni, Livia; Vitale, Vincenzina title: Spatial robust fuzzy clustering of COVID 19 time series based on B-splines date: 2021-05-15 journal: Spat Stat DOI: 10.1016/j.spasta.2021.100518 sha: cbd095f252580ed90df4449a3c46af62eb983a1e doc_id: 684997 cord_uid: mlkx9c90 The aim of the work is to identify a clustering structure for the 20 Italian regions according to the main variables related to COVID-19 pandemic. Data are observed over time, spanning from the last week of February 2020 to the first week of February 2021. Dealing with geographical units observed at several time occasions, the proposed fuzzy clustering model embedded both space and time information. Properly, an Exponential distance-based Fuzzy Partitioning Around Medoids algorithm with spatial penalty term has been proposed to classify the spline representation of the time trajectories. The results show that the heterogeneity among regions along with the spatial contiguity is essential to understand the spread of the pandemic and to design effective policies to mitigate the effects. Severe acute respiratory syndrome Coronavirus-2 (SARS-CoV-2) is the name given to the new 2019 coronavirus while COVID-19 is that given to the disease associated with the virus. This new strain of coronavirus, not previously identified in humans, has exponentially spread out from the capital of the Chinese province of Hubei, Wuhan, the epicentre of the contagion, worldwide causing a pandemic with serious effects on national health systems [1] . In Italy the outbreak of COVID-19 started at the end of February in Lombardy and then hit Northern Italy. As a consequence the Italian government The lockdown was gradually suspended at the beginning of May 2020 as the number of new daily active cases and the number of deaths due to COVID-19 steadily declined, also considering its economic cost. People could gather provided they maintained social distancing. Nonetheless, it was not possible to move freely between regions until 3 June 2020. The count of the new daily infections remained stable until August 2020, when the number of new daily cases, on a national level, increased to over 1200. In October, with an average of 10000 new daily cases and the exponential growth of the incidence rate, Italy faced its second pandemic wave involving all territories, from the North to the South. It is worth noting that the higher number of new dalily infections in the second wave was also strictly related to the considerably higher number of tested people by day. New restrictive measures were introduced by the Italian authorities that varied based on the classification of Italian provinces into three areas -red, orange and yellow -corresponding to three risk profiles on the basis of the value of 21 indicators. The 21 indicators are divided into three different categories: process indicators on monitoring capacity; process indicators on the ability to diagnose, investigate and manage contacts; result indicators relating to transmission stability and the resilience of health services. The red colour corresponds to higher risk. They contributed to control the pandemic spread in Italy and are still adopted to this day. Taking into account gender, out of the 85, 418 patients died for SARS-CoV-2, 43.7% was female. Based on an opportunistic sample of only 6, 381 patients died in hospital, the average number of diseases was 3.6 and the 84.9% of the sample had 2 or more comorbidities. In this study, we focused on the daily time-series of the cumulative cases over population (per 10000 inhabitants), of the cumulative cases over monitored cases and of the cumulative deaths over population (per 10000 inhabitants), spanning from 2020-02-24 to 2021-02-08. From a methodological point of view, one deals with spatial units observed at several time occasions. The corresponding spatial time data array X is a two-way data array, therefore an array of the type: spatial objects × occasions. Several space-time series clustering techniques have been proposed in the literature: some of them take into account the spatial dependence in the nonspatial time series clustering models by defining a suitable spatial dissimilarity measure [3] , while others are density-based [4] [5] [6] [7] [8] or model-based [9] [10] [11] [12] [13] clustering techniques. The approach followed in this work belongs to the broad group of the spatially constrained time series clustering techniques [14] [15] [16] . More specifically, it is based on the fuzzy clustering algorithm for spatial-time data proposed by [17] that, in turn, is an extension of the fuzzy C-Means algorithm for time series with spatial information developed in [15] . Both fuzzy algorithms proposed in [15, 17] include a spatial penalization term in the objective function. The role of this term and of the related tuning parameter is to smooth the membership degrees of all units contiguous to the generic i-th unit in all clusters to which i-th unit does not belong. The time series are transformed onto (finite dimensional) vectors of coefficients for dimension reduction purposes. This is obtained by projecting each time series onto a finite dimensional functional basis. Afterwards, robust clustering is applied to the resulting basis coefficients. Projection onto a suitable functional basis is considered, in particular the cubic B-spline basis. The need for robust clustering methods is naturally justified by the possible 4 J o u r n a l P r e -p r o o f and frequent presence of outlier time series. Notice that outliers are likely to occur in large, and sometimes fully automatized, data sets. Part of the variability of data can be removed by the regularization or smoothing procedure considered, but other sources of variability could be due to particularities of the data themselves. To achieve robustness, in the paper the Exponential distance is considered ( [18] ; [19] ). In this study, the Exponential distance -based Fuzzy C-Medoids clustering algorithm based on B-splines with spatial penalty term (BS-Exp-FCMd-S) is applied to clustering of time series related to COVID-19 pandemic. Summing up the proposed clustering method used for analysing the Covid time series presents different benefits inherited by its structural features. In particular: Clustering procedure: adopting the PAM (Partitioning Around Medoids) approach, the cluster prototypes (i.e., medoids) are units actually observed and not "virtual" units like the "centroids" derived with a fuzzy c-means [20] . Overall, having non-fictitious representative units available makes interpreting the obtained clusters easier [21] . In addition, PAM procedure provides a "timid robustification" of the c-means clustering ( [22] ; [23] ). Robustness: we consider in the clustering process a suitable transformation (exponential transformation) of the distance measure (exponential distance) capables to smooth the effect of the outliers and to tune their influence properly. In particular, the exponential distance gives less importance to outliers and more importance to those data points laying close to the bulk of the dataset (for more details see [24] ; [25] ). Fuzziness: fuzzy clustering appears more attractive than the traditional clustering methods when it is difficult to identify a clear boundary among clusters ( [26] ; [27] ). In addition, the memberships indicate whether there is a second-best cluster almost as good as the best cluster, a scenario which standard clustering methods cannot uncover [28] . Furthermore, fuzzy clustering is attractive because it is easily compatible with distribution free methods [29] and it is computationally efficient ( [26] ; [30] ). For more details, see [31] . J o u r n a l P r e -p r o o f Journal Pre-proof Temporal information: the time series are "re-parametrized" by means of the B-splines; in this manner, the temporal information of the time series is preserved considering a reduced number of values (the estimates of the splines coefficients) instead of the numerous actual observations of the time series. In this way we are able to preserve time information in the clustering process in a more parsimonious way from a computational point of view. Furthermore, our method inherits all the theoretical benefits of the B-splines in the time series clustering process [32] . Spatial information: our clustering method is capable to take into account the spatial information by means of a spatial penalty term defined on the basis of the following assumption: "...when a spatial unit belongs to a cluster with a high membership degree, then the penalty term forces the neighbouring spatial units to have high membership degrees in the cluster, as much as possible. In other words, it is expected that a spatial unit with high (low) membership degree in a cluster, will have neighbouring areas with low (high) membership degrees in the remaining clusters. It follows that the spatial penalty term attempts to determine a spatially smoothed membership degrees under the empirical evidence that neighbouring spatial units are often characterized by approximately similar features. Nonetheless, it may also occur that neighbouring spatial units are described by pretty diverse profiles. In this respect, there is a parameter which plays the role of increasing or decreasing the emphasis of the spatial penalty term in the clustering process" [15] . For this purpose, the parameter is chosen on the basis of an appropriate spatial autocorrelation measure. The paper is structured as follows. Section 2 describes data formalization A two-way data array of type "same objects × same quantitative variable × occasions", in which the objects are spatial units (geographical areas, pixels, etc.) and the occasions are times, is called spatial-time data array. In this study, the Italian regions are analyzed with respect to variables related to COVID-19 pandemic, collected daily over one year. Therefore, the corresponding objects of the spatial-time data array are represented by the Italian regions, the corresponding variable by a COVID-19 variable and the occasions by the days under consideration. In a formal way, a spatial-time data matrix can be algebraically formalized as [33] [34] [35] : where i indicates the generic spatial unit and t the generic time; x i (t) is the value of the variable observed for the i-th unit at time t. The time data matrix X can be conveniently represented as the set of vectors: The spatial proximity between the I objects is described by means of the contiguity matrix P I×I . It is a symmetric matrix with zero diagonal elements and with off-diagonal elements given by (i, i = 1, . . . , I, i = i ): Contiguity, on its turn, can be specified in several ways, for instance, two spatial units can be contiguous: if they are adjacent (neighbours); if they belong to the same macro-area, even if they are not adjacent. In both cases, a binary index 0 − 1 can be created where 1 is assigned to contiguous spatial units, 0 otherwise. Different definitions of connectivity and contiguity can be embedded in the clustering procedure. A time series {(t, x i (t))} is the result of collecting a variable X on unit i at the T times {t = 1, . . . , T }. Given a p-dimensional functional basis {B s (·)} p s=1 , we can model that time series by a simple linear least-squares fit as: Thus, for I time series x i , i = 1, . . . , I, we will obtain I vectors of fitted coef- . . , b p i ) , i = 1, · · · , I. Notice that these b s i parameters can be determined by using ordinary least squares regression. We consider the cubic B-spline bases. Notice that, considering S interior knots ξ 1 , ξ 2 , . . . , ξ S , a cubic B-spline basis will be made up of p = S + 4 basis elements. To fit a cubic spline to a data set with S knots we perform least square regression with an intercept and S + 3 predictors, of the form X(t), X 2 (t), X 3 (t) and S truncated cubic functions (one per knot). Other functional bases may be used, Fourier basis when the data curves exhibit periodical nature, wavelets basis that allows multiresolution analysis (in time and frequency scale). Cubic B-spline bases have good mathematical properties and are easy to implement [32] . Notice also that cubic B-spline bases have an attractive property in that they are locally sensitive to data (each parameter is nonzero over a span of at most five distinct nodes). One option for the choice of the number of knots is to try out different numbers of knots and see which produces the best looking curve. A more objective approach is to assess the accuracy of the model using the Residual of the data is removed, a a spline with a certain number of knots is fitted to the remaining data, and then the splines is used to make predictions for the held-out portion. The process is repeated multiple times until each observation has been left out once, and then RSS is computed. The procedure is repeated for different number of knots and the value corresponding to the smallest RSS is chosen. is defined as ( [24] , [36] ): The exponential distance gives a small weight to those noisy points or outliers and a large weight to those compact points in the data set. First, it should be noted that the exponential distance is bounded by 1. Second, as the value of β increases, the distance tends more rapidly to its maximum value. The parameter β plays an important role in the distance. Hence, if β is too high, in the classification process each time series is a singleton, since it has no neighbours. Figure 1 shows the effect of increasing values of β on the exponential distance. Following [24] , β is set as the inverse of the variability of 9 J o u r n a l P r e -p r o o f the data: the vector of coefficients closest to all other vectors. The value of β appropriately tunes the distance according to the variability of the data so that in the presence of low variability moderate distances are heavily magnified. See [24] for further insights on the robustness of the exponential distance. In this study, a suitable version of the well-known fuzzy C-Medoids (FCMd) algorithm is proposed. More specifically, the FCMd with spatial penalty term and squared Exponential distance is adopted, after transforming the times series onto the coefficients of the projection on a B-splines functional basis. As known, the FCMd technique allows to identify non-fictitious representative time series, the so called "medoids" favouring the interpretation of the final partition [37] . Its main advantage is related to a series of computational aspects: it is more efficient since the distance matrix needs to be computed once at the beginning of the iterative process and it is less affected by getting stuck in a local optima or by convergence problems [38, 39] . The BS-Exp-FCMd-S clustering model can be defined as follows: min : where b i and b c are the vectors of coefficients of the B-spline representation of the i-th spatial time series and of the c-th spatial medoid (c=1,. . . ,C) respectively, while m > 1 is well-known fuzziness parameter. J o u r n a l P r e -p r o o f As far as the spatial penalty term is concerned: it is worth noting that γ has to be seen as the tuning parameter of spatial information; p ii is the generic element of the "contiguity" matrix P I×I while C c is the set of the C clusters but cluster c. The u ic is the membership degree of the unit i belonging to the cluster c, that is equal to: We argue again that the role of the spatial penalty term is of increasing, for all spatial units contiguous to i-th unit, their final membership degrees to the cluster to which i belongs. In the next paragraph, we introduce the validity measure used in the application to choose the best partition. In order to choose the best solution in terms of number of groups, in this study we adopt the Fuzzy Silhouette (F S) index [40] , one of the most known cluster internal validity criteria based on the weighted average of the individual silhouettes width, λ i , as follows: where a pi is the average distance of object i to all other objects belonging to the same cluster p (p = 1,...,C) and l pi is the minimum (over clusters) average distance of the i-th unit to all units belonging to the cluster q with q = p. An higher value of F S means a better assignment of the units to the clusters implying that, simultaneously, the intra-cluster distance is minimized while the intercluster distance is maximized. As deeply analized in [15] , the optimal choice of the value of the parameter γ is a very complex issue. It has to be set exogenously by means of an heuristic procedure based on the spatial autocorrelation measure introduced in [15] , that could be seen as a generalization of the Moran's index [41] . For a chosen value of C and m, the algorithm is run for increasing values of γ (chosen in a suitable interval): the optimal γ value is that maximizes the within cluster spatial autocorrelation. Properly, it maximizes the Global Moran overall spatial autocorrelation measure ρ overall that, for a given partition, is computed as follows: where s c = I i=1 u ic . The ρ c , the spatial autocorrelation measure for the c-th cluster, is computed as: wherex is the centred "compromise" vector (mean of the T vectors x i ); U c is the square diagonal matrix (of order I) of the membership degrees of cluster c, and P is the spatial contiguity matrix. The operator diag(·) creates a diagonal matrix whose elements in the main diagonal are the same as those of the square matrix in the argument. If P is a contiguity matrix with 0/1 values, every diagonal element contains the number of neighboring units for the associated spatial unit. It is important to observe that (9) is very similar to the measure J o u r n a l P r e -p r o o f Journal Pre-proof proposed by Moran [42] . The main difference concerns the matrix U c , which tunes the contribution of the neighbours. The vectorb (mean of the p vectors b i ) was also used to compute (9) with the same results. As for Moran's index, also for ρ overall , a value of 1 (−1) identifies a perfect positive (negative) autocorrelation, while 0 indicates the absence of autocorrelation. Therefore, to higher values of the ρ overall corresponds a better spatial assignment of the units to the clusters. With the choice γ = 0 the spatial regularization term is not taken into account. In this case only the temporal information is taken into account in the clustering process. Then, we obtain the BS-Exp-FCMd clustering model formalized as follows: The membership degree u ic of the unit i belonging to the cluster c, is equal to: In this study, we proposed the BS-Exp-FCMd clustering model with spatial penalty term (BS-Exp-FCMd-S) to COVID-19 data collected during the pandemic in order to cluster Italian regions. Specifically, the data refer to variables observed on I=20 Italian regions 2 during T =351 times represented by the days erature; see, e.g., [43] , [44] , [45] , [46] , [47] . In the following sections, the clustering results will be described in detail with reference to the total cases over population (per 10000 inhabitants), the total deaths over population (per 10000 inhabitants) and the total cases over swabs, respectively, spanning from 2020-02-24 to 2021-02-08. One notices that total cases include active cases (infected patients) and closed cases (deaths and recovery/discharged). As far as the cumulative cases over monitored cases (over swabs) are concerned, we argue that the quality of clustering results is strictly Bolzano). Here, the data of the two provinces have been summed up to recover data at a regional level. The total number of cases over population (per 10000 inhabitants), by region, is shown in Figure 2 while the daily number of new infections over population (per 10000 inhabitants), by region, is shown in Figure 3 , spanning from 2020-02-24 to 2021-02-08. Both graphs well highlight regional heterogeneity and, looking at the second graph, in particular, we can clearly distinguish two peaks in correspondence of the first and second wave; the latter have soon surpassed the incidence rate of the former essentially because contact tracing has been more accurate and faster, carrying out more swab tests daily. Table 1 ; the latter is also reported in the ternary plot of Figure 5 . noticing that while Basilicata is clearly a fuzzy unit sharing features belonging to the clusters 2 and 3, the low membership of Sicily is due to an anomalous increase of infections during the second wave, as we can see by looking at Figure 3 showing the twofold sharp increase of new infections in Sicily, that exhibits amongst the highest values. With respect to the partition without spatial penalty term, the Molise region moves from the cluster with medoid Basilicata, not contiguous with the former, to the cluster with medoid Lazio, contiguous instead. As far as Sicily is concerned, with the spatial penalty term, its membership in cluster 1 is fuzzier than it was without penalty term (0.376 versus 0.414) and it is also comparable to that of the cluster 3, whose medoid is Calabria, its contiguous region. The memberships of the regions considering the spatial penalty term are improved. The total deaths over population due to COVID-19 desease, by region, is shown in Figure 8 , useful. In the report, one has been confirmed that the fatality rate for people aged under 50 was about 1% for both males and females while people aged over 80 were the most hit, males more than females, representing the 60% of all COVID-19 deaths. In general, COVID-19 deaths weighted the 13% out of the total mortality, in the first wave (February -May 2020), to rise to 16% in the second wave (end of September-November 2020). During the first pandemic wave, from March to May 2020, the number of total deaths was more than 211000, 50000 more than the average recorded in Table 2 ; the latter is also reported in the ternary plot of Figure 9 . The corresponding partition is plotted in both Figures 10 and 11 to emphasize the clustering results in time and space, respectively. Focusing on the partition based on spatial penalty, the first cluster, with The three clusters identified three profile, from that which paid the highest price in terms of deaths to that which paid the least price; indeed, the first cluster included all regions with the highest number, and also the highest increase in time, of the mortality due COVID-19 pandemic. The model with the spatial penalty had a substantially different partition from that for the model without the spatial penalty; all fuzzy units in the latter The added value produced by the robust spatial model is evident when we consider Sicily. Its membership degree (0.923), associated to the first cluster with medoid Veneto, had been considerably reduced so that it became a fuzzy unit in the spatial model. In other words, the model with penalty term and squared exponential distance is able to detect the presence of local outliers of particular interest in spatial statistics. In october 2020 Health Minister announced quick swabs in pharmacy as already carried out in some regions and an agreement with general practitioners to carry out the rapid swabs. The total cases over swabs are presented by region in Figure 12 spanning from 2020-02-24 to 2021-02-08. The regional heterogeneity on the onset of an epidemic outbreak, the occurrence of possible recurrent epidemic waves and the policy about swabs can be anlysed in terms of the range of the variable and the time of occurrence of changes. The Fuzzy Shilouette index selected three groups showing a value of 0.42. The optimum value of the spatial correlation index (9) over γ is 0.10 related to a value of Global Moran index ρ overall = 0.30. The membership degrees for the partition without and with penalty term are shown in Table 3 ; the latter is also reported in the ternary plot of Figure 13 . The corresponding partition is plotted in both Figures 14 and 15 to emphasize the clustering results in time and space, respectively. Focusing on the partition with spatial penalty, three clusters are clearly identified. The first cluster, with medoid Piedmont, collects the Northern regions Table 3 : total cases over swabs by region -3 clusters memberships In this study, the Exponential distance -based Fuzzy C-Medoids clustering algorithm based on B-splines with spatial penalty term (BS-Exp-FCMd-S) is applied to clustering of time series related to COVID-19 pandemic. Data reduction is obtained by the use of B-splines, robustness is obtained by the use of the exponential distance while spatial information is taken into account by the use of a penalty term in the objective function. We obtain on the entire period almost the same partition into yellow, or- Early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia Characteristics of sars-cov-2 patients dying in italy report based on available data on Clustering spatiotemporal data: An augmented fuzzy C-means A density-based algorithm for discovering clusters in large spatial databases with noise Mining spatial-temporal clusters from geodatabases An algorithm for clustering spatialtemporal data Fuzzy extensions of the DBScan clustering algorithm Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors The mixture method of clustering applied to three-way data Finite mixtures of matrix normal distributions for classifying three-way data Spatial generalized linear mixed models with multivariate CAR models for areal data Hierarchical multivariate mixture generalized linear models for the analysis of spatial data: An application to disease mapping Copula-based fuzzy clustering of spatial time series A hybrid EM approach to spatial clustering A fuzzy clustering model for multivariate spatial time series Fuzzy C-means with spatiotemporal constraints, Proceedings -2016 IEEE International Symposium on Computer, Consumer and Control Fuzzy clustering with spatial-temporal information Alternative c-means clustering algorithms Robust clustering of imprecise data, Chemometrics and Intelligent Laboratory Systems Pattern Recognition with Fuzzy Objective Function Algorithms Finding groups in data: An introduction to cluster analysis Robustness properties of k means and trimmed k means Advances in Data Analysis and Classification Alternative c-means clustering algorithms Time series clustering by a robust autoregressive metric with application to air pollution Application of fuzzy sets to climatic classification Market segmentation: conceptual and methodological foundations Cluster analysis Fuzzy clusterwise generalized structured component analysis Cluster differences scaling with a within-clusters loss component and a fuzzy successive approximation strategy to avoid local minima Handbook of Cluster Analysis A Practical Guide to Splines Dissimilarity measures for time trajectories Fuzzy C-means clustering models for multivariate time-varying data: Different approaches Fuzzy clustering for data time arrays with inlier and outlier time trajectories Alternative c-means clustering algorithm Finding Groups in Data: An Introduction to Cluster Analysis Cluster Analysis Fuzzy clusterwise generalized structured component analysis A fuzzy extension of the silhouette width criterion for cluster analysis A test for the serial independence of residuals A test for the serial independence of residuals Autocorrelation-based fuzzy clustering of time series Wavelet-based fuzzy clustering of time series Fuzzy clustering of time series in the frequency domain Fuzzy c-means method for clustering microarray data Robust fuzzy clustering based on quantile autocovariances Impatto dell'epidemia covid-19 sulla mortalità totaledella popolazione residente periodo gennaio-novembre 2020