key: cord-0044743-4sep0e5y authors: Klassen, Gerhard; Tatusch, Martha; Himmelspach, Ludmila; Conrad, Stefan title: Fuzzy Clustering Stability Evaluation of Time Series date: 2020-05-18 journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems DOI: 10.1007/978-3-030-50146-4_50 sha: 2ea88272e2e9014c25abbf6a15fb12f59fb5ff42 doc_id: 44743 cord_uid: 4sep0e5y The discovery of knowledge by analyzing time series is an important field of research. In this paper we investigate multiple multivariate time series, because we assume a higher information value than regarding only one time series at a time. There are several approaches which make use of the granger causality or the cross correlation in order to analyze the influence of time series on each other. In this paper we extend the idea of mutual influence and present FCSETS (Fuzzy Clustering Stability Evaluation of Time Series), a new approach which makes use of the membership degree produced by the fuzzy c-means (FCM) algorithm. We first cluster time series per timestamp and then compare the relative assignment agreement (introduced by Eyke Hüllermeier and Maria Rifqi) of all subsequences. This leads us to a stability score for every time series which itself can be used to evaluate single time series in the data set. It is then used to rate the stability of the entire clustering. The stability score of a time series is higher the more the time series sticks to its peers over time. This not only reveals a new idea of mutual time series impact but also enables the identification of an optimal amount of clusters per timestamp. We applied our model on different data, such as financial, country related economy and generated data, and present the results. The analysis of sequential data -so called time series (TS) -is an important field of data mining and already well researched. There are many different tasks, but the identification of similarities and outliers are probably among the most important ones. Clustering algorithms try to solve exactly these problems. There are various approaches for extracting information from time series data with the help of clustering. While some methods deal with parts of time series, so called subsequences [2] , others consider the whole sequence at once [9, 28] , or transform them to feature sets first [17, 34] . In some applications clusters may overlap, so that membership grades are needed, which enable data points to belong to more [32] . The blue clusters are more stable over time than the red ones. than one cluster to different degrees. These methods fall into the field of fuzzy clustering and they are used in time series analysis as well [24] . However, in some cases the exact course of time series is not relevant but rather the detection of groups of time series that follow the same trend. Additionally, time-dependent information can be meaningful for the identification of patterns or anomalies. For this purpose it is necessary to cluster the time series data per time point, as the comparison of whole (sub-)sequences at once leads to a loss of information. For example, in case of the euclidean distance the mean distance over all time points is considered. In case of Dynamic Time Warping (DTW) the smallest distance is relevant. The information at one timestamp has therefore barely an impact. The approach of clustering time series per time point enables an advanced analysis of their temporal correlation, since the behavior of sequences to their cluster peers can be examined. In the following this procedure will be called over-time clustering. An example is shown in Fig. 1 . Note, that for simplicity reasons only univariate time series are illustrated. However, over-time clustering is especially valuable for multivariate time series analysis. Unfortunately new problems like the right choice of parameters arise. Often the comparison of clusterings with different parameter settings is difficult since there is no evaluation function which distinguishes the quality of clusterings properly. In addition, some methods, such as outlier detection, require good clustering as a basis, whereby the quality can contextually be equated with the stability of the clusters. In this paper, we focus on multiple multivariate time series with same length and equivalent time steps. We introduce an evaluation measure named FCSETS (Fuzzy Clustering Stability Evaluation of Time Series) for the over-time stability of a fuzzy clustering per time point. For this purpose our approach rates the over-time stability of all sequences considering their cluster memberships. To the best of our knowledge this is the first approach that enables the stability evaluation of clusterings and sequences regarding the temporal linkage of clusters. Over-time clustering can be helpful in many applications. For example, the development of relationships between different terms can be examined when tracking topics in online forums. Another application example is the analysis of financial data. The over-time clustering of different companies' financial data can be helpful regarding the detection of anomalies or even fraud. If the courses of different companies' financial data can be divided into groups, e.g. regarding their success, the investigation of clusters and their members' transitions might be a fundamental step for further analysis. As probably not all fraud cases are known (some may remain uncovered) this problem cannot be solved with fully supervised learning. The stability evaluation of temporal clusterings offers a great benefit as it not only enables the identification of suitable hyper-parameters for different algorithms but also ensures a reliable clustering as a basis for further analysis. In the field of time series analysis, different techniques for clustering time series data were proposed. However, to the best of our knowledge, there does not exist any approach similar to ours. The approaches described in [8, 19, 28] cluster entire sequences of multiple time series. This procedure is not well suited for our context because potential correlations between subsequences of different time series are not revealed. Additionally, the exact course of the time series is not relevant, but rather the trend they show. The problem of not recognizing interrelated subsequences also persists in a popular method where the entire sequences are first transformed to feature vectors and then clustered [17] . Methods for clustering streaming data like the ones proposed in [14] and [25] are not comparable to our method because they consider only one time series at a time and deal with other problems such as high memory requirements and time complexity. Another area related to our work is community detection in dynamic networks. While approaches presented in [12, 13, 26, 36] aim to detect and track local communities in graphs over time, the goal of our method is finding a stable partitioning of time series over the entire period so that time series following the same trend are assigned to the same cluster. In this section, first we briefly describe the fuzzy c-means clustering algorithm that we use for clustering time series objects at different time points. Then, we refer on the one hand to related work with regard to time-independent evaluation measures for clusterings. Finally, we describe a resampling approach for cluster validation and a fuzzy variant of the Rand index that we use in our method. Fuzzy c-means (FCM) [4, 7] is a partitioning clustering algorithm that is considered as a fuzzy generalization of the hard k-means algorithm [22, 23] . FCM partitions an unlabeled data set X = {x 1 , ..., x n } into c clusters represented by their prototypes V = {v 1 , ..., v c }. Unlike k-means that assigns each data point to exactly one cluster, FCM assigns data points to clusters with membership degrees u ik ∈ [0, 1], 1 ≤ i ≤ c, 1 ≤ k ≤ n. FCM is a probabilistic clustering algorithm which means that its partition matrix U = [u ik ] must satisfy two conditions given in (1) . Since we focus on partition matrices produced by arbitrary fuzzy clustering algorithms, we skip further details of FCM and refer to the literature [4] . Many different external and internal evaluation measures for evaluating clusters and clusterings were proposed in the literature. In the case of the external evaluation, the clustering results are compared with a ground truth which is already known. In the internal evaluation, no information about the actual partitioning of the data set is known, so that the clusters are often evaluated primarily on the basis of characteristics such as compactness and separation. One metric that evaluates the compactness of clusters is the Sum of Squared Errors. It calculates the overall distance between the data points and the cluster prototype. In the case of fuzzy clustering, these distances are additionally weighted by the membership degrees. The better the data objects are assigned to clusters, the smaller the error, the greater the compactness. However, this measure does not explicitly take the separation of different clusters into account. There are dozens of fuzzy cluster validity indices that evaluate the compactness as well as the separation of different clusters in the partitioning. Some validity measures use only membership degrees [20, 21] , other include the distances between the data points and cluster prototypes [3, 5, 11, 35] . All these measures cannot be directly compared to our method because they lack a temporal aspect. However, they can be applied in FCSETS for producing an initial partitioning of a data set for different time points. The idea of the resampling approach for cluster validation described in [30] is that the choice of parameters for a clustering algorithm is optimal when different partitionings produced for these parameter settings are most similar to each other. The unsupervised cluster stability value s(c), c min ≤ c ≤ c max , that is used in this approach is calculated as average pairwise distance between m partitionings: where U ci and U cj , 1 ≤ i < j ≤ m, are two partitionings produced for c clusters and d(U ci , U cj ) is an appropriate similarity index of partitionings. Our stability measure is similar to the unsupervised cluster stability value but it includes the temporal dependencies of clusterings. Since we deal with fuzzy partitionings, in our approach we use a modified version of the Hüllermeier-Rifqi Index [18] . There are other similarity indices for comparing fuzzy partitions like Campello's Fuzzy Rand Index [6] or Frigui Fuzzy Rand Index [10] but they are not reflexive. The Hüllermeier-Rifqi Index (HRI) is based on the Rand Index [29] that measures the similarity between two hard partitions. The Rand index between two hard partitions U c×n andŨc ×n of a data set X is calculated as the ratio of all concordant pairs of data points to all pairs of data points in X. A data pair (x k , x j ), 1 ≤ k, j ≤ n is concordant if either the data points x k and x j are assigned to the same cluster in both partitions U andŨ , or they are in different clusters in U andŨ . Since fuzzy partitions allow a partial assignment of data points to clusters, in [18] , the authors proposed an equivalence relation E U (x k , x j ) on X for the calculation of the assignment agreement of two data points to clusters in a partition: Using the equivalence relation E U (x k , x j ) given in Formula (3), the Hüllermeier-Rifqi index is defined as a normalized degree of concordance between two partitions U andŨ : In [31] , Runkler has proposed the Subset Similarity Index (SSI) which is more efficient than the Hüllermeier-Rifqi Index. The efficiency gain of the Subset Similarity Index is achieved by calculating the similarity between cluster pairs instead of the assignment agreement of data point pairs. We do not use it in our approach because we evaluate the stability of a clustering over time regarding the team spirit of time series. Therefore, in our opinion, the degree of the assignment agreement between time series pairs to clusters at different time stamps contributes more to the stability score of a clustering than the similarity between cluster pairs. In this chapter we clarify our understanding of some basic concepts regarding our approach. For this purpose we supplement the definitions from [32] . Our method considers multivariate time series, so instead of a definition with real values we use the following definition. The membership degree u Ct i ,j (o ti,l ) ∈ [0, 1] expresses the relative degree of belonging of the data object o ti,l of time series T l to cluster C ti,j at time t i . U ti = [u Ct i ,j (o ti,l )]. In concrete it is the ordered set ζ = U t1 , ..., U tn of all membership matrices. An obvious disadvantage of creating clusters for every timestamp is the missing temporal link. In our approach we assume that clusterings with different parameter settings show differences in the connectedness of clusters and that this connection can be measured. In order to do so, we make use of a stability function. Given a fuzzy clustering ζ, we first analyze the behavior of every subsequence of a time series T = o t1 , ..., o ti , with t i ≤ t n , starting at the first timestamp. In this way we rate a temporal linkage of time series to each other. Time series that are clustered together at all time stamps, have a high temporal linkage, while time series which often separate from their clusters' peers, indicate a low temporal linkage. One could say we rate the team spirit of the individual time series and therefore their cohesion with other sequences over time. In the example shown in Fig. 2 , the time series T a and T b show a good team spirit because they move together over the entire period of time. In contrast, the time series T c and T d show a lower temporal linkage. While they are clustered together at time points t i and t k , they are assigned to different clusters in between at time point t j . After the evaluation of the individual sequences, we assign a score to the fuzzy clustering ζ, depending on the over-time stability of every time series. Let U ti be a fuzzy partitioning of the data objects O ti of all times series in k ti clusters at time t i . Similar to the equivalence relation in Hüllermeier-Rifqi Index, we compute the relative assignment agreement of the data objects o ti,l and o ti,s of two time series T l and T s , 1 ≤ l, s ≤ m to all clusters in partitioning U ti at time t i as follows Having the relative assignment agreement of time series at timestamps t i and t r , t 1 ≤ t i < t r ≤ t n , we calculate the difference between the relative assignment agreements of time series T l and T s by subtracting the relative assignment agreement values: We calculate the stability of a time series T l , 1 ≤ l ≤ m, over all timestamps as an averaged weighted difference between the relative assignment agreements to all other time series as follows: In Formula (7) we weight the difference between the assignment agreements D ti,tr (T l , T s ) by the assignment agreement between pairs of time series at the earlier time point because we want to damp the large differences for stable time series caused by supervention of new peers. On the other hand we aim to penalize the time series that leave their cluster peers while changing cluster membership at a later time point. Finally, we rate the over-time stability of a clustering ζ as the averaged stability of all time series in the data set: As we already stated, the over-time stability of the entire clustering depends on the stability of all time series regarding staying together in a cluster with times series, that follow the same trend. In the following, we present the results on an artificially generated data set, that demonstrates a meaningful usage of our measure and shows the impact of the stability evaluation. Additionally, we discuss experiments on two real world data sets. One consists of financial figures from balance sheets and the other one contains country related economy data. In all cases fuzzy c-means was used with different parameter combinations for the number of clusters per time point. In order to show the effects of a rating based on our stability measure, we generated an artificial data set with time series that move between two separated groups. Therefore, at first, three random centroids with two features ∈ [0, 1] were placed for time point 1. These centroids were randomly shifted for the next timestamps whereby the maximal distance of a centroid at two consecutive time points could not exceed 0.05 per dimension. Afterwards 3, 4 and 5 time series were assigned to these centroids, respectively. This means that the data points of a time series for each time point were placed next to the assigned centroid with a maximal distance of 0.1 per feature. Subsequently, sequences with random transitions between two of the three clusters were inserted. Therefore 3 time series (namely 1, 2 and 3) were generated, that were randomly assigned to one of the two clusters at every time point. All together, a total of 4 time points and 15 time series were examined. To find the best stability score for the data set, FCM was used with various settings for the number of clusters per time point. All combinations with k ti ∈ [2, 5] were investigated. Figure 3 shows the resulting fuzzy clustering with the highest FCSETS score of 0.995. For illustration reasons the clustering was defuzzyfied. Although it might seem intuitive to use a partitioning with three clusters at time points 1 and 2, regarding the over-time stability it is beneficial to choose only two clusters. This can be explained by the fact that there are time series that move between the two apparent groups of the upper (blue) cluster. The stability is therefore higher when these two groups are clustered together. In Table 1 a part of the corresponding scores for the different parameter settings of k ti are listed. As shown in Fig. 3 , the best score is achieved with k ti being set to 2 for all time points. The worst score results with the setting k t1 = 2, k t2 = 3, k t3 = 4 and k t4 = 5. The score is not only decreased because the upper (blue) cluster is divided in this case, but also because the number of clusters varies and therefore sequences get separated from their peers. It is obvious that the stability score is negatively affected, if the number of clusters significantly changes over time. This influence is also expressed by the score of 0.577 for the extreme example in the last row. The first data set was released by Refinitiv (formerly Thomson Reuters Financial & Risk) and is called EIKON. The database contains structured financial data of thousands of companies for more than the past 20 years. For the ease of demonstration two features and 23 companies were chosen randomly for the experiment. The selected features are named as TR-NetSales and TR-TtlPlanExpectedReturn by Thomson Reuters and correspond to the net sales and the total plan expected return, which are figures taken from the balance sheet of the companies. Since it is a common procedure in economics, we divided the features by the company's total assets and normalized them afterwards with a min-max-normalization. We generated the clusterings for all combinations of k ti from two to five clusters per timestamp. Selected results can be seen in Table 2 . The actual maximum retrieved from the iterations (in the third row) is printed bold. The worst score can be found in the last row and represents an unstable clustering. It can be seen that the underlying data is well separated into three clusters in the first point in time and into two clusters at the following timestamps. This is actually a rare case but can be explained with the selection of features and companies. Actually TR-TtlPlanExpectedReturn is rarely provided by Thomson Reuters and the fact that we only chose companies which got complete data for all regarded points in time. This may have diminished the number of companies which might have lower membership degrees. The next data set originates from www.theglobaleconomy.com [1] , which is a website that provides economic data of the past years for different countries. Again, two features were selected randomly for this experiment and were normalized with a min-max-normalization. Namely the features are the "Unemployment Rate" and the "Public spending on education, percent of GDP". For illustration reasons, we considered only a part of the countries (28) for the years from 2010 to 2017. The results are shown in Table 3 . It can be seen that the best score is achieved with two clusters at every point in time. Evidently the chosen countries can be well separated into two groups at every point in time. More clusters or different numbers of clusters for different timestamps performed worse. In this experiment we also iterated over all combinations of k ti for the given points in time. The bold printed maximum, and the minimum, which can be found in the last row of the table, represent the actual maximum and minimum within the range of the iterated combinations. In this paper we presented a new method for analyzing multiple multivariate time series with the help of fuzzy clustering per timestamp. Our approach defines a new target function for sequence-based clustering tasks, namely the stability of sequences. In our experiments we have shown that this enables the identification of optimal k ti s per timestamp and that our measure can not only rate time series and clusterings but also can be used to evaluate the stability of data sets. The latter is possible by examining the maximum achieved FCSETS score. Our approach can be applied whenever similar behavior for groups of time series can be assumed. As it is based on membership degrees, clusterings with overlapping clusters and soft transitions can be handled. With the help of our evaluation measure a stable over-time clustering can be achieved, which can be used for further analysis such as outlier detection. Future work could include the development of a fuzzy clustering algorithm which is based on our formulated target function. The temporal linkage could therefore already be taken into account when determining groups of time series. Another interesting field of research could be the examination of other fuzzy clustering algorithms like the Possibilistic Fuzzy c-Means algorithm [27] . This algorithm can also handle outliers which can be handy for certain data sets. In the experiment with the GlobalEconomy data set we faced the problem, that one outlier would form a cluster on its own in every point in time. This led to very high FCSETS scores. The handling of outliers could overcome such misbehavior. Future work should also include the application of our approach to incomplete data, since appropriate fuzzy clustering approaches already exist [15, 16, 33] . We have faced this problem when applying our algorithm to the EIKON financial data set. Also, the identification of time series that show a good team spirit for a specific time period could be useful in some applications and might therefore be investigated. Finally, the examination and optimization of FCSETS' computational complexity would be of great interest as it currently seems to be fairly high. Global economy Clickstream clustering using weighted longest common subsequences Adaptive optimization of the number of clusters in fuzzy clustering Pattern Recognition with Fuzzy Objective Function Algorithms An objective approach to cluster validation A fuzzy extension of the rand index and other related indexes for clustering and classification assessment A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters Clustering short time series gene expression data Time series clustering via community detection in networks Clustering and aggregation of relational data with applications to image database categorization A new method of choosing the number of clusters for the fuzzy c-mean method Benchmark model to assess community structure in evolving networks Tracking the evolution of communities in dynamic social networks Clustering data streams: theory and practice Fuzzy c-means clustering of incomplete data Fuzzy c-means clustering of incomplete data using dimension-wise fuzzy variances of clusters Time series k-means: a new k-means type smooth subspace clustering for time series data A fuzzy variant of the rand index for comparing clustering structures Fuzzy clustering of time series data using dynamic time warping distance A cluster validation index for GK cluster analysis based on relative degree of sharing A cluster-validity index combining an overlap measure and a separation measure based on fuzzy-aggregation operators Least squares quantization in PCM Some methods for classification and analysis of multivariate observations Fuzzy clustering of short time-series and unevenly distributed sampling points Streamingdata algorithms for high-quality clustering The rise and fall of spatio-temporal clusters in mobile ad hoc networks A possibilistic fuzzy c-means clustering algorithm k-shape: efficient and accurate clustering of time series Objective criteria for the evaluation of clustering methods A resampling approach to cluster validation Comparing partitions by subset similarities Show me your friends and i'll tell you who you are. finding anomalous time series by conspicuous cluster transitions Different approaches to fuzzy clustering of incomplete datasets A novel clustering-based method for time series motif discovery under time warping measure A validity measure for fuzzy clustering A dynamic algorithm for local community detection in graphs Acknowledgement. We thank the Jürgen Manchot Foundation, which supported this work by funding the AI research group Decision-making with the help of Artificial Intelligence at Heinrich Heine University Düsseldorf.