key: cord-0651790-15qco1oh authors: Toufani-Movaghar, Danial; Feizi-Derakhshi, Mohammad-Reza title: Word Embeddings and Validity Indexes in Fuzzy Clustering date: 2022-04-26 journal: nan DOI: nan sha: b33268c3c28074a28b28a32b0ed3a7e2c108d9f0 doc_id: 651790 cord_uid: 15qco1oh In the new era of internet systems and applications, a concept of detecting distinguished topics from huge amounts of text has gained a lot of attention. These methods use representation of text in a numerical format -- called embeddings -- to imitate human-based semantic similarity between words. In this study, we perform a fuzzy-based analysis of various vector representations of words, i.e., word embeddings. Also we introduce new methods of fuzzy clustering based on hybrid implementation of fuzzy clustering methods with an evolutionary algorithm named Forest Optimization. We use two popular fuzzy clustering algorithms on count-based word embeddings, with different methods and dimensionality. Words about covid from Kaggle dataset gathered and calculated into vectors and clustered. The results indicate that fuzzy clustering algorithms are very sensitive to high-dimensional data, and parameter tuning can dramatically change their performance. We evaluate results of experiments with various clustering validity indexes to compare different algorithm variation with different embeddings accuracy. Word embeddings become an essential element of methods that focus on analysis and comparison of texts. Some of the most popular embeddings are Word2Vec, GloVe, T5 and Bert. The embedding is obtained via analysis of word-word cooccurrences in a text corpus. A natural question is related to the ability of the embeddings to represent a human-based semantic similarity of words. We showed that embedding has a significant effect on the result of clustering [1] . Data clustering is the process of grouping objects in a way that similarity between data points that belong to the same group (cluster) becomes as high as possible, while similarity between points from different groups gets as small as possible. It is an important task in analysis processes and has been successfully applied to pattern recognition, image segmentation, and fault diagnosis and search engines [1] . Fuzzy clustering which allows data points to belong to several numbers of clusters with different membership grades has been proved to have useful applications in many areas. Specifically, fuzzy C-Means (FCM) clustering and its augmented versionfuzzy Gustafson-Kessel (FGK) clustering are the most popular fuzzy clustering techniques. Both of the methods suffer from Termination in local optima and mixing these useful methods with an evolutionary meta-heuristic search method like Forest Optimization Algorithm guarantees avoiding such a situation. High-dimensional spaces often have a devastating effect on data clustering in terms of performance and quality; this issue is regarded as the curse of dimensionality. Fuzzy Gustafson-Kessel algorithm suffers from performance in large dimensional data. Our study evaluates 4 method of clustering: Fuzzy C-Mean, Fuzzy Gustafson-Kessel, FOA Fuzzy C-Mean witch is hybrid implementation of Fuzzy C-Mean and Forest Optimization, and FOA Gustafson-Kessel which is hybrid implementation of Fuzzy Gustafson Kessel and Forest Optimization. We evaluated results on a text dataset with 500 items that were embedded with 4 different algorithms and they have the same output dimension. Additionally, we illustrate 'usefulness' of fuzzy clustering via analysis of degrees of belongings of words to different clusters. For comparison of embedding and various algorithms involved we used 7 clustering validity indexes which were described in further sections. The paper is structured as follows: Section II contains the description of Fuzzy C-Means (FCM) and Fuzzy Gustafson-Kessel (FGK) algorithms and their evolutionary hybrid version with Forest Optimization (FOA). In Section III, we provide an overview of the validity indices applied for fuzzy clustering processes. Then in section IV we describe briefly embedding methods used in this paper. Section V shows our experimental results. Finally, the obtained conclusions are presented in Section VI. In contrast to hard clustering techniques, where one point is assigned exactly to only one cluster, fuzzy clustering allows data points to pertain to several clusters with different grades of membership [1] . We have analyzed the behavior of FCM and FGK in clustering of vector representations of words in different dimensions. The details of these clustering methods have been described in the following sections. The Fuzzy C-means algorithm [1] , allows an observation to belong to multiple clusters with varying grades of membership. Having D as the number of data points, N as the number of clusters, m as the fuzzifier parameter, as the i-th data point, as the center of the j-th cluster, as the membership degree of for the j-th cluster, FCM aims to minimize : The FCM clustering proceeds in following way [1] : 1. Cluster membership values µ ij̇ and initial cluster centers are initialized randomly. 2. Cluster centers are computed according to the formula: Membership graded µ ij̇a re updated and The objective function J is calculated 4. The steps 2, 3, 4 are repeated until the value of the objective function gets less than a specified threshold. Fuzzy Gustafson-Kessel (FGK) extends FCM by introducing an adaptive distance norm that allows the algorithm to identify clusters with different geometrical shapes [1] . The distance metric is defined as it should be computed from a fuzzy covariance matrix of each cluster: Where is computed from a fuzzy covariance matrix of each cluster: Enabling the matrix to change with fixed determinant serves to optimize the shape of clusters by keeping the cluster's volume constant [1] . Gustafson-Kessel clustering minimizes the following criterion: The Forest optimization algorithm (FOA) (Ghaemi and Feizi-Derakhshi, 2014) is suitable for continuous non-linear optimization problems. The algorithm is inspired by the existence of ancient trees after decades. While many of the trees are short-lived, the number of trees is still in existence even after a few decades. In this algorithm, spreading seeds of the trees are simulated so that the number of seeds is set under the trees, and the number of seeds spread in a vast area of Forest by natural events such as wind. The output of the algorithm suggests improving its accuracy in finding the optimal positions rather than genetic algorithm and particle swarm optimization [2] . This algorithm has three main steps: (1) local seed production, (2) removing some members of the population, (3) global seed production. In this algorithm, like other evolutionary algorithms, forest trees are initialized in the initialization step. A tree, in addition to the values of the variables, has a part that represents the age of the related tree. The age of each tree is zero at first. After initialization of the trees, in the local seed production step, some new trees aged zero (new seeds) are created. Then, one unit is added to the age of previous trees. In the second step, the number of trees based on the pre-defined population, should be removed from the population. Removing the excess trees is based on their fitness function values. Therefore some trees will be omitted from the forest and they will form the candidate population for the global seeding step. In the production of global seeds, a percentage of the candidate population is chosen to move far in the forest. Global seeding step adds some new potential solutions to the forest so that it goes away from local optimums. Finally, after sorting the trees according to their fitness value, the tree with the highest fitness value is selected as the best tree. Then the age of the best tree will be set to 0 in order to avoid the aging of the best tree as the result of a local seeding step. In this section we introduce our work as a new method for clustering as a combination of Fuzzy C Mean and Fuzzy Gustafson Kessel with Forest Optimization Algorithm. In both fuzzy c mean and Gustafson Kessel we initialize some random vector as the center of clusters and then in a local search process we minimize the objective function to find the best center for clusters. Instead in our proposed hybrid algorithm lets forest optimization manage local and global search and initiate cluster centers and put cluster center calculation to original algorithm with single iteration, also because both algorithm objective functions are target to minimization of some function, we inject this as fitness function of Forrest optimization. Therefore, forest optimization does both local and global search and calculations were done by fuzzy c mean and Gustafson Kessel. III Validity Index There are several validity indices to analyze the performance of the fuzzy clustering algorithms. Each of them targets one or more aspect of clustering quality. Table below shows 7 validity indexed we used in our study and a brief description about them [3, 4] . Recognize clusters of different sizes independent of their expansion, their location in the feature space, and their closeness to each other Measures the distances between data items and cluster prototypes directly, the separation measure measures the distances between cluster centers and the grand mean of the data set XB (Xie and Beni) Analyzes partitionings of the data set particularly with regard to the overlaps between clusters and the variations in the cluster density, orientation and shape Word2vec represents words in vector space representation. Words are represented in the form of vectors and placement is done in such a way that similar meaning words appear together and dissimilar words are located far away. This is also termed as a semantic relationship. Neural networks do not understand text, instead they understand only numbers. Word Embedding provides a way to convert text to a numeric vector. Word2vec reconstructs the linguistic context of words. Before going further let us understand, what is the linguistic context? In general life scenarios when we speak or write to communicate, other people try to figure out what is the objective of the sentence. For example, "What is the temperature of India", here the context is the user wants to know "temperature of India" which is context. In short, the main objective of a sentence is context. Word or sentence surrounding spoken or written language (disclosure) helps in determining the meaning of context. Word2vec learns vector representation of words through the contexts [1] . BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. GloVe is essentially a log-bilinear model with a weighted least-squares objective. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. For example, consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary. Here are some actual probabilities from a 6 billion word corpus: As one might expect, ice co-occurs more frequently with solid than it does with gas, whereas steam co-occurs more frequently with gas than it does with solid. Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently. Only in the ratio of probabilities does noise from nondiscriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific to steam. In this way, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of thermodynamic phase. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence. Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. Because these ratios can encode some form of meaning, this information is encoded as vector differences as well. For this reason, the resulting word vectors perform very well on word analogy tasks, such as those examined in the word2vec package. T5 is a recently released encoder-decoder model that reaches SOTA results by solving NLP problems with a text-totext approach. This is where text is used as both an input and an output for solving all types of tasks. This was introduced in the recent paper, Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (paper). Some unique features of T5 embedding are as follows: • Encoder-decoder Transformer • Relative Positional Self-Attention • is the simplified positional embedding The offset between key and query being compared in the self-attention mechanism. Each word embedding is a scalar with a position parameter which is shared across model layers. Some pros and cons of model listed below: • "text-to-text" format • consistent training objective: maximum likelihood • task-specific (text) prefix • Mismatch label Issue V Experiments For evaluation of a defined algorithm, we choose 2 datasets. For testing and proof of concept we used IRIS dataset which led to the results in Figure 1 . The results of 4 fuzzy algorithms explained in section II are shown in table. For validity of clustering we used 7 different clustering validity index listed in section III. In execution of algorithm we run these algorithm for cluster count 1 to 5 and gather validity index values for each of clusters. Finally according to Validity Index formula if it needs to be Minimized or Maximized, We choose the best value for it. In comparison of running time of algorithms, we run FOA version with 10 iteration count (epochs), so each iteration spent time equivalent to a run of its base version, therefore running time in FOA version of algorithms cost N time slower than basic version, as N is the number of iteration we selected for FOA. Fuzzy Data contains tweet content, Account twitted, Hashtags and location. Retweet argument is set to FALSE and they will be ignored. Dataset language is set to 'EN' and twits in English language is regarded. For folding of data we only suppose data from Mondays of each week in our survey and in each Monday we choose 100 twits. Therefore, in these 5 week we have 500 record for analysis. In text pre-processing all Emoji's, Mentions, Links and Hashtags are removed and after tokenization if twit remains with just 4 words, this twit is removed and replaced with a new one. After all data is embedded with 4 algorithm named, Word2Vec, Glove, Bert and T5. Then results are fed into both Fuzzy C Mean and Fuzzy Gustafson Kessel algorithm and results are gathered in the below figure 2. In execution of algorithms defined in section II, because of calculation of covariance matrix, Fuzzy Gustafson Kessel Algorithm is 10 time slower than Fuzzy C Mean. Also in proposed hybrid form of algorithm because of using evolutionary algorithm which searches all around dataset, calculation time exceed and the time of execution is more than normal fuzzy algorithm. Also with high dimensional data, calculation of covariance matrix for Fuzzy Gustafson Kessel is time consuming and is not recommended. Because of high dimension of text embedding results, we do not propose Fuzzy Gustafson Kessel for such application, also in our result we eliminated calculation of results for hybrid version of Fuzzy Gustafson Kessel. In Table 2 , evaluation of cluster validity indexes in 4 different embedding described in section IV are shown. With Word2Vec, Glove and Bert embedding with all two algorithm shown, As result shows other embedding but Glove do their jobs very well and help in clustering of large text dataset. But according to results, T5 and Word2Vec embedding with execution of both algorithms detect correct number of clusters. Among validity Indexes in text clustering just Bouguessa-Wang-Sun index (BWS) from the 7 validity index has ability to detect right number of clusters in big text data. So our proposed method in clustering is using Bouguessa-Wang-Sun index (BWS). VI Conclusion In topic clustering in addition to accuracy, time to evaluate results are a main factor. Fuzzy C Mean as a basic and useful algorithm is the proposed algorithm for doing job. Also embedding is a main factor which led to more accurate result plays an important role. One of T5, Bert of Word2Vec Embeddings can be used and our proposed technique for embedding is T5. Among validity indexes Bouguessa-Wang-Sun index (BWS) is Best cluster validity index according to results. Analysis of Word Embeddings Using Fuzzy Clustering Efficiency of Cluster Validity Indexes in Fuzzy Clusterwise Generalized Structured Component Analysis Fuzzy Clustering of Incomplete Data Based on Cluster Dispersion A Validity Index for Fuzzy Clustering Based on Bipartite Modularity