key: cord-0792464-kwp7sjea authors: Xu, Wanyin; Li, Yun; Qiang, Jipeng title: Dynamic clustering for short text stream based on Dirichlet process date: 2021-07-26 journal: Appl Intell DOI: 10.1007/s10489-021-02263-z sha: 19b7c0499b9ac1bc8c3ffd7e7ecd3725536f6c0f doc_id: 792464 cord_uid: kwp7sjea Due to the explosive growth of short text on various social media platforms, short text stream clustering has become an increasingly prominent issue. Unlike traditional text streams, short text stream data present the following characteristics: short length, weak signal, high volume, high velocity, topic drift, etc. Existing methods cannot simultaneously address two major problems very well: inferring the number of topics and topic drift. Therefore, we propose a dynamic clustering algorithm for short text streams based on the Dirichlet process (DCSS), which can automatically learn the number of topics in documents and solve the topic drift problem of short text streams. To solve the sparsity problem of short texts, DCSS considers the correlation of the topic distribution at neighbouring time points and uses the inferred topic distribution of past documents as a prior of the topic distribution at the current moment while simultaneously allowing newly streamed documents to change the posterior distribution of topics. We conduct experiments on two widely used datasets, and the results show that DCSS outperforms existing methods and has better stability. Short texts are prevalent on the Web, including on traditional websites, e.g., news titles and search snippets, and emerging social media, e.g., microblogs and tweets. In recent years, these data have swept the world at an alarming rate, and have produced large quantities of data streams, also called short text streams. Short text stream clustering [1] is challenging due to the inherent characteristics of short text steams such as short length, weak signal and high ambiguity of each short text, and the explosive growth and popularity of short textual content. Considering the characteristics of short text streams, it is difficult to apply traditional short text clustering approaches to model building because of the following two challenges: 1) insufficient information or statistical signals [2] text streams to make the analysis meaningful; and 2) topic distributions change with time [3] , with previously salient topics "fading-off" and vice versa. To address these challenges, Liang et al. [4] proposed a dynamic clustering topic model (DCT), that uses the relevance of topics between different time points to alleviate the sparsity problem of the short text stream. However, DCT requires manual specification of the number of topics, which is a major limitation when addressing real data streams. Later, Yin et al. [5] proposed a short text stream clustering algorithm based on the Dirichlet process multinomial mixture model, called MStream, which can handle concept drift. However, MStream does not consider the correlation of the topic distribution at neighbouring time points, and ignores the fact that documents with similar time points might have a higher probability of belonging to the same topic, that is, the phenomenon that "the discussion of events from the previous day may continue to the next day". Therefore, to better solve the problem related to the sparsity and topic drift of the short text stream, we propose dynamic clustering for the short text stream based on the Dirichlet process [6] , called DCSS. Our main contributions in this paper are as follows: -DCSS can detect topic drift in short text streams. DCSS uses the Dirichlet process to initialize the topic of each text and determine whether to add topics by calculating the probability of existing topics and potential new topics so that there is no need to initialize the number of topics in advance. -DCSS can alleviate the sparsity problem of the short text stream. DCSS infers the hidden topic of the current short text based not only on the content of the text but also the topic distribution of the previous time point as prior information. -DCSS is more efficient than baseline methods. We compare the existing algorithms on public datasets, and verify that DCSS outperforms the state-of-the-art baselines. The code to reproduce the results is available at https://github.com/SilverXWY/DCSS. The following sections are organized as follows: Section 2 describes the related work; Section 3 details the formulation and procedure of DCSS; Section 4 presents the experimental results; Section 5 summarizes the work of this paper. Two main types of methods are used to analyse text streams: text stream clustering based on similarity and text stream clustering based on topic models. Most text stream clustering methods based on similarity use vector space models to represent documents, and calculate the similarity between documents or clusters. CluStream [7] is one of the most classic stream clustering methods that consists of online micro-clustering and offline macro-clustering. CluStream uses a pyramid time frame to store micro-clusters at different times in the past for future analysis. DenStream [8] combines micro-clustering with the density-estimation process of stream clustering to perform any form of data cluster and handle outliers. Yoo et al. [9] proposed a streaming spectral clustering method that maintains the approximation of the normalized Laplace operator for data streams over time and effectively updates the Laplace transformed eigenvectors as streams. Zhong et al. [10] proposed an efficient text stream clustering algorithm built upon the well-known winnertake-all competitive learning that updates cluster centres online. Aggarwal and Yu [11] proposed a text and categorical data stream clustering method that summarizes data streams into fine-grained clusters. Shou et al. [12] proposed a prototype of a persistent summary for Twitter text streaming, called Sumblr which compresses tweets into tweet feature vectors (TCVs) and processes them online. Kalogeratos et al. [13] proposed a method for clustering text streams using burst word information. This approach makes use of the fact that most of the important documents in a topic were published during the eruption of the term "main". The limitation of similarity-based clustering methods for text streams is the need to manually select a similarity threshold to determine whether documents are assigned to a new cluster, and the fact that there is no correlation between different time points. The topic model is the most typical unsupervised text clustering model. It assumes that text is generated through the process of "Select a topic with a certain probability and select a word with a certain probability from this topic". Generally, parameters are estimated by Gibbs sampling [14] or the EM algorithm [15] . Traditional static topic models are based on an entire corpus and cannot be applied directly to text streams. Therefore, many extended models for LDA [16] , such as the dynamic topic model (DTM [17] ), topic over time model (TOT [18] ), dynamic mixture model (DMM [19] ), online latent Dirichlet allocation model (OnlineLDA [20] ), topic tracking model (TTM [21] ), streaming latent Dirichlet allocation model (S-LDA [22] ), and Dirichlet mixture model with feature partition (DPMFP [23] ), have been proposed to handle text streams. These models have been applied to topic mining for long text streams, following the idea that words in a document may come from different topics. When the document is sufficiently long, the topic of an article may indeed be composed of words from unique topics, but when considering short text streams, this idea is contrary to reality. Short texts such as microblogs often have only one or two sentences, and the main words often belong to the same topic. If the topic model generates each word from different topics, it will greatly affect the clustering performance and computation speed. To apply the topic model to short text streams, Yin et al. proposed a dynamic GSDMM [24] model that considers that all the words in a text as belonging to a single topic and effectively solves the sparsity problem of short text. The DCT [4] model proposed by Liang et al. assumes that topics at the previous time may have a guiding effect on topics at a later time. Therefore, the dependency relationship of the topic distribution between different time points is introduced, and the topic distribution at the previous time point is taken as a prior of the topic distribution of documents at the next moment. The oBTM [25] model proposed by Cheng et al. assumes that each biterm in a text belongs to the same topic, which makes the grouping of document topics more realistic, and uses time slice to process the text stream. All text in a time slice can be iterated multiple times. After processing all the text in a time slice, the topic information of the bi-term is output to update the model parameters. Although the above algorithm can infer the number of topics in a text stream during the iteration process, the number of topics must be specified in advance. If the number of topics specified manually is too large or too small, the clustering time and results can be affected substantially. Therefore, the TDMP [26] model proposed by Ahmed et al. based on the Dirichlet process automatically determines the number of topics to address the fact that the number of topics in real streaming documents is not known beforehand. However, the method needs the entire sequence of text streams. The MStream [5] algorithm proposed by Yin et al. uses a forget rule to update the topic distribution based on the Dirichlet process to solve the problem of topic drift. The NPMM [27] model is a recently introduced model that uses word embeddings to eliminate a cluster generating parameter from the model. The above algorithms address the problem of specifying the number of topics and topic drift, but ignore the fact that documents at neighbouring time points may have a higher probability of belonging to the same topic. For example, there is a good probability that text with the word "mask" will belong to the same topic of "epidemic" as subsequent text with the word "Thunder God Mountain" during the COVID-19 pandemic. Therefore, the algorithm in this paper uses the Dirichlet process to automatically obtain the number of topics and adopts temporal dependency to fully consider the topic relevance of the text stream at neighbouring time points. DCSS updates the topic at each point in time and only saves the text statistics of the previous moment, which addresses topic drift without occupying excessive space resources. We propose dynamic clustering for short text streams based on the Dirichlet process (DCSS). DCSS can automatically generate new topics based on the Dirichlet process without requiring the number of topics as input. To account for topic drift, a fundamental challenge in short text streams, we adopt topic feature tuples to update topics at each time, and delete outdated topic information. The update of topic feature tuples plays a major role in handling the dynamic problem of text streams. In the clustering task at each time point, according to the temporal dependency strategy, we refer to the statistical information of the text clustering result at the previous time point, which in large part solves the sparsity problem of short texts. When processing the streaming text at each time point, Dirichlet process clustering is applied to obtain a topic distribution Θ and word distribution Φ at the current time. The topic distribution of text in reality may evolve over time. Since the probability that the text belongs to a topic can be inferred from the topic distribution and the word distribution, we discuss the text generation process based on the Dirichlet process in detail and describe how to adopt the temporal dependency in this process. The Dirichlet process [6] is a stochastic process, most commonly used as a prior for mixture models, which is widely used in nonparametric Bayesian models. In the generative process, topics and words are drawn from the multinomial distributions of the mixture model, and the Dirichlet distribution is the prior of these multinomial distributions. Table 1 summarizes the main notation used in our model. is the multinomial distribution of all words corresponding to topic z, V represents the size of the current time vocabulary V, and ϕ t,z,w = P (w|t, z) > 0, V w=1 ϕ t,z,w = 1. The idea of processing static text is that the topic distribution at the current time is independent of the past distribution, so the Dirichlet prior of the topic distribution of each document is determined by only the static hyperparameter α. Because of the higher probability that texts of neighbouring time points in the text stream belong to the same topic, we introduce time dependence, which makes the Dirichlet prior also depend on the topic distribution and word distribution of the previous time point. First, according to the parameter α and the topic distribution Θ t−1 at the previous time point, the topic distribution Θ t at the current time point is generated, and the topic of the text is extracted. Then, all the words corresponding to the current text are extracted according to the topic. The distribution of words over the topic is generated by the parameter β and the word distribution Φ t−1 at the previous time point. Since the topic distribution θ t,z at each time point is related to m t,z which is the number of documents belonging to topic z, and the distribution of words over topic φ t,z is related to n w t,z which is the frequency of words w occurring in topic z, we change (1) and (2) to (3) and (4), respectively. Since the probability of the text topic calculated at the current time point is related to that of the previous time point, the sparsity problem of short text is alleviated by providing more text information. However, the problem that topic models often need to fix the number of topics K has not yet been solved. Therefore, in the text generative process of our proposed algorithm, new topics are acquired dynamically based on the Dirichlet process, and the number of topics K is initially set to 0. The graphical representation of DCSS is illustrated in Fig. 1 , and the parameterization is as follows: where Dir is a Dirichlet distribution, Mult is a multinomial distribution, z d represents the topic assigned to text d, and B t represents the number of texts arriving at time t. Note that the number of topics in the text generative process is not fixed, as shown in Fig. 1 . That is, in the Dirichlet process, there is no need to initialize the number of topics in advance. The Chinese restaurant process (CRP) [28] is This section presents a brief discussion of the formulation of DCSS. Defining the relationship between documents and clusters is the most crucial task when addressing the text stream clustering problem. Similarity-based stream clustering methods use metrics such as cosine similarity to define the similarity between a document and a cluster. If the dissimilarity between existing clusters and newly arriving text exceeds a threshold, a new cluster is created; however, the similarity threshold is very difficult to define manually. In contrast, based on the parameter estimation of the topic distribution Θ t and word distribution φ t,z , we can calculate the probability that the text belongs to each existing topic. Moreover, we can calculate the probability that the text belongs to a new topic and finally choose the topic with the largest probability. Since each text has a possibility of being assigned to a new topic, DCSS can solve the topic drift problem. At time t, the text d is the observable known variable, α and β are given prior parameters, and the topic z is hidden variable. According to the graphical model of DCSS, we define the joint distribution of text topics in (5): where B t represents the number of documents arriving at time t and Z t represents the topic assignments for processed documents in B t . We account for the temporal dependency by adding the topic distribution Θ t−1 and the word distribution over the topic Φ t−1 at the previous time to the joint distribution; that is, we use the inferred past document topic distribution as the prior of the topic distribution of documents at the current moment. However, it is intractable to integrate out φ t,z and Θ t directly. Therefore, we employ collapsed Gibbs sample [14] for approximate inference and adopt a conjugate prior (Dirichlet) for the multinomial distributions, thus we can easily calculate the conditional distribution. Applying the chain rule, we can obtain the conditional probability according to (3) and (4): Because document d is associated with its own topic z, and Γ ( , we can simplify the conditional probability in (6) . As a result, the derived equation for calculating the probability of a document d choosing existing topic k at time t is given in (7): where z d represents the topic selected by text d, m t,k represents the number of texts in topic k at time t, n w t,k represents the frequency of word w occurring in topic k at time t, n t,k represents the number of words in topic k at time t, B t represents the number of texts arriving at time t, all "t − 1" items are parameters corresponding to the previous time point, and V represents the size of the vocabulary of the currently recorded documents. By following the Dirichlet process for infinite number of clusters, the probability of text d choosing a new topic K +1 is: where K is the number of existing topics. Notably, when initializing the text at the first moment, there is no t − 1 time point, so the corresponding parameter at time t − 1 is 0. The first term of (7) and (8) represents the completeness of the cluster, whereas, α is the concentration parameter of the model. A new document has a higher probability of choosing a topic with more documents, so the number of topics is limited to a certain number. The second term defines the homogeneity between a cluster and a document, and β is the pseudo weight of similar words in a cluster. When a topic has more documents that share same words with document d, the second part will be larger, and document d will be more likely to belong to that topic. To address the dynamic nature of text streams, DCSS adopts a strategy of updating topics at each point in time, assuming that the streaming texts at each moment can be clustered and iterated multiple times. Since the entire corpus is no longer uploaded statically, we need to define a feature tuple {m t,z , n t,z , n w t,z } for cluster z corresponding to each topic at the current time, where m t,z represents the number of texts in topic z, and n t,z represents the number of words in topic z, n w t,z represents the frequency of word w occurring in topic z at time t. The feature tuple of each topic z adds or deletes the information of a text to prepare for calculating the topic probability of the text. We only save feature tuples of topics in one moment, so that DCSS does not occupy excessive space resources. Therefore, after processing all the documents at the current time, we delete the feature tuples of the previous moment, and take the feature tuples of the current time as the prior for the next time point, and the outdated topics are deleted in this process. Algorithm 1 shows the procedure of the algorithm at time t. As an additional note, at time point 0, a new cluster is created for the first document and the document is assigned to the newly created topical feature tuple. Afterward, each arriving document in the stream at time t is assigned to either an existing topic or a new topic. The corresponding probability for choosing either an existing topic or a new topic is calculated using (7) and (8) . The topic with the highest probability is selected as the label of the current document. Lines 7-10 reflect the add property of feature tuples and lines 13-15 reflect the delete property, where N d represents the number of words contained in document d and N w d represents the frequency with which each word w appears in document d. In the iteration process, information of the current document must first be removed from the topical feature tuple. After all the documents at the current time point have been assigned to the most suitable topic, the iteration ends, and the streaming texts of the next time point are processed. Meanwhile, the feature tuples of the previous moment are deleted. In this section, we evaluate the performance of our proposed models by comparison with state-of-the-art models. We choose two public datasets and their variants for experiments. -News: This dataset comes from the GoogleNews dataset used in GSDMM [29] . News contains 11109 news titles belonging to 152 topics, with an average length of 6.23. These datasets have been preprocessed by word segmentation, stop word removal, lowercase conversion, etc., and the average length matches the size of the short text. Furthermore, after dividing by time points, these static datasets are suitable for streaming text clustering. We employ four widely used metrics to evaluate the clustering performance: normalized mutual information (NMI), homogeneity, purity and accuracy [30] [31] [32] . In addition, we adopt completeness to measure the performance of the proposed algorithm in the parameter adjustment experiment. NMI measures the amount of statistical information shared by random variables. These random variables represent the cluster assignment and the basic truth group of documents. NMI is formally defined as follows: where n c is the number of documents in class c, n k is the number of documents in cluster k, n c,k is the number of documents in class c as well as in cluster k, and N is the number of documents in the dataset. Purity calculate the proportion of the number of correct clustering samples to the total number of samples. Accuracy is used to compare the clustering results with the real classes of the data. Accuracy measures the percentage of assigned correct documents to all clusters. where k i and c i represent the clustering result and the real label corresponding to data x i respectively, map(k i ) denotes the optimal class label distribution, and the Hungarian algorithm [33] is used to achieve the optimal mapping. In addition, δ(a, b) is the indicator function. If a = b, the value is 1, otherwise it is 0. Homogeneity represents the proportion of members in a cluster obtained by the algorithm from the same class in the truth value group. where H (C |K ) is the conditional entropy of the class assigned to a given cluster, and H (C) is the class entropy [34] . Completeness is an index of the proportion of members of the same class in the truth group that are divided into the same cluster. where H (K |C ) is the conditional entropy of the cluster assigned to a given class, and H (K) is the cluster entropy. The value range of the above metrics is [0, 1], and the higher the score is, the better the clustering performance. We compare DCSS with the following state-of-the-art models in document clustering. DTM Dynamic topic model [17] is an extension of LDA that can be used to analyse evolving topics in a document stream. We set α = 0.01 for DTM. Sumblr Sumblr [12] is an online stream clustering algorithm for tweets. With only one pass, it enables the model to cluster tweets efficiently while maintaining cluster statistics.We set β = 0.02 for sumblr. oBTM oBTM [25] uses BTM to train documents in each time slice and updates the parameters according to the time point. We set α = 50/K, β = 0.01, and λ = 1 for oBTM. DCT DCT [4] enables tracking of the time-varying distributions of topics over documents and words over topics. We initialize α and β to 1 and 0.1, respectively. MStream MStream [5] clusters text streams by time point with forgetting rules. Only texts within a limited time range are stored in memory. We set α = 0.03 and β = 0.03, and set the maximum storage batch to 2 batches, and the number of iterations to 10. CTFWPO CTFWPO [35] performs initial clustering assignments based on frequent word pairs in texts, then removes outliers from clusters and reassigns them to more appropriate clusters using semantic similarity. Since the probability calculation in this method is based on MStream, the parameters are set to the same values as those used in MStream. DTM, Sumblr, oBTM and DCT require a fixed number of topics as input,so we set K = 300 and K = 170 for the Tweets-T and News-T datasets, respectively. The smaller α is, the more likely the text is to be assigned to a topic with more documents. The larger β is, the more likely the text is to be assigned to a topic with more similar words to itself. To enable the model to generate new topics and to take into account the rarity of words in each short text, the parameters of DCSS are set to α = 0.2 and β = 0.04, and the number of iterations is set to 10. In this part, we compare the performance of the proposed model with that of state-of-the-art algorithms. Because the resorted Tweets-T and News-T datasets are more representative of real-life text streams, we compare the clustering results of DCSS and other baselines on these two datasets. Table 2 presents the overall results. As shown in Table 2 , DCSS outperforms all the baselines on both datasets in terms of all measures. Methods (MStream, CTFWP and DCSS) that can infer the number of topics outperform those that require the number of topics to be specified beforehand, which verifies the importance of inferring the number of clusters in short text stream clustering. Compared with CTFWP and MStream, DCSS makes full use of the correlation of text at neighbouring time points, which can effectively alleviate the sparsity of the short text and increase the probability of the text being assigned to the correct topic. To verify the rationality of combining temporal dependency with the Dirichlet process, we compare the performance of DCSS on the original and resorted datasets in detail. Table 3 shows the average value and deviation of all measurement indicators of DCSS on the four datasets. From Table 3 , we can see that DCSS performs much better on both resorted datasets than on the original datasets. In the resorted datasets, texts belonging to the same topic appear continuously for a certain period of time. When the next hot topic arrives, these texts appear less frequently, but the old topic appears periodically for a period of time with the relevant discussions. However, the overall trend is that topics are constantly changing, and generally, an older topic will eventually disappear over time. Therefore, in News-T and Tweets-T, documents at neighbouring time points have a strong correlation, which is also true in real life. Therefore, DCSS makes reasonable use of the temporal correlation of streaming texts by referring to the clustering result of the last time point to obtain more statistical information and alleviate the data sparsity of the streaming short documents. In conclusion, DCSS can address the sparsity problem and obtain better clustering results by accounting for the temporal dependency of short text streams. We conducted experiments with the number of iterations, and set the number of iterations varying from 0 to 15 and the other parameters unchanged and selected the NMI and the inferred number of topics as a reference. Figure 2a shows the NMI value of the DCSS clustering results of the two datasets with different numbers of iterations. When the number of iterations changes from 0 to 1, the NMI tends to be stable. Figure 2b shows the number of clusters of the two datasets inferred under different iteration numbers. As observed in the figure, when the number of iterations is greater than 4, the number of clusters tends to remain unchanged. Streaming texts at a time point can be processed multiple times, which allows the text with similar content to be effectively grouped into the same topic. As the number of iterations increases further, the topic distribution tends to be stable, and the number of topics does not change. In this subsection, we explore the influence of α for DCSS. While fixing the other parameters, we set α to vary from 0.1 to 1.0. We select NMI, homogeneity, completeness, and the inferred number of topics for reference. Figure 3a shows the change in the NMI under varying α : the NMI is relatively stable. According to with the homogeneity and completeness indicators of the Tweets-T dataset in Fig. 3d , the algorithm in this paper is relatively stable under different α values. Figures 3b and 3c show that as α increases, the number of clusters obtained by clustering also increases. Moreover, Fig. 3c indicates that when α is sufficiently small, the inferred number of topics approaches the true value. According to the CRP [28] , this result is reasonable. The hyperparameter α can be understood as the number of people at a virtual table, and this virtual table is equivalent to the situation in which a newly added table may be present for a certain period of time. As α gradually increases, the number of people at this virtual table also increases. The greater the number of people, the more likely newcomers are to choose this virtual table. Therefore, when processing the text stream at each point in time, each text is more likely to be assigned to a new topic as α increases, which leads to more topics inferred by clustering. In this subsection, we explore the influence of β for DCSS. We vary β from 0.01 to 0.20 while fixing the other parameters. The NMI, homogeneity, completeness, and inferred number of topics are selected as references. Figure 4a shows the change in NMI with different β values. When β is greater than 0.02, the clustering effect of c DCSS on both datasets is relatively stable. Figure 4c shows that the homogeneity and completeness of the Tweets-T dataset fluctuate under different β conditions, but the values are not excessive from a global perspective. Figure 4b indicates that the number of clusters inferred by clustering decreases as β increases; this can also be explained by the CRP. β can be regarded as the number of dishes that new customers may be interested in on each table. When β is relatively small, new customers will choose the table to sit down at according to the number of dishes. However, when β is relatively large, even if a customer likes only one dish on the table, he may choose to sit down, which makes people tend to choose a table with more dishes and a greater number of each dish, so the total number of tables will be relatively small. Therefore, when processing the text stream at each time point, each text is more likely to be assigned to an existing topic as β increases, which leads to fewer clusters. This paper proposes a dynamic clustering method for short text streams based on the Dirichlet process (DCSS) that can cope with the sparsity problem of short text and solve the problems of dynamics and topic drift of text streams. DCSS dynamically assigns a batch of arriving documents to existing clusters or generates a new cluster based on the Dirichlet process. More importantly, DCSS incorporates semantic information of the temporal dependence of the streaming texts into the proposed graphical representation model to alleviate the sparsity problem in short text clustering. The experimental results on the resorted datasets prove that the past topic distribution can be used as a prior of the topic distribution of the current time to cope with the sparsity of short texts. We compare the clustering performance with state-of-the-art baselines on public datasets, and verify that DCSS achieves better performance. In the future, we will incorporate a pre-trained language model to improve the performance on short text clustering. An online semanticenhanced dirichlet model for short text stream clustering A robust user sentiment biterm topic mixture model based on user aggregation strategy to avoid data sparsity for short text Adapting dynamic classifier selection for concept drift Dynamic clustering of streaming short documents Modelbased clustering of short text streams A framework for clustering evolving data streams Density-based clustering over an evolving data stream with noise Streaming spectral clustering Efficient streaming text clustering On clustering massive text and categorical data streams Sumblr: continuous summarization of evolving tweet streams Improving text stream clustering using term burstiness and co-burstiness Asynchronous gibbs sampling Route identification in the national football league: An application of model-based curve clustering using the em algorithm Latent dirichlet allocation Dynamic topic models Topics over time: a non-markov continuous-time model of topical trends Dynamic mixture models for multiple time-series Online learning for latent dirichlet allocation Topic tracking model for analyzing consumer purchase behavior Streaming-lda: A copula-based approach to modeling topic dependencies in document streams Explainable user clustering in short text streams A text clustering algorithm using an online clustering scheme for initialization Btm: Topic modeling over short texts Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering A nonparametric model for online topic discovery with word embeddings The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies A dirichlet multinomial mixture modelbased approach for short text clustering Hybrid clustering analysis using improved krill herd algorithm A new feature selection method to improve the document clustering using particle swarm optimization algorithm Feature selection and enhanced krill herd algorithm for text document clustering V-measure: A conditional entropy-based external cluster evaluation measure Short text stream clustering via frequent word pairs and reassignment of outliers to clusters Acknowledgements This research is partially supported by the National Natural Science Foundation of China under grants 62076217 and 61703362. Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.