key: cord-0554658-6fe3uyj9 authors: Jing, Xiaonan; Rayz, Julia Taylor title: Graph-of-Tweets: A Graph Merging Approach to Sub-event Identification date: 2021-01-08 journal: nan DOI: nan sha: 498191f491e0f1010844e03d534046209865e94a doc_id: 554658 cord_uid: 6fe3uyj9 Graph structures are powerful tools for modeling the relationships between textual elements. Graph-of-Words (GoW) has been adopted in many Natural Language tasks to encode the association between terms. However, GoW provides few document-level relationships in cases when the connections between documents are also essential. For identifying sub-events on social media like Twitter, features from both word- and document-level can be useful as they supply different information of the event. We propose a hybrid Graph-of-Tweets (GoT) model which combines the word- and document-level structures for modeling Tweets. To compress large amount of raw data, we propose a graph merging method which utilizes FastText word embeddings to reduce the GoW. Furthermore, we present a novel method to construct GoT with the reduced GoW and a Mutual Information (MI) measure. Finally, we identify maximal cliques to extract popular sub-events. Our model showed promising results on condensing lexical-level information and capturing keywords of sub-events. With Twitter and other types of social networks being the mainstream platform of information sharing, an innumerable amount of textual data is generated every day. Social networks driven communication has made it easier to learn user interests and discover popular topics. An event on Twitter can be viewed as a collection of sub-events as users update new posts through time. Trending sub-events can provide information on group interests, which can assist with learning group behaviours. Previously, a Twitter event has been described as a collection of hashtags [1] , [2] , a (set of) named entity [3] , a Knowledge Graph (KG) triplet [4] , or a tweet embedding [5] . While these representations can illustrate the same Twitter event from various aspects, it can be argued that a KG triplet, which utilizes a graph structure, exhibits richer features than the other representations. In other words, the graph structure allows more semantic relationships between entities to be preserved. Besides KG, other NLP tasks such as word disambiguation [6] , [7] , text classification [8] , [9] , summarization [10] , [11] , and event identification [12] , [13] have also widely adopted graph structures. A graph G = (V, E) typically consists of a set of vertices V and a set of edges E which describes the relations between the vertices. The main benefit of a graph structure lies in its flexibility to model a variety of linguistic elements. Depending on the needs, "the graph itself can represent different entities, such as a sentence, a single document, multiple documents or even the entire document collection. Furthermore, the edges on the graphs can be directed or undirected, as well as associated with weights or not" [14] . Following this line of reasoning, we utilize a graph structure to combine both token-and tweet-level associations in modeling Twitter events. Graph-of-Words (GoW) is a common method inspired by the traditional Bag-of-Words (BoW) representation. Typically, the vertices in a GoW represent the BoW from a corpus. In addition, the edges encode the co-occurrence association (i.e. the co-occurrence frequency) between the words in BoW. Although the traditional GoW improved upon BoW to include word association information, it still fails to incorporate semantic information. One may argue that, as previously mentioned, using a KG can effectively incorporate both semantic information and corpus level associations into the graph. However, any pre-existing KG, such as WordNet [15] and FreeBase [16] , cannot guarantee an up-to-date lexicon/entity collection. Therefore, we propose a novel vocabulary rich graph structure to cope with the constantly changing real-time event modeling. In this paper, we employ a graph structure to model tokens, tweets, and their relationships. To the best of our knowledge, this is the first work to represent document level graphs with token level graphs in tweet representation. Our main contributions are the developments of 1) a novel GoT; 2) an unsupervised graph merging algorithm to condense token-level information; 3) an adjusted mutual information (MI) measure for conceptualized tweet similarities. Various studies have adopted graph structures to assist with unsupervised modeling of the diverse entity relationships. Event Clustering. Jin and Bai [17] utilized a directed GoW for long documents clustering. Each document was converted to a graph with nodes, edges, edge weights representing word features, co-occurrence, and co-occurrence frequencies respectively. The similarity between documents was subsequently converted to the similarity between the maximum common sub-graphs. With the graph similarity as a metric, K-means clustering was applied to maximum common document graphs to generate the clusters. Jinarat et al. [18] proposed a pretrained Word2Vec embedding [19] based GoW edge removal approach for tweet clustering. They constructed an undirected graph with nodes being the unique words in all tweets and edges being the similarity between the words. Token and tweet clusters were created by removing edges below a certain similarity value. However, pretrained embeddings can be prone to rare words in tweets, where abbreviations and tags are a popular means for delivering information. Event stream detection. Meladianos et al. [20] presented a similar graph of words approach in identifying sub-events of a World Cup match on Twitter. They improved edge weight metric by incorporating tweet length to global co-occurrence frequency. The sub-events were generated by selecting tweets which contains the top k-degenerate subgraphs. Another effort by Fedoryszak et al. [13] considered an event stream as a cluster chain consisting of trending Twitter entities in time order. The clusters were treated as nodes and the similarities between them were labeled as edge weights. While a common issue in entity comparison may be raised for the lack of coverage limitation [3] , Fedoryszak et al. [13] were able to overcome this issue through the help of an internal Twitter KG. However, potential synonyms were not taken into account in the weights assignment. Summarization. Parveen and Strube [21] proposed an entity and sentence based bipartite graph for multi-document summarization, which utilized a Hyperlink-Induced Topic Search algorithm [22] . However, they only considered nouns in the graph and ignored other parts-of-speech from the documents. In another attempt, Nayeem and Chali [10] adopted the TextRank algorithm [23] which employs a graph of sentences to rank similar sentence clusters. To improve from the crisp token matching metric used by the algorithm, Nayeem and Chali [10] instead used the pretrained Word2Vec embeddings [19] for sentence similarities. Graph Construction. Glavas et al. [24] built a semantic relatedness graph for text segmentation where the nodes and edges denote sentences and relatedness of the sentences respectively. They showed that extracting maximal cliques was effective in exploiting structures of semantic relatedness graphs. In an attempt of automated KG completion, Szubert and Steedman [25] proposed a word embedding based graph merging method in improvements of AskNET [26] . Similar to our merging approach, tokens were merged incrementally into a pre-constructed KG based on word embedding similarities. The difference is that AskNET used a graph-level global ranking mechanism while our approach considers a neighborlevel local ranking for the tokens. Additionally, Szubert and Steedman limited their scope to only named entity related relations during the graph construction. Evaluations of Event Identification. It should be noted that there is no benchmark dataset for event identification evaluation, as many event identification approaches are unsupervised and that the event dataset varies by the research of interests. Among the previously mentioned studies, Jin and Bai [17] conducted their clustering on dataset with category labels which allowed their unsupervised approach to be evaluated with precision, recall, and F-scores. Meladianos et al. [20] was able to generate labels through a sport highlight website ESPN as the sub-events contained in a soccer game exhibit simpler structure than most real-life events. Fedoryszak et al. [13] created a evaluation subset from scratch with the help of the Twitter internal KG and manually examined their clustering results. Both of the event summarization approaches adopted the classic metric ROUGE [27] score as well as the benchmarking dataset DUC. We propose the following (1) − (8) steps for sub-event identification ( Figure 1 ). The dataset used in this paper was collected from Twitter for a particular event (Section III-A). Step (1) includes tokenization and lemmatization with Stanford CoreNLP and NLTK toolkit 1 , as well as removing stopwords, punctuations, links, and @ mentions. All processed texts are converted to lowercase. Steps (2) -(3) contribute to GoW construction (Section III-B). Step 4 performs our proposed GoW reduction method (Section III-C). Step (5) elaborates on Graph-of-Tweets (GoT) construction using MI (Section III-D). Steps (6) -(8) finalize the subevents extraction from the GoT (Section III-E). Following a recent trend on social media, we collected data on "COVID-19" from Twitter as our dataset. We define tweets containing the case-insensitive keywords {"covid", "corona"} as information related to the "COVID-19" event. We fetched 500,000 random tweets containing one of the keywords every day in the one month period from Feb 25 to Mar 25 and kept only original tweets as our dataset. More specifically, retweets and quoted tweets were filtered out during the data cleaning process. The statistics of the processed dataset used for FastText model training can be found in Table I . Besides FastText training, which requires a large corpus to achieve accurate vector representations, we focused on a single day of data, the Feb 25 subset, for the rest of our experiment in this paper. We trained a word embedding model on the large 30day dataset as our word-level similarity measure in GoW construction. Pretrained Word2Vec models [19] have been applied previously to graphs as edge weights [8] , [11] , [18] . Trained on Google News corpus, Word2Vec is powerful in finding context based word associations. Words appearing in similar contexts will receive a higher cosine similarity score based on the hypothesis that "a word is characterized by the company it keeps" [28] . However, the pretrained Word2Vec model can only handle words within its vocabulary coverage. In other words, if any low frequency words are ignored during the training, they will not be captured in the model. In our case, the Twitter data contain many informal words, spelling mistakes, and new COVID-19 related words, which make pretrained model not suitable for our task. On the other hand, FastText model [29] uses character n-grams to enrich word representations when learning the word embeddings. Informal words such as "likeee" can be denoted as a combination of {"like", "ikee", "keee"} (4-gram), which its vector representation, after concatenating the subword vectors, will be closer to the intended word "like" given that both words were used in similar contexts. Thus, we employ the FastText word embeddings as the word-level similarity measure in our GoW construction. A skip-gram model with the Gensim implementation 2 was trained on the large 30-day dataset. For the basic structure of the GoW, we adopt the approach from Meladianos et al. [20] where the vertices V represent the tokens and the edges E co represent the co-occurrence of two tokens in a tweet. For k tweets in the corpus T : {t 1 , t 2 , ..., t k }, the co-occurrence weight w co between the unique tokens v 1 and v 2 is computed as: where n i (n i > 1) denotes the number of unique tokens in tweet t i that contains both token v 1 and v 2 . An edge e co is only drawn when the token pair co-occur at least once. In addition to the base graph, we add another set of edges E s denoting the cosine similarity w s between the token embeddings obtained from FastText. Figure 2 illustrates an example GoW for two processed tweets. "positive"(a) "test"(b) "virus"(c) "corona"(d) w co bc w co bd w s bc w s bd Fig. 2 : An example GoW for the processed tweets {"virus", "test", "positive"} and {"corona", "test"}. Solid and dotted edges denote E co and E s respectively. A raw GoW constructed directly from the dataset often contains a large number of nodes. Many of these nodes carry repeating information, which increases the difficulties of the subsequent tasks. To condense the amount of nodes, we propose a two-phase node merging method to reduce the proposed GoW: • Phase I: Linear reduction by node merging based on word occurrence frequency. • Phase II: Semantic reduction by node merging based on token similarity. Phase I. The goal of this phase is to reduce the number of nodes in the raw graph in a fast and efficient manner. For tokens that occur in less than 5 tweets, we merge them to its top similar token node in the graph. This phase is performed on nodes in the order of least frequently appearing to most frequently appearing nodes. Phase II. The goal of this phase is to combine frequent tokens in the graph. Algorithm 1 describes this process. For if any other neighbor v ik ∈ sim token then sim neighbors.insert((v ik , sim val)) end if parent = max(sim neighbors, key = sim val) node merge(src = v ij , dst = parent) end for end for a node in the GoW, we merge its lower weighed neighbor into another neighbor if the top N similar token of the lower weighed neighbor contains another neighbor. It should be addressed that the ordering of the node and the direction of merging matters in this process. For token nodes in the GoW, we perform this phase on nodes in the order of lowest degree to highest degree; and for neighbors of the same node, we perform the phase on neighbors in the order of lowest weight to highest weight. When the top N similar token contains more than one other neighbors, we select the node with the highest similarity value as the parent node. The node merging process consists of removing original nodes, reconnecting edges between neighbors, and migrating co-occurrence weights w co and merged tokens. Essentially, if a target node is merged into a parent node, the target node will be removed from the graph and the neighbors of the target node will be reconnected to the parent. It should be noted that when migrating neighbors and weights, we only consider the co-occurrence edge E s and only increment the weights w co into the edges of the parent node, while w s remains the same as determined by the leading token of the merged nodes. For a merged node with multiple tokens, we define the leading token as a single token that is carried by the original GoW. More precisely, suppose the target node "corona" 3 is to be merged into the parent node "virus" in Figure 2 . Since node "corona" is only neighboring with node "test", we add the weights together so that the new weight between node "test" and "virus" is w co bc = w co bd + w co bc , and remove node "corona" from the graph. Furthermore, for the new node "virus" containing both tokens "virus" and "corona", the leading token is "virus", and the weight w s bc remains the same as the cosine similarity between the word vectors "virus" and "test". Similar to GoW, we introduce a novel GoT which maps tweets to nodes. In addition, to compare the shared information between the tweet nodes, we construct the edges with an adjusted MI metric. Each tweet is represented as a set of unique merged nodes obtained from the previous two-phase GoW reduction, and tweets with identical token node representation are treated as one node. For instance, after merging token node "corona" into "virus" in Figure 2 , a processed tweet {"corona", "virus", "positive", "test"} can be represented as a tweet node t : {"virus", "positive", "test"} which contains the set of unique merged token nodes "virus", "positive", and "test". Originally from Information Theory, MI is used to measure the information overlap between two random variables X and Y . Pointwise MI (PMI) [30] in Equation 2 was introduced to computational linguistics to measure the associations between bigrams / n-grams. PMI uses unigram frequencies and cooccurrence frequencies to compute the marginal and joint probabilities respectively (Equation 3), in which W denotes the total number of words in a corpus where word x and y co-occur. i(x, y) = log p(x, y) p(x)p(y) The drawbacks of PMI are: 1) low frequency word pairs tend to receive relatively higher scores; 2) PMI suffers from the absence of boundaries in comparison tasks [31] . In an extreme case when x and y are perfectly correlated, the marginal probabilities p(x) and p(y) will take the same value as the joint probability p(x, y). In other words, when x and y only occur together, p(x, y) = p(x) = p(y), which will result i(x, y) in Equation 2 to take − log p(x, y). Therefore, with W remaining the same, a lower f (x, y) will result in higher PMI value. Additionally, it can be noted that PMI in Equation 2 suffers from the absence of boundaries when applied for comparisons between word pairs [31] . To mitigate the scoring bias and the comparison difficulty, a common solution is to normalize the PMI values with − log p(x, y) or a combination of − log p(x) and − log p(y) to the range [−1, 1] to smooth the overall distribution. Apart from measuring word associations, MI is also widely applied in clustering evaluations. In the case when the ground truth clusters are known, MI can be used to score the "similarity" between the predicted clusters and the labels. A contingency table (Figure 3 ) is used to illustrate the number of overlapping elements between the predicted clusterings A and ground truth clusterings B. One disadvantage of using Table Between Clusterings A and B [32] MI for clustering evaluation is the existence of selection bias, which Romano et al. [33] described as "the tendency to choose inappropriate clustering solutions with more clusters or induced with fewer data points." However, normalization can effectively reduce the bias as well as adding an upper bound for easier comparison. Romano et al. [33] proposed a variance and expectation based normalization method. Other popular normalization methods include using the joint entropy of A and B or some combinations of the entropies of A and B as the denominator [32] . In our GoT case, since tweet nodes are represented by different sets of token nodes, we can treat the tweet nodes as clusters which allow repeating elements. Thus, the total number of elements is the number of token nodes obtained from the reduced GoW. Correspondingly, the intersection between two tweet nodes can be represented by the overlapping elements (token nodes). Following this line of logic, we define the normalized MI between two tweets in Algorithm 2. Note that when p(t i , t j ) = 0 (indicating no intersection), NMI will take boundary value −1. It should be noted that as the fetching keywords {"corona", "covid"} appear in every tweet in the dataset, tokens containing these words are removed when the GoT is constructed. Consequently, two tweet nodes with only "corona" or "covid" in common will result in an NMI value of -1, while the outcomes of sub-event identification are not affected by removing the main event "COVID-19". We hypothesize that popular sub-events are included within a subgraph of GoT which are fully connected and highly similar in content. Following this assumption, we extract a GoT subgraph with only the tweet nodes included in the top n NMI values. Subsequently, we identify all maximal cliques of size greater than 2 from the subgraph for further analysis. As cliques obtained this way consist of only nodes with large NMI values, which indicates that the tweet nodes are highly similar, the clique can be represented by the token nodes included in the tweet nodes. Thus, we treat a clique as a set of all token nodes contained in the tweet nodes and that each clique represents a popular sub-event. To summarize, the raw GoW consisted of 29,493 unique token nodes for tweets from the Feb 25 data division. The two-phase GoW reduction reduced the graph by 83.8%, with 24,711 nodes merged and 4,782 token nodes left in the GoW. On the other hand, the raw GoT consisted of 31,383 unique tweet nodes. The extracted subgraph of the top 1000 MI values consisted of 1,259 tweet nodes. Finally, 83 maximal cliques were identified from the subgraph. Phase I reduction merged 19,663 token nodes that appeared in less than 5 tweets, which is roughly 66.7% of all token nodes contained in the raw GoW. This phase primarily contributes to reducing uncommon words within the corpus. It can be argued that rare words may provide non-trivial information. However, because our goal is to identify popular sub-events, the information loss brought by rare words does not heavily affect the results. Furthermore, word similarity provided by FastText can effectively map the rare terms to the more common words with a similar meaning. For instance, in our FastText model, the rare word "floridah" has its most similar word as the intended word "florida". Another fun fact is that the emoji " " was merged into the node "coffee". There were 5,048 token nodes merged during Phase II reduction. This phase mainly contributes to information compression within common tokens. By merging the neighbors that carry similar information, the same target node can appear in a less complex network without information loss. Table II presents statistics of the resulting GoW. Table III shows some example merged nodes from the reduced GoW 4 . The largest node (248 tokens) is not shown here as it contains many expletives. The second largest node "lmaoooo" (221 tokens) contains mostly informal terms like "omgggg" and emojis. We identify roughly three patterns of merge, namely 1) words with same stem or synonym merge, 2) common bi-gram or fix-ed expression merge, and 3) words of topical related but semantically different merge during Phase II reduction. Table IV illustrates some example merges from source node to destination node, with the green , blue, and red columns correspond to type-1, type-2 and type-3 respectively. Among type-1 merge (green), it can be seen that common abbreviations such as "btc" as "bitcoin" are also captured in the merging process. In type-2 merge (blue), the examples such as "test positive" and "health department" are frequent bigrams in the context of our data; and other examples such as "silicon valley" and "long term" are fixed expressions. One Phase II reduction src node dst node src node dst node src node dst node covid19 covid positive test buy sell covid coronavirus department health always never #iphone #apple cruise ship masked unmasked complains complaint valley silicon men woman btc bitcoin long term latter splatter dead death conspiracy theory undo mundo general drawback of word embedding models is that instead of semantic similarity, words with frequent occurrence will be considered as very similar as noted in the distributional hypothesis [28] . An improvement can be made by combining named entities, fixed expressions, and frequent bi-grams in the data processing stage so that a node can also represents a concept in the GoW. Finally, type-3 merge (red) is also suffers from the drawback of word embedding models. Word pairs like "buy" and "sell", "always" and "never", "masked" and "unmasked" are antonyms/reversives in meaning. However, these word pairs tend to be used in the same context so they are considered highly similar during the merging process. Word pairs like "latter" and "sp[latter]", "undo" and "m[undo]" (means "world" in Spanish) are subwords of each other. Recall that FastText model uses character n-grams in the training. The subword mechanism leads the rare words to be considered as similar to the common words which share a part with them. It should be noted that in Phase II, the merging is performed from the lower weighed neighbors to the higher weighed neighbors and from lower degree nodes to higher degree nodes. It is plausible that a common word like "undo" is a lower weighed neighbor as compared to the uncommon word "mundo" if the target token is also a rare word in Spanish and that "undo" does not co-occur frequently with the target token. Using the reduced GoW, we constructed a GoT, where each tweet node was represented as a set of leading tokens from the token nodes. Of the original 38,508 tweets, 31,383 tweets were represented as unique tweet nodes. Take the tweet node tweet 18736, which repeated the most times with a frequency of 221 times, as an example. The original text of tweet 18736 says "US CDC warns of disruption to everyday life' with spread of coronavirus https://t.co/zO2IovjJlq", which is an event created by Twitter on Feb 25. After preprocessing, we obtained {"u", "cdc", "warns", "disruption", "everyday", "life", "spread", "coronavirus"}. Finally, the GoW leading token represented tweet node became {"probably", "cdc", "expects", "destroy", "risking", "spreading", "coronavirus"}, with both "everyday" and "life" merged to "risking". One may notice that the word "US" is converted to "u" after preprocessing, which consecutively affected the FastText training, the GoW reduction and the GoT representation. This is due to a limitation from the WordNet lemmatizer that mapping the lower case "us" to "u". We would consider keeping the uncased texts and employ named entity recognition to identify cases like this to preserve the correct word in future work. Subsequently, tweet nodes constituting the top 1000 NMI weights were extracted to construct a new GoT subgraph for further analysis. The 1000 NMI subgraph contained 1,259 tweet nodes, which consisted of 1,024 token nodes. The minimum NMI value within the subgraph was 0.9769. In Figure 4 , edges are only drawn between node pairs with intersecting token nodes. The edges with an NMI value of -1 are not shown in the figure. It should be noted that if all the tweet node pairs extracted from the top 1000 MI edges are unique, there would be 2,000 nodes in the subgraph. After examining the tweet nodes with top 10 total degrees (the sum of all MI weights between the node and its neighbors), we found that some of the nodes are subsets of each other. For instance, tweet 23146 is represented as: {"news", "epidemiologist", "nation", "probably", "confirmed", "know", "spreading", "humping", "#mopolitics", "center", "kudla", "called", "contagious", "#sarscov2", "#health", "govt", "friday", "#prevention", "update", "expects", "rrb"}, of which two other tweets tweet 20023 and tweet 11475 are subsets with 2 and 1 token node(s) differences. Further statistics on the associations between total node degrees, average node degrees, and node size (number of leading tokens) are shown in Figure 5 . It can be seen in Figure 5a that total node degree is positively correlated with the number of neighbors. We also examined the correlation between total node degree and average neighbor degree (average NMI weights of all neighbors), but found no correlation. On the other hand, in Figure 5b , the average neighbor degree is negatively correlated with the number of leading tokens in a tweet node. This indicates that NMI relatively favors (a) total degrees (top) and number of neighbors (bottom) of tweet nodes ordered by total degrees descendingly. (b) average neighbor degrees (top) and number of leading tokens of tweet nodes ordered by average degrees descendingly. tweets with less elements. Imagine a random case where only independent tokens are present in a tweet node. Larger nodes (have more tokens) have less chances to share information with other nodes. More precisely, for two pairs of tweets that both share the same intersection size, the size of the tweets will determine the NMI values. For instance, t 1 (5 elements) and t 2 (5 elements) share 3 elements, while t 1 and t 3 (10 elements) also share 3 elements. Assuming there are a total of m = 100 elements, the NMI values as defined in Algorithm 2 are N M I t1,t2 = 0.829 and N M I t1,t3 = 0.598. It should be noted that different normalization methods can cause the NMI values to vary. The normalization metric we chose largely normalizes the upper bound of the MI values. When a different normalization metric is applied, for example, using the joint probability, both the upper and lower bounds of the MI values can be normalized. Finally, we identified 83 maximal cliques of size greater than 2 from the top 1000 NMI subgraph. While the largest clique contained 14 tweet nodes, there were 21 cliques consisting of 4 to 6 tweet nodes, and the rests contained 3 tweet nodes. We observed that the tweet nodes contained in the same clique shared highly similar content. Table V illustrates the shared token nodes in all 14 tweet nodes from the largest clique. We can derive the information contained in this clique as "the stock and/or bitcoin market has a crash/slide". Further investigation indicated that nodes in this clique are associated with the same tweet: "The Dow just logged its worst 2day point slide in history -here are 5 reasons the stock market is tanking, and only one of them is the coronavirus", which elaborates on the stock market dropping. We marked this type of clique to be "strong representative" of the subevent. However, not all cliques represented the original tweet contents well. Another clique in Table V that contained the tweets regarding "US HHS says there's 'low immediate risk' of coronavirus for Americans" only suggested the "likelihood of" the main "COVID-19" event. We marked this type of clique as "somewhat representative" of the sub-event. Other types of cliques are marked as "not representative". Table VI shows the content validity evaluation on 83 maximal cliques. It should be noted that the labels are annotated in acknowledgement of the meaningful token nodes, which on average takes 64.3% of each clique. We also analyzed the event categories among the generated cliques and manually annotated the clique with type labels. Figure 6 shows the content validity distribution by different event types. We can see that our proposed model performed very well on generating meaningful clique content on the categories "health department announcement" and "stock, economy"; and not very well on "politics related" category. The difference in performance may be due to the variation of the amount of information contained in the described events. Fig. 6 : Clique validity distribution by manually labeled event type categories. Deep blue, light blue, and grey correspond to "strong representative", "somewhat representative", and "not representative" categories respectively. It should be addressed that the dataset used in this paper was collected directly from Twitter and that the approach proposed in this paper was fully automated and unsupervised. As we mentioned previously there is no bench-marking dataset for event identification, and the absence of gold standard makes it difficult to conduct quantitative evaluations on the results. Our best effort, as presented in Table VI and Figure 6 , was to manually analyzed the content of the generated cliques for validation. In this paper, we proposed a hybrid graph model which uses conceptualized GoW nodes to represent tweet nodes for sub-events identification. We developed an incremental graph merging approach to condense raw GoW leveraging word embeddings. In addition, we outlined how the reduced GoW is connected to GoT and developed an adjusted NMI metric to measure nodes similarity in GoT. Finally, we utilized the fundamental graph structure, cliques, to assist with identifying sub-events. Our approach showed promising results on identifying popular sub-events in a fully unsupervised manner on real-world data. There remain adjustments that can be made to the GoT to improve the robustness of the model so more detailed and less noisy events can be captured. In the future, we plan to employ named entity recognition in raw data processing to identify key concepts. Additionally, frequent bi-grams/n-grams will be examined and combined prior to FastText training to improve the similarity metrics from word embeddings. We will also compare different MI normalization methods to neutralize the bias from sequence length. Finally, we plan to improve the conceptualization method of the token nodes so instead of the leading token, a node can be represented by the concept it is associated to. Streamcube: hierarchical spatio-temporal hashtag clustering for event exploration over the twitter stream An event detection approach based on twitter hashtags Real-time entity-based event detection for twitter Frame-based representation for event detection on twitter Tweet2vec: Character-based distributed representations for social media Embedding senses for efficient graphbased word sense disambiguation Breaking through the 80% glass ceiling: Raising the state of the art in word sense disambiguation by incorporating knowledge graph information Fusing document, collection and label graph-based representations with word embeddings for text classification Graph convolutional networks for text classification Extract with order for coherent multidocument summarization Graph-based neural multi-document summarization Armatweet: detecting events by semantic tweet analysis Real-time event detection on social data streams Graphrep: boosting text mining, nlp and information retrieval with graphs Wordnet: a lexical database for english Freebase: a collaboratively created graph database for structuring human knowledge Text clustering algorithm based on the graph structures of semantic word co-occurrence Short text clustering based on word semantic graph with word embedding model Distributed representations of words and phrases and their compositionality Degeneracy-based real-time sub-event detection in twitter stream Multi-document summarization using bipartite graphs Authoritative sources in a hyperlinked environment Textrank: Bringing order into text Unsupervised text segmentation using semantic relatedness graphs Node embeddings for graph merging: Case of knowledge graph construction Asknet: Creating and evaluating large scale integrated semantic networks Rouge: A package for automatic evaluation of summaries A synopsis of linguistic theory Enriching word vectors with subword information Word association norms, mutual information, and lexicography Normalized (pointwise) mutual information in collocation extraction Correction for closeness: Adjusting normalized mutual information measure for clustering comparison Standardized mutual information for clustering comparisons: one step further in adjustment for chance