key: cord-0668106-jrgdk0hj authors: Asgari-Chenaghlu, Meysam; Nikzad-Khasmakhi, Narjes; Minaee, Shervin title: Covid-Transformer: Detecting COVID-19 Trending Topics on Twitter Using Universal Sentence Encoder date: 2020-09-08 journal: nan DOI: nan sha: 0530953eee9e3f4bb0c34c66a3c798aae94ecbc6 doc_id: 668106 cord_uid: jrgdk0hj The novel corona-virus disease (also known as COVID-19) has led to a pandemic, impacting more than 200 countries across the globe. With its global impact, COVID-19 has become a major concern of people almost everywhere, and therefore there are a large number of tweets coming out from every corner of the world, about COVID-19 related topics. In this work, we try to analyze the tweets and detect the trending topics and major concerns of people on Twitter, which can enable us to better understand the situation, and devise better planning. More specifically we propose a model based on the universal sentence encoder to detect the main topics of Tweets in recent months. We used universal sentence encoder in order to derive the semantic representation and the similarity of tweets. We then used the sentence similarity and their embeddings, and feed them to K-means clustering algorithm to group similar tweets (in semantic sense). After that, the cluster summary is obtained using a text summarization algorithm based on deep learning, which can uncover the underlying topics of each cluster. Through experimental results, we show that our model can detect very informative topics, by processing a large number of tweets on sentence level (which can preserve the overall meaning of the tweets). Since this framework has no restriction on specific data distribution, it can be used to detect trending topics from any other social media and any other context rather than COVID-19. Experimental results show superiority of our proposed approach to other baselines, including TF-IDF, and latent Dirichlet allocation (LDA). COVID-19 has led to a global pandemic, impacting more than 200 countries across the globe, infecting more than 20 million people, and causing more than 750,000 death as of Aug 12, 2020 [1] , and there is a large number of research works With the growing scale of COVID-19 (after March 2020), there has been a shift in the distributions of Tweets posted on Twitter, reflecting the fact that COVID-19 has become a major concern of people across the world. To illustrate this, in Figure 1 we show the most frequent words used in Twitter during April 2020 (the 2nd month of global pandemic of COVID-19). As we can see words like "death", "help", "COVID-19" are among the popular words. Analyzing people's opinion and concerns on social media can help us to better understand public's concern and expectations, enabling the government and health official to have a better planning for managing the situation. There has already been a few works analyzing COVID-19 related Tweets. As an example, in [2] , Kleinberg et al. presented a ground truth dataset of emotional responses to COVID-19, and presented a framework to detect the main concerns of people in different countries on COVID-19 subject. In [3] , Ordun and colleagues analyzed five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for Covid-19 related tweets. In [4] , Singh et al. looked at the information and misinformation shared on Twitter. They tried to see if the discussion is emerging from, myths shared about the virus, and how much of it is connected to other high and low quality information on the Internet. So far, there has not been a solid framework which leverages the state-of-the-arts in natural language processing, to detect the trending COVID-19 related topics in social medial. Besides works trying to analyze textual data about COVID-19, there are many works trying to analyze other types of data (images, time-series, clinical information) to get insight and predictive models about different aspects of COVID-19 [5, 6, 7, 8] . In this work, we propose a deep learning based framework to detect trending topics people are talking about on Twitter, using a combination of transfer learning and clustering algorithms. Deep learning based models have been very successful in achieving state-of-the-art results in many of the NLP problems in the recent years, including word embedding, sentiment analysis, question answering, and machine translation [9, 10, 11] . We first extract the representation (embedding) of sentences in Tweets using "Sentence Transformer", which can capture the semantic information of the sentences. We then use clustering algorithms to group similar sentences (based on their embeddings) into the same groups. Ideally different clusters contain different semantic topics. After that, text summarization is used to get the summary of sentences in each cluster, which can uncover the trending topics of them. Compared to the classical models for topic modeling (such latent Dirichlet allocation, LDA), this work better employs the semantic information and meaning of the tweets, by first representing a sentence-level embedding of tweets, and then using those embeddings to group the tweets. In this way, we can find similar topics by directly analyzing sentences, rather than considering words similarity (as used in LDA). Figure 2 shows the word-cloud of sentences belonging to one of the topic clusters of our model. Here are the main contributions of this work: • We provide a novel framework, which can detect the trending COVID-19 related topics on social media, such as Twitter, in an unsupervised fashion. We do so, by first extracting a sentence-level embedding using sentence-Transformer, and then grouping similar sentences into the same cluster using k-means clustering, and finally extracting each cluster summary with a deep learning based text summarization. This summary contains the major topics of each cluster. • We provide a detailed experimental study showing the promise of this work, and its advantages over simple baselines such as using TF-IDF and latent Dirichlet allocation (LDA). The structure of the rest of this paper is as follows: In section 2, we provide the details of the proposed algorithm. Section 3 provides an overview of the dataset used in our experiments. In Section 4, we provide the experimental analysis of the proposed algorithm in terms of detected topics, trending words in each topic cluster, and sentence representation similarity. And finally the paper is concluded in Section 5. This work proposes a new framework for detecting the trending COVID-19 related topics on Twitter, using Universal Sentence Encoder and text summarization. The overall structure of our proposed approach is presented in Figure 3 . As it can be seen from this figure, we first obtain the adequate data from Twitter and get tweets through twitter API. After that, data cleaning part needs to performed on the collected tweets. Then, the Universal Sentence Encoder is used to extract the feature representation (embedding) of tweet's sentences. The sentence embeddings from different tweets are then fed into the k-means clustering algorithm to group them semantically similar ones into the same cluster. In the end, TextRank summarization technique is applied on the sentences of cluster to generate a summary of each cluster, which contains the most representative topic. More detailed description of each stem, is provided in the following subsections. In recent years, there have been many efforts to create a semantic representation of textual sequence (such as sentence, paragraph, or document). These methods include a wide range of techniques from word movers distance to the recent state-of-the-art methods such as SentBert and Universal Sentence Encoder [12, 13, 14, 15, 16] . All of these methods aim to provide a vector representation of sentence which can capture its semantic meaning, and provides similar representations for similar sentences. For example, Sentence BERT (also known as SentBERT), uses a Siamese coupled neural network composed of two identical instances of BERT. Like any Siamese neural network, this model aims to find the similarity between the two inputs sentences. Universal sentence encoder is another text embedding technique that has different versions for different use cases. One variation of this model was trained with Deep Averaging Network (DAN), called USE_DAN. This encoding model produces sentence embeddings in the way that it first gets the average embeddings of words and bi-grams, and then applies a feed-forward neural network on the average representation. On the other hand, newer version of universal sentence encoder is based on the Transformers, called USE_T, that has higher accuracy and is computationally more intensive than USE_DAN. In this research, we employ the transformer encoder version of universal sentence encoder. USE_T can handle words, sentence and documents, as the input. Figure 4 shows the architecture of this model and how it encodes text into high dimensional vectors. As it can be seen from the figure, USE_T has been trained on various downstream tasks. The encoder blocks in this figure are based on the Transformer model proposed by the Vaswani et. al [17] . Figure 4 : Architecture of USE_T and its multi-task/multi-lingual learning paradigm with shared parameters [14] . The multi-task and multi-lingual training paradigm of the USE_T makes it more suitable for tasks such as semantic sentence pair retrieval. On the other hand, deploying USE_T in our study is more crucial with respect to relation of semantically close tweets. The most powerful side of this architecture is its ability to find similar texts without need for any pair computations. It provides a dense vector representation of each text unit, and this dense vectors can be used for computing distance or similarity between different tweets or sentences [18, 19] . Based on the above reasons, we employ universal sentence encoder to capture the sentence representation of tweets. The tweet embeddings obtained from USE_T enables us to calculate the similarity of sentences in a semantic way, which is used by the clustering phase. It should be noted that before injecting tweets as the input of USE_T, text cleaning is a must-do process and we performed this by selecting the most sensible and usable sentences from the pile of tweets. The sentences which contain hashtags and mentions, are removed from our corpus. As it can be seen from Fig 3, once the embedding are extracted, a clustering algorithm is used to cluster tweet's sentences based on their embeddings obtained from previous step. The distance between two sentences's embeddings in this case, provides the dissimilarity of tweets. Different clustering algorithms can be used for this purpose, such as K-means, spectral clustering, mean-shift, Density-Based Spatial Clustering (DBSC). We use K-means clustering algorithm here, for its simplicity, speed, and the ability to pre-define the number of clusters. The clustering step, provides several groups of semantically similar sentences. On high level, the tweets in the same groups should contain more similar topics, than those in different groups. Although the centroid of each cluster should contain the average embedding (therefore the average topic/concept of that cluster), it will not necessarily capture all topics of that cluster. But it could serve as a simple baseline. A better solution for finding the topic of each cluster is to use a text summarization technique to provide a meaningful and sensible summary of that cluster, capturing the key topics. Here we used the TextRank summarization framework. More details on TextRank summarization is provided in [20, 21] . To evaluate the performance of the proposed framework, we used a dataset of tweets between 2020-03-29 and 2020-04-30, that are collected via twitter API. This dataset includes more than 8 million tweets in total for English language. We only used a random sampled of 20 percent of this dataset, which contains 1.6M tweets in total. Figure 5 denotes the wordcloud of the most frequent words after removing the stopwords for each day. One interesting fact from this plot is that, some words are always present among the popular ones for different days. We removed any repeatative words of previous days to have visualization in this figure. Figure 5 : Wordcloud of twitter corona dataset based on each day; Each row presents a week. Figure 6 shows the frequency of 8 popular words over different days. So we can see how the temporal trend of those words changes over time. A box-plot visulization is also presented in Figure 7 , which provides information about the most frequent words for the entire dataset. Th distribution shown in this plot is acquire over 33 days. For some words such as singing, only the first days has dramatic frequency raise, while for other days, this frequency drops seemingly. Specific words such as help are present for entire dataset and distribution is close to normal. We also labeled the dataset using opinions of six experts. The labels have been collected as group of words for each topic and major news media headlines have been used. In this section we provide the experimental study details, and the model results. Before going into the model analysis, we will provide the details of the hyper-parameter values, and experimental setup. We used the universal sentence encoder transformer based version 5, which is available at Google Tensorflow-hub 1 . This version has been trained on a monolingual dataset with embeddding dimension of 512. We used dot product for similarity of embeddings. For K-means clustering algorithm, the number of clusters is set to 30, and the max iteration to 300. This algorithms has been initialized with K-Means++ algorithm. After clustering, we merged smaller clusters with the larger ones using the similarity of their centroids. The clusters with smaller sizes are thought to be noisy ones. The clustering algorithm has been applied for each day (out of the 33 days) and the summarization is performed on the largest clusters in each day. The maximum summarization word count is set to 20 words and the TextRank based method is used for this task. The resulted summary for each day contains two or three sentences. This section presents the experimental results. Before revealing the outputs, we should note some points . First point is that the evaluation is performed on the subset of 20% of the samples. The second point is that we use TF-IDF for baseline comparison. Finally, the same clustering method (K-means) and summarization technique (using same TextRank as described in sec. 2.3) are applied to TF-IDF approach. For LDA its default setup is used for comparison. Figure 8 denotes the UMAP visualization of the tweets embedding of the most representative cluster for four days in our dataset. This figure shows five major clusters for each day. The presented UMAP dimensionality reduction shows the separability of clusters. Considering overlaps of all clusters on a major topic (covid19), this topic separation is an advantage. Figures 9 (a) and (b) illustrate the visualization results of TF-IDF and our approach for the first day of dataset. As we can see our approach provides more distinctive groups in this cluster. For the sake of comparison and producing a baseline for this dataset, we keep the original pipeline and use TF-IDF instead of universal sentence encoder. As it can be derived from the Figure 9 (a), the clusters created by TF-IDF are more separable and it is because of TF-IDF nature. But in case of output results, it did not have better result compared to our approach because of its inability to capture semantic relevance of inter-topic a irrelevance of out of topic texts. In another word, TF-IDF completely separates topics that we know is not correct due to the fact that there are significant overlaps among topics Figure 9 (b) proves that our approach separates topics by considering their overlapping. That means our approach obtains overlapping communities in which the tweets have the same topics. Summarized text obtained by our method for each topic for all days is presented in Table 4 . Table 1 demonstrates the resulting topics and extracted keywords using TF-IDF, which can be compared to Table 2 for the first date of dataset incident. These tables prove that using universal sentence encoder not only enables a better clustering result, but also makes a better summary of each cluster than TF-IDF approach. The obtained keywords also are another evidence to confirm USE_T works better than TF-IDF. USE_T suggests meaningful keywords which discover main topics in tweets. While, TF-IDF can't capture semantics and detect the most relevant keywords that represent the topic of the cluster. a. b. Figure 9 : Cluster results using UMAP dimensionality reduction for the first day of dataset: a. TFIDF; b. Our Approach Table 3 shows the quantitative results of our approach compared to LDA and TF-IDF. We used the labeled version of the dataset for this comparison. Top 10 precision, recall and f1 measure are used as metrics. From this table, it is clearly seen that our approach made over 11 and 27 percent improvements compared to LDA and TF-IDF respectively. As it is clear from the visualization outputs of TF-IDF, the clusters tend to be at the very same center or they are some noisy looking points around a centroid. In case of our approach, the clusters are divided into more separated sub-groups. The reason is that our approach considers and obtains semantic features by utilizing the USE_T for generating embeddings while TF-IDF produces raw vectorization. On the other hand, the result of TF-IDF in table 1 shows that it is merging topics that are not combinable. In contrast, in our approach, the clusters can be combined, because of semantic relativity and the dot product similarity metric, which in the case of TF-IDF no such approach would be applicable. Briefly, the partitions originated by our proposed method is more meaningful than TF-IDF due to extracting more semantic features from text and considering overlaps between topics. Moreover, the semantically close points from output of USE_T shows groups of near text representations that are centered around specific centroids with some kind of noise around them. While in case of TF-IDF, the noise is all over the reduced dimension output and no visually detectable center points are available (discarding the same points transformed to same location in some points that means they were the very same sentences with very close or same words). Another deduction from the results is that due to the emphasis of USE_T on semantic representation of tweets, it produces better quality and meaningful summaries of each topic cluster rather than TF-IDF method, which considers statistics of words of the tweets. Also, our approach finds keywords that better present the most relevant topics contained in the cluster. Clustering or topic detection based on bag-of-words methods and sparse term frequency vectorization such as TFIDF cannot capture semantic relations. LDA on the other hand, lacks the same capability. Results from table 3 obviously shows this gap between our approach and these two methods. In this work, we proposed a trending topic detection framework using a method that combines the Transformers with text summarization in a smart way, and applied that to COVID-19 related Tweets on Twitter. First, the sentence embeddings are extracted using Transformer. Then these embeddings are fed into a clustering algorithm to group similar Tweets, and finally text summarization is applied to all sentences of a each cluster to provde a short summary of that. This framework can provide us the resulting topics in the form of concise sentences that are easier to read and comprehend for human. This approach is applied to COVID-19 pandemic dataset collected from twitter, and several experimental studies are performed to assess its performance. Through experimental comparison, we showed that this model outperforms other popular topic detection approaches, based on TF-IDF, and LDA. Table 3 : The quantitative comparison of the performance of the proposed model and two popular baseline for topic extraction. Measuring emotions in the covid-19 real world worry dataset Exploratory analysis of covid-19 tweets using topic modeling A first look at covid-19 information and misinformation sharing on twitter Covid-19 and italy: what next? The Lancet Covid tv-unet: Segmenting covid-19 chest ct images using connectivity imposed u-net Covid-19-navigating the uncharted Covid-19 in iran: A deeper look into the future. medRxiv Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine Neural machine translation by jointly learning to align and translate Automatic question-answering using a deep similarity neural network From word embeddings to document distances Sentence-bert: Sentence embeddings using siamese bert-networks Chris Tar, et al. Universal sentence encoder Deep learning based text classification: A comprehensive review Topicbert: A transformer transfer learning based memory-graph approach for multimodal streaming social media topic detection Attention is all you need Universal sentence encoder for English Chris Tar, Yun-Hsuan Sung, et al. Multilingual universal sentence encoder for semantic retrieval Variations of the similarity function of textrank for automated summarization TextRank: Bringing order into text Here are yesterday's state by state numbers: This might be relevant information for a certain world leader in a country called the "United" states of America. 2020-03-29/2 People may get infected, possibly even die because of selfish, stupid idiots like this Six years back, I put my brain to thinking about what the world would really be like at the end. 2020-03-30/4 I'm 5ft 8' altho' the wobble base might yield a more energetic game, it looks like it's shorter than the Pro. Help! 2020-03-31/5 A municipal worker sprays disinfectant to prevent the spread of LIVE UPDATES: Cape May closes city beaches amid Looks like I'm gonna need a letter showing it's ok for me to be on my way to work We don't have enough ventilators, health care workers, personal protective equipment and hospital beds, the governor says Not ready for primetime...dont want to give people false hope" A lot of them probably don't know they've even had it Is it just droplets? 2020-04-04/14 From "oh, hopefully no big deal", to "uh, that looks suspicious" slowly creeping to "holy shit, I'm gonna lock myself away from all this" 2020 is definitely going to change what normal life means. 2020-04-04/15 Sopore Adminstration discharged 23 students after completing 14-days of quarantine at Sopore Hospital during government imposed nationwide lockdown as a preventive measures against the spread of the COVID-19 coronavirus I've been in my house WAY too long. 2020-04-05/18 Way people expect State to provide for them currently is a communist concept Focusing on moving from disorder in today's world The Cynefin framework significantly helps us to determine what particular parts we are dealing with, in the decisions needed Hard days, weeks, months, years, lives. Coronavirus world update Today 8 April Stay at home Like government spent millions of money to facilitate the huduma number process, why can't government use the data which was collected to disburse funds to cater for kenyas daily needs Whatever govt decides will be in favour of country citizens. 2020-04-09/27 Q He's what a real leader is like. Coronavirus (COVID-19): How Does It Impact Commercial Leases? 2020-04-09/28 Yeah, little ol' me, the negative one Just wanted to shout out to all the people who feel like they are losing their shit atm. 2020-04-10/29 Such a great idea -not only it would help protect the babies but also making them look so damn cute! 2020-04-10/30 In our latest Marketing Matters video series we uncover some top tips which can help open up new opportunities for your business Here's all you need to know The Way It's Going Right Now I See It As 50-50! People are in need!! 2020-04-11/32 Indigenous Groups Isolated by Coronavirus Face Another Threat: Hunger Uncomfortable to say the least, a human rights violation IMO! 2020-04-12/33 We need help with groceries bills. 2020-04-12/34 Steph Curry Mix -"Baby Pluto" Watch our full episode with Dr Earlier this week, Abbott closed all state parks and historic sites and announced new drive-thru testing efforts to battle the COVID-19 crisis This thing is seriously infectious Wish more people would listen to this guy. I can not believe what I'm watching. 2020-04-14/39 It rained all day, I was feeling a little under the weather, so here are pictures of some tchotchke, as well as a few great albums I listened to today discover things which are of no use to human race -like solar system -rather we should have channelise to understand human body better 4S has implemented several protective measures to enable safe delivery of classroom-based training to meet your workplace H S needs. 2020-04-16/43 Congress needs to make sure patients/others have access to health ins, cancer meds at home lower-income patients/survivors can enroll in Medicaid News 12 featured story about COVID-19 plasma therapy trials being offered at Montefiore Nyack Hospital. 2020-04-16/44 (I ask b/c I had a nightmare he nuked all the "uncooperative" states to "slow the spread") "Does Israel Have the Right to Cage Two Million People in a Coronavirus-Ravaged Prison Camp? The latest The Social Media World! 2020-04-18/48 The quarantine complainers are protesting for the right to spread a deadly virus that has no vaccine or cure and killed 100k people in a few months. 2020-04-18/49 Listen, I think all politicians, countries, governments shouldn't point fingers, it's not the time to blame make excuses. 2020-04-19/50 I need time to grieve People have to find a way to get back to normalcy but with caution. 2020-04-19/51 Estimating COVID-19 Case Fatality Rates (CFR) Update 9th April: the CFR is 0.72% -the lowest end of the current prediction interval and in line with several other estimates It's happening guys. 2020-04-20/54 Tag us and let us know about your co-workers at home:) Speaking to Children about Coronavirus: The new book "Coronavirus -A Book for Children" has been thoughtfully read aloud by Nurse Jane Ferrara. 2020-04-20/55 I just wish I saw more people focusing their time on valuable things like saving the health of the Earth rather than protesting against a decision our government has made to save our lives and the other around us. 2020-04-21/56 If you are a manager who has employees working from home, make sure you reach out to them at least once a day, make them feel valued during this stressful time say thank you for your support during these difficult times! 2020-04-24/64 Life in South Africa will gradually begin to return to normal from next month, with government steadily easing the COVID-19 lockdown regulations, albeit under stringent stipulations. 2020-04-25/65 now its a word I type of say several times an hour "I couldn't even take steps for probably the first five or six days with two people helping me with a walker Brazil reports 128 new cases and 12 new deaths bringing total confirmed cases there to 59,324 and 4,057 total deaths By the way, that's how real sarcasm works.) World Needs To Know This. Who knew World War III would look like this? 2020-04-28/74 When is the right time to open the economy back up? 2020-04-28/75 Here are the ways they see the pandemic transforming our societies: Here's a short song about the impacts of the Corona virus. That's what viruses do. 2020-04-29/76 WestSidewalks (BchSide) Closed-Off. We have considerations for auditors and audit committees to ensure continued high quality financial reporting: Thanks to RZA and the Children's Literacy Society. . . I'm Canadian, but I like Gov. Kristi Noem's plan! 2020-04-29/77 It's like one continuous day right? I got an idea Rest in Peace Signed Coronavirus, what coronavirus? Day 46. 2020-04-29/78 I think care homes get money from the Government for people dying of COVID 19 so they have an incentive. 2020-04-30/79 This is what good government looks like. This is what good planning and information looks like A Appendix