key: cord-0104570-8li52mb8
authors: Hosseini, Pedram; Hosseini, Poorya; Broniatowski, David A.
title: Content analysis of Persian/Farsi Tweets during COVID-19 pandemic in Iran using NLP
date: 2020-05-17
journal: nan
DOI: nan
sha: 0f91f083101ca5c322d2973121bd0325b8ddde10
doc_id: 104570
cord_uid: 8li52mb8

Iran, along with China, South Korea, and Italy was among the countries that were hit hard in the first wave of the COVID-19 spread. Twitter is one of the widely-used online platforms by Iranians inside and abroad for sharing their opinion, thoughts, and feelings about a wide range of issues. In this study, using more than 530,000 original tweets in Persian/Farsi on COVID-19, we analyzed the topics discussed among users, who are mainly Iranians, to gauge and track the response to the pandemic and how it evolved over time. We applied a combination of manual annotation of a random sample of tweets and topic modeling tools to classify the contents and frequency of each category of topics. We identified the top 25 topics among which living experience under home quarantine emerged as a major talking point. We additionally categorized broader content of tweets that shows satire, followed by news, is the dominant tweet type among the Iranian users. While this framework and methodology can be used to track public response to ongoing developments related to COVID-19, a generalization of this framework can become a useful framework to gauge Iranian public reaction to ongoing policy measures or events locally and internationally.

As COVID-19 has rapidly and widely spread in the United States and globally, this now pandemic is shaking up all aspects of daily life in all countries affected in an unforeseen manner. Economic activities have been disrupted globally on an unprecedented scale and governments are resorting to a number of policies and measures trying to manage primary and subsequent health, economic and financial aspects of this crisis. While each country will have its unique experience of dealing with this pandemic, there are shared aspects in terms of how different societies deal with and react to the spread of the virus. These commonalities stem from the nature of the virus itself and the biological and psychological similarities of humankind regardless of geographical boundaries. Additionally, there is the commonality in terms of policies that have been devised for mitigation and control of the virus across borders. Iran, along with China, South Korea, and Italy has been among the countries that have been hit hard in the first wave of the viral spread the cause of which is to be yet fully explained. Iranians have been using social media outlets such as Telegram, WhatsApp, Twitter, Instagram, and Facebook to both receive a large portion of their daily news, in addition to spreading the information to one another and expressing their opinion about various developments in the country such as social unrest or in this case events and issues related to the COVID-19 spread. Leveraging Machine Learning and Natural Language Processing (NLP) techniques we are conducting an ongoing analysis of the reaction of the Persian/Farsi speaking users 1 on social media starting with Twitter. In this work, we applied topic modeling to find the themes of tweets posted in Persian/Farsi about COVID-19, followed by manual annotation of a random subset of tweets to asses the distribution of various content types among all tweets. We believe our framework can be valuable in monitoring public reaction to ongoing developments locally and internationally related to COVID-19 pandemic, but additionally, a tool and platform to be used for future major economical, political, or health-related events among Persian/Farsi users.

This study is organized as the following. We begin this manuscript by reporting the insights and results of our data collection efforts and analysis. In the subsequent section, we outline the methods and experiments that were applied to raw data to obtain the results and insights. Lastly, In section 4, we briefly outline the data collection and preprocessing steps on the raw data. We conclude the paper in section 5 by summarizing the results and discussing future directions and next steps.

In this round of analysis, we report our findings and insights from analyzing tweets on three areas. We first share our insights about level of activity of the users who tweeted in Persian around COVID-19 over time. Next, we look at the correlation between different types of COVID-related tweets with the the official number of confirmed cases in Iran. Lastly, we outline the result of our analysis on the content of the tweets to show what topics were mostly talked about among the users around COVID-19 during this ongoing crisis. Related to the topic and content, we additionally broke down the tweets in terms of the type of the language users used when responding to COVID-19.

As the first step in our analysis, we looked at the volume of the tweets on COVID-19 pandemic over time. To achieve this, we extracted the tweets in Persian that had hashtags related to COVID-19 [detailed in section 4]. Figure 1 shows the number of all COVID-19 related tweets in the span of nearly five weeks since the onset of the crisis in Iran. Figure  2 shows only the number of Persian COVID-19 related original tweets in the same time period. By looking at Figures 1 and 2 , we see a dramatic decrease in the number of tweets as we get closer to March 20, the Nowruz or Persian new year, which is also the first day of Spring. This could indicate an increase in travel and trips around the Persian New Year as users tend to be less active on Twitter around this time. At this point, there was no official enforceable policy in place to limit such road trips and travels across the state lines.

We additionally looked into the correlation between the volume of the COVID-related tweets and number of confirmed COVID-19 cases in Iran. To do that, we extracted the number of confirmed, death, and recovered cases in Iran from the official website of the Ministry of Health and Medical Education. 2 Figure 3 shows the number of confirmed, death, and recovered cases in Iran. When we look at the number of posted tweets and confirmed cases during the same period of time in Figures 1 and 3 , respectively, we notice that even though the situation was not getting better in terms of the number of confirmed COVID-19 cases in Iran and country had not been reached to the peak of the pandemic, conversation and tweets on COVID-19 had already been started to decrease. It is worth to further study the reasons behind such phenomenon. For example, is less conversation on COVID-19 due to simply loosing interest in the topic or is it because of the underestimation of the pandemic by Iranians and lack of understating of the concept of flattening the curve. We should further add that the number of confirmed cases may be different than the actual number of infected people any time point due to inefficacy in testing procedures especially during early phases on the infection. 

We conducted Latent Dirichlet Allocation (LDA) [1] topic analysis on the collection of tweets to identify major COVID-19-related topics among Persian-speaking users. In Figure 4 , we have listed the top-25 topics as well as listing the top words associated with each of the topics. By looking at the top words and strongly associated tweets with each topic in our LDA models, we analyzed most popular topics among the tweets further among which life experience of living under home quarantine was the dominant one. In the following paragraph, we summarize our findings and share our insights on a few selected topics. 

As alluded to earlier, life experiences a result of living under quarantine is a major discussion topic. Users mainly talk about what they wish they could do but now cannot because of their new life style. There is a clear sign of frustration and fatigue as well as complaining about the life under quarantine. There are users who blame their fellow citizens for not taking the situation seriously. For instance, some users blame those who did not follow the quarantine and celebrated Chaharshanbe Suri -Iranian festival of fire celebrated on the eve of the last Wednesday before Nowruz. Another significant theme among such tweets is the feeling of helplessness and despair. There are additionally tweets on how individuals miss visiting their parents, siblings, or relatives. The feeling of depression is not specific to the users in this study and is a mutual problem in many countries that are facing the crisis, however, it is not still clear if such feelings are being taken seriously or if there is any plan to help people who face depression with psychotherapy or counseling. Last major theme is a significant number of satirical tweets that in many cases are combined with blames and complaints. Another major theme is not surprisingly tweeting about the news and reports. The topic of many such tweets are the latest on the number of confirmed infected, death, and recovered cases both in Iran and abroad with Italy, Spain, and United States being among the top countries. Interestingly, there is a discussion around lifting of the US sanctions against Iran. Tweets associated with this topic are coming from users who are both pro or against lifting the sanctions. Some argue that US should lift the sanctions to help Iranian people overcome the COVID-19 crisis while a majority of tweets are about why US should not lift the sanctions. Another major topic are tweets about Afghanistan, a country that is experiencing more number of COVID-19 related cases recently where pandemic is affecting the country with some delay after Iran.

Lastly there are also both pro-and anti-Iranian regime tweets. For the pro-regime, for example, users praise Basij, a paramilitary forces of the Islamic Revolutionary Guard Corps, who took action to mitigate the crisis or quote the supreme leader of Iran who praise Iranian people's effort and cooperation in fighting against COVID-19. On the antiregime side, some users blame Ayatollahs and the regime for taking actions that helped spread of the virus and for not taking adequate measures for handling the crisis following the spread. We additionally see tweets from the anti-regime opposition groups who see this crisis as an opportunity to overthrow the Iranian regime, a goal they have been pursuing over recent years and prior to COVID-19 pandemic.

In addition to the specific topics discussed in the previous section, we were interested in broader category of content of the COVID-related tweets. This required a step that involved manual labeling of a representative sample of tweets by two individuals [more details in section 3.2]. We started by manually looking at tweets prior to the analysis and topic modeling, and observed that certain categories such as satire and complaint are fairy dominant. Our initial observation and hypothesis was that users, representing Iranians, mainly either blame different entities (e.g. government or fellow citizens), complain about the situation or make jokes more frequently compared to discussing measures for fighting against the COVID-19. Figure 5 shows the distribution of different content type over the entire tweets. Results to certain degree are validating our hypothesis, the fact that users make jokes about the situation rather than talking about how they can fight against the virus. Even though we certainly need to annotate and analyze more tweets for better generalization of these results, still it is important to think about the reasons behind such a phenomenon. It would be interesting to know why satire is the top content category. Here are a two possible explanations, (1) An underestimation of the scale and seriousness of COVID-19 pandemic that was also reflected in the tone of the officials in early stages of the crisis, (2) Nuances of the Iranian culture in using satire as a way of coping with unpleasant realities. 

In this round of analysis we use topic modeling, Latent Dirichlet Allocation (LDA) in particular, to analyze topics of original tweets. 3 The main goal here for using LDA is to find the topics that are being discussed in Persian tweets. We discuss this analysis in section 3.1. We also annotated a random sample of tweets from two days to find out what type of content tweets are mostly about. We defined a set of categories including "satire", "news", "opinion", etc and manually assigned a label to a random sample of tweets to find the theme of tweets. We discuss the details of annotation in section 3.2.

We used Mallet for LDA analysis [3] . 4 Using the bag of words (BoW) method, we first built a dictionary and corpus using all the cleaned tweets. Then we set k=50, where k is the number of topics, and generated an LDA model using the dictionary and the corpus. We also enabled Mallet's hyper-parameter optimization by setting optimize_interval=10. Details of implementation can be found on our GitHub repository.

In an LDA model, each document 5 is a distribution of topics. We first find the dominant topic in a document. By dominant topic, we mean the topic with the largest association value with the document. Then, we group documents over the entire corpus by their dominant topic. In this way, we find the topics that are dominant among the majority of documents in the corpus and we call these topics the top topics.

We were also interested to find the word-level distribution of topics in the corpus. Each topic in the LDA model is associated with each word in the LDA dictionary to a certain degree. In this step and for each topic, we listed all the words that have a non-zero association with each topic. Then we counted up the number of these words for each topic. The top 10 topics and strongest associated words with each are marked in Figure 4 with a * next to the Topic ID.

For our content analysis, we first used the MiniBatchKmeans algorithm to cluster the tweets from March 12 2020 to March 14 2020, total 45,234, into multiple categories. Using Elbow method, we found k=8 to be the optimal number of clusters for our analysis. Then we fitted the K-means algorithm to the number of the optimal clusters on the TF-IDF vector of tweets. This process resulted in having one label from range of clusters, {0,...,7}, for each tweet.

We randomly sampled 30 tweets from each cluster. We defined a set of categories including: {"opinion", "news/quotes", "satire/jokes", "complaint/blame", "solution", "neutral"} and two annotators manually assigned a label from this set to each tweet. These categories are defined/chosen based on the themes we could see among tweets by manually reading and pre-annotating a sample of tweets. We tried to define the categories in a way that they can cover a variety of content types. It is important to mention that solution category is related to tweets that are constructive and talk about different ways of fighting the spread of the COVID-19, raising awareness, or giving hope to other users. Neutral tweets are mainly tweets that did not belong to any of the other categories or could not be easily understood (e.g. using Farsi to type in local languages.)

To estimate the overall representation of each category of tweets among all the clusters, we multiplied the ratio of each label in each cluster by cluster's weight/ratio. Cluster ratio is the number of tweets in the cluster over the number of all tweets.

We used Cohen's kappa metric from scikit-learn package [4] to compute the inter-annotator agreement between annotators. The agreement between annotators is 0.47. One reason for the fairly low agreement is the challenging nature of the task and many borderline samples. Figure 6 shows some of the disagreement examples. To resolve the disagreement cases, the two annotators discussed those cases together by reading the tweets and looking at the labels that were already assigned but without knowing what label is assigned by who. In many cases, the final label was chosen from one of the two different assigned labels. Also, in few cases, annotators agreed that the label for a tweet is different than the labels that were already assigned. 

We used the Social Feed Manager (SFM) [2] platform to collect tweets. 6 SFM is a software developed at the George Washington University that uses Twitter Developer API to help researchers with collecting tweets. We listed a group of hashtags that are associated with COVID-19 related Persian tweets on Twitter. Hashtags and their corresponding English translation are shown in Figure 7 .

We chose these hashtags based on the trends on Twitter since we started the data collection. We also added new hashtags associated with Persian COVID-19 on Twitter which we will use in our future analysis 7 . We started tweets collection process since March 12th 2020 and we're still collecting tweets real-time so that we can do future processing on 

We did some preprocessing on text of tweets before using them for our topic analysis. We only considered the original tweets. The reason for choosing only original tweets is that in many cases, replies and quotes may not either have text associated with the person who replies or the responses are shorter in length or not as informative as they should be. We also used the lang=fa attribute to filter tweets' language and have only those written in Persian. Then, we removed URLs, emojis, and punctuation marks, and English numbers. We also removed any mentions of user screen names in tweets. In the end, we normalized the tweets' text using the normalizer method in Hazm 8 library.

Also, we created a list of Persian stop words specifically for the analysis of this corona related collection of tweets. Even though there are some list of Persian stop words currently available, we created a new list because the definition of a stop word can be different for different tasks across various domains. For example, words such as hard or people are listed as stop words for topic modeling on some of the available Persian stop word lists while these words can potentially help us to understand the theme of a topic among corona related tweets. The preprocessing step resulted in having 530,249 unique tweets and 43,566 unique tokens. These tweets are the input to our analysis in the next step.

In this paper, we collected more than 530,000 original tweets in Persian/Farsi related to COVID-19 pandemic over time and analyzed the content in terms of major topics of discussion and the broader category of tweets. We applied a combination of manual annotation of a random sample of tweets and topic modeling tools to classify the contents and frequency of each category of topics. We identified the top 25 topics among which living experience under home quarantine emerged as a major talking point. We additionally categorized broader contents of tweets that show satire, followed by news, is the dominant tweet type among the Iranian users.

There are a few next steps and directions that we are actively working on. 1) We are interested to understand how COVID-19 related conversations among Iranian users shifted and evolved over time by tracking discussion topics within specific time windows, 2) Continue to conduct manual annotation of tweets for our topic analysis, as a continuation of content analysis described in section 3.2, for the newly added data points we collect. We expect these annotations to help us to better understand and measure publicâȂŹs reaction to the pandemic and to the specific events that unfolded over time, 3) we are interested to conduct a deeper analysis on the factuality of news and information shared on twitter and categorizing the different types of information in terms of validity and accuracy. While we briefly mentioned in section B examples of false information in collected tweets, we aim to find additional types of mis/dis-information. For example, we see examples of wrong information and claims about vaccines or treatments for corona-virus in the early days creating a black market for sales of these products. We are running more analysis to initially find the themes of such false information and additionally identify different strategies that are being used for spreading it. 

In this section, we explain some technical aspects of our LDA analysis in more detail. We also report some of the results we did not include in the main body of the paper due to lack of space and brevity.

For choosing the final value of k, number of topics, we first created multiple LDA models with different values of k including {50, 100}. Then we manually checked the output of these LDA models. Specifically, we looked at the top words in each generated topic to get an understating of the theme of the topic and compare overlap and similarity of topics.

To be on the safe side and not lose any useful information from the generated topics, we decided to set k big enough so that it will cover all the possible topics. The downside of choosing a larger value of k is ending up with more overlapped topics -topics that are similar to one another to a high degree. However, the benefit of choosing a large k is that we will not miss any topic among the tweets. In the end, between k=50 and k=100 we chose 50 since topics in k=100 model were too specific, not informative, and in some cases almost completely overlapped. With k=50, we still have some partially overlapped topics but far less than k=100.

We listed the top-25 topics in the LDA model before. Here, in Figure 8 we also list the bottom-25 topics in our LDA model.

This is an ongoing study and we expect to discover new results and insights and uncover new facts as we add more and newer data to our pipeline. There are some observations though, that we find interesting to share here. It is important to mention that these observations may not necessarily generalize across all social media platforms. We are conducting experiments to support the validity of these observations, however, we still find it insightful to share our findings. 

One of the challenges from early days for Iranian government was to persuade people to stay home and follow the quarantine and social distancing rules, like many other countries that are hit by COVID-19. Even though some may argue that Iranian government did not take quarantine into consideration as a serious option in the beginning, it's undeniable that government wanted people to stay at their homes to mitigate the spread of the coronavirus. It was specially challenging for government to persuade people to follow the quarantine since Nowruz was also in the middle of COVID-19 crisis which meant a lot of road-trips for the new year holidays that could make the situation even worst. WhatsApp is a popular messaging platform among Iranian users. There are many groups on WhatsApp that also share information around COVID-19 which by itself can be the topic of a study since the type of messages on WhatsApp can be different than what we usually see on a platform such as Twitter. One phenomenon we observed on What-sApp among Persian COVID-19 related messages was the use of false information not to mislead people but pro public health. Here, we list two sample messages we observed: Message 1: in the message, there is a news which is claimed to be broadcasted by a wellknown TV network outside Iran. Message describes Ramsar, a small city in the north of Iran which was among the cities hit earlier by COVID-19, as the Wuhan of Iran because people in Ramsar were so responsible in terms of staying home, following the quarantine, and stopping the spread of coronavirus. Message uses a lot of positive and encouraging words/adjectives. It specifically guides users to spread the message to acknowledge the good work that people in Ramsar did in slowing down the number of infections. Message also tries to encourage people in other cities to learn from people in Ramsar. Message 2: message is written in an image that starts by saying "this is an urgent message and read it immediately." Then it continues by referring to the leader [president] of China and his announcement of how the coronavirus was created. Message then quotes leader [president] of China that COVID-19 was a type of gas generated by the USA. And USA was unable to control its spread. Then it continues, "the gas was tested in Syria and Afghanistan. US soldiers were also infected by the gas and then were sent to Wuhan in China for a military exercise." Then it goes on and on in two paragraphs stating lots of inaccurate information and conspiracy theories. The interesting point is that at the very end of the message, another line has been written in red color saying that "so if you like to live longer, please let's stay at home."

It's important to note that for both of these messages, there is no way, at least as of writing this paper, to know where these messages originated from. For example, if they have been published directly or indirectly by government or they are just messages created by users with various intentions. However, it's interesting that even though the content and the main body of these messages are full of inaccurate information, the actual message or the actions requested to be taken by people are not only not false, but totally inline with advice from officials such as the World Health Organization (W.H.O.)

When looking at the top tweets associated with each topic in our topic modeling analysis, we realized that some tweets are duplicated. We were curious to see why this happened since in the data collection process we had already removed all the duplicate entries from our input JSON files. Since such activities seemed suspicious, we were interested to find out if such duplication cases are accidental or intentional and potentially part of an organized effort to spread inaccurate information. By looking at some samples, we realized existence of these duplicated tweets are mainly due to copy-pasting. For example, a tweet was posted by a user in a day and the very tweet was copy and pasted a couple of hours later in the very day by different user(s). At a glance, there is no sign that these cases are associated with a specific content type (e.g. dis or misinformation, rumor, or news), however, these cases can be studied in more detail. Also, there is no sign that such tweets are necessarily copy and pasted from accounts with more followers.

Latent dirichlet allocation

University (GWU) Libraries. SFM: Social feed manager

Mallet: A machine learning for language toolkit

Scikit-learn: Machine learning in Python