key: cord-0667452-w61lsdml
authors: Gupta, Raj Kumar; Vishwanath, Ajay; Yang, Yinping
title: COVID-19 Twitter Dataset with Latent Topics, Sentiments and Emotions Attributes
date: 2020-07-14
journal: nan
DOI: nan
sha: 3c2ea594e9935ee57171e32c8a5c997a00b08584
doc_id: 667452
cord_uid: w61lsdml

This paper describes a large global dataset on people's social media responses to the COVID-19 pandemic over the Twitter platform. From 28 January 2020 to 1 September 2021, we collected over 198 million Twitter posts from more than 25 million unique users using four keywords:"corona","wuhan","nCov"and"covid". Leveraging topic modeling techniques and pre-trained machine learning-based emotion analytic algorithms, we labeled each tweet with seventeen semantic attributes, including a) ten binary attributes indicating the tweet's relevance or irrelevance to the top ten detected topics, b) five quantitative emotion attributes indicating the degree of intensity of the valence or sentiment (from 0: very negative to 1: very positive), and the degree of intensity of fear, anger, happiness and sadness emotions (from 0: not at all to 1: extremely intense), and c) two qualitative attributes indicating the sentiment category (very negative, negative, neutral or mixed, positive, very positive) and the dominant emotion category (fear, anger, happiness, sadness, no specific emotion) the tweet is mainly expressing. We report the descriptive statistics around these new attributes, their temporal distributions, and the overall geographic representation of the dataset. The paper concludes with an outline of the dataset's possible usage in communication, psychology, public health, economics, and epidemiology.

The dataset license can be found in the OpenICPSR download folder. Essentially, the dataset license is based on the CC BY-NC 2.0 template and considers the need to be consistent with Twitter's terms of service, as the dataset is built upon the content provided by Twitter standard API.

Hence, you should read and agree with Twitter's Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy if you intend to use the dataset. The user should also read the restricted uses from Twitter to avoid using the dataset for any potentially inappropriate use.

Inquiries can address to the corresponding author via email.

The pandemic presents complex and evolving issues that warrant multidisciplinary research and a globally concerted effort. The complexity comes from the disease itself and the surge of the medical, scientific, social, behavioral, and economic issues that the disease has brought about. These issues include reports on daily counts of new cases and mortality rates, scientific discoveries, government responses, news reporting of social behaviors such as panic buying and food hoarding, impact on businesses and economic outlook, and changes in people's everyday lives. The challenges are multi-faceted and unprecedented. There is a growing recognition of the need for multidisciplinary research efforts to support the COVID-19 pandemic response, including disciplines such as social and behavioral science [2] and mental health science [3] .

Twitter is a popular microblogging site widely used by Internet users. According to Statista, as of the fourth quarter of 2019, Twitter had 152 million active users worldwide, and as of the second quarter of 2021, the number is 206 million 2 [4] . Twitter provides the research community a rich source of information about when, where, and what people have to say in their posts (known as "tweets") through its free, publicly accessible standard application programming interface (API) service. However, the raw tweet content is mainly in textual format and is not readily analyzable. When there are many tweets, it takes a significant time to accurately extract information about people's concerns, feelings, and emotions for human analysts and researchers to process and analyze for in-depth patterns and insights.

Twitter has opened up real-time, full-fidelity data streams 3 related to COVID-19 tweets since late April 2020 [5] , and that a few recent studies have leveraged Twitter for COVID-19 studies (e.g., [6, 7] ). However, to the best of our knowledge, no others have provided a tweet-by-tweet research resource with rich, semantically, and psychologically meaningful attributes surrounding the topics, sentiments, and emotions from the linguistic content of the tweets. The dataset described in this paper provides tweet-by-tweet tagging of the topic clusters that a tweet is semantically related to, the sentiment the tweet is expressing, and the emotional properties associated with the tweet. An initial analysis of the world's emotion trends using a part of this dataset has shed light on the significant change in people's emotional responses to the pandemic from late January to early April 2020 [8] . This study focused on the "emotion" attribute and found that anger has overtaken fear as the dominant emotion in tweets from January to April 2020.

This paper describes the source and data processing methods that enable the full dataset with tweet-by-tweet topics, sentiments, and emotions available to the research communities. The purpose is to allow more researchers to perform in-depth investigations in all possible areas, such as discovering the correlational patterns between other variables such as government measures and communication of these measures, demographics, economic indicators, and epidemiological markers.

We describe the data collection and data processing methods with a focused interest in tracking and understanding the latent topics, sentiments, and emotions surrounding the COVID-19 pandemic. We applied natural language processing (NLP) techniques, particularly statistical topic clustering techniques, that detect tweets surrounding similar topic clusters. We also applied pre-trained algorithms to tag each tweet with a sentiment valence (unpleasantness/pleasantness) and intensity scores of four different emotions -fear, anger, happiness, and sadness.

We set up our data collection app in early February 2020 by querying Twitter's standard search API 4 [9] . We first used three keywords "corona", "wuhan" (many people refer to the virus as "wuhan virus" at initial stages before the official name was announced), and "nCov" (WHO first named the virus as "2019-nCov"). On 11 February 2020, upon WHO officially renamed the disease as "COVID-19", we added "covid" as a new search keyword.

We focused on English tweets and used the language filter with the Twitter API. Simple sharing of these tweets (i.e., retweets) is not collected for the dataset to ensure the conciseness of the data.

The Twitter API returns the tweet text content with a rich range of attributes. For example, we were able to download the following 12 attributes in our local database. 

Although all the retrieved tweets are relevant to at least one of the four COVID-19 related keywords, many facets or subtopics have been covered in the tweets' "text" content. We applied an unsupervised topic clustering technique called Latent Dirichlet Allocation (LDA) to understand the subtopics. LDA is a probabilistic generative model which learns a multinomial distribution of latent topics in a given document [10] . The advantage of LDA is that it is independent of the corpus size, making it algorithmically efficient to learn topic clusters within a corpus with many tweets such as ours.

First, we pre-processed each raw tweet by converting it to ASCII characters, removing accented characters, forming bigrams and trigrams, filtering out stop words (including most rare and most frequent words), and performing text tokenization. These pre-processed tweets were then converted into a bag-of-words (BoW) corpus. The training data's date range is 28 January 2020 to 27 May 2020, consisting of 51 million tweets.

Next, we randomly sampled 1% of the BoW corpus and trained an LDA model whose inference was performed using online variational Bayes [11] . Using the trained LDA-based topic model, we obtained 100 topic clusters. The following list illustrates the top ten topics detected in the dataset (e.g., "topic 1") and the ten most representative words associated with each detected topic (e.g., "people, cases, new, deaths, time, china, realdonaldtrump, lockdown, trump" for "topic 1"), respectively. Lastly, for each tweet in the entire dataset, we assigned a relevance label ("1" or "0") using the trained LDA model based on the contribution of each topic (over a total of 100 topic clusters) to the tweet ("1" indicates if the contribution is > 1%, where "0" indicates otherwise). Table 1 shows example tweets that are tagged with the corresponding topic clusters, respectively. The first ten examples show tweets that are solely relevant to each of the ten topic clusters. The last example shows that a tweet can be relevant to multiple topic clusters simultaneously.

Processing for the sentiment and emotion intensity attributes As sentiments and emotions are subjective information embedded in the unstructured "text" content, extracting such information with targeted tools is necessary.

We used CrystalFeel 5 [12] , a collection of five machine-learning-based algorithms to extract the sentiment and emotions scores. The development of CrystalFeel involved training and experimental evaluations of features derived from affective lexicons, parts-of-speech, and word embeddings [13] , using tweets manually annotated with ground truth values [14] . Table 2 shows five example tweets tagged with these five attributes. The first example shows a tweet with a moderate (i.e., neither very negative nor very positive) sentiment 5 CrystalFeel is accessible via https://socialanalyticsplus.net/crystalfeel Example tweet text t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 Remember when doja cat said corona was just the flu Note that, in some instances, such as the fourth example ("Being higher risk of covid has me all over the place. Appt is in about 90 mins. Im scared, worried and anxious"), the intensity score may be exceeding 0-1 range, which indicates that these cases represented extreme intensities beyond the algorithms' original training samples. 

To facilitate a more straightforward interpretation, we used the CrystalFeel algorithm's qualitative output "sentiment" for the dataset. The "sentiment" attribute is derived from the following logic from "valence_intensity" score.

// # Initialize the sentiment category in a "neutral or mixed" class 1 sentiment = "neutral or mixed"; // # Assign the sentiment category based on the degree of the valence intensity 2 if(valence_intensity <= 0.30): 3 sentiment = "very negative"; 4 elif(valence_intensity < 0.48): 5 sentiment = "negative"; 6 elif(valence_intensity > 0.70): 7 sentiment = "very positive"; 8 elif(valence_intensity > 0.52): 9 sentiment = "positive"; Table 3 shows the five tweets examples tagged with their corresponding sentiment categories, qualitatively indicating the sentiment each tweet is mainly expressing. 

The underlying dominant emotion behind the sentiments carries more information than the overall valence or sentiment. We used CrystalFeel's "emotion" output to facilitate interpretation of the COVID-19 tweets.

The "emotion" output was obtained using the following logic that leverages all the valence and emotions intensities scores from CrystalFeel's outputs. It first uses "valence_intensity" as the first-line criterion as this dimension that has very high accuracy, i.e., 0.816 in terms of Pearson correlation with human annotated ground truth values. It then uses the relative intensity comparing the three primary negative emotions, anger, fear and sadness to assign a corresponding dominant emotion category. The following script describes the conversion logic:

// # Initialize the sentiment category in a "no specific emotion" class 1 emotion = "no specific emotion"; // # Assign the emotion category when valence intensity score exceeds 0.52 2 if(valence_intensity > 0.52): 3 emotion = "happiness"; // # Assign the emotion category when valence intensity score falls below 0.48 4 elif(valence_intensity < 0.48): 5 emotion = "anger"; 6 if((fear_intensity > anger_intensity) and (fear_intensity > = sadness_intensity )): 7 emotion = "fear"; 8 elif((sadness_intensity > anger_intensity) and (sadness_intensity > fear_intensity)): 9 emotion = "sadness"; Table 4 shows the five tweet examples tagged with the dominant emotion categories. It is helpful to note that the conversion logic mentioned above is based on application assumptions where CrystalFeel is used for processing short informal text (e.g., tweets, Facebook posts, and comments). The conversion thresholds are derived from heuristics and social media corpora we continuously monitor in our research (see more in [12] ).

Users may define and adjust their conversion logic as far as it is appropriate or suitable for different applications. For example, for converting the emotional intensity scores to meaningful categories on short formal text (e.g., news headlines), the conversion logic shall be adjusted accordingly.

We processed the original "tweet_created_at" (in unix format) and obtained a "tweet_timestamp" attribute (in YYYY-MM-DD-HH-SS format). The timezone is maintained as UTC time.

Processing for country/region attribute As our COVID-19 data collection is keyword-based, the tweets returned appear to come from users from different geographic regions worldwide. In order to facilitate the assessment of the geographic representativeness of the dataset, we converted the "location" attribute from the original Twitter results into a "country/region" attribute. This is done using GeoNames' cities15000 geographic database 6 [15] , which contains a mapping between all cities with a population > 15,000 or capitals and a country code.

For example, the original location "Ontario, Canada" is converted to country/region code as "Canada", "India" is converted to as "India", "Shanghai" is converted to as "China", "London" is converted to as "United Kingdom".

If the location is indicated as "online", "The Entire Universe!" or left blank (i.e., no match can be found using the GeoNames database), the country_region is coded as "-", indicating that there is no country or region identifiable information associated with the tweet.

If the user does not indicate any location information, the country/region is maintained as an empty field.

The data records are constructed as comma-separated value (CSV) files.

As of 1 September 2021, our system collected a total of 198,378,184 tweets worldwide using the four COVID-19 related keywords, with the first retrievable date being 28 January 2020. All the data record files have at least the following three columns or attributes.

 tweet_ID: the unique identifier for this tweet  user_ID: the unique identifier for the user  keyword: one of the four keywords ("corona", "wuhan", "nCov" or "covid") we used to query the Twitter API, which returned the tweet 1. tweetid_userid_keyword_topics.csv This file contains the entire tweets CSV file with the following ten attributes of the processed topic. The file is very large. We recommend that users use python+pandas to view and retrieve data records in this file.

 t1: A binary value of 0 or 1, where 0 -this tweet is not relevant to this topic; 1 -this tweet is relevant to this topic  t2: A binary value of 0 or 1, where 0 -this tweet is not relevant to this topic; 1 -this tweet is relevant to this topic  t3: A binary value of 0 or 1, where 0 -this tweet is not relevant to this topic; 1 -this tweet is relevant to this topic  t4: A binary value of 0 or 1, where 0 -this tweet is not relevant to this topic; 1 -this tweet is relevant to this topic  t5: A binary value of 0 or 1, where 0 -this tweet is not relevant to this topic; 1 -this tweet is relevant to this topic  t6: A binary value of 0 or 1, where 0 -this tweet is not relevant to this topic; 1 -this tweet is relevant to this topic  t7: A binary value of 0 or 1, where 0 -this tweet is not relevant to this topic; 1 -this tweet is relevant to this topic  t8: A binary value of 0 or 1, where 0 -this tweet is not relevant to this topic; 1 -this tweet is relevant to this topic  t9: A binary value of 0 or 1, where 0 -this tweet is not relevant to this topic; 1 -this tweet is relevant to this topic  t10: A binary value of 0 or 1, where 0 -this tweet is not relevant to this topic; 1 -this tweet is relevant to this topic It is useful to note that the current release includes topic information coveraging the time range of 28 Janauary 2020 to 1 Janauary 2021.

This file contains the entire tweets file with the processed seven sentiments and emotions attributes. The file is very large. We recommend that users use python+pandas to view and retrieve data records in this file.

 valence_intensity: A continuous variable ranging from 0 to 1, where 0 indicates that this text expresses extremely negative or unpleasant feelings, and 1 indicates that this text expresses extremely positive or pleasant feelings  fear_intensity: A continuous variable ranging from 0 to 1, where 0 indicates that this text does not express the fear emotion at all, and 1 indicates that this text expresses an extremely high intensity of the fear emotion  anger_intensity: A continuous variable ranging from 0 to 1, where 0 indicates that this text does not express the anger emotion at all, and 1 indicates that this text expresses an extremely high intensity of the anger emotion  happiness_intensity: A continuous variable ranging from 0 to 1, where 0 indicates that this text does not express the happiness emotion at all, and 1 indicates that this text expresses an extremely high intensity of the happiness emotion  sadness_intensity: A continuous variable ranging from 0 to 1, where 0 indicates that this text does not express the sadness emotion at all, and 1 indicates that this text expresses an extremely high intensity of the sadness emotion  sentiment: A categorical variable that indicates the text mainly expresses one of the five sentiment classes: very negative, negative, neutral or mixed, positive, and very positive  emotion: A categorical variable that indicates the text mainly expresses one of the five emotion classes: fear, anger, happiness, sadness, and no specific emotion.

In addition, the data file includes the following processed attributes.

 Tweet_timestamp: A timestamp in YYYY-MM-DD HH-MM-SS format (in UTC time) processed based on "time_created_at" retrieved from Twitter API  country/region: A text field that indicates the country or region processed based on the "location" declared in the Twitter author's profile Individual CSV files extracted for 29 representative countries are also included in the latest release.

Hydrating other attributes. In compliance with Twitter's content redistribution terms, our released dataset only contains two original Twitter data attribute: "tweet_ID" and "user_ID".

Users may use "tweet_ID" or "user_ID" to retrieve or "hydrate" the other attributes (such as the actual "text", "retweet_count", "location", "followers_count") through the standard search API from Twitter directly [9] .

The following provides a simple step-by-step guide to hydrate Twitter data using Twitter standard search API and Python.

1) Request access to Twitter API via Twitter's developer site 7 . a. Applying for a developer account on Twitter b. Choose between product tracks (we recommend Standard) c. Get approval from Twitter 2) After getting the approval to use the API, a Project and App must be created which would have its designated API Key, API Key Secret, Access Token, Access Token Secret, and Bearer token. 3) If Python is the language of choice, we recommend using tweepy 8 to access Twitter data. A simple python implementation guide 9 provides further details.

Apart from the steps mentioned here, there are several third-party tools one could use to hydrate Twitter data, such as Hydrator 10 and twarc 11 . As they are third-party tools, users should carefully check the terms of use before deciding to use them.

Overall raw tweets coverage As of 1 September 2021, our system has collected a total of 198,378,184 tweets worldwide using the four COVID-19 related keywords, with the first retrievable date being 28 January 2020. Table 6 presents the data overview. Most of the tweets are retrieved based on the "covid" keyword, which returned 169,892,566 tweets, or 85.6%. On average, 14,178 COVID-related tweets were posted every hour, or 340,271 tweets every day. In total, 25,765,886 unique users post these tweets based on the "user_ID" attribute. It is helpful to note the following limitations associated with the dataset.

1. Twitter's standard search API has a known limitation because it does not guarantee the retrieved tweets are exhaustive due to indexing and other reasons. In other words, the search API retrieves relevant but not all the tweets that match the search keywords. 2. We were able to retrieve only the first 144 characters of the tweet "text" from the Twitter standard search API, from 28 January 2020 to 18 March 2021. After 19 March 2021, we were able to retrieve and hence have processed full tweet content that may exceed 144 characters. The length of tweets may affect the topic and emotion analysis results before and after 18 March 2021. Re-processing is recommended for applications that require comparison before or after the date.

Validity of the processing methods Topic identification. The quality of the topic model was evaluated using metrics including perplexity and coherence scores based on suggestions from the literature [16] . We obtained the top ten topics, i.e., "t1", "t2", …,"t10", that received relatively high coherence scores (c_v measure, mean = 0.575) from a model optimized by learning 100 topics and hyperparameters α as a fixed normalized asymmetric Dirichlet prior (1/topic_number) and η = 0.909. We obtained ten topics out of 100 extracted from 500,000-odd data points, a 1% sample from the entire dataset.

Conceivably, training an LDA-based topic model with data from specific Twitter accounts, smaller and more focused date ranges, particular countries of interest, or specific hashtags, would yield more targeted and meaningful results. Hence, we provide our Python source code to help researchers quickly apply and adapt the model for different use scenarios.

The accuracy in determining "valence_intensity", "fear_intensity", "anger_intensity", "happiness_intensity", and "sadness_intensity" are systematically validated in prior research [13] , and are subsequently tested for predictive validities in other NLP tasks [17, 18, 19] .

The descriptive validity of CrystalFeel is reported in Gupta and Yang's original evaluation experiments using out-of-training-sample test data: the CrystalFeel algorithms' accuracies in terms of Pearson correlation coefficient (r) with manually annotated test data are 0.816 on valence intensity, and are 0.708, 0.740, 0.700 and 0.720 on happiness 12 intensity, anger intensity, fear intensity and sadness intensity [13] .

The predictive validity of the valence, happiness, anger, fear, and sadness intensity scores on other tasks has been studied and demonstrated in the context of predicting news social popularity in Facebook and Twitter [17] , in predicting the ingredients of happy moments [18] , and in detecting propaganda techniques in news articles [19] . Hence, researchers may examine the use of the sentiment and emotion intensity scores directly without conversions.

Sentiment and emotion labels. The "sentiment" and "emotion" attributes are obtained based on a conversion logic presented in the "Methods" section. The conversion principle that allows each tweet to be labeled with one of the five emotion categories (i.e., "fear", "anger", "happiness", "sadness", "no specific emotion") follows a conceptual simplification that a single dominant emotion exists for each tweet.

However, some tweets may express "mixed emotions" [20] , such as express anger and fear simultaneously. Other conversion logic may be explored in future research. For example, Mohammad et al. [14] suggest using the mid-scale threshold, i.e., 0.5, to differentiate highintensity vs. non-high-intensity emotions. Researchers shall examine the intended applications and determine the conversion threshold accordingly.

We checked the volume of the tweets related to the top ten identified topic clusters. A vast majority of tweets were related to two or more topics, which form 60% of the total tweets. The tweets that solely pertained to "t1" have the highest volume, consisting of 33,680,867 tweets or 23% out of the total data volume. Table 7 presents the overall tweet topics statistics. Figure 2 depicts a visualization of the topic clusters in the context of the volume of the tweets. (Note that for the current release, topic data is updated until 1 January 2021). 

The quantitative "sentiment_intensity" averaged for the whole dataset is 0.455, with the most negative tweet having its valence intensity score of -0.058, and the most positive tweet having its valence intensity score of 1.005 (Table 8) . Qualitatively, the counts and distributions for valence intensity score converted into sentiment categories counts are presented in Figure 3 . The results indicate that most of the tweets are "negative" or "very negative", forming 59.9% of the total 118,832,924 tweets.

Plotting the "sentiment" values over daily aggregated tweet counts suggested more subtle patterns (see Figure 4 ). For example, the single-day peak during this period was 1,075,087 tweets (629,938 were "negative" tweets and 65,165 were "very negative" tweets), which took place on 13 March 2020, one day immediately following the WHO's announscement on the disease as a "pandemic". Further analysis may look into, for example, the sentiment changes before and after more targeted periods based on critical announcements (e.g., to study a week before and after 13 March 2020). The dataset may also allow further research to explore the correlations and predictive values based on the sentiment and emotion scores when over-laid with economic indicators (e.g., stock market changes). 

Using the four quantitative emotions intensities attributes, overall statistics show that "anger_intensity" and "fear_intensity" have the highest mean values of 0.445. Table 9 reports the descriptive statistics for the four emotion intensity scores. Qualitatively, the counts and distribution of the most dominant emotion categories based "emotion" attribute are presented in Figure 5 . The results suggest that, over the 19 months, tweets that are dominantly expressing "anger" (57,764,647 tweets, 29.1%) and tweets that are dominantly expressing "fear" (49,613,158 tweets, 25.0%) formed the majority of the total tweets. We checked the daily counts of the four emotions for the 19 months (see Figure 6 ). The significance of the change can be illustrated using the contrast of results at the start and the end of our data range.

For example, as of 28 January 2020, a total of 23,405 tweets were posted for the day, and the tweets with "anger" as the dominant emotion formed 15% of the total 23,405 tweets, far less than those tweets with "fear" as the most dominant emotion which formed 53% of the total 23,405 tweets. In contrast, as of 5 October 2020, a total of 602,206 total tweets were posted for the day, and the tweets with "anger" as the most dominant emotion formed 39.1%, exceeding those tweets with "fear" as the most dominant emotion, which formed 20.1%).

The trends surfaced some interesting patterns: While both "fear" and "anger" dominated in the overall counts, the trend plot shows that over time, the relative distribution of "fear" has been decreasing, and the relative distribution of "anger" has been increasing; meanwhile, "happiness and other positive expressions" have been increasing, though in a slower rate (See [8] which provides an interpretation based on analysis of early time coverage of this dataset).

We checked the "country/region" attribute converted from "location" attribute to understand the geographic coverage and representativeness of the dataset. The geographical coverage of the tweets is estimated to contain users coming from more than 170 countries, regions, or territories worldwide. Figure 7 shows a visualization of the dataset's geographical representativeness. 

This paper presents a very large COVID- 19 Twitter dataset with psychologically meaningful attributes. This dataset may create opportunities to understand both global and local conversations and social sentiments in real-time, at a large scale, potentially leading to rich insights on human behaviors and behavioral changes surrounding the unprecedented pandemic. We envisage its potential usage in five broad areas.

First, for media and communication research, the dataset can be helpful for communication scientists and professionals to evaluate and improve government response, policies, and media communications towards the unprecedented pandemic crisis. For example, a recent study compared communications efforts of health authorities in the United States, the United Kingdom, and Singapore on Facebook during the early period of COVID-19 [22] . As the virus continuously hit different countries in different timeframes and governments implemented different response strategies and policies, one direction of an ongoing related work overlays the location attribute, examining and comparing sentiments, emotions, and topics associated with different countries. The dataset may also help to study how media's topical and emotional framing in their headlines and titles are different from those expressed by the general public.

Second, the dataset is of inherent interest for psychology research. The granularity of the tweet's metadata may allow researchers to dive deeper into more nuanced trends with more profound psychological accounts and insights. One possibility is to look into the sentiments and emotional differences over more fine-grained timelines, examine cultural differences and segregate the users who are influencers vs. the general public. Future research may also look into leveraging user characteristics inference techniques (e.g., [23] ) and the present dataset to investigate user community-specific tendencies and issues.

Third, as the pandemic escalates in its severity and geographical span and is likely to last for a prolonged period, public mental health issues (e.g., [24] ) are more prevalent. The dataset may be used to examine public mental wellbeing. Prior literature (e.g., [25, 26] ) has established the linkage between fear (as an emotion) and anxiety (as a mental disorder) and between sadness (as an emotion category) and depression (as a mental disorder). Hence, it can be fruitful to study the value of the emotion intensity scores and their trends in the context of its duration, frequency, and various user communities.

Fourth, it is potentially valuable to overlay publicly available economic indicators (e.g., daily stock market data, monthly unemployment rates reports) and investigate how the topics, sentiments, and emotional trends present predictive value in future research.

Last but not least, data scientists and epidemiology researchers may find the dataset useful. For example, prior research in Zika [27] and other infectious disease outbreaks [28] have studied and found insights in overlaying with air travel networks and virus genome. Hence the dataset may reveal more hidden patterns and relationships of the large-scale social media content and other pandemic-related data streams.

The dataset described in this paper is available for download at Open ICPSR: https://doi.org/10.3886/E120321. The dataset license is also available at the Open ICPSR download folder. Essentially, the dataset license is based on the CC BY-NC 2.0 template and considers the need to be consistent with Twitter's terms of service as the dataset is built upon the content provided by Twitter standard API.

reviewed the content, and agreed with the submission. References 1. World Health Organization, WHO Coronavirus Disease (COVID-19) Dashboard

Using social and behavioural science to support COVID-19 pandemic response

Multidisciplinary research priorities for the COVID-19 pandemic: a call for action for mental health science

Number of monetizable daily active Twitter users (mDAU) worldwide from 1st quarter

Top Concerns of Tweeters During the COVID-19 Pandemic: Infoveillance Study

Creating COVID-19 Stigma by Referencing the Novel Coronavirus as the "Chinese virus" on Twitter: Quantitative Analysis of Social Media Data

Global Sentiments Surrounding the COVID-19 Pandemic on Twitter: Analysis of Twitter Trends

Twitter standard search API

Latent dirichlet allocation

Online learning for latent dirichlet allocation

Emotion Intensity Analysis from Natural Language. Institute of High Performance Computing, A*STAR

CrystalFeel at SemEval-2018 Task 1: Understanding and detecting emotion intensity using affective lexicons

Task 1: Affect in tweets

Reading tea leaves: How humans interpret topic models

Predicting and Understanding News Social Popularity with Emotional Salience Features

What constitutes happiness? Predicting and characterizing the ingredients of happiness using emotion intensity analysis

SocCogCom at SemEval-2020 Task 11: Characterizing and Detecting Propaganda Using Sentence-Level Emotional Salience Features

Eliciting mixed emotions: a meta-analysis comparing models, types, and measures

Leading countries based on number of Twitter users as of

Measuring the Outreach Efforts of Public Health Authorities and the Public Response on Facebook During the COVID-19

Pandemic in Early 2020: Cross-Country Comparison

Inferring latent user properties from texts published in social media

The Psychological and Mental Impact of Coronavirus Disease 2019 (COVID-19) on Medical Staff and General Public -A Systematic Review and Meta-analysis

Fear and anxiety. Handbook of Emotions

Speaking of sadness: Depression, disconnection, and the meanings of illness

Travel Surveillance and Genomics Uncover a Hidden Zika Outbreak during the Waning Epidemic

Global transport networks and infectious disease spread

Kum Seong, Wong Chi Kit, and Zhang Mila, for helpful discussions. We are grateful for the help from Nur Atiqah Othman for her proofreading, which helped to enhance the clarity of the paper. All errors that remain are our sole responsibility.

The source scripts for the trained LDA-based topic model are available at our GitHub page: https://github.com/ajvish91/covid_twitter_scripts. A visualization dashboard on the COVID-19 tweets with hourly refreshed sentiment and emotion trend results is available at: https://socialanalyticsplus.net/corona2019. CrystalFeel is accessible via: https://socialanalyticsplus.net/crystalfeel. Access to CrystalFeel API is available upon request from the corresponding author.

RG acquired the data and extracted sentiment and emotion features. AV extracted topic clusters features. YY initiated, conceptualized, and led the manuscript. All authors performed

The authors declare the following competing interests: RG and YY are co-inventors of the CrystalFeel tool used to extract the sentiment and emotion-related attributes. No other conditions or circumstances present a potential conflict of or competing interest for the other authors.