key: cord-0171710-o8b1rtux
authors: Gao, Zhiwei; Yada, Shuntaro; Wakamiya, Shoko; Aramaki, Eiji
title: NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset
date: 2020-04-17
journal: nan
DOI: nan
sha: a946a3147ddc1ad7d4180fe3a7608c7a8632327c
doc_id: 171710
cord_uid: o8b1rtux

Since the outbreak of coronavirus disease 2019 (COVID-19) in the late 2019, it has affected over 200 countries and billions of people worldwide. This has affected the social life of people owing to enforcements, such as"social distancing"and"stay at home."This has resulted in an increasing interaction through social media. Given that social media can bring us valuable information about COVID-19 at a global scale, it is important to share the data and encourage social media studies against COVID-19 or other infectious diseases. Therefore, we have released a multilingual dataset of social media posts related to COVID-19, consisting of microblogs in English and Japanese from Twitter and those in Chinese from Weibo. The data cover microblogs from January 20, 2020, to March 24, 2020. This paper also provides a quantitative as well as qualitative analysis of these datasets by creating daily word clouds as an example of text-mining analysis. The dataset is now available on Github. This dataset can be analyzed in a multitude of ways and is expected to help in efficient communication of precautions related to COVID-19.

The outbreak of the coronavirus disease 2019 was observed at the end of 2019 in Wuhan, Hubei Province, China. Since January 2020, it has rapidly spread worldwide. On March 11, 2020, the World Health Organization (WHO) announced that COVID-19 can be characterized as a pandemic. The virus causing COVID-19, severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), has infected more than 1.2 million people worldwide, and 60,000 people have lost their lives. 2 WHO highly recommends main-taining "social distancing" measures, and several countries with severe epidemics are further requesting citizens to stay home.

In this scenario, online social media, such as Twitter, Weibo, and Instagram, are playing an important role in sharing information and perception about COVID-19. Social media is recognized as one of the valuable resource of data that can lead to prediction of various phenomena related to an event. For example, Lampos and Cristianini (2010) showed that microblog data facilitated better public-health surveillance, such as the prediction of the number of patients suffering from influenza.

To encourage and support the social media studies on COVID-19, it is crucial to make relevant datasets available to the public. Here, we publish a multilingual dataset that contains over 20 million microblogs related to COVID-19 in English, Japanese, and Chinese from Twitter and Weibo since January 20, 2020, until March 24, 2020. Chen et al. (2020) and Lopez et al. (2020) have already released multilingual datasets collected from Twitter. Given that China is the very first country to have faced a COVID-19 outbreak, we further collected microblogs about COVID-19 from Weibo, one of the most popular social media in China similar to Twitter.

The remainder of the paper is organized as described follows. In Section 2, we elaborate on the method of data collection. In Section 3, we provide a quantitative analysis of the dataset, such as the character count per microblog and the microblog count per day. In Section 4, we present the daily word cloud images created from microblogs of each language as an example of text-mining analysis. Finally, in Section 5, we present the conclusion with our future work. 

To collect the microblogs related to COVID-19, we adopted keyword-based search. For English and Japanese, we collected microblogs related to COVID-19 from Twitter, while we obtained Chinese microblogs from Weibo. We employed Twitter Search API 3 for tweets; a web crawler was applied to retrieve Weibo posts.

We developed three sets of query keywords as shown in Table 1 according to the stages of COVID-19 spread. Corresponding to these sets, our dataset can be divided into three phases:

Phase 1 (January 20 to February 23, 2020): In combination with the term "Wuhan," we used the keywords "pneumonia" and "coronavirus" in English and their translations in Japanese and Chinese. We included the Chinese city name "Wuhan" as the primary keyword, because Wuhan ("武 漢" in Japanese and "武汉" in Chinese) observed the earliest outbreak with the maximum number of confirmed cases. Note that in the said period, the official disease name "COVID-19" was yet to be defined.

Phase 2 (February 24 to 29, 2020): WHO assigned the official name "COVID-19" on February 11. We added it to the keywords in combination with "Wuhan," although this resulted in a smaller number of retrieval because all the microblogs included "Wuhan."

Phase 3 (March 1-24, 2020):

To obtain more data, we relaxed search con- ditions by querying each set of keywords separately.

As shown in Table 2 , we have collected over 16 million microblogs in English, 9 million in Japanese, and 180 thousand in Chinese during January 20 to March 24, 2020. To collect Twitter and Weibo posts, we have adopted a uniform daily timing to collect microblogs from 0:00 to 23:59 (JST) of the previous day. To ensure the uniqueness of the data, for Twitter, we filtered out all retweets by adding the "-filter:retweets" operator; for Weibo, we searched for "original microblogs" only. Note that we have collected smaller amounts of the data from Weibo than Twitter because anti-crawling mechanism in Weibo limits our web crawler to access only the first 50 pages of the search content.

We released the first version of the dataset on Github at https://github.com/sociocom/covi d19_dataset. Following the terms of service of Twitter and Weibo, we mainly published microblog IDs, instead of exposing original text and metadata. The dataset consists of the lists of microblog IDs with two fields of metadata: their timestamps and the query keywords mentioned in the microblogs among our search queries. This helps make subsets suitable for subsequent applications and tasks. Since a Weibo's microblog is uniquely determined by the combination of user ID and microblog ID, we share the corresponding user ID and microblog ID for each microblog in the form of "user ID/microblog ID."

We provide basic statistics of our dataset in terms of its quantitative volume. First, we show the number of characters in microblogs. Next, we plot the number of microblogs per time series.

While microblogs contain multimodal data (e.g., images and movies), their core content is text. We report the number of characters to quantify the total amount of our dataset. Table 3 shows the sum, mean, and standard deviation of the number of characters for each language in our dataset. We removed URLs and punctuations from each microblog to expose the amount of characters that constituted the essential content. In Figure 1(a) , a sudden and dramatic increase in the number of English microblogs can be observed on January 28, 2020. According to the news, that particular day saw a discussion on the death toll in mainland China reaching 100. 5 On the same day, Japan also observed a sharp rise in the relevant microblogs, as shown in Figure 1(c) . This was a result of many users tweeting extensively about the three newly confirmed cases in Japan, which included people who had not been to Wuhan. 6 Subsequently, there was a substantial increase in the English microblogs on February 25, 2020, as shown in Figure 1(a) . On that day, there were reports that "Trump privately vents over his team's response to coronavirus -even though he says that the virus is under control," 7 leading to many microblogs against Trump on Twitter.

In March, as Figure 1(b) shows, the number of microblogs in major English-speaking countries showed an upward trend as the number of the confirmed cases increased, and the largest number of microblogs exceeded 9 million a day. Meanwhile, in Japan, the number of daily confirmed cases was relatively small as shown in Figure 1 (d). Therefore, we assumed that Japanese Twitter users are not as interested in COVID-19 as in the major English-speaking countries. In particular, there was a decline in the number of microblogs from March 12 to March 15, 2020. March 12, 2020, was the Olympic flame lighting ceremony and the torch relay for the Tokyo 2020 Olympics. 8 Therefore, we speculate that this sudden decrease was caused by a shift in attention from COVID-19 to the torch relay for many Japanese users.

With regard to the Chinese microblogs, the trends of the numbers are shown in Figures 1(e) and 1(f). These do not fully reflect the quantitative trends of the confirmed cases owing to the limited amount of the microblogs we could collect on a daily basis.

In addition to the quantitative analysis, we show an example of qualitative analysis based on our dataset. As an initial attempt, we adopted a word cloud, which is "an electronic image that shows words used in a particular piece of electronic texts or series of texts." 9 In word clouds, term frequency for each word in a corpus is proportional to its font size, which enables us to grasp the top-(a) The number of English microblogs and the daily confirmed cases in major English-speaking countries in January and February.

(b) The number of English microblogs and the daily confirmed cases in major English-speaking countries in March.

(c) The number of Japanese microblogs and the daily confirmed cases in Japan in January and February.

(d) The number of Japanese microblogs and the daily confirmed cases in Japan in March.

(e) The number of Chinese microblogs and the daily confirmed cases in China in January and February. ics of the corpus visually. Daily word cloud images of our dataset for each language are available at https://aoi.naist.jp/2020-covid/wo rdcloud. Henceforth, we provide brief interpre-tations of these word clouds to demonstrate a possible text-mining approach that can be applied to our dataset in Figure 2 .

Note that we removed stop words followed by tokenization in our word clouds. For the Chinese and Japanese tokenization, we used Jieba 10 and Mecab 11 , respectively. We also filtered out the search keywords in each microblog to reduce the disturbance of these keywords in the image.

A US citizen who lived in Wuhan passed away because of COVID-19 in Wuhan on February 8, 2020. 12 This was the first casualty of a US citizen. The word cloud of this day, shown in Figure 2(a) , contains the related words, e.g., "American," "US," "citizen," and "die." Figure 2 (b) is the word cloud on March 16, 2020, in which "social distancing," an important phrase to fight against the epidemic, appears notably. We can also notice that another socially important phrase "stay home" has an increased in size in our word cloud series from March 20, 2020.

The first local transmission of COVID-19 inside Japan was reported on January 28, 2020, as described in Section 3.2. Figure 2(c) shows the word cloud on that day. It reflects the fact that the infected patient lived in Nara prefecture and drove a sightseeing-tour bus that carried travelers from Wuhan. We can observe the relevant keywords, such as "奈 良 (Nara)," "バ ス (bus)," and "運 転 (drive)."

On March 24, 2020, Japan and International Olympic Committee (IOC) officially agreed to postpone the planned 2020 Tokyo Olympics until 2021. 13 A notable change in Japanese word cloud series can be found as the novel appearance of the words "オリンピック (Olympics)" and "延期 (postponing)" in that day's figure (i.e., Figure 2(d) ).

We can also notice that a YouTube video became viral in Japanese Twitter from around January 29 to February 6, 2020, by observing the corresponding word clouds. The video was originally made by a Wuhan citizen and subtitled in Japanese later by another YouTuber, 14 which tells 10 https://github.com/fxsjy/jieba 11 https://taku910.github.io/mecab 12 February 8, 2020; CNBC, https://cnb.cx/2R4uY Z1 13 March 24, 2020; The Washington Post, https://wa po.st/2UYXEnG 14 January 29, 2020; YouTube, https://youtu.be/M cfn5Eh5OVE the situation of Wuhan in lockdown. In addition to the word "YouTube," the corresponding word clouds contain the tokens of the video title, i.e., "震源 (hypocenter)," "動画 (video)," and "和 訳 (Japanese translation)."

Figure 2(e) shows the word cloud on January 20, 2020, and also shows that the term "钟 南 山 (Zhong nanshan)" has a larger weight. It was on January 20 that Dr. Zhong indicated the existence of human-to-human transmission of COVID-19 15 that triggered extensive discussion on Weibo.

Figure 2(f) shows the word cloud on March 10, 2020 and the word "方 舱 医 院 (mobile cabin hospital)" was more conspicuous. According to China's National Health Commission, all of Wuhan's mobile cabin hospitals were closed on March 10. 16 The mobile cabin hospitals, which were instrumental in preventing the spread of the epidemic, also had attracted much attention.

We published a multilingual dataset of microblogs related to COVID-19 collected by relevant query keywords at https://github.com/sociocom/co vid19_dataset. The dataset covered English and Japanese tweets from Twitter and Chinese posts from Weibo. The present version of the dataset (April 20, 2020) encompassed microblogs from January 20 to March 24, 2020.

We then showed one of the possible utilization of our dataset through the daily microblog count analysis as an example of the quantitative analyses and the word cloud-based analysis as an example of the qualitative analyses. The results of the analyses are summarized as follows. For China, which is the first country to have faced a full-blown outbreak of COVID-19, we can observe from social media that people took the situation and prevention seriously. As the number of confirmed cases in China decreased, the trend in social media shifted toward the concern for the global situation. In the UK and the US, the main English-speaking countries, initially, there was less social media interests owing to fewer confirmed cases. The subsequent outbreaks sprung the discussion about COVID-19 on social media, including the promotion of precautionary measures and recommendations to keep "social distancing" measures. Meanwhile, Japan showed relatively sluggish growth. However, on March 24, 2020, the announcement of the postponement of the 2020 Olympic Games in Tokyo along with a relatively rapid growth of confirmed cases was reflected in the increased social media activity. This was accompanied by microblogs expressing concerns about the epidemic and dissatisfaction with government measures.

We believe that this dataset can be analyzed further in many ways, such as sentiment-based analysis 17 , comparison with web search queries, moving logs 18,19 , etc. Various combinations of data can enable deeper analyses of social media communication. Furthermore, our dataset would contribute to extract useful clinical information

Covid-19: The first public coronavirus twitter dataset

Tracking the flu pandemic by monitoring the social web

Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset

This study was supported in part by JSPS KAK-ENHI Grant Number JP19K20279 and Health and Labor Sciences Research Grant Number H30shinkougyousei-shitei-004.

ity/ 19 https://dataforgood.fb.com/tools/dis ease-prevention-maps from social media and render hints about efficient broadcasting of the clinical information. We continue to collect the microblog data while keeping the repository up-to-date.