key: cord-0619599-33m59ajn authors: Alqurashi, Sarah; Alhindi, Ahmad; Alanazi, Eisa title: Large Arabic Twitter Dataset on COVID-19 date: 2020-04-09 journal: nan DOI: nan sha: e680b8b12cb5f5936a266ab76935b6071242b81d doc_id: 619599 cord_uid: 33m59ajn The 2019 coronavirus disease (COVID-19), emerged late December 2019 in China, is now rapidly spreading across the globe. At the time of writing this paper, the number of global confirmed cases has passed two millions and half with over 180,000 fatalities. Many countries have enforced strict social distancing policies to contain the spread of the virus. This have changed the daily life of tens of millions of people, and urged people to turn their discussions online, e.g., via online social media sites like Twitter. In this work, we describe the first Arabic tweets dataset on COVID-19 that we have been collecting since January 1st, 2020. The dataset would help researchers and policy makers in studying different societal issues related to the pandemic. Many other tasks related to behavioral change, information sharing, misinformation and rumors spreading can also be analyzed. On December 31, 2019, Chinese public health authorities reported several cases of a respiratory syndrome caused by an unknown disease, which subsequently became known as COVID-19 in the city of Wuhan, China. This highly contagious disease continued to spread worldwide, leading the World Health Organization (WHO) to declare a global health emergency on January 30, 2020. On March 11, 2020 the disease has been identified as pandemic by WHO, and many countries around the world including Saudi Arabia, United States, United Kingdom, Italy, Canada, and Germany have continued reporting more cases of the disease (World Health Organization and others 2020). As the time of writing this paper, this pandemic is affecting more than 208 countries around the globe with more than one million and half confirmed cases (World Health Organization 2020) . Since the outbreak of COVID-19, many governments around the world enforced different measures to contain the spread of the virus. The measures include travel restrictions, curfews, ban of mass gatherings, social distancing, and probably cities lock-down. This has impacted the routine of people around the globe, and many of them have turned to * Corresponding author: s43980127@st.uqu.edu.sa social media platforms for both news and communication. Since the emergence of COVID-19, Twitter platform plays a significant role in crisis communications where millions of tweets related to the virus are posted daily. Arabic is the official language of more than 22 countries with nearly 300 million native speakers worldwide. Furthermore, there is a large daily Arabic content in Twitter as millions of Arabic users use the social media network to communicate. For instance, Saudi Arabia alone has nearly 15 million Twitter users as of January, 2020 (Statista 2020). Hence, it is important to analyze the Arabic users' behavior and sentiment during this pandemic. Other Twitter COVID-19 datasets have been recently proposed (Chen, Lerman, and Ferrara 2020; Lopez, Vasu, and Gallemore 2020) but with no significant content for the Arabic language. In this work, we provide the first dataset dedicated to Arabic tweets related to COVID-19. The dataset is available at https://github.com/SarahAlqurashi/ COVID-19-Arabic-Tweets-Dataset. We have been collecting data in real-time from Twitter API since January 1, 2020, by tracking COVID-19 related keywords which resulted in more than 3,934,610 Arabic tweets so far. The presented dataset is believed to be helpful for both researchers and policy makers in studying the pandemic from social perspective, as well as analyzing the human behaviour and information spreading during pandemics. In what follows, we describe the dataset and the collection methods, present the initial data statistics, and provide information about how to use the dataset. We collected COVID-19 related Arabic tweets from January 1, 2020 until April 15, 2020, using Twitter streaming API and the Tweepy Python library. We have collected more than 3,934,610 million tweets so far. In our dataset, we store the full tweet object including the id of the tweet, username, hashtags, and geolocation of the tweet. We created a list of the most common Arabic keywords associated with COVID-19. Using Twitters streaming API, we searched for any tweet containing the keyword(s) in the text of the tweet. Table 1 shows the list of keywords used along with the starting date of tracking each keyword. Furthermore, Table 2 shows the list of hashtags we have been tracking along with the number of tweets collected from One and a half meters 2020-04-01 quarantine activities 2020-04-01 Quarantine 2020-04-01 Table 1 : The list of keywords that we used to collect the tweets. A summary over the dataset is given in Table 3 . While collecting data, we have observed that the number of retweets increased significantly in late March. This is likely due to the exponential increase in confirmed COVID-19 cases worldwide, including the Arabic speaking countries. A relatively small percentage of tweets were geotagged. The dataset is accessible on GitHub at this address: https://github.com/SarahAlqurashi/ COVID-19-Arabic-Tweets-Dataset However, to comply with Twitters content redistribution policy 1 , we are distributing only the IDs of the collected tweets. There are several tools (such as Hydrator 2 ) that can be used to retrieve the full tweet object. We also plan to provide more details on the pre-processing phase in the GitHub page. We are continuously updating the dataset to maintain more aspects of COVID-19 Arabic conversations and discussions happening on Twitter. We also plan to study how different groups respond to the pandemic and analyze information sharing behavior among the users. Understanding the perception of covid-19 policies by mining a multilanguage twitter dataset World Health Organization, et al. 2020. Coronavirus disease 2019 (covid-19): situation report, 67. [World Health Organization 2020] World Health Organization. 2020. Coronavirus disease (covid-19) pandemic The authors wish to express their thanks to Batool Mohammed Hmawi for her help in data collection.