key: cord-0458957-9sw4oyls authors: Ameur, Mohamed Seghir Hadj; Aliane, Hassina title: AraCOVID19-SSD: Arabic COVID-19 Sentiment and Sarcasm Detection Dataset date: 2021-10-05 journal: nan DOI: nan sha: 229f582529e812cbe9f5f2d82a3ebd0f6b0af5e8 doc_id: 458957 cord_uid: 9sw4oyls Coronavirus disease (COVID-19) is an infectious respiratory disease that was first discovered in late December 2019, in Wuhan, China, and then spread worldwide causing a lot of panic and death. Users of social networking sites such as Facebook and Twitter have been focused on reading, publishing, and sharing novelties, tweets, and articles regarding the newly emerging pandemic. A lot of these users often employ sarcasm to convey their intended meaning in a humorous, funny, and indirect way making it hard for computer-based applications to automatically understand and identify their goal and the harm level that they can inflect. Motivated by the emerging need for annotated datasets that tackle these kinds of problems in the context of COVID-19, this paper builds and releases AraCOVID19-SSD a manually annotated Arabic COVID-19 sarcasm and sentiment detection dataset containing 5,162 tweets. To confirm the practical utility of the built dataset, it has been carefully analyzed and tested using several classification models. COVID-19 is a highly infectious respiratory [1] that was first identified in Wuhan, China, in late December 2019, and then declared as a global pandemic on March 2020 by the World Health Organization (WHO) [2] . Since its appearance governments around the world have adopted several protection measures such as closing borders, travel restrictions, quarantine, and containment. As of late July 2021, COVID-19 has caused more than 170 million confirmed cases and 3 million deaths worldwide 2 . Governments around the world have taken some urgent measures to stop the spread of the virus such as closing borders, self-isolation, quarantine, and social distancing. The severity of these measures along with the increased number of cases and deaths have significantly impacted people's morals causing a lot of uncertainty, grief, fear, stress, mood disturbances, and mental health issues [3] . Many people relied on social networking sites such as Facebook and Twitter to express their feelings, thoughts, and opinions by publishing and sharing content related to this new emerging pandemic. The content that they shared often employed sarcasm 3 to convey their intended meaning in a humorous, funny, and indirect way making it hard for the computerbased applications to automatically understand and identify their goal and the harm level that they can cause. The presence of sarcastic phrases makes the task of sentiment analysis more difficult as the intended meaning is conveyed via indirect often humorous ways. This led the research community to devote a lot of interest and attention to the task of automatic sarcasm and sentiment detection. As part of the efforts that are being made to create and share COVID-19 related datasets and tools [4, 5, 6, 7] , this paper builds and releases a manually annotated Arabic COVID-19 sarcasm and sentiment detection dataset containing 5,162 tweets. The built dataset is carefully analyzed and tested using several classification models. The main contributions of this paper can be summarized as follows: • We collected, treated, and made available a large annotated Arabic COVID-19 Twitter sentiment and sarcasm detection dataset which can be very helpful to the research community. • To the best of our knowledge, this is the first paper that shares an annotated dataset for both Arabic sentiment and sarcasm detection in the context of the COVID-19 pandemic. • We compared the results of multiple bag-of-words and pre-trained transformer baselines for the two considered tasks (sentiment analysis and sarcasm detection) and reported the obtained results. The remainder of this paper is organized as follows: Section 2 presents the sentiment and sarcasm detection research studies that have been published in the context of the COVID-19 pandemic. The details of our dataset collection, construction, and statistics are then provided in Section 3. Then, in Section 4, we present and discuss the tests we have done and the results we have obtained. Finally, In Section 5, we conclude our work and highlight some possible future works. In the last decade, the research studies that have been made in regards to Arabic sentiment analysis and sarcasm detection have increased significantly. As such, a large number of datasets have been built and shared to be used by the research community. In which concerns the major research studies that attempted to create Arabic sentiment analysis datasets; Rushdi et al. [8] presented an Arabic opinion mining dataset containing 500 movie reviews gathered from several blogs and web pages. Their dataset contains the same number of positive and negative instances, 250 each. The authors used several machine learning algorithms so as to provide baseline results for their annotated dataset. Nabil et al. [9] presented the Arabic Social Sentiment Analysis Dataset (ASTD). It contains 10,000 Arabic tweets manually annotated with four labels: "objective", "subjective positive", "subjective negative", "subjective mixed". Their paper also presented the statistics of their constructed dataset as well as its baseline results. Al-Twairesh et al. [10] created two Arabic sentiment lexicons using a large tweets dataset containing 2.2 million tweets. Their lexicons were generated using two methods and evaluated by using internal and external datasets. Aly et al. [11] created an Arabic sentiment analysis dataset containing over 63,000 book reviews, each review is rated on a scale of 1 to 5 stars. They provided baseline results for their dataset by testing it on the tasks of sentiment polarity and rating classification. Abu Kwaik et al. [12] presented an Arabic sentiment analysis dataset containing 36,000 annotated tweets. The authors employed distant supervision and self-training approaches to annotate the collected tweets. They also released 8,000 tweets that have been manually annotated as a gold standard. For the task of sarcasm detection, several datasets have been published. Karoui et al. [13] created a sarcasm and irony dataset using political tweets from Twitter. They gather the tweets using politician names as keywords and classify them into ironic and non-ironic tweets. Their created dataset contains a total of 5,479 tweets, 1,733 of which are ironic and the remaining are non-ironic. Ghanem et al. [14] created a shared task for Arabic irony detection consisting of binary classification of tweets as ironic or non-ironic. They released a dataset composed of 5,030 tweets regarding the Middle East and Maghreb regions' political events. Their tweets were composed of Modern Standard Arabic (MSA) as well as different Arabic regional dialects. Abbes et al. [15] created an irony-detection corpus (DAICT) that includes a total of 5,358 annotated MSA and dialectal Arabic tweets. The tweets were collected on the basis of different hashtags regarding irony and sarcasm. Their classification included 3 labels: "Ironic", "Not Ironic", and "Ambiguous". Farha et al. [16] presented "ArSarcasm", an Arabic sarcasm detection dataset built by re-annotating an existing sentiment analysis dataset. Their dataset contains 10,547 tweets, 16% of which are sarcastic. They used different baselines to test the utility of their dataset and reported the obtained results. To the best of our knowledge there are no research studies that have attempted to build a sentiment analysis and sarcasm detection dataset that is devoted to the COVID-19 pandemic, thus, we believe that our dataset will be an important addition to the efforts that are being held to make more COVID-19 related datasets available for the research community. This section first presents the "AraCOVID19-SSD" 4 dataset, its design goals, and the different classes that it contains. Then, it explains the process of tweets collection and annotation that has been adopted and provides the dataset's statistics. The "AraCOVID19-SSD" considers two tasks: sarcasm detection and sentiment analysis. The tasks' descriptions and their annotation details are provided in Table 1 . All the 5,162 Arabic tweets of the "AraCOVID19-SSD" dataset are annotated for the two aforementioned tasks (Table 1) . A small portion of the "AraCOVID19-SSD" annotated tweets are illustrated in Table 2 . The first step that we followed to build the dataset was to prepare a set of keywords, then we retrieved the tweets based on those keywords. The keywords that we used were made to retrieve the largest possible number of tweets related to COVID-19, a portion of the keywords that we used are The retrieved tweets were filtered in the following way: • All the retweets of a given tweet were removed. • Identical tweets that share the same textual content (when ignoring the tweets' links) were removed. This is done to ensure that the text of each considered tweet is unique. • Very short tweets that contain less than 5 Arabic words were filtered. • Tweets were gathered within the period spanning from December 15, 2019, and December 15, 2020. After the filtering step, we have ended up with a total of 300k unique Arabic tweets related to the COVID-19 pandemic. Due to the high cost of the annotation task, we only required each tweet to be annotated by one expert annotator. This allowed us to annotate a total of 5, 162 Arabic tweets from the 300k gathered tweets. We plan to continue annotating the remaining gathered tweets gradually according to our financial capacities. The manual annotation task was carried out by providing the annotator with the full text of the tweet, including the links, and ask him/her to read the tweet, check the tweet's links if necessary, and annotate it for each one of the 2 labels (tasks). This results in a dataset in which each tweet is labeled for each one of the 2 tasks (as shown in Table 2 ). The statistics of our "AraCOVID19-SSD" dataset are provided in Table 3 . As shown in Table 3 , the two considered tasks contain more than 1000 instances for each one of their values, which helps train robust classification models. Our tests aim at evaluating the quality of our annotated dataset and provide baseline results for the two tasks that it includes. To this end, for both sarcasm detection and sentiment analysis tasks, several deep learning and bag-of-words models were trained and tested. In the following, first, we will present the Arabic preprocessing that we performed, and the different models that we considered. Then, we will report and discuss the results of our performed tests. We applied a basic preprocessing to all the collected Arabic tweets, which includes: • The removal of diacritical marks. • The removal of elongated and repeated characters. • Arabic characters normalization. • The removal of links and users' references (users' notifications). • Tweets tokenization in which punctuation, words, and numbers are separated. We note that this preprocessing has been used only when training the baseline models (Section 4.2); it has not been used for the annotation task nor in the final dataset. Aside from the classical bag-of-words models, pretrained transformer models have been recently used in many NLP tasks and have continuously achieved new state-of-the-art results [17] . In the following, we will highlight both the bag-of-words and the transformer models that we considered in our tests. In our experiments we used three pretrained transformer models: • AraBERT 5 : A BERT (Bidirectional Encoder Representations from Transformers) model [17] pretrained on 200 million Arabic MSA sentences gathered from different sources [18] . • Multilingual BERT (mBERT) 6 : A BERT-based model [17] pretrained on the first 104 major Wikipedia languages 7 . • XLM-Roberta 8 : A large multi-lingual language model, trained on 2.5TB of filtered Common Crawl data [19] . In our experiments we considered three bag-of-words models: • Support Vector Machines (SVMs) [20] : are discriminative classifiers that use maximum-margin hyperplanes (support vectors) to classify high-dimensional data into a set of predefined categories. • Random Forests model [21] : is an extension to the standard decision tree [22] introduced to tackle the overfitting problem that usually occurs when a decision tree learns highly irregular patterns as a consequence of growing too deep. It constructs multiple trees from random sub-samples of the same training data. Then, the final prediction is made by averaging the predictions of all the trained trees. • Logistic Regression [23] : is a process of modeling the probability of an outcome given an input variable. It is useful for classification problems, where the goal is to determine if a new instance fits best into a given category. The implementation of the different models have been done using the following libraries: • Scikit-learn [24] 9 is a python-based machine learning library. We used it to train the bag-of-words baselines and to evaluate the performance of all the considered models. • Flair [25] 10 : is a framework for building state-of-the-art NLP models. We used it to train our classification models. • Huggingface-transformers [26] 11 : is a framework for building and pretraining different state-of-the-art NLP models. We used it to test our pretrained models. • PyTorch [27] 12 is an open-source library designed for implementing deep neural networks. We used it as a backend for both the Huggingface-transformers and the Flair frameworks. To evaluate the performance of our considered classification models, we have used a stratified 5-fold cross-validation method. This is done by randomly partitioning the instances of each one of our dataset's tasks into 5 disjoint sets of equal size. In this five-fold cross-validation, five experiments are performed, in each one, one of the five sets is selected for testing, and the remaining four are used for training. For each experiment, the weighted F-score is calculated, and finally, the average F-score for all the five experiments (the 5-folds) is reported. The results of the experiments that we have performed in regards to the tasks of sarcasm detection and sentiment analysis are provided in Table 4 . The experiments that we have performed show that high-level classification results are achieved for both the sentiment and sarcasm detection tasks. Indeed, all the tested models surpassed 0.89 f-score, we believe that the high f-scores that have been achieved are mainly due to the richness of the dataset (the high number of instances in each class of the considered tasks). We can also observe that the SVM model and the Arabert transformer model gave the best performance by reaching an f-score of more than 0.95 on the sarcasm detection task and more than 0.92 on the sentiment analysis task. The quality of the obtained results reflects the importance of having a large annotated dataset and confirms our adopted annotation schema's practical utility. In this paper, we have presented and published "AraCOVID19-SSD" an Arabic COVID-19 sentiment analysis and sarcasm detection dataset. The dataset contains 5,162 Arabic tweets; each tweet is annotated for two tasks: sentiment analysis and sarcasm detection. All the dataset's tweets have been manually annotated and validated by human annotators. The quality of the final annotated dataset has been examined via several bag-of-words and transformer models. The considered models were trained and tested using the developed dataset and the obtained results were reported. As future work, we plan to continue enriching the annotated dataset with new tweets to keep it up-to-date with the latest events and discussions that are being shared on Twitter in regards to the COVID-19 pandemic. Severe acute respiratory syndrome coronavirus 2 (sars-cov-2) and coronavirus disease-2019 (covid-19): The epidemic and the challenges Coronavirus diseases (covid-19) current status and future perspectives: a narrative review Increased generalized anxiety, depression and distress during the COVID-19 pandemic: a cross-sectional study in Germany Fakecovid -a multilingual cross-domain fact check news dataset for covid-19 Covid-19-fakes: A twitter (arabic/english) dataset for detecting misleading information on covid-19 Large arabic twitter dataset on covid-19 Aracovid19-mfh: Arabic covid-19 multi-label fake news & hate speech detection dataset Oca: Opinion corpus for arabic ASTD: Arabic sentiment tweets dataset Arasenti: large-scale twitter-specific arabic sentiment lexicons LABR: A large scale Arabic book reviews dataset An Arabic tweets sentiment analysis dataset (ATSAD) using distant supervision and self training Soukhria: Towards an irony detection system for arabic in social media Idat at fire2019: Overview of the track on irony detection in arabic tweets DAICT: A dialectal Arabic irony corpus extracted from Twitter From arabic sentiment analysis to sarcasm detection: The arsarcasm dataset Bert: Pre-training of deep bidirectional transformers for language understanding Arabert: Transformer-based model for arabic language understanding Unsupervised cross-lingual representation learning at scale Support vector machines. IEEE Intelligent Systems and their applications Random forests An introduction to decision tree modeling Logistic regression Scikit-learn: Machine learning in Python Contextual string embeddings for sequence labeling Transformers: State-of-the-art natural language processing Pytorch: An imperative style, high-performance deep learning library