key: cord-0573275-mxq1kb2a authors: Husain, Fatemah; Uzuner, Ozlem title: Transfer Learning Approach for Arabic Offensive Language Detection System -- BERT-Based Model date: 2021-02-09 journal: nan DOI: nan sha: 58cc5f59f651fef5901d1c57b65f6e76084b2651 doc_id: 573275 cord_uid: mxq1kb2a Developing a system to detect online offensive language is very important to the health and the security of online users. Studies have shown that cyberhate, online harassment and other misuses of technology are on the rise, particularly during the global Coronavirus pandemic in 2020. According to the latest report by the Anti-Defamation League (ADL), 35% of online users reported online harassment related to their identity-based characteristics, which is a 3% increase over 2019. Applying advanced techniques from the Natural Language Processing (NLP) field to support the development of an online hate-free community is a critical task for social justice. Transfer learning enhances the performance of the classifier by allowing the transfer of knowledge from one domain or one dataset to others that have not been seen before, thus, supporting the classifier to be more generalizable. In our study, we apply the principles of transfer learning cross multiple Arabic offensive language datasets to compare the effects on system performance. This study aims at investigating the effects of fine-tuning and training Bidirectional Encoder Representations from Transformers (BERT) model on multiple Arabic offensive language datasets individually and testing it using other datasets individually. Our experiment starts with a comparison among multiple BERT models to guide the selection of the main model that is used for our study. The study also investigates the effects of concatenating all datasets to be used for fine-tuning and training BERT model. Our results demonstrate the limited effects of transfer learning on the performance of the classifiers, particularly for highly dialectic comments. Developing a system to detect online offensive language is very important to the health and the security of online users. Studies have shown that cyberhate, online harassment and other misuses of technology are on the rise. Particularly during the global Coronavirus pandemic in 2020, 35% reported online harassment related to their identity-based characteristics, which is a 3% increase over 2019 1 . Applying advanced techniques from the Natural Language Processing (NLP) field to support the development of an online hate free community is a critical task for social justice. Transfer learning enhances the performance of the classifier by allowing the transfer of knowledge from one domain or one dataset to others that have not been seen before, thus, supporting the classifier to be more generalizable. In our study, we apply the principles of transfer learning cross multiple Arabic offensive language datasets to compare the effects on system performance. This study aims at investigating the effects of fine-tuning and training Bidirectional Encoder Representations from Transformers (BERT) model on multiple Arabic offensive language datasets individually and testing it using other datasets individually. Our experiment starts with a comparison 1 https://www.adl.org/ among multiple BERT models to guide the selection of the main model for our study. The study also investigates the effects of concatenating all datasets to be used for fine-tuning and training BERT model. The scope of this study covers Arabic text from online user-generated content. While there are multiple forms of the Arabic language, the majority of the content from usergenerated platforms are written in dialectic Arabic. The dialectic form of Arabic is the actual spoken Arabic, and it has several categories depending on social and geographical factors. Habash [1] divides the Arabic dialects into seven categories; Egyptian, Levantine, Gulf, North African, Iraqi, Yemenite, and Maltese. The diversity among Arabic dialects adds difficulties to the process of developing an NLP system that can understand online Arabic content similar to human level of understanding. technique is performed to use the same model for a new purpose-specific task. The main feature that distinguish BERT from the other language modeling techniques is the use of bidirectional language model rather than unidirectional language model during the fine-tuning process. This bidirectional learning technique consists of a Masked Language Model (MLM) with a pre-training objective that is randomly masks some of the tokens from the input with the objective of predicting the original vocabulary id of the masked word based only on its context [5] . Multilingual BERT (M-BERT) 2 proposed by Google Research and has 2 versions; BERT-base-multilingual-uncased model, which covers 102 languages, and BERT-basemultilingual-cased model, which covers 104 languages. Wikipedia dumps of each language (excluding user and talk pages) were used to train the models with a shared word piece vocabulary. Previous studies report that M-BERT outperforms other tools when it applies to multilingual text, however, M-BERT shows some limitations in tokenizing Arabic sentences, which could degrade the performance of the classifier [6] . This finding is in line with other experiments conducted by Hasan et al. [7] , Saeed et al. [8] , and Keleg et al. [9] , which reported poor performance in Arabic offensive language and hate speech detection in comparison to other word embeddings, machine learning classifiers, and deep learning classifiers. In addition, Abu Farha and Magdy [10] tried M-BERT with Adam optimiser, and trained the model with 4 epochs, learning rate of 1e−5, and setting the maximum sequence length to the maximum length seen in the training set. However, the results were not as good as the results obtained from the Bidirectional Long Short-Term Memory (BiLSTM) model, Convolutional Neural Network -Bidirectional Long Short-Term Memory (CNN-BiLSTM) model, and multitask learning models. In [11] , multiple M-BERT-based classifiers were used with different fine-tuning settings for offensive language and hate speech detection tasks, and in both tasks the reported macro F1 score was not better than what has been reported by other studies using simple traditional machine learning methods [12] . While M-BERT supports various languages, Arabic specific BERT models have been used as well for Arabic offensive language detection, such as AraBERT and PERTbase Arabic. The AraPERT 3 is an Arabic version of BERT model that shows state-of-the-art performance in multiple downstream tasks [13] . It uses BERT-base configuration has similar pre-training settings for the ones used at the original BERT model, consisting of implementing the Masked Language Modeling (MLM) task and the Next Sentence Prediction (NSP) task. Multiple Modern Standard Arabic (MSA) corpora are used to train the model, which include: manually scraped Arabic news websites for articles, 1.5 billion words extracted from news articles from ten major news sources, and OSIAN, which is an Open Source International Arabic News Corpus. Results of evaluating AraBERT on sentiment analysis task, question answering task, and Named Entity Recognition (NER) task outperform others for all tasks from M-BERT and from the previous state-of-the-art models archived by Dahou et al. [14] and Eljundi et al. [6] . This finding demonstrates that a pre-trained language model trained on a single language performs better than a multilingual model. Djandji et al. [15] apply AraBERT for the Open-Source Arabic Corpora and Corpora Processing Tools (OSACT) dataset with a multitask learning approach and a multilabel classification approach. Multitask Learning solves the data imbalance problem in OSACT dataset by leveraging information from multiple tasks simultaneously. The same study also applies multilabel classification approach using AraBERT, in which all labels of the 2 labeling hierarchies in OSACT dataset-offensive and hate-are merged under a broad task of violence detection. Results report 90.15% as the highest macro F1 score for offensive language detection when adopting the multitask learning approach with AraBERT. Findings from this study demonstrate the superiority of using a multitask learning approach over a multilabel classification approach when using AraBERT for offensive language detection. The error analysis reveals that confusion occurs in tweets that consist of offensive words in a non-offensive context. It also shows that most of the errors are related to mockery, sarcasm, or mentioning other offensive and hateful statements within tweets. Arabic-base-BERT 4 model is another Arabic monolingual BERT model [16] . The We use four publicly available Arabic Offensive language datasets. These datasets include: Aljazeera.net Deleted Comments [17] , YouTube dataset [18] , Levantine Twitter Dataset for Hate Speech and Abusive Language (L-HSAB) [19] , and OSACT offensive and not offensive classification samples [20] . Table 1 provides a summary for the characteristics of each dataset. We use only binary classes; offensive or not offensive. Thus, we convert different types of offensive languages to offensive class. For example, the L-HSAB dataset differentiate between hate and abusive languages classes; which were both converted to offensive language class. For some datasets that are provided in a train/evaluation/test splitted formats, we merge all parts together into one dataset, and then, we randomly apply 80%-20% split for train-test datasets. This support consistency in the setting among all datasets used in this study as most of them are provided in one part. All datasets were used without performing any preprocessing procedures to the text. Our experiments depend mainly on AraBERT model from Hugging Face 6 library. To ensure selecting the best available BERT model for our task, we use the OSACT and the L-HSAB datasets to perform a quick experiment and evaluate the performance of multiple BERT models that are supporting Arabic to guide our selection for the model. We use XLM-Roberta 7 (also called XLM-R), M-BERT, Arabic-Base-BERT, and AraBERT. Table 2 shows the resulted macro F1 scores of the experiments. As can be noticed from table 2, AraBERT outperforms the other models. Thus, we decide to select AraBERT for our main experiments. Moreover, table also demonstrates that Arabic monolingual models perform better than multilingual models. In all experiments, we apply the same experiment settings: maximum length = 128, patch size = 16, epoch = 5, epsilon = 1e-8, and learning rate = 2e-5. We use the pooled output from the encoder to be used with a simple Feed Forward Neural Network (FFNN) layer to build the classifier. Experiments were developed in Python using PyTorch-Transformers library, and evaluation metrics were developed using Scikit-Learn Python library. Google Colab Pro used to conduct all experiments. We use macro measurements of precision, recall, and F1, in addition to accuracy score to evaluate the performance of the classifiers. The following table shows performance results for the four individual models, each fine-tuned and trained using one dataset and tested on all datasets individually. As can be noticed from the table above, the highest recorded macro recall, macro F1, and accuracy scores are shown for the OSACT dataset when used in training and testing. While the highest recorded macro precision is reported by the classifier that has been trained using the Aljazeera dataset and tested with YouTube dataset. It is also noticeable that Aljazeera dataset is the one with the lowest overall performance scores. Results from each individual dataset experiments demonstrate highest performance when the model is trained and tested using the same dataset, which indicates the limited improvements of the transfer learning approach. Table 4 shows the results after concatenating all datasets into one corpus, and then use it to fine-tune and train the classifier. Comparing the best results for each dataset from table 3 with the results from the concatenating model, overall there are no improvement achieved. The OSACT dataset is still recording the highest performance scores, which are exactly the same as if the model is fine-tuned and trained using the OSACT dataset only. However, the L-HSAB dataset shows lower performance by 3% in macro F1 score. This decrease could be a result of the high dialectic text of L-HSAB, as the other datasets might not share much Levantine vocabulary words. Looking over samples of the misclassified comments, we calculate percentages of misclassified comments per class for each experiment that was conducted using the concatenated trained model. Table 5 presents a summary of the error analysis. The percentages are calculated per class based on the total instances of each class for each of the four testing datasets. Most common 5 tokens are presented in the table based on their frequencies order. As can be seen from table 5, offensive and not offensive misclassified percentages vary among the datasets. Investigating top tokens among the misclassified samples shows names of countries (e.g. Saudi Arabia, Qatar, Iraq) and names of famous people (e.g. Kadim, Ahlam, Gibran), which are in some cases refer to the first name and the last name of the same person as two separate tokens. For example, 'Kadim' and 'Sahir' are the first and the last name of the same singer, and 'Gibran' and 'Basil' are the first name and the last name of the same minster. This type of terms need to be proceeded as one term rather than separate parts because the semantic meaning of its parts might not be equivalent to the semantic meaning of the term. As a result of that, some preprocessing procedures are needed to ensure proper understanding and processing of multiple tokens terms. 10 4. Assafir Lebanese news articles 11 5. Manually crawled news websites (Al-Akhbar, Annahar, AL-Ahram, and AL-Wafd). The sources of our datasets are mostly from user-generated content, which differ from the Arabic text that is used in writing news articles and books. The type of Arabic that is used in our datasets is the dialectic Arabic, while the one used in training AraBERT is the MSA. Thus, simple fine-tuning process might not be enough to adjust the weights of AraBERT vocabulary toward our task of offensive language detection, especially if most of the tokens in our datasets are treated as out-of-vocabulary tokens by AraBERT tokenizer. Increasing the dataset size for training and fine-tuning AraBERT model not always improve the performance of the system. Thus, finding some other methods to improve the performance are required. For example, creating more advance classifier architecture on top of BERT model that can give better results than a simple FFNN. Another method could focus on AraBERT model and trying to adjust its vocabulary to support offensive language classification task. A costlier approach could be to consider traning a new BERT model that is customized for the online Arabic offensive detection task. In this paper, we try to present our work in applying transfer learning cross several Arabic offensive detection datasets using AraBERT model. Our results report outperformance of Arabic monolingual BERT models over BERT multilingual models. The results also report poor performance when applying transfer learning cross individual datasets from heterogeneous sources and themes; such as YouTube comments from musicians' channels and Aljazeera News comments from political articles. While the results from aggregating knowledge from multiple datasets on the same time show no effects on the performance when tested on individual datasets, it lowers the performance of the highly dialectic dataset; L-HSAB; by 3% in macro F1 score. The overall findings from our experiments demonstrate the Synthesis Lectures on Human LanguageTechnologies Transferlearning from lda to bilstm-cnn for offensive language detection intwitter Survey on hate speech detection usingnatural language processing A unified deep learning architecture for abuse detection BERT: Pre-training of deep bidirectional transformers for language understanding hULMonA:Theuniversallanguagemod elinArabic ALT submission for OSACT shared task on offensive language detection OSACT4 shared tasks: Ensembled stacked classification for offensive and hate speech in Arabic tweets European Language Resource Association ASUOPTO at OSACT4 -offensive language detection for Arabic text Multitask learning for Arabic offensivelanguage and hate-speech detection Leverag-ing affective bidirectional transformers for offensive language detection OSACT4 shared task on offensive language detection:Intensive preprocessing based approach," inThe 4th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4) Arabert: Transformer-based modelfor arabic language understanding ArabicSentiment Classification Using Convolutional Neural Network and Differential Evolution Algorithm Multi-task learningusing arabert for offensive language detection Kuisail at semeval-2020 task12: Bert-cnn for offensive speech identification in social media Abusive language detectionon Arabic social media," inProceedings of the First Workshopon Abusive Language Online Dataset construction for the detection of anti-social behaviour in online communication in arabic L-HSAB: A Levantine twitter dataset for hate speech and abusive language Overview of osact4 arabic offensive language detection shared task