key: cord-0582807-mi5ekctq authors: Shifath, S. M. Sadiq-Ur-Rahman; Khan, Mohammad Faiyaz; Islam, Md. Saiful title: A transformer based approach for fighting COVID-19 fake news date: 2021-01-28 journal: nan DOI: nan sha: c93007011ff99b242133679af2a8de93484205ca doc_id: 582807 cord_uid: mi5ekctq The rapid outbreak of COVID-19 has caused humanity to come to a stand-still and brought with it a plethora of other problems. COVID-19 is the first pandemic in history when humanity is the most technologically advanced and relies heavily on social media platforms for connectivity and other benefits. Unfortunately, fake news and misinformation regarding this virus is also available to people and causing some massive problems. So, fighting this infodemic has become a significant challenge. We present our solution for the"Constraint@AAAI2021 - COVID19 Fake News Detection in English"challenge in this work. After extensive experimentation with numerous architectures and techniques, we use eight different transformer-based pre-trained models with additional layers to construct a stacking ensemble classifier and fine-tuned them for our purpose. We achieved 0.979906542 accuracy, 0.979913119 precision, 0.979906542 recall, and 0.979907901 f1-score on the test dataset of the competition. The Coronavirus disease 2019 (COVID- 19) is an infectious disease caused by SARS coronavirus 2. It has impacted almost every country and changed people worldwide's social, economic, and psychological states. Especially in the past few months, people have become more information-hungry on this topic. Hence, they are exposed to a significant amount of interaction to the information circled around coronavirus through various platforms. In parallel, the infodemic of false and misinformation regarding the virus has been rising. As the idea of "stay-at-home" is proved to be the most effective precaution against the virus, people's most preferred solution for communication or entertainment has been various online and social media platforms. Hence, the number of people exposed to rumors or misinformations is more significant than ever before. In [13] , Facebook advertisements across 64 countries were examined, and it was found that 5% of the advertisements contain possible errors or misinformations. Besides, many online news portals purposefully present misguiding news to increase their popularity. In recent times, a cluster of fake news about lockdowns, possible remedies, vaccinations have caused panic among people. People started to pile up stocks of sanitizers, masks, and the supply chain disrupted out of fear. Fighting against this ever-increasing amount of fake or misleading news has proven to be as crucial as finding remedies against the virus to overcome the pandemic. Fake news detection is a critical task as it requires the identification and detection of various news types like clickbait, propaganda, satire, misinformation, falsification, sloppy journalism, and many more. Traditionally, machine learningbased classifiers have been a go-to solution in this domain. In recent years, sequence models like RNN, LSTM, and CNN have shown good competency. However, the introduction of transformers has caused a considerable performance gain. In this work, we propose a solution to the "Constraint@AAAI2021 -COVID19 Fake News Detection in English" task on the provided dataset [16] . Our contributions to this task are the following: -We perform extensive experimentation like training classic machine learning models and traditional text classification models like Bidirectional LSTM, one dimensional CNN on the competition dataset, and incorporating publicly available external datasets for training our models. -We fine-tune state-of-the-art transformer based pre-trained models as per the requirement of our task. We also experiment with additional R-CNN [8] , multichannel CNN with attention [12] , multilayer perceptron (MLP) modules to increase the model's performance. Finally, we construct a stacking ensemble classifier of eight transformer models BERT [3] , GPT-2 [20] , XLNet [24] , RoBERTa [11] , DistillRoBERTa, ALBERT [9] , BART [10] , and DeBERTa [6] , each with additional MLP layers. -We also present our overall workflow, comparison, and performance analysis of our experiments on the validation set. The issue of fake news detection has been well studied in various fields. Existing machine learning-based methods like support vector machine (SVM) [5] , decision tree [2] have worked as baselines for this task. In [21] , a simple classifier was used to leverage the Term Frequency(TF), Term Frequency and Inverse Document Frequency(TF-IDF), and cosine similarity between vectors as features to provide a baseline for fake news detection task on the Fake News Challenge (FNC-1) dataset 4 . Deep learning linguistic models like Convolutional Neural Networks (CNN), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) can analyse variable length sequential data and discover hidden complex patterns in textual data. In [1] , a fake news detection model based on Bidirectional LSTM [22] is presented. In [14] , Bidirectional LSTM along with GloVe [17] and ELMO [18] embeddings were used to encode the claim and sentences. Despite the efficacy of the above-stated language models, they still fell short of producing significant accuracy. In recent years, transformers and their various modifications have brought about considerable performance improvement in various natural language processing tasks. In [7] , the contextual relationship between the headline and main text of news was analysed using BERT [3] . They showed that fine-tuning the pre-trained BERT for specific tasks like fake news detection can outperform existing sequence models. Research efforts focusing on COVID-19 fake news detection have also been made. In [4] , a detection model based on the information provided by the World Health Organization, UNICEF, and the United Nations was proposed. Ten machine learning algorithms were used along with a voting ensemble classifier. In [23] , a Twitter dataset containing 7623 tweets and corresponding labels was introduced. We use an ensemble of multiple varieties of pre-trained transformers for the task at hand. Transformers are state-of-the-art tools for natural language processing. They have stacked blocks of identical encoders and decoders with self-attention. Following are the short descriptions of the varieties of transformers that are used: 1. BERT [3] : BERT(Bidirectional Encoder Representations from Transformers) is a bidirectional language transformer. It is trained based on masked language modeling and next sentence prediction on a sizeable unlabelled text corpus. 2. GPT-2 [20] : It is a modified transformer with a larger context and vocabulary size. In contrast to the ordinary transformers, it has an additional normalization layer after the self-attention block. 3. XLNet [24] : It is a modified version of the transformer-XL. It is trained to learn the bidirectional context with an autoregressive method. 4. RoBERTa [11] : It is an optimized BERT. It is trained or more robust data and includes fine-tuning the original BERT without the next-sentence prediction objective. 5. DistilRoBERTa: It is a distilled and faster version of the RoBERTa base version. 6. ALBERT [9] : It splits the embedding matrix into two smaller metrics and uses two-parameter reduction techniques to increase training speed and decrease memory consumption. 7. Bart [10] : It is introduced by facebook as a sequence-to-sequence machine translation model. It is a fusion between BERT [3] and GPT [19] . 8. DeBERTa [6] : It is built on RoBERTa with some modifications like disentangled attention and enhanced mask decoder. Multilayer Perceptron (MLP): Each of the previous models is followed by a multilayer perceptron module. It consists of a fully connected layer with 64 hidden units, followed by a normalization layer, followed by a linear layer with tanh activation, a dropout layer and a softmax layer for generating two class classification probabilities. All the weights are initialized by Xavier initialization. Ensemble Module: For ensembling, we train a meta-learner consisting of the following units: a fully connected layer with 64 hidden units, a linear layer with tanh activation function, followed by a fully connected layer with 128 hidden units, a linear layer with ReLU activation function, and a softmax classifier. Figure 1 contains an overview of our proposed architecture. The dataset [16] provided for the competition has social media posts related to COVID-19 and corresponding labels indicating if the social media posts are fake or real. The dataset is divided into three sets, which are train, validation, and test. The train set contains 6420 samples. Test and validation set each contains 2140 social media posts with their corresponding labels. The train set consists of 3360 real and 3060 fake social media posts, whereas the test and validation set consists of 1120 real and 1020 fake social media posts. On average, the train, validation, and test set contain 27.0, 26.79, and 27.46 words, respectively, in a social media post. We also used additional data from [23] for training our model as experimentation. It has 7623 labeled data. We used each data title as a replacement of 'tweet' in the competition data because both have similar lengths. We converted the multi-label classification data to binary as per the label type. As a result, 7501 were classified as fake, and 122 were classified as real. We perform experiments primarily on traditional language models such as Bidirectional LSTM(Bi-LSTM) [22] with attention, 1 dimensional CNN(1D-CNN), Hierarchical Attention Networks(HAN) [25] , Recurrent Convolutional Neural Networks(RCNN) [8] , and Multichannel CNN with Attention(AMCNN) [12] on the competition dataset. We also experiment with transformer-based pre-trained models like BERT and RoBERTa. The result of these experiments is shown in the table 1. From table 1, it is clearly evident that transformer-based pre-trained models showed radical improvements on scores than the traditional models. But in these experiments, we use just a dense layer with a softmax classifier on top of the transformers for training and prediction. Still, there is a lot of scopes to improve it further. According to table 1 it is also clear that, among the traditional models scores for RCNN is better than the others. So, we choose RCNN and add it on top of BERT and RoBERTa and train these combined models. In these cases, we find that none of these additions improves the performance of the transformerbased models with a simple classification layer. The output of these experiments is shown in table 2 with the corresponding model's name. We experiment with a few classic machine learning classifiers such as Decision Tree(DT) [2] , support vector machine (SVM) [5] with different kernels on top of transformer-based models. Among these, SVM with Radial Basis Function(RBF) kernel achieves highest score which is almost similar to the simple classifier with Since none of those models except MLP on top of RoBERTa perform better than the transformer-based models with a simple classifier, we add a Multi-Layer Perceptron (MLP) on top of transformer-based models. We test different combinations of MLPs and choose the best performing combination among these. Initially, we experiment with only BERT and RoBERTa among the transformerbased pre-trained models. To add diversity and capture different hidden information of the data, we select the most suitable eight models among different transformer-based models based on their performance and low resource necessity. These are BERT, RoBERTa, XLNet, GPT-2, ALBERT, DistilRoBERTa, BART and DeBERTa. We experiment on each of these models with base architecture since we do not have enough resources. We add the best performing MLP on top of each. These models are tuned individually on the train and validation dataset. The results of the individual performance are shown in table 3. For ensembling, we build a dataset based on the predictions from the individual models of table 3. We form a feature vector where the features are predictions from each previous model for a particular training sample. We use it to train a meta-learner to model the individual predictions into a more generalized final output. We experiment with random forest and SVM classifiers and achieve accuracy of .9785 and .9789, respectively on the validation set. Since these methods do not yield significant improvement, we introduce a more robust and complex meta-learner consisting of fully connected layers and achieve considerable performance gain. We try different combinations of ensemble of transformer based models also. The results of best 3 ensembled models are summarised in table 4. Here, Ensemble-v1 consists of RoBERTa, BERT, XLNet, GPT-2. Ensemble-v2 has ALBERT, BART, DeBERTa, and the previous models from Ensemble-v1. DistilRoBERTa is added to the previous seven models in the Ensemble-v3. From table 4, it can be seen that having more individual models in the final ensemble classifier usually yields better generalisation and achieves higher accuracy. One crucial factor is that despite adding additional data for training, all of our models' performance in the validation set decreases. One probable reason for this decline might be the sizeable imbalance in the fake and real news counts. So, we choose to discard additional data for our final model training. We test different hyper-parameters like the number of layers, number of units in a layer, learning rate, weight decay, dropouts, normalization, etc. within a feasible range. In all the models we use different learning rates between 1e-3 and 2e-6. Learning rate 2e-6 is used in most cases since the dataset is not so large. Using this learning rate helps to converge to the global minimum easily. For traditional models, we test a different combination of several layers with many units. We find Bidirectional LSTM models with 256 layers with 128 hidden units performed best. For 1D-CNN models, we use a layer with 256 filters and filter sizes 1-6. For all traditional models, we use a similar structure with a little variation. In case of the transformer-based pre-trained models, we use the base architectures. It is because using large models on small datasets can cause the models to over-fit. Also, we face a resource limitation for experimenting with larger models. In this work, we have presented our overall workflow for the fake news detection task. We have conducted a number of experiments and provided a comprehensive solution based on modified transformers with additional layers and an ensemble classifier. Our method achieves comparative accuracy on the test dataset and therefore helps us to place 20 in the leaderboard of "CONSTRAINT 2021 Shared Tasks: Detecting English COVID-19 Fake News and Hindi Hostile Posts" [15] competition. Based on our experience from this task, we feel that a more balanced additional training data may help our model perform better. Besides, extracting more information like parts of speech, named entities, number of words, punctuation, hashtags, website links etc. in social media posts and using these as meta-data during training can increase the methodology's performance. Fake news detection using bi-directional lstmrecurrent neural network Classification and regression trees Bert: Pre-training of deep bidirectional transformers for language understanding Detecting misleading information on covid-19 Comparing support vector machines with gaussian kernels to radial basis function classifiers Deberta: Decoding-enhanced bert with disentangled attention exbake: Automatic fake news detection model based on bidirectional encoder representations from transformers (bert) Recurrent convolutional neural networks for text classification Albert: A lite bert for self-supervised learning of language representations Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension Roberta: A robustly optimized bert pretraining approach Multichannel cnn with attention for text classification Online health monitoring using facebook advertisement audience estimates in the united states: evaluation study Combining fact extraction and verification with neural semantic matching networks Overview of constraint 2021 shared tasks: Detecting english covid-19 fake news and hindi hostile posts Fighting an infodemic: Covid-19 fake news dataset Glove: Global vectors for word representation Deep contextualized word representations Improving language understanding by generative pre-training Language models are unsupervised multitask learners A simple but tough-tobeat baseline for the fake news challenge stance detection task Bidirectional recurrent neural networks FakeCovid -a multilingual cross-domain fact check news dataset for covid-19 Xlnet: Generalized autoregressive pretraining for language understanding Hierarchical attention networks for document classification