key: cord-0233085-41pt26n1 authors: Babu, Yandrapati Prakash; Eswari, Rajagopal title: CIA_NITT at WNUT-2020 Task 2: Classification of COVID-19 Tweets Using Pre-trained Language Models date: 2020-09-12 journal: nan DOI: nan sha: 5a036c5e6719471dde29eb5f16d66eccfbfc1f23 doc_id: 233085 cord_uid: 41pt26n1 This paper presents our models for WNUT 2020 shared task2. The shared task2 involves identification of COVID-19 related informative tweets. We treat this as binary text classification problem and experiment with pre-trained language models. Our first model which is based on CT-BERT achieves F1-score of 88.7% and second model which is an ensemble of CT-BERT, RoBERTa and SVM achieves F1-score of 88.52%. As of September 07,2020 COVID-19 Coronavirus infected 27.3M people and caused 887K deaths 1 . Real time updates regarding the number of infected cases and death cases is given in dashboards. These dashboards make use of information from social networking sites like twitter. As majority of the tweets posted online are uninformative, it is necessary identify the informative tweets which include useful information related to recovered, suspected, confirmed and death cases as well as location or travel history of the cases. The WNUT 2020 shared task2 involves identification of informative tweets. We treat this as binary text classification problem. Prior to 2018, most of the text classification models are based on Convolutional Neural Network (CNN) or Recurrent Neural Network(RNN). These models are shallow in nature and cannot learn more informative features from the input. Moreover as these models are to trained from scratch, they require more number of training instances (Kalyan and Sangeetha, 2020a,b) . Recently pre-trained language models like BERT (Devlin et al., 2019) , RoBERTa (Liu et al., 2019) achieved significant improvements in many of the natural language processing tasks (Qiu et al., 2020) . 1 https://www.worldometers.info/coronavirus/ BERT is a transformer encoder based language model trained using 16 GB text corpus using language modeling and next sentence prediction objectives. The 16GB text corpus includes 3.5B words from Wikipedia articles and 0.8B words from Books. BERT model is available in two versions namely BERT-base (consists of 12 transformer encoder layers with 768 hidden vector size) and BERT-large (consists of 24 transformer encoder layers with 1024 hidden vector size). As BERT models are trained using generic less noisy text corpus, these may not be effective for noisy text like tweets. Moreover,these models don't include any domain specific information. A common strategy is to adapt BERT model to a specific domain is to further pre-train the model or train the model from scratch using domain specific text. In this paper, we propose two models to identify informative COVID-19 tweets. First model is based on Covid-Twitter-BERT (CT-BERT) which is a BERT-Large based model which is further trained on 160M Corona virus related tweets (Müller et al., 2020) . Second model is ensemble of CT-BERT, RoBERTa and SVM (Islam et al., 2017) . As CT-BERT is initialized from BERTlarge weights and further pre-trained on COVID tweets, it has two advantages compared to BERTlarge which is pre-trained on generic less noisy texts. First advantage is, CT-BERT includes domain as well as specific information and second advantage is, CT-BERT can better handle noisy texts like tweets. Our CT-BERT based model achieves F1-score of 88.87% and ensemble model achieves F1-score of 88.52% 2 Methodology The dataset contains 20K tweets each of which is labeled as 0 (uninformative tweet) or 1 (informative Table 1 and the dataset splitted into 80% and 20% reported in Table 2 . As tweets are noisy in nature, we do the following pre-processing steps • remove unnecessary punctuation and non-ASCII characters. • standardize words with repeating characters (e.g. coooool → cool) • replace emoji characters with their text descriptions 2 • replace interjection words with their meanings (e.g. oww → pain) • replace contraction with full form (e.g., I'm → I am) • replace twitter slang words with related words (e.g., 2morrow → tomorrow We treat the problem of identification of informative tweets as binary text classification. Following the recent trend of using pre-trained language models in NLP, we propose models based on BERT and RoBERTa. Model-1 This model is based on COVID-Twitter-BERT (CT-BERT). CT-BERT is initialized from BERT-Large weights and further pre-trained on 160M Corona virus related tweets. As it is binary classification, a fully connected sigmoid layer is included on the top of CT-BERT. The entire model (CT-BERT + fully connected sigmoid layer) is then 2 We gather list of emojis and corresponding descriptions from https://emojipedia.org/ Figure 1 : Overview of Model-1 fine-tuned using the training dataset. The original tweet is added with the special tokens [CLS] and [SEP] and then tokenized using word-piece tokenizer. The embedding of each token is obtained by the summation of word-piece, position and segment embeddings. A sequence of 24 transformer encoder layers is applied on these token embeddings to get the final hidden state vectors. Following , we treat e t ∈ R h the final hidden vector of [CLS] token as the representation of tweet. Then, e t is passed through fully connected sigmoid layer to get the required labelp ∈ [0, 1](as shown in figure 1 ). Model-2 This model is ensemble of CT-BERT, RoBERTa and TF-IDF with SVM. In this model we used base model of Roberta and TF-IDF is used for to extract the features from the tweets which were used in the SVM. Each model is individually trained using the training set. In case of CT-BERT and RoBERTa, task-specific classifier layer having fully connected sigmoid layer is added and the entire model is fine-tuned. In case of SVM, the model is trained using the tf-idf vectors of training tweets and we use kernel as sigmoid. The final prediction is obtained from the average of predictions of all these models(as shown in figure 2 ). The model is officially evaluated using precision, recall and F1-score metrics. Task organizers provided training and validation sets with labels . We merged both training and validation set and split into 80% train and validation sets with 80% and 20% of instances. We set batch size = 32, learning rate = 3e-5 and epochs=3 after doing random search over the hyperparameter space. All our models are implemented using tranformers library in PyTorch (Wolf et al., 2019) . Text classification is one of the core NLP tasks. It involves assigning labels to text sequences like phrases, sentences or documents. It has applications in various NLP tasks like sentiment analysis, spam classification, abusive text detection etc (Minaee et al., 2020) . The use of deep learning models for text classification started with using models like Convolutional Neural Network or Recurrrent Neural Network (Kim, 2014; Nowak et al., 2017) . These models are used on the top of word embeddings. To over the issue of Out of Vocabulary (OOV) words, char level CNN or RNN are used (Zhang et al., 2015) . As these models are shallow in nature and need to be trained from scratch, it requires more number of training instances to train these models. Recently, with the introduction of deep pre-trained language models like BERT, RoBERTa , there is no need to train the downstream model from scratch. To adapt the model to downstream task, it is enough to add task specific layers and fine-tune the model for few epochs (Devlin et al., 2019; Liu et al., 2019 To identify informative tweets related to Corona virus, we experimented with two models. First model is based on CT-BERT and second model is an ensemble of CT-BERT, RoBERTa and SVM. The results is reported in Table 3 . CT-BERT based model achieved F1-score of 88.87% and ensemble model achieved F1-score of 88.52%. From the Table 3 , it is clear that CT-BERT based model achieved slightly better results compared to ensemble model. In this work, we present our models to identify COVID-19 related informative tweets. We treat this as binary text classification problem. We propose two models based on pre-trained language models for this task. Our model based on CT-BERT achieved F1-score of 88.87%. Bert: Pre-training of deep bidirectional transformers for language understanding A support vector machine mixed with tf-idf algorithm to categorize bengali document Bertmcn: Mapping colloquial phrases to standard medical concepts using bert and highway network Secnlp: A survey of embeddings in clinical natural language processing Convolutional neural networks for sentence classification Roberta: A robustly optimized bert pretraining approach Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2020. Deep learning based text classification: A comprehensive review Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets Lstm recurrent neural networks for short text and sentiment classification Pre-trained models for natural language processing: A survey Transformers: State-of-theart natural language processing Character-level convolutional networks for text classification