key: cord-241351-li476eqy authors: Liu, Junhua; Singhal, Trisha; Blessing, Lucienne T.M.; Wood, Kristin L.; Lim, Kwan Hui title: CrisisBERT: a Robust Transformer for Crisis Classification and Contextual Crisis Embedding date: 2020-05-11 journal: nan DOI: nan sha: doc_id: 241351 cord_uid: li476eqy Classification of crisis events, such as natural disasters, terrorist attacks and pandemics, is a crucial task to create early signals and inform relevant parties for spontaneous actions to reduce overall damage. Despite crisis such as natural disasters can be predicted by professional institutions, certain events are first signaled by civilians, such as the recent COVID-19 pandemics. Social media platforms such as Twitter often exposes firsthand signals on such crises through high volume information exchange over half a billion tweets posted daily. Prior works proposed various crisis embeddings and classification using conventional Machine Learning and Neural Network models. However, none of the works perform crisis embedding and classification using state of the art attention-based deep neural networks models, such as Transformers and document-level contextual embeddings. This work proposes CrisisBERT, an end-to-end transformer-based model for two crisis classification tasks, namely crisis detection and crisis recognition, which shows promising results across accuracy and f1 scores. The proposed model also demonstrates superior robustness over benchmark, as it shows marginal performance compromise while extending from 6 to 36 events with only 51.4% additional data points. We also proposed Crisis2Vec, an attention-based, document-level contextual embedding architecture for crisis embedding, which achieve better performance than conventional crisis embedding methods such as Word2Vec and GloVe. To the best of our knowledge, our works are first to propose using transformer-based crisis classification and document-level contextual crisis embedding in the literature. Crisis-related events, such as earthquakes, hurricanes and train or airliner accidents, often stimulate a sudden surge of attention and actions from both media and the general public. Despite the fact that crises, such as natural disasters, can be predicted by professional institutions, certain events are first signaled by everyday citizens, i.e., civilians. For instance, the recent COVID-19 pandemics was first informed by general public in China via Weibo, a popular social media site, before pronouncements by government officials. Social media sites have become centralized hubs that facilitate timely information exchange across government agencies, enterprises, working professionals and the general public. As one of the most popular social media sites, Twitter enables users to asynchronously communicate and exchange information with tweets, which are mini-blog posts limited to 280 characters. There are on average over half a billion tweets posted daily [1] . Therefore, one can leverage on such high volume and frequent information exchange to expose firsthand signals on crisis-related events for early detection and warning systems to reduce overall damage and negative impacts. Event detection from tweets has received significant attention in research in order to analyze crisis-related messages for better disaster management and increasing situational awareness. Several recent works studied various natural crisis events, such as hurricanes and earthquakes, and artificial disasters, such as terrorist attacks and explosions [2, 3, 4, 5] . These works focus on binary classifications for various attributes of crisis, such as classifying source type, predicting relatedness between tweets and the crises, and assessing informativeness and applicability [6, 7, 8] . On the other hand, several works proposed multi-label classifiers on affected individuals, infrastructure, casualties, donations, caution, advice, etc. [9, 10] . Crisis recognition tasks are likewise conducted such as identifying crisis types, i.e. hurricanes, floods and fires [11, 12] . Machine Learning-based models are commonly introduced in performing the above mentioned tasks. Conventional linear models such as Logistic Regression, Naive Bayes and Support Vector Machine (SVM) are reported for automatic binary classification on informativeness [13] and relevancy [8] , among others. These models were implemented with pre-trained word2vec embeddings [14] . Several unsupervised approaches are also proposed for classifying crisis-related events, such as the CLUSTOP algorithm utilizing Community Detection for automatic topic modelling [15] . A transfer-learning approach is also proposed [16] , though its classification is only limited to two classes. The ability for cross-crisis evaluation remains questionable. More recently, numerous works proposed Neural Networks (NN) models for crisis-related data detection and classification. For instance, ALRashdi and O'Keefe investigated two deep learning architectures, namely Bidirectional Long Short-Term Memory (BiLSTM) and Convolutional Neural Networks (CNN) using domain-specific and GloVe embeddings [17] . Nguyen et al. propose a CNN-based classifier with Word2Vec embedding pretrained on Google News [14] and domain-specific embeddings [18] . Lastly, parallel CNN architecture was proposed to detect disaster-related events using tweets [19, 20] . While prior works report remarkable performance on various crisis classification tasks using NN models and word embeddings, no studies are found to leverage the most recent Natural Language Understanding (NLU) techniques, such as attention-based deep classification models [21] and document-level contextual embeddings [22] , which reportedly improve state-of-the-art performance for many challenging natural language problems from upstream tasks such as Named Entity Recognition and Part of Speech Tagging, to downstream tasks such as Machine Translation and Neural Conversation. This work focuses on deep attention-based classification models and document-level contextual representation models to address two important crisis classification tasks. We study recent NLU models and techniques that reportedly demonstrated drastic improvement on state-of-the-art and localize for domain-specific crisis related tasks. Overall, our main contribution of this work includes: • proposing CrisisBERT, an attention-based classifier that improves state-of-the-art performance for both crisis detection and recognition tasks; • demonstrating superior robustness over various benchmarks, where extending CrisisBERT from 6 to 36 events with 51.4% of additional data points only results in marginal performance decline, while increasing crisis case classification by 500%; • proposing Crisis2Vec, a document-level contextual embedding approach for crisis representation, and showing substantial improvement over conventional crisis embedding methods such as Word2vec and GloVe . . . To the best of our knowledge, this work is the first to propose a transformer-based classifier for crisis classification tasks. We are also first to propose a document-level contextual crisis embedding approach. In this section, we discuss the recent works that propose various machine learning approaches for crisis classification tasks. While these works report substantial improvement in performance over prior works, none of the works uses state of the art attentionbased models, i.e., Transformers [21] , to perform crisis classification tasks. We propose CrisisBERT, a transformer-based architecture that builds upon a Distilled BERT model, fine-tuned by large-scale hyper-parameter search. Various works propose linear classifiers for crisis-related events. For instance, Parilla-Ferrer et al. proposed an automatic binary classification, based on informative and uninformative tweets using Naive Bayes and Support Vector Machine (SVM) [13] . A SVM with pretrained word2vec embeddings approach was also proposed [14] . Besides linear models, recent works also propose deep learning based methods with different neural network architectures. For instance, ALRashdi and O'Keefe investigated Bidirectional Long Short-Term Memory (BiLSTM) and Convolutional Neural figure) and passed into a DistrilBERT model. Since we are performing classification task, the CLS token vector, i.e. the first output vector, is then passed into a linear classifier for detection or recognition task, whereas the remainder of the output vectors are average-pooled to create Crisis2Vec embeddings. Network (CNN) models using domain-specific and GloVe embeddings [23] . Nguyen et al. proposed a CNN model to classify tweets to get information types using Google News and domain-specific embeddings [18] . In 2017, Vaswani et al. from Google introduced Transformer [21], a new category of deep learning models which are solely attention-based and without convolution and recurrent mechanisms. Later, Google proposed the Bidirectional Encoder Representations from Transformers (BERT) model [24] which drastically improved state-of-the-art performance for multiple challenging Natural Language Processing (NLP) tasks. Since then, multiple transformer-based models have been introduced, such as GPT [25] and XLNet [26] , among others. Transformer-based models were also deployed to solve domain specific tasks, such as medical text inferencing [27] and occupational title embedding [28] , and demonstrated remarkable performance. The Bidirectional Encoder Representation of Transformer (BERT), for instance, is a multi-layer bidirectional Transformer en-coder with attention mechanism [24] . The proposed BERT model has two variants, namely (a) BERT Base, which has 12 transformer layers, a hidden size of 768, 12 attention heads, and 110M total parameters; and (b) BERT Large, which has 24 transformer layers, a hidden size of 1024, 16 attention heads, and 340M total parameters. BERT is pre-trained with self-supervised approaches, i.e., Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). While Transformers such as BERT are reported to perform well in natural language processing, understanding and inference tasks, to the best of our knowledge, no prior works propose and examine the performance of transformer-based models for crisis classification. In this work, we investigate the transformer approach for crisis classification tasks and propose CrisisBERT, a transformer-based classification model that surpasses conventional linear and deep learning models in performance and robustness. [29] , optimizers, learning rates, and batch sizes. Table 1 shows the breakdown of the search space and the final hyper-parameters for CrisisBERT. Each set of parameters is randomly chosen and ran with 3 epochs and two trials. In total, we evaluate over 300 hyper-parameters sets using a Nvidia Titan-X (Pascal) for over 1,000 GPU hours. Taking into consideration of performance and efficiency trade-off, we select the DistilBERT model for our Transformer LM layer. DistilBERT is a compressed version of BERT Base through Knowledge Distillation. With utilization of only 50% of the layers of BERT, DistilBERT performs 60% faster while preserving 97% of the capabilities in language understanding tasks. The optimal set of hyper-parameters for DistilBERT includes an AdamW [30] optimizer, and initial learning rate of 5e-5, and a batch size of 32. Output Layer. The output layer of DistilBERT LM is a set of 768-d vectors led by the class header vector. Since we are conducting classification tasks, only the [CLS] token vector is used as the aggregate sequence representation for classification with a linear classifier. The remainder of the output vectors are processed into Crisis2Vec embeddings using Mean-Pooling operation. As discussed in Section 2.3, Crisis2Vec embedding is a byproduct of CrisisBERT, where the embeddings are constructed based on a pre-trained BERT model, and subsequently fine-tuned with three corpora of crisis-related tweets [6, 31, 32] to be domain-specific for crisis-related tweet representation. Crisis2Vec leverages the advantages of Transformers, including (1) leveraging a self-attention mechanism to incorporate sentencelevel context bidirectionally, (2) leveraging both word-level and positional information to create contextual representation of words, and (3) taking advantage of the pre-trained models on large relevant corpora. To the best of our knowledge, we are the first who propose a document-level contextual embedding approach for crisis-related document representation. Upon convergence, we construct the fixed-length tweet vector using a MEAN-Pooling strategy [22] , where we compute the mean of all output vectors, as illustrated in Algorithm 1. In this work, we conduct two crisis classification tasks, namely Crisis Detection and Crisis Recognition. We formulate the Crisis Detection task as a binary classification model that identifies if a tweet is relevant to a crisis-related event. The Crisis Recognition task on the other hand extends the problem into multi-class classification, where the output is a probability vector that indicate the likelihood of a tweet indicating specific events. Both tasks are modelled as Sequence Classification problems that are formally defined below. We define the Crisis Detection task D = (S, Φ), which is specified by S = {s 1 , ..., s n } a finite sample space of tweets with size n. Each sample s i is a sequence of tokens at T time steps, i.e., s i = {s 1 i , ..., s T i }. Φ denotes the set of labels that has the same sequence as the sample set, Φ = {φ 1 , ..., φ n } and φ i ∈ {0, 1} where φ i = 1 indicates that sample s i is relevant to crisis, and φ i = 0 indicates otherwise. A deterministic classifier C D : S → φ specifies the mapping from sample tweets to their flags. Our objective is to train a crisis detector using the provided tweets and labels that minimizes the differences between predicted labels and true labels, i.e., where J D denotes some cost function. Similarly, we define a Crisis Recognition task R = (S, L), where sample space S is identical to that in Crisis Detection. L denotes a sequence of multi-class labels that have the same sequence as S, i.e., L = {l 1 , ..., l n }, and l i ∈ R m for m number of classes. A deterministic classifier C R : S → L specifies the mapping from the sample tweets to the crisis classes. The objective of the crisis classification tasks is to train a sequence classifier using the provided tweets and labels that minimizes the differences between predicted labels and true labels, i.e., where J R denotes some cost function for classifier C R . In this section, we discuss the experiments performed and their results in order to propose a highly effective and efficient approach for text classification. Three datasets of labelled crisis-related tweets [6, 31, 32] are used to conduct crisis classification tasks and evaluate the proposed methods against benchmarks. In total, these datasets consist of close to 8 million tweets, where overall 91.6k are labelled. These data sets are in the form of: (1) 60k labelled tweets on 6 crises [6] , (2) 3.6k labelled tweets for 8 crises [32] , and (3) 27.9k labelled tweets for 26 crises [31] . Table 2 describes more detail about each dataset and their respective classes. For our experimental evaluation, the 91.6k labelled crisis-related tweets are organized into two datasets, annotated as C6 and C36. In particular, C6 consists of 60k tweets from 6 classes of crises, whereas C36 comprises all 91.6k tweets in 36 classes. Both datasets are split into training, validation and test sets that consist of 90%, 5% and 5% of the original sets, respectively. CrisisBERT. We evaluate the performance of CrisisBERT against multiple benchmarks, which comprise recently proposed crisis classification models in the literature. These works include linear classifiers, such as Logistic Regression (LR), Support Vector Machine (SVM) and Naive Bayes [33] , and non-linear neural networks, such as Convolutional Neural Network (CNN) [20] and Long Short-Term Memory [34] . Furthermore, we investigate the robustness of CrisisBERT for both detection and recognition tasks. This is achieved by extending the experiments from C6 to C36, which comprise 6 and 36 classes respectively, but with only 51.4% additional data points. We evaluate the robustness of the proposed models against benchmarks by observing the compromise in robustness performance, while realizing the drastically improved classification performance. As described in Section 2.3, we use the optimal set of hyper-parameters for CrisisBERT in the experiments, which include the use of a BERT model with distillation (i.e. DistilBERT), an AdamW [30] optimizer, an initial learning rate of 5e-5, a batch size of 32, and a word dropout rate of 0.25. Crisis2Vec. To evaluate Crisis2Vec, we choose the two classifiers with the aim to represent both traditional Machine Learning approaches and the NN approaches. The two selected models are: (1) a linear Logistic Regression model, denoted as LR c2v , and (2) a non-linear LSTM model, denoted as LST M c2v . We evaluate the performance of Crisis2Vec with the two models by replacing the original embedding to Crisis2Vec, ceteris paribus. We use two common evaluation metrics, namely Accuracy and F1 score, which are functions of True-Positive (TP), False-Positive (FP), True-Negative (TN) and False-Negative (FN) predictions. Accuracy is calculated by: For a F1-score of multiple classes, we calculate the unweighted mean for each label, i.e., for n classes of labels as: where P recision = T P T P + F P and Recall = T P T P + F N We select and implement several crisis classifiers proposed in recent works to serve as benchmarks for evaluating our proposed methods. Concretely, we compare CrisisBERT with the following models: • LR w2v : Logistic regression model with Word2Vec embedding pre-trained on Google News Corpus [33] • SV M w2v : Support Vector Machine model with Word2Vec embedding pre-trained on Google News Corpus [33] • N B w2v : Naive Bayes model assuming Gaussian distribution for features with Word2Vec embedding pre-trained on Google News Corpus [33] • CN N gv : Convolutional Neural Network model with 2 convolutional layers of 128 hidden units, kernel size of 3, pool size of 2, 250 filters, and GloVe for word embedding [20] • LST M w2v : Long Short-Term Memory model with 2 layers of 30 hidden states and a Word2Vec-based Crisis Embedding [34] Models Overall, the experimental results show that both proposed models achieve significant improvement on performance and robustness over benchmarks across all tasks. The experimental results for CrisisBERT and Crisis2Vec are tabulated in Table 4 . Robustness. Comparing Crisis Detection task between C6 and C36, CrisisBERT shows 1.7% and 1.3% decline for F1-score and Accuracy, which is much better than most benchmarks, i.e., from 1.7% to 6.3%, except CNN. However, when we compare the more challenging Crisis Recognition tasks between C6 and C36, the performance of CrisisBERT compromises marginally, i.e., 1.6% for F1-score and 0.7% for Accuracy. On the contrary, all benchmark models record significant decline, i.e. from 6.0% to 67.2%. Discussion. Based on experimental results discussed above, we observe that: (1) CrisisBERT's performance exceeds state-of-theart performance for both detection and recognition tasks, with up to 8.2% and 25.0% respectively, (2) CrisisBERT demonstrates higher robustness with marginal decline for performance (i.e. less than 1.7% in F1-score and Accuracy), and (3) Crisis2Vec shows superior performance as compared to conventional Word2Vec embeddings, for both LR and LSTM models across all experiments. 5 Related Work Event detection from tweets has received significant attention in research in order to analyze crisis-related messages for better disaster management and increasing situational awareness [4, 5, 2, 3] . Parilla-Ferrer et al. proposed automatic binary classification of informativeness using Naive Bayes and Support Vector Machine (SVM) [13] . Stowe et al. presented an annotation scheme for tweets to classify relevancy and six [8] . Furthermore, use of pre-trained word2vec reportedly improved SVM for Crisis classification [14] . Lim et al. proposed CLUSTOP algorithm utilizing the community detection approach for automatic topic modelling [15] . Pedrood et al. proposed to transfer-learn classification of one event to the other using a sparse coding model [16] , though the scope was only limited to only two events, i.e. Hurricane Sandy (2012) and Supertyphoon Yolanda (2013). A substantial number of works focusses on usign Neural Networks (NN) with word embeddings for crisis-related data classification. Manna et al. [33] compared NN models with conventional ML classifiers [33] . ALRashdi and O'Keefe investigated and showed good performance for two deep learning architectures, namely Bidirectional Long Short-Term Memory (BiLSTM) and Convolutional Neural Networks (CNN) with domain-specific GloVe embeddings [17] . However, the study had yet to validate the relevance of model on a different crisis type. Nguyen et al. applied CNN to classify information types using Google News and domain-specific embeddings [18] . Kersten et al. [20] implemented a parallel CNN to detect two disasters, namely hurricanes and floods, which reported a F1-score of 0.83. The CNN architecture was proposed earlier by Kim et al. [19] . Word-level Embeddings such as Word2Vec [35] and GloVe [23] are commonly used to form the basis of Crisis Embedding [18, 36] in various crisis classification works to improve model performance. For context, Word2Vec uses a Neural Network Language Model (NNLM) that is able to represent latent information on the word level. GloVe achieved better results with a simpler approach, constructing global vectors to represent contextual knowledge of the vocabulary. More recently, a series of high quality embedding models, such as FastText [37] and Flair [38] , are proposed and reported to have improved state of the art for multiple NLP tasks. Both word-level contextualization and character-level features are commonly used for these works. Pre-trained models on large corpora of news and tweets collections are also made publicly available to assist in downstream tasks. Furthermore, Transformer-based models are proposed to conduct sentence-level embedding tasks [22] . Social media such as Twitter has become a hub of crowd generated information for early crisis detection and recognition tasks. In this work, we present a transformer-based crisis classification model CrisisBERT, and a contextual crisis-related tweet embedding model Crisis2Vec. We examine the performance and robustness of the proposed models by conducting experiments with three datasets and two crisis classification tasks. Experimental results show that CrisisBERT improves state of the art for both detection and recognition class, and further demonstrates robustness by extending from 6 classes to 36 classes, with only 51.4% additioanl data points. Finally, our experiments with two classification models show that Crisis2Vec enhances classification performance as compared to Word2Vec embeddings, which is commonly used in prior works. Natural disasters detection in social media and satellite imagery: a survey Situational awareness enhanced through social media analytics: A survey of first responders Processing social media messages in mass emergency: A survey Earthquake shakes twitter users: real-time event detection by social sensors Crisislex: A lexicon for collecting and filtering microblogged communications in crises Semi-supervised discovery of informative tweets during the emerging disasters Identifying and categorizing disaster-related tweets Extracting information nuggets from disaster-related messages in social media Online public communications by police & fire services during the 2012 hurricane sandy On semantics and deep learning for event detection in crisis situations Verifying baselines for crisis event information classification on twitter Automatic classification of disaster-related tweets Distributed representations of words and phrases and their compositionality Clustop: A clustering-based topic modelling algorithm for twitter using word networks Mining help intent on twitter during disasters via transfer learning with sparse coding Deep learning and word embeddings for tweet classification for crisis response Robust classification of crisis-related data on social networks using convolutional neural networks Convolutional neural networks for sentence classification Robust filtering of crisis-related tweets Attention is all you need Sentence-bert: Sentence embeddings using siamese bert-networks Glove: Global vectors for word representation BERT: Pre-training of deep bidirectional transformers for language understanding Language models are unsupervised multitask learners Xlnet: Generalized autoregressive pretraining for language understanding Ncuee at mediqa 2019: Medical text inference using ensemble bert-bilstm-attention model Ipod: An industrial and professional occupations dataset and its applications to occupational data mining and analysis Distilling the knowledge in a neural network What to expect when the unexpected happens: Social media communications across crises Analysing how people orient to and spread rumours in social media by looking at conversational threads Effectiveness of word embeddings on classifiers: A case study with tweets A deep multi-modal neural network for informative twitter content classification during emergencies A neural probabilistic language model Applications of online deep learning for crisis response using social media information Enriching word vectors with subword information Contextual string embeddings for sequence labeling