key: cord-0208072-houc3327 authors: Joao, Renato Stoffalette title: On Informative Tweet Identification For Tracking Mass Events date: 2021-01-14 journal: nan DOI: nan sha: 1b9cd07c901a9f185634c26557aa0a95b4099342 doc_id: 208072 cord_uid: houc3327 Twitter has been heavily used as an important channel for communicating and discussing about events in real-time. In such major events, many uninformative tweets are also published rapidly by many users, making it hard to follow the events. In this paper, we address this problem by investigating machine learning methods for automatically identifying informative tweets among those that are relevant to a target event. We examine both traditional approaches with a rich set of handcrafted features and state of the art approaches with automatically learned features. We further propose a hybrid model that leverages both the handcrafted features and the automatically learned ones. Our experiments on several large datasets of real-world events show that the latter approaches significantly outperform the former and our proposed model performs the best, suggesting highly effective mechanisms for tracking mass events. Lately Twitter has become an important channel for communication and information broadcasting. A large number of its users have been using the platform for seeking and sharing the information about events. Particularly, during undesired mass events like natural disasters or terrorist attacks, Twitter users post tweets, share updates, inform other users about current situations, etc. However, in addition to these information, a lot of tweets are merely for discussing and expressing opinions and emotions towards the events, which makes it challenging for professionals involved in crisis management to actually collect relevant information for better understanding the situations and respond more rapidly (Vieweg et al., 2010) . Considering the large volume of tweets published by Twitter users, manual sifting to find useful information is inherently impractical (Meier, 2013) . Thus automatic mechanisms for identification of the informative tweets are required to assist not only the average citizen to become aware of the situation but also the professionals to take measures immediately and potentially save lives. In this work, we investigate the viability of machine learning approaches for developing such an aua https://orcid.org/0000-0003-4929-4524 tomatic mechanism. We study both traditional ones that use handcrafted features, as well as the state of the art representation learning approach, the BERTbased models (Devlin et al., 2019) , to classify tweets according to their informativeness. We implement a rich set of features for the former, examine different usage of the latter as well the combinations of both. Furthermore, we propose a hybrid model that leverages both the BERT-based models and the handcrafted features. We evaluate all these models on large datasets collected during several natural and man-caused disasters. In summary, we make the following contributions. • We investigate a rich set of features that include Bag-of-Words, text-based, and user-based features for traditional models, and examine the performance of BERT-based models for the informative tweet classification problem. • We further propose a hybrid model that combines a BERT-based model with handcrafted features for the problem. • We conduct comprehensive experiments for evaluating the performance of these diverse models. • Empirically, we demonstrate that deep BERTbased models outperform the traditional ones for the task without requiring complicated feature en-gineering, while our proposed model performs the best. The remaining of this paper is organized as follows. We firstly review the related works in Section 2, then we describe the methods and the features in Section 3. Section 4 describes our experiments, datasets and give details about our implementation methods. In Section 4.4 we report the results obtained from our experiments. Finally, we draw some conclusions and point out some future directions in Section 5. Social media platforms such as Twitter and Facebook have become valuable communication channels over the years. Twitter enables people to share all kinds of information by posting short text messages, called tweets. Although social media services are full of conversational messages, it is also an environment where users post newsworthy information related to some natural or human-induced disaster. Identifying such information can help not only the ordinary citizen but it can also assist professionals and organizations in coordinating their response for potentially saving lives and diminishing catastrophic losses (Imran et al., 2015) . A number of automated systems have been proposed to extract and classify crisis related information from social media channels, for example Crisis-Tracker (Rogstadius et al., 2013) , Twitcident (Abel et al., 2012) , AIDR (Imran et al., 2014) , among others. For a more complete list of systems, please refer to the survey by Imran et al. (Imran et al., 2015) . Machine learning and natural language processing play an important role when it comes to classifying crisis related tweets automatically, and the approach applied to extract textual features can determine the performance of an automated classifier. Castillo et al. (Castillo et al., 2011) proposed automatic techniques to assess the credibility of tweets related to specific topics or events, using features extracted from user's posting behavior and tweet's text. Verma, et al. (Verma et al., 2011) used Naive Bayes and MaxEnt classifiers to find situational awareness tweets from several crises and Cameron et al. (Cameron et al., 2012 ) described a platform for emergency situation awareness where they classified interesting tweets using an SVM classifier. With the recent advances in natural language processing and the emergence of techniques such as word2vec (Mikolov et al., 2013a; Mikolov et al., 2013b) and GloVe (Pennington et al., 2014) , deep neural networks have successfully been applied in similar tasks. Caragea et al. (Caragea et al., 2016) for instance, demonstrated that convolutional neural networks outperformed traditional classifiers in tweet classification. Nguyen et al. (Nguyen et al., 2017) also used a convolutional neural network based model to classify crisis-relevant tweets. These results suggest a promising approach for this informative tweet classification task. Identifying informative tweets is a critical task, particularly during catastrophic events. There is however no simple rules that can be applied for the task. We therefore approach the problem of informative tweets identification as a supervised learning problem. In the following subsections, we shall discuss several models for the task. We start with some conventional classification models that make use of features engineered from the tweets as well as the users who posted the tweets. Next, we present the deep learning approaches for the task, and describe our proposed model. Several machine learning approaches have been proposed for the task of automatically detecting crisisrelated tweets, for example, Naive Bayes (Li et al., 2018) , Support Vector Machines (Caragea et al., 2016) , and Random Forests (Kaufhold et al., 2020) . Thus, as the baselines, we have trained these traditional classifiers to automatically classify a tweet into either Informative or Not Informative. Specifically, we have implemented the following models. • LOGISTIC REGRESSION (LR) -a classifier that models the probability of a label based on a set of independent features, • DECISION TREE (DT) -a classifier that successively divides the features space to maximise a given metric (e.g., information gain), • RANDOM FOREST (RF) -a classifier that utilises an ensemble of uncorrelated decision trees, • NAIVE BAYES (NB) -a Gaussian Naive Bayes classifier, • MULTILAYER PERCEPTRON (MP) -a network of linear classifiers, (perceptrons) that uses the backpropagation technique to classify the instances, and • SUPPORT VECTOR MACHINE (SVM) -a discriminative classifier formally defined by a sepa-rating hyperplane. All the classifiers deployed in this work were implemented in Python using the machine learning library Scikit-Learn (Pedregosa et al., 2011) . The source code of our models implementations is freely available at https://github.com/renatosjoao/ infotweets.git. Inspired by previous works, we investigated a set of features based on the tweets' contents as well as on the users who posted the tweets (Acerbo and Rossi, 2017; Graf et al., 2018; Imran et al., 2013; Verma et al., 2011) . These features are described as follows. • Text-based features: the ones that are calculated from the content of a tweet, including n chars : This feature refers to the number of characters a tweet contains. n words : The number of words a tweet contains. After removing symbols and patterns we count the number of words that is present in the tweet. Since this number may vary considerably we calculated it as log 10 (n f ollowers + 1). n f ollowees : Number of accounts the user who posted the tweet follows, calculated as log 10 (n f ollowees + 1). n tweets : This feature represents the total number of tweets posted by the user. There can be the case where the user has not posted many tweets as well as there can be cases of influential users who post messages more frequently, thus we calculate this feature as log 10 (n tweets + 1). We now discuss deep learning based approaches that are widely used in recent works (Nguyen et al., 2017; Neppalli et al., 2018) . The traditional models such as the Bag-of-Words do not capture well the meaning of the words and consider each word as a separate feature. Word embeddings have been proposed and widely used neural models that map words into real number vectors such that similar words are closer to each other in a higher dimensional space. The word embeddings captures the semantical and syntactical information of words taking into consideration the surrounding context. In this work, we examine the following typical word embedding methods: • Word2vec (Mikolov et al., 2013b) is one famous method of neural words embeddings initially proposed in two variants: (i) a Bag-of-Words model that predicts the current word based on the context words, and (ii) a skip-gram model that predicts surrounding words given the current word. • GloVe is an extension to the Word2vec method for efficiently learning word vectors, proposed by (Pennington et al., 2014) which uses global corpus statistics for words representations and learns the embeddings by dimensionality reduction of the co-occurrence count matrix. • Fasttext (Bojanowski et al., 2016) is an extension to the skip-gram model from the original Word2vec model which takes into account subword information, i.e. it learns representations for character n-grams, and represents words as the sum of the n-gram vectors. The idea is to capture morphological characteristics of words. We make use of the pre-trained word vectors of the above models 3 , 4 , 5 . The feature vector of each tweet is then determined by taking the average of all embedding vectors of its words. Generalized from word embeddings, text embedding methods compute a vector for each group of words taken collectively as a single unit, e.g., a sentence, a paragraph, or the whole document. In this work, we examine a typical method for text embedding, namely Doc2vec, and state-of-the-art ones, namely BERTbased models. Doc2vec generates efficient and high quality distributed vectors of a complete document (Mikolov et al., 2013b) . The main objective of Doc2Vec is to convert the sentence (or paragraph) into a vector. It is a generalization on the Word2vec model. BERT is a model developed on a multi-layer bidirectional Transformer encoder (Vaswani et al., 2017; Devlin et al., 2019) . It makes use of an attention mechanism that learns contextual relations between words in texts. In its generic format, the Transformer includes two separate mechanisms, an encoder that reads input text and a decoder that produces the task prediction. The encoder is composed of a stack of multiple layers, and each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. The decoder is also composed of a stack of multiple identical layers with the addition of a third sub-layer, which performs multi-head attention over the output of the encoder stack. One key component of the Transformer encoder is the multi head self-attention layer, i.e. a func-tion that can be formulated as querying a dictionary with key-value pairs. The most straightforward usage of BERT is to employ it as a blackbox for feature engineering. This is the combination of the default BERT model and conventional classifiers. The final hidden state of the first word ([CLS]) from BERT is the encoded sentence representation and it is input to conventional classifiers for the predictions task. The original BERT model is pre-trained in a general domain corpus. Thus, for a text classification task in a specific domain, the data distribution may be different. In this way in order to obtain improved results, we need to further train BERT on a domain specific data. There are a couple of ways to further train BERT on a domain specific corpus. The first one is to train the entire pre-trained model on the new corpus and feed the output into a softmax function. In this way, the error is back propagated throughout the entire model's architecture and the weights are updated for this domain specific corpus. Another method is to train some of BERT's layers while freezing others, or we can freeze all the layers and attach extra neural network layers and train this new model where only the weights of the attached layers will be updated. These are so called fine tuning procedures, and in this work we will be fine tuning BERT, by encoding Twitter sentences with the BERT encoder and running more training iterations and backpropagating the error throughout the entire model. We now describe a hybrid model, called BERT Hyb , that combines both the handcrafted features with the ones learned by BERT. BERT Hyb model feeds a vector of handcrafted features from the tweet through a linear layer, and also feeds the vector produced by BERT for the first token (CLS) of the tweet through another linear layer. The outputs of these two layers are concatenated and fed through a third linear layer, whose output is subsequently fed through a softmax layer to produce the prediction whether a tweet is Informative or Not Informative. We use the following datasets to evaluate the models. • CRISISLEXT26 (Olteanu et al., 2015) -This is a dataset of tweets collected during twenty six large crisis events in 2012 and 2013, with about 1,000 tweets labeled per crisis for informativeness, information type, and source. • CRISISLEXT6 (Olteanu et al., 2014 ) -This dataset includes English tweets posted during six large events in 2012 and 2013, with about 60.000 tweets labeled by relatedness as On-topic or Offtopic with each event. We assume the tweets labeled as On-topic being the Informative tweets and Off-topic being Not Informative respectively. • CRISISMMD (Alam et al., 2018) -CrisisMMD is a dataset that contains tweets with both text and image contents. There are 16,000 tweets that were collected from seven events that took place in 2017 in five countries. • COVID (Nguyen et al., 2020 ) -This dataset consists of 10K English Tweets collected during the Covid pandemic. It is split into training set with 3303 Informative tweets and 3697 Uninformative tweets, and a validation set with 472 and 528 Informative and Uninformative tweets respectively. In their original form, the above datasets provide only tweets' content together with their ids and labels. To calculate the user based features we crawl from Twitter the full information of all the tweets. However, some tweets are no longer available. We thus create a version of each dataset that consists of the subset of tweets that we can crawl full information from Twitter. These versions are COVID and COVID SUBSET , CRISISLEXT6 and CRISISLEXT6 SUBSET , CRISISLEXT26 and CRISISLEXT26 SUBSET , CRISISMMD and CRISIS-MMD SUBSET respectively. The basic statistics of all the datasets and their subsets are shown in Tables 1 and 2 respectively. To evaluate the informative tweets classification task we employ the following performance metrics. Precision (P): the fraction of the correctly classified instances among the instances assigned to the class. Recall (R): the fraction of the correctly classified instances among all instances of the class and F-score (F1): the harmonic mean of precision and recall. In this work we compute the metrics independently for each class and then take the average, i.e. Macro Precision, Macro Recall and Macro F-score. We normalized all characters in the tweets to their lower-cased forms followed by the removal of punctuation and non ASCII characters as well as non English words, then we calculated the text-based features and user-based features. The Bag-of-Words feature was calculated for the entire corpus of tweets, however in our experiments we only calculated it for words appearing at least 5 times in the entire corpus and up to a limit of 10000 times. The words with length less than two characters were also pruned. In parallel we then tokenized the sentences and encoded the tokens using the BERT encoder. Each dataset is randomly split into 10 mutually exclusive subets and 10-fold cross validation was used to measure the performance of the models. For the conventional classifiers we used the implementation from the scikit-learn tool (Pedregosa et al., 2011) and all the algorithms were set to use the default parameter values. As regards BERT fine tuning, we used the stochastic gradient descent optimizer with a learning rate of 0.001, momentum 0.9 and ran the training process for 20 epochs. We set the batch size to 16 and limited the BERT sentence encoding to the maximum length of 80. In this work the BERT models were built based on the pytorch-pretrained-BERT repository https:// github.com/huggingface/pytorch-pretrained-BERT. We show the results in terms of macro average Fscore. Table 3 shows the performance of the implemented models on all the datasets used in this work. The two best results obtained in each dataset is highlighted in bold face. Only the COVID and CRI-SISMMD datasets were split into training and validation sets by default, however to make it fair and comparable across all the datasets and approaches we performed 10-fold cross validation with the entire datasets (combined training and validation sets). In the first six rows we show the classification performance of conventional classifiers using the handcrafted features proposed in this work. For the full datasets it is only possible to calculate the Twitterbased features, as the user-based features are strongly dependent on the complete tweet information, and since we had to crawl the Twitter platform to obtain the complete information, we realised that many tweets had been deleted. We noticed the performance of the classifiers varies on a per dataset basis and classifiers performed differently on each of the datasets. For the COVID dataset we observed the LOGISTIC REGRES-SION classifier performed the best with Macro F1 of The following six rows show the classification performance using Bag-of-Words as input features. Here again we noticed the performance of the classifiers varies on a per dataset basis, however we observed considerable performance improvement across all datasets which demonstrates that the bag-of-words is a stronger features encoding method than the handcrafted features approach only. In the following six rows we show the results of the classification task using a combination of the handcrafted features with the Bag-of-Words features. It is interesting to observe that for the majority of the classifiers this combination does not produce improved results over the COVID and the CRI-SISLEXT6 datasets. Only NAIVE BAYES demonstrated considerable improvement over the previous approach for the COVID dataset. However, all the classifiers demonstrated improvement in the CRI-SISLEXT26 dataset when compared to using the Bagof-Words only approach, and for the CRISISMMD dataset again only NAIVE BAYES demonstrated improvement when compared to the previous approach. The next six rows show the results of the conventional classifiers using Fasttext word embeddings. For the COVID and CRISISLEXT6 datasets, MLP produced the best results, while for the CRISISLEXT26 and for the CRISISMMD datasets, LOGISTIC RE-GRESSION demonstrated the best macro F-score. In the following six rows we can see the classification results using GloVe word embeddings. The performance results observed from the classifiers using this embedding technique seem to be similar to the Fasttext word embeddings varying not too much across datasets. In the following six rows we show the performance results of one approach in which we use the conventional classifiers using BERT encoded features combined with the handcrafted features. We have not noticed improvements using this approach of combining BERT word embeddings with handcrafted features on the COVID and CRISISMMD datasets, however we observed some improvements in the CRI-SISLEXT6 and CRISISLEXT26 dataset for the majority of the classifiers. Finally in the last row we show the results of our proposed approach BERT Hyb . Our model outperforms all the previously cited methods across all datasets used in this work. For COVID dataset it produced a macro F-score of 84.41 which is 2.5 percentage points improvement over the best result from previous approaches (LR using Bag-of-Words features). For CRISISLEXT6 we observed 95.96 macro F-score, for CRISISLEXT26 we obtained 79.09 macro Fscore, which is the highest improvement (7 percentage points over SVM using handcrafted features combined with Bag-of-Words) and for CRISISMMD our model produced 77.66 macro F-score. There are some reasons that can explain why our hybrid model performs much better than other models tested in this paper. The first one is the fact that BERT encoder uses a contextual representation in which it processes words in relation to all the other words in the sequence, rather than one by one separately, and the second reason is the fact that we ran several training iterations while adjusting weights, and using different optimization functions to minimise the training loss. We also evaluated the proposed approach in the subsets of the original datasets. As mentioned before these subsets were created so we could also calculate features related to the user who posted the message. We noticed again that the handcrafted features alone did not produce satisfactory results. The Using the Fasttext, GloVe and BERT embeddings as input features to the conventional classifiers showed considerable improvements across all datasets, especially when using LOGISTIC REGRES-SION as base classifier, however this was not a pattern observed when using different classification methods. Our hybrid model BERT Hyb produced the best performance result for almost all the dataset with the exception of the CRISISLEXT6 SUBSET , however the difference is marginal. The best observed macro Fscore is shown when using the Bag-of-Words features model using RANDOM FOREST as base classifier (93.22), while our hybrid approach produced a score of 93.05. In the COVID SUBSET our model showed 84.64 macho F-score which is 2.3 percentage points improvement over the second best result (Bag-of-Words and LR = 82.35). Our model showed 76.68 and 76.54 macro F-score for the CRI-SISLEXT26 SUBSET and CRISISMMD SUBSET datasets respectively. These two datasets seem to be the two datasets where the performance of the models were lower than 80%. Further investigation and a more in depth analysis is required as there is still some room for improvements. Social media has drawn attention from different sectors of society and the information available during catastrophic events is extremely useful for both the ordinary citizen and the professionals involved in humanitarian purposes, however there is an overload of information that requires an automated filtering method for real time processing of relevant content. In this work we designed a set of handcrafted features from both the Twitter posts and the users who posted a tweet, and showed experimentally the performance of six conventional classifiers on the infor-mative tweet classification task. We also trained classifiers with several word embeddings, namely, Fasttext, GloVe and BERT, as input features. Moreover, we showed that our proposed deep neural model BERT Hyb is more effective in identifying informative tweets as compared to conventional classifiers in different crisis related corpus from Twitter. As future works we intend to further investigate different deep learning models combinations and implement a complete pipeline where the tweets are crawled and classified in real time based on crisis related trending topics. 24(+/-0.03) RF 80.47(+/-0.05) 85(+/-0.10) 63.74(+/-0.02) RF 76 ) 60.05(+/-0.09) 63.54(+/-0.02) Semantics+ filtering+ search= twitcident. exploring information in social web streams Filtering informative tweets during emergencies: a machine learning approach Crisismmd: Multimodal twitter datasets from natural disasters Enriching word vectors with subword information Emergency situation awareness from twitter for crisis management Identifying informative messages in disaster events using convolutional neural networks Information credibility on twitter Bert: Pre-training of deep bidirectional transformers for language understanding Cross-domain informativeness classification for disaster situations Processing social media messages in mass emergency: A survey Aidr: Artificial intelligence for disaster response Extracting information nuggets from disaster-related messages in social media Rapid relevance classification of social media posts in disasters and emergencies: A system and evaluation featuring active, incremental and online learning Disaster response aided by tweet classification with a domain adaptation approach Crisis maps: Harnessing the power of big data to deliver humanitarian assistance Efficient estimation of word representations in vector space Distributed representations of words and phrases and their compositionality Deep neural networks versus naive bayes classifiers for identifying informative tweets during disasters WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets Robust classification of crisis-related data on social networks using convolutional neural networks Crisislex: A lexicon for collecting and filtering microblogged communications in crises What to expect when the unexpected happens: Social media communications across crises Scikit-learn: Machine learning in python Glove: Global vectors for word representation Crisistracker: Crowdsourced social media curation for disaster awareness Attention is all you need Natural language processing to the rescue? extracting" situational awareness" tweets during mass emergency Microblogging during two natural hazards events: what twitter may contribute to situational awareness