key: cord-0609618-hanblbgh authors: Khan, Pervaiz Iqbal; Razzak, Imran; Dengel, Andreas; Ahmed, Sheraz title: Understanding Information Spreading Mechanisms During COVID-19 Pandemic by Analyzing the Impact of Tweet Text and User Features for Retweet Prediction date: 2021-05-26 journal: nan DOI: nan sha: 242e84d05e65054656a72fbf8f11d8e5d92e17c6 doc_id: 609618 cord_uid: hanblbgh COVID-19 has affected the world economy and the daily life routine of almost everyone. It has been a hot topic on social media platforms such as Twitter, Facebook, etc. These social media platforms enable users to share information with other users who can reshare this information, thus causing this information to spread. Twitter's retweet functionality allows users to share the existing content with other users without altering the original content. Analysis of social media platforms can help in detecting emergencies during pandemics that lead to taking preventive measures. One such type of analysis is predicting the number of retweets for a given COVID-19 related tweet. Recently, CIKM organized a retweet prediction challenge for COVID-19 tweets focusing on using numeric features only. However, our hypothesis is, tweet text may play a vital role in an accurate retweet prediction. In this paper, we combine numeric and text features for COVID-19 related retweet predictions. For this purpose, we propose two CNN and RNN based models and evaluate the performance of these models on a publicly available TweetsCOV19 dataset using seven different evaluation metrics. Our evaluation results show that combining tweet text with numeric features improves the performance of retweet prediction significantly. Coronavirus 2019 (COVID-19) originated from Wuhan, China, back in December 2019 has affected many countries across the world. On March 11, 2020, the World Health Organization (WHO) announced it a pandemic disease as it spread in 114 countries in the world [1] . On April 21, 2021, 143,589 ,434 cases of the virus have been reported worldwide, including 3,058,632 deaths [2] . The total number of COVID-19 cases reported in USA alone are 32,536,470 including 25,105,535 recoveries, and 582,456 deaths as on April 21, 2021 [2] . This virus has also affected the various public and private sectors such as tourism, airline and transportation, and private businesses that have hugely impacted the economy worldwide [3] . To stop the spread of the virus, along with medical treatments, non-pharmaceutical interventions such as lock-downs, closing the educational institutions, local and international travel bans have been made [4] . To make non-pharmaceutical interventions in a region, it is important to know the situation of the disease in that region. Analysis of social media platforms such as Twitter, Facebook, YouTube, etc., can be useful in knowing the COVID-19 situation in a region. For example, a tweet on Twitter related to COVID-19 may indicate how serious the situation of the virus is. Twitter is an online social network service that allows users to share information with other users using short textual messages, called tweets. These tweets are a rich source of user communication. Their popularity has resulted in information propagation becoming a fundamental function of online social networks [5, 6] . Twitter also provides a popular functionality known as retweeting a tweet that enables the user to share the content of an existing tweet from another user with his friends and followers without altering the original content. Retweeting is viewed as an atomic behavior and causes widespread information on the internet. Besides, a retweet is also an indicator that the user is interested in a particular tweet. Recently, the prediction of retweets has received significant attention. Retweet analysis has many applications, such as detecting and tracking the spread of fake news [7] , and emergency management [8] in case of a pandemic. The more number of COVID-19 related retweets may indicate the prevalence of the disease. Hence predicting an accurate number of retweets is vital to analyze the situation of a virus. Information spread using retweet functionality depends on many factors such as the number of users mentioned in a tweet, number of friends, and followers of the user tweeting and retweeting the information. It also depends on the sentiments present in a tweet and the time of the tweet posting. For example, a tweet posted early morning or late at night is less likely to be retweeted, as most users may not be active on Twitter during this time. Tweet content such as text, image, or a video itself plays a role in retweeting a tweet. For example, a tweet "how's self-quarantine going?", with the highest number of retweets from the TweetsCOV19 dataset, contains a video that is the main factor of retweeting. However, the presence of videos and images in the tweet content poses additional challenges in retweet analysis. Most of the existing retweet prediction methods utilize information through modeling user preference such as user post history, user following relationship and user profile, etc. [9, 10, 11, 12] . Recently, CIKM-20 organized the COVID-19 retweet prediction challenge. The challenge focused on predicting the retweet frequency based on eleven features such as tweet ID, user name, timestamp, followers, friends, favorites, entities, mentions, URLs, hashtags, and sentiment. Although these features, especially sentiment, help predict the number of retweets, we believe that the content of a tweet also has an impact on retweet count. Fig. 1 shows some of the retweeted tweets. To understand the spread of information related to the COVID-19 pandemic, in this work, we focus on the social media platform Twitter and try to predict the number of retweets for the given tweet. Along with numeric features present in the AnalytiCup competition dataset, we additionally utilize tweet text using deep learning methodologies for predicting retweet count. For this purpose, we propose two CNN and RNN based regression models. We perform experiments on these models in three different ways: • We feed these models with textual information only. To get statistical representation for tweet text, we initialize random embeddings that are learned during training. • We feed these models with pre-processed numeric feature vectors only and remove the embedding layer. • We reap the benefits of both textual input and pre-processed feature vectors. Here, we take both inputs at two branches, and after getting informative features from them, we concatenate and pass them to a dense layer for retweet probability prediction. We evaluate the performance of the models on TweetsCOV19 dataset using seven different evaluation metrics. Experimental results show that combining tweet text with numeric features improve retweet prediction results significantly. The key contributions of this work are: • We propose a framework for predicting retweets for a COVID-19 related tweet based on CNN and RNN regression models. • Unlike CIKM retweet challenge (AnalytiCup), we utilize both tweet text and retweet dataset features (6 features) to improve the prediction. • As the understanding of COVID-19 related retweet behavior has many practical applications, we analyze the importance of features for the retweet prediction task. • We conduct extensive experiments using various features contributing to retweets to demonstrate the effectiveness of tweet text for predicting retweet behavior. The structure of the rest of the paper is as follows: Section 2 briefly reviews related work, and section 3 presents the proposed retweet prediction framework. In section 4, we provide the dataset and experimental details as well as present results and their discussion whereas, section 5, concludes this paper. This section briefly discusses the prior work done to predict the number of retweets for a given tweet. Zaman et al. [13] trained a probabilistic collaborative model called Matchbox [14] for the retweet prediction, developed originally for the prediction of the movie preference of a user. To collect data for the training and evaluation of the method, they crawled Twitter. They used tweeter and retweeter information such as the name of the user, the number of followers, etc., as well as tweet content itself as features for training the model. They introduced another feature called binary feedback and set it to 1 if the tweet was retweeted in the window of one hour of its original posting time and set to 0 otherwise. To evaluate the performance of the model, they used calibration plots as well as a negative log-score. However, they did not provide the training and testing of data distribution. Firdaus et al. [15] predicted the number of retweets for a given tweet based on the analysis of the user difference as an author and a retweeter. They considered the user personality and the topic of interest based on their tweet and retweet history. They used Big Five and their thirty lower level facet scores for each user as a personality measure, resulting in a 35dimensional vector for each user based on their past tweets and retweets. They analyzed the frequent words used by users in their tweets and retweets. They also used the topic of interest as another feature based on their past tweets and retweets separately. For every user, they calculated the similarity score with the topic of interest of all other users. If the similarity score value was less than a threshold, this meant user interests were different as an author of the tweet and as a retweeter. Along with user profile generation based on their interests, they also created the profile for a given tweet based on its text and then compared the tweet text profile with the user profile to generate a new feature. Finally, machine learning classification was performed on features to classify whether the tweet would be retweeted by the users or not. Their results showed that considering user behavior differently as an author and retweeter outperformed the conventional methods. Can et al. [16] predicted retweet count using visual cues in the tweet. They crawled Twitter for data collection and used the tweets for training and evaluating the model that only contained images. They experimented with structured-based features such as the number of friends, the number of followers, the number of favorites, and image-based features such as color histograms, GIST descriptors [17] , and object detectors [18] . For retweet prediction, they used machine learning models such as linear regression, Support Vector Machines (SVM) with Gaussian kernel, and random forest regression and used the Root Mean Squared Error (RMSE) as an evaluation metric. The experimental results showed that using image-based features on the random forest regression model achieved the best performance. Wang et al. [12] studied the problem of retweet prediction using multimodal regression. They combined visual and textual data of tweets and the author's social features such as number of friends, number of followers, etc., to predict the number of retweets for a given tweet. They used Inception-ResNet CNN [19] and LSTM-RNNs [20] to model the visual and textual features respectively. They used word embeddings specifically trained on tweet-style language, used as input to LSTM-RNNs. A joint embedding model was trained to learn the semantic relationship between tweet images and texts. The learned visual, textual, and author's social features were used as input to the Poisson regression model to predict the number of retweets. They trained and evaluated the model on an existing dataset MBI-1M dataset [21] and two other internal datasets containing tweets from 2015 and 2016. They used two evaluation metrics: Spearman ranking coefficient mean absolute percentage error (MAPE) for their model. It was evident from the results, naively combining visual, textual, and author's social features did not improve repression model performance but via jointly embedding model. As user exposure towards posting from followees can be used for retweet prediction, thus, Ma et al. [9] used hot topics discussed by followees using selfattentive model. In addition to this, authors have considered the user posting histories with external memory and utilize hierarchical attention modeling to construct users' interests. Zhang et al. presented a non-parametric statistical approach for predicting retweet behavior that combines textual, temporal, and structural information on a large number of microblogs and their social networks [22] . Microblog posts are aggregated to understand the topics of user concentration and the topic distribution of the cluster in the microblog is estimated. As the topics may vary over time, thus weighted approach is used to increase the role of hot topics. Firdaus et al. [23] considered one of the influential and latent factors for retweet behavior and used topic-specific emotions as they may play a role in retweet prediction. Results showed that user profiles coupled with user emotion showed better performance in comparison to a user profile. Recently, CIKM organized a retweet prediction challenge (AnalytiCup) for predicting tweet popularity related to COVID-19 in terms of retweet frequency. Retweet prediction can be helpful during a crisis such as COVID-19. The challenge focused on eleven features, namely tweet ID, user name, timestamp, followers, friends, favorites, entities, mentions, URLs, hashtags, and sentiment, for retweet prediction. Although the tweet sentiment provides the tweet nature, we think that tweet text may also play a vital role in retweet prediction. In this work, we aim to analyze the impact of combining the tweet text with numeric features for the retweet prediction task and propose two CNN and RNN based regression models to perform experimentation. Modeling the retweeting behavior is very important during the crisis time, and it has been an active area of research. Recently, the organized CIKM challenge AnalytiCup considers numeric features for the retweet prediction, however, we believe that the tweet text also plays a significant role in the retweet prediction. This work explores tweet text and numeric features behavior for the retweet prediction of a tweet. Let T = {(t i , y i )} n i=1 be the tweets, where t i and y i represent the i-th tweet and its number of retweets respectively and n represents total tweets . Let be the features from TweetsCOV19 dataset, and F = (f text ) be the textual features for every tweet t i . Let F * = F + F represents the combined TweetsCOV19 dataset and textual features. The task is to predict y i for a given tweet t i using using features F , F , and F * , and analyze the impact of these features. Deep learning methodologies are achieving state-of-the-art performance for diverse natural language processing tasks such as text classification [24] [25] [26] , sentiment analysis [27] , information retrieval [28] and text summarization [29] . Generally, deep learning methodologies are classified into two categories, namely Convolutional Neural Networks (CNNs) [30] and Recurrent Neural Networks (RNNs) [31] . Initially, researchers believed that CNNs perform better in diverse tasks related to the computer vision (CV) domain whereas, RNNs achieve better performance in natural language processing (NLP) tasks. However, this thought is not valid as many researchers have concluded that CNN models also perform better on NLP tasks [32] . To analyze the retweet behaviour, we propose two CNN and RNN based methods that take three different inputs such as numeric-only, text-only, and combined numeric and text features. Building-blocks of the proposed methods are given as follows: 1. One-dimensional Convolution: The one-dimensional convolution (Conv-1D) operation computes the dot product between a weight vector w ∈ R w and a vector of inputs x ∈ R x . Concretely, Conv-1D computes the dot product of weight vector w with each w-th values in the input x to obtain another vector y. The vector of weights w is called filter or kernel and learned during the training of the network. In Conv-1D, the features residing at the margin of the input do not actively participate as compared to the features residing in the center. To prevent this, zero paddings is applied at the input and intermediate layers that ensure that all the weights in the filter reach the entire input sequence including the words at the margin. 2. k-Max Pooling: Given a sequence s ∈ R s and a value k (where s ≥ k), k-max pooling selects the subsequence s k max highest values of s. The order of the values in s k max corresponds to their original order in s. k-max pooling makes sure to select the k most active features from the input sequence s. 3. Non-linear Function: A non-linear activation function g is applied element-wise to the input vector. Let a matrix M ∈ R f ×d , where f is the number of filters and d is the output dimensions from the pooling operation. Then the i-th activation vector a i for i-th filter is obtained as follows: 4. Folding: After applying the 1D convolution, k-max pooling, and a nonlinearity function to the input, a first-order feature map is obtained. These operations are repeated in each layer of the network to get more feature maps. Let F i denotes the features map in the i-th layer. Multiple features maps F i 1 , F i 2 ,..., F i n are computed in parallel in each layer, where n denotes the number of filters in the i-th layer. These feature maps are independent of each other until they reach a fully connected layer. A simple method called folding sums every two rows in the feature map F. For a feature map of m rows, folding returns m/2 rows i.e. half of the feature map rows. With the folding method, a feature map in the i-th layer now depends on two rows of the feature values in that layer. 5. Recurrent Neural Networks: At each time step t, an Recurrent Neural Network (RNN) takes an input x t and outputs a hidden state h t . A hidden state is computed by using X t as well as the previous hidden state x t−1 . The CNN regressor uses Conv-1D, k-max pooling and non-linearity in the first layer whereas it uses Conv-1D, folding, k-max pooling and non-linearity in the second layer . The output of second layer is flattened and passed to the output layer. The RNN regressor on the other hand, consists of a simple RNN layer with 32 hidden units and non-linear activation function. The output of RNN layer is flattened and passed to the output layer for the prediction. Here, we present the dataset, evaluation metrics, and parameters used for training the model. Further, we provide the results of our experimentation and briefly discuss them. For our experimentation, we downloaded the TweetsCOV19 dataset containing tweets from the Twitter platform. The dataset contained a total of 8151524 tweets. Every tweet example had eleven features, namely Tweet id, Username, Timestamp, Number of Followers, Number of Friends, Number of Favorites, Entities, Sentiment, Mentions, Hashtags, URLs and a label "No. of Retweets". As tweet text might contain important content, apart from these eleven features, we further crawled Twitter to obtain tweet text. However, at the time of crawling, not all the tweets were available for download, hence resulting in only 6955124 tweets with text. For our experimentation, we used a timestamp, the number of followers, number of friends, number of favorites, sentiment, mentions, and tweet text features instead of using all the features. We split the timestamp features into six features of timestamp month, timestamp week, timestamp day, timestamp hour, timestamp minute, timestamp day-of-week. The sentiment feature contained a score for positive sentiment ranging from 1 to 5 and a score for negative sentiment ranging from -1 to -5. Every tweet example contained positive and negative sentiment scores separated by a white-space character " ". For example, sentiment value "3 -1" means that tweet has a positive sentiment score of 3 and a negative sentiment score of -1. Instead of directly using the sentiment feature, we split it into two features: Positive sentiment and negative sentiment. The user mention feature contained names of users mentioned in the tweet separated by white-space character " ". We converted it into the total number of the user mentioned in the tweet. For example, if the user mention feature contained names of 3 persons, the feature value for this became 3. Table 1 shows the top 5 most retweeted tweets from the dataset. As shown in the Table 1 , tweet with the text "how's self-quarantine going?" was the highest retweeted tweet with a total of 275529 retweets. This tweet also contained a video along with the tweet. The text was not available for the second tweet in the table at the time of download. The third most retweeted tweet also contained video along with the plain text on Twitter. Instead of using all the tweets for our experimentation, we randomly took 60K tweets from the dataset and divide them into 40K, 10K, and 10K samples for train, evaluation, and test sets, respectively. To evaluate the performance of the models using numeric-only, text-only and combined numeric and text features, we used seven different evaluation metrics [33] . Detail of each evaluation metric is given in the following subsections: Mean absolute error (MAE) calculates the expected or average enormity of errors in a set of predictions where directions are ignored [34] . Absolute differences between actual observations and predictions are computed and then we estimate the mean of these differences, where all differences have identical weight. Mathematical formula for MAE can be expressed as: Relative error helps to determine the magnitude of the absolute error in terms of the actual size of the measurement. Relative Mean Absolute Error (rMAE) is the normalized form of MAE. It is computed by dividing MAE by the mean of predictions. Mathematical formula for rMAE can be expressed as: Mean Bias Error (MBE) calculates the expected or average enormity of errors in a set of predicted values. Differences are computed between actual observations and predictions and then we compute the mean of these differences, where all differences have same weight. Mathematical formula of MBE can be expressed as: Relative Mean Bias Error (rMBE) is the normalized form of MBE. It is computed by dividing MBE by the mean of predictions. Mathematical formula for rMBE can be expressed as: Root Mean Square Error (RMSE) is a quadratic scoring technique that computes the average enormity of errors. First, we calculate the differences between actual observation and predictions then these differences are squared. We compute the mean of these squared differences and calculate the square root of this mean. Mathematical formula of RMSE can be expressed as: Relative Root Mean Square Error (rRMSE) is the normalized form of RMSE. It is computed by dividing RMSE by the mean of predictions. Mathematical formula for rRMSE can be expressed as: 4.2.7. R 2 Score R 2 score is also termed as coefficient of determination. It shows the closeness between data and fitted regression boundary on data. In order to compute this metric, sum of squares of residuals (RSS) is divided by total sum of squares(TSS), and result is subtracted from 1. Mathematical formula of R 2 can be expressed as: Choosing an optimization approach for the deep learning model is an important aspect of the philosophy of deep learning since it can prove to be a move that saves time and resources by delivering outcomes in minutes instead of hours and in hours instead of days. Adam [35] is one of the commonly used optimization algorithms. Instead of stochastic gradient descent, it can be used to change the network weights iteratively on training data. Adam was suggested by Diederik Kingma of OpenAI and Jimmy Ba of the University of Toronto in an ICLR (2015) paper. The paper was titled "Adam: A Stochastic Optimization Process" [35] . The exponential moving average of the gradient and the squared gradient is calculated by Adam precisely. The decay rate of moving averages is regulated by two parameters, beta 1 and beta 2. Following are the configuration parameters of Adam: • alpha: This parameter is the learning rate or step. It is the proportion with which weights get updated. Larger values of this parameter result in faster initial learning before the rate is updated. Learning rate is slowed down in training with a smaller value of this parameter. • beta1: It is the exponential decay rate for first moment estimates. • beta2: It is the exponential decay rate for second moment estimates. • epsilon: It is a very small number. This parameter is used to avoid division by zero during implementation. Proposed CNN regressor takes three types of inputs. For textual input, we first map all the unique words to number. Given the text-only input, replace each word with the number that generates a sequence s. Then, we take embedding e ∈ R d for each token in the sequence s, where d = 100 in our settings. Thus it makes an embedding matrix E dxl where l = 30 is the maximum sequence length of our input. We apply zero-padding on both the side of size (49) that makes the size equal to E 100x128 and ensures that every token in the sequence participates actively during convolution operation. After that comes the Conv-1D, with 64 filters, followed by k-max-pooling with d = 5 and then non-linearity. Then we zero pad the out of the first layer, apply Conv-1D with the layer 64 filters, followed by a folding layer, k-max pooling and non-linearity. Then we flatten the output of this layer, and finally, a dense layer predicts the output. For the numeric-only input, the same configurations of the output are applied, except there is no embedding layer. For the combined numeric and text features, numeric and textual features are processed separately and combined the flattened outputs of both the features just before the final dense layer serving as a prediction layer. We proposed an RNN based model to predict the number of retweets for a given tweet. This model consists of a simple RNN layer that has 32 hidden units and uses the tanh as an activation function. In the case of text-only features, it takes the input features, followed by an embedding layer similar to CNN regressor. The output of the embedding layer is passed to the RNN layer. The output of the RNN layer is flattened and passed to a dense layer for prediction. In the case of numeric-only features, there is no embedding layer and input features are directly passed to the RNN layer. For both the combined numeric and text features, both types of features are processed separately and combined before the prediction layer. Figure 2 shows the block diagram of proposed method whereas Figure 3 presents the architecture of CNN and RNN regressors. In our experimentation, we set the value for the learning rate to be 0.001. We used Adam as an optimization function having default values of beta 1 and beta 2 which were 0.9 and 0.999 respectively, and value for epsilon was 1e-07. We used a batch size of 64 whereas we trained both the models for 100 epochs. Here, we present the experimental results of models, we proposed to analyze the impact of tweet text with numeric features. We passed each of the CNN, and RNN models numeric-only, text-only, and combined numeric and text features. We evaluated the performance of both models using seven different metrics. It is evident from the results that the CNN-based model significantly performed better on all the seven evaluation metrics for combined numeric and text features. Numeric-only features performed better than text-only features. The model performed worst on text-only features. The RNN-based model outperformed while using numeric-only features on the three evaluation metrics, namely mean absolute error, root mean squared error and R 2 score. It outperformed on only one metric for text-only features. This model outperformed while using combined numeric and text features on three evaluation metrics relative mean absolute error, mean bias error, and relative mean bias error. Table 2 presents the experimental results of CNN and RNN models for numeric-only, text-only, and combined numeric and text features. Based on the experimental evaluation, we have the following key observations: • Retweet prediction performance improved for the CNN-based model if numeric and text features are combined. • For RNN-based models, text played an important role for retweet prediction either as a standalone feature or combining it with numeric features. Based on these results, we can conclude that the tweet text plays a significant role in retweet prediction. Therefore, we should consider it for a retweet prediction instead of ignoring it. Figure 4 shows the predicted and actual retweet count on both the two models for 50 randomly taken tweets from the test set. Predicted retweet count is plotted for numeric-only, text-only, and combined numeric and text features. It is clear from the figure that the predicted retweets by combining numeric and text features are close to the actual number of retweets. Predicted retweets by using text-only features are far away from actual number of retweets. In this paper, we predicted the number of retweets for a given COVID-19 related tweet using numeric features combined with tweet text. For this purpose, we proposed two CNN and RNN based models. We passed each of these models numeric-only, text-only, and combined numeric and text representation separately. We evaluated the performance of these models on the subset of the TweetsCOV19 (AnalytiCup) dataset. Results showed that CNN regressor achieved the highest R 2 of 0.7427, and the lowest MAE of 25.3887 for combined numeric and text features. On the other hand, the RNN regressor achieved the lowest rMAE of 142.6813, MBE of 0.4204, and rMBE of -1.2255 for combined numeric and text features. Experimental results showed that combining the numeric and text features improved retweet prediction as compared to individual numeric or text features. A deep learningbased social distance monitoring framework for covid-19 Sociodemographic determinants of covid-19 incidence rates in oman: Geospatial modelling using multiscale geographically weighted regression (mgwr) Impact of the covid-19 pandemic on travel behavior in istanbul: A panel data analysis Predicting individual retweet behavior by user similarity: A multi-task learning approach, Knowledge-Based Systems Event detection in twitter stream using weighted dynamic heartbeat graph approach The spread of true and false news online Think local, retweet global: Retweeting by the geographically-vulnerable during hurricane sandy Hot topic-aware retweet prediction with masked self-attentive model Prediction of likes and retweets using text information retrieval Who will retweet this? automatically identifying and engaging strangers on twitter to spread information Retweet wars: Tweet popularity prediction via dynamic multimodal regression Predicting information spreading in twitter Proceedings of the 18th international conference on World wide web Retweet prediction considering user's difference as an author and retweeter Predicting retweet count using visual cues Modeling the shape of the scene: A holistic representation of the spatial envelope Object bank: A high-level image representation for scene classification & semantic feature sparsification Inception-v4, inceptionresnet and the impact of residual connections on learning Long short-term memory Latent factors of visual popularity prediction Retweet behavior prediction using hierarchical dirichlet process Topic specific emotion detection for retweet prediction Benchmark performance of machine and deep learning based methodologies for urdu text document classification Two stream deep network for document image classification A robust hybrid approach for textual document classification A precisely xtreme-multi channel hybrid approach for roman urdu sentiment analysis Natural language processing in information retrieval Text summarization techniques: a brief survey An introduction to convolutional neural networks Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network Understanding convolutional neural networks for text classification The evaluation of reanalysis and analysis products of solar radiation for sindh province Mean Absolute Error A method for stochastic optimization