key: cord-0856134-2xotzmo4 authors: Wu, Zhiyuan; Pi, Dechang; Chen, Junfu; Xie, Meng; Cao, Jianjun title: Rumor Detection Based On Propagation Graph Neural Network With Attention Mechanism date: 2020-06-05 journal: Expert Syst Appl DOI: 10.1016/j.eswa.2020.113595 sha: d5b066c9081a6300baa8cf9403bd58260658e7f9 doc_id: 856134 cord_uid: 2xotzmo4 Rumors on social media have always been an important issue that seriously endangers social security. Researches on timely and effective detection of rumors have aroused lots of interest in both academia and industry. At present, most existing methods identify rumors based solely on the linguistic information without considering the temporal dynamics and propagation patterns. In this work, we aim to solve rumor detection task under the framework of representation learning. We first propose a novel way to construct the propagation graph by following the propagation structure (who replies to whom) of posts on Twitter. Then we propose a gated graph neural network based algorithm called PGNN, which can generate powerful representations for each node in the propagation graph. The proposed PGNN algorithm repeatedly updates node representations by exchanging information between the neighbor nodes via relation paths within a limited time steps. On this basis, we propose two models, namely GLO-PGNN (rumor detection model based on the global embedding with propagation graph neural network) and ENS-PGNN (rumor detection model based on the ensemble learning with propagation graph neural network). They respectively adopt different classification strategies for rumor detection task, and further improve the performance by including attention mechanism to dynamically adjust the weight of each node in the propagation graph. Experiments on a real-world Twitter dataset demonstrate that our proposed models achieve much better performance than state-of-the-art methods both on the rumor detection task and early detection task. neural network. The proposed algorithm can embed textual and structural features into the high-level representations by propagating information between neighbor nodes in the propagation graph.  On the basis of PGNN, we propose two rumor detection models, GLO-PGNN(rumor detection model based on the global embedding with propagation graph neural network) and ENS-PGNN(rumor detection model based on the ensemble learning with propagation graph neural network), which mainly differ in classification approaches. We also include attention mechanism into our proposed models in order to achieve significant performance improvement.  The experimental results on the public dataset demonstrate that our models are superior to state-of-the-art baselines on both rumor detection task and early detection task. The organization of this paper is as follows: Section 1 is the introduction. Section 2 introduces the related works. Section 3 first presents some definitions and symbolic markers used in this paper, then describes the way to construct the propagation graph. Section 4 explains the algorithms we proposed in detail. Section 5 shows and analyzes the experimental results in detail. Section 6 describes concluding remarks and future work 2 RELATED WORK Currently, most rumor detection methods and fake news detection methods are supervised. The most common type is the content-based methods. The content-based methods classify rumors or fake news depending on the veracity of text or images. These work assume that, the content in different types of rumors(or news) differ in some quantifiable way. (Fuller et al. 2009 ) previously proposed the concept of cue sets based on several semantic cues including percentage of first person singular and plural pronouns in the text, the average length of words in the text and imagery. (Zhao et al. 2015 ) tried a more elaborate set of cues to make it more suitable for fake news detection, and included several platform specific features, such as counts of hashtags"#" and mentions"@" in tweets. (Mihalcea et al. 2009 ) applied the n-gram approach into lie detection task and represented the text content as word frequency vectors, which can be used as inputs of some machine learning classifiers. (Sicilia et al. 2018 ) used content based features, together with some fine-grained features inspired by the graph theory and the social influence models to detect rumor in each post of a single topic domain related to health news. However, the cue and feature based methods mentioned above require a lot of time and energy to design non-trivial linguistic cue set, and the hand-crafted features rely heavily on specific social platform, making it hard to generalize across different domains, languages and topics. On social networks, news or rumors often contain pictures, ) proposed a fake news authentication system aiming at analyzing the veracity of platformindependent information available in the form of images. The model identifies news in 4 steps, firstly it extracts text from images, secondly it recognizes entities from the text, and then it scrapes the web for related content according to the extracted entities, finally a processing unit is responsible for the classification. In order to alleviate the shortcomings of traditional content-based methods that requiring manual feature engineering, some researchers automatically extracted features by including deep learning methods. (Wang et al. 2017 ) processed the statement text and the speaker metadata by two different embedding layers, so as to obtain continuous low dimensional feature representations. (Shu et al. 2018 ) assumed that publisher-news relations and user-news interactions are inherently associated. They considered fake news detection task in a tri-relationship embedding framework. (Kaliyar et al. 2020 ) proposed a deep convolutional neural network (FNDNet) for content-based fake news detection. FNDNet is designed to automatically learn the discriminatory features for fake news classification through multiple hidden layers built in the deep neural network. However, some rumors or fake news are deliberately crafted to imitate real stories for a variety of motivations, it is challenging for machine learning algorithms to determine veracity by solely analyzing contents. Some methods sort a series of relevant posts in chronological order and then identify rumors by capturing the temporal dynamics of the time series. (Ma et al. 2015) divided the time series into fixed time intervals, and captured the temporal characteristics by comparing the difference of social context information between adjacent time intervals. (Ma et al. 2016 ) included recurrent neural network into time series modeling technique and automatically generated representations from tweets content in the time series. Recently, they considered rumor classification and stance classification tasks simultaneously under the framework of multi-task learning, which mutually reinforced both tasks (Ma et al. 2018a) . (Ruchansky et al. 2017) further enriched the features of each time intervals and obtained users' credibility scores by matrix decomposition. However, these approaches typically focus on temporal dynamics alone, ignoring the internal topology among posts on social networks (who replies to whom and who comments on whom). Information on social media with different contents differs in the temporal patterns and diffusion speeds (Foroozani & Ebrahimi 2019) . Rumor and non-rumor, corresponding to unverified information and verified information respectively, spread through social media in the form of shares and re-shares of the source and shared posts, resulting in a diffusion cascade or tree, with the source post at the root. To figure out how information is being misused and disseminated to fulfil spiteful motives, (Meel et al. 2019 ) systematically studied the prevalent technologies used to detect and contain malicious information, they further gave a taxonomy for classifying spiteful information as different stages. Some work have designed many hand-crafted features to exploit the propagation tree of rumors. (Wu et al. 2015) and (Ma et al. 2017 ) calculated similarity between different propagation trees based on well-designed random walk graph kernels and tree kernels respectively. (Vosoughi et al. 2017) proposed a system to automatically predict the veracity of rumors, the system identifies rumors based on 3 aspects, namely, linguistic style, characteristics of people involved in propagating information, and network propagation dynamics. To reduce the time and cost of manually designing features, (Ma. et al. 2018b) proposed two recursive neural models for rumor detection. The models automatically extract features by recursively traversing the tree structure in either the top-down or bottom-up manner. The aforementioned approaches belong to the scope of supervised learning. In addition, some scholars have worked on how to use unsupervised methods to identify the veracity of unconfirmed news or rumors. However, research on unsupervised methods has encountered extremely hard difficulties and very few research findings are achieved in this field. For example, (Yang et al. 2019) captured the conditional dependencies among the veracity of news, users' opinions and users' credibility based on a Bayesian network model, and they proposed an efficient sampling method, assisting model to determine the veracity of unlabeled news. Inspired by these ideas, we propose a novel approach to tackle the problem of rumor detection. In our proposed model, we use word embedding technique to obtain continuous low dimensional representations for text content, which is a common practice in deep learning-based rumor detection approaches. And unlike those work simply considering the temporal characteristics of rumors, we exploit the topological characteristics by taking into consideration the who-replies-to-whom relationship. Instead of designing features manually for the propagation tree of rumors, we construct the propagation graph of rumors and automatically extract features through graph neural network. To the best of our knowledge, our work is the first attempt to tackle the rumor detection problem by exploiting the propagation graph of rumors. In this section, we first introduce some definitions and symbolic markers used in the paper. Then we discuss how to construct the propagation graph according to who-replies-to-whom relationship. At last, we formally give the problem statement of rumor detection task. When someone posts a tweet on Twitter, other Twitter users can express their stance by sharing, replying, commenting or retweeting, so that the information about the source post can widely spread on the Internet. For a better understanding of the propagation structures of posts, some definitions and symbolic markers used in this paper are described as follows: post set can also be represented as . So the Twitter rumor detection dataset can be defined Each post set can form a propagation tree structure based on who-replies-to-whom relationships (Wu et al. 2015 , Ma et al. 2017 . Figure 1 (a) shows the connection structure based on who-replies-to-whom relationships after the ACM official Twitter account posted the winner of the 2019 Turing Award. Figure 1 (b) shows the propagation tree structure derived from Figure 1 (a), root node 0 represents the source tweet, nodes with indices from 1 to 3 represent the responsive tweets, all the nodes in the propagation tree form a post set. In this paper, we extend the original definition of the propagation tree to form a propagation graph structure. Each propagation graph consists of several nodes and two different types of relation paths. Node in propagation graph corresponds to tweet in the post set . Relation paths can be classified into two different types, namely explicit relation path and implicit relation path. For tweets and , if there exists the reply relationship between them, then the propagation graph has an explicit relation path with ( ) j G the direction from node to node . For example, as shown in Figure 1 (c), there is an explicit relation path from node 2 to node 4 since user 4 replies to user 2. Definition 5 (Implicit relation path): In order to form a graph structure, we define that each emplicit relation path in propagation graph corresponds to an implicit relation path with the opposite direction. For example, as shown in Figure 1 (c), the solid arrow and dotted arrow represent explicit relation path and implicit relation path respectively. the set of all nodes in the propagation graph, where each node corresponds to tweet in is the set of all edges. and are mapping functions, Rumor detection commonly can be seen as a multi-classification task under the framework of supervised learning. Traditional methods simply classify source tweets as non-rumors and rumors by distinguishing between verified information and unverified information. Rumors belong to the unverified information category, in spite of that, they are not necessarily false and may turn out to be true or false, or may remain unresolved. In this paper, we consider a more fine-grained classification. We assume that the source tweets can be divided into four classes: 1) non-rumors, namely verified information, 2) true rumors, namely the unverified information that turns out to be true 3) false rumors, namely the unverified information that turns out to be false, 4) unconfirmed rumors, namely the unverified information that remains unresolved. These four categories can be respectively represented as N, T, F and U for short (Ma et al. 2017 ). Take as an example the tweets related to coronavirus COVID-19. The tweets that "person-to-person contact is the main method of transmission of the novel coronavirus" is a non-rumor, since many authoritative news media have reported relevant news and WHO has acknowledged that. The tweet that "coronavirus is caused by 5G technology" is absolutely a false rumor, it not only lacks the factual support, but also deviates from the scientific principles. Therefore, our goal is to learn a classifier from the labeled propagation graph set, that is , where takes one of four finer-grained classes: N, T, F and U. Given the propagation graph , the classifier can output the classification result for source tweet . In this section, we elaborate each module of our proposed models. We first detail the text embedding algorithm adopted in our models in section 4.1. Then we discuss how to aggregate feature information from neighborhood with the proposed PGNN algorithm in section 4.2. At last, we discuss how to identify rumors based on the representations with two different strategies in section 4.3. It is hard to deal with discrete words in tweets directly, so we want to embed the information of tweets content into low-dimensional space. Our goal is to learn a mapping function in :V H   order to generate representation for each node according to its corresponding tweet content. A common practice is to represent each word as a vector by using word2vec algorithm, and then take the mean vector as the representation of the whole tweet (Mikolov et al. 2013 ). However, social platform generally has a limitation on the number of words in a single post, for example, 280 words on Twitter and 140 words on Weibo, and the posts on social media often consist of several sentences, or even a few words, the word2vec algorithm generally does not work well in such application scenarios (Le & Mikolov 2014) . In this paper, we use the extension method of word2vec algorithm-Doc2Vec to generate the representations of tweets content (Le & Mikolov 2014) . Doc2vec is also known as paragraph2vec or sentence embedding, which can generate powerful representation for a short text or even a sentence With the mapping function learned using Doc2vec algorithm, we can obtain an original  representation for each node in the propagation graph. However, such node representation only considers the textual content, ignoring the information of neighborhood nodes and the topological structure in the propagation graph. In this section, we discuss our proposed PGNN algorithm in a step-by-step manner. Our proposed algorithm is based on the framework of the Graph Neural Network(GNN). GNN can learn useful node representations by encoding local graph structures and node attributes (Kipf & Welling 2017) . Our proposed algorithm updates the node representation iteratively. In a single iteration step, neighbor nodes first exchange information via different types of relation paths, and then update their representations by aggregating the neighborhood information and their own information. As a result, the new obtained node representation contains the textual and contextual information simultaneously. For each node in the propagation graph , its representation is iteratively updated v G v h until convergence according to Equation (1), here we omit the subscript and superscript of symbol and for simplicity since the update approach is common to different nodes in different v G propagation graphs. (1) i  However, this way of iteratively updating node representations converges exponentially fast to thefinal result (Zhou et al. 2018) . That is to say, node representations need to be updated many times following the recurrence in Equation (1), which takes a lot of computation time. Inspired by (Li et.al. 2016) , we include the gated mechanism into the update process mentioned in Equation (1). The benefit is that the constraint on parameters to ensure convergence can be eliminated, which solves the problem of slow convergence and reduces the time complexity. The gated recurrent neural network has two common variants: 1) long short-term memory (LSTM) (Hochreiter & Schmidhuber 1997) , 2) gated recurrent units (GRU) . In this work, we use GRU as the hidden unit rather than LSTM for efficiency since GRU uses a simpler architecture and can achieve almost the same performance with fewer parameters. The changed update process from step t-1 to step t is shown as follows: (2) where and refers to the previous state at step and the current state at step of respectively. denotes the merged information from the neighbors of node via v incoming edges with parameters dependent on the relation path, it is used as the input of GRU unit at step . is the candidate state of the hidden state . , , , , , are the t is the element-wise multiplication. The reset gate  ( ) t v r determines how much previous hidden state is included in the current candidate state . ( The update gate determines how to obtain current hidden state based on candidate state and previous state . GRU uses two different types of activation functions, namely ( ) x  used to limit the values of reset gate and update gate between 0 and 1. The activation is used to limit the value of candidate state ( ) ( ) / ( ) between 0 and 1, To obtain the current hidden state of node , the GRU unit selectively adds the v information aggregated from the neighbors of node and selectively forgets the previous hidden v state of node . v There are two benefits to include the gated mechanism: 1) the cumulative speed of information can be controlled, 2) the noise due to excessive iterations can be reduced. Figure 2 shows the update process of node (the black node) according to Equations (2)~ (6), v the direction and type of the relation path are not discussed here for the sake of simplicity. In the 0-th step, all nodes only contain the content information of their own. From the 0-th step to the first step, the node aggregates the information from its first-order neighbors and uses the GRU unit v to update the hidden state, then the new obtained hidden state contains the information of the (1) v h 1-hop neighborhood of node , the same is true for other nodes. From the first step to the second v step, node repeats the update process, since each node in (grey nodes) also contains v ( ) IN v information of its first-order neighbors, the node extends its receptive field from 1-hop to 2-hop. v Through inductive analysis, we can conclude that the hidden state contains the information of ( ) t v h t-order neighbors. The dash-dotted circle in Figure 2 shows the receptive field of node at step v . The parameter in Equation (2) is used to balance the weights of different types of relation t  paths. However, we argue that the weights of relation paths with the same type should also be different. For example, after a tweet is posted, users will reply to it to express the orientation towards the veracity. Obviously, the comments with different stances should not be treated equally. We believe that the weight of the relation path is related to the hidden states of nodes at both ends of the path, so we include the attention mechanism to dynamically adjust the weights of relation paths. The attention mechanism was originally proposed by to solve the problem of machine translation. Nowadays, it has been widely used in various deep learning models. Attention mechanism can force models to focus selectively on specific information while ignoring irrelevant information. Take machine translation task as an example, when translating a word, only certain words around the target word may be relevant, there is no need to pay attention to all the words in the paragraph. We include attention mechanism to adjust the weights when aggregating the information from neighbors following Equation (2), the modified Equation is shown as follows, where is the attention score,it can be calculated as Equation (8) ( After updating T times according to the above process, the final representation of each node in the propagation graph can be obtained. But the representations have limited expression capability since they are learned by using graph neural network with single layer. (LeCun et al. 1998 ) tried to use a multi-layer convolutional neural network to extract more abstract information from images. Inspired by this idea, our proposed algorithm also uses a multi-layer architecture. The update procedures in each layer are the same and are in accordance with Equations (2)~(6), but the parameters in different layers may be totally different, which forces each layer to focus on different information when updating node representations. The relation of node representations in adjacent layers is as follows: is the number of layers, represents the number of update times in layer , :The set of all node representations after -layer updates. Use attention mechanism to adjust the weights of different incoming edges of node v with the same type of relation path, the attention score is calculated according to Equation (8) 15. End for 16. Aggregate the information from the neighbors of node v according Equation (2) Step2~3 are initialization settings, the text representations obtained following doc2vec algorithm are used as initial states of nodes in the propagation graph. Step 7~19 is a single update process, step 7~9 demonstrate that the node representations obtained in the previous layer are used as the initial states of nodes in the current layer, step 13~15 calculates the attention score based on the nodes at both ends of the relation paths. Step 16~17 aggregate information from neighbors via different types of relation paths and update the hidden states of nodes in the propagation graph according to gated mechanism. In this section we discuss how to identify different types of rumors based on the representations calculated by PGNN algorithm. We propose two ideas for rumor classification: 1) Detecting rumors based on the propagation graph can be regarded as a graph classification task. The common practice to solve the problem is to represent the whole graph with a single embedding (Bacciu et al. 2019) . At that point, a standard classifier can be applied to output a graph prediction. Based on this, we come up with the first idea to classify rumors. We first aggregate the representation of each single node from a global perspective to obtain a global graph embedding for the whole propagation graph, and then we use the global embedding as the input of a fully connected neural network to get the prediction result; 2) Our second idea comes from the ensemble learning (Friedman et al. 2001) . The main idea of ensemble learning is to predict separately with multiple individual learners and then combine the results with some certain strategies. As we discussed in section 4.2, the representation of each individual node contains the information of its neighbors, we can first calculate the prediction probability for each node separately, and then summarize to obtain the final classification result. For the first idea, the simplest approach is to take the mean of all node representations as the global embedding for the entire propagation graph, which is as shown in Equation (10). where denotes the global representation for the entire propagation graph, and are g g W g b weights and bias respectively, and is the symbol denoting transpose. T As shown in Equation (11), the obtained global representation is feed into a fully-g connected layer with softmax activation function to generate the prediction for the source tweet (11) ( ( )) y softmax F g  where is a fully-connected layer, the dimension of its input and output are and F d k respectively, is determined by the dimension of the global graph embedding , and is d g k determined by the categories of rumors. In this paper, we assume that the source tweets can be divided into four classes: 1) non-rumors, 2) true rumors, 3) false rumors, 4) unconfirmed rumors, therefore the value of is 4. If the source tweets are simply classified as rumors and non-rumors, k the value of is 2. denotes the prediction result. kŷ For the second idea, a simple way is to calculate the individual prediction probability according to each node representation first, and then use the linear summation to obtain the final result. The second idea can be formalized as Equation (12). where is the sigmoid function, it is used to limit the output between 0 and 1, and are ( ) weight and bias respectively, , . There is room for improvement in above two methods. From the intuitive point of view, the nodes in the propagation graph should not be treated equally. For the first idea, Equation (10) is on the premise that each node representation makes equal contribution to the graph embedding. The global graph embedding is obtained by aggregating all nodes linearly. However, sophisticated aggregation strategy may improve the classification performance (Bacciu et al. 2019) . In our work, we aggregate the node representations for graph embedding with attention mechanism. For the second idea, Equation (12) simply averages the prediction results, which can be seen as the majority voting strategy. However, the weighted voting strategy in ensemble learning assigns each classifier a weight so that the prediction results can be treated differently. In our work, we use attention mechanism to adjust the contribution that each node makes to the classification result dynamically, so that the prediction result based on each individual node representation can be treated differently. For the first idea, the weight of each node is calculated based on the difference between the single node representation and the mean of all node representations. We replace Equation (9) with Equations (12) and (13). For the second idea, we first calculate the individual prediction probability according to each node representation, and then compare the difference between the single node representation and the mean in order to determine the contribution each node makes to the final classification result. We modify Equation (11) to Equation (14). In this paper, we propose two models for rumor detection. We name the first model GLO-PGNN. PGNN stands for the propagation graph neural network and GLO stands for global graph embedding. GLO-PGNN first obtains the node representations using the propagation graph neural network, and then identifies rumors based on the global graph embedding. We name the second model ENS-PGNN. PGNN also stands for the propagation graph neural network and ENS stands for ensemble learning. ENS-PGNN uses the propagation graph neural network to learn the node representations, and then classifies rumors with the weighted voting strategy which is commonly used in ensemble learning. Both of our proposed models can be trained in an end-toend fashion. Experiments show that by including attention mechanism, the two models pay more attention on the nodes whose representations are very different from the mean value, which can significantly improve the classification performance. In this section, we discuss the time complexity and space complexity of the two proposed models. For deep learning algorithm, it is more important to focus on the prediction time than the training time. Therefore, we only estimate how long it takes the proposed models to detect one rumor. The GLO-PGNN and ENS-PGNN algorithms differ only in the approach used for classification. We only care about the number of trainable parameters. In the PGNN algorithm, it can be known from Equation (2) that the relation path corresponds to parameters and . There are two types of relation paths, the number of trainable parameters (2) is . As shown in Equation (3) The loss function of our proposed algorithm consists of two modules: 1) the cross entropy loss between the ground truth and the probability distributions of the predictions; 2) the regularization term of the parameter. The loss function is calculated as shown in equation (16). (Goller & Kuchler 1996) . There are many optimizers for training neural network based models, such as SGD, AdaGrad, RMSProp and Adam (Yu, T. et al. 2020 , Kingma & Ba. 2014 , Shao, W.S. et al. 2019 . SGD and some of its variants, such as batch gradient descent and mini-batch gradient descent, are very popular perform optimization in neural network. However, they all suffer from some drawbacks: 1) it is hard to choose a proper learning rate at the beginning, 2) all parameters are updated with the same learning rate, sometimes we may not want to update all the parameters to the same extend, especially when the data is sparse and the features have very different frequencies, 3) all these algorithms all face the problem of getting trapped in the numerous suboptimal local minima and saddle points. To solve these the first problem and the second problem, some optimization algorithms that can adaptively adjust the learning rate have been proposed. Take the AdaGrad algorithm as an example, AdaGrad can adapt the learning rate to different parameters, it updates the frequently occurring features with small learning rate and updates the infrequently occurring features with large learning rate. To solve these the second problem, the momentum based optimization algorithms have been proposed, such as RMSProp. RMSProp adjusts the learning rate according to the exponential moving average of squared gradient, which can accelerate the convergence speed and reduce the oscillation at saddle point or local minimum point. To combine the advantages of AdaGrad and RMSProp, Adam algorithm has been proposed. It effectively solves the above drawbacks of SGD, considering the first moment estimation and the second moment estimation of the gradients synthetically. Adam can adjust the learning rate adaptively and to some extent avoid the suboptimal local minima, which is very suitable for our models. In our work, we use the Adam optimizer to speed up the training process. In order to verify the performance of the proposed models, we performed comparative analysis over publicly available twitter dataset. We evaluated the detection performance, prediction time and early stop performance of our models and compared with other models to illustrate the superiority of GLO-PGNN and ENS-PGNN. We also discussed how to select particular values for different parameters and how to select the optimal models. For experimental evaluation, we use the recently released Twitter dataset -PHEME (Kochkina etal.2018) , which consists of tweet data and user data from the Twitter platform. The dataset contains 15 6425 post sets related to 9 events, each post set consists of a source tweet and several responsive tweets. Each post set is annotated as either rumor or non-rumor, and rumors in the dataset are further labeled as true, false and unverified by professional journalists. As shown in Table 1 , the total number of tweets in the PHEME dataset is 105354. The number of non-rumors, false rumors, true rumors and unverified rumors are 4023, 638, 1067 and 697 respectively. The average number of tweets in each post set is 16. In the original PHEME dataset, the number of non-rumors is much more than the rumors and some post sets contain very few tweets. Therefore, we filter out post sets containing less than 4 tweets, and then randomly sample 800, 400, 600, 500 post sets respectively from four finer-grained classes: non-rumor, false rumor, true rumor and unverified rumor. We use the sampled post sets to form a new dataset. During training and testing, we construct each post set in the new dataset into a propagation graph according to the way proposed in section 3.1. In this section, we first detail the evaluation metrics and the algorithms selected for comparison, and then we discuss how to select particular values for different parameters and how to select the optimal models. Owing to the imbalanced class prevalence, it is unreasonable to evaluate methods based solely on accuracy, because those methods that tend to classify the unknown example into the majority class will be considered to have better performance. In this work, we use accuracy, Micro-F1 value and Macro-F1 value to evaluate the performance. F1 score can be calculated following Equation (17) We have made a detailed comparison of our proposed methods and some of the state-of-theart baselines on rumor classification task. The methods involved are as follows: RFC: (Kwon et al. 2013 ) used the random forest classifier to classify rumors, with inputs being several temporal properties and handcrafted features based user information, tweet content and propagation characteristics. SVM-BOW: (Mihalcea & Strapparava 2009 ) used Bag-of-Words model to represent text based on the frequency of words appearing in the text. The obtained representations were then used as the inputs of a linear SVM model for lie detection task. SVM-TS: (Ma et al. 2015) divided the time series into fixed time intervals, and captured the temporal characteristics by comparing the difference of social context information between adjacent time intervals. The temporal characteristics, together with several handcraft features based on tweets content, user information and propagation structure, were used as inputs of a SVM model for rumor detection task. GRU-RNN: The method proposed by (Ma et al. 2016 ) used the recurrent neural network with GRU unit to obtain the representation of the post set by modeling the sequential dynamics of responsive tweets. This method is also a simplified form of the MT-ES method proposed by (Ma et al. 2018a ) (with the user stance classification module removed). BU-RvNN and TD-RvNN: Two variants of the RvNN model proposed by (Ma et al. 2018b ). Both methods construct the propagation tree following the non-sequential propagation structure of tweets in each post set. BU-RvNN captures the structural properties by visiting all nodes recursively in a bottom-up manner, and then classify rumors based on the representation of the single root. TD-RvNN visits nodes in a top-down manner, and identifies rumors based on the representations of leaf nodes via max pooling technique. These two models are the simplification of GLO-PGNN and ENS-PGNN. They both use the PGNN algorithm to obtain the representations of all nodes in the propagation graph. GLO-PGNN(basic) calculates the global graph embedding by simply averaging the node representations and then uses the fully connected network for classification, which is shown in Equation (10) and Equation (11). ENS-PGNN calculates the prediction probability for each node based on their representations and then combines the prediction results with the majority vote strategy, which is shown in Equation (12). difference is that GLO-PGNN aggregates the node representations with attention mechanism for the global graph embedding, which is shown in Equation (13) and Equation (14). ENS-PGNN is an improvement on ENS-PGNN(basic). The difference is that ENS-PGNN combines the prediction results with the weighted voting strategy, which is shown in Equation (15). For the algorithms selected for comparison, we use the default values of the parameters given in the cited papers. If the values of the parameters are not given explicitly, we tune the parameters following the suggestions in the cited papers to find the optimal value. For the SVM-BOW method, when generating representations for text content, only the top 1000 words with the highest frequency in the vocabulary are considered. Therefore, the dimension of the word frequency vector is 1000. We use the RBF kernel as the kernel function of the SVM-BOW method. The penalty coefficient is set to 1.0 and the gamma value of the RBF kernel function is set to 0.1. For the RFC method, C we use 500 decision trees to construct the Random Forest classifier. The max depth of one single tree is set to 3. The minimum number of samples divided into each leaf node is set to 1. The minimum number of samples required for splitting is set to 2. For the SVM-TS method, we divide the entire time series into 10 time intervals, since the average number of tweets in each post set is 16. For the sake of fairness, all neural network based methods use doc2vec algorithm to learn representations of tweets content. The doc2vec algorithm has two variants: 1) Distributed Memory Model of Paragraph Vectors (PV-DM), 2) Distributed Bag of Words version of Paragraph Vector (PV-DBOW), we use both variants to learn 20-dimensional representations respectively, and then concatenate to get the 40-dimensional representation for tweet content. It should be noted that there are some useless characters (i.e. hashtags"@" and mentions "#") and hyperlinks in tweets. We filter out these useless information in the data cleaning phase. The GRU-RNN, BU-RvNN, TD-RvNN and our proposed methods all include the GRU network, we set the layer of the GRU network to 1 and the size of GRU unit to 40. The balance parameter and the learning rate of all neural   network based methods are set to 0.01 and 0.05 respectively. The optimization techniques commonly used in deep learning methods, such as dropout (Srivastava et al. 2014 ) and normalization, are also used to implement the algorithms. proposed methods respectively as increases. For the GLO-PGNN method, the optimal value of M is 2, and for the ENS-PGNN method, the optimal value of is 3. We assume that the number of updates per layer is the same for the sake of simplicity. However, the number of updates can be different in different layers. We implement the doc2vec algorithm using genism. We implement RFC, SVM-BOW and SVM-TS models using scikit-learn. We implement all neural network based models using Tensorflow. For RFC, SVM-BOW and SVM-TS models, we hold out 90% and 10% of the examples in the dataset for training and testing respectively. We perform 10-fold cross-validation throughout all experiments. We randomly splits the training set into 10 distinct subsets called folds, then train and evaluate models 10 times, picking a different fold for evaluation every time and training on the other 9 folds. Each time we first train the model against the training set and select the optimal model against the validation set, then we test the optimal model against the testing set. We take the average of the F1 scores in 10 tests as the final result to measure the performance of each model. For neural network based methods, we hold out 80%, 10% and 10% of the examples in the dataset for training, validation and testing respectively. We select the optimal model as follows: At the beginning of each epoch, the training set is randomly divided into a number of batches, which are used as inputs to the model successively. The parameters of the model are updated by using back-propagation to minimize the loss. At the end of each epoch, we evaluate the model against validation set. We continue the above epoch until the best performance is not improved in 10 consecutive epochs. When the training phrase is over, we measure the performance of the optimal model against the testing set. We classify all the compared methods into three groups. The first group consists of traditional methods, including SVM-BOW, RFC and SVM-TS. The second group contains some state-of-theart baselines based on the neural network, including GRU-RNN, BU-RvNN and TD-RvNN. Our proposed methods are in the third group. As shown in Table 2 , our proposed methods achieve much better performance than other methods. It is observed that the 3 methods in the first group achieve pool performance, fluctuating around 0.5 in accuracy, which is far lower than the methods in the second and third groups. This upsetting result demonstrates that the traditional methods based on handcrafted features fail to identify rumors effectively. RFC performs relatively better than SVM-BOW and SVM-TS, since RFC comprehensively considers the temporal dynamics of user information, linguistic property and propagation structure. SVM-TS considers the difference of social context information between adjacent intervals additionally, however, the average number of tweets per post set in the PHEME dataset is only 16, which hinders SVM-TS model from capturing the temporal dynamics of features. SVM-BOW performs slightly worse than RFC since it classifies rumors based solely on the linguistic information captured from tweets content. In the second group, it is obvious that the neural network based methods achieve superior improvements of more than 25%, compared with the traditional methods in the first group. This significant improvement indicates that neural network based methods do have better generalization capabilities to learn discriminative features automatically. Among all 3 methods in the second group, GRU-RNN is inferior to the two recursive models. This is because GRU-RNN only considers the temporal dynamics of tweets content while TD_RvNN and BU_RvNN additionally take into account the non-sequential propagation structure of tweets. As shown in Table 2 , TD_RvNN is superior to BU_RvNN, which is consistent with the experimental results in the original paper (Ma. et al. 2018) . From an intuitive perspective, BU_RvNN suffers from much larger information loss because BU_RvNN classify rumors based on the representation of single root while TD_RvNN comprehensively considers the representations of all leaf nodes. Although TD_RvNN and BU_RvNN can achieve decent performance, they are still inferior to our proposed methods. In order to overcome the shortcoming that TD_RvNN and BU_RvNN can only disseminate information in one direction using either a top-down or bottom-up manner, our proposed methods extend the propagation tree to the propagation graph by adding relation paths and include graph neural network to learn more powerful representations. In the third group, GLO-PGNN(basic) and ENS-PGNN(basic) both achieve better performance than TD_RvNN and BU_RvNN, suggesting that the propagation graph neural network effectively captures the temporal characteristic and the topological characteristic of rumors. The two methods only differ in the strategy used for classification. GLO-PGNN(basic) first calculates the global graph embedding of the whole propagation graph by averaging all the node representations, and then uses the fully-connected network for classification. ENS-PGNN(basic) first calculates the prediction probability for each node based on their representations, and then combines the prediction results with the majority vote strategy. GLO-PGNN and ENS-PGNN yield the first highest and second highest F1 scores, suggesting that our proposed methods are more effective on rumor detection task. GLO-PGNN is an improvement on GLO-PGNN(basic). The difference is that GLO-PGNN takes a weighted average of the node representations to obtain the global graph embedding. Compared with GLO-PGNN(basic), GLO-PGNN achieve 1.6% improvements in terms of Micro-F1 and 1.4% gains in term if Macro-F1. ENS-PGNN is an improvement on ENS-PGNN(basic). The difference is that ENS-PGNN combines the prediction results with a weighted voting strategy. Compared with ENS-PGNN(basic), ENS-PGNN achieve 1.7% improvements in terms of Micro-F1 and 1.8% gains in term if Macro-F1. Both GLO-PGNN and ENS-PGNN assign different to nodes in the propogation graph. The value of the weight is determined by the attention mechanism, as shown in Equation (13)-(15). The improvement in terms of both Macro-F1 and Micro-F1 scores justifies the rationality of including attention mechanism. We can conclude that using attention mechanism to dynamically adjusting the node weights contributes to improving the detection performance. Besides, it is observed that GLO-PGNN is superior to ENS-PGNN in terms of most evaluation metrics. The Micro-F1 score achieved by GLO-PGNN is about 1.1% over ENS-PGNN. The Macro-F1 score achieved by GLO-PGNN is about 1.5% over ENS-PGNN. However, ENS-PGNN has fewer trainable parameters, which makes ENS-PGNN require smaller storage spaces especially when the representation has a large dimension. According to the analysis, we conclude that taking into consideration the temporal and topological characteristic of rumors contributes to improving the detection performance significantly. The effectiveness of attention mechanism indicates that rumors have different influences at different stages. In this work, we adjust the weight of each node with attention mechanism to quantify the influence, which improves the detection performance slightly. In addition, if some researches set a large value for the node representation or have limited computing resources, we recommend them to adopt ENS-PGNN. Compared to GLO-PGNN, ENS-PGNN has slightly inferior performance, but requires smaller storage spaces. To demonstrate the efficiency of our proposed models, we compared the prediction time of all deep learning based models. We randomly selected 400 examples from the testing data and recorded the running time taken by each model to identify these examples. As shown in Figure 6 , the X-axis represents name of compared models and the Y-axis represents the value of prediction time. BU-RvNN and TD-RvNN have more complex structures and therefore take more time to predict. It is observed that our proposed models have the lowest time complexity. In GRU-RNN, the input of GRU unit depends on its predecessor. In BU-RvNN and TD-RvNN, the input of GRU unit depends on its parent node or children node. The GRU cells are highly dependent on each other, so they cannot be parallelized. However, our models can update the node representations in parallel because the representation of each node is only related to its k-hop neighborhood. In each layer of the PGNN algorithm, all nodes can use GRU unit to update their representations simultaneously. In summary, the prediction time of our proposed models is very fast due to parallelization, indicating that GLO-PGNN and ENS-PGNN are very efficient in the speed with which they identify rumors. Detecting rumors at an early stage of propagation is very important so that timely mitigation could be made to prevent the adverse impact caused by rumors from further expanding. We evaluate the ealy stop performance of different methods in term of different time delays measured by either tweet count received or time elapsed since the source tweet is posted. For each method, we first obtain the optimal model following the training approaches mentioned in section 5.2.4. Then we evaluate the performance of each optimal model by accuracy as we incrementally increase the tweet count or elapsed time. Figure 7 shows the statistics of all post sets in the testing set about tweets count and elapsed time. The X-axis in Figure 7 (a) represents the tweet count and the Y-axis represents the percentage. The X-axis in Figure 7 (b) represents the elapsed time and the Y-axis represents the percentage. As shown in Figure 7 (a), more that 90% of the post sets contain less than 60 tweets. As shown in Figure 7 (b), around 90% of the post sets have a time span not exceeding 25h. Therefore, we increase the tweets count from 2 to 50 and increase the elapsed time from 0.02h to 20h when evaluating the early stop performance. Figure 8 (a) and Figure 8 (b) show the early stop performance of several neural network based methods at different checkpoints in terms of elapsed time and tweets count. As shown in Figure 8 (b), the accuracy curves are hard to distinguish at a very early stage. To have a better observation, Figure 8 (c) takes 1 hour as the cutoff point and shows the early stop performance of each model in two different time periods. We only consider the neural network based methods since they perform much better than traditional methods on rumor detection task. It is observed that our proposed models demonstrate superior early detection performance than other models. As shown in Figure 8 (a), the accuracy of each model climbs rapidly as the tweets count incrementally increases from 1 to 10. When the number of tweets exceeds 15, the growth rate of accuracy gradually decreases. The accuracy of models considering the propagation structure tends to be stable when the tweets count exceeds 30, while the accuracy of GRU-RNN tends to be stable much earlier. As shown in Figure 8 (c), the performance of each model improves rapidly within the 0.3h after the source tweet is posted and tends to be stable when the elapsed time exceeds 1h. TD-RvNN, GLO-PGNN and ENS-PGNN perform much better than GRU-RNN because they take into consideration the non-sequential propagation structure of tweets. TD-RvNN, GLO-PGNN and ENS-PGNN only need around 10 tweets or about 0.4h to achieve the best performance of GRU-RNN. When the tweet count is very small, the performance of TD-RvNN can catch up with that of ENS-PGNN. However, the superiority of GLO-PGNN and ENS-PGNN gradually manifests as the tweets count increases. In summary, our proposed models achieve the highest prediction accuracy even when the tweet count is very small or the elapsed time is very short, indicating that GLO-PGNN and ENS-PGNN possess excellent early detection performance. In this section, a case is analyzed in detail to give an intuitive explanation of how our proposed models identify rumors. Taking as an example the source tweet whose tweet id is 544307028815253504, the tweet is correctly classified into the false rumor by the models proposed by us. Figure 9 (a) shows the partial propagation structure of the source tweet and its responsive tweets. Figure 9 (b) shows the tweet content. Each node in Figure 9 (a) corresponds to the textual content with the same index in Figure 9 (b). Since the user information in the dataset is desensitized due to privacy issues, we use user 0, 1, … to represent the user id. As shown in Figure 9 , user0 released a source tweet whose tweet id is 544307028815253504, user 1, 2, …, 16 express different kinds of attitudes towards the source tweet which include supporting, denying, querying and commenting. We use S, D, Q and C as the abbreviations of different kinds of stances respectively. As shown in Figure 9 , we manually label the user stance by inferring directly from the tweets content for simplicity. However, a better approach is to classify users' sentiments automatically with the help of the Convolutional Neural Network (Yoo et al. 2018) , especially when analyzing the entire propagation graph. We construct the propagation graph based on the propagation structure in Figure 9 . We use PGNN algorithm to generate representations for all nodes and calculate the similarity between each node representation and the mean. The similarity is measured by cosine distance to measure and the results are shown in Figure 10 . The X-axis in Figure 10 represents the node index and the Y-axis represents the similarity. We find out that the representations of node 8 and node 14(the diamond points in Figure 10 ) have a large difference from the mean. As mentioned in section 4.3, our models pay more attention on the node whose representation is very different from the mean value, so node 8 and node 14 ought to make more contribution than other nodes when identifying the type of the source tweet. It is apparent in Figure 9 that the local contexts where node 8 and node 14 are located encompass many conflict views. Previous study finds that there are many conflict views in the local context of the tweet related to false rumors (Jin et al. 2016) . As mentioned in section 4.2, our proposed models exchange information between different nodes via relation paths, so that the learned representations contain discriminative features of the context information, which is beneficial to solving the downstream rumor detection task and early detection task. In this paper, we propose two graph neural network based algorithms, GLO-PGNN and ENS-PGNN, aiming at the rumor detection task on social networks. The two algorithms both have two stages: 1) In the first stage, our algorithms learn representations for each node in the propagation graph; 2) In the second stage, our proposed algorithms classify rumors based on the representation obtained in the first stage. GLO-PGNN and ENS-PGNN only differ in the strategies used in the second stage. In the first stage, we first use a novel way to construct the propagation graph based on the propagation structure (who replies to whom), and then use gated graph neural network to learn powerful node representations by exchanging information between different nodes via relation paths. The learned node representations therefore contain the context information. In the second stage, GLO-PGNN first calculates the global graph embedding of the whole propagation graph from a global perspective, and then uses the fully-connected network for classification. ENS-PGNN first calculates the prediction probability for each node based on their representations, and then summarizes to obtain the final classification result. The attention mechanism is included into both algorithms to improve the performance by dynamically adjusting the weight of each node in the propagation graph. The experiments based on public Tweet dataset demonstrate that our proposed algorithms achieve significant improvements over several state-of-the-art baselines in recent literature on both rumor detection task and early detection task. We performed comparative analysis on different methods. We discussed how to select particular values for different parameters and how to find the optimal model. Our proposed methods achieve 25%-30% improvements over traditional contentbased rumor detection methods. Compared to those methods merely exploit the temporal characteristic of rumors, our methods take into consideration the topological characteristic and achieve around 3%-5% improvements. In order to demonstrate the effectiveness of the attention mechanism, we compared the methods with and without attention mechanism. Our experiment results indicate that the methods with attention mechanism achieve around 1.5%-2% improvements in terms of Macro-F1 score and Micro-F1 score over the methods without attention mechanism. To evaluate the early stop performance of our proposed models in term of different time delays, we performed comparative analysis from different aspects. The experiment results demonstrate that our proposed models can achieve better performance even when the data is very little. Besides, the accuracy of our proposed models climb rapidly as the tweet count or the elapsed time increases, suggesting that the proposed models possess strong learning ability. We list three limitations of the methodology adopted in this work, other researchers who want to make a profound study can make an improvement on these shortcomings. First, we construct the propagation graph based on the who-replies-to-whom structure, however, the follower-followee relationships and the forward relationships are ignored. Second, the neural network has many training parameters, parameter tuning is tedious and requires certain skills. As we discussed in section 5.2.3 and section 5.2.4, it is non-trivial work to turn parameters and select the optimal model. Third, when we embed the information of tweets content into low-dimensional space, we merely consider the textual information, however tweets sometimes include pictures, emoticons and videos in addition to text. Take into consideration various forms of information may improve the performance of the models. It is worth mentioning that each module of our proposed model has good scalability and low coupling. Other researchers may easily change the methodology used in each single module to achieve almost the same objective in this work. In section 4.1, we have discussed how to embed the textual information into low-dimensional space with Doc2vec algorithm. Some other word embedding methods, such as word2vec and Glove, can also be adopted to extract the linguistic features of the tweets content. In section 4.2, we have proposed a GRU-based propagation graph neural network to exploit the propagation graph. Recently, more and more graph neural networks have been proposed, such as GAT and GCN (Zhou, J. el al. 2018) . These algorithms are powerful in modeling non-Euclidean space data and capturing the internal dependencies of the graph, therefore they are very suitable for extracting the features of the propagation graph. We believe that some of these graph neural networks can achieve the same objective as PGNN does. In the future work, we will try to integrate more social network information of users, such as the follower-followee relationships, co-participation relationships in the community. The social relationship inherently constitutes a graph structure which can be used to further enrich the propagation graph. In addition, most of the rumor detection algorithms proposed so far are under the framework of supervised learning, but what is frustrating is that the data in real world is mostly unlabeled. Therefore, we plan to solve the rumor detection task with the help of transfer learning methods. A Gentle Introduction to Deep Learning for Graphs Neural machine translation by jointly learning to align and translate On the Properties of Neural Machine 25 Translation: Encoder-Decoder Approaches Anomalous information diffusion in social networks: Twitter and Digg Greedy Function Approximation: A Gradient Boosting Machine Decision support for determining veracity via linguistic based cues Learning task-dependent distributed representations by backpropagation through structure Long short-term memory News Verification by Exploiting Conflicting Social Viewpoints in Microblogs FNDNet -A deep convolutional neural network for fake news detection Social Media Trends Adam: a method for stochastic optimization Semi-supervised classification with graph convolutional networks All-in-one: Multi-task Learning for Rumour Verification Prominent Features of Rumor Propagation in Online Social Media Gradient-based learning applied to document recognition Distributed representations of sentences and documents Gated graph sequence neural networks Efficient Estimation of Word Representations in Vector Space Detecting rumors from microblogs with recurrent neural networks Detect rumors using time series of social context information on microblogging websites Detect rumors in microblog posts using propagation structure via kernel learning Detect rumor and stance jointly by neural multi-task 26 learning Rumor Detection on Twitter with Tree-structured Recursive Neural Networks Fake News, Rumor, Information Pollution in Social Media and Web: A Contemporary Survey of State-of-the-arts, Challenges, and Opportunities. Expert Systems with Applications The Lie Detector: Explorations in the Automati c Recognition of Deceptive Language CSI: A Hybrid Deep Model for Fake News Detection A Pareto-based estimation of distribution algorithm for solving multiobjective distributed no-wait flow-shop scheduling problem with sequencedependent setup time Combating Fake News: A Survey on Identification and Mitigation Techniques February) Beyond news contents: The role of social context for fake news detection Twitter rumour detect ion in the health domain Dropout: a simple way to prevent neural networks from overfitting Detection and veracity analysis of fake news via scrapping and authenticating the web search Rumor gauge: predicting the veracity of rumors on twitter Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection False rumors detection on sina weibo by propagation structures Unsupervised Fake News Detection on Social Media: A Generative Approach Social media contents based sentiment analysis and predi ction system Hyper-Parameter Optimization: A Review of Algorithms and Applications Enquiring minds: Early detection of rumors in social media from enquiry posts Graph Neural Networks: A Review of Methods and Applications The research work is supported by National Natural Science Foundation of China (U1433116) and the Fundamental Research Funds for the Central Universities (NP2017208). A novel way is proposed to explicitly construct the propagation graph of rumors. A representation learning algorithm based on gated graph neural network is proposed. Two rumor detection models with different classification strategies are proposed. Attention mechanism is included to improve the detection performance.