key: cord-0295054-mq7b54lw authors: Nguyen, Van-Hoang; Sugiyama, Kazunari; Nakov, Preslav; Kan, Min-Yen title: FANG: Leveraging Social Context for Fake News Detection Using Graph Representation date: 2020-08-18 journal: nan DOI: 10.1145/3340531.3412046 sha: 197c628a1fed591469e7ee6ff1c2d33074979c82 doc_id: 295054 cord_uid: mq7b54lw We propose Factual News Graph (FANG), a novel graphical social context representation and learning framework for fake news detection. Unlike previous contextual models that have targeted performance, our focus is on representation learning. Compared to transductive models, FANG is scalable in training as it does not have to maintain all nodes, and it is efficient at inference time, without the need to re-process the entire graph. Our experimental results show that FANG is better at capturing the social context into a high fidelity representation, compared to recent graphical and non-graphical models. In particular, FANG yields significant improvements for the task of fake news detection, and it is robust in the case of limited training data. We further demonstrate that the representations learned by FANG generalize to related tasks, such as predicting the factuality of reporting of a news medium. Social media have emerged as an important source of information for many worldwide. Unfortunately, not all information they publish is true. During critical events such as a political election or a pandemic outbreak, disinformation with malicious intent [38] , commonly known as "fake news", can disturb social behavior, public fairness, and rationality. As part of the fight against COVID-19, the World Health Organization also addressed the infodemic caused by fatal disinformation related to infections and cures [41] . Many sites and social media have devoted efforts to identify disinformation. For example, Facebook encourages users to report non-credible posts and employs professional fact-checkers to expose questionable news. Manual fact-checking is also used by fact-checking websites such as Snopes, FactCheck, PolitiFact, and Full Fact. In order to scale with the increasing amount of information, automated news verification systems consider external knowledge databases as evidence [13, 34, 42] . Evidence-based approaches achieve high accuracy and offer potential explainability, but they also take considerable human effort. Moreover, fact-checking approaches for textual claims based on textual evidence are not easily applicable to claims about images or videos. Some recent work has taken another turn and has explored contextual features of the news dissemination process. They observed distinctive engagement patterns when social users face fake versus factual news [17, 25] . For example, the fake news shown in Table 1 had many engagements shortly after its publication. These are mainly verbatim re-circulations with negative sentiment of the original post explained by the typically appalling content of fake news. After that short time window, we see denial posts questioning the validity of the news, and the stance distribution stabilizes afterwards with virtually no support. In contrast, the real news example in Table 1 invokes moderate engagement, mainly comprised of supportive posts with neutral sentiment that stabilize quickly. Such temporal shifts in user perception serve as important signals for distinguishing fake from real news. Previous work proposed partial representations of social context with (i) news, sources and users as major entities, and (ii) stances, friendship, and publication as major interactions [16, 32, 33, 39] . However, they did not put much emphasis on the quality of representation, modeling of entities and their interactions, and minimally supervised settings at all. Table 1 : Engagement of social media users with respect to fake and real news articles. Column 2 shows the time since publication, and columns 4-7 show the distribution of stances (S: Support, D: Deny, C: Comment, and R: Report). 0.00 0.03 0.19 0.78 "DISGUSED SO TRASNPHOBIC", "FOR GODS SAKE GET REAL GOP", "You cant make this up folks" Before Using Bathroom (Fake) 3h -6h 21 0.00 0.10 0.10 0.80 "Ok This cant be real", "WTF IS THIS BS", "Rediculous RT" 6h+ 31 0.00 0.10 0.14 0.76 "Cant make this shit up", "how is this real", "small government", "GOP Cray Cray Occupy Democrats" 1,100,000 people have been killed by 3h 9 0.56 0.00 0.00 0.44 "#StopGunViolence", "guns r the problem" guns in the U.S.A. since John Lennon was shot and killed on December 8, 1980 (Real) 3h+ 36 0.50 0.00 0.11 0.39 "Some 1.15 million people have been killed by firearms in the United States since Lennon was gunned down", "#StopGunViolence" Naturally, the social context of news dissemination can be represented as a heterogeneous network where nodes and edges represent the social entities and the interactions between them, respectively. Network representations have several advantages over some existing Euclidean-based methods [23, 35] in terms of structural modeling capability for several phenomena such as echo chambers of users or polarized networks of news media. Graphical models also allow entities to exchange information, via (i) homogeneous edges, i.e., user-user relationship, source-source citations, (ii) heterogeneous edges, i.e., user-news stance expression, source-news publication, as well as (iii) high-order proximity (i.e., between users who consistently support or deny certain sources, as illustrated in Figure 1 ). This allows the representation of heterogeneous entities to be dependent, leveraging not only fake news detection but also related social analysis tasks such as malicious user detection [7] and source factuality prediction [3] . Our work focuses on improving contextual fake news detection by enhancing representations of social entities. Our main contributions can be summarized as follows: (1) We propose a novel graph representation that models all major social actors and their interactions ( Figure 1 ). (2) We propose the Factual News Graph (FANG), an inductive graph learning framework that effectively captures social structure and engagement patterns, thus improving representation quality. (3) We report significant improvement in fake news detection when using FANG and further show that our model is robust in the case of limited training data. (4) We show that the representations learned by FANG generalize to related tasks such as predicting the factuality of reporting of a news medium. (5) We demonstrate FANG's explainability thanks to the attention mechanism of its recurrent aggregator. In this section, we first review the existing work on contextual fake news detection and the way the social context of news is represented in such work. We then detail recent advances in the Graph Neural Network (GNN) formalism, forming the premise of our proposed graph learning framework. Previous work on contextual fake news detection can be categorized based on the approach used to represent and learn the social context. Euclidean approaches represent the social context as a flat vector or a matrix of real numbers. They typically learn a Euclidean transformation of the social entity features that best approximates the fake news prediction [32] . The complexity of such transformation varies from the traditional shallow (as opposed to "deep") models, i.e., Random Forest or Support Vector Machines (SVM) [6, 44] to probabilistic graphical models [33] and deep neural networks such as Long Short-Term Memory (LSTM) [15] that model engagement temporality [35] . However, given our formulation of social context as a heterogeneous network, Euclidean representations are less expressive [5] . Although pioneering work used user attributes such as demographics, news preferences, and social features, e.g., the number of follower and friends [26, 38] , they do not capture the user interaction landscape, i.e., what kind of social figures they follow, which news topics they favor or oppose, etc. Moreover, in graphical representation, node variables are no longer constrained by the independent and identically distributed assumption, and thus they can reinforce each other's representation via edge interactions. Having acknowledged the above limitations, some researchers have started exploring non-Euclidean or geometric approaches. They generalized the idea of using social context by modeling an underlying user or the news source network and by developing representations that capture structural features about the entity. The Capture, Score, and Integrate (CSI) model [35] used linear dimensionality reduction on the user co-sharing adjacency matrix, combining it with news engagement features obtained from a recurrent neural network (RNN). The Tri-Relationship Fake News (TriFN) detection framework [39] -although similar to our approach -neither differentiated user engagements in terms of stance and temporal patterns, nor modeled source-source citations. Also, matrix decomposition approaches, including CSI [35] , can be expensive in terms of graph node counts and ineffective for modeling high-order proximity. Other work on citation source network [21] , propagation network [29] , and rumor detection [10, 45] used recent advances in GNNs and multi-head attention to learn both local and global structural representations. These models optimized solely for the objective of fake news detection, without accounting for representation quality. As a result, they are not robust when presented with limited training data and cannot be generalized to other downstream tasks, as we show in Section 5. Table 2 compares these approaches. GNNs have successfully generalized deep learning methods to model complex relationships and inter-dependencies on graphs and manifolds. Graph Convolutional Networks (GCNs) are among the first methods that effectively approximate convolutional filters [19] . However, GCNs impose a substantial memory footprint in storing the entire adjacency matrix. They are also not easily adaptable to our heterogeneous graph, where nodes and edges with different labels exhibit different information propagation patterns. Furthermore, GCNs do not guarantee generalizable representations, and are transductive, requiring the inferred nodes to be present at training time. This is especially challenging for contextual fake new detection or general social network analysis, as their structure is constantly evolving. With these points in mind, we build our work on GraphSage that generates embeddings by sampling and aggregating features from a node's local neighborhood [12] . GraphSage provides significant flexibility in defining the information propagation pattern with parameterized random walks and recurrent aggregators. It is well-suited for representation learning with unsupervised node proximity loss, and generalizes well in minimal supervision settings. Moreover, it uses a dynamic inductive algorithm that allows the creation of unseen nodes and edges at inference time. We first introduce the notation, and then formally define the problem of fake news detection. Afterwards, we discuss our methodology, namely the process of construction of our social context graph -FANG -as well as its underlying rationale. Finally, we describe the process of feature extraction from social entities as well as the modeling of their interactions. Let us first define the social context graph G with its entities and interactions shown in Figure 1 : x e } is modeled as a relation between two entities v 1 , v 2 ∈ A ∩ S ∩ U at time t; t is absent in timeinsensitive interactions. The interaction type of e is defined as the label x e . Table 3 summarizes the characteristics of different types of interactions, both homogeneous and heterogeneous. Stances are special types of interaction, as they are not only characterized by edge labels and source/destination nodes, but also by temporality as shown in earlier examples in Table 1 . Recent work has highlighted the importance of incorporating temporality not only for fake news detection [26, 35] , but also for modeling online information dissemination [14] . We use the following stance labels: neutral support, negative support, deny, report. The major support and deny stances are consistent with the prior work (e.g., [28] ), whereas the two types of support -neutral support and negative supportare based on reported correlation between news factuality and invoked sentiment [1] . We assign the report stance label to a user-news engagement when the user simply spreads the news article without expressing any opinion. Overall, we use stances to characterize news articles based on opinions about them as well as social users by their view on various news articles. Table 4 : Some examples from the stance-annotated dataset, all concerning the same event. Text Type Annotated stance Greta Thunberg tops annual list of highest-paid Activists! reference headline -Greta Thunberg is the âĂŸHighest Paid ActivistâĂŹ related headline support (neutral) No, Greta Thunberg not highest paid activist related headline deny Can't speak for the rest of 'em, but as far as I know, Greta's just a schoolgirl and has no source of income. related post deny The cover describes Greta Thunberg to be the highest paid activist in the world related tweet support (neutral) A very wealthy 16yo Fascist at that! related post support (negative) We can now formally define our task as follows: Definition 3.1. Context-based fake news detection: Given a social context G = (A, S, U , E) constructed from news articles A, news sources S, social users U , and social engagements E, context-based fake news detection is defined as the binary classification task to predict whether a news article a ∈ A is fake or real, in other words, News Articles. Textual [6, 34, 39, 44] and visual [18, 43] features have been widely used to model news article contents, either by feature extraction, unsupervised semantics encoding, or learned representation. We use unsupervised textual representations as they are relatively efficient to construct and optimize. For each article a ∈ A, we construct a TF.IDF [36] vector from the text body of the article. We enrich the representation of news by weighting the pre-trained embeddings from GloVe [30] of each word with its TF.IDF value, forming a semantic vector. Finally, we concatenate the TF.IDF and semantic vector to form the news article feature vector x a . News Sources. We focus on characterizing news media sources using the textual content of their websites [3, 21] . Similarly to article representations, for each source s, we construct the source feature vector x s as the concatenation of its TF.IDF vector and its semantic vector derived from the words in the Homepage and the About Us section, as some fake news websites openly declare their content to be satirical or sarcastic. Social Users. Online users have been studied extensively as the main propagator of fake news and rumors in social media. As in Section 2, previous work [6, 44] used attributes such as demographics, information preferences, social activity, and network structure such as the number of followers or friends. Shu et al. [39] conducted feature analysis of user profiles and pointed out the importance of signals derived from profile description and timeline content. A text description such as "American mom fed up with anti american leftists and corruption. I believe in US constitution, free enterprise, strong military and Donald Trump #maga" strongly indicates the user's political bias and suggests the tendency to promote certain narratives. We calculate the user vector x u as the concatenation of a pair consisting of a TF.IDF vector and a semantic vector derived from the user profile's text description. Social Interactions. For each pair of social actors x e } to the list of social interactions E if they are linked via interaction type x e . Specifically, for following, we examine whether user u i follows user u j ; for publication, we check whether news article a i was published by source s j ; for citation, we examine whether the Homepage of source s i contains a hyperlink to source s j . In the case of time-sensitive interactions, i.e., publication and stance, we record their relative timestamp with respect to the article's earliest time of publication. Stance Detection. The task of obtaining a viewpoint for a piece of text with respect to another one is known as stance detection. In the context of fake news detection, we are interested in the stance of a user reply with respect to the title of a questionable news article. We consider four stances: support with neutral sentiment or neutral support, support with negative sentiment or negative support, deny, and report. We classify a post as verbatim reporting of the news article if it matches the article title after cleaning the text from emojis, punctuation, stop words, and URLs. We train a stance detector to classify the remaining posts as support or deny. Popular stance detection datasets either do not explicitly describe the target text [8] , have a limited number of targets [27, 40] , or define the source/target texts differently, as in the Fake News Challenge. 1 In order to overcome this difficulty, we constructed our own dataset for stance detection between social media posts and news articles, which contains 2,527 labeled source-target sentence pairs from 31 news events. For each event with a reference headline, the annotators were given a list of related headlines and posts. They labeled whether each related headline or post supports or denies the claim made by the reference headline. Aside from the reference headline-related headline or the headline-related post sentence pairs, we further made second-order inferences for related headlinerelated post sentence pairs. If such a pair expressed a similar stance with respect to the reference headline, we inferred a support stance for the related headline-related post, and deny, otherwise. Tables 4 and 5 show example annotations and statistics about the dataset. The inter-annotator agreement evaluated with Cohen's Kappa is 0.78, indicating a substantial agreement. In order to choose the best stance classifier, we fine-tuned the model on our dataset using various pre-trained large-scale Transformers [9, 22] . RoBERTa [22] turned out to work best, achieving Accuracy of 0.8857, F 1 score of 0.8379, Precision of 0.8365, and Recall of 0.8395, and thus we chose it for our stance classifier. In order to further classify support posts into such with neutral and with negative sentiment, we fine-tuned a similar architecture on the Yelp Review Polarity dataset to obtain a sentiment classifier. Altogether, the stance prediction of a user-article engagement e is given as stance(e). We now describe our FANG learning framework on the social context graph described in Section 3.2. Figure 2 shows an overview of FANG. While optimizing for the fake news detection objective, FANG also learns generalizable representations for the social entities. This is achieved by optimizing three concurrent losses: (i) unsupervised Proximity Loss, (ii) self-supervised Stance Loss, and (iii) supervised Fake News Detection Loss. Representation Learning. We first discuss how FANG derives the representation of each social entity. Previous representation learning frameworks such as Deep Walk [31] and node2vec [11] compute a node embedding by sampling its neighborhood, and then optimizing for the proximity loss similarly to word2vec. However, the neighborhood is defined by the graph structure. These methods use the neighborhood structure only, and they are suitable when the node auxiliary features are unavailable or incomplete, i.e., when optimizing for each entity's structural representation separately. Recently, GraphSage [12] was proposed to overcome this limitation by allowing auxiliary node features to be used jointly with proximity sampling as part of the representation learning. Let GraphSaдe(·) be GraphSage's node encoding function. Thus, we can now obtain the structural representation z u ∈ R d of any user and source node r as z r = GraphSaдe(r ), where d is the structural embedding dimension. For news nodes, we further enrich their structural representation with user engagement temporality, which we showed to be distinctive for fake news detection in Section 1 above. This can be formulated as learning an aggregation function F (a, U ) that maps a questionable news a, and its engaged users U to a temporal representation v t emp a that captures a's engagement pattern. Therefore, the aggregating model (i.e., the aggregator) has to be time-sensitive. RNNs fulfill this requirement: specifically, the Bidirectional LSTM (Bi-LSTM) can capture a long-term dependency in information sequence in both the forward and the backward directions [15] . On top of the Bi-LSTM, we further incorporate an attention mechanism that focuses on essential engagement during the encoding process. Attention is not only expected to improve the model quality but also its explainability [9, 24] . By examining the model's attention, we learn which social profiles influence the decision, mimicking human analytic capability. Our proposed LSTM input is a user-article engagement sequence {e 1 , e 2 , · · · , e |U | }. Let meta(e i ) ∈ R l = (time(e i ), stance(e i )) be the concatenation of e i 's elapsed time since the news publication and a one-hot stance vector. Each engagement e i has its representation x e i = (z U i , meta(e i )), where z U i = GraphSaдe(U i ). A Bi-LSTM encodes the engagement sequence and outputs two sequences of hidden states: (i) a forward sequence, H f = h Let w i be the attention weight paid by our Bi-LSTM encoder to the forward (h f i ) and to the backward (h b i ) hidden states. This attention should be derived from the similarity of the hidden state and the news features, i.e., how relevant the engaging users are to the discussed content, and the particular time and stance of the engagement. Therefore, we formulate the attention weight w i as: where l is the meta dimension, e is the encoder dimension, and M e ∈ R d ×e and M m ∈ R l ×1 are the optimizable projection matrices for engagement and meta features shared across all engagements. w i is then used to compute the forward and the backward weighted Finally, we concatenated the forward and the backward vectors to obtain the temporal representation v t emp a ∈ R 2e for article a. By explicitly setting 2e = d, we can combine the temporal and the structural representations of a news a into a single representation: Unsupervised Proximity Loss. We derive the Proximity Loss from the hypothesis that closely connected social entities often behave similarly. This is motivated by the echo chamber phenomenon, where social entities tend to interact with other entities of common interest to reinforce and to promote their narrative. This echo chamber phenomenon encompasses inter-cited news media sources publishing news of similar content or factuality, as well as social friends expressing similar stance with respect to news article(s) of similar content. Therefore, FANG should assign such nearby entities to a set of proximal vectors in the embedding space. We also hypothesize that loosely-connected social entities often behave differently from our observation that social entities are highly polarized, especially in left-right politics [4] . FANG should enforce that the representations of these disparate entities are distinctive. The social interactions that define the above characteristics the most are user-user friendship, source-source citation, and newssource publication. As these interactions are either (a) between sources and news or (b) between news, we divide the social context graph into two sub-graphs, namely news-source sub-graph and user sub-graph. Within each sub-graph G ′ , we formulate the following Proximity Loss function: where z r ∈ R d is the representation of entity r , P r is the set of nearby nodes or positive set of r , and N r is the set of disparate nodes or negative set of r . P r is obtained using our fixed-length random walk, and N r is derived using negative sampling [12] . Self-supervised Stance Loss. We also propose an analogous hypothesis for the user-news interaction, in terms of stance. If a user expresses a stance with respect to a news article, their respective representations should be close. For each stance c, we first learn a user projection function α c (u) = A c z u and a news article projection function β c (a) = B c z a that map a node representation of R d to a representation in the stance space c of R d c . Given a user u and a news article a, we compute their similarity score in the stance space c as α(u) ⊤ β(a). If u expresses stance c with respect to a, we maximize this score, and we minimize it otherwise. This is the stance classification objective, optimized using the Stance Loss: where f (u, a, c) is defined as f (u, a, c) = α c (u) ⊤ β c (a) and y u,a,c = 1 if u expresses stance c over a, 0 otherwise. Supervised Fake News Loss. We directly optimize the main learning objective of fake news detection via the supervised Fake News Loss. In order to predict whether article a is false, we obtain its contextual representation as the concatenation of its representation and the structural representation of its source, i.e., v a = (z a , z s ). Input : The social context graph G = (A, S, U , E) The news labels Y A , and the stance labels Y U ,A,C Output : FANG-optimized parameters θ Initialize θ ; while θ has not converged do for each news batch A i ⊂ A do for each news a ∈ A i do U a ← users who have engaged with a; z a ← Equation (2); z s ← GraphSaдe(s); for each user u ∈ U a do z u ← GraphSaдe(u); L ′ st ance ← Equation (4); end end L ′ news ← Equation (5); end for each news-source or user sub-graph G ′ do for each entity r ∈ G ′ do P r ← positive samples of r in G ′ ; This contextual representation is then input into a fully connected layer whose outputs are computed as o a = Wv a + b, where W ∈ R 2d ×1 and b ∈ R are the weights and the biases of the layer. The output value o a ∈ R is finally passed through a sigmoid activation function σ (·), and trained using the following cross-entropy Fake News Loss L news , defined as follows: where T is the batch size, y a = 0 if a is a fake article, and y a = 1 otherwise. We define the total loss by linearly combining these three component losses: L tot al = L pr ox . + L st ance + L news . We provide detailed instructions for training FANG in Algorithm 1. We conducted our experiments on a Twitter dataset collected by related work on rumor classification [20, 25] and fake news detection [37] . For each article, we collected its source, a list of engaged users, and their tweets if they were not already available in the previous dataset. This dataset also includes Twitter profile description and the list of Twitter profiles each user follows. We further crawled additional data about media sources, including the content of their Homepage and their About us page, together with their frequently cited sources on their Homepage. The truth value of the articles, namely, whether they are fake or real news, is based on two fact-checking websites: Snopes and Politifact. We release the source code of FANG and the stance detection dataset. 2 Table 6 shows some statistics about our dataset. We benchmark the performance of FANG on fake news detection against several competitive models: (i) a content-only model, (ii) a Euclidean contextual model, and (iii) another graph learning model. In order to compare our FANG with the content-only model, we use a Support Vector Machine (SVM) model on TF.IDF feature vectors constructed from the news content (see Section 3.2). We also compare our approach with a current Euclidean model, CSI [35] , a fundamental yet effective recurrent encoder that aggregates the user features, the news content, and the user-news engagements. We re-implement the CSI with source features by concatenating the overall score for the users and the article representation with our formulated source description to obtain the result vector for CSI's Integrated module mentioned in the original paper. Lastly, we compare against the GCN graph learning framework [19] . First, we represent each of k social interactions in a separated adjacency matrix. We then concatenate GCN's output on k adjacency matrices as the final representation of each node, before passing the representation through a linear layer for classification. We also verify the importance of modeling temporality by experimenting on two variants of CSI and FANG: (i) temporal-insensitive CSI(-t) and FANG(-t) without time(e) in the engagement e's representation x e , and (ii) temporal sensitive CSI and FANG with time(e). Table 7 shows the macroscopic results. For evaluation, we use the area under the Receiver Operating Characteristic curve or AUC score as standard. All context-aware models (i.e., CSI(-t), CSI, GCN, and FANG(-t)) and FANG improve over the context-unaware baseline by 0.1153 absolute with CSI(-t) and by 0.1993 absolute with FANG in terms of AUC score. This demonstrates that considering social context is helpful for fake news detection. We further observe that both time-sensitive CSI and FANG improve over their time-insensitive variants, CSI(-t) and FANG(-t) by 0.0233 and 0.0339, respectively. These results demonstrate the importance of modeling the temporality of news spreading. Finally, two graph-based models, FANG(-t) and GCN are consistently better than the Euclidean CSI(-t) by 0.0501 and 0.0386, respectively: this demonstrates the effectiveness of our social graph representation described in Section 3.2. Overall, we can observe that FANG outperforms the other context-aware, temporally-aware, and graph-based models. 2 https://github.com/nguyenvanhoang7398/FANG Table 7 : Comparison between FANG and baseline models on fake news detection, evaluated with AUC score. Contextual Temporal Graphical AUC We now answer the following research questions (RQs) to better understand FANG's performance under different scenarios: • RQ1: Does FANG work well with limited training data? • RQ2: Does FANG differentiate between fake and real news based on their contrastive engagement temporality? • RQ3: How effective is FANG's representation learning? In order to address RQ1, we conducted the experiments described in Section 4.2 using different sizes of the training dataset. We found consistent improvements over the baselines under both limited and sufficient data conditions. Table 8 shows the experimental results and Figure 3 (left) further visualizes them. We can see that FANG consistently outperforms the two baselines for all training sizes: 10%, 30%, 50%, 70%, and 90% of the data. In terms of AUC score at decreasing training size, among the graph-based models, GCN's performance drops by 16.22% from 0.7064 at 90% to 0.5918 at 10%, while FANG's performance drops by 11.11% from 0.7518 at 90% to 0.6683 at 10%. We further observe that CSI's performance drops the least by only 7.93% from 0.6911 at 90% to 0.6363 at 10%. Another result from an ablated baseline, FANG(-s), where we removed the stance loss, highlights the importance of this self-supervised objective. When training on 90% of the data, the relative underperforming margin of FANG(-s) compared to FANG is only 1.42% in terms of AUC. However, this relative margin increases as the availability of training data decreases, to at most 6.39% at 30% training data. Overall, the experimental results emphasize our model's effectiveness even at low training data availability compared to the ablated version, GNN and Euclidean, which confirms RQ1. To address RQ2 and to verify whether our model makes its decisions based on the distinctive temporal patterns between fake and real news, we examined FANG's attention mechanism. We accumulated the attention weights produced by FANG within each time window and we compared them across time windows. Figure 3 (right) shows the attention distribution over time for fake and for real news. We can see that FANG pays 68.08% of its attention to the user engagement that occurred in the first 12 hours after a news article has been published to decide whether it is fake. Its attention then sharply decreases to 18.83% for the next 24 hours, then to 4.14% from 36 hours to two weeks after publication, and finally to approximately 9.04% from the second week onward. On the other hand, for real news, FANG places only 48.01% of its attention on the first 12 hours, which then decreases to 17.59% and to 12.85% in the time windows of 12 to 36 hours and 36 hours to two weeks, respectively. We also observe that FANG maintains 21.53% attention even when the real news has been published after two weeks. Our model's characteristics are consistent with the general observation that the appalling nature of fake news generates the most engagements within a short period of time after its publication. Therefore, it is reasonable that the model places much emphasis on these crucial engagements. On the other hand, genuine news attracts fewer engagements, but it is circulated for a longer period of time, which explains FANG's persistent attention even after two weeks after publication. Overall, the temporality study here highlights the transparency of our model's decision, largely thanks to the incorporated attention mechanism. Our core claim is to improve the quality of representation with FANG, and we verify it in intrinsic and extrinsic evaluations. In the intrinsic evaluation, we verify how generalizable the minimally supervised news representations are for the fake news detection task. We first optimize both GCN and FANG on 30% of the training data to obtain news representations. We then cluster these representations using an unsupervised clustering algorithm, OP-TICS [2] , and we measure the homogeneity score -the extent to which clusters contain a single class. The higher the homogeneity score, the more likely the news articles of the same factuality label (i.e., fake or real) are to be close to each other, which yields higher quality representation. Figure 4 visualizes the representations obtained from two approaches with factuality labels and OPTICS clustering labels. In the extrinsic evaluation, we verify how generalizable the supervised source representations are for a new task: source factuality prediction. We first train FANG on 90% of the training data to obtain all source s representations as z s = GraphSaдe(s), and the total representation as v s = (z s , x s , a ∈publish(s) x a ), where x s , publish(s), and x a denote the source s content representation, the list of all articles published by s, and their content representations. We propose two baseline representations that do not consider the source s content, v ′ s = (z s , x s ). Finally, we train two separate SVM models for v s and v ′ s on the source factuality dataset, consisting of 129 sources of high factuality and 103 sources of low factuality, obtained from Media Bias/Fact Check 3 and PolitiFact. 4 For intrinsic evaluation, the Principal Component Analysis (PCA) plot of labeled FANG representation (see Figure 4 top left) shows moderate collocation for the groups of fake and real news, while the PCA plot of labeled GCN representation (Figure 4 bottom left) shows little collocation within either the fake or the real news groups. Quantitatively, FANG's OPTICS clusters (shown in Figure 4 top right) achieve a homogeneity score of 0.051 based on news factuality labels, compared to 0.0006 homogeneity score for the GCN OPTICS clusters. This intrinsic evaluation demonstrates FANG's strong representation closeness within both the fake and the real news groups, indicating that FANG yields improved representation learning over another fully supervised graph neural framework. For the extrinsic evaluation on downstream source factuality classification, our context-aware model achieves an AUC score of 0.8049 compared to 0.5842 for the baseline. We further examined the FANG representations for sources to explain this 0.2207 absolute improvement. Figure 5 shows the source representations obtained from the textual features, GCN, and FANG with their factuality labels, i.e., high, low, mixed, and citation relationship. In the left sub-figure, we can observe that the textual features are insufficient to differentiate the factuality of media, as a fake news spreading site such as cnsnews could mimic factual media in terms of web design and news content. However, the citation between a low-factuality website and highfactuality sites would not be as high, and it is effectively used by the two graph learning frameworks: GCN and especially FANG. Yet, GCN fails to differentiate low-factuality sites with higher citations, such as jewsnews.co.il and cnsnews, from high-factuality sites. On the other hand, sources such as news.yahoo despite being textually different, as shown in Figure 5 (left), should still cluster with other credible media for their high inter-citation frequency. FANG, with much more emphasis on contextual representation learning, makes these sources more distinguishable. Its representation space gives us a glance into the landscape of news media, where there is a large central cluster of high-factuality inter-cited sources such as nytimes, washingtonpost and news.yahoo. At the periphery lie less connected media including both high-and low-factuality ones. We also see cases where all models failed to differentiate mixedfactuality media, such as buzzfeednews and nypost, which have high citation counts with high-factuality media. Overall, the results from intrinsic and extrinsic evaluation, as well as the observations, confirm RQ3 on the improvement of FANG's representation learning. FANG overcomes the transductive limitation of previous approaches while inferring the credibility of unseen nodes. MVDAM [21] has to randomly initialize an embedding and to optimize it iteratively using node2vec [11] for any unseen node, whereas FANG directly infers the embedding with its learned feature aggregator. Other graphical approaches using matrix factorization [39] or graph convolutional layers [10, 29] learn parameters whose dimensionality is fixed to the network size N , and can be as expensive as O(N 3 ) [10] in terms of inference time complexity. FANG infers the embeddings of unseen nodes without reconstructing the adjacency matrix, and its inference time complexity only depends on the size of the neighborhood of unseen nodes. It is also helpful to analyze FANG's predictions by examining specific test examples. The first example is shown in Figure 6 , where we can see that FANG pays most of its attention to a tweet by user B. This can be explained by B's Twitter profile description of a fact-checking organization, which indicates high reliability. In contrast, a denying tweet from user A is not paid so much attention, due to the insignificant description of its author's profile. Our model bases its prediction on the support stance from the factchecker, which is indeed the correct label. In the second example, shown in Figure 7 , FANG pays most attention to a tweet by user C. Although this profile does not provide any description, it has a record of correctly denying the fake news about the dead NFL lawyer. Furthermore, the profiles that follow Twitter user C, namely user D and user E, have credible descriptions of a proof reader and of a tech community, respectively. This explains why our model bases its prediction of the news being fake thanks to the reliable denial, which is again the correct label. We note that entity and interaction features are constructed before passing to FANG, and thus errors from upstream tasks, such as textual encoding or stance detection, can propagate to FANG. Future work can address this in an end-to-end framework, where textual encoding [9] and stance detection can be jointly optimized. Another limitation is that the dataset for contextual fake news detection can quickly become obsolete as hyperlinks and social media traces at the time of publication might no longer be retrievable. We have demonstrated the importance of modeling the social context for the task of fake news detection. We further proposed FANG, a graph learning framework that enhances representation quality by capturing the rich social interactions between users, articles, and media, thereby improving both fake news detection and source factuality prediction. We have demonstrated the efficiency of FANG with limited training data and its capability of capturing distinctive temporal patterns between fake and real news with a highly explainable attention mechanism. In future work, we plan more analysis of the representations of social users. We further plan to apply multi-task learning to jointly address the tasks of fake news detection, source factuality prediction, and echo chamber discovery. Sentiment Aware Fake News Detection on Online Social Networks OPTICS: Ordering Points to Identify the Clustering Structure Predicting Factuality of Reporting and Bias of News Media Sources Is the Internet Causing Political Polarization? Evidence from Demographics Geometric Deep Learning: Going beyond Euclidean data Information Credibility on Twitter Seminar Users in the Arabic Twitter Sphere SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Multiple Rumor Source Detection with Graph Convolutional Networks node2vec: Scalable Feature Learning for Networks Inductive Representation Learning on Large Graphs ClaimBuster: The First-ever End-to-end Fact-checking System Predicting the Popularity of Web 2.0 Items based on User Comments Long Short-Term Memory News Credibility Evaluation on Microblog with a Hierarchical Propagation Model News Verification by Exploiting Conflicting Social Viewpoints in Microblogs MVAE: Multimodal Variational Autoencoder for Fake News Detection Semi-Supervised Classification with Graph Convolutional Networks PHEME dataset for Rumour Detection and Veracity Classification Multiview Models for Political Ideology Detection of News Articles RoBERTa: A Robustly Optimized BERT Pretraining Approach Early Detection of Fake News on Social Media Through Propagation Path Classification with Recurrent and Convolutional Networks Effective Approaches to Attention-based Neural Machine Translation Detecting Rumors from Microblogs with Recurrent Neural Networks Detect Rumors Using Time Series of Social Context Information on Microblogging Websites SemEval-2016 Task 6: Detecting Stance in Tweets Automatic Stance Detection Using End-to-End Memory Networks Fake News Detection on Social Media using Geometric Deep Learning GloVe: Global Vectors for Word Representation DeepWalk: Online Learning of Social Representations Credibility Assessment of Textual Claims on the Web Where the Truth Lies: Explaining the Credibility of Emerging Claims on the Web and Social Media CredEye: A Credibility Lens for Analyzing and Explaining Misinformation CSI: A Hybrid Deep Model for Fake News Detection Introduction to Modern Information Retrieval FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media Beyond News Contents: The Role of Social Context for Fake News Detection A Dataset for Multi-Target Stance Detection WHO says fake coronavirus claims causing 'infodemic Automated Fact Checking: Task Formulations, Methods and Future Directions EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection Automatic Detection of Rumor on Sina Weibo Jointly Embedding the Local and Global Relations of Heterogeneous Graph for Rumor Detection