key: cord-0600324-d5xvbtdi authors: Silva, Amila; Luo, Ling; Karunasekera, Shanika; Leckie, Christopher title: Embracing Domain Differences in Fake News: Cross-domain Fake News Detection using Multi-modal Data date: 2021-02-11 journal: nan DOI: nan sha: 9e6de281ad0d2ac8c475d5802ce15004cbfbaf35 doc_id: 600324 cord_uid: d5xvbtdi With the rapid evolution of social media, fake news has become a significant social problem, which cannot be addressed in a timely manner using manual investigation. This has motivated numerous studies on automating fake news detection. Most studies explore supervised training models with different modalities (e.g., text, images, and propagation networks) of news records to identify fake news. However, the performance of such techniques generally drops if news records are coming from different domains (e.g., politics, entertainment), especially for domains that are unseen or rarely-seen during training. As motivation, we empirically show that news records from different domains have significantly different word usage and propagation patterns. Furthermore, due to the sheer volume of unlabelled news records, it is challenging to select news records for manual labelling so that the domain-coverage of the labelled dataset is maximized. Hence, this work: (1) proposes a novel framework that jointly preserves domain-specific and cross-domain knowledge in news records to detect fake news from different domains; and (2) introduces an unsupervised technique to select a set of unlabelled informative news records for manual labelling, which can be ultimately used to train a fake news detection model that performs well for many domains while minimizing the labelling cost. Our experiments show that the integration of the proposed fake news model and the selective annotation approach achieves state-of-the-art performance for cross-domain news datasets, while yielding notable improvements for rarely-appearing domains in news datasets. Motivation. Today, social media is considered as one of the leading and fastest media to seek news information online.Thus, social media platforms provide an ideal environment to spread fake news (i.e., disinformation). Many times the cost and damage due to fake news are high and early detection to stop spreading such information is of importance. For example, it has been estimated that at least 800 people died and 5800 were admitted to hospital as a result of false information related to the COVID-19 pandemic, e.g., believing alcohol-based cleaning products are a cure for the virus 1 . Due to the high volumes of news generated on a daily basis, it is not practical to identify fake news using manual fact checking. Therefore, automatic detection of fake news has recently become a significant problem attracting immense research effort. Challenges. Nevertheless, most existing fake news detection techniques fail to identify fake news in a real-world news stream for the following reasons. First, most existing techniques Shu et al. , 2020b Ruchansky et al. 2017) are trained and evaluated using datasets (Shu et al. 2020a; Cui et al. 2020) that are limited to a single domain such as politics, entertainment, healthcare. However, a real news stream typically covers a wide variety of domains. We have empirically found that existing fake news detection techniques perform poorly for such a cross-domain news dataset despite yielding good results for domain-specific news datasets. This observation may be due to two reasons: (1) domain-specific word usage; and (2) domain-specific propagation patterns. For example, Figure 1 adopts two datasets from different domains, Poli-tiFact for politics and GossipCop for entertainment, which are two widely used labelled datasets to train fake news detection models. Fig. 1 shows that there are significant differences in the frequently used words and propagation patterns of these two datasets. To address this challenge, some previous works Castelo et al. 2019) learned models to overlook such domain-specific information and only rely on cross-domain information (e.g., webmarkup and readability features) for fake news detection. However, domain-specific knowledge could be useful for accurate identification of fake news. As a solution, this work aims to address how to preserve domain-specific and crossdomain knowledge in news records to detect fake news in cross-domain news datasets. Second, the studies in Janicka et al. 2019) show that most fake news detection techniques are not good at identifying fake news records from unseen or rarely-seen domains during training. As a solution, fake news detection models can be learned using a dataset that covers as many domains as possible. Here we assume that the fake news detection model requires supervision as supervised techniques are known to be substantially better at identifying fake news compared to the unsupervised methods (Yang et al. 2019a ). In such a supervised learning setting, each training (i.e., labelled) data point has an associated labelling cost. Thus, the total labelling budget constrains the number of data instances that can be selected for manual labelling. Due to the sheer volume of unlabelled news records available, there is a need to identify informative news records to annotate such that the labelled dataset ultimately covers many domains while avoiding any selection biases. Contribution. To address the aforementioned challenges, this work makes the following contributions: • We propose a multimodal 2 fake news detection technique for cross-domain news datasets that learns domain-specific and cross-domain information of news records using two independent embedding spaces, which are subsequently used to identify fake news records. Our experiments show that the proposed framework outperforms state-of-the-art fake news detection models by as much as 7.55% in F1-score. • We propose an unsupervised technique to select a given number of news records from a large data pool such that the selected dataset maximizes the domain coverage. By using such a dataset to train a fake news detection model, we show that the model achieves around 25% F1-score improvements for rarely-appearing domains in news datasets. Fake news detection methods mainly rely on different attributes (text, image, social context) of news records to determine their veracity. Text content-based approaches Volkova et al. 2017; Pérez-Rosas et al. 2018; mainly explore word usage and linguistic styles in the headline and body of news records to identify fake news. Some works analyse the images in news records along with the text content for fake news detection. For example, the studies in (Jin et al. 2017; Khattar et al. 2019) use pre-trained image models (e.g., VGG-19, ResNet) to extract features from images, which are integrated with text features to identify fake news. Also, some works consider the social context of a news record, i.e., how the record is propagated across social media, as another modality to differentiate fake news records from real ones. Existing work in this line mostly applies various machine learning techniques to extract features from propagation patterns, including Propagation Tree Kernels (Ma et al. 2017) , Recurrent Neural Networks Liu et al. 2018) , and Graph Neural Networks (Monti et al. 2019) . However, all these modalities (i.e., text, propagation patterns) generally show notable differences (see Figure 1) for news records in different domains. Thus, most existing techniques perform poorly for cross-domain news datasets due to their inability to capture such domain-specific variations. Our model also relies on the text content and social context of news. However, the main objective of our model is to capture such domain-specific variations of news records. Domain-agnostic Fake News Detection. Several previous works have attempted to perform fake news detection using cross-domain datasets. In , an event discriminator is learned along with a multimodal fake news detector to overlook domain-specific information in news records. The study in (Castelo et al. 2019 ) carefully selects a set of features (e.g., psychological features, readability features) from news records that are domain-invariant. These techniques rely only on cross-domain information in news records. In contrast, Han et al. (2020) consider crossdomain fake news detection as a continual learning task, which learns a model for a large number of tasks sequentially. This work adopts Graph Neural Networks to detect fake news using their propagation patterns and applies wellknown continual learning approaches Elastic Weight Consolidation (Kirkpatrick et al. 2017) and Gradient Episodic Memory (Lopez-Paz et al. 2017) to address cross-domain fake news detection problem. This approach has two limitations: (1) it assumes that the news records from different domains arrive sequentially, though this is not always true for real-world streams; and (2) it requires the domain of news records to be known, which is not generally available. In contrast, our approach exploits both domain-specific and cross-domain knowledge of news records without knowing the actual domain of news records. Active Learning for Fake News Detection. Almost all the aforementioned models are supervised. Although there are unsupervised fake news detection techniques (Yang et al. 2019b; Hosseinimotlagh et al. 2018) , they are generally inferior to the supervised approaches in terms of accuracy. However, the training of supervised models requires large labelled datasets, which are costly to collect. Therefore, how to obtain fresh and high-quality labelled samples for a given labelling budget is challenging. Some works (Wang et al. 2020; Bhattacharjee et al. 2017) adopt conventional active learning frameworks to select high-quality samples, in which the model is initially trained using a small randomly selected dataset. Then, the beliefs derived from the initial model are used to select subsequent instances to annotate. This approach has two limitations: (1) it requires a Figure 2 : Overview of the proposed framework. In the illustrated embedding spaces, each data point's colour and shape denote its domain label and veracity label (i.e., triangle for fake news and circle otherwise) respectively. (2) it is known to be highly vulnerable to the biases introduced by the initial model. In contrast, our instance selection approach does not depend on such an initial model. Also, none of the previous works attempted to explicitly maximize the domaincoverage of the labelled dataset, which is vital to train a model that perform equally well for multiple domains. Let R be a set of news records. Each record r ∈ R is represented as a tuple t r , W r , G r , where (1) t r is the timestamp when r is published online; (2) W r is the text content of r; and (3) G r is the propagation network of r for time bound ∆T . We keep ∆T low (= five hours) for our experiments to evaluate early detection performance. Each propagation network G r is an attributed directed graph (V r , E r , X r ), where nodes V r represent the tweets/retweets of r and the edges E r represent the retweet relationships among them. X r is the set of attributes of the nodes (i.e., tweets) in G r . More details about E r and G r are given in (Silva et al. 2021 ). Our problem consists of two sub-tasks: (1) select a set of instances R L from R to label while adhering to the given labelling budget B, which constrains the number of instances in R L . The labelling process assigns a binary label y r for each record r: y r is 1 if r is false and 0 otherwise; (2) learn an effective model using R L to predict the label y r for unlabelled news records r ∈ R U as false or real news records. In this work, R (R L ∪ R U ) is not constrained to a specific domain. To emulate such a domain-agnostic dataset, we com-bine three publicly available datasets: (1) PolitiFact (Shu et al. 2020a) , which consists of news related to politics; (2) GossipCop (Shu et al. 2020a ), a set of news related to entertainment stories; and (3) CoAID (Cui et al. 2020 ), a news collection related to COVID-19. All three datasets provide labelled news records and all the tweets related to each news item. The statistics of the datasets are shown in Table 1 . As shown in Fig. 2 , the proposed fake news detection model consists of two main components: (1) unsupervised domain embedding learning (Module A); and (2) supervised domain-agnostic news classification (Module B). These two components are integrated to identify fake news while exploiting domain-specific and cross-domain knowledge in the news records. In addition, the proposed instance selection approach (Module C) adopts the same domain embedding learning component to select informative news records for labelling, which eventually yields a labelled dataset that maximizes the domain-coverage. For a given news record r, assume that its domain label is not available. The proposed unsupervised domain embedding learning technique exploits multimodal content (e.g., text, propagation network) of r to represent the domain of r as a low-dimensional vector f domain (r). Our approach is motivated by: (1) the tendency of users to form groups containing people with similar interests (i.e., homophily) (McPherson et al. 2001) , which results in different domains having distinct user bases; and (2) the significant differences in domain-specific word usage as shown in Figure 1a . We exploit the aforementioned motivations by constructing a heterogeneous network which consists of both users tweeting the news items and words in the news title as nodes, using the following steps (Line 1-9 in Algo. 1): (1) create a Input: A collection of news records R Output: Domain embeddings f domain (r) of r ∈ R // Network construction 1 Initialize an empty graph G; if edge e exists in graph G then 7 Increment edge e in graph G by 1; Add edge e to graph G; // Community Detection 10 C ← Find communities in G using Louvain; // Embedding Learning set S r for each news record r by adding all the users U r in the propagation network G r and all the words appearing in the news title W r (tokenized using whitespaces); (2) for each pair of items in S r , build a weighted edge e linking the two items in the graph; and (3) repeat Steps 1 and 2 for all the news records, until we obtain the final network G. Then, we adopt the Louvain algorithm 3 to identify communities in G. Here, we select the Louvain algorithm as it was shown to be one of the best performing parameter-free community detection algorithms in (Fortunato 2010) . At the end of this step, we obtain a set of communities/clusters C, each having either a highly connected set of users or words. As the nodes of G contain both users and words, such communities may have formed either due to a set of users engaging with similar news records or a set of words only appearing within a fraction of news records. Following the aforementioned motivations, this work assumes each community in C belongs to a single domain. In the next step, we compute the soft membership p(r ∈ c) of r in a cluster c using the following equation: Here p(r ∈ c) is proportional to the number of common users or words that r and c have. Each node (i.e., user or word) v is weighted using the degree v deg in G (i.e., number of occurrences) to reflect their varying importance for the corresponding community. Finally, we produce the domain embedding f domain (r) ∈ R |C| of r as the concatenation of r's likelihood belonging to communities in C: where ⊕ denotes concatenation. In Figure 3 , we adopt t-SNE (Maaten et al. 2008) to visualize the domain embedding space of the proposed approach and the user-based domain discovery algorithm proposed in ). Due to space limitations, we present more details about the baseline in (Silva et al. 2021) . As can be seen in Figure 3 , the proposed approach yields a clear separation between the domains compared to the baseline. This may be mainly due to the ability of our approach to jointly exploit multimodalities, both users and text of news records to discover their domains. In addition, most previous works on domain discovery ultimately assign hard domain labels for news records, which could lead to substantial information loss. For example, some news records may belong to multiple domains, which cannot be captured using hard domain labels. Hence, by having a low-dimensional vector to represent embedding, our approach could preserve such knowledge related to the domains of news records. In our news classification model, each news record r is represented as a vector f input (r) using the textual content W r and the propagation network G r of r (elaborated in the section Experiments). Then, our classification model maps f input (r) into two different subspaces such that one preserves the domain-specific knowledge, f specif ic : f input (r) → R d , and the other preserves the cross-domain knowledge f shared : f input (r) → R d , of r. Here d is the dimension of the subspaces. Then, the concatenation f specif ic (r) and f shared (r) is used to recover the label y r and the input representation f input (r) of r during training via two decoder functions g pred and g recons respectively. where y r and f input (r) denote the predicted label and the predicted input representation respectively. BCE stands for the Binary Cross-Entropy loss function. We mini-mize L pred and L recon to find the optimal parameters of (f specif ic , f shared , g pred , g recon ). However, L pred and L recon do not leverage domain differences in news records. Hence, we now discuss how the mapping functions for subspaces, f specif ic and f shared , are further learned to preserve the domain-specific and crossdomain knowledge in news records. Leveraging Domain-specific Knowledge To preserve the domain-specific knowledge, we introduce an auxiliary loss term L specif ic to learn a new decoder function g specif ic to recover the domain embedding f domain (r) of r using the domain-specific representation f specif ic (r). We minimize L specif ic to find the optimal parameters for (f specif ic , g specif ic ) to capture the domain-specific knowledge by f specif ic , and this process can be defined as follows: Leveraging Cross-domain Knowledge In contrast, we learn f shared to overlook domain-specific knowledge of the news records. Consequently, f shared preserves the crossdomain knowledge in the news records. Here, we train a decoder function g shared to accurately predict the domain of r using f shared (r). Meanwhile, we learn f shared to fool the decoder g shared by maximizing the loss of g shared . Such a formulation forces f shared to only rely on crossdomain knowledge, which are useful to transfer the knowledge across domains. This process can be defined as a minimax game between g shared and f shared as follows: Integrated Model Then the final loss function of the model is formulated as: (7) where λ 1 , λ 2 and λ 3 controls the importance given to each loss term compared to L pred (i.e., main task). To learn the minimax game in L shared , the final loss function L f inal is sequentially optimized using the following two steps: where θ 1 and θ 2 denote the parameters in (f specif ic , f shared , g specif ic , g pred , g recon ) and g shared respectively. The empirically studied convergence properties of the proposed optimization scheme are presented in (Silva et al. 2021 ). The aforementioned model is able to exploit the domainspecific and cross-domain knowledge in news records to identify their veracity. Nevertheless, if the model is used to identify fake news records in unseen or rarely appearing domains during training, we empirically observe that the performance of the model substantially drops. This observation is expected and is consistent with the findings in (Castelo et al. 2019) , which could be due to the domain-specific word usage and propagation patterns as shown in Fig. 1 . Hence, we propose an unsupervised technique to come up with a labelled training dataset for a given labelling budget B such that it covers as many domains as possible. The ultimate objective of this technique is to learn a model using such a dataset that performs well for many domains. Our approach initially represents each news record r ∈ R using its domain embedding f domain (r). Then, we propose a Locality-Sensitive Hashing (LSH) algorithm based on random projection to select a set of records in R that are distant in the domain embedding space, which can be elaborated using the following steps: 1. Create |H| different hash functions such as H i (r) = sgn(h i ·f domain (r)), where i ∈ {0, 1, . . . , |H|−1} and h i is a random vector, and sgn(.) is the sign function. The random vectors h i are generated using the following probability distribution, as such a distribution was shown to perform well for random projection-based techniques (Achlioptas 2001) : with probability 1/6 0 with probability 2/3 −1 with probability 1/6 (10) 2. Construct an |H|-dimensional hash value for each news record r as H 0 (r) ⊕ H 1 (r) ⊕ . . . ⊕ H |H|−1 (r), where ⊕ defines the concatenation operation. According to the Johnson-Lindenstrauss lemma (Johnson et al. 1984) , such hash values approximately preserve the distances between the news records in the original embedding space with high probability. Hence, neighbouring records in the domain embedding space are mapped to similar hash values. 3. Group the news records with similar hash values to construct a hash table. 4. Randomly pick a record from each bin in the hash table and add to the selected dataset pool. 5. Repeat steps (1), (2), (3) and (4) until the size of the dataset pool reaches the labelling budget B. In Figure 4a , we compare 10% of the original dataset selected using the proposed approach and random selection. As can be seen, random selection follows the empirical distribution of the datasets in Table 1 and picks few instances from rarely appearing domains (e.g., fake/real news in Poli-tiFact, fake news in CoAID). Thus, the model trained on such a dataset may poorly perform on rarely appearing domains. In contrast, the proposed approach provides a significant number of samples from even rarely occurring domains. In addition, the proposed approach is efficient (O(|H||R|) complexity) compared to the naive farthest point selection algorithms (e.g., k-Means (Lloyd 1982) with O(|R| 2 ) complexity, where |R| >> |H|). To measure the domain coverage of the instances selected from the proposed instance selection approach, we adopt the metric introduced in (Laib et al. 2017) , which can be computed as follows for a given set of records r 1 , r 2 , ..., r n that are represented using their PLT GSP CVD domain embeddings: , f domain (r k ))) and δ = δ i /n. If the coverage is high, λ is small. Hence, the proposed approach yields a better domain-coverage compared to random instance selection as shown in Figure 4b . Encoding and Decoding Functions In our model, each record r is initially represented as a low-dimensional vector f input (r) using its text content and propagation network. We adopt RoBERTa-base, a robustly optimized BERT pretraining model ) to learn the text-based representation f text (r) of r. The propagation network-based representation f network (r) of r is represented using the unsupervised network representation learning technique proposed in . Then, the final input representation f input (r) is constructed as f text (r) ⊕ f network (r), where ⊕ denotes concatenation. All the other encoding and decoding functions, (f specif ic , f shared , g specif ic , g shared , g pred , g recon ), are modelled as 2-layer feed-forward networks with sigmoid activation 4 . Dataset We combine three disinformation datasets: (1) PolitiFact; (2) GossipCop; and (3) CoAID, to produce a cross-domain news dataset 5 . Then, we randomly choose 75% of the dataset as the candidate data pool R pool for training and the remaining 25% for testing. For a given labelling budget B, we select B instances from R pool to train the model. The same process is performed for 3 different training and test splits and the average performance is reported. We evaluate the performance for each domain separately using the testing instances from each domain. For the evalua-tion, we adopt four metrics: (1) Accuracy (Acc); (2) Precision (Prec); (3) Recall (Rec); and (4) F1 Score (F1). Baselines In Table 2 , we compare our approach with seven widely used fake detection techniques and their variants 4 . Parameter Settings After performing a grid search, we have set the hyper-parameters in our model as 4 : λ 1 = 1, λ 2 = 10, λ 3 = 5, d = 512. To satisfy the Johnson-Lindenstrauss lemma, we set |H| = 10 (>> log(|R|). For the specific parameters of the baselines, we use the default parameters mentioned in their original papers. Quantitative Results for Fake News Detection As shown in Table 2 , the proposed approach yields substantially better results for all three domains, outperforming the best baseline by as much as 7.55% in F1-score. The best baseline, EANN-Multimodal, also adopts domain-information when determining fake news. This observation shows the importance of having domain-knowledge of news records when identifying fake news in cross-domain datasets. In addition to the architectural differences of the model, EANN-Multimodal is different from our approach for two reasons: (1) EANN-Multimodal only preserves cross-domain knowledge in news records. Thus, it overlooks domain-specific knowledge, which is shown to be useful in our ablation study in Table 2 ; and (2) EANN-Multimodal adopts a hard label (i.e., exclusive membership) to represent the domain of a news record. Our approach conversely uses a vector to represent the domain of a news record. Thus, our approach can accurately represent the likelihood of each record for different domains. These differences may explain the importance of our approach compared to the best baseline. Out of the baselines, the multimodal approaches (except HPNF+LIWC) generally achieve better results compared to the uni-modal approaches. Thus, we can conclude that each modality (i.e., propagation network and text) of news records provides unique knowledge for fake news detection. In HPNF+LIWC, each news record is represented using a set of hand-crafted features. In contrast, other multimodal approaches including our approach learn data-driven latent representations for news records, which may be able to capture latent and complex information in news records that are useful to determine fake news. These observations further support two main design decisions in our model: (1) to exploit multimodalities of news records; and (2) to adopt a representation learning-based technique. Table 2 shows that without the domain-specific loss (Eq. 5) and the crossdomain loss (Eq. 6), the F1-score of the model substantially drops by around 6% and 3% for the PolitiFact dataset, which is the smallest domain of the training dataset. Hence, it is important to have a domain-specific layer to preserve the domain-specific knowledge and a separate cross-domain layer to transfer common knowledge between domains. To check whether our model actually learns the aforementioned intuition behind each embedding layer, we visualize each embedding layer using t-SNE in Figure 5 . As can be seen, the domain-specific embedding layer preserves the domain of the news records by mapping different domains into different clusters. In contrast, we cannot identify the domain labels of news records from the cross-domain embedding space. Hence, this embedding space is useful to share common knowledge between records from different domains. Furthermore, we analyse the contribution of each modality. It can be seen that network modality is more useful to determine fake news in GossipCop, while text modality is the most informative one for CoAID. This observation further signifies the importance of multimodal approaches to train models that generalize for multiple domains. Evaluation of LSH-based Instance Selection As shown in Table 2 , our model outperforms the baselines even with a constrained budget B (50%|R pool |) to select training data using the LSH-based instance selection technique. To verify its significance further, Figure 6 compares the proposed LSH-based instance selection approach with random instance selection for different B values. The proposed approach substantially outperforms the random instance selection for the rarely-appearing or highly imbalanced domains. It increases F1-score by 24% for PolitiFact and 27% for CoAID, when B/|R pool | = 0.1. This may be due to the ability of our approach to maximize the coverage of domains when selecting instances (see Figure 4 ), instead of biasing towards a domain with larger number of records. In this work, we proposed a novel fake news detection framework, which exploits domain-specific and crossdomain knowledge in news records to determine fake news from different domains. Also, we introduced a novel unsupervised approach to select informative instances for manual labelling from a large pool of unlabelled news records. The selected data pool is subsequently used to train a model that can perform equally for different domains. The integration of the aforementioned two contributions yields a model with low labelling budgets that outperforms existing fake news detection techniques by as much as 7.55% in F1-score. For future work, we intend to extend our model as an online learning framework to determine fake news in a realworld news stream, which typically covers a large number of domains. This setting introduces new challenges such as capturing newly emerging domains and handling temporal changes in domains. Also, how to use the alignment in multimodal information to weakly guide the learning process of the proposed model is another interesting direction to explore, which may further reduce the labelling cost in a conventional supervised learning setting. In this work, the text content of a news record is represented using RoBERTa ), a robustly optimized BERT pre-training model. For a given textual content {w 1 w 2 w 3 ....w n } of a news record r, the RoBERTa model returns the text-based latent representation f text (r) ∈ R dt of r. Out of the different variants of pretrained RoBERTa models, we adopt the roberta-large model available in https:// pytorch.org/hub/pytorch fairseq roberta/, where d t = 1024. We explore two types of features: global-level features (global); and node-level features (local), of the propagation network G r = (V r , E r , X r ) to generate the network-based representation f network (r) of a record r. We consider all the tweets/retweets related to r as the nodes V r of G r . There is an extra node (i.e., source node) in G r to represent the news, which links different information cascades of r. The edges E r of G r represent how a news item spreads from one person to another as shown in Fig 1. Specifically, there is an edge from node i to node j if (1) the user of tweet i mentions the user of tweet j; or (2) tweet i is public and tweet j is posted within the detection deadline (= five hours) after tweet i. We use the following features as global-level features: (1) Wiener Index (g 1 ); (2) Number of nodes (g 2 ); (3) Network depth (g 3 ); (4) Number of nodes at different hops (g 5 ); and (5) Branching factor at different levels (g 6 ). Finally, all these features are concatenated together to formulate the global-level network representation f global (r) of a record r. whether the user is verified (n 1 ), the number of followers (n 2 ), the number of friends (n 3 ), the number of lists (n 4 ), and the number of favourites (n 5 ) text the sentiment scores computed using VADER with the text content in the tweet (n 6 ), the proportion of positive words (n 7 ), the proportion of negative words (n 8 ), the number of mentions (n 9 ), and the number of hashtags(n 10 ) temporal the time difference with the source node (n 11 ); the time difference with the immediate predecessor (n 12 ); and the average time difference with the immediate successors (n 13 ); user account timestamp (n 14 ) modelled as 2-layer feed-forward networks with sigmoid activation. Formally, we can define a encoding/decoding function f that maps an input x ∈ R dinput to an output z ∈ R doutput as: d hidden ) , b 1 ∈ R d hidden , and b 2 ∈ R doutput are trainable parameters. σ denotes sigmoid activation. We set d hidden as max(d input , d output )/2. For example, assume that f takes inputs of 1024 dimensions and outputs of 128 dimensions. Then, the size of the hidden layer is 512. We leave the optimal neural architecture search for each encoding and decoding function in our model as future work. We compare our domain discovery approach with the baseline proposed in , which assigns hard domain labels for news records based on the users engaged with each news record. For the visualization purpose, we convert these hard domain labels (i.e., one-hot vector) to domain embeddings as they preserve pairwise domain similarity between records . We elaborate the steps that we followed to generate the domain embeddings using this baseline as follows: 1. Initially, we construct a network by considering each news record as a node. 2. Each news record r (i.e., node) is represented using the list of the users U r tweeting the the particular news record. 3. The pairwise similarity of nodes is computed for a given two nodes r 1 and r 2 as: similarity(r 1 , r 2 ) = |U r1 ∩ U r2 | |U r1 ∪ U r2 | Then r 1 and r 2 are connected in the graph if similarity(r 1 , r 2 ) > α. α is set to 0.4 following the original paper (Chen and Freire 2020). 4. The Louvain algorithms is used to identify the communities C = c 1 , c 2 , ... in the constructed graph, which yields hard cluster (considered as domains) assignment for each node. 5. Then each node r can be represented as an one-hot vector I r ∈ R |C| , in which I r i := {1 if r ∈ c i ; 0 otherwise} 6. Finally, we construct the domain embedding f domain (r) ∈ R |R| of r by concatenating the cosine similarity scores of I r with other news records: f domain (r) = (I r · I r0 ) ⊕ (I r · I r1 )... ⊕ (I r · I r |R|−1 ) where ⊕ denotes concatenation operation. Since this approach considers news records as the nodes of the constructed graph, it is difficult to extend such an approach to learn domain embeddings for new records. In contrast, the proposed approach in this paper constructs its knowledge network using words and users as nodes. Thus, we can generate the domain embeddings for a new record using the words and users related to the new record. Also, our approach considers both text and user information of news records to identify their domain labels. We compare our fake news detection model with seven widely used baselines and their variants: • LIWC ) ((i.e., Linguistic Inquiry and Word Count)) learns feature vectors from the text content of news records by counting the number of lexicons falling into different psycho-linguistic categories 2 . Then, a logistic regression model 3 is used as the classifier to predict fake news using LIWC feature vectors. • text-CNN (CNN) to model the text content of news records at different granularity levels with the help of multiple convolutional filters and multiple CNN layers 4 . • HAN ) adopts a hierarchical attention neural network framework to model the text content of news records, which can assign varying importance to words and sentences when making final predictions by word-level and sentence-level attention 5 . • EANN ) produces a latent representation for each news record using its different modalities (e.g., text, network) such that the domain-specific knowledge in news records are ignored in the latent space. Subsequently, the latent representation is used to predict the label of the news record. We compare our model with two variants of EANN: -EANN-Unimodal only considers the text modality of a news record to generate the latent representation; and -EANN-Multimodal considers both text and network modalities of a news record to produce the latent embedding. For a fair comparison of the models, we adopt the same text and network representation techniques in our model to encode the input modalities of EANN. • HPNF ) extracts various features (e.g., structural features, temporal features) from the propagation network of a news record to generate its feature representation. Then, a Logistic Regression is used to classify news records using the extracted propagation networkbased model. In HPNF+LIWC, we concatenate the features vectors from HPNF and LIWC together to construct the feature representation for news records. • AE ) adopts an Auto-encoder architecture to learn latent representation for each news record based on its propagation network. Subsequently, the latent representations are used to determine fake news records. • SAFE proposes a multimodal approach for fake news detection. For a given news record, this model learns separate latent representations for each modality. Also, it jointly learns another representation to represent cross-modality knowledge, which is consistent across modalities. Finally, all three representations are concatenated and fed to a classifier to predict the label of the record. The original work of this model considers the text and image modalities of news records. For a fair comparison with our model, here we use the text and network modality of news records for this baseline too. We adopt the same text and network representation techniques in our model to encode the input modalities in this baseline too. This section evaluates how changes to the hyper-parameters of the model affect its performance on the fake news detection tasks. In Figure 2 , we analyse the performance of our model for different λ 1 , λ 2 and λ 3 values (see Eq. 7 in the paper), which varies the importance assign to each loss term in our model. By setting a very high value (> 2 2 ) or a very low value (< 2 −1 ) for λ 1 tends to drop the performance consistently for all three datasets. It means that L recon loss term should be included in our model with moderate importance compared to the other loss terms. The performance of the model for PolitiFact and CoAID domains drop substantially for low λ 2 < 5 and high λ 3 > 5 values. By setting a low λ 2 < 5 or a high λ 3 > 5 value, our model assigns more importance to the cross-domain embedding space. The cross-domain embedding space could be dominated by frequently appearing domains (GossipCop in this dataset). Thus, assigning more importance for cross-domain embedding space, the model could poorly perform for small domains e.g., PolitiFact and CoAID in this dataset as shown in Fig. 2 . This observation further signifies the importance of having domain-specific knowledge of news items to identify fake news. We examine the sensitivity of the model's performance for other parameters: latent dimension (d); number of epochs; and batch size. Overall, the model yields consistent performance for d > 256, epochs > 300, and batch size < 128 values. There is only one hyper-parameter in the proposed LSHbased instance selection approach, which is the number of hash functions (|H|) used for the random projections. As shown in Figure 4 , domain coverage of the proposed ap- proach reduces (increases λ measure) for high |H| (> 20) values. This is intuitive because high |H|(lengthy hash codes) value could map even very close neighbours in the embedding space into different bins. Thus, the selected instance from different bins could be close-neighbours. In contrast, low |H| values increases the domain coverage. Nevertheless, having a very low |H| value increases the time complexity as it requires many iterations of the hashing step to meet a given labelling budget. In summary, we adopt the following hyper-parameter values for the results reported in the paper: (1) lambda 1 = 1; (2) lambda 2 = 10; (3) lambda 3 = 5; (4) |H| = 10; (5) d = 512; (6) epochs = 300; (7) batch size = 64. We use the Adam optimizer for the optimization. For the parameters of the optimizer (e.g., learning rate, moments), the default parameters in Keras 6 are used. Due to the randomness involved in the training and testing datasets splitting process, we conducted all our experiments using three random state value: {0, 1, 2}, and the average performance is reported in the paper. In Figure 5 , we examine the convergence properties of the loss function of our model. Our loss function consists of four terms: prediction loss (L pred ); reconstruction loss (L recon ); domain-specific loss (L specif ic ); and crossdomain loss (L shared ). As can be seen in Fig. 5 , each loss term converges around 250 epochs. Since L shared is trained as a minimax game, the converging L shared in Fig. 5 empirically verifies the convergence of the proposed minimax game to exploit cross-domain knowledge in news records. Moreover, L recon , L specif ic and L shared are meansquared error based loss terms and L pred is based on binary cross-entropy. Hence, the typical value range for the nonconverged L pred differs from the other loss terms. This also shows the importance of having λ 1 , λ 2 , and λ 3 to penalise such differences due to different loss functions. Database-friendly Random Projections Active Learning Based News Veracity Detection with Feature Weighting and Deep-shallow Fusion Fast Unfolding of Communities in Large Networks A Topic-agnostic Approach for Identifying Fake News Pages Proactive Discovery of Fake News Domains from Real-Time Social Media Feeds Community Detection in Graphs Unsupervised Content-based Identification of Fake News Articles with Tensor Decomposition Ensembles Cross-Domain Failures of Fake News Detection Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs Extensions of Lipschitz Mappings into a Hilbert Space Mvae: Multimodal Variational Autoencoder for Fake News Detection Convolutional Neural Networks for Sentence Classification Overcoming Catastrophic Forgetting in Neural Networks All-inone: Multi-task Learning for Rumour Verification Unsupervised Feature Selection Based on Space Filling Concept RoBERTa: A Robustly Optimized BERT Pretraining Approach Early Detection of Fake News on Social Media Through Propagation Path Classification with Recurrent and Convolutional Networks Least Squares Quantization in PCM Gradient Episodic Memory for Continual Learning Detect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning Visualizing Data using t-SNE Birds of a Feather: Homophily in Social Networks The Development and Psychometric Properties of LIWC2015 Automatic Detection of Fake News CSI: A Hybrid Deep Model for Fake News Detection DE-FEND: Explainable Fake News Detection FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media Hierarchical propagation networks for fake news detection: Investigation and exploitation Embedding Partial Propagation Network for Fake News Early Detection Supplementary Materials for Embracing Domain Differences in Fake News: Cross-domain Fake News Detection using Multi-modal Data URL Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection Weak Supervision for Fake News Detection via Reinforcement Learning Tracing Fake-News Footprints: Characterizing Social Media Messages by How They Propagate Unsupervised Fake News Detection on Social Media: A Generative Approach Unsupervised Fake News Detection on Social Media: A Generative Approach Hierarchical Attention Networks for Document Classification React: Online multimodal embedding for recency-aware spatiotemporal activity modeling SAFE: Similarity-Aware Multi-modal Fake News Detection Fast Unfolding of Communities in Large Networks Proactive Discovery of Fake News Domains from Real-Time Social Media Feeds Convolutional Neural Networks for Sentence Classification Clustop: A Clustering-based Topic Modelling Algorithm for Twitter using Word Networks RoBERTa: A Robustly Optimized BERT Pretraining Approach The Development and Psychometric Properties of LIWC2015 Near Linear Time Algorithm to Detect Community Structures in Large-scale Networks Maps of Random Walks on Complex Networks Reveal Community Structure Hierarchical Propagation Networks for Fake News Detection: Investigation and Exploitation Beyond News Contents: The Role of Social Context for Fake News Detection Embedding Partial Propagation Network for Fake News Early Detection EANN: Event Adversarial Neural Networks for Multi-Modal Fake News Detection Hierarchical Attention Networks for Document Classification SAFE: Similarity-Aware Multi-modal Fake News Detection This research was financially supported by Melbourne Graduate Research Scholarship and Rowden White Scholarship. We would like to specially thank Yi Han for his insightful comments and suggestions for this work. We are also grateful for the time and effort of the reviewers in providing valuable feedback on our manuscript. This section presents more details about the Louvain algorithm , which is used in the proposed domain embedding learning approach to identify communities in a network.As shown in Algorithm 1, the Louvain algorithm identifies the communities in a network using the following steps:1. Each vertex is placed in their own community (Line 1 in Algo. 1); 2. Each vertex is retained in its own cluster or merge with an immediate neighbour such that the modularity scores of the network is maximised (Line 3-15 in Algo. 1). The modularity score is computed as:where in and tot represents the total weight of all links inside a community/cluster and total weight of all links to a community/cluster, respectively. Similarly, the terms k i and k i,in denote the total weight of all links to i and total weight of links to i within the community/cluster. Lastly, m denotes the total weight of all links in the network graph;3. Build a new network where vertices in the same community are combined as a single vertex. Steps 2 and 3 until there are no more mergings between communities.At the end of this algorithm, we will obtain a set of communities of the provided network such that the modularity Copyright © 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Keep v in its original community; 16 while A stabilises (i.e., no more shifts); score of the network is maximised. We selected this algorithm in our model because it is known (Lim, Karunasekera, and Harwood 2017) to generate a relatively small number of communities compared to other parameter-free community detection algorithms such as Infomap (Rosvall and Bergstrom 2008) and Label Propagation (Raghavan, Albert, and Kumara 2007) . In our model, each news record r is inputted as a lowdimensional vector f input (r) using its text content (i.e., news title) and propagation network (i.e., social context). Initially, we construct two independent representations for r using its text content f text (r) and propagation network f network (r). Then, these two representations are concatenated to produce the final representation of r: f input (r) = Input: propagation network G r = (V r , E r , X r ) source node of r v s ∈ V r Output: The local representation f local (r)Local Representation For the node-level features, we extract three types of features: (1) text-based;(2) user-based; and (3) temporal-based, which are listed in Table 1 . For a given propagation network G r of a record r, all the features in Table 1 are extracted to represent each vertex (i.e., tweet) in G r . Then, we adopt the node-level aggregation approach proposed in to propagate the aforementioned node-level features to the source node as elaborated in Algo. 2. This algorithm returns the final representation of the source node (see Fig. 1 ) of G r as the local representation f local (r) of r. Finally, the network-based representation is formulated as:where ⊕ denotes concatenation. Note: We standardise 1 each dimension of f network (r) before inputting to the model to stabilise the learning process of our model. In our fake news detection classifier, we have six encoding and decoding functions, (f specif ic , f shared , g specif ic , g shared , g pred , g recon ). In this work, all these functions are