1 Introduction

Detecting fake news is a challenging task since fake news constantly evolves, influencing the formation of the opinion of social groups as accepted [6]. To minimize the effects of disinformation provided by fake news, Machine Learning algorithms have been proposed in the literature, which learn classification models to discriminate true and fake content [4, 22]. The most common way to deal with this problem is to characterize the fake news detection as a supervised binary or multiclass classification problem [22]. However, labeling real and fake news to cover different subjects, sources, and falsity levels, as well as updating the classification model are costly processes [4]. All of these challenges drive the investigation of more appropriate models for representing and classifying news.

Positive and Unlabeled Learning (PUL) algorithms can be good alternatives in this scenario [3]. PUL algorithms learn models considering little labeled data of the interest class and use the unlabeled data to increase classification performance [9]. Therefore, PUL eliminates the need to label a large number of news of the uninteresting classes. Due to the suitability of PUL to fake news detection scenario, in this paper, we proposed an approach based on the algorithm Positive and Unlabeled Learning by Label Propagation [9] (PU-LP), a network-based semi-supervised transductive learning algorithm. PU-LP has been little explored for text classification, despite its good performance in numerical datasets. PU-LP infers reliable sets of interest (fake news) and not interest (true news) using a small number of labeled data, Katz index and a k-Nearest Neighbor (k-NN) network. Then, a label propagation algorithm is used to label the remaining unlabeled objects as interest or not interest.

Representing news using traditional models, i.e., Bag-of-Words or embeddings, may not be discriminative enough to distinguish fake and real content [17, 21]. Thus, an important challenge for automatic fake news detection is to assemble a set of features that efficiently characterize a false content in a dynamic scenario. This set may contain features about creator or spreader, the target victims, the news content and the social context [22]. In this work, we propose using content-based features in a domain-independent fake news detection approach based on the PU-LP algorithm. Our network use linguistic features such as representative terms, news emotiveness, average number of words per sentence, and pausality as network objects, thus making a heterogeneous network [12]. These features have proven to be relevant for news classification [13]. Also, the presence of different objects and relations in the network allows for representing different patterns on data [15]. To create the k-NN news network, we represent the text with Bag-of-Words (BoW) and Doc2Vec (D2V) models to compute the similarity between the news. To the best of our knowledge, heterogeneous networks and PUL have not been explored in the literature to detect fake news.

We evaluated different combinations of features to measure the impact of these features in the classification performance. Our experiments used relatively small sets of labeled fake news: from 10% to 30% of the fake news set. The heterogeneous network was compared with the k-NN networks, used in the original proposal of PU-LP. Two Portuguese datasets and four English datasets were used, containing only one or more subjects. For the label propagation, we used two well-established algorithms: Label Propagation Through Heterogeneous Networks [12] and GNetMine [7]. We evaluated the algorithms considering the \(F_{1}\) measure of the fake news class. The results obtained the proposed approach were compared with a semi-supervised binary baseline, and demonstrate competitiveness even with half the news initially labeled and without the labeling of real news. This paper is organized as follows: Sects. 2 and 3 present related work. Section 4 describes the proposed approach. Section 5 presents experimental evaluation and discussions. Section 6 presents conclusions and future works.

2 Related Work

Existing methods for news classification, in general, represent news in a structured way using traditional models such as BoW, word (or document) embeddings or networks [22]. However, these models are not discriminative enough to capture the nuances to distinguish fake from real news, requiring the investigation of textual and contextual features that can be incorporated into the representation model [17, 21, 22]. Contextual features are extracted from the news environment, such as post, source, and publisher related information [21]. Textual features involve lexical, syntactic and semantic aspects of the news content. Lexical features are related to frequency statistics of words, which can be done using n-gram models, able to identify, for example, the presence of vague, doubt or sensationalist expressions [4]. Syntactic features are related to the presence and frequency of Part-Of-Speech patterns, like subjects, verbs and adjectives [17]. At the semantic level, meanings are extracted from terms, where methods like Linguistic Inquiry and Word Count (LIWC) [10] can be used to estimate emotional, cognitive and structural components present in the written language.

In [17], authors proposed representation models for the Fake.BR dataset that combine BoW and linguistic features, such as pausality, emotiveness, and uncertainty. The authors reached 96% of macro \(F_{1}\) considering two thirds of the data as training. In [18], the authors proposed a multimodal representation that combines text and images information. For each news, 124 textual features (81 of them with LIWC) and 43 image features are collected. The approach achieves 95% accuracy with Random Forest using approximately 7,000 news documents and 90% of data in the training set.

Some works use networks as a representation model to detect false information. [13] proposes a knowledge network for fact-checking, in which the entry is a triple (subject, predicate, object). The network is constructed with information from Wikipedia, in which given a sentence, if it exists within the knowledge network, it is considered authentic. The approach achieves approximately 74% accuracy. [21] proposes a diffuse neural network graph for inferring the credibility of news, creators and subjects. The graph has approximately 14,000 news items, 3,600 creators and 152 subjects. The approach achieves the best overall performance in the binary classification compared to SVM, Deep Walk and a neural network. [11] proposes an Adversarial Active Learning Heterogeneous Graph Neural Network to detect fake news. A hierarchical attention mechanism is used to learn nodes’ representations, and a selector is responsible for consulting high-value candidates for active learning. The authors evaluate two datasets: the first containing 14,055 news, 3,634 creators and 152 subjects, and the second containing 182 news, 15,257 twitter users and 9 publishers. With 20% of real and fake news labeled, the algorithm reaches respectively 57% and 70% of macro \(F_{1}\). [19] proposes IARNet, an Information Aggregating and Reasoning Network to detect fake news on heterogeneous networks. The network contains the post source, comments and users as nodes and interactions between them as edges. The authors evaluated two datasets, Weibo and Fakeddit, and 70% of the labeled news for training. The algorithm achieved 96% accuracy.

To avoid real news labeling efforts, [5] proposes the DCDistanceOCC, a One-class Learning (OCL) algorithm to identify false content. For each news item, linguistic characteristics are extracted, such as number of words per sentence and sentiment of the message. A new example is classified as the interest class if its distance to the class vector is below a threshold. The algorithm is executed using 90% fake news to train the model. The approach reaches an average \(F_1\) that ranges from 54% to 67%, especially considering the Fake.BR dataset.

It is possible to observe that most of the existing work for fake news detection adopts binary supervised learning approaches, which requires labeling news considering the real and fake classes. Additionally, the broad spectrum of real news also makes the generation of a representative labeled training set difficult. On the other hand, the PUL algorithms, which mitigate the drawback of binary supervised learning algorithms in fake news detection, are scarce in this scenario. Also, the use of heterogeneous networks, which usually are more adequate in semi-supervised text classification [12], has not been explored in the literature for detecting fake news. Considering these gaps, in this paper we propose a PUL algorithm for fake-news detection, which is based on a heterogeneous network composed of news, terms, and linguistic features. Our approach uses only information that can be collected in the publication’s content, not depending on external information. The proposed approach is detailed in the next sections.

3 Positive and Unlabeled Learning Algorithms

PUL algorithms learn models considering a set of labeled documents of an interest class and unlabeled documents to train a classifier (inductive semi-supervised learning) or to classify unlabeled known documents (transductive semi-supervised learning) [3]. The purpose of using unlabeled documents during learning is to improve classification performance. Besides, since the unlabeled documents are easy to collect and a user has to label only a few documents of the interest class, PUL has gained attention in the last years [3, 9].

The most common PUL approaches are those which perform learning in two stages. In the first stage, a set of non-interest documents is generated by extracting reliable outliers documents to be considered as non-interest class. In addition, the set of interest documents can also be increased with reliable interest documents. Once there are interest, non-interest, and unlabeled documents, a transductive learning algorithm is applied to infer the label to the remaining unlabeled documents in the second stage [3]. Generally, the PUL algorithms based on this framework include the application of self-training in some or all of the steps, which can be computationally costly and do not improve the classification performance [12]. We can observe the use of algorithms based on vector-space model, such as Rocchio or Expectation Maximization, also demonstrated lower performances than other approaches in text classification [12, 20].

Despite the benefits of network-based approaches in semi-supervised learning scenarios, there are few PUL proposals based on networks, such as Positive documents Enlarging PU Classifier (PE-PUC) [20] and Positive and Unlabeled Learning by Label Propagation (PU-LP) [9]. PE-PUC uses network representations only to increase the set of interest documents with reliable positive documents. The network is not used in the classification step. The remaining steps consider Bayesian classifiers, which have been shown to perform poorly when the labeled training data set is small [12]. The use of networks allows to extract patterns of the interest class even they are given by different distributions, densities on regions in the space [15]. As PU-LP is entirely network-based, achieving good performance in [9], we propose to use it for news classification. Next section describes the proposed approach.

4 Proposed Approach: PU-LP for Fake News Detection

Considering that PUL algorithms can reduce labeling effort for news classification, the benefits of network-based representations for semi-supervised learning, and that adding extra information to the network can contribute to differentiating real and fake content, we propose an approach based on PU-LP algorithm applied to fake news detection. After PU-LP infer interest and non-interest news sets in the news network, we add relations (edges) between news, relevant terms and linguistic features making a heterogeneous network. Thus, a label propagation is applied to classify the unlabeled news. We compared different proposed heterogeneous networks with the traditional PU-LP algorithm considering only the news network. We use different parameters, label propagation algorithms, and representation models for unstructured data to assess the algorithms’ behavior. Figure 1 presents the proposed approach of PU-LP for semi-supervised fake news detection. The next subsections presents details of each stage of the proposed approach.

Fig. 1.
figure 1

Proposed approach for detecting fake news based on the semi-supervised PU-LP algorithm. Circular nodes represent news, nodes with the letter t are representative unigrams and bigrams, nodes with the letter p correspond to pausality, e emotiveness, and s the average number of words per sentence.

4.1 News Collection and Representation Model

Let \(\mathcal {D} = \{d_1, \ldots , d_l, d_{l+1}, \ldots , d_{l+u} \}\) be a news set and \(\mathcal {C} = \big \{\textit{interest},\)\( \textit{not interest}\big \}\), i.e., \(\mathcal {C} = \{\textit{fake}, \textit{real}\}\), be a set of class labels. The first l elements of \(\mathcal {D}\) are fake labeled news, composing the interest labeled set \(\mathcal {D}^{+}\). The remaining u elements are unlabeled news (fake and real), composing the set \(\mathcal {D}^{u}\), and \(u \gg l\) (Fig. 1 - Stage 1). The news dataset must be pre-processed, and a representation model, such as BoW or document embedding [1] must be adopted to transform news into structured data (Fig. 1 - Stage 2). Next, PU-LP builds an adjacency matrix with the complete set of examples, \(\mathcal {D}\), detailed below.

4.2 k-NN Matrix, Katz Index and Sets Extraction

The representation model and a chosen distance metric are used to calculate an adjacency matrix. In this matrix, news with similar content has a low distance between them. The adjacency matrix is used as a basis for building a k-Nearest Neighbors (k-NN) matrix, called A, so that \(A_{i,j} = 1\) if the news \(d_j\) is one of the k nearest neighbors to the news \(d_i\), and \(A_{i,j} = 0\) otherwise (Fig. 1 - Stage 3). Through the k-NN matrix, a heterogeneous network \(\mathcal {N}\) is also created. The heterogeneous network can be defined as a triple \(\mathcal {N} = \langle \mathcal {O},\mathcal {R},\mathcal {W} \rangle \), in which \(\mathcal {O}\) is the set of objects, \(\mathcal {R}\) is the set of relations between objects, and \(\mathcal {W}\) is the set of weights of these relations. Given two pairs of objects \(o_i, o_j \in \mathcal {O}\), the relationship between them is \(r_{o_i,o_j}\) and the weight of a relation \(r_{o_i,o_j}\) is given by \(w_{o_i,o_j} \forall o_i, o_j \in \mathcal {O}\). In \(\mathcal {N}\), objects are news, the relationships between news are created according to the k-NN matrix, and the weights \(w_{o_i,o_j}\) are the cosine similarity between news. Although the cosine is a good local similarity measure, its performance as a global similarity measure is not good enough. To consider local paths in the news network that involve neighboring news in common, [9] propose using the Katz Index. The basic idea is that if two news have many neighbors in common, they are likely to be of the same class. Katz index is a global similarity measure that calculates the similarity between pairs of nodes considering all possible paths in a graph that connect them. Thus:

$$\begin{aligned} sim(d_i,d_j) = \sum ^{\infty }_{h=1} \alpha ^h \cdot |\textit{path}_{d_i,d_j}^{<h>}| = \alpha A_{i,j} + \alpha ^2 (A^2)_{i,j} + \alpha ^3 (A^3)_{i,j} + ..., \end{aligned}$$
(1)

in which \(\alpha \) is a parameter that controls the influence of paths. With a very small \(\alpha \), the long paths contribute very little. In the Katz index, when \(\alpha < 1/\epsilon \), being \(\epsilon \) the biggest eigenvalue for the matrix A, Eq. 1 converges and can be calculated as follows: \(S = (I - \alpha A)^{-1} - I\), which I denotes the identity matrix and S has dimnesions \(|\mathcal {D}| \times |\mathcal {D}|\). Thus, \(S_{i,j} \in \mathbb {R}\) denotes the similarity between the nodes \(d_i\) and \(d_j\) according to Katz index (Fig. 1 - Stage 4). The labeled news in \(\mathcal {D}^{+}\) and the similarity matrix S are used to infer two sets: the set of reliable interest news RI and reliable non-interest news RN (Fig. 1 - Stage 5). The set RI contains news from set \(\mathcal {D}^{u}\) that are most similar to examples from \(\mathcal {D}^{+}\), and RN will contain news from \(\mathcal {D}^{u} - RI\) that are most dissimilar to the set \(\mathcal {D}^{+} \cup RI\).

For the inference of the RI set, an iterative method is applied. The task is divided into m iterations. In each of the m iterations, the total number of reliable interest news that will be extracted is \((\lambda /m) \times |\mathcal {D}^{+}|\), which \(\lambda \) controls the set’s size. The news in \(\mathcal {D}^{u}\) are ranked according to their average similarities for all news in \(\mathcal {D}^{+}\) based on S. The \((\lambda /m)|\mathcal {D}^{+}|\) most similar news are taken from \(D^{u}\) and added into \(RI'\). At the end of each iteration, RI is incremented with the elements in \(RI'\) (\(RI \leftarrow RI \cup RI'\)) [9].

For the inference of the reliable non-interest set, news in \(\mathcal {D}^{u} - RI\) are ranked according to their average similarities (based on S) for all news in \(\mathcal {D}^{+} \cup RI\). The algorithm extracts the \(|\mathcal {D}^{+} \cup RI|\) most dissimilar news, forming the set RN. After getting the set RN, the sets \(\mathcal {D}^{+} \cup RI\), RN, and \(\mathcal {D}^{u} \leftarrow (\mathcal {D}^{u} - RI - RN\)) are used as input by label propagation algorithms based on transductive semi-supervised learning. More details about the algorithm can be seen in [9].

4.3 Adding Features in the News Network

Considering that the PU-LP algorithm is entirely based on networks, we propose adding new features as nodes into the news network \(\mathcal {N}\) that has been shown to be relevant in [2, 17] in news classification:

  • incertainty = total number of modal verbs and passive voice

  • non immediacy = total of singular first and second personal pronoun

  • emotiveness (emo) = \(\frac{\text {total number of adjectives + total number of adverbs}}{\text {total number of nouns + total number of verbs}}\)

  • pausality (pau) = \(\frac{\text {total number of punctuation marks}}{\text {total number of sentences}}\)

  • average words sentence (avgws) = \(\frac{\text {total number of words}}{\text {total number of sentences}}\)

Among the features considered, we compute for each dataset which ones had the highest correlation with the target attribute and appeared in more than one dataset. The characteristics chosen to be included in the network were: pausality, emotiveness, and average words per sentence. Table 1 shows the correlations values. The features are added to the news network \(\mathcal {N}\) as new unlabeled nodes \(f_j\), \(0<j<\text {total features}\). New edges between news and the features are added, whose weights \(w_{d_i,f_j}\) correspond to normalized values of the feature \(f_j\) for the news \(d_i\), according to the equation: . We also considered other linguistic features extracted by LIWC [10], such as counting personal pronouns, prepositions, verbs, nouns, adjectives and quantitative words, use of affective words, positive and negative emotions, and anxiety, but they did not demonstrated to be relevant in our proposal.

Table 1. Correlations of the main linguistic features considering all news datasets.

In addition to the linguistic characteristics, we also considered terms as network nodes since news terms are widely used in researches for discrimination of true and false content. They also demonstrated to be useful in label propagation for text classification [12]. To select terms, stopwords were removed and the terms were stemmed. Then, the single terms and sequences of two terms, i.e., unigrams and bigrams, are selected to be network nodes if they respect a minimum document frequency of 2 documents and their term frequency - inverse document frequency (tf-idf) value in a document is above a threshold \(\ell \).

Table 2. List of characteristics included in each proposed heterogeneous networks.

Twelve different heterogeneous networks were created, considering the combination of all features. Table 2 presents the networks and features included in each one. After adding all the features as nodes into the network \(\mathcal {N}\), we perform the normalization of network relations in order to mitigate possible distortions due to different ranges of the different types of relations. The relations are normalized considering each relation type. Thus, the edge weight for an object \(o_{i} \in \mathcal {O}_{l}\) and and object \(o_{j} \in \mathcal {O}_m\) is given by Eq. 2:

$$\begin{aligned} w_{o_i,o_j} = \frac{w_{o_i,o_j}}{\sum _{o_{k}} w_{o_i,o_k}}, o_{i} \in \mathcal {O}_{l}, {o_{j} \in \mathcal {O}_m}, \text{ and } o_{k} \in \mathcal {O}_{m} \end{aligned}$$
(2)

After building the news and terms network, the next stage (Fig. 1 - Stage 9) is to carry out the label propagation using transductive learning algorithms for heterogeneous networks.

4.4 Label Propagation

We propose using regularization-based transductive algorithms for heterogeneous networks to classify the unlabeled news as fake or real. The regularization-based algorithms satisfy two premises: (i) the class information of neighboring objects must be similar; (ii) and the class information of the labeled objects assigned during the classification process must be similar to the real class information [12]. The two algorithms considered in this paper are: Label Propagation through Heterogeneous Networks (LPHN) [12] and GNetMine (GNM) [7].

In order to explain the regularization functions of both algorithms, let \(\mathbf {f}_{o_i} = \{f_{\textit{interest}}, f_{\textit{not-interest}}\}\) be the class information vector, which gives how much an object \(o_i\) belongs to each of the classes, and let \(\mathbf {y}_{o_i}\) be the real class information vector \(\mathbf {f}\), in which the position of the vector corresponding to its class is filled with 1. Thus, only labeled objects have values for vector \(\mathbf {y}\). The term \( w_{o_i, o_j}\) is the weight of the edge connecting object \(o_i\) to object \(o_j\), and \(\mathcal {O}^{L}\) refers to the set of labeled objects. The regularization function to be minimized by LPHN is given by:

$$\begin{aligned} Q(\mathbf {F}) = \sum _{\mathcal {D},\mathcal {T} \in \mathcal {O}} \frac{1}{2} \sum _{{o_i} \in \mathcal {O}_i} \sum _{{o_j} \in \mathcal {O}_j} w_{o_i, o_j}(\mathbf {f}_{o_i} - \mathbf {f}_{o_j})^2 + \lim _{\mu \rightarrow \infty } \sum _{o_i \in \mathcal {O}^L} (\mathbf {f}_{o_i} - \mathbf {y}_{o_i})^2. \end{aligned}$$
(3)

We can observe that the different relations have the same importance and that \(\lim _{\mu \rightarrow \infty }\), which forces that \(\mathbf {f}_{i} = \mathbf {y}_{i},\, \forall o_{i} \in \mathcal {O}^{L}\), i.e., the class information of label object do not change. After inferring the \(\mathbf {f}\) vectors of the unlabeled objects, the Class Mass Normalization (CMN) [23], and then the objects are classified according to the \(\arg \)-\(\max \) of the corresponding \(\mathbf {f}\) vector.

In GNetMine [7], the relations between different objects have different importance. The label of an object \(o_{i} \in \mathcal {O}^{L}\) can be changed during the classification process if information from neighboring objects diverges from the class of the object initially labeled. The regularization performed by GNetMine is given by:

$$\begin{aligned} Q(\mathbf {F}) =&\sum _{\mathcal {O}_i,\mathcal {O}_j\subset \mathcal {O}} \lambda _{\mathcal {O}_i,\mathcal {O}_j} \sum _{o_k\in \mathcal {O}_i} \sum _{o_l\in \mathcal {O}j} w_{o_k,o_l} \left| \left| \frac{\mathbf {f}_{o_k}(\mathcal {O}_i)}{\sqrt{\displaystyle \sum _{o_m\in \mathcal {O}_j}w_{o_k,o_m}}} - \frac{\mathbf {f}_{o_l}(\mathcal {O}_j)}{\sqrt{\displaystyle \sum _{o_m\in \mathcal {O}_i}w_{o_l,o_m}}} \right| \right| ^2 \nonumber \\&\quad \quad \qquad \qquad \qquad \qquad \qquad \qquad \quad \qquad \;\; +\sum _{o_j\in \mathcal {O}^L}\alpha _{o_j}(\mathbf {f}{o_j}-\mathbf {y}{o_j}) \end{aligned}$$
(4)

in which \(\lambda _{\mathcal {O}_i,\mathcal {O}_j} (0 \ge \lambda _{\mathcal {O}_i,\mathcal {O}_j} \ge 1)\) is the importance given to the relationship between objects of the types \(\mathcal {O}_i\) and \(\mathcal {O}_j\), and \(\alpha _{o_i} (0 \ge \alpha _{o_i} \ge 1)\) is the importance given to the real class information of an object \(o_j \in \mathcal {O}^{L}\) (set of labeled objects). The documents are classified considering the arg-max of the final value of \(\mathbf {f}_{d_{i}}\) vectors for \(d_{i} \in \mathcal {D}^{u}\).

5 Experimental Evaluation

In this section, we present the experimental configuration used in the experiments, the datasets and an analysis of the results achieved. Our goal is to encourage the study of PUL approaches to fake news detection, which perform well using little labeled data. Furthermore, we want to demonstrate that with little labeled fake news data, structured in a network, it is possible to achieve reasonable classification performance, which can be improved by adding extra information, taken from the news textual content.

Table 3. Detailed information about news datasets.

5.1 News Datasets

In this paper, six news datasetsFootnote 1 were evaluated, four of them are in English and two are in Portuguese. Table 3 presents detailed information about language, subject and amount of real and fake news present in each dataset. Fact-checked news (FCN) was collected from five Brazilian journalistic fact-checking sites to evaluate our approach. The second portuguese dataset is Fake.BR (FBR), the first reference corpus in Portuguese for fake news detection [17]. The third dataset was acquired from FakeNewsNet repository (FNN) [16], which contains news of famous people fact-checked by the GossipCop website. FakeNewsNet is the one with the greatest unbalance in the distribution of classes. The last three datasets are also written in English. The news was taken randomly from the FakeNewsCorpus datasetFootnote 2 (FNC0, FNC1, FNC2), an open source dataset composed of millions of news, collected from 1001 domains. Fake and real news were selected, and stopwords were extracted that could induce the behavior of classification algorithms, such as links and names of serious publication vehicles.

5.2 Experimental Setup and Evaluation Criteria

This section presents the experiment configuration and evaluation criteria for the PU-LP and the baseline algorithms. After pre-processing, feature vectors for news representation were obtained considering two strategies: (i) A traditional BoW, with tf-idf as the term-weighting scheme; and (ii) D2V (Paragraph Vectors). In D2V we used the union of the models Distributed Memory and Distributed Bag-of-Words to generate the document embeddings. For training each of these models, we consider the average and concatenation of the word vectors to create the hidden layer’s output. Also, we employed the range of the maximum number of epochs \(\in \{100, 1000\}\), \(\alpha = 0.025\) and \(\alpha _{min} = 0.0001\), number of dimensions of each model \(= 500\), window size \(= 8\), and minimum count \(= 1\) [8]. For the k-NN matrix used in PU-LP, we used \(k = [5,6,7]\) and cosine as similarity measure. For the extraction of reliable interest and non-interest sets, we use: \(m = 2\), \(\lambda = [0.6, 0.8]\), and \(\alpha = [0.005, 0.01,0.02]\). These values were chosen as suggested in [9].

For the selection of representative unigrams and bigrams, we use \(\ell = 0.08\). The parameter \(\ell \) was chosen after a statistical analysis of the sample, indicating that about 25% of bag-of-words terms had tf-idf greater than 0.08. For the label propagation stage, as suggested in [14], we used a convergence threshold = 0.00005 and a maximum number of iterations = 1,000. For the GNM \(\alpha = \{0.1, 0.5\}\) and \(\lambda = 1\) was considered.

A 10-fold cross-validation adapted to OCL and PUL problems was used as a validation scheme. In this case, the set of fake news (\(\mathcal {D}^{+}\)) was randomly divided into 10 folds. In order to simulate a semi-supervised learning environment, in which the number of labeled examples is higher than the unlabeled examples, i.e., \(|\mathcal {D}^{+}| \gg |\mathcal {D}^u|\), we carried out different experiments considering as labeled data 1, 2 or 3 folds. The remaining folds and the real news are: (i) considered as test documents for the OCL algorithm; (ii) considered as unlabeled and test documents for the PUL algorithms.

We proposed a baseline approach using binary semi-supervised learning to assess the labeling procedure of reliable-interest and reliable-non-interest examples in PU-LP. For this analysis, the set of real news was randomly divided into ten subsets. In the cross-validation scheme, for each fake news fold used to train the algorithm, one fold of real news was used. From the network obtained by \(\mathcal {N}\) in PU-LP, and considering the training set as the set of labeled nodes, label propagation algorithms infer the class of the remaining news from the network. We considered the values of k ranging in the interval [5, 7]. The propagation algorithms and their respective configurations are the same as the experiments performed with PU-LP.

As evaluation measure, we used \(F_{1} = (2 \cdot \textit{precision} \cdot \textit{recall})/(\textit{precision}+\textit{recall})\) considering the fake news as the positive class (interest-\(F_1\))Footnote 3. In the next section, the results of the experiments are presented.

5.3 Results and Discussions

Tables 4 and 5 present the the best results for the interest-\(F_{1}\) that each network reached considering BoW and D2V representation models for both PU-LP and binary semi-supervised baseline, the experimental configuration defined in Sect. 5.2, and the news datasets. Due to space limitations, we present the results obtained only with the GNM algorithm, which obtained better overall performance. The complete table of results is available in our public repository. 10%, 20%, and 30% indicate the percentage of news used to train the algorithms. Networks 1 to 12 have a combination of the features proposed in Sect. 4.3 (see Table 2). Table 6 presents the average ranking and standard deviation analysis of the proposed heterogeneous networks.

Table 4. Interest-\(F_{1}\) of the PU-LP and binary baseline approaches with Bag-of-Words representation model, using GNetMine as label propagation.

On network 1 (Net 1), built only with news, we can see that overall PU-LP behaves better than the binary baseline, even using half of the labeled news. When exceptions occur, the results obtained tend to be very close, with less than 2% of difference. This demonstrates that identifying reliable fake news and inferring a real news set using the initially interest labeled set and Katz index, can be a promising strategy for news classification. In particular, using only 10% of labeled fake news and representation models that consider semantic relations such as D2V, can provide results as efficient as using larger training sets, which favors the dynamic scenario of news classification.

Including terms (Net 2) in the news network tends to improve classification performance in general, especially when news are represented with BoW (Table 4 and Table 5). Table 6 also shows that Net 2 obtained a better average ranking, with a low standard deviation, in relation to the other networks proposed. For BoW with 20% and 30% labeled data, including also pausality achieved better results (Net 7). The term selection considering the tf-idf measure above a threshold seems to contribute to distinguishing real and fake news by propagation algorithms, increasing the algorithm’s accuracy.

Table 5. Interest-\(F_{1}\) of the PU-LP and binary baseline approaches with Doc2Vec representation model, using GNetMine as label propagation.

Among the news datasets, only the FNN is unbalanced. FNN has only celebrity news, of which fake news is a minority making up only 24.3% of the dataset. For this dataset, the interest-\(F_{1}\) always tends to be lower. However, we can notice that our proposal performed better than the binary baseline, which also points out the utility of PUL for fake news detection.

Lower results also occur in the FBR datasets. This behavior also happened in [5], in which 67% of the interest-\(F_1\) was obtained using 90% labeled fake news and one-class learning algorithms. The hypothesis is that since this collection in composed of 6 different subjects, fake news from different subjects are spread along with the space, making it difficult to infer reliable negative documents. Therefore, we conclude that to achieve more promising results, the best way is to classify the news considering one subject at a time. Although the results achieved with these datasets are inferior, we can observe that the addition of extra information in the heterogeneous network was beneficial, especially for the baseline, which increases its interest-\(F_{1}\) by more than 10%.

PU-LP obtained great results for FCN, FNC0, FNC1 and FNC2 news datasets. Using D2V as the representation model, the results ranged from 87 to 92% interest-\(F_{1}\) only with news and terms on the network. These results are relevant, mainly because they consider an initially small set of labeled data, different from the vast majority of methods proposed in the literature for news classification.

Table 6. Average ranking and standard deviation analysis for the proposed heterogeneous networks.

6 Conclusion and Future Work

This paper proposed a new approach for detecting fake news based on PU-LP. Our main contributions were a heterogeneous network using content-based features resulting in a domain-independent fake news detection approach and a performance assessment of different linguistic features for fake news detection.

PU-LP achieves good classification performance using a low amount of labeled data from the interest class and uses unlabeled data to improve classification performance, minimizing the news labeling effort. Furthermore, since PU-LP is a PUL approach based on network, it allows the easy incorporation of other features to perform learning. We assessed the performance of a heterogeneous network that incorporates combinations of four different features based on terms and linguistic characteristics.

The results of our experimental analysis in six datasets from different languages and subjects indicate that our proposal can provide good classification performance even with a small amount labeled fake news. Incorporating additional linguistic information such as representative terms and pausality into the network improved classification performance. Our approach achieved performance similar to a binary classifier in all datasets, even requiring much lesser effort for data labeling. Future work intends to consider more powerful representation models, such as context-based, and use more efficient label propagation algorithms, such as those based on neural networks. We also intend to apply our approach to new datasets, which will allow us to perform statistical significance tests to better assess the results.