Improving Short-Content Misinformation Detection Using Multiple Aspect Trajectories Classification Techniques

Sanchez, Juan Pablo Chavarro; Portela, Tarlis Tortelli; Carvalho, Jônata Tyska

doi:10.1007/978-3-031-79032-4_9

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15413))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

381 Accesses

Abstract

The proliferation of fake news and misinformation on social media poses a significant threat to social integrity. While there are approaches such as automatic content analysis and artificial intelligence detection to mitigate this problem, these methods face challenges in classifying misinformation in short content, like that posted on microblogging platforms or instant messaging services. In this study, we present an innovative approach to misinformation detection that combines content-based detection with multi-aspect trajectory classification. This approach models information propagation by considering each shared message as a moving object within the network, enriching the propagation trajectory with different semantic aspects. We implemented this approach on a misinformation dataset collected from WhatsApp and compared it with traditional content-based detection methods, and we combined these two detection orientations into a hybrid approach. Our findings indicate that the proposed hybrid approach achieves an F1-score of 0.89, surpassing baseline models by 8% in misinformation detection, even in short content. These results suggest that combining content-based detection with multi-aspect trajectory classification is a novel and promising strategy for addressing misinformation on social media.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

How Early Can We Detect? Detecting Misinformation on Social Media Using User Profiling and Network Characteristics

Efficient Multivariate Data Fusion for Misinformation Detection During High Impact Events

Near Real-Time Detection of Misinformation on Online Social Networks

1 Introduction

In today’s digital era, social media has taken a pivotal role in people’s daily lives. These platforms provide users from around the world with the ability to communicate, share information, and express opinions quickly, affordably, and accessibly. According to Statistica’s statistics [30], over 50% of the world’s population is active on social media, emphasizing its significance as a tool for communication, information, and opinion sharing.

The growing reliance on social media for real-time news has increased the risk of misinformation, as malicious actors exploit these platforms [23, 34]. Technological advancements, such as large language models, further enable the rapid spread of false information, reinforcing echo chambers and filter bubbles [3, 21]. Additionally, the structure of social networks promotes the dissemination of misinformation [2], creating significant challenges for governments and society.

The detection of fake news and online misinformation has been approached in various ways, including two main approaches: manual fact-checking and automatic detection. Manual verification by news agencies faces limitations due to the vast volume of data, potential human bias, and impracticality for early detection [24]. Automatic detection can overcome most of these limitations. These methods can be based on the context, source of the information or content. Source-based detection assesses the credibility of the sources creating or sharing the information, while context-based detection analyzes user interaction patterns and information propagation. Content-based detection seeks patterns in the text or images to identify the news article’s intent. Despite its effectiveness and the fact that this is the most studied analysis in the state of the art [11], this approach is vulnerable to the manipulation of news style, language, and adversarial attacks [37]. A significant and noteworthy challenge in the detection of misinformation is the recognition of short texts, typical on microblogging platforms such as Weibo and Twitter; and instant messages services such as WhatsApp or Telegram. This feature is particularly relevant in the context of fake news, as the concise messages on these platforms, due to their brevity, can be more complex to evaluate and verify accurately.

Given the complexity and heterogeneity of data related to information on online social networks, along with the limitations of content-based approaches, we propose two novel approaches: one based on the context of misinformation propagation and user interaction on social networks, and another hybrid approach that combines content-based and trajectory-based detection. In this paper, we introduce a model that conceptualizes the propagation path of a message on an online social network, specifically an instant messaging service, as a Multiple Aspects Trajectory (MAT) [18]. This type of data provides an effective and expressive representation of the diversity of data dimensions, namely the contexts associated with online information and its propagation. To validate the feasibility of these approaches, we turned to a dataset of messages spread in public WhatsApp groups focused on Brazilian politics [1]. Using the analogous definition of information as multiple aspects trajectories, we modeled the propagation paths of these messages based on forwarding chains as trajectories.

This study proposes a novel trajectory-based approach and a hybrid approach for misinformation detection in social networks, combining content-based and trajectory-based classification. The results show that the hybrid approach outperforms the baseline method by 8%, achieving an F1 score of 0.89, especially in the classification of short texts. The main contributions of this work are: a new dataset of information trajectories, a new application domain for trajectory-based classification methods, and a new classification method with great potential for detecting misinformation in social networks.

2 Related Work

In this section, we will analyze the related works on detecting misinformation and fake news, with a primary focus on those employing propagation-based approaches. These approaches take into careful consideration the social context in which such misleading information spreads, accounting for the ways data disseminates across online networks. Researchers have approached this challenge from various angles and have employed different input representations in their models [13, 36].

The models that utilize cascade, tree, or “tree-like” structures directly represent the path of information propagation. In [4], Del et al. investigate how users in Facebook groups tend to share homogeneous content, leading to the formation of echo chambers. Within these echo chambers, homogeneity drives the spread of misinformation. Additionally, relationships between the trajectory size and the lifecycle of information are established. Similarly, [32] shows that false news spreads faster, farther, and to a broader audience on Twitter compared to true news, analyzing factors such as cascade depth and breadth. Additionally, [5] Ducci et al. introduces the “cascade-LSTM” model, which detects misinformation and analyzes retweet behavior.

Some other propagation-based detection methods utilize structural network information, such as text data and the propagation process, to portray the behavior of users involved in the dissemination of misinformation [13]. Often, these methods adopt a graph-based geometric orientation, and depending on the interaction between nodes, they can be classified as homogeneous, heterogeneous, or hierarchical. Homogeneous networks involve a single type of node and a single type of edge. In [16], Ma et al. propose a model based on recursive neural networks for rumor detection based on propagation trees that use user characteristics in the Twitter15 and Twitter16 datasets. In [10], Han et al. use a homogeneous network representation to classify news using Graph Neural Networks (GNNs), highlighting the advantages of propagation-based approaches in terms of news text independence. In [29], a model is proposed to capture temporal evolution patterns of news from Twitter15, Twitter16, FakeNewsSetGen, and Weibo using Homogeneous Dynamic Graph Neural Networks for fake news detection. In [19], a homogeneous graph is created without considering the textual content of each tweet. Instead, a tree structure is generated, and Graph Convolutional Networks (GCNs) are employed for fake news detection.

Heterogeneous networks involve multiple types of nodes and edges. For instance, in [27], relationships between authors, users, and news in a network called TriFN are proposed, and a classifier is employed to detect fake news. In the work Monti et al. [20], which employs similar dimensions to the previous studies and CNNs for classification, the model is found to be potentially independent of the text and geographic location of the users involved in the network. Additionally, the advantages of propagation-based approaches in terms of adversarial attacks, as these would require manipulating the social network.

Hierarchical networks involve various types of nodes and edges forming hierarchical relationships. This is exemplified in the work developed by Shu et al. [26], where a hierarchical propagation network is constructed by evaluating different granularities to extract temporal, structural, or linguistic features.

An approach for representing propagation can be based on time series, more specifically, multivariate time series. In the work of Preveti et al. [22], rumor propagation is represented as the time series of a cascade, incorporating user-related and temporal aspects for classification. In the work of Liu et al. [15], the propagation path is represented as a multivariate time series, with points based on user and time-related aspects. These aspects are primarily numeric or binary, and convolutional and recurrent networks are applied to detect time series associated with fake news in the Twitter15, Twitter16, and Weibo datasets.

3 Dissemination of Misinformation as MAT

This section, divided into two parts, explores addressing misinformation in online social networks through multi-aspect trajectories. The first part covers the theoretical foundations, while the second focuses on applying these principles by converting a dataset into multi-aspect trajectories.

3.1 Misinformation and MAT

As discussed in Sect. 2, existing representations of news or information propagation paths generally use cascade structures, graphs, or time series, incorporating mainly numeric or textual features related to users, news, networks, or time. However, these representations lack semantic data. In this section, we introduce the essential definitions and concepts for our approach. We begin by presenting the definition of “Multiple-Aspect Trajectory” as suggested in [17], and then define what a “Multiple-Aspect Trajectory” is in the context of information in Online Social Networks (OSN), capable of capturing semantic data.

Definition 1

(Multiple - Aspect Trajectory). A multiple - aspect trajectory is an ordered sequence of points \(T\;=\; \left\langle d_1,\;d_2,\; \dots \;,d_n\right\rangle \), with \(d_i=(x,y,t, \mathcal {A})\) being the \(i-\)th point of the trajectory at location (x, y) and timestamp t, described by the set \(\mathcal {A}=\left\{ a_1,\;a_2,\; \dots \;,a_r \right\} \) of r attributes.

This type of sequential data allows for the inclusion of heterogeneous data of diverse nature and types. These trajectories can be multi-labeled, and for the classification task. According to the work by Petry et al. [17], the concept of a “trajectory classification” is defined as follows:

Definition 2

(Trajectory classification). Given a set of labels \(\mathcal {L}\) and a trajectory set defined by a set of pairs \(\mathcal {T} = \{(T_1, \text {label}(T_1)),\ldots , (T_{|\mathcal {T}|}, \text {label}(T_{|\mathcal {T}|}))\}\), where each pair contains a trajectory \(T_i\) and its class label label\((T_i)\in \mathcal {L}\). Trajectory classification involves learning a prediction function (i.e., a model) f that assigns each trajectory \(T_i\) in the set \(\mathcal {T}\) to one of the class labels in \(\mathcal {L}\).

Kumar et al. [14] differentiate between misinformation, which is false information without intent to deceive, and disinformation, which involves deceptive intent. This distinction is supported by other researchers [12, 33], though some studies use a broader definition of misinformation to encompass all false or inaccurate information spread on social networks. This includes rumors, disinformation (intentional or unintentional), urban legends, fake news, spam, and hate speech. These phenomena, given their temporal evolution in a social network, can be represented as a sequence of multiple aspects.

In line with our objective to analyze the behavior of information in a social network, it is essential to have a general concept that represents the dissemination of information. Following Guo et al. [9], we define information dissemination in social networks by considering the content, social, and temporal dimensions, as outlined by Shu et al. [25].

Let s be information disseminated in a social network N, which consists of two main sets: Publisher and Content. Publisher \(P_{s}\) includes a set of profile features to describe the original author, such as name, domain, age, among other attributes. Content \(C_{s}\) consists of a set of attributes that represent the news article and includes headline, text, image, etc.
Let \(E=\{e_{it}\}\) be the set of tuples to represent the process of how information spread over the time among n users \(U=\{u_1,u_2,\dots u_n\} \) and their corresponding posts \(P=\{p_1,p_2,\dots p_n\}\) on social media regarding information s. Each element \(e_{it}\) is described as \(\{p_i, u_i, t\}\), indicating that a user \(u_i\) spreads a post s using \(p_i\) at time t.

Therefore, to consider information or news s as a trajectory, we must treat it as a sequence that passes through points of interest (POIs). Due to privacy concerns in social networks, it is not always possible to obtain a user’s location. Thus, we initially need to define a temporal sequence of points along the trajectory it follows. In this case, each point corresponds to an interaction \(e_i\) at a time t, where a user \(u_i\) interacts with the post \(p_i\) corresponding to the POI at a specific time t. This i-th point in the temporal sequence is associated with content and context aspects provided by \(u_i\) and \(p_i\). In this regard, and considering the definition of misinformation propagation in a social network adapted to the definition of multi-aspect trajectories given in the Definition 1, we present the following definition.

Definition 3

(Multiple-Aspect Trajectory of information). A multi-aspect trajectory of information propagated in an Online Social Network (OSN) is an ordered sequence of points \(E= \left\langle e_1,\; e_2,\;\dots ,\;e_{n}\right\rangle \), where \(e_i=(u_i,t,\mathcal {A})\) represents the \(i-\)th point of the trajectory located at \(p_i\) at a timestamp t, described by a set of attributes \(\mathcal {A}=C_{s_{i}}\cup P_{s_{i}}\), where \(C_{s_i}\) is the set of attributes related to the content of s, and \(P_{s_{i}}\) is the set of context attributes related to s.

This study addresses the problem of binary classification of trajectories, using \(\mathcal {L}=\{0,1\}\), as specified in Definition 2.

3.2 Dataset

In this section, we start from the original dataset “FAKEWHATSAPP.BR”, which contains WhatsApp messages labeled as misinformation or not. From these messages, we generate the dataset “MAT FAKEWHATSAPP.BR”, which contains the multi-aspect trajectories of the messages, according to Definition 3.

FAKEWHATSAPP.BR. The original dataset, proposed in [1], consists of news collected from 59 public WhatsApp groups that were used during the political campaigns for the 2018 presidential elections in Brazil. This data collection took place from July 2, 2018, to October 29 of the same year, resulting in a total of 282,601 messages labeled in two classes: the positive class, represented as 1, corresponds to messages with misinformation, while the negative class, represented as 0, corresponds to messages without misinformation.

To effectively label the dataset, the findings of Vosoughi et al. [32] were considered, demonstrating that false information spreads faster, wider, and deeper. A subset of 5,284 unique messages was selected based on their virality and manually labeled following a classification protocol. The labeled subset comprises mostly brief messages, averaging 20 tokens for non-misinformative messages and 34 for misinformative messages.

To automatically expand the categories assigned to the subset of messages to the entire dataset, an algorithm generated the TF-IDF matrix from both the labeled subset and the complete corpus. It calculated the similarity between unlabeled and labeled messages, assigning labels to unlabeled messages when the cosine similarity exceeded 0.9. This process labeled 21,289 messages, with 6,926 being unique. Cabral [1] evaluated this dataset using classic natural language processing methods, including Bag of Words (BoW) and TF-IDF features with n-gram variations. The best results, with an F1 score of 0.73, were achieved using MLP, LSVM, and SGD algorithms. Excluding short messages and focusing on those with over 50 words increased the F1 score to 0.87.

MAT FAKEWHATSAPP.BR. Taking into account the Definition 3, we proceed to generate the set of multi-aspect trajectories using the aforementioned dataset. This set can be seen as a collection of trajectories passing through points of interest (POIs). In the context of WhatsApp, each POI represents a user who shares a message, and the moving object is the message being shared from one user to another. The semantic aspects are derived from the data obtained from the social network.

If we were to construct the trajectories as temporal sequences of message repetitions (forwards), we would obtain the 6,926 trajectories corresponding to the unique labeled messages in the dataset, composed of 21,289 points. However, a more detailed examination of the dataset reveals that the vector representation and similarity calculation fail to fully capture similar records that exhibit variations. To address this limitation, we will use a representation capable of capturing the context of the text and generating a better representation through word embeddings generated by BERT.

The process of creating these trajectories unfolds following the detailed steps below and can be followed in the Fig. 1.

I:

Extraction and Grouping of Trajectories.

Identification of Forwards: All forwarded messages in the conversation are identified, thus establishing the beginning of a trajectory.

Temporal Grouping: Forwarded messages are grouped into trajectories ordered by time, allowing us to capture the sequence of forwards and the spread of information in the chat.

II:

Representation of Trajectories with Sentence Embeddings.

Generation of Embeddings Each message in a trajectory is represented by a multilingual BERT sentence embedding. This captures the semantic and lexical context of each message in a vector space.

Similarity Calculation: We calculate cosine similarity between the embeddings of all trajectories. Those that exceed a similarity threshold of 0.9 are considered candidates for merging.

Merging Similar Trajectories: Trajectories or records with high similarity are candidates for merging and are combined into a single trajectory. This reflects the relationship between messages that share similar content.

III:

Assignment of Unique Identifiers (TID).

Temporal Sorting: The forwards of each message representing a trajectory are sorted chronologically to establish their temporal sequence.

TID Assignment: A unique TID is assigned to each message or trajectory, ensuring that each trajectory in the dataset corresponds to a unique representative message.

The process of creating the dataset involved selecting messages that were shared at least 10 times, thus representing trajectories with 10 or more points. This criterion reduced the dataset to 469 trajectories, composed of 177 labeled as non-misinformation and 292 as misinformation, reflecting the higher frequency of misinformation messages in the original dataset. These 469 trajectories span 10, 251 points and are associated with 2, 408 unique messages. The selected aspects for analysis are presented in Table 1. The selection of these aspects was guided by the attributes commonly used in location check-in datasets such as FourSquare [17].

The process of construction can be replicated using datasets from different messaging platforms such as Telegram, or it can be modified for use on other social networks like Twitter, where information is disseminated in brief texts. This representation of the spread of information on online social networks as a trajectory of multiple aspects is agnostic to the text itself, as it leverages data from the context and the propagation of information.

Table 1. Description of Dataset Attributes.

Full size table

4 Trajectory Classification Models

In the field of multi-aspect trajectory analysis, various techniques have been developed to discover patterns and knowledge in this data, ranging from similarity analysis to clustering and classification, among others. Initially, the focus was on the raw trajectories of moving objects over time, limited to extractable attributes from space-time dimensions [28].

With the advent of semantic trajectories, defined as sequences of points of interest visited by a user, the problem of “Trajectory-user-linking” emerged, which aims to predict which user generated a trajectory. Despite advances such as TULVAE [35] and TULER [8], which have addressed these semantic trajectories, they still face limitations in other dimensions.

Finally, with multi-aspect trajectories, models capable of encompassing all dimensions have emerged. The Movelets method [6], a pioneer in this field, extracts representative subtrajectories of classes, continually improving the quality of the extracted Movelets and reducing computational costs. Additionally, the MARC [17] model, a recurrent neural network, can effectively handle all these dimensions. These models have found applications in a variety of datasets, including human trajectories, animal trajectories, transportation trajectories, hurricane trajectories, vessel trajectories, and storm trajectories. Below, we explore in detail how these models are capable of addressing multi-aspect trajectories [28].

4.1 MARC Model

One of the standout models in this context is the MARC Model [17], which relies on attribute embeddings and recurrent neural networks (RNNs). This model excels in processing trajectories defined by categorical or textual attributes. Its architecture comprises several key components:

Encoding of Trajectory Points: Initially, trajectory attributes are encoded using one-hot encoding. This encoded representation of our attributes is then multiplied by an embedding matrix, which has as many rows as possible values of the attribute. The rows of this embedding matrix represent the embedding of each of the attributes. Finally, an aggregation function is applied to these embedded attribute points. In the MARC model, three different aggregation functions can be used, which can be sum, average, or concatenation.

Recurrent Component: The encoded trajectory points move on to a recurrent component, utilizing Long Short-Term Memory (LSTM) cells. This neural network structure with LSTM units is fundamental for handling sequential data like trajectory points. LSTM units can capture relationships and patterns among different trajectory points and their attributes.

Fully Connected Layer and Sigmoid Function: The final component is a fully connected layer that receives information from the recurrent component and subsequently passes through a sigmoid function to calculate probabilities.

4.2 Movelet-Based Models

Another category of models aims to identify the best trajectory or subtrajectory features for input into a classifier. Various methods have been proposed for this task. These methods extract movelets [6], that are subsequences of the trajectory of any size and any combination of attributes. The movelets are evaluated by a quality score based on its discriminative power for each class, find the optimal alignment and division, and then assess its F-score, which determines its importance. The internal architecture and the way the candidate movelet is calculated differ for each model:

MasterMovelets [7]: Calculates the distance from each candidate to all trajectories, making it the most computationally expensive method.
HiperMovelets [31]: Extracts limited-size subtrajectories and generates movelet candidates based on their in-class frequency, currently being the fastest model.

5 Experimental Evaluation

In this section, we describe the methodology used to evaluate our approach for misinformation classification using multi-aspect trajectory classification. We compare it with a content-based detection approach similar to Cabral et al. [1], which serves as the baseline. All experiments will emphasize hyperparameter optimization through 5-fold stratified cross-validation, with the F1-score as the primary evaluation metric.

5.1 Content-Based Detection

In order to provide a comparative analysis, a classic content-based detection approach was established employing conventional natural language processing and machine learning methods. In this experiment, different textual content associated with each trajectory was utilized, meaning that if a trajectory represents the dissemination of certain information, all textual variants of that information are taken into account.

For each of the 2408 different messages linked to the 469 trajectories, the same preprocessing employed by Cabral et al. [1] was employed, included removing stopwords, normalization, lemmatization, retaining emojis as tokens, and normalizing URLs. Given their success with text vector representation using TF-IDF and various n-grams, the same vectorizer was used, incorporating unigrams, bigrams, and trigrams. Finally, the classification algorithms that yielded the best results with the aforementioned vector representation were selected, including Logistic Regression (LR), Multi-Layer Perceptron (MLP), and Linear Support Vector Machine (LSVM).

5.2 Trajectory-Based Detection

In evaluating the proposed trajectory propagation-based approach, the trajectory classification models discussed in the previous section were used, with preprocessing and transformation detailed in Sect. 3.

For the MARC model, experiments varied the aggregation method of the embeddings, testing different methods (add, average, concatenate).

For the model based on Movelets, HiperMovelets was chosen for its efficiency and quality in extracting the candidate Movelets matrix representing each trajectory. This matrix was used in three classification models: Random Forest (RF), Support Vector Machine (SVM), and Multilayer Perceptron (MLP).

5.3 Hybrid Approach

As pointed out by Cabral et al. [1], detecting misinformation based solely on content has limitations, especially with short texts. In addition to this limitation, it is important to consider the quantity of texts associated with an information, as content-based models employ frequency-based methods, and this can affect their performance if the frequency is low.

That’s why the proposed hybrid classifier generates a classification threshold regarding the amount of words and texts associated with an information. If the text has fewer than a certain number of words and fewer than a certain number of different associated messages, the classification of this message will be determined by the classification assigned by the trajectory-based method to the trajectory to which that message belongs. Otherwise, the classification will be made by a content-based classification model.

For this approach, we evaluate the combination of all trajectory-based and content-based models with their optimized hyperparameters. Furthermore, for the estimation of this threshold, we consider the quantity of words and different texts as hyperparameters. The models are trained using cross-validation, evaluating the hybrid model on the validation fold. After finding the best hyperparameters, the hybrid model is trained on the complete training set, and its performance is evaluated on the test set.

6 Results

A total of 69 experiments were conducted to evaluate different approaches to misinformation detection. Nine combinations of TF-IDF with unigrams, bigrams, or trigrams were explored, using Multilayer Perceptron (MLP), Logistic Regression (LR), and Linear Support Vector Machines (LSVM) classifiers for content-based detection. Additionally, six trajectory-based models were evaluated: three variants of the MARC model and three models based on the creation of Movelets with HyperMovelets, classified with Multilayer Perceptron (MLP), Support Vector Machines (SVM), and Random Forest (RF). Finally, 54 experiments were implemented with the hybrid approach, combining TF-IDF with unigrams, bigrams, and trigrams along with the classifier for content-based detection, with the six trajectory classification models mentioned above.

Table 2. Results of misinformation classification using different approaches

Full size table

Given the number of experiments, Table 2 presents the top three results per approach, indicating their F1 score and accuracy. It is noteworthy that the best result is achieved when employing the hybrid approach that combines Bigram+MLP+Hyper+RF, which achieves the highest accuracy (0.85) and F1-score (0.89) among all evaluated models. Content-based models exhibit acceptable performance, primarily when considering the use of n-grams of any order and the MLP classifier. However, these models achieve the lowest accuracy. On the other hand, trajectory-based models are led by the MARC-based model, which in its three variants achieves the best results for this approach. Furthermore, the best trajectory-based model (MARC-avg) slightly outperforms the best content-based model in terms of F1 score, but the difference is more significant when observing the precisions achieved by these three models.

Table 2 presents the best results of the hybrid approach that combines trajectory-based and content-based models when determining a classification threshold. The hyperparameters of this threshold are, on average across all experiments, fewer than 40 words and fewer than 5 associated messages for an information. In such cases, classification is done by a trajectory-based method; otherwise, it is done by a content-based method.

Comparing the three best trajectory-based classifiers presented in Table 2, we observe that, on average, they correctly classify 77.5% of short messages with few different texts associated, while the three best content-based classifiers classify, on average, 62.1%, which is 15% less.

This demonstrates an advantage of trajectory-based classification: its independence from the text of messages associated with an information, as this approach only utilizes propagation data to represent a set of messages. When classifying a trajectory, all messages associated with it are classified, whereas in a content-based approach, only the unique text associated with an information is classified. This characteristic makes trajectory-based classification particularly suitable for short texts, where the information contained in the text may be limited. With the thresholds determined for each experiment, the best results are achieved when using unigrams and bigrams with the MLP classifier if the threshold condition is not satisfied, and Hyper+RF or MARC-cct otherwise. The best of these hybrid models achieves an F1-score of 0.89, which surpasses the best content-based method by 8%. These models also achieve superior accuracy. To better observe this, Fig. 2 presents the average accuracy and F1-score per classification orientation.

As shown in Fig. 2, trajectory-based and content-based methods deliver similar performance in terms of F1-score, with trajectory-based models achieving slightly higher accuracy but with greater variability. In contrast, the hybrid approach significantly outperforms both methods, demonstrating higher mean F1-score and accuracy with much lower variance, highlighting the advantage of combining both strategies for misinformation detection. We conducted a statistical significance analysis, starting with the Shapiro-Wilk test to assess normality (W = 0.95, p = 0.75), followed by a paired t-test (t = 3.51, p < 0.02). The results indicated that the hybrid model significantly outperformed content- and trajectory-based models.

7 Conclusion

In this article, we have introduced an innovative definition of multi-aspect trajectories to address the spread of misinformation in social networks. Our approach views each shared message as a moving object within the social network and applies trajectory classification techniques, commonly used for classifying moving entities such as humans, animals, weather phenomena, among others.

Multi-aspect trajectory-based models have achieved comparable or even superior results to conventional misinformation detection methods, both in their trajectory-based variant and in combination with content-based models. The hybridization of both approaches has shown a significant improvement in classification capability.

This research opens a new horizon in the fight against misinformation. The application of trajectory-based models, particularly those based on Movelets, promises greater interpretability and accuracy in misinformation detection.

References

Cabral, L., Monteiro, J.M., da Silva, J.W.F., Mattos, C.L.C., Mourao, P.J.C.: Fakewhastapp. br: NLP and machine learning techniques for misinformation detection in brazilian portuguese whatsapp messages. In: ICEIS (1), pp. 63–74 (2021)
Google Scholar
Ceylan, G., Anderson, I.A., Wood, W.: Sharing of misinformation is habitual, not just lazy or biased. Proc. Natl. Acad. Sci. 120(4), e2216614120 (2023)
Article MATH Google Scholar
Choi, D., Chun, S., Oh, H., Han, J., Kwon, T.T.: Rumor propagation is amplified by echo chambers in social media. Sci. Rep. 10(1), 310 (2020)
Article Google Scholar
Del Vicario, M., et al.: The spreading of misinformation online. Proc. Natl. Acad. Sci. 113(3), 554–559 (2016)
Article MATH Google Scholar
Ducci, F., Kraus, M., Feuerriegel, S.: Cascade-LSTM: a tree-structured neural classifier for detecting misinformation cascades. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2666–2676 (2020)
Google Scholar
Ferrero, C.A., Alvares, L.O., Zalewski, W., Bogorny, V.: Movelets: exploring relevant subtrajectories for robust trajectory classification. In: Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp. 849–856 (2018)
Google Scholar
Ferrero, C.A., Petry, L.M., Alvares, L.O., da Silva, C.L., Zalewski, W., Bogorny, V.: Mastermovelets: discovering heterogeneous movelets for multiple aspect trajectory classification. Data Min. Knowl. Disc. 34(3), 652–680 (2020)
Article MathSciNet MATH Google Scholar
Gao, Q., Zhou, F., Zhang, K., Trajcevski, G., Luo, X., Zhang, F.: Identifying human mobility via trajectory embeddings. In: IJCAI, vol. 17, pp. 1689–1695 (2017)
Google Scholar
Guo, B., Ding, Y., Yao, L., Liang, Y., Yu, Z.: The future of false information detection on social media: new perspectives and trends. ACM Comput. Surv. (CSUR) 53(4), 1–36 (2020)
MATH Google Scholar
Han, Y., Karunasekera, S., Leckie, C.: Graph neural networks with continual learning for fake news detection from social media. arXiv preprint arXiv:2007.03316 (2020)
Hoy, N., Koulouri, T.: A systematic review on the detection of fake news articles. arXiv preprint arXiv:2110.11240 (2021)
Islam, M.R., Liu, S., Wang, X., Xu, G.: Deep learning for misinformation detection on online social networks: a survey and new perspectives. Soc. Netw. Anal. Min. 10(1), 1–20 (2020). https://doi.org/10.1007/s13278-020-00696-x
Article MATH Google Scholar
Kondamudi, M.R., Sahoo, S.R., Chouhan, L., Yadav, N.: A comprehensive survey of fake news in social networks: attributes, features, and detection approaches. J. King Saud Univ.-Comput. Inf. Sci. 35(6), 101571 (2023)
MATH Google Scholar
Kumar, S., Shah, N.: False information on web and social media: a survey. arXiv preprint arXiv:1804.08559 (2018)
Liu, Y., Wu, Y.F.: Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Ma, J., Gao, W., Wong, K.F.: Rumor detection on twitter with tree-structured recursive neural networks. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1980–1989 (2018)
Google Scholar
May Petry, L., Leite Da Silva, C., Esuli, A., Renso, C., Bogorny, V.: Marc: a robust method for multiple-aspect trajectory classification via space, time, and semantic embeddings. Int. J. Geogr. Inf. Sci. 34(7), 1428–1450 (2020)
Google Scholar
Mello, R.D.S., et al.: Master: a multiple aspect view on trajectories. Trans. GIS 23(4), 805–822 (2019)
Google Scholar
Michail, D., Kanakaris, N., Varlamis, I.: Detection of fake news campaigns using graph convolutional networks. Int. J. Inf. Manag. Data Insights 2(2), 100104 (2022)
Google Scholar
Monti, F., Frasca, F., Eynard, D., Mannion, D., Bronstein, M.M.: Fake news detection on social media using geometric deep learning. arXiv preprint arXiv:1902.06673 (2019)
NewsGuard: The next great misinformation superspreader: how chatgpt could spread toxic misinformation at unprecedented scale (2023). https://www.newsguardtech.com/misinformation-monitor/jan-2023/
Previti, M., Rodriguez-Fernandez, V., Camacho, D., Carchiolo, V., Malgeri, M.: Fake news detection using time series and user features classification. In: Castillo, P.A., Jiménez Laredo, J.L., Fernández de Vega, F. (eds.) EvoApplications 2020. LNCS, vol. 12104, pp. 339–353. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43722-0_22
Chapter MATH Google Scholar
Reuters Institute for the Study of Journalism: Digital news report 2023 (2023). https://reutersinstitute.politics.ox.ac.uk/digital-news-report/2023
Saikh, T., De, A., Ekbal, A., Bhattacharyya, P.: A deep learning approach for automatic detection of fake news. arXiv preprint arXiv:2005.04938 (2020)
Shu, K., Bernard, H.R., Liu, H.: Studying fake news via network analysis: detection and mitigation. In: Emerging Research Challenges and Opportunities in Computational Social Network Analysis and Mining, pp. 43–65 (2019)
Google Scholar
Shu, K., Mahudeswaran, D., Wang, S., Liu, H.: Hierarchical propagation networks for fake news detection: investigation and exploitation. In: Proceedings of the International AAAI Conference on Web and Social Media, vol. 14, pp. 626–637 (2020)
Google Scholar
Shu, K., Wang, S., Liu, H.: Beyond news contents: the role of social context for fake news detection. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 312–320 (2019)
Google Scholar
da Silva, C.L., Petry, L.M., Bogorny, V.: A survey and comparison of trajectory classification methods. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 788–793. IEEE (2019)
Google Scholar
Song, C., Shu, K., Wu, B.: Temporally evolving graph neural network for fake news detection. Inf. Process. Manag. 58(6), 102712 (2021)
Article MATH Google Scholar
Statista: Number of social network users worldwide from 2017 to 2025 (2021). https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/
Tortelli Portela, T., Tyska Carvalho, J., Bogorny, V.: Hipermovelets: high-performance movelet extraction for trajectory classification. Int. J. Geogr. Inf. Sci. 36(5), 1012–1036 (2022)
Article Google Scholar
Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018)
Google Scholar
Wu, L., Morstatter, F., Carley, K.M., Liu, H.: Misinformation in social media: definition, manipulation, and detection. ACM SIGKDD Explor. Newsl. 21(2), 80–90 (2019)
Article MATH Google Scholar
Zhang, X., Ghorbani, A.A.: An overview of online fake news: characterization, detection, and discussion. Inf. Process. Manag. 57(2), 102025 (2020)
Article MATH Google Scholar
Zhou, F., Gao, Q., Trajcevski, G., Zhang, K., Zhong, T., Zhang, F.: Trajectory-user linking via variational autoencoder. In: IJCAI, pp. 3212–3218 (2018)
Google Scholar
Zhou, X., Zafarani, R.: A survey of fake news: fundamental theories, detection methods, and opportunities. ACM Comput. Surv. (CSUR) 53(5), 1–40 (2020)
Article MATH Google Scholar
Zhou, Z., Guan, H., Bhat, M.M., Hsu, J.: Fake news detection via NLP is vulnerable to adversarial attacks. arXiv preprint arXiv:1901.09657 (2019)

Download references

Acknowledgements

This work has been partially supported by CAPES (Finance code 001). We also thank the reviewers who contributed to the improvement of this work. The views and opinions expressed in this paper are the sole responsibility of the author.

Author information

Authors and Affiliations

Universidade Federal de Santa Catarina, 476 88.040-900, Florianopolis, Brazil
Juan Pablo Chavarro Sanchez & Jônata Tyska Carvalho
Instituto Federal do Paraná (IFPR), Curitiba, Brazil
Tarlis Tortelli Portela

Authors

Juan Pablo Chavarro Sanchez
View author publications
Search author on:PubMed Google Scholar
Tarlis Tortelli Portela
View author publications
Search author on:PubMed Google Scholar
Jônata Tyska Carvalho
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Juan Pablo Chavarro Sanchez .

Editor information

Editors and Affiliations

Universidade Federal Fluminense, Niterói, Brazil
Aline Paes
Instituto Tecnológico de Aeronáutica, São José dos Campos, Brazil
Filipe A. N. Verri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sanchez, J.P.C., Portela, T.T., Carvalho, J.T. (2025). Improving Short-Content Misinformation Detection Using Multiple Aspect Trajectories Classification Techniques. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15413. Springer, Cham. https://doi.org/10.1007/978-3-031-79032-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-79032-4_9
Published: 30 January 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79031-7
Online ISBN: 978-3-031-79032-4
eBook Packages: Computer ScienceComputer Science (R0)

Keywords

Publish with us

Policies and ethics