1 Introduction

Political topics generate heated discussions and opinion polarization in social networks between supporters and opponents about a given topic. This dynamic is especially evident during election times. When a candidate gives a speech, social networks are flooded with publications supporting or opposing the topics addressed in that speech. In Brazil, a survey presented by the Data Senado Institute [26] shows that the main communication channels for seeking information about politics are: TV (37%), social networks (24%), and Internet sites (23%). Young people most often use social networks and journalistic portals. TV attracts the more advanced age group and, from social network users, at least 20% confirm that they use them to talk about politics.

Therefore, the period before and after the 2022 Brazilian elections indicates an excellent opportunity to analyze social media data on political discussions. As shown in Fig. 1, using social media posts, it would be possible to better understand the political discussion by using a topic model to identify and study the different topics in the data. Furthermore, using sentiment analysis, it is possible to estimate user ratings on different topics. In this article, we expose a framework based on topic modeling, using Twitter publications about politics in Brazil. However, this approach can be used in various current scenarios and topics. Our goal is to automatically identify potential high-impact policy-related topics by combining two NLP tasks: topic modeling and sentiment analysis. The proposal is presented in Fig. 2.

Fig. 1.
figure 1

Topic Modelling Techniques in NLP.

Fig. 2.
figure 2

Controversial Topic Identification.

In the literature, there are several techniques for clustering and identifying topics. A traditional method for topic discovery and semantic mining of unsorted documents is Latent Dirichlet Allocation (LDA). However, the method has some drawbacks. It does not consider the temporal aspect of the data, when creating topics in data with different epochs. It is very computationally intensive to train a model with millions of examples [18]. Also, LDA does not consider sequential or semantic words as a prerequisite for creating topics [4].

As an alternative to the standard approach, the work of Top2Vec and BERTopic [4, 14] explored a topic modeling methodology based on the use of state-of-the-art dense representations, followed by dimensionality reduction and clustering, to extract meaningful topics from document collections. In this sense, a series of studies emerged to evaluate the use of correlation of the aforementioned techniques, as described in [11], in which the performance of the different algorithms is evaluated in terms of their strengths and weaknesses in a social science context. Based on the results of this study, the authors indicate that BERTopic supports more embedding models than Top2Vec, allows multilingual analysis, and automatically determines the number of topics. The advantages of Top2Vec are the support for hierarchical topic reduction, the ability to work with very large datasets, and its use of embeddings so that no preprocessing of the original data is required.

Focusing on Portuguese texts, BERTopic has already been successfully applied to Brazilian Portuguese documents by [1]. In their work, the authors applied the methodology to automatically classify legal documents into the six most representative document classes of the Brazilian National Council of Justice (CNJ), achieving 89% of the macro F1 score. Similarly, the authors of [28] examined the social impact of law changes on Twitter social media.

On the other hand, some papers have also conducted public opinion experiments on the political aspects of social networks. A recent study mapped a large-scale cross-party sentiment analysis in Greece, Spain, and the United Kingdom on tweets [5]. The study showed a preponderance of negatively-tweeted tweets from politicians and examined trends in popularity and sentiment. Another study in Spain used Twitter to analyze the impact of elite discourse on effective citizen polarization [16]. According to this study, users’ contact with candidates does not affect polarization.

From this perspective, the aim of this article is to develop a framework that uses a topic modeling approach based on clustering techniques proposed by [1, 14, 28] for publications in Brazilian Portuguese on Twitter about political topics. Furthermore, we hypothesize that combining the extracted topics with state-of-the-art sentiment analysis can add more information and identify potentially controversial topics. Thus, we can identify controversial issues and assess public appreciation. The main contributions of this work are:

  1. 1.

    A new framework for identifying controversial topics in Twitter data, combining clustering and sentiment analysis based on state-of-the-art Transformer representations;

  2. 2.

    Evaluation of discovered topics. The results are evaluated by associating real-world events with the topics and by using quantitative metrics;

  3. 3.

    Comparison between our methodology and a simpler approach;

This work is organized as follows. In Sect. 2, we briefly describe the related works. Section 3 describes the methodology, Sect. 4 shows our experimental results. Finally, in Sect. 5 are presented the conclusions and future works.

2 Related Works

Many works deal with information extraction using social networks, opinion mining, and general linguistic representations. In this section, we briefly survey examples of applications.

Text mining is an important area in Knowledge Discovery in Databases (KDD). It focuses on discovering interesting patterns in structured and unstructured data [13]. As a result, the applications in this area are diverse. In addition, it fosters strong connections to natural language processing, data mining, machine learning, information retrieval, and knowledge management [23]. In the following, we present some examples of recent studies on topic modeling and text clustering related to this proposal.

In the work proposed in [3], a machine learning-based approach is presented to improve cognitive distortions classification of the Arabic content over Twitter, enriching text representation by defining latent topics in tweets. Another study on data clustering tools is the use of topic modeling for customer service chats [15]. The mentioned work focuses on finding new intents from user messages, that are not yet included in any previous intents and reorganizing existing intents by analyzing the topic model generated.

In relation to text clustering, HDBSCAN [19] was applied to investigate how to link popular social media topics and news stories using Transformer models and neural networks [2]. Other works highlight the use of HDBSCAN to compare latent semantic analysis and latent Dirichlet allocation, analyzing the topic COVID-19 [27]. This study checks which most frequent words from each cluster will be displayed and compared with factual data about the outbreak to find out if there are any correlations. The authors state how well HDBSCAN clusters its data in comparison to K-Means [22]. [17] investigates the most effective way of performing text classification and clustering of duplicate texts in technical documentation written in Simplified Technical English. Vector representations from pre-trained Transformers and LSTM models were tested against TF-IDF using the density-based clustering algorithms DBSCAN and HDBSCAN.

Monitoring Social Media Research has become essential for government entities, large corporations, and global companies. Several data mining tools currently assess public reaction to measures taken by a government, company and famous personalities. For example, some studies use social media data to predict a country’s elections based on public sentiment [8]. Other recent work examines social media to study the public awareness of COVID-19 pandemic trends and uncover meaningful themes of concern posted by Twitter users [7].

Focusing on data collection, there are several applications for extracting and analyzing data from social networks. Tweepy is an advanced Twitter scraping tool written in Python allowing to scrape tweets from Twitter profiles without using Twitter’s official API [10]. Another variant, Snscrape, is a scraper for social networking services (e.g., Instagram, Facebook, and Twitter) [6]. It scrapes data like user profiles, hashtags, or searches and returns the discovered items.

3 Methodology

In this Section, we will describe the steps we took to identify controversial political themes using Twitter data. In Fig. 3 is shown an overview of our methodology, presenting its main components: Data Collection (Sect. 3.1), Tweet Pre-processing (Sect. 3.2), Calibration and Clustering (Sects. 3.3 and 3.4), and the Cluster Analysis performed (Sect. 3.7). At the end, in Sect. 3.8 is discussed the evaluation of the controversial topics.

Fig. 3.
figure 3

Overview of the controversial topic identification and analysis.

3.1 Data Collection

The 38\(^{th}\) former Brazilian president Jair Messias Bolsonaro used to perform weekly live streams on his channel on the YouTube video sharing platform. During the live streams, which last an average of one hour, he discussed events of the week that relate to his government.

We assumed that even if the live streams take place on a specific platform (YouTube), supporters and opponents end up generating publications about the content of the broadcast on other social networks such as Twitter. Twitter is a social network in which users are capable of publishing and interacting with others through small text messages called tweets. Given this context, we considered Twitter posts, published after the start of the ex-president’s live streams, a good source of controversial topics.

Given the start time of the weakly live streams, we collected Twitter posts related to the former Brazilian president, and published up to 3 h after the start of the broadcasts. We focused on the lives performed in May 2022 and collected tweets using the snscrapeFootnote 1 social media scrapper. Our tweet data consists of tweets mentioning the ex-president Twitter profile (using ‘ @jairbolsonaro’) or his name (‘Bolsonaro’). This way, we expect that the collected tweets at least mention Jair Bolsonaro.

Table 1. Examples collected for each May live stream. The last live stream was delayed due to the schedule of Bolsonaro.

In Table 1 is shown the number of tweets collected for each live stream of May. In total, 30.120 unique tweets were analyzed with an average of 20 tokens (space-separated tokens) per tweet. In addition, it’s worth mentioning that the amount of collected tweets is superior to the amount of posts published in the YouTube comments section for each live stream.

3.2 Tweet Pre-processing

As a text pre-processing step for the following analyses, we removed mentions (usernames), URLs, empty texts and duplicated texts using the spacyFootnote 2 Python package. We preserved hashtags as they can be good indicators of themes or subjects. In Table 1, we present the number of examples remaining after the text pre-processing.

Following the text pre-processing, we converted the resulting texts to state of the art contextualized dense representations based on Transformers [31]. We used efficient multilingual representations based on the MiniLM [32] language model, trained using the Sentence Transformers framework [24], and publicly available to use and researchFootnote 3.

We limited the token sequence length to be 128 WordPiece tokens (smaller sequences are padded and larger sequences are truncated). The resulting representations for each tweet are vectors of 384 dimensions, generated by the mean of the contextualized token representations and well suited to be compared using similarity metrics such as cosine distance [24]. Considering representations that use context, we expect publications semantically similar (or related to similar subjects) to be close in the representation space.

Finally, we used the UMAP (Uniform Manifold Approximation and Projection) [20] to reduce the dimension of the MiniLM vectors. UMAP is able to reduce dimensions and preserve the global distribution of the original data on the lower dimension space with competitive execution time when compared to other dimension reduction techniques (ex: t-SNE). Since our goal is to use clustering as an intermediate step to identify controversial topics, the dimension reduction is used to improve the efficiency of the clustering method. We observed that performing clustering without dimension reduction led to worse results.

3.3 Clustering Tweets with HDBSCAN

We used the HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) [19] algorithm to cluster tweets and extract common topics, inspired by the success of previous works [4, 14]. The method consists in a hierarchical variation of DBSCAN clustering algorithm [12] that is able to identify clusters of different densities and be more robust in relations to its parameters. HDBSCAN identifies regions of large concentration of examples (high density) as clusters, and it’s able to ignore noisy examples (label as noise).

When representing Twitter publications as multidimensional vectors, we expect that regions of common subjects can be of any shape. Hence, we chose to use HDBSCAN to identify these regions of common subjects as clusters. The motivation for using HDBSCAN is related to its ability to identify clusters of different densities, handle outliers, automatically discover the number of clusters and provide a hierarchical structure of clusters. Furthermore, when dealing with social media posts, not every publication is associated to a common theme. Assuming such examples are not close to high density areas, HDBSCAN can easily identify them as noise.

3.4 Parameter Calibration

Since clustering is used to identify topics, the parameters of the both methods, UMAPFootnote 4 and HDBSCANFootnote 5 were chosen adequately. To find a good set of parameters, we used Random Search with the objective of maximizing the DBCV (Density-Based Clustering Validation index) metric [21]. DBCV is a metric created with the evaluation of density based clustering as its main goal. The metric evaluates the density of the obtained clusters (considering their shape properties) and consider the number of noise examples in its definition. It evaluates both the density sparseness of the examples of a cluster and the density separation of different clusters, using the core distance and mutual reachability distances presented by the authors. Thus, we choose the DBCV metric, to evaluate the quality of the clustering using a metric that is suited to evaluate density-based clustering, instead of relying on metrics such as silhouette and DB index, which may not capture the specific characteristics of density-based clustering algorithms.

Considering the UMAP dimension reduction pre-processing step, two parameters were adjusted: the number of components and the number of neighbors. The number of components determines the number of dimensions in which we want to embed our representations. The number of neighbors is a parameter that controls the balance between the local (favored by lower values) and the global structure (favored by larger values) of the data. Furthermore, the cosine metric was used to determine distances in the original space and we fixed the minimum distance in the lower dimensions space as zero.

For HDBSCAN, we chose to tune the following parameters: minimum cluster size and minimum samples. Minimum cluster size defines the smallest number of examples that can be considered as a cluster. Minimum samples specify the number of neighbors used to estimate the probability density function. For other parameters, we preserved the default parameters of HDBSCAN Python package.

Although HDBSCAN determines the number of clusters automatically, we observed empirically that a minimum number of clusters is beneficial for the analysis. This way, parameter sets that generate less than 4 clusters were discarted. In Table 2 is presented the obtained UMAP and HDBSCAN parameter sets for each day of analysis.

Table 2. Parameters obtained for each May tweet collection and value ranges used.

3.5 Sentiment Analysis

Sentiment analysis is used as an intermediate step of the controversial topic identification. Contextualized dense representations generated by Transformers [31] are predominant in different NLP tasks. Souza et al. [30] investigated different usages of Transformer representations in the sentiment analysis of reviews in Brazilian Portuguese. As shown by the authors results, fine-tuning a language model led to better performance in all the datasets used. Therefore, to perform sentiment analysis on the collected tweets, we used the BERTimbau-base [29] language model fine-tuned in the TweetSentBR corpus [9].

The corpus consists of tweets published during popular Brazilian TV shows. A total of 12.312 examples were recovered from the original corpus. The examples are divided in three classes: positive (45%), neutral (25%) and negative (29%). We preserved the class imbalance, since the minority class has an expressive amount of examples (25%), and used the same text pre-processing described in the start of the Sect. 3.2.

We fine-tuned BERTimbau for sentiment analysis using the popular HuggingFace libraryFootnote 6. We limited the token sequence length to a maximum of 128 tokens (padding the smaller sequences and truncating the larger ones). We trained the model for 5 epochs using: batch size of 128, learning rate of \(2\times 10^{-5}\) and a cosine scheduler with warm-up (10% of the train data used for warm-up). Evaluating using a stratified holdout (80% train and 20% test), the model was able to achieve \(69.15\%\) macro-F1. We used the fine-tunned model to perform sentiment analysis in the collected tweets described in Sect. 3.1.

3.6 Controversial Topic Identification

The controversial topic detection is done by combining the clustering and sentiment analysis of the collected tweets. For each date of analysis, HDBSCAN is used for generating the cluster labels. Then, the fine-tuned BERTimbau extracts the sentiments of each tweet in the clusters. As the next step, the clusters are sorted by the percentage of positive, neutral and negative examples in the clusters. Finally, assuming that each cluster corresponds to a certain subject, we observed that potential controversial topics are usually located in clusters with large amounts of negative publications. Therefore, a controversial cluster is defined as being a cluster with a negative percentage above a threshold C (\(C \in \mathbb {R}\), \(0 \le C \le 1\)), where C is a configurable parameter. It was observed that a threshold \(C = 0.7\) was enough to identify controversial clusters in the collected data. Following this methodology, by identifying a controversial cluster, we can also quantify how bad the topic was received by the Twitter users by analysing the C value.

3.7 Cluster Analysis

The same cluster based TF-IDF used by BERTopic [14], is used to discriminate and analyse the tweet clusters. The procedure consists in a generalization of TF-IDF score. By modifying the definition of term frequency and inverse document frequency, a cluster based TF-IDF allow us to generalize TF-IDF representation from documents to collections (clusters) of documents. The modification of TF-IDF is defined as \(W_{t,c} = tf_{t,c}\cdot \log \left( 1+\frac{A}{tf_{t}}\right) \), where \((tf_{t,c})\) represents the frequency of a token t in a cluster c, \((tf_{t})\) represents the frequency of t in all clusters and (A) average number of tokens per cluster. TF-IDF aims to associate high scores to the most discriminative tokens in a text example. Treating all examples of a cluster as a single example, it is expected that the most discriminative tokens of a cluster have the highest TF-IDF scores. Thus, we summarized each cluster by the tokens with the highest TF-IDF scores. This way, by analysing the filtered tokens, we can easily identify the topics present in the clusters.

To generate the cluster TF-IDF score, we added lemmatisation and punctuation and stop-word removal to the text pre-processing step (Sect. 3.2), treated the cluster examples as a single document (by concatenation) and generated the scores using unigrams.

3.8 Evaluation

In addition to identifying a controversial cluster, it is interesting to associate a real world event to its content as a form of validation. In order to associate news or presidential declarations to the controversial clusters, we searched manually over the news published during the week period analyzed. We also analyzed the topics discussed in the Bolsonaro’s live streams, in search of subjects that could also appear in the clusters.

To quantitatively assess the discoveries, we computed the coherence measurement \(C_V\) [25] for each controversial topic identified. Based on the distributional hypothesis of words, the coherence measurement \(C_V\) aims to quantify how much the topic is highlighted by the documents analysed. We evaluated the \(C_V\), summarizing the controversial topic by it’s top-10 tokens with the highest cluster TF-IDFs (using the same pre-processing described in the Sect. 3.7).

4 Results and Discussions

In this section, we will present the results of exploring the HDBSCAN clusters with the sentiment analysis data.

4.1 Clustering and Sentiment Analysis

Table 3 shows the results of applying the parameters obtained by the calibration step (Sect. 3.4) in the Twitter data collected. We obtained DBCV values above 0.45 for all analyzed dates, and the collection at May the 19th generated the largest number of clusters and the lowest DBCV.

Table 3. Clustering and sentiment analysis results obtained.

In relation to sentiment analysis, Table 3 presents the quantities of positive, neutral and negative tweets in the collected data. In general, we observed similar percentages of neutral tweets. On the other hand, the percentages of positive and negative tweets varied the most, with the highest percentages being associated with negative tweets. The predominance of negative comments, in the brazillian scenario, corroborates the previous mentioned work [5].

4.2 Controversial Topic Discussion

Table 4. Controversial clusters detected and their respective sizes (number of examples), negative percentage and \(C_V\).
Fig. 4.
figure 4

Word cloud visualizations for the 50 tokens with the highest cluster-TF-IDF in the cluster TF-IDF. The bigger the token, the higher its TF-IDF score.

In Table 4 is shown information about the controversial clusters identified. We observed a large variation in the number of examples of the controversial clusters and in the \(C_V\) metric between the dates of analysis. In addition, May \(19^{th}\) was the only date with more than one controversial cluster identified.

To visualize the controversial clusters identified by HDBSCAN, we generated the word clouds shown in the Fig. 4. We analyzed the tweets in each controversial cluster and will discuss the discovered controversial events below.

  • May \(05^{th}\) (Fig. 4a): a large quantity of tweets may have been motivated by a declaration made by Bolsonaro during his broadcast, in which he expressed his concern about frauds in the upcoming election. To add more context, in the same week of the live stream, the Brazilian military forces questioned the entity responsible for the electoral process about possible vulnerabilities in the electronic voting machine.Footnote 7

  • May \(12^{th}\) (Fig. 4b): during his live stream, Bolsonaro talked about a possible deal with Petrobras (important Brazilian petroleum corporation) about reduction in the price of fuel. In the same week, the Brazilian Senate approved a bill that changes taxes residing on fuel. We observed tweets related to both these events in the cluster.Footnote 8

  • May \(19^{th}\) (Figs. 4c and 4d): on this day, we identified two clusters with negative percentages over 70%. The first one (Fig. 4c) may be related to a speech made by the Bolsonaro during an event in the capital of Rio de Janeiro in the same day. In his speech, Bolsonaro repeated that in his government there is no corruption and criticized the predecessor government. We observed the lowest \(C_V\) (0.402) for this cluster. The second cluster (Fig. 4d) seems to contain tweets related to news published during the week, which discloses the ex-president’s expenses using his corporate card.Footnote 9

  • May \(27^{th}\) (Fig. 4e): on the same day of the live stream, the Brazilian government announced a cut in spending destined for the Brazilian Ministry of Education. We observed a large amount of tweets related to this event in the cluster, and it was the worst reception by users (86% negative) and the topic with the highest \(C_V\).Footnote 10

4.3 Evaluating a Simpler Approach

Table 5. Top-10 tokens and \(C_V\), based on the cluster TF-IDF for each date of data collection. HDBSCAN top tokens on the left, and K-means top tokens on the right.

Supposing that a simpler clustering approach could obtain similar results, we repeated the methodology using the K-Means algorithm and no dimension reduction. We chose K-Means, instead of sparse topic modelling methods such as LDA, since it could benefit from using the dense MiniLM representations. We fixed the K-Means parameters (200 random initialisations and 500 maximum iterations), and determined the number of clusters using the elbow rule.

Focusing only on the clusters with the highest percentage of negative tweets, Table 5 compares the top-10 tokens based on their cluster TF-IDF for HDBSCAN and K-Means. As we can see, even tough we observed similar results for the first date of analysis (May \(05^{th}\)), the use of HDBSCAN in conjunction with the UMAP dimension reduction led to clusters with more discriminative and meaningful tokens in the other days. This result is also indicated in the Cv, in which, although the \(C_V\) of the HDBSCAN topics varied the most, their values were always superior to the topics extracted by K-Means.

4.4 Other Discoveries

Fig. 5.
figure 5

Examples of clusters with high percentage of positive and neutral tweets. Word Clouds generated the same way used in the Sect. 4.2

To extract controversial topics, we focused only on the negative tweets. Although the percentages of positive and neutral tweets were not as representative as the negatives, we choose to investigate their examples as well. Then, by analyzing the clusters with high percentages of these classes, we also discovered interesting patterns. Considering all dates of analysis and all identified clusters, we will briefly discuss the findings in the following.

  • Most positive clusters (Fig. 5a): analyzing the most positive clusters obtained for each date, we mainly identified tweets from supporters of Bolsonaro. In specific, at the start of each live stream, the ex-president’s Twitter account (‘@jairbolsonaro’) shared the YouTube link of the broadcast as a tweet. We identified that the clusters with the highest positive rates usually contain tweets from supporters, in response to the official account’s tweet, congratulating Bolsonaro for carrying out the live stream. We discovered the tweets source (in response to) by looking at the tweets metadata obtained from snscrape.

  • Most neutral clusters (Fig. 5b): sorting and investigating clusters by their neutral percentage, we identified clusters consisting of many short tweets containing laughter (‘kkkkkk’) without much context. HDBSCAN was able to isolate the ‘laugh’ clusters, but we suspected that the amount of neutral examples may indicate unwanted bias in the BERTimbau sentiment classifier. Thus, by looking at the examples in the clusters, we identified many ironic or sarcastic (with negative sentiment) tweets incorrectly labeled as being neutral.

5 Conclusions

In this paper, controversial political topics were successfully identified in Twitter data by applying clustering-based topic modeling and analyzing the negative publications. To the best of our knowledge, this is the first study to combine clustering and sentiment analysis based on Transformers to identify controversial topics in social media. We identified one event for each date and validated the detected polemic by searching the news published during the week. We compared two different approaches to identify controversial clusters and obtained favorable results using UMAP and HDBSCAN method. Finally, we presented results investigating positive and neutral tweets in the obtained clusters.

As future works, we intend to replace the sentiment analysis step with aspect-based sentiment analysis to monitor social posts about other Brazilian politicians. The sentiment analysis could be improved by treating politician names as an aspect of the analysis. We also intend to replicate our methodology to analyze data from other social networks (such as Facebook and Instagram).