Abstract
The popular participation in Law-making is an important resource in the evolution of Democracy and Direct Legislation. The amount of legislative documents produced within the past decade has risen dramatically, making it difficult for law practitioners to attend to legislation and still listen to the opinion of the citizens. This work focuses on the use of topic models for summarizing and visualizing Brazilian comments about legislation (bills). In this paper, we provide a qualitative evaluation from a legal expert and compare it with the topics predicted by our model. For such, we designed a specific sentence embedding technique able to induce models for Portuguese texts, and we used these models as topic model, obtaining very good results. We experimentally compared our proposal with other techniques for multilingual sentence embeddings, evaluating them in three topical corpora prepared by us, two of them annotated by a specialist and the other automatically annotated by hashtags.
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
The popular participation in the definition of public policy is inherent to a democratic state, and it comprises more than electoral processes. The Information and Communication Technologies (ICTs) provide a means for citizens to manifest their options in social media and government platforms [31]. Furthermore, the citizens can effectively contribute to better public policies and government decision-making [41]. In Brazil, the Chamber of Deputies has an online platform that enables all Brazilian citizens to interact and express their opinions concerning bills being discussed by the parliament [12].
One of the main sources of public opinion data are the social engagement platforms, such as TwitterFootnote 1. In 2019, approximately 500 million tweets were daily generated in the world [47]. Relying exclusively on human judgment as a tool for the analysis of very large public opinion data streams regarding the Brazilian e-democracy is not feasible. The design and use of new computational tools able to extract relevant and useful information from the opinions of the population regarding new bills can result in more accepted and embraced legislation.
As means to reliably process opinions, and ensure that citizens have their freedom of speech respected and equally considered, text analytic techniques [18] emerge as tools for the e-democracy [17]. It is important to observe that the objects (opinions) received in these data streams are mostly unlabeled. Ergo, unsupervised and semi-supervised learning approaches, such as clustering and topic models algorithms, can be more suitable to extract relevant information from these data.
Text clusterization and topic modeling techniques, which are often used in text mining, have been frequently adopted for text data analysis [17]. Previous studies, such as Evangelopoulos and Visinescu [17], successfully applied clustering algorithms for topic modeling using the analysis of citizens’ opinions in the e-democracy. Topic modeling techniques can extract from texts information to support semantic analyses. This paper investigates a similar application, using topic modeling as the core method to extract key information about the Brazilians’ political opinions for any proposed legislation publicly available in stances in the Chamber of Deputies online platform. This work focuses on a very specific instrument of citizen participation in the law-making process: comments related to the bills, in particular, those collected from the polls in the Chamber of Deputies web portal (refer to Sect. 3 for details). While there is a vast literature exploring the analysis of social media influence in politics [5, 30], our comments dataset is more structured and focused on specific bills. The research reported in this paper has the following contributions:
-
We adapted the BERTopic topic mining tool [21] to extract topics from political comments from [11];
-
We applied an unsupervised deep learning based embedding technique [19] to train a Brazilian Portuguese Legislative Sentence based-Embedding, and this resulting model is available at https://github.com/nadiafelix/Bracis2021;
-
We compared the performance of BERTopic by leveraging Portuguese and multilingue sentence embedding models. The sentence embeddings are used by topic models as input features in the clustering approach;
-
We described and made available three topical corpora, two of them annotated by a specialist and the other automatically annotated by hashtags;
-
We performed a dynamic topic modeling presenting a timeline of evolution;
-
We evaluated a topic model, before evaluated only for English, for Brazilian Portuguese corpora.
-
Our work, to the best of our knowledge, is the first open source approach in Brazilian Portuguese to analyze user-generated content from the parliament platform about bills.
This work is organized as follows. The second section provides an overview about related works. Section 3 outlines the corpora and exploratory analyses. The fourth section introduces the methodology. Section 5 presents the experimental evaluation. The following section comprises the discussions and concludes the work, summarizing the findings, advantages, limitations, contributions and research opportunities.
2 Related Work
Legal Text Analysis. Recent advances in natural language processing techniques led to a surge in the application of text analysis in legal domains. In private law practice, the legal analytics tools impact areas such as legal research, document retrieval, contract review, automatic document generation, and custom legal advice [14, 39]. Recent surveys indicate that the adoption of these tools results is not only in a significant reduction in time to deliver the services but also improves the predictability and trust in the litigation process [3].
Beyond assisting legal practitioners, textual analysis has been increasingly adopted in the legislative process. Before a law comes into force, a bill (law proposal) is subject to extensive technical analysis and discussion, which often allows the participation of civil organizations and the general public. During this process, a plethora of textual documents is generated, including technical reports, transcribed speeches, project amendments, among others. The availability of these documents in open formats [10] spurred the development of approaches to extract insights including the analysis of parliamentary debates [1] and prediction of bill enactment probabilities [27].
Among those legislative documents, this work focuses on a very specific instrument of citizen participation in the lawmaking process: comments related to the law proposals, in particular, those collected from the polls in the Chamber of Deputies web portal (refer to Sect. 3 for details). While there is a vast literature exploring the analysis of social media influence in politics [5, 30], our comments dataset is more structured and focused on specific bills.
Text Clustering. Text clusterization is a data mining task that consists of grouping instances according to their characteristics, and the Curse of Dimensionality also jeopardizes the clustering performance [28]. However, short texts, such as tweets, lead to more sparse vectors, which may require compression. The clusterization has hierarchical approaches [45]. The first is top-down and consists of an initial cluster with the entire dataset that is successively divided [2]. The bottom-up approach considers at first that each instance constitutes a cluster, and the most similar (the algorithm sets up first similarity measures) instances are agglomerated [2]. The distance-based clusterization, such as KMeans, can start with k clusters, each one with one centroid, and these centroids gather the other instances according to a distance metric [6], such as cosine similarity, euclidean distance, Jaccard distance and so forth. Finally, density-based methods (such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN)) start clustering the instances that have more neighbors [40].
Topic models and text clustering have analogous purposes. Both of them aim to group text instances based on their similarity [24, 26]. Nonetheless, whereas clustering algorithms mostly have a deterministic approach, topic models premises include stochastic assumptions [2]. Moreover, interpretable topic models extract from each group the latent semantic-based n-grams that most represent the topic [26], in contrast to clustering algorithms that only gather instances without summarizing the inherent characteristics of each group.
Topic Modeling. Topic models are widely used to analyze large text collections. They represent a suite of unsupervised learning algorithms whose purpose is to discover the thematic structure in a collection of documents. Several researchers have studied topic modeling. Originally, Probabilistic Latent Semantic Indexing [23] and Latent Semantic Indexing [29] paved the way for the creation of the Latent Dirichlet Allocation (LDA) [7] which is the most commonly implemented topic modeling algorithm in use today. In recent years, researchers have started to develop topic modeling algorithms that have achieved state of the art performance by leveraging word embeddings [15, 25, 34].
Lda2vec [26], Embedded Topic Model (ETM) [16], and Deep Latent Dirichlet Allocation (DLDA) [13] are three examples of topic models that extend LDA. The lda2vec is an unsupervised topic model, which combines Latent Dirichlet Allocation and Word Embeddings (word2vec - skipgram) [26]. This method generates context vectors combining word vectors and document vectors [26]. The second model incorporates word embeddings and LDA [16]. The ETM keeps its performance even in large vocabularies. In addition, the topics are represented as vectors in the semantic space, and the topic embeddings generate context vectors. This method leads to interpretable topics, regardless the inclusion or removal of stopwords. The third work, DLDA, uses sample strategies that provide scalability to single or multilayered methods. Besides, the DLDA can have multiple learning rates (based on topics and layers), and is a generalization of LDA. In contrast to the two previous models, the DLDA uses count vectors. The three models lose semantic context due to the embedding approach. This issue is partially solved by methods that consider sentence embeddings such as BERT [15].
Top2Vec [4] and BERTopic [21] proposed the use of distributed representations of documents and words due to their ability to capture the semantics of words and documents. They emphasize that before their approaches it was necessary to have the number of topics as input to the topic model algorithms, and with Top2Vec and BERTopic, this parameter is found by their approaches. The first work proposes that topic vectors be jointly embedded with the document and word vectors with distance between them representing semantic similarity. BERTopic is a topic modeling technique that leverages sentences based BERT embeddings [15] and a class-based TF-IDF [36] to create dense clusters allowing for interpretable topics whilst keeping important words in the topic descriptions.
As this paper focuses on topic models, we compare these previous described in Table 1 (The columns #Topics and Dynamic express whether the approach uses the number of topics and dynamic topic modeling, respectively). This table shows, e.g., that BERTopic [21] can use different topic modeling techniques (value “yes”, in the column #Topics) and uses dynamic topic modeling (value “yes”, in the column Dynamic), a collection of techniques that analyze the evolution of topics along the time, generating the topic representations at each timestamp for each topic. These methods help to understand how a topic changes during a period of time. Furthermore, it provides other forms of topic modeling, not explored in this study, such as semi-supervised learning.
3 Portuguese Political Comments
The Chamber of Deputies Board of Innovation and Information Technology provided the dataset used in this study, which is available at [11]. This dataset consists of information classified by the user commentary for the bill, including counter of comments, date, opinion on the bill (positive or negative) and the commented proposition.
The user can make its participation through the website [11]. In order to organize the commentaries, since there are several law projects and constitutional amendment propositions, the user can choose a specific proposition, whether his/her stand is for or against the legal document, as well as formulate the comment itself. A feature was implemented to “like” a certain commentary, simulating a social network, with the purpose of emphasizing the popularity of a certain point of view.
An exploratory data analysis was performed on the dataset prior to the experimental evaluation. For every comment we calculated its word count, character count, and mean word length (characters per word). These metrics can be visualized in Fig. 1. Metadata with the number of likes of each comment, as well as its perception by the user as a positive or negative aspect of the legal document were also included in the analysis.
The majority of comments have none or few likes, which is expected since the most liked comments are shown to the user in the voting page. As such, with more visibility this creates a snowball effect in which they attain more votes. This is seen in both bills and positive/negative opinions.
In word and character count, the density distribution of the variables differs slightly in shape between the projects and opinions, although their medians remain close to each other. The mean word length presents a normal-like distribution in all categories. Extracted comments with at least 2 words were kept in the corpora and had URL links removed.
3.1 Corpora
For our experiments, we considered all comments associated with the two bills (PL 3723/2019 and PL 471/2005), and the set of all comments that have a hashtag in their content (Hashtag Corpus). Below, we briefly detail these scenarios.
PL 3723/2019. The PL 3723/2019 Bill changes the Law No. 10.826, of December 22, 2003, which deals with registration, possession and sale of firearms and ammunition, and defines crimes. The PL 3723/2019 corpus was collected from National Congress PollsFootnote 2, in which the citizens are asked about the positive and negative parts of this bill. This corpus is composed of 195 negative and 530 positive comments. For our experiments, we considered the total number of comments (i.e., 725 comments).
PEC 471/2005. The PL 471/2005 Bill establishes the effectiveness for the current responsible and substitutes for the notary services, invested according to the law. Known as “PEC dos Cartórios”, the PL 471/2005 corpus was collected from National Congress PollsFootnote 3, in which the citizens were asked about the positive and negative parts of this bill. This corpus is composed of 383 negative and 205 positive comments. For our experiments, we considered the total number of comments, i.e., 588 comments.
Hashtag Corpus. This corpus is composed of 522 comments, which contain at least 16 comments with hashtags [11]. There are 189 bills with hashtags. We considered the top ten bills with more comments. The top ten bills with more comments and hashtags are: PEC 32/2020, PL 318/2021, PL 3019/2020, PDL 232/2019, PL 2893/2019, PEC 135/2019, PEC 101/2003, PL 4425/2020, PEC 6/2019, and PL 1577/2019. More details are shown in Table 2.
4 Methodology
We summarize the methodology for topic modeling in Fig. 2. First we obtain the comments of citizens about a bill, next we chose a sentence embedding that will convert the comments into a vector representation. Next, the clustering algorithm takes care of obtaining clusters based on spatial density.
4.1 Sentence Embeddings
A common method to address clustering and semantic search is to map each sentence to a vector space as semantically similar sentences are close. Researchers have started to input individual sentences into BERT [15] and to derive fixed-length sentence embeddings. The most commonly used approach is to average the BERT output layer (known as BERT embeddings) or by using the output of the first token (the [CLS] token). As it was shown by [33], this practice yields rather bad sentence embeddings, often worse than averaging static word embeddings (for example, Glove [32]). In this work, we evaluate three sentence embeddings: (i) Universal Sentence Encoder [9]; (ii) Multilingual Universal Sentence Encoder for Semantic Retrieval (distiluse-base-multilingual-cased-v2) [48]; and (iii) a Portuguese Legislative Sentence Embedding [19] trained by us.
-
(i) Universal Sentence Encode (USE)Footnote 4: It is an encoder of greater-than-word length text trained on a variety of data [9], in a variety of languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian) i.e., it is a cross-lingual sentence representation. Unsupervised training data for the sentence encoding models are drawn from web sources of WikipediaFootnote 5, web news, web question-answer pages and discussion forums. The authors mention that augment unsupervised learning with training on supervised data from the Stanford Natural Language Inference corpus (SNLI)Footnote 6.
-
(ii) Multilingual Universal Sentence Encoder for Semantic Retrieval (distiluse-base-multilingual-cased-v2): It is a multilingual knowledge distilled version of the multilingual Universal Sentence Encoder. This version supports 50+ languages, including Portuguese. Training data consists of mined question-answer pairs from online forums including RedditFootnote 7, StackOverflowFootnote 8, YahooAnswersFootnote 9, and translation pairs, MultiNLI [46], an extensive corpus, which contains examples from multiple sources is also used. The number of mined question-answer pairs also varies across languages with a bias toward a handful of top tier languages. To balance training across languages, the authors used Google’s translation system to translate SNLI to the other languages.
-
(iii) Portuguese Legislative Sentence Embedding: We trained our sentence embedding using a contrastive learningFootnote 10 framework [19] coupled with the Portuguese pre-trained language model named BERTimbau [37]. This proposal advances the state-of-the-art sentence embeddings using only unsupervised learning, and the authors named SimCSE [19]. We use the same hyperparameters, and the training set is described next: (1) From [12], we used the column Conteudo with 215,713 comments. We removed the comments from PL 3723/2019, PEC 471/2005, and Hashtag Corpus, in order to avoid bias. (2) From [12], we also used 147,008 bills. From these projects, we used the summary field named txtEmenta and the project core text named txtExplicacaoEmenta. (3) From Political SpeechesFootnote 11, we used 462,831 texts, specifically, we used the columns: sumario, textodiscurso, and indexacao. These corpora were segmented into sentences and concatenated, producing 2,307,426 sentences.
4.2 Topic Models
BERTopic [21] is a topic modeling technique that leverages sentence embeddings transformers and c-TF-IDF, a Class-based TF-IDF (Term Frequency-Inverse Document Frequency) to create dense clusters, allowing for interpretable topics whilst keeping important words in the topic descriptions. The algorithm makes an assumption that many semantically similar documents can indicate an underlying topic. This approach automatically finds the number of topics, no stop word lists are required, no need for stemming/lemmatization, and works on short text [21].
5 Experimental Evaluation
5.1 Quantity and Quality Evaluation
The unsupervised nature of topic model makes the evaluation task difficult [43]. We use the topic coherence evaluation [38] to decide which sentence embedding has better interpretability.
Topic Coherence Evaluation. The goal of topic coherence is to automate the evaluation of the interpretability of latent topics and the underlying idea is rooted in the distributional hypothesis of linguistics [22] – words with similar meanings tend to occur in similar contexts [38]. There have been approaches that confirmed a positive correlation with human interpretability and have been applied to evaluate the effectiveness of advances in the field of topic modeling. In order to evaluate our models, we used the coherence measurement \(C_V\) from the literature [35]. \(C_V\) is the most accurate measure according to M. Röder et al. [35], and is calculated as follows. The top N words of each topic are selected as the representation of the topic, denoted as W = \(\{w_1, . . . , w_N \}\). Each word \(w_i\) is represented by an N-dimensional vector \(v(w_i) =\) \(\{NPMI(w_i, w_j )\}_{j=1,...,N}\), where jth-entry is the Normalized Pointwise Mutual Information (NPMI) between word \(w_i\) and \(w_j\), i.e. \(NPMI(w_i, w_j) = \frac{log P(w_i,w_j)-log(P(w_i)P(w_j))}{-log P(w_i,w_j)}\). W is represented by the sum of all word vectors, \(v(W) = \sum _{j=1}^{N} v(w_j)\). The calculation of NPMI between word \(w_i\) and \(w_j\) involves the marginal and joint probabilities \(P(w_i), P(w_j ), P(w_i, w_j )\). For each word \(w_i\), a pair is formed \((v(w_i), v(W))\). A cosine similarity measure \(\phi _i(v(w_i), v(W)) = \frac{v(w_i)^Tv(W)}{\parallel v(w_i) \parallel \parallel v(W) \parallel }\) is then calculated for each pair. The final \(C_V\) score for the topic is the average of all \(\phi \)’s.
Human Evaluation. Each performance metric has distortions and limitations. This metric, in the context of unsupervised learning, could mislead results. Thus, an expert also evaluated qualitatively the datasets. The specialist defined the key topics of the corpora without knowing how many are or without any clue.
5.2 Setup
We used the BERTopic [21] topic modeling toolFootnote 12 with the default parameters, and n-gram ranging from 2 to 20. The n-gram range corresponds to the number of terms that composes each topic. We reported results with 5 runs, showing the mean and standard deviation for topics ranging from 2 to 46 in increments of 2.
We trained our Portuguese Legislative Sentence Embedding based in SimCSEFootnote 13 with train batch size = 128, number of epochs = 1, max sequence length = 32, and Portuguese pre-trained language model named BERTimbau [37]. SimCSE is based on Huggingface’s transformers packageFootnote 14, and we take 5e−5 as the learning rate with the unsupervised learning approach.
5.3 Results
Coherence of Topics. In first experiment, we analyze the number of topics versus \(C_V\) coherence score for the three corpora. As we can see on Figs. 3a, 3b, and 3c, the Portuguese Legislative Sentence Embedding presents the best results in all experiments.
We emphasize that although we analyze the coherence of topics as a function of the number of topics, this metric was not used to decide the optimal number of topics. This is because the clustering algorithm used to learn the model infers automatically the number of topics through the spatial density of the sentences represented by the embeddingsFootnote 15.
Real and Predicted Topics. The second experiment carried out is the comparison of the topics inferred by the best BERTopic configuration (with Portuguese Legislative Sentence Embedding) and the topics noted by the domain specialist. We pair the most similar ones in Table 3 showing that the approach proposed by BERTopic is indeed efficient. We report the real topics and respective predicted topics only for PL 3723/2019, for reasons of space in the paperFootnote 16. It is important to note that the topics raised by the expert are not the only and definitive ones. For this experiment, we asked her to raise the obvious topics. Another observation is that the expert did not know how many topics were inferred by BERTopic and did not have previous access to them, so that there is no bias in her answers.
Example 1
Defesa, pois como vigilante tenho o direto de defesa. Trabalho com risco de vida. E também não estamos seguros. O governo não nos da segurança.
The BERTopic models the Example 1 on Fig. 4, in which one can see that topic 10 is more significant. The relevant keywords from this topic are ‘armas’, ‘não’, ‘direito’, ‘de fogo’, ‘cidadão’, ‘defesa’. The other topics are still relevant as: topic 8 is about ‘arma de fogo’, ‘uma arma de fogo’; topic 9 is about ‘advogados’, ‘categorias’, ‘risco’, ‘de risco’, ‘arma’, ‘porte de arma’.
6 Discussion
This work has presented a fully automated approach for identifying topics in Brazilian comments about legislation (bills). The BERTopic together with Portuguese Legislative Sentence Embedding got the best result. This work is an early indication as to how legal practitioners can identify salient and coherent topics using automatic topic modeling tools and how can this ensure greater citizen participation.
In sentence embeddings context, we experiment an unsupervised training approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise [19]. This simple method works well, performing on par with previous supervised counterparts [19]. We intend to train other unsupervised models for Portuguese in this legal domain, as in [44], evaluating in topic model, and in other tasks.
Regarding topic model experiments, we intend to analyze other coherence measures, as well as human evaluation. Besides that, we intend to develop a protocol of annotation for the corpora, and with this identify the topics and have a more accurate evaluation with supervised measures [42].
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
Contrastive learning is a machine learning technique used to learn the general features of a dataset without labels by teaching the model which data points are similar or different [20].
- 11.
Available at https://github.com/nadiafelix/Bracis2021.
- 12.
- 13.
- 14.
- 15.
To analyze the coherence in function of topics we enter with the number of topics.
- 16.
We report the real Topics and respective predicted topics for PEC 471/2005 and Hashtag corpus in https://github.com/nadiafelix/Bracis2021.
References
Abercrombie, G., Batista-Navarro, R.: Sentiment and position-taking analysis of parliamentary debates: a systematic literature review. J. Comput. Soc. Sci. 3(1), 245–270 (2020)
Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques (2017)
Andrade, M.D.D., Rosa, B.D.C., Pinto, E.R.G.D.C.: Legal tech: analytics, inteligência artificial e as novas perspectivas para a prática da advocacia privada. Revista Direito GV 16(1) (2020)
Angelov, D.: Top2Vec: distributed representations of topics (2020). https://arxiv.org/abs/2008.09470
Barberá, P., Rivero, G.: Understanding the political representativeness of Twitter users. Soc. Sci. Comput. Rev. 33(6), 712–729 (2015)
Basu, S., Banerjee, A., Mooney, R.J.: Active semi-supervision for pairwise constrained clustering. In: Proceedings of the 2004 SIAM, International Conference on Data Mining. Society for Industrial and Applied Mathematics, April 2004
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: Hierarchical density estimates for data clustering, visualization, and outlier detection 10(1) (2015)
Chidambaram, M., et al.: Learning cross-lingual sentence representations via a multi-task dual-encoder model (2019)
Câmara dos Deputados: Dados Abertos da Câmara dos Deputados (Open Data of the Chamber of Deputies, when translated to English). https://dadosabertos.camara.leg.br (2021). Accessed 8 June 2021
Câmara dos Deputados (Brazilian Chamber of Deputies, when translated to English): Enquetes (polls, when translated to English). https://www.camara.leg.br/enquetes. Accessed 5 May 2021
Câmara dos Deputados (Brazilian Chamber of Deputies, when translated to English): Popular participation. https://www2.camara.leg.br/transparencia/servicos-ao-cidadao/participacao-popular (2021). Accessed 5 May 2021
Cong, Y., Chen, B., Liu, H., Zhou, M.: Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In: Proceedings of the 34th International Conference on ML , vol. 70, pp. 864–873. ICML 2017, JMLR.org (2017)
Dale, R.: Law and word order: NLP in legal tech. Nat. Lang. Eng. 25(1), 211–217 (2019)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Comp. Linguistics: Human Language Tech, pp. 4171–4186. Minnesota, June 2019
Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics 8, 439–453 (2020)
Evangelopoulos, N., Visinescu, L.: Text-mining the voice of the people. Commun. ACM 55(2), 62–69 (2012)
Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manage. 35(2), 137–144 (2015)
Gao, T., Yao, X., Chen, D.: SimCSE: simple contrastive learning of sentence embeddings. https://arxiv.org/abs/2104.08821 (2021). Accessed 5 May 2021
Giorgi, J., Nitski, O., Wang, B., Bader, G.: DeCLUTR: deep contrastive learning for unsupervised textual representations (2021)
Grootendorst, M.: BERTopic: leveraging BERT and c-TF-IDF to create easily interpretable topics. https://doi.org/10.5281/zenodo.4381785, note = Accessed 5 May 2021
Harris, Z.: Word 10(2–3), 146–162 (1954)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 1999, pp. 50–57. Association for Computing Machinery, New York, NY, USA (1999)
Mahyoub, M., Hind, J., Woods, D., Wong, C., Hussain, A., Aljumeily, D.: Hierarchical text clustering and categorisation using a semi-supervised framework. In: 2019 12th International Conference on Developments in eSystems Engineering (DeSE). IEEE, October 2019
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013)
Moody, C.E.: Mixing Dirichlet topic models and word embeddings to make lda2vec (2016)
Nay, J.J.: Predicting and understanding law-making with word vectors and an ensemble model. PLoS ONE 12(5), e0176999 (2017)
Nebu, C.M., Joseph, S.: Semi-supervised clustering with soft labels. In: 2015 International Conference on Control Communication & Computing India (ICCC). IEEE, November 2015
Papadimitriou, C., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: a probabilistic analysis. J. Comput. Syst. Sci. 61(2), 217–235 (2000)
Parmelee, J.H., Bichard, S.L.: Politics and the Twitter Revolution: How Tweets Influence the Relationship Between Political Leaders and the Public. Lexington books, Lanham (2011)
Pavan, J.N.S., Pinochet, L.H.C., de Brelàz, G., dos Santos Júnior, D.L., Ribeiro, D.M.N.M.: Study of citizen engagement in the participation of elective mandate actions in the Brazilian legislature: analysis of the use of political techs. Cadernos EBAPE.BR 18(3), 525–542, September 2020
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha, Qatar, October 2014
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in NLP and the 9th International Joint Conference on NLP (EMNLP-IJCNLP), pp. 3982–3992. ACL, Hong Kong, China, November 2019
Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in NLP. ACL, November 2020
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the 8a ACM International Conference on Web Search and Data Mining, pp. 399–408. WSDM 2015, New York, NY, USA (2015)
Sammut, C., Webb, G.I. (eds.): TF-IDF, pp. 986–987. Springer, US, Boston, MA (2010)
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, 20–23 October 2020. (to appear)
Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 952–961. Association for Computational Linguistics, Jeju Island, Korea, July 2012
Sugathadasa, K., et al.: Legal document retrieval using document vector embeddings and deep learning. In: Arai, Kohei, Kapoor, Supriya, Bhatia, Rahul (eds.) SAI 2018. AISC, vol. 857, pp. 160–175. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01177-2_12
Thaiprayoon, S., Unger, H., Kubek, M.: Graph and centroid-based word clustering. In: Proceedings of the 4th International Conference on NLP and Information Retrieval. ACM, December 2020. https://doi.org/10.1145/3443279.3443290
United Nations: Inclusion and more public participation, will help forge better government policies: Guterres. https://news.un.org/en/story/2020/09/1073742, September 2020. Accessed 5 May 2021
Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1073–1080. ICML 2009. Association for Computing Machinery, New York, NY, USA (2009)
Wallach, H., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th Annual International Conference on ML (ICML 2009), pp. 1105–1112. ACM (2009)
Wang, K., Reimers, N., Gurevych, I.: TSDAE: using transformer-based sequential denoising auto-encoder for unsupervised sentence embedding learning (2021)
Willett, P.: Recent trends in hierarchic document clustering: a critical review. Inf. Process. Manage. 24(5), 577–597 (1988)
Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1112–1122. Association for Computational Linguistics, New Orleans, Louisiana, June 2018
World Economic Forum: How much data is generated each day? https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/ (2019). Accessed 5 May 2021
Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Silva, N.F.F.d. et al. (2021). Evaluating Topic Models in Portuguese Political Comments About Bills from Brazil’s Chamber of Deputies. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-91699-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)



