Classifying Potentially Non-compliant Portuguese Language Sentences Concerning Privacy Policies

Tocchini, Matheus; Rocha, Igor M.; de Barros, Raphael M.; e Silva, Jéssica O.; Garcia, Ananda F.; Zular, Felipe; Maranhão, Juliano; Sichman, Jaime Simão

doi:10.1007/978-3-031-79038-6_8

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15415))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

245 Accesses

Abstract

Privacy policies (PPol) are extensive and contain complex sentences that are difficult to understand. They detail what happens with a person’s personal data and must comply with specific legislation. One way to assess the compliance of PPol with legislation is through machine learning models. There are some studies already carried out, aiming to detect compliance with the GDPR, basically analyzing PPol in English. In this work, we present a mapping of Brazilian data protection legislation into 27 categories, divided into 3 blocks, and 3 levels of potential compliance. We also introduced a corpus in Portuguese, with PPol sentences annotated through mapping of Brazilian legislation. We evaluated some classifier models in a task of detecting potentially non-compliant sentences and another task of categorizing potentially non-compliant sentences. We achieved performance close to the literature for studies in English and the GDPR. Our study points out ways to improve the automated assessment of PPol and highlights the complexity of the tasks that seek to ensure compliance with data protection legislation.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Establishing a Strong Baseline for Privacy Policy Classification

On GDPR Compliance of Companies’ Privacy Policies

A Contrastive Study of Pre- and Post-legislation Interaction Design for Communication and Action About Personal Data Protection in e-Commerce Websites

1 Introduction

A privacy policy is a document that informs the holders of all the processes, treatments and security measures that a company carries out with their personal data [9]. However, almost 20 years ago, studies already showed that the readability and comprehension of most of these documents requires a graduation degree [20]. Furthermore, PPols are extensive documents, requiring a considerable period of time to be read [31]. With the enactment of legislation to protect personal data, such as the GDPR (General Data Protection Resolution) [43] in European territory and the LGPD (Lei Geral de Proteção de Dados, in Portuguese - General Data Protection Law, in English) [8] in Brazil, PPol would need to evolve in order to eliminate gaps that were already known and, also, meet legal requirements. One way to ensure that privacy policies are being improved and in compliance with data protection laws is through the use of algorithms for automated analysis. More specifically, a combination of natural language processing (NLP) and ML techniques [26].

A lot of effort has been put into extracting information from PPol. Different methods were applied to identify and retrieve relevant keywords and text parts [3]. The work presented in [45] directed studies on PPol for the use of supervised ML. The development of corpora that mapped the concepts that were desired to be extracted from policies into categories that could be classified through ML gained prominence, as there were no data sets with these characteristics until then.

Various approaches to evaluating PPol in relation to GDPR compliance have been studied [2, 14, 35]. All of them used the GDPR as a legal reference to guide the performance of the proposed models. Also, the studies were conducted on English-language documents. Thus, there is a gap in relation to works that address other legal norms, in addition to other languages. There is an opportunity to develop research beyond the English language and GDPR, in order to compare results from different languages and laws, and pave the way for the adoption of automated analysis of privacy policies, ensuring compliance with regulatory data protection laws.

In order to address this gap described across different languages and legal systems, we developed a mapping of the Brazilian data protection law (LGPD) [8] into machine-readable attributes. To this end, we built an annotation guideline that details the application of our mapping to evaluate PPols clauses [33]. The guideline instantiates 3 large blocks of categories, totaling 27 annotation categories. Through the guideline, we assembled a corpus with 6431 annotated PPols clauses, with 3 levels of compliance with the LGPD: compliant, partially non-compliant, non-compliant. To the best of our knowledge, this corpus is the first in Portuguese that relates the evaluation of PPol from the perspective of the LGPD. We evaluate the corpus using classical ML and deep learning models. We applied it to two different tasks: identifying potentially non-compliant clauses and categorizing potentially non-compliant clauses.

The remainder of this paper is organized as follows. We present related work in Sect. 2. In Sect. 3, we depict the development of the guideline and the construction of our corpus. Section 4 describes our experimental analysis and evaluation methodology, and we discuss our results in Sect. 5. Finally, Sect. 6 brings our conclusions and we suggest future directions for our work.

2 Related Work

Initially, work on PPol aimed to extract information from documents through rules and build knowledge ontologies to evaluate new PPols. At this stage, studies carried out in [3, 5] capture information and keywords from PPol through lexical rules. An indication of the potential use of ML to evaluate PPol began to emerge with the findings in [6, 47], which brought encouraging evidence of the effectiveness of crowdsourcing to annotate PPol. In [45], Wilson et al. present the OPP-115 corpus, a pioneer in research on PPol.

After OPP-115 corpus, lots of works focused on building and improving corpora and applying ML. In [16], the authors propose a formal model for data practices. The works in [11, 12, 46] study improvements of crowdsourcing, in order to automate PPol clause analysis, and Zaeem et al. introduces a large PPol corpus, with more than 200,000 PPol gathering from DMOZ [34]. The work of Gebauer et al. propose a prototype of a human-in-the-loop combined with ML to annotate PPol [19]. PrivacyGLUE is introduced in [39] as a benchmark to measure general language understanding in the privacy domain. In [36], the authors present question-answering methods to retrieve informations from privacy policies. Bhatia et al. introduce a way to identify incompleteness in privacy policies by using semantic frames [7]. Kotal et al. show that techniques to extract knowledge from PPol fail when the text policies are ambiguous [24]. Yang et al. propose an approach for automatically extracting purpose-aware rules from PPol [49], while Kumar et al. present an approach to extract and classify automatically opt-out choices found in PPol [4]. In [32], the authors build a corpus from a rule-based information extraction of PPol for Internet of Things apps. A proposal of a privacy score framework, based in ML, is presented in [23].

Another aspect that developed after OPP-115 was the assessment of PPol in relation to data protection legislation, especially GDPR. The development of methodologies that combine the mapping of GPDR into categories and the creation of a corpus with privacy risk classes are addressed in [14, 41]. The inspiring work presented in [25] combines the development of a corpus with GDPR-guided classes and the application of ML to classify it. In [35], the authors compare article 5 from GDPR with OPP-115 annotations, in order to validate the corpus applicability. We also highlight the studies in [2, 18, 29, 37, 50], which are based on GDPR to extract information from PPol, and the work of [28], which introduces a corpus to evaluate the structure of PPol for Android apps, based on GDPR articles.

We highlight 4 more works in languages other than English. In [30], the authors built a knowledge base from language processing and unsupervised learning techniques, using PPol in Italian. The work in [51] introduces a Chinese privacy dataset for sequence labeling task and compliance identification studies. An Arabic dataset for ML applications in PPol is presented in [1]. Finally, we highlight the work of Correia et al. [15], in which they developed a recognizer of named entities in Portuguese for legal entities, based on texts from the Brazilian Supreme Court.

Our work differs from the others revised in three points: (i) we mapped data protection law into attributable categories in ML models, which is done through Brazilian legislation, unlike others, which use European legislation or even just linguistic or conceptual rules; (ii) we constructed a detailed guideline on mapping LGPD and how to apply it to annotate raw texts, presenting a reproducible model for legislation other than GDPR; (iii) we have used texts in Portuguese, since almost all researches are carried out with documents in English.

3 Annotation Guideline and Corpus Development

The work to create the guideline and obtain the corpus was based on the theory proposed in [21]. In this work, Hovy et al. brings 7 steps for methodological annotation of a corpus: (i) selecting representative texts; (ii) instantiating the theory; (iii) choosing and training annotators; (iv) defining the annotation process; (v) designing the annotation interface; (vi) determining and applying evaluation metrics; (vii) corpus maintenance.

Also, our work was inspired by the research developed in the Claudette project, with the construction of a corpus relating the GDPR [43] with PPols and the evaluation of the data set through ML models [13, 25].

Following the theoretical framework in [21], Sect. 3.1 details the theory applied to develop the guideline, while Sect. 3.2 describes the other corpus construction steps, such as text capture, the annotation interface and the preparation of annotators.

3.1 Annotation Guideline

We started the guideline work through the study and instantiation of the LGPD theory [8], which corresponds to the set of standards responsible for the protection of personal data in Brazilian territory. Our systematic analysis of the LGPD resulted in an initial version with more than 30 categories, separated into 3 large blocks (data omission, data processing, unclear language and general).

The guideline underwent 3 revisions. We made changes such as including new categories, excluding some existing ones and adding others into a single category (neutralization of the theory [21]). After revisions, the final version contains 27 categories. The data omission block has 18 categories, data processing has 7 categories, unclear language and general have one each. Table 1 presents all the categories. For the sake of space, we ask to refer to [33] for details on each of the categories within guideline.

Table 1. Categories defined in Guideline [33]. They map LGPD blocks into attributable classes for ML.

Full size table

The conceptualization of the categories was accompanied by the definition of the degrees of compliance for each of them. Following the work of [13, 25], we classify compliance into 3 levels: level 1 - compliance, level 2 - potential partial non-compliance, level 3 - potential total non-compliance. Specific details of the degree of compliance for each category are described in the guideline [33]. Also, there is an indication, with its legal basis, regarding the mandatory presence of data evaluated by the category in a PPol. This information was included in the guideline, as it was reported in [25] as an obstacle to delimiting the task that should be automated by the ML model. We illustrate, in Fig. 1, the Data correction category, with its compliance levels, examples in documents and the column that indicates its mandatory nature in a PPol.

During revisions, our team tagged 2 PPol to evaluate the instantiated theory after each change. This assessment aimed to check whether the categories were consistent with the section of the LGPD that they instantiated, identify ambiguities between categories and find very specific categories that could be added. The entire guideline development process took 5 months, resulting in the initial version. Subsequent reviews took another 5 months. In total, we worked for 10 months to reach the final version of the guideline [33].

3.2 Corpus

After completing the guideline, we moved on to implementing the other steps of the Hovy et al. methodology [21]. We collected documents from companies of different economic sectors, in a similar way to the criteria applied by Liepina et al. [25]. Our collection efforts resulted in a set of 75 PPol from different companies.

We process the PPols in order to transform them into clauses in JSON format. This process took place in 3 stages. Figure 2 details the PPol capture process. First, we parsed PPol text into sentences, using regex split by punctuation (?!.;). In second stage, we reviewed each parsed PPol, in order to remove sentences parsed in a wrong way. The last stage consisted in a transformation of each parsed PPol in text files (one text file per sentence) and conversion to JSON files.

With the PPol sentences in the appropriate format, we chose our annotation tool. We use Label Studio [42]. Within the tool, we configure the annotation interface according to the guideline categories (Sect. 3.1). Figure 3 shows an example of interface to annotate a sentence. The annotations have two parts: choose a category and, in sequence, determine the compliance level. Each sentence can have one or more categories tagged. Also, we included two extra categories: “NA”, if the sentence has no content to be analyzed using the guideline, and “Incorrect”, for sentences parsed with errors that were not identified during the manual review. All sentences tagged with any of these categories were deleted from the final corpus.

We selected and trained a group of 5 people to be annotators. All were undergraduate Law students at Universidade de São Paulo. The training consisted of an 1-hour online session to explain the use of the annotation interface and presentation of the annotation guideline. We chose an annotation strategy similar to that recommended in Braun [10], which suggests assigning 2 or more annotators per sentence and a more senior expert to resolve disagreements. We started with 3 annotators per clause and weekly meetings to discuss disagreements and reach consensus. If the disagreement persisted, the most senior researcher on our team, a specialist in Consumer Law and Data Protection, would make the final decision. After training the annotators, two of them did not progress in their work and decided to leave the group. These withdrawals impacted our strategy and we had to use only one annotator per sentence, in order to meet the working deadline of the annotators. This scenario has the potential to introduce biases into the annotation process. Before the withdrawals, the annotators got a Fleiss’ kappa \(\kappa <\) 0 (in 266 clauses or 4% of total corpus). This result can be an effect of a phenomenon discussed in Sect. 5.

The final corpus^{Footnote 1} took 10 months to be done and has 6341 distinct sentences. Highlights include the categories Purpose of treatment, Category of processed data and Data Sharing, which correspond to almost 50% of the sentences. Around 22% of sentences are potentially non-compliant, that is, they may be in conflict with the LGPD.

4 Corpus Evaluation

The evaluation of the corpus we built was inspired by the methodologies presented in [25, 27]. Following the reviewed literature, we divided the experimental work into two tasks: detection of potentially non-compliant sentences and categorization of potential non-compliant sentences.

The task of detecting non-compliant sentences consists of a binary classification, with the purpose of identifying sentences that are potentially non-compliant with the LGPD. The task of categorizing non-compliant sentences is a binary classification, one classifier for each category, performed only on sentences with conformity level 3.

4.1 Feature Representation

For the task of detecting potential non-compliant sentences, we use a representation by bag-of-words (BoW), with bigrams and trigrams (applying TD-IDF), and part-of-speech (POS) [38]. We use spaCy^{Footnote 2} and its POS tagger for the Portuguese language. We also use an approach for data input into BERT models, with the application of tokenization, padding and other auxiliary operations [48]. We chose the BoW representation in order to evaluate whether the lexical and grammatical information is sufficient to identify non-compliant sentences. This approach also allows comparison with the results presented in [25, 27]. The representation for the BERT model contributes to the incorporation of semantic information into the classification task.

Regarding the task of categorizing non-compliant sentences, we only employ the BoW approach, with POS and bigrams and trigrams. Thus, we only consider lexical and grammatical information to categorize sentences, in a similar way to what was proposed in [27].

4.2 Machine Learning Algorithms

Regarding the non-compliant sentence detection task, we use two classifiers: support vector machine (SVM) [44] and fine-tuned BERT pre-trained in Portuguese, known as BERTimbau [40]. The SVM receives the BoW representation as input and the BERT model and its data preprocessing and representation, as described in Sect. 4.1. We chose SVM for comparisons with the results in [25], even though the ML model chosen by the authors is a class of SVM with some changes. The BERT model was our second choice to evaluate the degree of improvement that a model close to the state-of-the-art can bring to the task. In this scenario, all sentences with compliance level 3 are assigned as positive label and others sentences, compliance levels 1 and 2, as negative label.

The second task, categorizing potentially non-compliant sentences, was carried out with the SVM + BoW configuration. In this scenario, we train a model for each compliance category in the guideline. In each trained model, a single category was marked as a positive label and all others as a negative label. Only sentences with compliance level 3 are included in this task.

In order to address the class imbalance in our corpus, we adopted an algorithmic structure approach for SVM: cost-sensitive learning, with a higher weight to minority classes. The models’ performance was evaluated using the F1-score, precision and recall metrics [38]. We selected them because we can compare them with the performance of detecting non-compliant sentences in [25] and with the results for categorizing non-compliant sentences in [27].

5 Results and Discussion

The experiments^{Footnote 3} for the SVM non-compliant sentence detection task were performed by splitting the data into 90% for training and 10% for testing. We trained 10 models with 5-fold cross validation. Each of them with a different seed to split the dataset and we performed a grid search for the C parameter (search space = [0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 50.0]), linear kernel, class weight = ‘balanced’ and we fixed a seed for the models. Performances in test set were averaged.

The BERT model used a single dataset split due to computational overburden. We split in 80% for training, 10% for validation and 10% for testing, and was trained for 4 epochs, following the recommendation in [17]. We set batch size = 8, Adam optimizer (learning rate = \(2\times 10^{-5}\), \(\epsilon = 1\times 10^{-8}\)). All metrics were verified in the test set.

Following the evaluation methodology of [25], we added two more baseline classifiers: a classifier that always returns the class of interest, called Always positive, and a random classifier, with a probability of the class of interest similar to the proportion in the training set, around 22% (Sect. 3.2). Table 2 presents the metrics obtained for the potentially non-compliant sentence detection task.

Table 2. Performance of the models for the task of detecting potentially non-compliant sentences.

Full size table

Table 3. Performance of each classifier in test set for non-compliant sentences categorization. The proportion column refers to the percentage of non-compliant sentences in the category in relation to all non-compliant sentences (compliance level 3). The category blocks, omission of data required by law, data processing and unclear language, are separated by colors, where the data processing block is in gray.

Full size table

The BERT model outperforms all others and shows us an encouraging result. We noticed that the baseline Always positive performed very close to the SVM. This fact made us think about some hypotheses. As SVM exploits a lexical and grammatical representation, PPol can contain sentences that require higher linguistic complexity to be understood and become separable. The BERT model exploits a representation with semantic and contextual information [22] and presents considerably better performance (around 42%). This is consistent with [25], since the authors relate difficulties with sentences that need the PPol context. When comparing the performance of PPol with online terms of service (ToS), work carried out in [27] (F1 = 0.769 for SVM + BoW), we realize that the lexical and grammatical representations through BoW for ToS have results that are higher than those we found and those found in [25] (F1 = 0.42\(\,-\,\)0.55). This scenario may reinforce the idea about the greater complexity and difficulty of analyzing PPol, a fact reported in [20]. In line with analyzing the complexity of sentences, we manually studied some sentences that were annotated with more than one category and with different compliance levels. For example, the sentence “Você reconhece e concorda que poderemos lhe enviar informações e avisos importantes referentes à sua conta e ao Site da comunidade por e-mail, mensagem de texto ou outros meios, com base nas informações que você nos forneceu.” (in English, “You acknowledge and agree that we may send you important information and notices regarding your account and the Community Site by email, text message or other means based on the information you have provided to us.”) was tagged in categories Purpose of treatment, with compliance level 1, Category of processed data and Advertising, with compliance level 3. The tagged categories are in accordance with the guideline definitions. Therefore, a sentence can be compliant or non-compliant, depending on the category. A future work could be the separation of categories into blocks to apply the classification, similar to what was done in [25]. Even so, this finding of ours can reinforce the hypothesis of the complexity of the sentences found in PPol.

A second hypothesis is related to the bias that the annotation process may have introduced into the corpus. This is a limitation of our work. Only one annotator tagged each sentence and may have made some subjective choices that may not be based on the guideline.

Regarding the second task we designed, the categorization of non-compliant sentences, we have done just a randomized grid search, with a cross validation (5-fold), for each category. We assigned positive label for a category and false label to others, using the same hyperparameters applied to detection task. In total, we have trained 26 SVM classifiers. Table 3 displays the results found. We observed that the categorization task performed below the task of identifying potentially non-compliant sentences. The category Other consents does not present results because it has no sentences with compliance level 3. The work carried out in [27], with ToS, had opposite findings. The categorization task performed better. We interpret our findings, compared with the literature, as an indication of the complexity of PPol sentences, just as we raised this hypothesis in the task of detecting non-compliant sentences.

Two more findings caught our attention. The first was that higher F1-scores are associated with categories with a higher proportion of sentences in the dataset (Category of processed data, 3rd party sharing and Purpose of treatment). We understand this as an indication of the need to expand the number of sentences in our corpus to improve their quality. The second concerns high recall values regardless of the proportion of the category in the dataset. For example, category “Take it or leave it” corresponds to 1.6% of sentences in categorization task and achieved 100% of recall. In other words, the BoW representation, exploiting lexical and grammatical information, is sufficient for the SVM to identify a category correctly. The problem becomes excessive false positives, probably because the representation of sentences is bringing ambiguities to the models. A future path of research could be to study classifiers’ false positives and identify their causes, in order to evolve the representation of information that the algorithms receive as input from PPol.

6 Conclusions and Future Work

In our work, we introduced a corpus of more than 6,400 PPol sentences annotated using a standardized guideline that maps LGPD in machine-readable categories. We defined 27 categories, separated into 3 blocks, which can qualify a sentence and 3 levels of compliance.

We evaluated our corpus in two classification tasks, in order to verify the ability to analyze PPol sentences in an automated way. The first task was the detection of potentially non-compliant sentences. We use two approaches, SVM + BoW, exploiting lexical and grammatical information, and a BERT model, which incorporates semantic information. The SVM + BoW model did not outperform the baselines we employed. The BERT model presented an encouraging result, close to that found in [25]. We interpret these results as an indication that it is possible to detect PPol sentences potentially non-compliant with the LGPD. Regarding the categorization of potentially non-compliant sentences, we understand that our results point to the need for more advanced representations of sentences and models closer to the state-of-the-art to achieve more satisfactory performances that are close to the literature. We also highlight the complexity of PPol texts in Portuguese that our work revealed. We believe that our findings contribute to the improvement of NLP research applied to Digital Law in Portuguese.

As future developments for this work, we intend to expand the size of our corpus, in order to increase the examples of each category. We think that the larger volume of data has the potential to improve the performance of the tested models. A second future activity is to improve the annotation process, employing more annotators per sentence, in order to follow the initial strategy that we proposed and recommended in [10]. Finally, we wish to apply the BERT model to categorize potentially non-compliant clauses and evaluate the reasons for the excess of false positives found in our studies.

Notes

1.
For more detailed information about the corpus: https://doi.org/10.5281/zenodo.13371639.
2.
https://spacy.io.
3.
Code developed for experiments is available in https://github.com/mtstocchini/claudinha_project.

References

Al-Khalifa, H., Mashaabi, M., Al-Yahya, G., Alnashwan, R.: The Saudi privacy policy dataset (2023)
Google Scholar
Alshamsan, A.R., Chaudhry, S.A.: A GDPR compliant approach to assign risk levels to privacy policies. Comput. Mater. Continua 74(3), 4631–4674 (2023). https://doi.org/10.32604/cmc.2023.034039
Audich, D.A., Dara, R., Nonnecke, B.: Extracting keyword and keyphrase from online privacy policies. In: 2016 Eleventh International Conference on Digital Information Management (ICDIM), pp. 127–132 (2016). https://doi.org/10.1109/ICDIM.2016.7829792
Bannihatti Kumar, V., et al.: Finding a choice in a haystack: automatic extraction of opt-out statements from privacy policy text. In: Proceedings of The Web Conference 2020. WWW ’20, pp. 1943–1954. Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3366423.3380262
Bhatia, J., Breaux, T.D.: Towards an information type lexicon for privacy policies. In: 2015 IEEE Eighth International Workshop on Requirements Engineering and Law (RELAW), pp. 19–24 (2015). https://doi.org/10.1109/RELAW.2015.7330207
Bhatia, J., Breaux, T.D., Schaub, F.: Mining privacy goals from privacy policies using hybridized task recomposition. ACM Trans. Softw. Eng. Methodol. 25(3) (2016). https://doi.org/10.1145/2907942
Bhatia, J., Evans, M.C., Breaux, T.D.: Identifying incompleteness in privacy policy goals using semantic frames. Requir. Eng. 24(3), 291–313 (2019). https://doi.org/10.1007/s00766-019-00315-y
Brasil: Lei Geral de Proteção de Dados Pessoais (LGPD) (2018). https://www.planalto.gov.br/ccivil03/ato2015--2018/2018/lei/l13709.htm
Brasil: Guia de elaboração de termo de uso e política de privacidade (2023). https://www.gov.br/governodigital/pt-br/privacidade-e-seguranca/ppsi/guia_termo_uso_politica_privacidade.pdf
Braun, D.: I beg to differ: how disagreement is handled in the annotation of legal machine learning data sets. Artif. Intell. Law (2023). https://doi.org/10.1007/s10506-023-09369-4
Article MATH Google Scholar
Chrysakis, I., et al.: Cap-A: a suite of tools for data privacy evaluation of mobile applications. In: Legal Knowledge and Information Systems. Frontiers in Artificial Intelligence and Applications, vol. 334, pp. 269–272. IOS Press (2020). https://doi.org/10.3233/FAIA200881
Chrysakis, I., et al.: A rewarding framework for crowdsourcing to increase privacy awareness. In: Barker, K., Ghazinour, K. (eds.) DBSec 2021. LNCS, vol. 12840, pp. 259–277. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81242-3_15
Chapter MATH Google Scholar
Contissa, G., et al.: Claudette meets GDPR: automating the evaluation of privacy policies using artificial intelligence. Technical report (2018)
Google Scholar
Contissa, G., et al.: Automated processing of privacy policies under the EU general data protection regulation. In: Palmirani, M. (ed.) Legal Knowledge and Information Systems, pp. 51–60. No. 313 in Front. Artif. Intell. Appl. IOS Press (2018). https://doi.org/10.3233/978-1-61499-935-5-51
Correia, F.A., et al.: Fine-grained legal entity annotation: a case study on the Brazilian supreme court. Inf. Process. Manag. 59(1), 102794 (2022). https://doi.org/10.1016/j.ipm.2021.102794
Article MathSciNet MATH Google Scholar
d’Aquin, M., et al.: Privonto: a semantic framework for the analysis of privacy policies. Semant. Web 9(2), 185–203 (2018)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018)
Google Scholar
Elluri, L., Chukkapalli, S.S.L., Joshi, K.P., Finin, T., Joshi, A.: A BERT based approach to measure web services policies compliance with GDPR. IEEE Access 9, 148004–148016 (2021). https://doi.org/10.1109/ACCESS.2021.3123950
Article MATH Google Scholar
Gebauer, M., Maschhur, F., Leschke, N., Grünewald, E., Pallas, F.: A human-in-the-loop approach for information extraction from privacy policies under data scarcity (2023)
Google Scholar
Milne, G.R., Culnan, M.J., Greene, H.: A longitudinal assessment of online privacy notice readability. J. Publ. Policy Market. 25(2), 238–249 (2006)
Google Scholar
Hovy, E., Lavid, J.: Towards a ‘science’ of corpus annotation: a new methodological challenge for corpus linguistics. Int. J. Transl. 22(1) (2010)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing (draft) (2021). https://web.stanford.edu/~jurafsky/slp3/
Kim, N., Oh, H., Choi, J.K.: A privacy scoring framework: automation of privacy compliance and risk evaluation with standard indicators. J. King Saud Univ. - Comput. Inf. Sci. 35(1), 514–525 (2023). https://doi.org/10.1016/j.jksuci.2022.12.019
Article MATH Google Scholar
Kotal, A., Joshi, A., Pande Joshi, K.: The effect of text ambiguity on creating policy knowledge graphs. In: 2021 IEEE International Conference on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking (ISPA/BDCloud/SocialCom/SustainCom), pp. 1491–1500 (2021)
Google Scholar
Liepina, R., et al.: GDPR privacy policies in Claudette: challenges of omission, context and multilingualism. In: Ashley, K.D., et al. (eds.) Proceedings of the Third Workshop on Automated Semantic Analysis of Information in Legal Text (ASAIL 2019), vol. 2385. Montreal, QC, Canada (2019). https://ceur-ws.org/Vol-2385/paper9.pdf
Lippi, M., et al.: Consumer protection requires artificial intelligence. Nat. Mach. Intell. 1(4), 168–169 (2019). https://doi.org/10.1038/s42256-019-0042-3
Article Google Scholar
Lippi, M., et al.: CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artif. Intell. Law 27(2), 117–139 (2019). https://doi.org/10.1007/s10506-019-09243-2
Article MATH Google Scholar
Liu, S., Zhang, F., Zhao, B., Guo, R., Chen, T., Zhang, M.: AppCorp: a corpus for android privacy policy document structure analysis. Front. Comput. Sci. 17(3), 173320 (2022). https://doi.org/10.1007/s11704-022-1627-2
Article MATH Google Scholar
Liu, S., Zhao, B., Guo, R., Meng, G., Zhang, F., Zhang, M.: Have you been properly notified? Automatic compliance analysis of privacy policy text with GDPR article 13. In: Proceedings of the Web Conference 2021. WWW ’21, pp. 2154–2164. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3442381.3450022
Martinelli, F., Marulli, F., Mercaldo, F., Marrone, S., Santone, A.: Enhanced privacy and data protection using natural language processing and artificial intelligence. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2020). https://doi.org/10.1109/IJCNN48605.2020.9206801
McDonald, A.M., Cranor, L.F.: The cost of reading privacy policies. I/S: J. Law Policy Inf. Soc. 540–565 (2008)
Google Scholar
Miller, V., Candelario, J.R.R., Ray, L., Perez, A.J.: Assessing purpose-extraction for automated corpora annotations. In: 2022 IEEE 19th International Conference on Mobile Ad Hoc and Smart Systems (MASS), pp. 718–719 (2022). https://doi.org/10.1109/MASS56207.2022.00107
Moraes Rocha, I., Marques de Barros, R., Fernandes Garcia, A., de Oliveira e Silva, J., Zular, F., Maranhão, J.: Guidelines Claudinha data protection law (LGPD) (2024). https://doi.org/10.5281/zenodo.13371432
Nokhbeh Zaeem, R., Barber, K.S.: A large publicly available corpus of website privacy policies based on DMOZ. In: Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy. CODASPY ’21, pp. 143–148. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3422337.3447827
Poplavska, E., Norton, T.B., Wilson, S., Sadeh, N.: From prescription to description: mapping the GDPR to a privacy policy corpus annotation scheme. In: Legal Knowledge and Information Systems, pp. 243–246. Front. Artif. Intell. Appl. IOS Press (2020). https://doi.org/10.3233/FAIA200874
Ravichander, A., Black, A.W., Wilson, S., Norton, T., Sadeh, N.: Question answering for privacy policies: combining computational and legal perspectives. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4947–4958. Association for Computational Linguistics, Hong Kong, China (2019)
Google Scholar
Sánchez, D., Viejo, A., Batet, M.: Automatic assessment of privacy policies under the GDPR. Appl. Sci. 11(4) (2021). https://doi.org/10.3390/app11041762
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Google Scholar
Shankar, A., Waldis, A., Bless, C., Andueza Rodriguez, M., Mazzola, L.: Privacyglue: a benchmark dataset for general language understanding in privacy policies. Appl. Sci. 13(6) (2023). https://doi.org/10.3390/app13063701
Souza, F., Nogueira, R., Lotufo, R.: BERTimbau: pretrained BERT models for Brazilian Portuguese. In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, 20–23 October (2020, to appear)
Google Scholar
Tesfay, W.B., Hofmann, P., Nakamura, T., Kiyomoto, S., Serna, J.: I read but don’t agree: privacy policy benchmarking using machine learning and the EU GDPR. In: Companion Proceedings of the The Web Conference 2018. WWW ’18, pp. 163–166. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2018). https://doi.org/10.1145/3184558.3186969
Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label studio: data labeling software (2020–2022). https://github.com/heartexlabs/label-studio
European Union, T.: General data protection regulation (2016). https://gdpr-info.eu
Vapnik, V.N.: The vicinal risk minimization principle and the SVMs. In: Vapnik, V.N. (ed.) The Nature of Statistical Learning Theory. Statistics for Engineering and Information Science, pp. 267–290. Springer, New York (2000). https://doi.org/10.1007/978-1-4757-3264-1_9
Chapter MATH Google Scholar
Wilson, S., et al.: The creation and analysis of a website privacy policy corpus. In: Erk, K., Smith, N.A. (eds.) Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1330–1340. Association for Computational Linguistics, Berlin, Germany (2016)
Google Scholar
Wilson, S., et al.: Analyzing privacy policies at scale: from crowdsourcing to automated annotations. ACM Trans. Web 13(1) (2018)
Google Scholar
Wilson, S., et al.: Crowdsourcing annotations for websites’ privacy policies: can it really work? In: Proceedings of the 25th International Conference on World Wide Web. WWW ’16, pp. 133–143. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2016)
Google Scholar
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing (2020)
Google Scholar
Yang, L., Chen, X., Luo, Y., Lan, X., Chen, L., Meddahi, A.: Purext: automated extraction of the purpose-aware rule from the natural language privacy policy in IoT. Sec. Commun. Netw. 2021 (2021). https://doi.org/10.1155/2021/5552501
Zaeem, R.N., Barber, K.S.: The effect of the GDPR on privacy policies: recent progress and future promise. ACM Trans. Manag. Inf. Syst. 12(1) (2020). https://doi.org/10.1145/3389685
Zhao, K., et al.: A fine-grained Chinese software privacy policy dataset for sequence labeling and regulation compliant identification. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10266–10277. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022). https://doi.org/10.18653/v1/2022.emnlp-main.700

Download references

Acknowledgments

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001. We would like to thank Instituto Lawgorithm for partially supporting our research. We also thank the USP-AWS agreement for the cloud environment support to the development of our model.

Author information

Authors and Affiliations

Universidade de São Paulo, São Paulo, Brazil
Matheus Tocchini, Igor M. Rocha, Raphael M. de Barros, Juliano Maranhão & Jaime Simão Sichman
Instituto Lawgorithm, São Paulo, Brazil
Matheus Tocchini, Igor M. Rocha, Raphael M. de Barros, Jéssica O. e Silva, Ananda F. Garcia, Felipe Zular, Juliano Maranhão & Jaime Simão Sichman

Authors

Matheus Tocchini
View author publications
Search author on:PubMed Google Scholar
Igor M. Rocha
View author publications
Search author on:PubMed Google Scholar
Raphael M. de Barros
View author publications
Search author on:PubMed Google Scholar
Jéssica O. e Silva
View author publications
Search author on:PubMed Google Scholar
Ananda F. Garcia
View author publications
Search author on:PubMed Google Scholar
Felipe Zular
View author publications
Search author on:PubMed Google Scholar
Juliano Maranhão
View author publications
Search author on:PubMed Google Scholar
Jaime Simão Sichman
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Matheus Tocchini .

Editor information

Editors and Affiliations

Universidade Federal Fluminense, Niterói, Brazil
Aline Paes
Instituto Tecnológico de Aeronáutica, São José dos Campos, Brazil
Filipe A. N. Verri

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tocchini, M. et al. (2025). Classifying Potentially Non-compliant Portuguese Language Sentences Concerning Privacy Policies. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15415. Springer, Cham. https://doi.org/10.1007/978-3-031-79038-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-79038-6_8
Published: 31 January 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79037-9
Online ISBN: 978-3-031-79038-6
eBook Packages: Computer ScienceComputer Science (R0)

Classifying Potentially Non-compliant Portuguese Language Sentences Concerning Privacy Policies

Abstract

Similar content being viewed by others

Establishing a Strong Baseline for Privacy Policy Classification

On GDPR Compliance of Companies’ Privacy Policies

A Contrastive Study of Pre- and Post-legislation Interaction Design for Communication and Action About Personal Data Protection in e-Commerce Websites

1 Introduction

2 Related Work