1 Introduction

A privacy policy is a document that informs the holders of all the processes, treatments and security measures that a company carries out with their personal data [9]. However, almost 20 years ago, studies already showed that the readability and comprehension of most of these documents requires a graduation degree [20]. Furthermore, PPols are extensive documents, requiring a considerable period of time to be read [31]. With the enactment of legislation to protect personal data, such as the GDPR (General Data Protection Resolution) [43] in European territory and the LGPD (Lei Geral de Proteção de Dados, in Portuguese - General Data Protection Law, in English) [8] in Brazil, PPol would need to evolve in order to eliminate gaps that were already known and, also, meet legal requirements. One way to ensure that privacy policies are being improved and in compliance with data protection laws is through the use of algorithms for automated analysis. More specifically, a combination of natural language processing (NLP) and ML techniques [26].

A lot of effort has been put into extracting information from PPol. Different methods were applied to identify and retrieve relevant keywords and text parts [3]. The work presented in [45] directed studies on PPol for the use of supervised ML. The development of corpora that mapped the concepts that were desired to be extracted from policies into categories that could be classified through ML gained prominence, as there were no data sets with these characteristics until then.

Various approaches to evaluating PPol in relation to GDPR compliance have been studied [2, 14, 35]. All of them used the GDPR as a legal reference to guide the performance of the proposed models. Also, the studies were conducted on English-language documents. Thus, there is a gap in relation to works that address other legal norms, in addition to other languages. There is an opportunity to develop research beyond the English language and GDPR, in order to compare results from different languages and laws, and pave the way for the adoption of automated analysis of privacy policies, ensuring compliance with regulatory data protection laws.

In order to address this gap described across different languages and legal systems, we developed a mapping of the Brazilian data protection law (LGPD) [8] into machine-readable attributes. To this end, we built an annotation guideline that details the application of our mapping to evaluate PPols clauses [33]. The guideline instantiates 3 large blocks of categories, totaling 27 annotation categories. Through the guideline, we assembled a corpus with 6431 annotated PPols clauses, with 3 levels of compliance with the LGPD: compliant, partially non-compliant, non-compliant. To the best of our knowledge, this corpus is the first in Portuguese that relates the evaluation of PPol from the perspective of the LGPD. We evaluate the corpus using classical ML and deep learning models. We applied it to two different tasks: identifying potentially non-compliant clauses and categorizing potentially non-compliant clauses.

The remainder of this paper is organized as follows. We present related work in Sect. 2. In Sect. 3, we depict the development of the guideline and the construction of our corpus. Section 4 describes our experimental analysis and evaluation methodology, and we discuss our results in Sect. 5. Finally, Sect. 6 brings our conclusions and we suggest future directions for our work.

2 Related Work

Initially, work on PPol aimed to extract information from documents through rules and build knowledge ontologies to evaluate new PPols. At this stage, studies carried out in [3, 5] capture information and keywords from PPol through lexical rules. An indication of the potential use of ML to evaluate PPol began to emerge with the findings in [6, 47], which brought encouraging evidence of the effectiveness of crowdsourcing to annotate PPol. In [45], Wilson et al. present the OPP-115 corpus, a pioneer in research on PPol.

After OPP-115 corpus, lots of works focused on building and improving corpora and applying ML. In [16], the authors propose a formal model for data practices. The works in [11, 12, 46] study improvements of crowdsourcing, in order to automate PPol clause analysis, and Zaeem et al. introduces a large PPol corpus, with more than 200,000 PPol gathering from DMOZ [34]. The work of Gebauer et al. propose a prototype of a human-in-the-loop combined with ML to annotate PPol [19]. PrivacyGLUE is introduced in [39] as a benchmark to measure general language understanding in the privacy domain. In [36], the authors present question-answering methods to retrieve informations from privacy policies. Bhatia et al. introduce a way to identify incompleteness in privacy policies by using semantic frames [7]. Kotal et al. show that techniques to extract knowledge from PPol fail when the text policies are ambiguous [24]. Yang et al. propose an approach for automatically extracting purpose-aware rules from PPol [49], while Kumar et al. present an approach to extract and classify automatically opt-out choices found in PPol [4]. In [32], the authors build a corpus from a rule-based information extraction of PPol for Internet of Things apps. A proposal of a privacy score framework, based in ML, is presented in [23].

Another aspect that developed after OPP-115 was the assessment of PPol in relation to data protection legislation, especially GDPR. The development of methodologies that combine the mapping of GPDR into categories and the creation of a corpus with privacy risk classes are addressed in [14, 41]. The inspiring work presented in [25] combines the development of a corpus with GDPR-guided classes and the application of ML to classify it. In [35], the authors compare article 5 from GDPR with OPP-115 annotations, in order to validate the corpus applicability. We also highlight the studies in [2, 18, 29, 37, 50], which are based on GDPR to extract information from PPol, and the work of [28], which introduces a corpus to evaluate the structure of PPol for Android apps, based on GDPR articles.

We highlight 4 more works in languages other than English. In [30], the authors built a knowledge base from language processing and unsupervised learning techniques, using PPol in Italian. The work in [51] introduces a Chinese privacy dataset for sequence labeling task and compliance identification studies. An Arabic dataset for ML applications in PPol is presented in [1]. Finally, we highlight the work of Correia et al. [15], in which they developed a recognizer of named entities in Portuguese for legal entities, based on texts from the Brazilian Supreme Court.

Our work differs from the others revised in three points: (i) we mapped data protection law into attributable categories in ML models, which is done through Brazilian legislation, unlike others, which use European legislation or even just linguistic or conceptual rules; (ii) we constructed a detailed guideline on mapping LGPD and how to apply it to annotate raw texts, presenting a reproducible model for legislation other than GDPR; (iii) we have used texts in Portuguese, since almost all researches are carried out with documents in English.

3 Annotation Guideline and Corpus Development

The work to create the guideline and obtain the corpus was based on the theory proposed in [21]. In this work, Hovy et al. brings 7 steps for methodological annotation of a corpus: (i) selecting representative texts; (ii) instantiating the theory; (iii) choosing and training annotators; (iv) defining the annotation process; (v) designing the annotation interface; (vi) determining and applying evaluation metrics; (vii) corpus maintenance.

Also, our work was inspired by the research developed in the Claudette project, with the construction of a corpus relating the GDPR [43] with PPols and the evaluation of the data set through ML models [13, 25].

Following the theoretical framework in [21], Sect. 3.1 details the theory applied to develop the guideline, while Sect. 3.2 describes the other corpus construction steps, such as text capture, the annotation interface and the preparation of annotators.

3.1 Annotation Guideline

We started the guideline work through the study and instantiation of the LGPD theory [8], which corresponds to the set of standards responsible for the protection of personal data in Brazilian territory. Our systematic analysis of the LGPD resulted in an initial version with more than 30 categories, separated into 3 large blocks (data omission, data processing, unclear language and general).

The guideline underwent 3 revisions. We made changes such as including new categories, excluding some existing ones and adding others into a single category (neutralization of the theory [21]). After revisions, the final version contains 27 categories. The data omission block has 18 categories, data processing has 7 categories, unclear language and general have one each. Table 1 presents all the categories. For the sake of space, we ask to refer to [33] for details on each of the categories within guideline.

Table 1. Categories defined in Guideline [33]. They map LGPD blocks into attributable classes for ML.

The conceptualization of the categories was accompanied by the definition of the degrees of compliance for each of them. Following the work of [13, 25], we classify compliance into 3 levels: level 1 - compliance, level 2 - potential partial non-compliance, level 3 - potential total non-compliance. Specific details of the degree of compliance for each category are described in the guideline [33]. Also, there is an indication, with its legal basis, regarding the mandatory presence of data evaluated by the category in a PPol. This information was included in the guideline, as it was reported in [25] as an obstacle to delimiting the task that should be automated by the ML model. We illustrate, in Fig. 1, the Data correction category, with its compliance levels, examples in documents and the column that indicates its mandatory nature in a PPol.

Fig. 1.
figure 1

Example detailing compliance levels for the Data correction category [33].

During revisions, our team tagged 2 PPol to evaluate the instantiated theory after each change. This assessment aimed to check whether the categories were consistent with the section of the LGPD that they instantiated, identify ambiguities between categories and find very specific categories that could be added. The entire guideline development process took 5 months, resulting in the initial version. Subsequent reviews took another 5 months. In total, we worked for 10 months to reach the final version of the guideline [33].

3.2 Corpus

After completing the guideline, we moved on to implementing the other steps of the Hovy et al. methodology [21]. We collected documents from companies of different economic sectors, in a similar way to the criteria applied by Liepina et al. [25]. Our collection efforts resulted in a set of 75 PPol from different companies.

We process the PPols in order to transform them into clauses in JSON format. This process took place in 3 stages. Figure 2 details the PPol capture process. First, we parsed PPol text into sentences, using regex split by punctuation (?!.;). In second stage, we reviewed each parsed PPol, in order to remove sentences parsed in a wrong way. The last stage consisted in a transformation of each parsed PPol in text files (one text file per sentence) and conversion to JSON files.

Fig. 2.
figure 2

Process of capturing and processing privacy policies. Source: own authorship.

With the PPol sentences in the appropriate format, we chose our annotation tool. We use Label Studio [42]. Within the tool, we configure the annotation interface according to the guideline categories (Sect. 3.1). Figure 3 shows an example of interface to annotate a sentence. The annotations have two parts: choose a category and, in sequence, determine the compliance level. Each sentence can have one or more categories tagged. Also, we included two extra categories: “NA”, if the sentence has no content to be analyzed using the guideline, and “Incorrect”, for sentences parsed with errors that were not identified during the manual review. All sentences tagged with any of these categories were deleted from the final corpus.

Fig. 3.
figure 3

Label Studio annotation interface with an example of annotation, showing categories and levels of compliance with LGPD. The screen’s information was written in Portuguese. Source: own authorship.

We selected and trained a group of 5 people to be annotators. All were undergraduate Law students at Universidade de São Paulo. The training consisted of an 1-hour online session to explain the use of the annotation interface and presentation of the annotation guideline. We chose an annotation strategy similar to that recommended in Braun [10], which suggests assigning 2 or more annotators per sentence and a more senior expert to resolve disagreements. We started with 3 annotators per clause and weekly meetings to discuss disagreements and reach consensus. If the disagreement persisted, the most senior researcher on our team, a specialist in Consumer Law and Data Protection, would make the final decision. After training the annotators, two of them did not progress in their work and decided to leave the group. These withdrawals impacted our strategy and we had to use only one annotator per sentence, in order to meet the working deadline of the annotators. This scenario has the potential to introduce biases into the annotation process. Before the withdrawals, the annotators got a Fleiss’ kappa \(\kappa <\) 0 (in 266 clauses or 4% of total corpus). This result can be an effect of a phenomenon discussed in Sect. 5.

The final corpusFootnote 1 took 10 months to be done and has 6341 distinct sentences. Highlights include the categories Purpose of treatment, Category of processed data and Data Sharing, which correspond to almost 50% of the sentences. Around 22% of sentences are potentially non-compliant, that is, they may be in conflict with the LGPD.

4 Corpus Evaluation

The evaluation of the corpus we built was inspired by the methodologies presented in [25, 27]. Following the reviewed literature, we divided the experimental work into two tasks: detection of potentially non-compliant sentences and categorization of potential non-compliant sentences.

The task of detecting non-compliant sentences consists of a binary classification, with the purpose of identifying sentences that are potentially non-compliant with the LGPD. The task of categorizing non-compliant sentences is a binary classification, one classifier for each category, performed only on sentences with conformity level 3.

4.1 Feature Representation

For the task of detecting potential non-compliant sentences, we use a representation by bag-of-words (BoW), with bigrams and trigrams (applying TD-IDF), and part-of-speech (POS) [38]. We use spaCyFootnote 2 and its POS tagger for the Portuguese language. We also use an approach for data input into BERT models, with the application of tokenization, padding and other auxiliary operations [48]. We chose the BoW representation in order to evaluate whether the lexical and grammatical information is sufficient to identify non-compliant sentences. This approach also allows comparison with the results presented in [25, 27]. The representation for the BERT model contributes to the incorporation of semantic information into the classification task.

Regarding the task of categorizing non-compliant sentences, we only employ the BoW approach, with POS and bigrams and trigrams. Thus, we only consider lexical and grammatical information to categorize sentences, in a similar way to what was proposed in [27].

4.2 Machine Learning Algorithms

Regarding the non-compliant sentence detection task, we use two classifiers: support vector machine (SVM) [44] and fine-tuned BERT pre-trained in Portuguese, known as BERTimbau [40]. The SVM receives the BoW representation as input and the BERT model and its data preprocessing and representation, as described in Sect. 4.1. We chose SVM for comparisons with the results in [25], even though the ML model chosen by the authors is a class of SVM with some changes. The BERT model was our second choice to evaluate the degree of improvement that a model close to the state-of-the-art can bring to the task. In this scenario, all sentences with compliance level 3 are assigned as positive label and others sentences, compliance levels 1 and 2, as negative label.

The second task, categorizing potentially non-compliant sentences, was carried out with the SVM + BoW configuration. In this scenario, we train a model for each compliance category in the guideline. In each trained model, a single category was marked as a positive label and all others as a negative label. Only sentences with compliance level 3 are included in this task.

In order to address the class imbalance in our corpus, we adopted an algorithmic structure approach for SVM: cost-sensitive learning, with a higher weight to minority classes. The models’ performance was evaluated using the F1-score, precision and recall metrics [38]. We selected them because we can compare them with the performance of detecting non-compliant sentences in [25] and with the results for categorizing non-compliant sentences in [27].

5 Results and Discussion

The experimentsFootnote 3 for the SVM non-compliant sentence detection task were performed by splitting the data into 90% for training and 10% for testing. We trained 10 models with 5-fold cross validation. Each of them with a different seed to split the dataset and we performed a grid search for the C parameter (search space = [0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 50.0]), linear kernel, class weight = ‘balanced’ and we fixed a seed for the models. Performances in test set were averaged.

The BERT model used a single dataset split due to computational overburden. We split in 80% for training, 10% for validation and 10% for testing, and was trained for 4 epochs, following the recommendation in [17]. We set batch size = 8, Adam optimizer (learning rate = \(2\times 10^{-5}\), \(\epsilon = 1\times 10^{-8}\)). All metrics were verified in the test set.

Following the evaluation methodology of [25], we added two more baseline classifiers: a classifier that always returns the class of interest, called Always positive, and a random classifier, with a probability of the class of interest similar to the proportion in the training set, around 22% (Sect. 3.2). Table 2 presents the metrics obtained for the potentially non-compliant sentence detection task.

Table 2. Performance of the models for the task of detecting potentially non-compliant sentences.
Table 3. Performance of each classifier in test set for non-compliant sentences categorization. The proportion column refers to the percentage of non-compliant sentences in the category in relation to all non-compliant sentences (compliance level 3). The category blocks, omission of data required by law, data processing and unclear language, are separated by colors, where the data processing block is in gray.

The BERT model outperforms all others and shows us an encouraging result. We noticed that the baseline Always positive performed very close to the SVM. This fact made us think about some hypotheses. As SVM exploits a lexical and grammatical representation, PPol can contain sentences that require higher linguistic complexity to be understood and become separable. The BERT model exploits a representation with semantic and contextual information [22] and presents considerably better performance (around 42%). This is consistent with [25], since the authors relate difficulties with sentences that need the PPol context. When comparing the performance of PPol with online terms of service (ToS), work carried out in [27] (F1 = 0.769 for SVM + BoW), we realize that the lexical and grammatical representations through BoW for ToS have results that are higher than those we found and those found in [25] (F1 = 0.42\(\,-\,\)0.55). This scenario may reinforce the idea about the greater complexity and difficulty of analyzing PPol, a fact reported in [20]. In line with analyzing the complexity of sentences, we manually studied some sentences that were annotated with more than one category and with different compliance levels. For example, the sentence “Você reconhece e concorda que poderemos lhe enviar informações e avisos importantes referentes à sua conta e ao Site da comunidade por e-mail, mensagem de texto ou outros meios, com base nas informações que você nos forneceu.” (in English, “You acknowledge and agree that we may send you important information and notices regarding your account and the Community Site by email, text message or other means based on the information you have provided to us.”) was tagged in categories Purpose of treatment, with compliance level 1, Category of processed data and Advertising, with compliance level 3. The tagged categories are in accordance with the guideline definitions. Therefore, a sentence can be compliant or non-compliant, depending on the category. A future work could be the separation of categories into blocks to apply the classification, similar to what was done in [25]. Even so, this finding of ours can reinforce the hypothesis of the complexity of the sentences found in PPol.

A second hypothesis is related to the bias that the annotation process may have introduced into the corpus. This is a limitation of our work. Only one annotator tagged each sentence and may have made some subjective choices that may not be based on the guideline.

Regarding the second task we designed, the categorization of non-compliant sentences, we have done just a randomized grid search, with a cross validation (5-fold), for each category. We assigned positive label for a category and false label to others, using the same hyperparameters applied to detection task. In total, we have trained 26 SVM classifiers. Table 3 displays the results found. We observed that the categorization task performed below the task of identifying potentially non-compliant sentences. The category Other consents does not present results because it has no sentences with compliance level 3. The work carried out in [27], with ToS, had opposite findings. The categorization task performed better. We interpret our findings, compared with the literature, as an indication of the complexity of PPol sentences, just as we raised this hypothesis in the task of detecting non-compliant sentences.

Two more findings caught our attention. The first was that higher F1-scores are associated with categories with a higher proportion of sentences in the dataset (Category of processed data, 3rd party sharing and Purpose of treatment). We understand this as an indication of the need to expand the number of sentences in our corpus to improve their quality. The second concerns high recall values regardless of the proportion of the category in the dataset. For example, category “Take it or leave it” corresponds to 1.6% of sentences in categorization task and achieved 100% of recall. In other words, the BoW representation, exploiting lexical and grammatical information, is sufficient for the SVM to identify a category correctly. The problem becomes excessive false positives, probably because the representation of sentences is bringing ambiguities to the models. A future path of research could be to study classifiers’ false positives and identify their causes, in order to evolve the representation of information that the algorithms receive as input from PPol.

6 Conclusions and Future Work

In our work, we introduced a corpus of more than 6,400 PPol sentences annotated using a standardized guideline that maps LGPD in machine-readable categories. We defined 27 categories, separated into 3 blocks, which can qualify a sentence and 3 levels of compliance.

We evaluated our corpus in two classification tasks, in order to verify the ability to analyze PPol sentences in an automated way. The first task was the detection of potentially non-compliant sentences. We use two approaches, SVM + BoW, exploiting lexical and grammatical information, and a BERT model, which incorporates semantic information. The SVM + BoW model did not outperform the baselines we employed. The BERT model presented an encouraging result, close to that found in [25]. We interpret these results as an indication that it is possible to detect PPol sentences potentially non-compliant with the LGPD. Regarding the categorization of potentially non-compliant sentences, we understand that our results point to the need for more advanced representations of sentences and models closer to the state-of-the-art to achieve more satisfactory performances that are close to the literature. We also highlight the complexity of PPol texts in Portuguese that our work revealed. We believe that our findings contribute to the improvement of NLP research applied to Digital Law in Portuguese.

As future developments for this work, we intend to expand the size of our corpus, in order to increase the examples of each category. We think that the larger volume of data has the potential to improve the performance of the tested models. A second future activity is to improve the annotation process, employing more annotators per sentence, in order to follow the initial strategy that we proposed and recommended in [10]. Finally, we wish to apply the BERT model to categorize potentially non-compliant clauses and evaluate the reasons for the excess of false positives found in our studies.