1 Introduction

Most modern natural language processing (NLP) methodologies, including Large Language Models (LLMs), rely on extensive text corpora for precise training and weight adaptation [18]. Large-scale training corpora, or pre-training corpora, are fundamental for developing foundational models, which serve as the basis for numerous task-specific adaptations [7]. State-of-the-art LLM training pipelines utilize various types of datasets: (i) pre-training corpora to acquire language structure, syntax, and semantics; (ii) instruction fine-tuning datasets to enhance the model’s capability to follow instructions; (iii) preference datasets to rank responses; and (iv) evaluation datasets to measure model performance [23].

Recent research shows that the size and diversity of pre-training corpora significantly impact LLM performance [14, 18]. Most pre-training datasets are available in English and Chinese, which are high-resource languages, while other languages have significantly fewer tokens [23]. Although multilingual corpora can help mitigate data scarcity for low-resource languages, these datasets are often unbalanced, favoring high-resource languages [17]. This imbalance affects the performance of multilingual models for less-represented languages and models trained on multilingual corpora do not perform as well as those trained on monolingual corpora [34]. Therefore, it is essential to train or fine-tune models in the target languages to capture linguistic nuances, structures, and domain-specific or cultural knowledge [28].

A direct implication of this scenario is the necessity of making high-quality plain-text corpora available to encourage research on specific model languages and the development of better-performing approaches. Therefore, we introduce Aroeira: a curated Portuguese language-specific corpus composed of approximately 100 GB of text. The content was extracted from recent Common CrawlFootnote 1 (CC) web pages (up until 2023) and fully curated to remove web tags, ensuring quality and bias filtering. To the best of our knowledge, Aroeira is the largest highly-curated Portuguese corpus available to date. It has the potential to influence new instruction fine-tuning and evaluation dataset studies while guiding the development of preference datasets and large models.

Aroeira was created based on a double-pipeline inspired by [30]. The pipeline comprises two key steps: data quality management and content safety assurance. These steps ensure the size and quality necessary for a Portuguese corpus to train safe LLMs effectively. As part of the content safety step, we investigated techniques for filtering hazardous content and mitigating biases in our corpora [16]. This effort resulted in a custom Portuguese word dictionary, which encompasses offensive words, as well as terms, expressions, and phrases that include sexism, homophobia, ableism, racism, hate speech, and political, religious, and regional prejudice [13, 24, 26].

We highlight our main contributions:

  • Introduction of Aroeira, a 100 GB Portuguese corpus from diverse internet sources. Our dataset surpasses the largest currently available corpus for training language models in Portuguese in terms of size, quality, and representativeness.

  • Development of a parameterizable double-pipeline, which includes: downloading, extracting, language identification, quality filtering, and text storage in the data step; filtering sexual content, toxic data, and bias in the content safety step.

  • Creation of a dictionary to filter biased terms and mitigate social bias in the Portuguese language.

This paper is organized as follows. In Sect. 2, we present related work in corpus extraction. Sections 3 and 4, we describe the methodology for generating the corpus and the configuration of hyperparameters used in the quality filters, respectively. In Sect. 5, we analyze the volumetry of Aroeira in terms of year distribution, knowledge domains, document length, and other relevant results. Finally, Sect. 6 presents conclusions and future works.

2 Related Work

The largest Portuguese language corpus is BrWac [35] which has approximately 25 GB of textual data distributed in 3.53 Mi documents totaling 2.68 Bi tokens. Another large corpus is the Carolina 1.2 Ada [11], which contains approximately 2.11 Mi documents and a total of 11 GB of textual data. When we compare this corpus with the corpora of other languages, the gaps become evident. Gao et al. [14], for example, propose The Pile, a corpus with 825 GB of texts in English. The corpus is derived from various data sources, including scientific articles, patent documents, and forums.

An inspiring work for Aroeira is Colossal Clean Crawled Corpus (C4) [30], a curated English-only corpus. C4 was created using Common Crawl (CC) data extracted in April 2019 and comprises approximately 750 GB of clean English text. Similar to our approach, they apply filters to the raw data. CLUECorpus2020 [36] was constructed using cleaned data from CC, resulting in a high-quality Chinese pre-training corpus of 100 GB and 36 Bi tokens. MassiveText [29] is a collection of large English datasets created with data from different sources. MassiveText contains 2.35 Bi documents, equivalent to 10.5 TB of text. More recently, Sabiá [28] applied a similar filtering methodology of MassiveText to the Portuguese section of ClueWeb dataset [27] and managed to retrieve a curated dataset. WuDaoCorpora [38] is a 3 TB Chinese corpus with 1.08 Tri of Hanzi characters collected from 822 Mi web pages.

It is also worth mentioning that a current trend is the proposal of multilingual corpora. The C4Corpus authors [17] present the construction of a 12 Mi web page corpus containing more than 50 languages, including Portuguese. English has a volume of 7.7 Mi (64.2%) documents while Portuguese has only 0.3 Mi (2.5%). RedPajama [10] is a large multilingual corpus, containing 100 Bi text documents extracted from 84 CC snapshots. Quality signals were applied to 30 billion documents, and deduplication was performed on 20 billion documents. It claims to have English (69.8%), Deutch (9.2%), Spanish (8.8%), French (7.8%), and Italian (4.4%).

As we can see, the corpora available in English and Chinese have a massive data amount, easily surpassing corpora in Portuguese and other languages. However, we know that the amount of internet information available in English and Chinese is greater than in Portuguese.

Another important aspect is the biases present in corpora and texts. Language is a highly relevant avenue for manifesting social hierarchies, pre-established concepts, and standard forms of treatment [6]. Various efforts are being made to evaluate data biases and how they impact the behavior of language models. The paper [24] analyzed 93 social groups that receive stigmatized treatment by NLP models. Work [25] created StereoSet to measure stereotypical treatment in certain ethnic groups, and paper [26] developed a benchmark dataset for measuring biases related to gender, race, age, sexual orientation, and others.

These aspects are relevant in a context with strong normative motivations and the need to create responsible AI. Many ways exist to mitigate text biases, such as data augmentation, content filtering, rebalancing, masking, and many others [13]. Our work uses the concept studied by [16], where filtering sensitive content can result in models with more equitable treatment of different ethnic groups. The work specifically uses word co-occurrence in the filtering process.

Based on these past works, we can see that, in general, they focus on creating corpora for training language models for high-resource language tasks. Thus, there is a necessity for creating a Portuguese corpus since the amount of large Brazilian Portuguese models has drastically increased recently. We can cite Bertimbau [33], PTT5 [9], Bertaú [12], Sabiá [4, 28], Cabrita [22], and Bode [15].

3 Aroeira

In this section, we detail the steps of the corpus creation (double-pipeline) which is divided into two objectives, (i) collect (Data Pipeline) and (ii) ensure content safety (Content Safety Pipeline). Our whole pipeline contains nine steps: data collection and sampling, text extraction, language identification, deduplication, and quality filters in Data Pipeline, and sexual content filter, toxic data filter, bias filter, and categorization in Content Safety Pipeline. Figure 1 presents the entire workflow.

Fig. 1.
figure 1

Double pipeline: Data Pipeline contains collection and sampling, text extraction, language identification, deduplication, and quality filters; and content safety pipeline encompasses sexual content filter, toxic data filter, bias filter, and categorization.

3.1 Data Collection and Sampling

The data collection and sampling step involves downloading and extracting Portuguese text from raw Web ARChive (WARC) in files. All data is sourced from Common Crawl (CC), which contains petabytes of scraped internet content from millions of web pages. We use the raw HTTP files as the initial material, from which we extract and filter Portuguese text as detailed in Subsects. 3.2 and 3.3. CC organizes its datasets by date, each comprising thousands of individual shards of scraped content. We sampled shards from datasets ranging from 2015 to 2023, prioritizing more recent data.

3.2 Text Extraction

This computational step involves using multiple cloud machines in parallel. These instances download and process raw files by extracting text from the HTML and filtering for Portuguese text. The resulting data is then processed further using a single machine containing a key-value database for deduplication, as explained in Subsect. 3.4. We opted to work with WARC files to ensure better quality text, which includes handling raw HTML files and extracting texts ourselves. A Python library called Trafilatura [5] extracted only natural language text from the HTML files. Metadata from webpages were saved for later use in the pipeline.

3.3 Language Identification

Roughly 0.2% of the pages in each shard are in Portuguese. To filter these pages, it is necessary to detect the language they were written in automatically. As utilized by [14] and [1], we employ Meta AI’s pre-trained fastText model, which can detect 176 languages. For each downloaded page, the text is extracted using the Trafilatura library [5] and the fastTextFootnote 2 [20] models used to determine the language. Pages identified as Portuguese with the highest probability by fastText were selected.

3.4 Deduplication

The purpose of this step is to remove duplicated data from the corpus. To achieve this goal, we use two deduplication approaches. The first is a page-level approach, which identifies and removes pages with duplicate URLs. The second is the document-level approach, which aims to remove significant overlapping documents. We employ the MinHashLSH algorithm to calculate the Jaccard similarity between documents, considering whether two document similarity exceeds 0.7 [29].

3.5 Quality Filters

A significant amount of data available on the internet may be insufficient in terms of quality for linguistic model formation. Some examples include automatically generated text and text not written for human consumption [29]. This step aims to retain only pages written by humans for humans. To achieve this, we applied a series of ten quality filters:

  • Number of tokens: Removes pages with fewer than a minimum number of tokens (in this work, we used the same tokenizer employed by GPT-2), as texts with low token counts are generally not informative;

  • Number of words: Removes pages that do not attend specified upper and lower word limits, excluding punctuation and special characters;

  • Type Token Ratio (TTR): The ratio of unique words (types) to total words (tokens) [31]. TTR [37] serves as an indicator of text quality;

  • Symbols-word ratio: Removes pages whose symbol word percentages exceed limits. Any special character is considered a symbol;

  • Symbols at the beginning of the text: Removes pages with an excessive number of symbols at the beginning of the text;

  • Stopwords: The presence of stopwords may indicate text coherence [29];

  • N-gram repetition: Excessive repetition of sentences, paragraphs, or n-grams indicates low informational content [29];

  • Number of sentences: Removes pages with fewer than a specified number of sentences;

  • Lorem ipsum: Removes pages containing the term “Lorem ipsum” [30];

  • Valid words: Removes pages whose percentage of words found in a language dictionary is below a specified threshold.

The thresholds for each filter are detailed in Sect. 4.

3.6 Sexual Content Filter

To maintain the integrity of the corpus, a filter was applied to remove sexual content from the data. We verified whether a URL was present in the Université Toulouse 1Footnote 3 (UT1) blocklist for each page collected. As noted by [2], the UT1 blocklist is an extensive compilation of block lists frequently used for internet access control at schools. It was developed with the help of automated systems and human contributors and currently includes 3.7 million entries. For this work, we utilized a filtered version of this blocklist tailored for Brazilian websites. It should be noted that this filter only excludes websites marked as adult content. For the remaining content, we randomly selected 25,000 examples and used a Mistral 7B [19] model to extract pejorative sexual terms. These terms were then reviewed by humans and used as the final sexual content filter.

3.7 Toxic Data Filter

In this step, our objective is to identify and remove potential toxic content. Toxicity definition is a rude, disrespectful, or unreasonable comment likely to incite an argument [13]. Our filter comprises a dictionary of insults and pejorative terms. We evaluated exact matches of dictionary words with document terms and removed documents that exceeded a specified threshold percentage of words in the dictionary. The dictionary used for this filter was created by merging two lists of wordsFootnote 4\(^,\)Footnote 5. We reinforce this filter does not aim to eliminate all the data containing toxic words, but rather to remove content with a significant toxic content proportion.

3.8 Bias Filter

Our work involves a step to identify and eliminate potential biases in the text based on contextual cues. We construct a dictionary of Portuguese expressions used in biased contexts.

When compiling this dictionary, it is crucial to consider the society dynamic and its relationship with linguistics [6]. We filtered the corpus by checking for exact matches between the dictionary words and those in the text [16]. Various types of social biases were mapped, including gender, religion, race, sexist expressions, xenophobia, homophobia, ableism, fatphobia, and politics [24, 26].

3.9 Categorization

This phase aims to categorize each page into a specific knowledge domain. We identified 27 knowledge domains, covering several categories and subjects, such as: blog posts, news articles, marketing, movies, social media, health, culinary recipes, books, scientific articles, politics, etc. The information regarding each domain can be leveraged to balance the data for specific tasks or augment datasets where knowledge is lacking. Essentially, we map the URLs of different pages to each knowledge scope and assign a topic to each URL. A pt-pt text category has been introduced to differentiate between Brazilian Portuguese and European Portuguese, as Brazilian Portuguese predominates in the dataset.

The identified knowledge domains and their distribution are discussed in the Subsect. 5.2.

4 Qualitative Configuration Test

We generated a 1 GB sample of texts and analyzed the distribution of metrics such as the number of tokens, word count, and TTR. We use this sample to find the best configuration for our double-pipeline to produce the resultant datasets. Different value sets were empirically tested, and for each one, we checked the correctness of page removals and recorded the number of pages excluded after the filters were applied.

Analysis was carried out on this sample to find the optimal configuration. Moreover, we perform qualitative analyses to verify the removal appropriateness. This qualitative test helped us define satisfactory values that we should filter to obtain content with good textual quality, i.e., a text that is diverse in words, fluid, with few repetitions, and with semantically relevant content. It is worth noting that we also evaluated the number of potentially toxic words and possible biases contained in the texts.

Table 1 shows the optimal configuration to obtain texts that exceed our minimum quality requirements.

Table 1. Double-pipeline final configuration.

5 Results

The created corpus was evaluated concerning five groups of requirements. The first requirement is the created corpus must be larger than the existing corpus for the Portuguese language (see Subsect. 5.1). The second requirement is the corpus must be diverse, i.e., containing data from different sources (see Subsect. 5.2). The third requirement is the corpus covers the most recent to the least recent information (see Subsect. 5.3). The fourth requirement is that the corpus presents high-quality text indicators (see Subsect. 5.4). Finally, the fifth requirement is that the corpus avoids introducing or increasing bias.

Due to the large corpus size, Subsects. 5.4 and 5.5 utilize a 10% randomly generated sample to present the results. Consequently, Fig. 4 and Table 3 were created based on this sample size.

5.1 Corpus Size

The first requirement evaluated was the corpus size. We collected terabytes of data from different CC dumps. Each dump has approximately 0.2% of texts in Portuguese. It is worth noting that the documents may be of poor quality, contain inappropriate or biased content, and be duplicated due to the CC not filtering the data. Therefore, a corpus cleaning step was necessary to ensure that the final corpus was composed only of non-duplicated documents to respect quality criteria. At the end of this process, we obtained a corpus of 100 GB.

Table 2 presents the created corpus statistics alongside other Portuguese corpora. Aroeira surpasses brWac [35] and Carolina 1.2 Ada [11] in size, document quantity, and token number. Thus, our corpus is potentially a more diverse resource regarding texts and tokens than the available resources.

Table 2. Corpora size comparison.

5.2 Knowledge Domains

The second requirement evaluated was the distribution of knowledge domains within the created corpus. A mapping of different URLs to their respective knowledge domains was conducted. Each base URL was verified against a dictionary. When there were no matches, keywords were used to determine the document’s domain (Subsect. 3.9). Figure 2 illustrates the document distribution across these domains.

The complexity of the corpus strongly correlates with downstream data performance [3]. Therefore, an extensive representation of knowledge domains can contribute to the generation of more robust models, potentially improving in-context few-shot learning performance [32].

Most documents could not be assigned to a specific domain and are marked as NR (Not Recognized). Among those that were categorized, blog posts and news articles were the most frequent, although other categories such as institutional texts, e-commerce, and internet forums were also found. Knowledge domains are essential for evaluating the quality of the data in the corpus and for filtering data used in the pre-training phase of domain-specific language models.

5.3 Distribution of Documents over Time

Our third analysis is the distribution of the corpus documents over time. This temporal analysis is important to identify possible temporal biases such as outdated texts. Our corpus presents a recent data distribution, which indicates more up-to-date texts.

Figure 3 illustrates that our data set comprises documents spanning up to 7 years, beginning in 2017. The bulk of the data is from 2017 to 2019, but a notable portion of recent data is from 2021, 2022, and 2023. This distribution meets the need for both recent and extensive data. As a result, models can train on up-to-date information and include recent and common terms used in Portuguese.

Fig. 2.
figure 2

Distribution of knowledge domains.

Fig. 3.
figure 3

Distribution of documents over time.

Fig. 4.
figure 4

Quality indicators. Higher TTR and percentage of valid words indicate better corpus quality, while lower values for other indicators also signify better quality.

5.4 Quality Indicators

We used TTR value, symbol word percentage, stopword percentage, valid words, and toxic content as quality indicators. Figure 4 shows the results obtained.

We have two metrics that indicate the diversity of the content present in our corpus. The TTR indicates the variability of tokens in a sentence, and we aim for this value to be as high as possible, as it is a sign of texts composed of varied tokens with low word repetition. A TTR threshold of 0.5 is a quality parameter for the text. We obtained a distribution curve with the first quartile close to 0.5 TTR, a median of 0.57, and the third quartile above 0.65. Thus, most of the data in our corpus reaches a satisfactory TTR value, which is a strong indicator of non-repetitive texts.

Another indicator of variability is the percentage of valid tokens or keywords. This indicator measures the frequency of contextually relevant words within the text, and we also want higher values for more diverse and fluid texts. We achieved an excellent distribution in this indicator, with most data above 0.7. This distribution value is another strong sign of lexically diverse texts.

In contrast, the metrics for the percentage of symbols and the percentage of stopwords are indicators of less fluid texts, with many symbols interrupting the text or commonly used words that do not add semantic value to the documents (e.g., “to”, “for”, or “the”). Our goal is to minimize these metrics as much as possible. We achieved our goal of reducing these values, obtaining distributions with low values, with the third quartile below 0.2 in both metrics. This result is a strong indication that the texts in the corpus are fluid.

Finally, we aim to minimize the percentage of potentially toxic words, decreasing to close to 0 (no potentially toxic words). The results show that we achieved this goal in almost all the texts in the corpus, except for some outliers that do not exceed 0.2% of toxic content.

5.5 Bias

We performed a word co-occurrence analysis to identify biases in our corpus. This technique has proven effective in demonstrating stereotypical treatment of a particular social group [13], which we do not desire. It is worth noting that this method is one of several mitigation approaches, and the corpus may still exhibit bias. We analyzed three groups of biases as shown in Table 3, they are: (i) Gender, (ii) Religion, and (iii) Race. We selected different words for each group and analyzed the context in which these words were inserted.

We have chosen representative terms indicative of social groups and conducted an analysis focusing on those with the highest co-occurrence frequencies. We isolate the terms “Man” and “Woman” to evaluate gender bias. We selected “Atheist”, “Christian”, “Buddhist”, “Evangelical”, “Jewish”, “Muslim” and “Umbandist” for religious bias. Finally, we highlighted the words “White”, “Black”, “Asian” and “Hispanic” for racial bias. The Tables 3 respectively represent the results for gender, religious, and racial bias.

Table 3. Related word co-occurrence.

No stereotypical treatment or hazardous behaviors described in the literature exist in the analyzed groups, such as associations with crime, income, and others [16]. Furthermore, the words selected among the various social groups are very similar, which suggests a more equanimous treatment in our proposed corpus.

6 Conclusion

Recent studies have shown significant improvements in the performance of language models trained on large corpora [8, 14]. Consequently, the interest in creating large datasets has grown. Most existing research focuses on high-resource languages like English and Chinese, with considerable efforts made to develop multilingual corpora. However, there is a pressing need to develop large datasets for lower-resource languages.

This work aims to address this gap by developing language models for lower-resource languages, specifically Portuguese. We created the largest curated corpora for training or pre-training Portuguese language models. To achieve this, we implemented a double-pipeline process to extract data while ensuring content safety. The process includes downloading, text extraction, language identification, deduplication, quality filtering, filtering for sexual content and toxicity, bias filtering, categorization, and storage. This effort involved collecting terabytes of data, resulting in a curated dataset of approximately 100 GB for Aroeira’s construction.

Our results demonstrate that our corpus fulfills the requirements of corpus size, knowledge domains, document distribution over time, quality indicators, and bias mitigation. We conducted statistical analyses of the corpus to better comprehend the size of the collected documents in terms of the number of tokens, words, and sentences. Additionally, we analyzed bias to recognize potential harms in the created corpus, a distinguishing factor in building models free from social biases. Our findings conclude that the corpus created is of high quality and diversity, with minimal bias.

We are eager to advance our work by training models with encoder architectures, such as BERT models [21]. Furthermore, we plan to pre-train language models with Aroeira to obtain higher-quality models in Portuguese. We intend to research existent instruction and evaluation datasets. Finally, conducting comparative bias analyses on models trained with Aroeira relative to other available corpora [13] will be highly valuable.