1 Introduction

Language Models have revolutionized the field of natural language processing with their exceptional ability to perform tasks with minimal supervision. Although primarily pretrained on English-centric corpora, the models have shown impressive multilingual capabilities [10]. Given the abundance of languages worldwide, the majority of which are low-resource, it has become a common practice to pretrain single models on multiple languages simultaneously. Models like XLM-R [12], mBART [28], mT5 [70], and BLOOM [54] exemplify this approach.

Despite the success of these multilingual models, we argue that they may not be the optimal approach for capturing the cultural and knowledge richness inherent in individual languages. When a moderately-sized language-specific corpus is available, continued pretraining could integrate the missing knowledge into the model, enhancing its performance on targeted tasks. To test this hypothesis, we extend the pretraining of English-centric models using Portuguese corpora and evaluate their performance on an extensive range of Portuguese datasets employing a few-shot learning approach. Our results indicate that, even for models trained beyond the recommendations by Hoffmann et al [18], this additional pretraining considerably improves performance compared to multilingual models.

We evaluate our models on datasets comprising texts originally created by native Brazilian Portuguese speakers, as well as datasets translated from English to Portuguese. We observe improvements across all datasets due to the Portuguese pretraining, with the gains being particularly pronounced for datasets created by Brazilian speakers. One of the largest improvements was observed on the ENEM dataset [57], which is derived from entrance exams used by Brazilian universities and requires extensive knowledge of the country’s history, geography, and literature. This result provides evidence that the major contribution of our language-specific pretraining is to inject domain-specific knowledge about a particular culture as opposed to solely enhancing language proficiency.

2 Related Work

The success of multilingual pretraining has been well-documented in the literature, with models such as ByT5 [69], mT5 [70], XLM-R [12], XGLM [27] and mGPT [56] paving the way for more inclusive language understanding and generation by leveraging shared knowledge across multiple languages. However, there are limitations to this approach.

BLOOM, a 175B-parameter model pretrained on 46 languages, performs worse on English tasks compared to OPT [74], a similarly sized model pretrained on English-centric corpora using comparable computational resources and data size. We conjecture that BLOOM’s underperformance may be attributed to its relatively limited exposure to English tokens during the pretraining phase. Consequently, this observation suggests that monolingual pretraining could offer supplementary advantages.

In support of this hypothesis, models with hundreds of millions of parameters pretrained on monolingual texts have demonstrated gains over multilingual counterparts [2, 6,7,8, 21, 24, 25, 32, 36, 52, 59]. Additionally, research has indicated that language adaptation is beneficial even for low-resource languages [4, 13, 38, 72]. However, there is a limited number of published research articles with comprehensive evaluations of the benefits of continued pretraining at the multi-billion-parameter scale [22, 50, 73]. Through this study, we contribute to the literature by demonstrating the effectiveness of continued language-specific pretraining for Portuguese language models up to the 65B-parameter scale.

The question concerning whether it is advantageous to train models for specific languages is closely associated with the question of whether it is beneficial to train models for particular domains of knowledge. Recent studies, such as Minerva [26] and Galactica [62], have shown that domain-specific pretraining can lead to significant improvements, even with a smaller pretraining corpus compared to large-scale, general-purpose pretraining corpora. Analogously, Fu et al. [15] demonstrated the feasibility of specializing smaller models to perform multi-step reasoning, a capability typically exclusive to models with at least 50B parameters, at the expense of diminished performance in other, more general tasks.

Pretraining with a combination of general and domain-specific corpora can potentially enhance performance in specialized tasks without compromising effectiveness in general-purpose tasks, albeit at the cost of increased computational demands. For example, BloombergGPT [68], a 50B-parameter model pretrained on heterogeneous corpus in which more than half of texts are from the financial domain, exhibits comparable performance to OPT-66B in general tasks. However, BloombergGPT’s pretraining dataset is three times larger, and consequently used more computational resources.

Rather than pursuing a single model that performs well across multiple domains, Gururangan et al. [17] propose an alternative approach: using multiple expert models, each trained on a domain-specific subset within a broader, diverse dataset, to function as a single general-purpose model. Their models outperform dense ones across various domain-specific tasks, at the expense of an increased parameter count, consequently leading to larger memory requirements for efficient inference.Footnote 1

3 Methodology

In this section, we outline the pretraining data and training details used to build our models, including data sources, preprocessing techniques, architectures, hyperparameters, and optimization methods.

3.1 Pretraining Data

The pretraining data is derived from the Portuguese subset of the ClueWeb 2022 dataset [40, 41]. To increase the datasets’s quality, we apply the quality filters from MassiveText [45], modifying them to accommodate the specific requirements of the Portuguese language. We normalize the text with ftfyFootnote 2, convert wikitexts into human-readable texts, and exclude documents containing less than 200 unique tokens.

These quality filters are primarily designed for web pages and may not seamlessly transfer to other domains. There is potential for improvement by employing more automated methods; however, this study did not explore such approaches due to the resource-intensive nature of pretraining experiments.

Following the cleaning process, all documents are concatenated using an end-of-sequence token as a separator, and then tokenized. The GPT-J tokenizer, which is identical to the GPT-2 tokenizer [44], produces 7.8 billion tokens, while the LLaMA tokenizer produces 7.3 billion tokens. The discrepancy in the total number of tokens is primarily due to the different tokenization strategies each model employs, byte-level BPE and BPE based on sentencepiece [23], respectively along with the variation of the vocabularies used by each tokenizer.

We extended the training of three models - LLaMA 7B and 65B [63] as well as GPT-J [66] - originally trained on English-centric corpora, on Portuguese texts; these further pretrained models from LLaMA are denoted as Sabiá, while the one derived from GPT-J is referred to as Sabiá-J.Footnote 3

3.2 Sabiá Models

The LLaMA 7B and 65B models are decoder-only Transformer models [64] with a similar architecture to PALM’s [10]. The models were trained using a causal language modeling objective on a massive dataset sourced from webpages, code, books, and scientific papers. The 7B model was trained on 1 trillion tokens and the 65B model was trained on 1.4 trillion tokens. While the majority of the corpus is in English, it also includes an unspecified amount of Portuguese text.

Starting from the LLaMA weights, we train the Sabiá models on our Portuguese dataset (see Sect. 3.1) using the t5x and seqio frameworks [48]. Adhering closely to the hyperparameters used by PALM, we use the AdaFactor optimizer [55] without factorization, a first-order momentum \(\beta _1 = 0.9\), and a second-order momentum \(\beta _2 = 1 - k^{-0.8}\), where k represents the step number. We apply global norm clipping at 1.0 and dynamic weight decay of \(lr^2\), with lr denoting the current learning rate.

Besides the standard causal language modeling loss, we use an auxiliary loss of \(10^{-4} \log ^2 (\sum _i e^{z_i})\), where z are the logits, to decrease the likelihood of loss spikes at the 65B-parameter scale. The learning rate is linearly increased from 0 to 1e-3 over the initial 1,000 steps, followed by a constant learning rate of 1e-3 for an additional 9,000 steps.

The models were trained on a TPU v2-512, using batches of 512 sequences, each containing 2048 tokens. We utilized gradient checkpointing, also known as rematerialization, to enable the use of larger batches, thereby increasing TPU utilization. For the 7B model, this configuration results in a throughput of 124,000 tokens/sec, corresponding to a Model FLOPs Utilization (MFU) [10] of 45.2%, excluding the self-attention operations. For the 65B model, we achieve a throughput of 14,000 tokens/sec, resulting in an MFU of 47.4%.

The resulting models were trained on a total of 10.4 billion tokens, or 1.52 epochs of the Portuguese dataset. This equals to 10,000 training steps, which is the same amount used to train Sabiá. We noticed improvements in few-shot tasks beyond one epoch, which corroborates results from Taylor et al. [62]. However, due to the high costs of pretraining, we did not continue training.Footnote 4

3.3 Sabiá-J

The GPT-J model is a 6B-parameter decoder-only Transformer model whose architecture and training hyperparameters closely follow GPT-3 6.7B. The main differences reside on computing the MLP and self-attention in parallel, applying attention head with dimension 256 (twice larger than GPT-3 6.7B), and using Rotary Positional Embedding (RoPE) [61]. GPT-J was trained on 400B tokens from The Pile dataset [16], whose 97.4% tokens are in English.

We begin training Sabiá-J from the released GPT-J checkpoint,Footnote 5 using the mesh-transformer-jax framework [65] and AdamW optimizer [30] with a weight decay of 0.1. We start the pretraining by warming up the learning rate until 1.2e-5 over 13,500 steps, followed by a cosine annealing decay during 135,518 steps until the end learning rate of 2.4e-6, and kept it constant from there on. We train on a TPU v3-8 using an effective batch size of 32 sequences of 2048 tokens. This results in a throughput of 5,200 tokens/sec, corresponding to a MFU of 44.5% without self-attention. The model was trained for 18 d on 7.8B tokens, or one epoch of the Portuguese dataset.Footnote 6

4 Evaluation on Poeta

We evaluate the Sabiá models on the Portuguese Evaluation Tasks (Poeta) benchmark, which comprises 14 downstream NLP datasets in Portuguese: ASSIN 2 RTE and STS [47], ENEM Challenge [57], ENEM 2022 [37], FaQuAD [53], TweetSentBr [5], AG News [75], IMDB [31], MASSIVE [14], MKQA [29], BoolQ [11], SST2 [58], WSC [33], and BLUEX [1]. Half of them (ASSIN 2 RTE and STS, BLUEX, ENEM Challenge, ENEM 2022, FaQuAD, and TweetSentBr) were originally written in Portuguese, and the remaining ones were either manually or automatically translated into Portuguese from their originals in English. We refer to the first group as “Native” datasets and the second group as “Translated” datasets.Footnote 7

The models were evaluated in a few-shot manner using the maximum number of examples that fits into a 2048-token context for each task. We used the GPT-2 tokenizer as a reference because it results in more tokens. This allowed us to comfortably fit prompts tokenized with other tokenizers.

To evaluate the models, we manually select a set of few-shot examples for each dataset on Poeta. Depending on the dataset, these examples are balanced by class (except for FaQuAD, BLUEX, ENEM Challenge, ENEM 2022, MKQA, and WSC). For each test example, the prompts are built with the selected few-shot examples in alternating order. Each task on Poeta has a particular instruction that is placed at the beginning of the prompt.

Following Srivastava et al. [60], we adopt the Normalized Preferred Metric (NPM) as our primary evaluation measure:

$$\begin{aligned} \texttt {NPM} = \frac{1}{N} \sum _{i=1}^N 100 \times \frac{\texttt {[raw preferred metric]}_i - \texttt {[random score]}_i}{\texttt {[high score]}_i - \texttt {[random score]}_i} \end{aligned}$$
(1)

where N is the number of evaluation datasets, \(\texttt {[raw preferred metric]}_i\) is the score obtained by the model on the i-th dataset, \(\texttt {[random score]}_i\) is the score of a random model (e.g., 50% for a binary classification task) and \(\texttt {[high score]}_i\) is the highest possible score on that dataset, which is either 1 or 100. The preferred metric and random score for each dataset are presented in Table 1. The rationale behind employing NPM rather than a straightforward average across all datasets is to mitigate the undue influence of datasets with inherently high scores, such as binary classification datasets, which could otherwise outweigh datasets characterized by lower scores.

Table 1. A summary of the datasets constituting the Poeta benchmark.

5 Results

The main results can be found in Table 2. Models such as BLOOMZ, XGLM and Bertin-GPT struggled to generate answers in Portuguese. To address this issue, we adopted an approach akin to that used by the XGLM authors: by calculating the likelihood of each candidate answer string based on the input text and subsequently selecting the class with the highest probability. For FaQuAD, the only dataset in the benchmark without predetermined candidate answers, we allowed the models to generate answers in their original format.

We observe that the LLaMA baselines significantly outperform models of equivalent size trained with fewer tokens, such as Galactica and OPT. Furthermore, despite being trained on English-centric corpora, LLaMA-7B surpasses multilingual BLOOM and XGLM of similar sizes. The Sabiá models demonstrate considerable improvement in NPM compared to their respective baseline models. These NPM gains are more substantial for the smaller Sabiá-J and Sabiá-7B models. Notably, Sabiá-65B marginally outperforms OpenAI’s GPT-3.5-turbo, which serves as the base model for ChatGPT.

Table 2. Few-shot NPM results on the Poeta benchmark.

Through our Portuguese pretraining, we observed that the improvement in NPM was higher in native datasets than that in translated datasets. For Sabiá-65B, improvements over LLaMA-65B were mostly from the native subset. We hypothesize that this is due to the “mechanistic” nature of translated datasets: since they were translated from English, the baseline model already possesses the knowledge needed to solve them and gains little from learning the linguistic, syntactic, and grammatical knowledge of the target language. For instance, to answer the question “does p o box come before street address” (BoolQ dataset), the model gains little from additional pretraining on a Portuguese corpus as it is unlikely that the corpus would provide new information regarding the formatting of US mailing addresses that the model has not already encountered during its initial English-centric pretraining. Conversely, language-specific pretraining introduces the specific knowledge required to solve tasks in the native subset.

Although GPT-J exhibited lower few-shot performance in English tasks relative to LLaMA, we use it in this study to illustrate that not only highly optimized models like LLaMA can benefit from extended pretraining. We chose not to use BLOOM-7.1B as our initial checkpoint for pretraining due to its inferior performance compared to GPT-J in preliminary few-shot experiments on three Portuguese datasets. However, we later discovered that its performance on Poeta surpassed GPT-J’s. Nonetheless, BLOOM still exhibits lower performance compared to LLaMA.

Analogous to Sabiá-J, BERTIN-GPT is a model pretrained on Spanish text starting from the GPT-J weights. Since Spanish and Portuguese are similar languages, it is reasonable to expect that BERTIN-GPT would perform better than its baseline model. Nevertheless, the observed NPM for BERTIN-GPT is only slightly higher than GPT-J’s.

A noteworthy comparison involves Galactica, a model pretrained on scientific text, predominantly in English, and a similarly-sized OPT model, which utilized comparable pretraining compute but was pretrained on a larger and more diverse English-centric corpus. In their study, the authors demonstrate that Galactica performs on par with OPT on English tasks and largely outperforms OPT on scientific-related tasks. Conversely, OPT significantly outperforms Galactica in Portuguese tasks. This result underscores the trade-offs associated with domain-specific specialization, which often entails diminished performance in other tasks.

BLOOMZ [35], a multilingual instruction-tuned model, demonstrated superior performance compared to its baseline BLOOM model, rivaling LLaMA of equivalent size.Footnote 8 Nevertheless, our approach of pretraining in Portuguese appears to yield superior results, as Sabiá-J surpasses BLOOMZ despite originating from a lower-performing baseline model. We envision continued pretraining and instruction tuning as complementary techniques to be combined in future research.

5.1 Results per Dataset

Table 3 presents the results per Poeta dataset for Sabiá models, their baselines, and for the supervised state-of-the-art. The SOTA results reported for the translated datasets were obtained using their original English versions [46, 51, 71, 76]. Since the Poeta benchmark excludes unanswerable examples of the MKQA dataset, we decided not to include the SOTA result for this dataset.

In more challenging datasets, such as ENEM Challenge, ENEM 2022, and BLUEX, which are derived from admission exams to Brazilian universities, we see the most significant gains due to language-specific pretraining. Substantial improvements are also observed in TweetSentBr, a dataset containing tweets with an abundance of slang and references to Brazilian popular culture. We hypothesize that this pretraining imparts specific knowledge about the country’s culture, literature, and geography that is less frequently encountered and learned during the original pretraining with more diverse texts.

Table 3. Results per dataset. \(^1\) [49]; \(^2\) [9]; \(^3\) [34]; \(^4\) [3]; \(^5\) [71]; \(^6\) [76]; \(^7\) [46]; \(^8\) [51].

Certain capabilities only emerge at scale, as evidenced by [67]. For example, 6-7B models perform close to the random baseline in datasets such as ASSIN 2 RTE and STS, and WSC. However, at the 65B scale, we observe substantial improvements, approaching or surpassing state-of-the-art supervised models on the ASSIN 2 RTE and FaQuAD datasets.

GPT-4 [39] results indicate that there is still room for improvement for Sabiá-65B in the majority of the datasets evaluated in this work. Nevertheless, Sabiá-65B performs on par with GPT-4 in datasets such as ASSIN 2 RTE, ENEM Challenge, and FaQuAD.

5.2 Data Contamination

The pretraining data for Sabiá models were collected up until February 2022. Since ENEM 2022 was publicly released in November 2022, the model could not have access to the answers for the questions present within its pretraining data. Consequently, the improvements observed at least for ENEM 2022, which were higher than the average of the datasets, cannot be attributed to data contamination. However, for the other datasets, the possibility of data contamination cannot be ruled out.

5.3 Ablation: English Datasets

In this ablation study, we investigate the potential impact of Portuguese pretraining on the performance of the model in English datasets. We evaluated the LLaMA-7B and the Sabiá-7B models in English multiple-choice tasks. For simplicity, we employed a few-shot evaluation setup with 10 randomly selected examples (dynamic-sampled prompt). Importantly, we did not incorporate any descriptions or include Portuguese keywords to delimit the few-shot examples. We also restricted all the datasets to 350 test examples.

Following LLaMA’s [63] approach, given the provided context, we select the answer with the highest likelihood normalized by the number of characters. The results in Table 4 indicate that the Sabiá-7B model exhibits a slightly reduced performance in English tasks compared to the baseline. This result corroborates our premise that model specialization invariably entails a balancing act, where improvements in one domain frequently coincide with degradation in another.

Table 4. Results in English datasets.

6 Limitations

Owing to the financial constraints associated with pretraining and, more significantly, the manual labor involved in collecting and curating evaluation datasets, experiments were conducted exclusively in Portuguese. Given that our models started pretraining from English-pretrained models and that Portuguese and English exhibit relatively close linguistic proximity, we anticipate that other researchers conducting further pretraining on languages closely related to English will observe comparable improvements in their target tasks. However, determining whether the benefits of this method persist for languages more distant from English remains an open research question.

Portuguese is a language with an abundance of high-quality web-based texts. Thus, the gains observed with the proposed method may not necessarily extend to low-resource languages with limited availability of quality texts. In such cases, parameter-efficient methods [19, 42, 43] could be advantageous, as evidenced by Yong et al. [72]. We did not use these techniques in this study due to the training costs, which are approximately equivalent to training the entire model.Footnote 9

7 Conclusion

In this study, we contributed to the expanding body of scientific evidence that specializing models for individual languages leads to improvements, even when the baseline model is large and extensively trained. We achieved this for the Portuguese language utilizing a near state-of-the-art model with 65 billion parameters. Given the relatively low pretraining cost and significant performance gains observed, we foresee a future landscape consisting of a diverse array of models, each tailored to a specific domain, rather than a single, all-encompassing model.