1 Introduction

Automatic Text Simplification (ATS) aims to transform complex sentences into simpler ones, which supports second language learners and improves communication with people with poor literacy, among other benefits [4]. ATS can be applied to multiple domains. For instance, it could simplify legal documents transforming jargon-heavy and juridic terms into more accessible vocabulary, making the document more understandable by the general public [9]. Hence, there a lot of opportunities for ATS techniques in institutions from the public sector, which needs to make their public documents more accessible and reinforce their relevance and transparency.

ATS has been an active research topic over the years with several applications [28]. In this context, José and Finatto expressed that the demand for text simplification in Brazil has increased over the years [16]. One of the main reasons for that is the growing need to make specific concepts accessible for a wider range of people. The authors also made an investigation about the language used in the documents provided by the Ministry of Health of Brazil. They found two kinds of documents describing the same disease. The first one is directed to health professionals and, therefore, has a particular domain-specific vocabulary. In the other class of documents, focused on the general public, a “simplified version” is provided. Nevertheless, there are still terms that are not usual among the less educated, such as “mucus”. Besides that, the manual simplification of complex text is not scalable, thus, ATS alternatives should be investigated thoroughly [16].

In the past decade, Portuguese ATS had a significant expansion with systems using lexical and syntactical simplification and Statistical Machine Translation (SMT) methods [3, 4, 30]. For instance, Specia [30] proposed a SMT model to simplify text in Portuguese using only a few examples. It achieved acceptable adequacy and fluency [30]. However, these studies were developed more than ten years ago, before the advent of Neural Networks architectures for NLP tasks. Therefore, even though deep learning algorithms are a trend of the field, they still have limited applications in Portuguese ATS. For instance, Neural Machine Translation (NMT) is a recent method for text simplification which directly transforms complex sentences in simpler ones without any need of syntactical or lexical analysis [2]. NMT has gained popularity due to its well succeed results on simplification in a variety of domains [2]. Moreover, different companies have used NMT methods in their services, such as Google Translate, Microsoft Translator, IBM Watson Language Translator [13, 27, 37].

MT-based ATS methods, such as NMT, usually use a parallel corpus to map hard-to-read sentences to simpler ones. Besides, it is domain-independent, i.e., one can use a larger parallel corpus to train an ATS model and then apply the same model on texts from a different domain with regards to its performance [8, 12, 32]. It is important to highlight that NMT methods have presented successful results and outperformed consolidated statistical methods [2]. According to our knowledge, no research has investigated the application of NMT methods to simplify Brazilian Portuguese texts.

Based on this scenario, this study investigates and assesses the use of state-of-the-art methods based on NMT to simplify documents written in Brazilian Portuguese automatically. This paper presents an empirical evaluation of NMT Models in a parallel corpus extracted from complex and simplified translations of the Bible to reach this goal. The results demonstrated that the use of NMT in Portuguese text simplification is promising, with a wide range of practical applications. These findings can improve text accessibility for more people, fostering the democratisation of information.

This paper is organised as follows. Section 2 introduces basic concepts about NMT and the methods adopted in this work. Section 3 presents works related to the ATS problem. In Sect. 4, materials and methods are detailed. Section 5 presents the results and discussion. Finally, Sect. 6 states the conclusion and future works.

2 Background

This paper explores the use of NMT models using mono language translation to simplify texts in Portuguese. Herein, we briefly introduce Recurrent Neural Networks (RNNs), NMT, and the methods considered in this work, i.e., an Attention-based model and a Bidirectional Recurrent Neural Network (B-RNN) with an Attention layer.

A RNN main advantage is to learn temporal information, even though it can be used in a non-temporal context [20]. In the case of machine translation, it usually works with a combination of encoder-decoder architectures in which both of them use RNN [20, 34]. An encoder is responsible for transforming the input in a context vector summarising its information after T recursive updates  [25]. The decoding process takes the dummy input and generates recursively the output feeding the next output with the previous output generated [25]. In this paper, we consider a specific RNN type called Bidirectional Recurrent Neural Network (B-RNN), in which the encoder is not only able to predict based on the past inputs but also in the future ones [20]. It produces a feed-forward sequence \((f_1, f_2, \dots , f_n)\) and a backward sequence \((b_1, b_2, \dots , b_n)\) such that h = [f, b] is the concatenation of them [20].

A significant advancement on RNN was the proposal of the attention mechanism [6]. This mechanism allows a sequence-to-sequence model to pay attention to key parts of the target sequence. Consequently, it permits the model to learn the correct alignment of the sentences [6]. Studies have stated that attention mechanisms significantly improve the model performance on long sentences and improve the model soft-alignment [6]. Consequently, it had a considerable impact on improving the results of machine translation [6]. In this work, we used a B-RNN using an attention layer as one of the algorithms to be analysed in the ATS problem for Portuguese.

More recently, it was proposed a method called Transformer, which is based solely on attention layers, dispensing with recurrence and convolutions entirely  [33]. The Transformer follows a general sequence-to-sequence architecture based on encoder-decoder [33]. The basic encoder format is a stack of N layers followed by two sub-layers, the self-attention, and the multi-head attention. Also, the encoder has a normalization layer and a residual connection [33]. The decoder follows a similar design using a stack of N layers with sub-layers and normalization with multi-head attention and residual connections [33]. The attention implementation proposed by [33] consists of a scaled dot product attention where the key (K), queries (Q) are vectors (V) of dimension \(d_k\) and values are vectors of dimension \(d_v\) [33]. Therefore, the attention to each output is calculated as given by Eq. 2 where \(\frac{1}{\sqrt{d_k}}\) is the scaling factor.

$$\begin{aligned} Attention(Q,K,V) = softmax(\frac{QK^{T}}{\sqrt{d_k}})V. \end{aligned}$$
(1)

In practice, the model makes use of multi-head attention to learn different parts of the representation and in different positions [33]. Besides, attention is used both in decoder and encoder and in a self-attention manner [33]. In addition to attention, the models also use a fully connected position-wise feed-forward with positional encoding layer also [33].

The Transformer experimental results showed that only attention models overcame RNNs in quality and required significantly less time to train as in the on two machine English-French translation tasks [33]. The development of Transformer allowed the development of new promising models such as Bidirectional Encoder Representations from Transformers (BERT) and many others [33]. Due to its relevant results, the Transformer was considered in our experiments as one of the algorithms to be analysed in the ATS problem for Portuguese.

3 Related Works

ATS is a relevant task, with a growing interest in Natural Language Processing field in recent years. This section presents relevant methods developed, over the last decade, for the ATS problem for Brazilian Portuguese.

Recently, the ATS problem has been addressed as a monolingual machine translation problem, where a given text is translated into a simpler one. There are two relevant machine translation approaches: statistical machine translation (SMT) and neural machine translation (NMT). In the SMT, the translation of the original sentence f (called the translation model) into a sentence e (called the language model) is modelled on the Bayes Theorem as detailed in [1]. The research carried out by [30] treated the ATS problem of Portuguese texts as a translation task. The authors adopted the SMT approach to learning how to translate complex sentences into simple ones. The SMT was trained with a parallel corpus of original and simplified texts, aligned at the sentence level. The translations produced were evaluated using the metrics (Bilingual Evaluation Understudy) BLEU and manual inspection. According to both evaluations, the results were promising, and the overall sentence quality is not harmed. It was observed that some types of simplification operations, mainly lexical, are correctly captured.

In summary, despite all the advancements, there is still a gap in studies on the applications of NMT to Portuguese. NMT is a recently developed deep learning technique that has reached significant results over several complex tasks [2, 6, 23, 34, 39]. According to [2], NMT based methods have shown a better alternative than SMT techniques for translation problems. Although several works employed NMT for text simplification [2, 10, 23, 26, 32], no work has applied it in Portuguese texts. To the best of our knowledge, the last works in automatic text simplification in Portuguese was done more than ten years ago [3, 4, 30], and used SMT. Thus, this paper aims to explore state-of-the-art NMT methods for text simplification in Portuguese.

4 Materials and Methods

This section details the dataset, methods, experimental methodology and evaluation metrics of this work.

4.1 Data Description

This work adopted a parallel corpus based on different versions of the bible to evaluate the NMT methods. The first one is a traditional version called Almeida Revista e Corrigida (ARC), published in 1997, with a complex text style. The newer version, called Nova Almeida Atualizada (NAA), was launched in 2017, simplifying the traditional versionFootnote 1. In this paper, we also evaluated other versions of the bible, such as the Nova Tradução Linguagem de Hoje (NLTH), Nova Bíblia Viva (NBV) and the Nova Versão Internacional (NVI). Considering a diverse range of versions might provide different simplification types as explored in the Porsimples project [4].

Each dataset has 29070 aligned verses, which will be used to create the proposed sequence-to-sequence model. Table 1 provides more information about the corpora used in this study including number of tokens, sentences, readability and Lexical Diversity (extracted using pylinguistic library [7]). In that library, the readability ease (FRE) score was proposed by  [17] and adapted to Portuguese by  [22] (see Eq. 2).

$$\begin{aligned} FRE = 206.835 - (1.015 * ASL) - (84.6 * ASW) \end{aligned}$$
(2)

It adds 42 more points to the equation proposed by [22] because Portuguese words has more syllable then English ones and therefore, the score would be too penalised. ASL means the average sentence length and AWS means average number of syllables per word. Complex and Simplified Portuguese texts tend to have key differences. One of them is the size of each sentence and the number of tokens per sentence, as pointed out by a previous work, the Porsimples project, which manually simplified a newspapers corpus. The study found that, in most cases, the simplified versions had fewer words per sentence [4]. This was also observed in both versions of the bible.

In a random sample of 292 pairs of verses (0.01% of the original dataset), we analysed different aspects of the versions of the bible. The Table 1 and Fig. 1 shows the difference between the traditional ARC versions and the other ones.

Table 1. The table shows descritive statistics on texts using the pylinguistic library in a random sample of 303 pairs of verses (0.01% of the original dataset). It is possible to see that simplified text has, in general, fewer tokens per sentence and more sentences.

Figure 1 shows the histograms stating the frequency of the number of tokens per sentence of each Bible. The simplified versions of the Bible have fewer tokens per sentence, with more sentences under the median. The distribution of tokens per sentence in the ARC is smoother, and it has a prevalence of longer sentences (i.e., more tokens per sentence). It is considerably different from the other versions, specially NLTH and NBV. The difference is even greater from the ARC to the NLTH version in almost all aspects, such as average tokens per sentence and median sentence length. Thus, both features indicate that the NLTH and the other versions are easier to read because it has fewer tokens per sentences and more sentences.

Fig. 1.
figure 1

The histogram shows the distribution of tokens per sentences in both versions of the Bible. In a random sample of 303 pairs of verses (0.01% of the original dataset), the split in sentences and tokens was made by using the Portuguese sentencizer and tokenizer from spacy [15].

Table 2 exemplifies the aligned versions of ARC and NLTH Bibles used in the experiments. It is important to remain that ARC is considered the complex version to be simplified. The other versions are considered a target in separated experiments and combined in a unique dataset afterward.

Table 2. The table exemplifies the aligned versions of the Bible used in the experiments. The passage is from Genesis Chapter 1.1-4.

4.2 Automatic Text Simplification

This section presents the details about the automatic text simplification architecture adopted in this study. We applied the Transformer and Bidirectional Recurrent Neural Network (B-RNN) architectures to simplify texts in Portuguese which are state-of-the-art methods.

Table 3. The table shows the hyperparameters used in both models.

To perform the evaluation, we used the algorithms implemented at OpenNMT framework [18]. In OpenNMT, it is possible to train the model using different datasets. We considered each pair of SOURCE-TARGET a distinct corpus and assigned different weights based on the difference between the median sentence length from the ARC version to the target version of the corpus presented at Table 1. Further, the validation content received target examples in the same proportion of the weights. The main objective is to avoid over-fitting by increasing the training data. Also, it might allow the model to learn different simplification styles, which may improve the model generalisation. The combined corpus can be identified in the following tables as “multi-corpus”.

Finally, the dataset was split into 17441 verses for training, 8139 verses for validation, and 3489 verses for test parallel examples. The experiments considered different target corpus where the ARC bible version is always the input. Also, different encoder-decoder architecture were considered (see Table 5). Two different experiments were performed for each model: with and without pre-trained word embeddings (Portuguese glove embedding with 300 dimensions [14]). At last, a total of 20 different experiments were performed: 5 to each encoder-decoder architecture for each corpus. All the experiments considered a shared embedding and vocabulary and allowed the execution of 10000 training epochs. Detailed information on the experiment setup is given in Tables 3 and 4.

Table 4. The table shows the hyperparameters specific for each model.

4.3 Evaluation

The evaluation used two metrics for translation and text simplification evaluation. The first one is the Bilingual Evaluation Understudy Score (BLEU score) [24]. The BLEU score is a widely used metric to evaluate machine translation between two languages based on a reference corpus. It also has been extensively used to assess automatic text simplification, especially the models based on mono language translation [5, 35]. Another metric is the System output Against References and against the Input (SARI score) [35]. Unlike the BLEU score, the initial purpose of the SARI score is to evaluate text simplification, considering the system output and references and the source sentence. In summary, SARI score measure how well the words are maintained or changed by the system [35]. Herein, it was used the SARI and BLEU score implementationFootnote 2 proposed by [5].

5 Results

This section presents the results obtained in the experiments. First, we discuss quantitative aspects of the supervised metrics, and then, a more in-depth discussion on the quality of the predictions is given. Table 6 synthesize the results from Table 5. Table 5 shows the detailed results of the text simplification for each architecture and pair of datasets analyzed. In this paper context, as long as it was not possible to find a massive, parallel corpus of Portuguese simplified texts and the experiment training time constraints, the B-RNN and B-RNN+Embedding achieved the best results. Despite the poor performance when compared with the BRNN model, the transformer model might improve its performance when trained with more epochs and with a larger corpus [11, 21, 29].

Table 5. The table shows the result of the text simplification to the different experiments. The B-RNN model outperforms all the other models when both metrics are considered.
Table 6. The summary of metrics of Table 5.
Table 7. The multi-corpus prediction was produced using the B-RNN model with and without pre trained embeddings. He both models removed a specific part of the sentence to try to make it shorter which is in bold text.

5.1 Simplification Quality

One particular insight is that the simplification using multiple targets achieves a much higher BLEU score but has a lower SARI score in almost all experiments. This difference is due to the diverse nature of both metrics, i.e., the BLEU score measures the number of unigrams of the system prediction is part of the references. In other words, it calculates a “modified precision score”, which decreases the incentive of an over-generation of a particular word to obtain a high score [24]. Therefore, the high BLEU score might mean that the model is sharing a significant overlap with the references in the predictionFootnote 3. On the other hand, the SARI score rewards the words that are maintained in both reference and source sentence [5, 36]. It also scores the addition of new words as long as they belong to at least one reference [5, 36]. Further, the metric showed to be intuitive on how the simplification gain is calculated [5, 36].

Fig. 2.
figure 2

Text readability scores, calculated by the pylinguistics library, of the text simplification made by the B-RNN model in different corpus.

Table 8. Different scores produced by the simplification made over different corpus from the best performing methods which are B-RNN with and without pre trained embeddings.

Figure 2 show that in the single training corpora, the metrics of readability is improve and it is even higher than the readability of the reference corpus. This means that the model was able to learn the style of the target corpus as it was pointed out by previous works [19, 38]. Besides, although the predicted texts have more tokens per sentence in average, the high readability score might mean that it is predicting short words as it is one of the aspects considered in the readability metric [7].

Finally, the model trained with a single corpus achieved a higher SARI score, indicating a better simplification. Nonetheless, it could not produce a sentence with the same level of grammatical correctness and semantic meaning in this particular example as the one produced by the multi-corpus training approach. It was pointed by [31] that even though SARI scores can represent the quality of the simplified sentences, the BLEU score performs better on scoring the grammatical meaning of the sentences.

Table 7 presents an example of the outcome of the best text simplification method. It shows details about the potential of the application of the proposal in practice. As presented, the text after applying the algorithm contains more general words than the original text. Even though it did not produce an exact translation of the tokens, the model is able to maintain the original meaning of the sentence and grammatical correctness.

6 Conclusion

Neural Machine Translation (NMT) methods have achieved successful results for the Text Simplification problem in different languages, overcoming traditional statistical approaches. To the best of our knowledge, no research has investigated the application of NMT methods to simplify Brazilian Portuguese texts. The main contribution of this paper is the application of NMT methods for the simplification of Portuguese text. Two different state-of-the-art NMT methods were considered: the Transformer and Bidirectional Recurrent Neural Network (B-RNN). The results demonstrated that the B-RNN was able to obtain the best results, in average (BLEU = 21.84 without pre trained embedding and SARI = 45.34 with pre trained embedding), despite the small corpus size and limited training epochs constraints.

Another significant improvement was the use of multiple corpora presenting different possible simplifications for the same input. It achieved an improvement of over 8 points on the BLEU score. Despite of a lower SARI score, the higher BLEU score might mean the ability to preserve the sentence meaning and grammatical correctness.

As future works, we intend to: (i) perform an analysis on the parameters of each algorithm evaluated; (ii) use different embedding models, such as BERT [11, 21, 29]; (iii) apply the NMT methods in text from different domains, such as law and health; (iv) explore the use of other methods such as lexical and syntactical simplification and pre-trained models for a mono-lingual translation approach.