1 Introduction

Automatic Text Summarization is a fundamental task in the field of Natural Language Processing (NLP), essential for handling large amounts of information. This task aims to condense information from one or more documents, producing summaries that preserve the main points of the original content [13]. Summarization can be classified into two main types: extractive and abstractive. Extractive summarization selects sentences or phrases directly from the original text(s), while abstractive summarization generates new sentences that capture the original meaning of the text(s) [17]. Additionally, summarization can be categorized as single-document, when summarizing an unique text, or multi-document, integrating information from multiple texts [7].

More recently, Large Language Models (LLMs) have been employed for several NLP applications, including generation tasks as text summarization [1]. Despite their popularity, LLMs present an explainability deficit and have some limitations, such as the expensive necessary computational infrastructure and the occurrence of hallucinations, which, in the case of summarization, may hinder the proposal of preserving the original meaning of the texts. On the other hand, extractive summarization methods generally do not suffer from these disadvantages. Interestingly, there has been a resurgence of interest in these methods, with the proposal of novel methods that incorporate classical ideas with new perspectives and also bring new approaches to the task.

Various extractive methods have been proposed and tested mainly on English corpora like CNN/DM [9]. Some notable methods are PreSumm [12], HiStruct+ [20], RankSum [10], MemSum [8] and MatchSum [22], which have demonstrated promising performances in generating summaries. However, there is a significant gap in the application of these methods to other languages and, therefore, limited evaluation on their multilingual capacity. With the web, multilingual summarization is a very relevant field, and there is a growing need for resources and techniques that support multiple languages, expanding the applicability of NLP advances to diverse linguistic communities.

Evaluation in summarization (whether it is multilingual or not) is crucial for advancing research in the field. Traditionally, the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric [11] has been widely used to measure the overlap of n-grams between automatic summaries and human reference summaries. However, ROUGE is limited since it only computes superficial text similarity. To address these limitations, other metrics have arisen, as BLANC [21]. BLANC evaluates summaries by using them to help pre-trained language models (such as BERT [5]) to perform the token unmasking task, with the supposition that a “more helpful” summary may improve the model success on the task.

In this context, this work aims at evaluating the effectiveness of some more recent extractive methods for Brazilian Portuguese, trying to assess their multilingual performance. We run some selected state-of-the-art and more classical methods on news corpora for English and Brazilian Portuguese, using well-known reference summarization corpora. We compare the results using both ROUGE and BLANC metrics, in order to provide a more comprehensive evaluation.

Next section briefly describes the main related work. The datasets that we use are detailed in Sect. 3. Section 4 brings our experiment setup, while discussion of results and final remarks are made in Sects. 5 and 6, respectively.

2 Related Work

Several automatic extractive summarization methods have been developed in the area, especially using deep learning models and word embeddings. Here, we briefly overview classical and more recent methods, including those that are considered state-of-the-art in the area.

TextRank [14] is a graph-based ranking algorithm used for both sentence and keyword extraction. Inspired by Google’s PageRank algorithm, TextRank applies the idea of “voting" to determine the importance of textual units. It constructs a graph where nodes represent sentences or words, and edges indicate co-occurrence or similarity relations. The algorithm iterates over the graph until the importance scores of the nodes converge. TextRank is an unsupervised method that has proven competitive in established benchmarks like DUC-2002, demonstrating robustness and portability across different domains and languages.

Centroid-embedding [19] summarization leverages the compositional capabilities of word embeddings to capture semantic relationships between words and sentences. This method constructs a centroid vector, which represents the central theme of the document, by summing the embeddings of the most significant words, determined by their TF-IDF scores. Each sentence is then represented as a sum of its word embeddings, and sentences closest to the centroid vector are selected for the summary. This approach addresses the limitations of traditional bag-of-words models, which often fail to capture semantic similarities between sentences with different words but similar meanings. The centroid-based method has proven effective in both multi-document and multilingual summarization tasks, offering competitive performance compared to more complex deep learning models. Its simplicity and robustness make it an attractive option for extractive summarization across various domains and languages.

PreSumm [12], also known as BERTSUM, has extractive and abstractive summarization strategies. It uses the BERT architecture to understand the context of sentences and select the most relevant ones for the summary. The technique includes pretraining on general language tasks and fine-tuning specific to the summarization task. PreSumm performs summarization in two phases: (1) an extraction phase where the most important sentences are selected, and (2) an abstractive generation phase where these sentences can be refined to improve the fluency and cohesion of the final summary. For this work, we used only the extractive strategy of PreSumm.

HiStruct+ [20] incorporates hierarchical structure information into pre-trained language models like BERT and RoBERTa to improve extractive summarization. This model is designed to handle the intrinsic structure of documents, considering the hierarchy of sections and subtitles when selecting sentences for the summary. HiStruct+ is particularly effective for long and complex documents with clear hierarchical structures, such as scientific articles. The approach improves summarization accuracy by preserving the logical and semantic structure of the documents.

Other relevant methods are RankSum [10], MemSum [8] and MatchSum [22], but we do not address them in this paper because they have been outperformed by some of the above methods in the literature and also in some of our initial experiments (that we do not report in this paper).

3 Datasets

For our experiments for Brazilian Portuguese, we adopt the CSTNews corpusFootnote 1 [2]. It is a reference and widely known corpus developed to support research in automatic summarization of journalistic texts in Brazilian Portuguese. The corpus comprises 50 clusters of news articles, totaling 150 texts and their respective summaries. Each cluster contains 2 to 3 news articles about the same event, collected from various Brazilian media sources such as Folha de São Paulo, Estadão, O Globo, Jornal do Brasil, and Gazeta do Povo, according to their impact at the time of publication. The corpus contains 2,088 sentences and 47,240 words, with an average of 2.8 texts, 41.76 sentences, and 944.8 words per cluster. In addition to the original texts, each cluster includes single-document manual summaries, multiple-document manual summaries, and automatic summaries. The corpus is also manually annotated in various ways for syntax, semantics and discourse information.

CNN/Daily Mail [9] is a corpus widely used in automatic summarization research for English. It comprises approximately 286,817 article-summary pairs, each containing a news article and an associated summary. The articles are sourced from the CNN and Daily Mail news websites, and the summaries are human-generated bullet points summarizing the main points of the news articles. The corpus was created initially for the passage-based question-answering task but was adapted for summarization research by restoring the bullet points to form multi-sentence summaries. In the training set, the source documents have an average of 766 words distributed over 29.74 sentences, while the summaries consist of 53 words distributed over 3.72 sentences. The corpus is available in two versions: one with real entity names and another with anonymized entities that were replaced by document-specific IDs, facilitating vocabulary reduction and experimentation with deep learning models.

4 Experiment Setup

The corpora data was divided into 70%, 15%, and 15% for training, validation, and testing for the experiments (as some methods needed training). The original configurations of each summarizer were maintained, adapting them to the CSTNews corpus. The experimental process is summarized in the following subsections.

4.1 Pre-processing

The NLTK library was used for tokenization, stopword removal, and normalization of Portuguese texts. This step is crucial to ensure that the models, originally trained in English, can operate effectively in the new language. Tokenization segments the text into smaller units, such as words or subwords. Stopword removal eliminates common words that do not significantly contribute to the task. Normalization adjusts the text to a more consistent form, facilitating subsequent processing. For HiStruct+ and PreSumm, we performed similar pre-processing using the Stanford CoreNLP tool.

4.2 Model Training and Adjustment

We applied the methods’ original codes with minimal modifications. We also fine-tuned the model parameters to optimize their performance on CSTNews. For this, we adapted the tokenization processes and changed some hyperparameters specific to the language and style of the texts in the CSTNews corpus.

We aimed to replicate the original training conditions of the models as closely as possible despite our limitations. The model used in PreSumm was BERT-base-uncased, while HiStruct+ employed RoBERTa-base, as these were the models used in the original works.

Since we only had access to a CPU, the number of epochs and batch sizes had to be adjusted. PreSumm and HiStruct+ were trained for 7 epochs with a batch size of 14.

The learning rates were the same as in the original works: 2e-3 for PreSumm and HiStruct+.

4.3 Evaluation

The generated summaries were evaluated using ROUGE and BLANC. ROUGE measures the overlap of n-grams between the automatic and human reference summaries. The most commonly used variants are ROUGE-1, which measures unigram (individual words) overlap, ROUGE-2, which measures bigram (word pairs) overlap, and ROUGE-L, which measures the longest common subsequence overlap. Each variant can be evaluated in terms of F1 score, precision, and recall. The F1-score is the harmonic mean of precision and recall, balancing the two. Precision measures the proportion of n-grams in the generated summary that is present in the reference summary. In contrast, recall measures the proportion of n-grams in the reference summary that is present in the generated summary. Although ROUGE is widely used for its simplicity and effectiveness, it is criticized for not adequately capturing summaries’ semantic quality and coherence, focusing mainly on superficial text similarity. Conversely, BLANC measures how much the generated summary aids a pre-trained language model (like BERT) in understanding the document. It uses the masked token task (Cloze task) to evaluate the functional utility of a summary. In BLANC-help mode (which is the one we used), the summary is concatenated to each document sentence during inference, and the model’s ability to predict masked words with the summary is measured. The difference in prediction accuracy with and without the summary indicates the summary’s quality. A higher BLANC score means the summary provides more contextual information that helps the model to understand the document better. Unlike ROUGE, BLANC does not require human-written reference summaries.

5 Results and Discussion

Tables 1 and 2 show the ROUGE results for the summarization methods on the CSTNews and CNN/DM corpora. For CSTNews, PreSumm demonstrates the highest performance in most ROUGE metrics, particularly in ROUGE-1 F1 (55.55), ROUGE-1 Precision (67.91), and ROUGE-2 Precision (49.64). This indicates its high effectiveness in selecting relevant sentences that closely match the reference summaries at the unigram and bigram levels. However, its lower recall scores suggest that, while maintaining high precision, it may miss some relevant content. For the CNN/DM dataset, PreSumm also shows the highest performance with ROUGE-1 score of 53.37. TextRank and Centroid-embedding have lower scores (30.14 and 18.12), consistent with their lower performance on the CSTNews corpus too and demonstrating the need for advanced models to handle complex text summarization tasks.

Table 1. Results on ROUGE metrics for CSTNews dataset
Table 2. Results on ROUGE metrics for CNN/DM dataset
Table 3. Results on BLANC metric for CSTNews and CNN/DM

Table 3 presents the BLANC results for the same models. Centroid-embedding achieves the highest BLANC score (57.74) for CSTNews. Despite its high ROUGE scores, PreSumm produces the lowest BLANC score, which shows the importance of using complementary metrics for a more holistic evaluation. For the CNN/DM dataset, TextRank achieves the highest BLANC score (26.20). Centroid-embedding also performs well, with score of 24.35. PreSumm again shows a relatively low BLANC score.

As illustration, Table 4 shows a summary generated by PreSumm, allowing comparing it with the human-written summary. The summary generated by PreSumm accurately captured Fabiana Murer’s victory and the matching of the medal record, but omitted details about the competitors and the achieved marks.

Table 4. Comparison between summary generated by PreSumm and the human-written summary

Table 5 shows three CSTNews summaries generated by PreSumm, the summarizer which achieved the best ROUGE results. These summaries were randomly chosen for a manual human evaluation. We used some of the known TAC (Text Analysis Conference) criteria [4], which include overall responsiveness and readability. The assessment of overall responsiveness examines how effectively a summary addresses the information need outlined in the topic statement, considering both the content and linguistic quality of the summary. The readability score evaluates the summary’s fluency and structure, independent of content, and is based on factors such as grammatical correctness, lack of redundancy, referential clarity, focus, structure, and coherence. These criteria are evaluated on a five-point scale: from very poor to very good.

We evaluated these three random summaries as “very good" based on the above criteria once they show no redundancy, no referential problems, and very good structure and coherence. Their high ROUGE scores can reflect the responsiveness of the summaries. Once the summaries generated are close to the golden standard, that is, the summary generated by humans, they are expected to cover the document’s main content, thus, having a very good responsiveness.

Table 5. Three CSTNews summaries generated by PreSumm

Table 6 shows three CSTNews summaries generated by Centroid-embedding, the summarizer which achieved the best BLANC results. These summaries were also randomly chosen for a manual human evaluation and evaluated under the TAC criteria. Similarly to the previous summaries, we also evaluate these three as “very good" regarding the overall responsiveness and readability.

Table 6. Three CSTNews summaries generated by Centroid-embedding

These results raise an important question: why did the Centroid-embedding achieve such high BLANC scores? Since BLANC measures how much the generated summary aids a pre-trained language model in understanding the original document, its high BLANC score suggests a high amount of informational content in the summaries. However, based on these three randomly chosen summaries, we could not hypothesize why they had higher BLANC scores than PreSumm, for example. Thus, we encourage further research and new experiments with both ROUGE and BLANC metrics.

While providing valuable insights into the performance of various summarization models on the CSTNews corpus, the present study is subject to several methodological limitations. One significant limitation is the size of the dataset. Although a reference in the area, the CSTNews corpus contains only 150 texts, and more than this limited number of texts is needed for training robust summarization models. So, we emphasize that we need more resources in languages other than English, to enhance the reliability and generalizability of summarization models. We are aware of a recent corpus, RecognaSumm [15], which contains over 130.000 journalistic texts in Brazilian Portuguese, and we plan to perform experiments with it in the future.

Another limitation of this study is the computational resources used. Due to the unavailability of GPU resources, the experiments were conducted using CPUs. The lack of GPU access limited our ability to fine-tune model parameters optimally, potentially impacting the performance of the summarizers. With the availability of GPUs, future studies could perform more extensive parameter tuning, leading to better results.

More efforts are also needed to refine and develop evaluation metrics beyond traditional measures like ROUGE. Metrics such as BLANC, which supposedly consider semantic coherence and fluency, should be further explored and validated. Developing new metrics that can more accurately reflect the quality of summaries in different languages and contexts will be crucial for advancing the field.

6 Final Remarks

We contributed to the investigation of multilingual summarization by evaluating state-of-the-art extractive summarization models on the Brazilian Portuguese CSTNews corpus and the English CNN/DM corpus. Our study utilized ROUGE and the more recently developed BLANC metric, providing insights into the models’ performance. The findings revealed that PreSumm performed well according to ROUGE metrics, while Centroid-embedding had good BLANC scores for both languages.

Future work includes testing other methods for the previous languages, as well as other languages, including languages of different linguistic typologies. We also aim at using corpora of diverse genres and domains. Such evaluation is necessary for determining the potential of current and future summarization methods for multilingual summarization.

Another relevant issue consists in exploring multilingual multi-document summarization, leveraging some previous researches and datasets that exist for Brazilian Portuguese, as the ones of [3] and [6]. Exploring the multilingual performance of some classic single and multi-document methods that were created for Portuguese may be other interesting endeavor, as it is the case of GistSumm [16] – the first summarization system for Portuguese – and RSumm [18].

The interested reader may find the datasets, the source codes of the methods and other details about this work at the POeTiSA project website (at https://sites.google.com/icmc.usp.br/poetisa).