1 Introduction

Automatic text summarization is the process that applies computational methods to produce a condensed version of one or more input documents while preserving the most relevant information [7]. Particularly in the case of summarizing multiple documents, in addition to selecting the most relevant content, there is the challenge of avoiding redundancies and inconsistencies. News summarization is an example where an ongoing event can be reported by different channels, resulting in different descriptions with different updates.

According to [22], many of the multi-document summarization models combine inputs as a flat sequence and do not consider relationships between documents. We believe the combination of input documents can help provide a more accurate description of the content, and our goal is to improve summarization by leveraging this variety of input documents.

More specifically, we will apply a Siamese network [13] to evaluate different input documents and identify those capable of producing better summaries. The difference between documents arises, for example, from the different volumes of information and objectivity. To emphasize these differences, we will consider different extractive summarization techniques. Extractive summarization aims to produce a summary selecting a representative subset of the sentences or paragraphs from the source text without making any changes in them, opposed to abstractive summarization which aims to represent the input content with new words and sentences. Despite the promising results with abstractive methods, it is not possible to guarantee that the summaries produced will be consistent with the facts [32], which can be limiting in some usage scenarios.

The Siamese Neural Network is a natural choice for our approach, as these networks are efficient in comparison learning. We use datasets composed of multiple input documents and at least one reference summary. Every input document will be evaluated using the ROUGE metric and, after that, it will be reorganized into pairs indicating which one obtained the best evaluation. This arrangement is compatible with the Siamese network application, which needs to learn to recognize the qualities that enable one document to be evaluated more highly than another. We used a training dataset and a separate test dataset, demonstrating generalization to unseen data. As an additional benefit, reorganizing documents into pairs produces a larger volume of training data.

Our main contributions are as follows:

  • Proposal of a document comparison method, based on Siamese networks, applied to the automatic summarization of multiple documents;

  • Description of changes made to a variety of summarization methods to take advantage of document prioritization;

  • Experiments validated with statistical tests pointing to relevant differences in our results;

  • Comparison of some popular summarization techniques using the same datasets and testing methodology.

  • Our document prioritizing code is made publicly availableFootnote 1.

The remainder of this article is organized as follows. We review the related work in Sect. 2. Section 3 presents the proposed model description. Section 4 describes the datasets, the baseline methods, the changes made to the methods to include document prioritization and information about the testing methodology. In Sect. 5 the results and discussions are presented. Finally, Sect. 6 brings the conclusion and future works.

2 Related Work

Cross-document relations should be taken into account while summarizing multiple documents. These relationships are typically used by summarizing techniques to reduce repetition and identify the most important details. Applying clustering techniques, such as in [8, 33], is a way to reduce redundancy.

Graphs are often used to represent the relationship between pieces of text, as in [31]. This work used heterogeneous vertices to represent information of larger granularity, such as sentences, or smaller granularity, such as words. The vertices contained embeddings to represent information, which were updated with graph attention networks [29]. In [2] GRAPHSUM is presented, which analyzes text fragments to select sentences. Text fragments are represented as nodes and a variation of the Pagerank method is proposed to favor positive correlations.

The encoder-decoder framework is a widely used method in abstractive summarization approaches, including multi-document scenarios. With this method, the encoder layer aims to create a representation of the input texts, and the decoder generates new sentences using this representation. The methodology has shown good results, although there are still some challenges, especially with long documents. To mitigate this difficulty, [15] proposed the use of graphs to guide abstractive summarization.

A popular datasets for evaluating multi-document approaches is presented in [9], which also proposes a summarization technique. The suggested approach is a change to the Point-Generator summarizer [26], applying an attention layer that includes the importance of sentences. To determine the importance of sentences, a bidirectional LSTM network is applied to produce embeddings. Using the sentence representations as input, a similar process is applied to produce the document representation. This outputs are used to identify the most important sentences, which occurs with a variation of the MMR [3] technique.

In the works retrieved, the main motivation for extracting cross-document relations is the reduction of redundancy and identification of the most important sentences. In contrast to these approaches, our goal is to evaluate entire documents. Beyond just finding documents with the most pertinent information, our method seeks to identify those that are capable of producing the best summaries. A possible application, which will be demonstrated in our experiments, is the use of an additional layer in the evaluation of sentences, prioritizing the sentences that are present in the best documents.

3 Model Description

The suggested approach seeks to determine which articles produce the best summaries, allowing documents to be ranked in order of importance for information extraction. The hypothesis is that better summaries can be produced when more sentences are extracted from the best documents. In the datasets we consider, which are presented in Sect. 4, we observe that each sample is composed of a variable number of input documents. For simplicity, we suggest a technique for evaluating pairs of documents, which allows us to rank documents by applying multiple comparisons.

In this way, the model needs to be able to determine which of the two input texts has the highest likelihood of yielding a good summary, and Siamese networks are useful for learning by comparison [16]. In its simplest configuration, a Siamese network is composed of two attribute extraction networks, one for each input, and a comparison head. This architecture is flexible enough to allow deep-learning techniques to be applied to the feature extraction layer, which is a configuration that has achieved state-of-the-art results in different tasks [1, 14, 30]. Additionally, the two feature extraction networks have identical connection weights, ensuring a similar processing method for both input data.

In our model, as depicted in Fig. 1, each input document is processed by a neural network and produces a representation as document embeddings. These embeddings are used in a classifier to identify the best document. During training, the error observed in the classification is used to adjust the entire network, including the two networks that produce the document embeddings. With this approach, the reduction of errors in the classifier occurs through the optimization of document representations, which requires the network to learn how to identify the most useful attributes for evaluating document quality.

Fig. 1.
figure 1

Document comparison using Siamese network. Two documents are evaluated by neural networks with the same connection values. The output is embeddings that are input to a classifier.

Figure 2 shows the network in more detail up until the point where the document representation stage is completed. In this figure, the input documents are used to produce word embeddings that are processed by a convolutional neural network (CNN) [17]. CNN provide adjustable configurations that enable evaluating the precision increase as a function of computing cost in addition to the possibility of applying parallel processing. Furthermore, this type of network has been successfully applied to a variety of problems such as news classification [10], medical image classification [5] and energy demand prediction [25].

Our network is composed of one-dimensional filters, with multiple planes and different window sizes. This means that the filters consider each word embedding as a multidimensional point, that the filters are moved along the text, and that at each iteration several tokens are evaluated simultaneously, depending on the size of the window. Filter results are compiled in a max-pooling layer. During the training, we included a dropout procedure, which in our experiments helped to increase generalization. At the end, a fully connected layer is included to performs dimensionality reduction and produce the document representation. The step after producing document embeddings, not represented in this figure, is the union of two document representations in a fully connected layer and a classifier activated with a Sigmoid function.

Fig. 2.
figure 2

Document embedding generation. Tokens are converted to chains of word-embeddings, which are evaluated by convolutional filters. Outputs are reduced with max-pooling and document embeddings are produced with a dense layer.

To train the network, we need labeled pairs of documents that we construct using the training data. We start by evaluating all documents with the ROUGE metric. Using this evaluation, we choose two documents, \(d_1\) and \(d_2\), and assign the label 1 if the ROUGE evaluation of \(d_1\) is better than \(d_2\) and 0 otherwise. We take care during this process to maintain class balance and avoid duplicating documents in order to prevent overfitting. Furthermore, as pointed out by [28], differences in the lengths can interfere with the evaluation with ROUGE, so we are using summaries of the input documents truncated with the same number of tokens. For simplicity, summaries are produced by extracting the initial tokens from documents and we leave other approaches for future work.

To apply the trained network to order sets of documents, we need a strategy using pairwise comparisons. Since each sample we considered was composed of a few input documents, for the sake of simplicity, we are applying the comparison to all possible pairs. In this way, considering all possible pairs of documents, we count the number of times that a document \(d_x\) obtained a score higher than \(d_y\), according to Eq. 1. In this equation, D is the set of input documents, and \(S(d_i)\) is the output of the Siamese network.

$$\begin{aligned} score(d_x) = \mid \{ d_y \in D \mid d_y \ne d_x,\; S(d_x) > S(d_y) \} \mid \end{aligned}$$
(1)

Equation 1 is repeated for all documents allowing the identification of the best ones. To simplify integration with the other methods that will be presented in Sect. 4, we normalize the document scores to sum to 1.

4 Materials and Methods

This section presents the materials used in the experiments, including the datasets and baseline methods. It also describes the changes we made to include document prioritization and some implementation details.

4.1 Datasets

Few datasets are available for multi-document summarization, mainly because producing reference summaries is labor-intensive. Some of the datasets we use are composed of news, where it is possible to find different sources describing the same event.

Multi-News [9]: Contains articles extracted from the news aggregation NewserFootnote 2. This site contains human-written news summaries that include citations to external sources of information, gathering information from over 1,500 news channels. To compose this dataset, summaries are used as references and the news is cited as sources. There are a few versions of this data and we are using the pre-processed version without truncation.

WCEP [11]: Contains documents extracted from the Wikipedia Current Events PortalFootnote 3. Following Wikipedia guidelines, summaries are short, approximately 35 words, written in the present tense and avoiding opinions and sensationalism. Each summary averages only 1.2 cited sources, but this number has been incremented through documents extracted from the Common Crawl News datasetFootnote 4. We are using the version distributed by the authors, which was truncated at a maximum of 100 documents per sample.

WikiSum [18]: Wikipedia was used to create this dataset by providing reference summaries. The task is to create the lead, or first part, of the Wikipedia page using the references listed in the article and ten Google search engine results. However, due to its enormous size, some works only use a part of this data. In this work we are using the partition of [20], which contains the first 40 paragraphs selected with a logistic regression method that was trained using ROUGE-2-Recall from the comparison of paragraphs with the reference summary. Since the input data is made up of separate paragraphs that are not identified with the document from which it was taken, we are treating each paragraph as a separate document.

arXiv [4]: Scientific publications taken from arXivFootnote 5 are included in this dataset. Since many of the datasets now in use are composed of news articles or other shorter content, the objective was to develop a dataset composed of longer documents. The extracted content contains the abstracts, which are used as reference summaries, and the article sections which are used as input. Since we are interested in multi-document datasets, we are considering each section as an input document.

Multi-XScience [21]: It is a dataset of scientific articles, where the objective is to produce the related works section using the abstract of the article itself and a collection of reference documents. The articles were extracted from arXiv, and related documents were identified by applying a set of heuristics in Microsoft Academic Graph [27].

Table 1 presents a summary of the evaluated datasets. The “Train Test Val” corresponds to the number of samples in each data partition. “Doc” is the average number of input documents in each sample. “Doc/Ref Len” is the average number of tokens per document and reference, calculated using the nltkFootnote 6 library.

Table 1. Dataset attributes

4.2 Baseline and Integration Methods

In this section, we describe the methods we used to compare the results and the procedure we applied to integrate with document prioritization.

LeadSum: Since a part of the datasets is composed of news, we are considering extracting content from the beginning of articles. This strategy is justified by the fact that in journalistic content the first sentences are usually the most important [12]. In this way, all content is merged into a single large document, and the initial content is extracted until the length defined for the summary is reached.

DocScore+LeadSum: To use document prioritization, we are reordering the input documents, moving the most relevant ones to the beginning. This way, if the most relevant document is larger than the size of the summary, all content will be extracted from this source. Otherwise, content from other documents will be used following the same order of relevance.

TextRank [23]: Is an unsupervised extractive summarization method based on graphs. In this method, the sentences are the vertices and the similarity between the sentences is used to define connection weights. Using this graph, PageRank is applied to assign a score to each sentence, which is used to rank the sentences. Summaries are produced by selecting the sentences with the highest score. In our implementation, the similarity between sentences was calculated with Tf-Idf (term frequency–inverse document frequency).

DocScore+TextRank: The result of processing with TextRank is a score value related to each sentence. To integrate with document prioritization, we are adding the document score to the paragraph score, resulting in a greater probability of selecting the paragraphs from the best documents. In this sum, we are applying a weighting factor, according to Eq. 2, where \(T(s_{d, i})\) is the score obtained with TextRank for sentence i of document d, S(d) is the document score obtained with our model and \(\alpha \) is an adjustable parameter.

$$\begin{aligned} score(s_{d, i}) = T(s_{d, i}) + \alpha * S(d) \end{aligned}$$
(2)

PacSum [34]: This is a summarization method that represents the input contents in the form of digraphs. The authors argue that when selecting sentences to construct summaries, there are those that contain the most relevant information and those that are complementary. A simple approach is proposed to identify the most relevant sentences considering their position of occurrence in the text. In the proposed digraph, the vertices are sentence embeddings, and the direction of the connections is defined by the order of occurrence of the paragraphs in the text. The most central sentences are selected, and centrality is defined with Eq. 3. In this equation, i and j are positions of sentences in the input documents, \(s_i\) is a sentence, \(e_{i,j}\) is a measure of similarity between sentences \(s_i\) and \(s_j \) and \(\lambda _1\) and \(\lambda _2\) are adjustable parameters that determine the importance of the initial and final sentences, respectively.

$$\begin{aligned} centrality(s_i) = \lambda _1 \sum _{j<i}e_{i,j} + \lambda _2 \sum _{j>i}e_{i,j} \end{aligned}$$
(3)

DocScore+PacSum: PacSum’s adjustable parameters are calibrated through a supervised process. In particular, the parameters \(\lambda _1\) and \(\lambda _2\) are configured to prioritize the initial and final sentences. This way, the method already has a mechanism in addition to sentence similarity that we will reuse to prioritize the most important documents. To achieve this, our approach was to reorder the input documents, keeping the most relevant ones first, which allows the parameters \(\lambda _1\) and \(\lambda _2\) to be adjusted more effectively.

BertSum [19]: It is a supervised method for producing extractive summaries using the BERT [6] network. The network receives the content to be summarized, which has been changed to include sentence separation and produce contextualized sentence embeddings. To select sentences, some approaches are evaluated, including a simple classifier, an inter-sentence Transformer, and an LSTM network. As an additional resource to avoid redundancy, summaries are produced avoiding inserting sentences with repeated trigrams.

DocScore+BertSum: It is challenging to use BertSum when there is a large amount of input data because it processes all input sentences at once. In the BERT network, input documents are truncated into 512 tokens, which in our test case can lead to the deletion of complete documents. To minimize this limitation, we reorganize the input documents in a manner similar to our approach using PacSum, allowing the best documents to participate in the summarization.

Oracle: Inserted as an upper bound for extractive methods. We are using a greedy approach similar to the one presented in [19]. Starting with an empty set, in each iteration, we include a sentence, and this sentence must be the one that increases the ROUGE metric the most of all sentences in all documents. Furthermore, repeated sentences should not be included. The process is repeated until the size defined for the summary is reached or there is no sentence that increases the summary score.

Oracle Lead: An additional upper bound that exclusively takes document ordering into account. Input documents were evaluated with ROUGE and ordered best first. Using the data in this sequence we apply LeadSum, which is the simplest classifier.

4.3 Implementation Details

Our goal when running the experiments is to find out whether adding document prioritization improved the quality of summaries. In order to achieve this, we conducted experiments on the datasets and evaluated how well the baseline and modified versions of the algorithms performed. All training of the document prioritization model was carried out with the training data and followed up with the validation data. We terminated training when no reductions in classification error were observed in the validation set for three consecutive iterations.

To determine the model settings, we performed tests on the Multi-News dataset, applying the same configuration to all datasets. In our evaluation, we only considered the quality of the results, disregarding the computational cost, although in our studies the larger network did not always produce better results. In this way, 128 convolutional kernels were used, and we produced document embeddings with a length of 32 values. The input data was converted to lowercase and we are using a vocabulary with the most frequent 10,000 lemmatized tokens. The word embeddings were generated with fastText [24], which has a sub-word mechanism capable of avoiding out-of-vocabulary occurrences. We used the distribution trained with Common Crawl and allowed these representations to be optimized in the training. As mentioned in Sect. 3, the input documents are truncated to the same size as expected for the summary.

Some of the baseline methods also require defining attributes. In the modified version of TextRank, we include the \(\alpha \) variable that needs to be defined. In order to achieve this, we generate summaries for various values of \(\alpha \) using the validation data. We used the first 1,000 samples from the validation set to evaluate ten \(\alpha \) values between 0 and 1.0, using the best-performing configuration to produce the summaries in the test set. For PacSum we use the optimizer provided by the authors, which selects values using the grid-search method. In this process, in the same way as in the original article, we use the validation set. We also use the version of BERT that was fine-tuned by the method’s authors.

We configured BertSum to select sentences with a classifier in the output layer and maintained the default settings, which include the block-trigram strategy. We train the model with the training dataset, evaluating with the validation set every 1,000 training iterations. The training was stopped after three consecutive iterations with no decrease in error on the validation set. Because it is a very large dataset, when training BertSum with WikiSum, we reduced the validation set to the initial 3,200 samples. We also reduced the training set to 20% of the total, which is enough to complete training in the first epoch, since in our experiments training ended approximately after using 5% of the data.

According to [28], evaluation using ROUGE can be influenced by the length of the summaries. This way, when generating summaries, enough sentences are chosen and the final sentence is truncated to reach the number of tokens specified by the summary, producing summaries that are precisely the same length across all methods. Summary lengths were chosen through a literature search. This way we are using 300 tokens for Multi-News, 40 for WCEP, 60 for WikiSum, 220 for arXiv, and 120 for Multi-XScience.

All experiments were repeated 10 times, always with different seeds of random numbers. Using Wilcoxon rank-sum statistics, we compared the statistical significance of the results from the prioritized version with the results from the original methods. For statistical tests, it is also necessary to repeat the tests with the original versions of the methods. Since in the LeadSum, PacSum, and BertSum methods, the order of documents is important, we are randomly changing the order of the input documents in each experiment. With this approach, we are comparing the random reordering of input documents with the reordering using our prioritization method.

The original version of TextRank does not use random numbers and is not affected by the order of documents, so we use another comparison approach. In its original implementation, the ROUGE metric is presented along with confidence intervals, which are calculated with bootstrap for a 95% confidence factor. In this way, we performed ten repetitions of our modified version and compared the median with this confidence interval.

5 Results and Discussion

5.1 Classification Scores

In this section, the classification results using the Siamese network are presented. The input data are the document pairs produced as indicated in Sect. 3, so these results refer to binary classification to identify the best document. Document pairs were produced for the training, testing, and validation data, and we are presenting the results with the testing data after the training is complete. The results are presented in Table 2.

Table 2. Siamese Network classification results

We conclude from the results that the suggested network helps identify the best documents. Despite this, especially in the Multi-XScience dataset, the accuracy was not very high. We therefore conducted experiments to determine the situations in which the model made the most mistakes. In particular, we were interested in identifying the relationship between the difference between ROUGE values and classification performance.

Since each training sample is composed of a pair of documents, we consider the difference in ROUGE evaluations between documents, reusing the evaluation applied to construct the training data. Ordering the documents by this difference, we divide the test data into sets of 500 samples. Thus, the first group corresponds to documents with similar evaluations and the last group contains documents with greater differences in quality.

For each group produced, we calculate the f1-Score metric and the mean difference of ROUGE. We calculated the Spearman correlation between these values and the results are in the last column of Table 2, where it is possible to verify a very strong correlation. This indicates that the proposed Siamese network works better when the input documents have a large difference in the evaluation and not so well when the evaluations are very similar.

These results are also represented in Fig. 3. To simplify visualization, this figure represents a simplified version of the correlation experiment. In this experiment, we applied the same ordering procedure by ROUGE difference, with the change that the results were separated into just 10 groups. This figure shows that classifier performance is not particularly significant while the difference in evaluations with ROUGE is close to zero, but it advances to substantially greater performance when the difference with ROUGE increases.

Fig. 3.
figure 3

F1-score as a function of the difference in quality of documents evaluated with ROUGE.

5.2 Summarization Scores

We evaluate the results with the ROUGE 1, 2, and L metrics. In Table 3 are the results with ROUGE-1, and correspond to the medians after 10 repetitions of the experiments. The results with the other metrics were equivalent and are available with our code. We are highlighting in bold the best results that have statistical significance in the evaluation with Wilcoxon rank-sum for a confidence factor of 95%. At TextRank we highlight the best results that exceed confidence margins calculated with bootstrap and a confidence factor of 95%. When the results are not different enough to achieve statistical significance, we highlight both values in bold.

Evaluating these results, we can see that document prioritization had a significant impact on the results, which is evident in Oracle-Lead. Even applying the simplest summarizer, the perfect prioritization of input documents brought the best results. Although without achieving perfect ordering, the proposed model for prioritizing documents allowed important improvements. In these results presented, the median of ROUGE-1 increments was 2.7.

The prioritization of documents brought benefits in most results, and in the worst case, it obtained results equivalent to the non-prioritized version. Equivalent results were obtained with TextRank on the WCEP and Multi-XScience datasets. The original version of TextRank is not influenced by document order and prioritization gives an advantage to the best documents. In the case of WCEP, considering that it is the dataset with the largest volume of input documents, it might be necessary to increase the advantage of the best documents, which could be achieved by applying higher values of \(\alpha \) in Eq. 2. The Multi-XScience dataset had the smallest increase in the evaluation and was also the dataset with the lowest precision presented in Sect. 5.1.

Table 3. Multi-News: results F1-Score (%)

6 Conclusions

We present a document comparison approach using a Siamese network that is useful for identifying the best documents in multi-document summarization tasks. As shown in the results, document prioritization can help increase the accuracy of summaries produced with different summarization techniques.

The integration process with several methods from the literature was presented, and a possible future work is integration with other methods.

Since each experiment was conducted with identical tools and procedures, in addition to evaluating document prioritization, it is possible to compare different summarization techniques.

In pairs, the original and prioritized versions were compared with statistical tests, indicating a significant difference in the results.