1 Introduction

Video summarization is a challenging task that has gained significant attention in the computer vision and multimedia communities [1, 19]. One of the goals of video summarization is to extract essential information from a video and present it in a condensed format [9, 10, 16, 18]. This task is essential for applications such as video captioning, surveillance, synopsis of news videos [4, 5, 13], and video retrieval, among others [1]. The video summarization task involves several sub-tasks, such as keyframe extraction, object tracking, and summarization itself. The keyframe extraction step selects representative frames that capture the essence of the video, while object tracking aims to track important objects across frames [2]. The summarization step involves selecting a subset of keyframes that provide a comprehensive summary of the video while minimizing redundancy [6]. Video summarization techniques can be categorized into unsupervised and supervised approaches, depending on the availability of training data. While unsupervised techniques aim to identify patterns in the video data without any prior knowledge, supervised techniques require labeled data to train the summarization model [1, 19].

Video summarization is particularly useful when dealing with a video collection containing lots of repeated or redundant information spread out over many points in time [6]. In such cases, it becomes a challenge to analyze the entire video and efficiently extract useful information. Video summary techniques can help identify the most important frames in the video that are likely to contain unique and relevant information. By summarizing the video, one can achieve a condensed version that retains the most relevant information while reducing the overall size of the video collection [6]. This allows us to efficiently analyze large video datasets and highlight the most important information, improving the overall effectiveness of video analysis tasks [10, 16].

Fig. 1.
figure 1

Example of the summary generated by the HieTaSumm method compared with the groundtruth. In this case, the HieTaSumm method returns 13 keyframes in contrast to two annotators, the first one (User 1) selects 11 keyframes while the second (User 4) selects 9 keyframes for video v21 of the OpenVideo dataset.

Figure 1 shows that creating a single summary for a video that accurately reflects every user’s perception and preferences can be a challenging task. Since groundtruth data is generated by humans, the interpretation of each user of what is essential and relevant may vary. To generate the groundtruth for the video summarization task, annotators must watch the entire video and identify the most crucial moments. However, what one annotator perceives as essential may differ from another, leading to a subjective groundtruth. Hence, the subjectivity of groundtruth generated for each user is a critical aspect to consider in a machine learning method [2, 16].

Regardless of the difficulties related to the subjectivity of groundtruth generated by several users, many unsupervised methods have been proposed over the years. In [11], the authors presented a platform for customizing video summaries. Using clustering techniques, they proposed a method named VISTO, which analyzed low-level features to determine the similarity between frames. Keyframe selection is done by selecting the center of each cluster then a post-processing step is responsible for analyzing and removing possible frame redundancies. In [9], the authors presented a clustering-based strategy to solve the video summarization task named VSUMM. First, a sampling process is made to reduce the number of frames under analysis. Then, frames represented by color histograms were grouped into similar sets by a k-means algorithm. VSUMM results tended to group dispersed frames in time that may have a considerable temporal separation. In [17], the authors presented a graph-based approach for video summarization named HSUMM. The proposed approach was hierarchical and comprised keyframe extraction, scene segmentation, and video summarization stages. During the keyframe extraction stage, their method selected representative frames based on image quality and diversity. In the scene segmentation stage, the video was divided into different scenes based on the visual similarity between frames. Finally, keyframes were combined to generate a video summary. The proposed approach employed a hierarchical graph-based clustering that was capable of generating effective video summaries. In [15], the authors presented an unsupervised approach for summarizing a collection of videos. They developed a diversity-aware optimization method for multi-video summarization by exploring the videos’ complementarity.

The video summarization landscape has evolved over the last few years, especially after the introduction of deep learning algorithms. The study in [14] focused on egocentric’ video summarization and the challenges of this task. In [3], the authors concentrated on summarization methods that are directly applied to the compressed domain. Finally, the authors in [20] presented the relevant bibliography for dynamic video summarization. According to [1], in deep-learning-based video summarization methods, the video content is represented by deep feature vectors extracted by pre-trained neural networks. The extracted features are then utilized by a deep summarizer network and its output can be either a set of keyframes (i.e., a static summary) or a set of video fragments (that form a dynamic summary).

The study conducted by [6] introduces a hierarchical approach for generating a fixed number of keyframes for video captioning. Their approach involves generating a fixed number of keyframes per video, which serves as a representative distribution of the video content. These keyframes are then used as input for a transformer-based method. In contrast to the proposed by [6], our proposed approach takes a different direction. We propose the creation of a dynamic number of keyframes to serve as a comprehensive summary for each video. This approach recognizes the inherent variability in video content and aims to capture the most salient frames specific to each video. Consequently, the number of keyframes in the summary differs across videos within the same dataset. By employing a dynamic keyframe selection process, the proposed approach enables a more discriminating representation of the video content compared to the fixed number approach suggested by [6].

This work proposes an unsupervised method for video summarization that considers changes in video content over time, named Hierarchical Time-aware Summarizer–HieTaSumm. Similarly to recent deep-learning-based approaches, the proposed method uses pre-trained neural networks to generate video frame descriptions. However, it does not adopt a deep summarizer network to avoid the challenges related to its training. Instead, a hierarchical graph-based clustering strategy is adopted. It is worth mentioning that the proposed method assesses frame importance over time for selecting keyframes that comprise the video summary, which is different from other hierarchical approaches. The major contributions of this work are two-fold: (i) a strategy for video summarization that incorporates frame importance over time for selecting keyframes; and (ii) the identification of keyframes through a hierarchical graph-based clustering using deep-learning-based descriptors and a dynamic strategy to define summary sizes.

This work is organized as follows. Section 2 defines many concepts used in this work. Section 3 presents the proposed method, followed by the experimental results in Sect. 4. Finally, Sect. 5 draws some conclusions and future work proposals.

2 Fundamental Concepts

Let \({\mathbb {A}}\subset {\mathbb {N}}^2\), \({\mathbb {A}}=\{0,\ldots ,\textrm{H}-1\}\times \{0,\ldots ,\textrm{W}-1\}\), where \(\textrm{H}\) and \(\textrm{W}\) are the width and height of each frame, respectively, and, \({\mathbb {T}}\subset {\mathbb {N}}\), \({\mathbb {T}}=\{0,\ldots , N-1\}\), in which N is the number of frames of a video. A frame f is a function from \({\mathbb {A}}\) to \({\mathbb {R}}^3\), where for each spatial position (xy) in \({\mathbb {A}}\), f(xy) represents the color value at pixel location (xy). A video \(\textrm{V}_{ N }^{}\), in domain \({\mathbb {A}}\times {\mathbb {T}}\), can be seen as a sequence of frames \({ f}_{}\). It can be described by \( \textrm{V}_{ N }^{}={({ f}_{})}_{t\in {\mathbb {T}}}\), where \( N \) is the number of frames contained in the video.

A frame \({ f}_{}\) is usually described in terms of a global descriptor \(d({ f}_{})\). Let \({ f}_{t_1}\) and \({ f}_{t_2}\) be two video frames at locations \(t_1\) and \(t_2\), respectively. The (dis)similarity between \({ f}_{t_1}\) and \({ f}_{t_2}\) can be evaluated by a distance measure \(\mathcal {D}{(d({ f}_{t_1}),d({ f}_{t_2}))}\) between their descriptors. There are several choices for \(\mathcal {D}{(d({ f}_{t_1}),d({ f}_{t_2}))}\), i.e., the distance measure between two frames depending on the global descriptor, e.g. histogram/frame difference, histogram intersection, difference of histograms means, and even the \(L_2\) norm.

A time-aware frame similarity graph \(G_\delta = (V, E_\delta )\) is a weighted undirected graph. Each node \(v_{t} \in V\) represents a frame \(f_{t} \in \textrm{V}_{N}^{}\). There is an edge \(e \in E_\delta \) with a weight \(w(e) = \mathcal {D}{(d({ f}_{t_1}),d({ f}_{t_2}))}\) between two nodes \(v_{t_1}\) and \(v_{t_2}\) if the difference between their time indexes falls below a specified threshold \(\delta \), i.e.,

$$\begin{aligned} E_\delta = \{\;(v_{t_1},v_{t_2}, \mathcal {D}{(d({ f}_{t_1}),d({ f}_{t_2}))})\; |\;v_{t_1}, v_{t_2} \in V, v_{t_1} \ne v_{t_2}, |t_2 - t_1| \le \delta \}. \end{aligned}$$
(1)

This constraint over the frames’ time indexes limits the connections between distant video frames, effectively allowing the proposed method to consider two frames as similar only if they are not very far in time. This is a noteworthy distinction from many other approaches in the literature, which may consider two frames as similar independently from their time occurrence. Doing that permits the proposed method to assess frame importance over time for selecting it as a keyframe to form the video summary even when it seems to reoccur throughout the video. Figure 2(a) illustrates a time-aware frame similarity graph with \(\delta = 4\).

Similar to [17], this work also constructs a hierarchy based on a minimum spanning tree (MST) of the original graph. So, we define an edge-weighted tree of frames \(T_{G_\delta }= (V, E_\delta ^*)\) is a connected acyclic subgraph of \(G_\delta \), i.e., \(E_\delta ^* \subseteq E_\delta \). The weight of \(T_{G_\delta }\) is equal to the sum of weights of all edges belonging to \(E_\delta ^*\), i.e., \(w(T_{G_\delta }) = \sum _{e \in E_\delta ^*} w(e)\). The minimum spanning tree of frames \(T^{*}_{G_\delta }\) is a tree of frames whose weight is minimal.

Given a finite set V, a partition of V is a set \(\textbf{P}\) of nonempty disjoint subsets of V whose union is V. Any element of \(\textbf{P}\), denoted by \(\textbf{R}\), is called a region of \(\textbf{P}\). Given two partitions \(\textbf{P}\) and \(\textbf{P}^\prime \) of V, \(\textbf{P}^\prime \) is said to be a (total) refinement of \(\textbf{P}\), denoted by \(\textbf{P}^{\prime } \preceq \textbf{P}\), if any region of \(\textbf{P}^\prime \) is included in a region of \(\textbf{P}\). Let \(\mathcal {H}= (\textbf{P}_1, \dots , \textbf{P}_\ell )\) be a set of \(\ell \) partitions on V. \(\mathcal {H}\) is a hierarchy if \(\textbf{P}_{i-1} \preceq \textbf{P}_i\), for any \(i \in \{ 2, \dots , \ell \}\). According to [8], an MST can be utilized to represent a hierarchy, and a weighted MST of a graph can address any connected hierarchy for that graph. Additionally, the work in [12] demonstrated that creating a hierarchical graph segmentation involves reweighting an MST using a dissimilarity measure between regions. Thus, the proposed method utilizes an MST of frame similarity graph \(T^{*}_{G_\delta }\) to obtain a hierarchy \(\mathcal {H}\) which is then used to obtain frame clusters.

Finally, a hierarchical segmentation of \(G_\delta \) into k components is equivalent to the partition of a hierarchy \(\mathcal {H}\) into k regions (containing more similar elements) and can be done by removing \(k-1\) edges that present higher weights (representing greater dissimilar) from the \(T^{*}_{G_\delta }\) (since it represents \(\mathcal {H}\)). This strategy incorporates a similarity measure between clusters while partitioning the graph, providing a more comprehensive approach than traditional methods that only consider the similarity between isolated frames.

3 Hierarchical Time-Aware Video Summarization

Figure 2 illustrates the proposed method steps. The main steps of HieTaSumm method are the following: (i) generation of a time-aware frame similarity graph \(G_\delta \) to represent a video; (b) computation of a minimum spanning tree \(T^{*}_{G_\delta }\) for that graph; (c) creation of a hierarchy \(\mathcal {H}\) based on the \(T^{*}_{G_\delta }\); (d) generation of subsets of frames through cuts on the hierarchy; and (e) selection of keyframes to represent each subset.

Fig. 2.
figure 2

Illustration of the proposed method steps: (a) generation of a time-aware frame similarity graph \(G_\delta \) for a video; (b) computation of its minimum spanning tree \(T^{*}_{G_\delta }\); (c) creation of a hierarchy \(\mathcal {H}\) based on \(T^{*}_{G_\delta }\); (d) generation of subsets of frames through hierarchy cuts (edge removals); and (e) selection of keyframes to represent each subset. These keyframes are the result of the summarization process.

The HieTaSumm method (see Algorithm 1) created and uses a frame similarity graph \(G_\delta \). Each vertex represents a distinct video frame and there is an edge between two vertices if the difference between their time indexes falls below a specified threshold \(\delta _t\). Equation 2 represents this constraint and is implemented at line 7 of Algorithm 1.

$$\begin{aligned} |t_2 - t_1| < \delta _t \end{aligned}$$
(2)

in which \(t_f\) and \(t_{f'}\) represent the time indexes of frames f and \(f'\), respectively. Additionally, the edge weight represents the (dis)similarity between frames.

Algorithm 1
figure a

Hierarchical time-aware video summarization

The proposed method employs the Kruskal algorithm to obtain the MST \(T^{*}_{G_\delta }\) from \(G_\delta \), while the watershed by area [7] is used to generate a hierarchy \(\mathcal {H}\) from \(T^{*}_{G_\delta }\). Once a hierarchy \(\mathcal {H}\) is constructed, a hierarchical segmentation of \(G_\delta \) generates a video summary of size k. For that, The proposed method needs only to remove the \(k-1\) edges with higher weights from \(\mathcal {H}\). Instead of generating a fixed-size video summary, we adopt a strategy for identifying the moment when stability is reached during the edge removal process that is similar (but distinct) to the one used in [17]. Let \(e'\) be the edge with the highest weight in the hierarchy \(\mathcal {H}\). Thus, the edge \(e'\) is removed only when its weight \(w(e')\) is greater than or equal to an equilibrium measure function \(F(e')\), i.e., \(w(e') \ge \textbf{F}(e)\). In this work, the equilibrium measure function is given by Eq. 3.

$$\begin{aligned} \textbf{F}(e) = \gamma \sigma _w(e) \end{aligned}$$
(3)

in which \(\sigma _w(e)\) represents the standard deviation of all edge weights of the connected component that contains edge e, and \(\gamma \) is a parameter related to the allowed variability. During tests, we have set \(\gamma \) empirically.

Finally, after dividing the hierarchy into several connected components, central frames (concerning chronological order) are selected as keyframes for the video summary.

This dynamic choice of the number of components and, consequently, the size of the video summary becomes essential when it comes to videos that contain numerous very similar scenes. In such cases, employing a static number of frames for all videos can result in redundant and repetitive content in the summary. By adopting a dynamic approach, the method can infer an adequate summary size based on the specific video content and characteristics.

4 Experimental Results

This section provides a comprehensive analysis of results obtained by the proposed approach to video summarization with a dynamic selection of video summaries.

We compared HieTaSumm with other unsupervised video summarization methods, namely HSUMM [17], VSUMM1 [9], VSUMM2 [9], VISTO [11] and Open Video summaries (referred to like OVSummary). These comparative assessments allow for a comprehensive review of the performance and effectiveness of HieTaSumm against these established approaches.

4.1 Implementation Details and Dataset

Similar to [11, 17], we applied the proposed method to the same collections of videos from the OpenVideo dataset (referred to as the VSUMM dataset in [19]). This dataset contains 50 videos of different genres. All videos are in MPEG-1 format (30 fps, 352 \(\times \) 240 pixels). The genres are distributed into documentary, educational, ephemeral, historical, and lecture. The time duration of each video varies from 01 to 04 min. The process of creating of user summary consists of the collaboration of 50 different persons. Each user is dealing with the task of choosing the keyframes for 5 videos. Thus, 250 were created for the dataset each video has 05 different user summaries generated manually. And, as a way to pre-process the video dataset we extracted 04 fps from all videos.

For the creation of the frame similarity graph, we use ResNet50 and VGG16 (both pre-trained on ImageNet) to extract frame descriptors. The cosine similarity was used to assess the similarity between two frame descriptors. And, we also set \(\delta _t = 32\) (i.e., 08 s with 04 fps) and \(\gamma = 0.05\), during the experiments. The parameter \(\delta _t\) plays a crucial role in enhancing the temporal threshold and restricting vertex connections to avoid the creation of edges that span across all frames of the video. This strategy is used since, if all frames were connected, temporal dependencies may be neglected. Similarly, the parameter \(\gamma \) is employed to regulate the variance amplification in feature differences. Its utilization helps control the level of distinction among features, ensuring a balanced representation of the underlying data.

4.2 Evaluation Metrics

Assessing frame quality in the context of video summarization poses a distinct challenge because of the many ways in which frames can be constructed while conveying similar meanings. These variations can arise from using different analyzes of resources from different informational aspects. Although humans have an intuitive understanding of this process, abstract evaluation remains an open question without a specific framework. As a result, the conventional practice involves adapting similar metrics that have been stretched to accommodate the specific requirements of the video summary task. By re-purposing and customizing these metrics, researchers, and practitioners can assess the effectiveness and fidelity of summaries generated in video summarization, despite the inherent complexities and subjectivity involved in sentence evaluation [9, 17].

To compute the improvement of the frame selection, we will evaluate the obtained results following the same approach used by the authors of [9, 17]. They reported their results using metrics widely disseminated in the literature such as CUSa, CUSe [9, 17], and COV [17], defined by the Eqs. 46, respectively, to evaluate the similarity between the frames generated by their summarization method and the GT results.

$$\begin{aligned} \text{ CUSa } = & {} \frac{m_A}{n_U} \end{aligned}$$
(4)
$$\begin{aligned} \text{ CUSe } = & {} \frac{\overline{m}_A}{n_U} \end{aligned}$$
(5)

in which \(m_A\) denotes the number of matching keyframes generated from the Automatic Summary (AS), \(\overline{m}_A\) represent non-matching keyframes from AS, and \(n_U\) are the number of keyframes selected for the user to represent the user summary (U) to each video.

$$\begin{aligned} \text{ COV } = \frac{\sum _{U \in US}|M(AS,U)|}{\sum _{U \in US} |U|} \end{aligned}$$
(6)

in which M(XY) and |.| are the maximum matching between two sets of different elements X and Y, and the cardinality of a set, respectively.

While those two first metrics provide valuable insights, they often fail to measure the diversity displayed in user summaries as COV does. Furthermore, the calculation of averages for each user’s measurements can introduce distortions and inaccuracies. Specifically, the CUSa, which is commonly employed to assess user opinions, fails to effectively capture the diversity of these opinions. To illustrate, consider two users, A and B, providing summaries for the same video. Let the summary of user A be \(U_A = \{X, Y\}\) while the summary of user B is \(U_{B} = \{M, N, O, P, Q, R, S, T, U, V\}\), in which each character denotes a single frame of video. Now suppose that three distinct methods generate summaries: \(AS_1 = \{X, Y\}\), \(AS_2 = \{M, N, O, P, Q, R, S, T, U, V\}\), and \(AS_3 = \{X, M, N, O, P, Q\}\). Despite these summaries being completely different, they provide the same accuracy rate (i.e., CUSa = 0.5). This highlights the limitations of CUSa in accurately assessing divergence of opinion and the need for more comprehensive assessment metrics [9, 17].

Unlike CUSa, COV assesses the extent to which an automatic summary covers all user-generated summaries. This measure takes into account both the diversity of opinions expressed by users and the degree of agreement among them. Specifically, the CUSa measure calculates the average ratio between each user’s summary and an automatic summary, thus capturing the level of agreement between the two. In contrast, COV assesses the proportion of an automatic summary that aligns with all user summaries, providing a measure of overall covering. We use COV as the first metric to compute the effectiveness of the HieTaSumm. The reader should refer to [9, 17] for more information about those metrics.

4.3 Quantitative Analysis

Table 1 presents the HieTaSumm results. We used ResNet50 and VGG16 to extract frame descriptors for the construction of the frame similarity graph. During the evaluation of the results, we also used ResNet50 and VGG16 to extract frame descriptors but the cosine similarity was used to verify the agreement between the groundtruth and automatic summaries. We have also used color histograms (CH) during the assessment of the results.

Table 1 presents the average values of all metrics for the 50 videos belonging to the dataset. The results are presented for different levels of precision (between groundtruth and automatic summaries). It is possible to notice that the use of ResNet50 presents a slight improvement compared to the results with VGG16 (under a greater precision in evaluation), and the VGG16 presented better results (under a lower preciseness in evaluation). Moreover, it is also possible to observe the high values of COV and CUSa achieved by HieTaSumm method, and even under a higher precision in evaluation, the proposed method still presents competitive results.

Table 1. Performance of HieTaSumm method for different levels of precision in evaluation of video summaries. CUSa, CUSe, and COV values were multiplied by \(10^2\) to improve readability.
Fig. 3.
figure 3

Comparative example of HieTaSumm results compared with HSUMM results and with the frames selected by the User 3 and User 5 (both selected 9 frames). The video summary generated by HieTaSumm contains 9 frames.

4.4 Qualitative Analysis

To provide a better understanding of the results obtained and their improvements, Figs. 3 and 4 present samples of summaries generated by various approaches in the literature, including HSUMM [17], VSUMM1 [9], VSUMM2 [9], VISTO [11] and Open Video summaries (referred to like OVSummary), and the groundtruth (GT) results, alongside those generated by the HieTaSumm method. This comparison enables the evaluation of time awareness, similarity with the GT results, and the rate of the frames selected by the HieTaSumm method and others.

Fig. 4.
figure 4

Comparative example of HieTaSumm results compared with the results of VSUMM1 [9], VSUMM2 [9], VISTO [11], OVSummary and with the frames selected by the User 2 and User 3.

Figure 3 shows the results generated by the HieTaSumm method along with HSUMM [17] results, and the summaries generated by two users. Each frame list created for each user encapsulates a distinct selection of frames, reflecting individual preferences and perspectives. Employing cosine similarity, we can quantify the degree of similarity between the GT of the users and that generated for HieTaSumm method. However, it is essential to recognize that similarity is subjective and may vary among observers. Factors such as the weighting of different frames, the level of granularity in frame selection, and the specific context of the video all influence perceived similarity. Therefore, when evaluating the cosine similarity between two lists of frames, it is crucial to consider the subjective nature of the perception and the different perspectives that individuals bring to the comparison. The result obtained for the HSUMM has a much higher number of frames than the others and, due to this, they present a large number of frames with high similarity. Furthermore, HSUMM results may not preserve chronological order.

On the other hand, HieTaSumm method presents a fluid and coherent result. Furthermore, the select keyframes are very similar to those frames in GT. For all keyframes selected by HieTaSumm method, only one frame does not have another directly correlated with those selected by the two users. But, in all cases, even with different keyframes selected by the users, the automatic summary generated by HieTaSumm method is very close to theirs (especially for Users 3 and 5 shown in Fig. 3). In addition, the unrelated keyframe preserves temporal order and, when we look at the three keyframes in which the map is present, it is possible to observe that a refinement process takes place to identify the correct highlighted region, starting from a global visualization to an analysis local that identifies the region in focus as the most important point of location on the map of the region presented in the video.

Figure 4 also presents some subjective characteristics for the keyframes selected by users 2 and 3. Considering the number of frames selected, 15 and 17 respectively, it tends to suggest the existence of a larger number of scene modifications. This variation can cause the selection of a greater number of frames returned by automatic methods, but the increase in the number of scenes can cause frames to be repeated by automatic methods. In this way, the returned summaries have a great challenge of maintaining temporal coherence, but without two highly similar frames being selected without the presence of other events. With this difficulty in mind, OVSummary presents a series of repeated frames side by side. Seen displays some repeated frames, but a reduced number of frames with more similar information. VSUMM1 observes more scene modification and has some information that tends to be more similar. VSUMM2 tends to keep the results without redundancy but without the presence of some scenes more relevant to the user. Finally, the hierarchical approach used by HieTaSumm tends to reduce the redundancy of information with a lot of similarity. HieTaSumm results has a smaller number of keyframes, but these keyframes are more related to user summaries. Moreover, keyframes selected by HieTaSumm method keep the temporal ordering and shows that the dynamic selection of summary size helps to better capture the changing scenes more smoothly.

5 Conclusion

This work proposes an unsupervised method for video summarization that considers changes in video content over time, named Hierarchical Time-aware Summarizer– HieTaSumm. It uses pre-trained neural networks to generate video frame descriptions with a hierarchical graph-based clustering strategy. The proposed method explores a time-aware frame similarity graph to represent video content considering changes over time. Moreover, a dynamic strategy for defining summary size is adopted. Experimental results indicate that the proposed approach has great potential. Specifically, it seems to enhance coherence among different video segments, reducing frame redundancy in the generated summaries, and enhancing the diversity of selected keyframes.

Future works may explore other strategies for selecting keyframes and different hierarchies. It might also be interesting to investigate the impact of different datasets with little scene modifications. Following these future research directions, we can advance the video summary field and further refine the dynamic frame selection approach to provide more accurate, informative, and user-centric video summaries.