1 Introduction

Graph-based image representation is an emerging research area that leverages the spatial relationships between image elements to model image content more effectively [16]. Graph-based approaches can enhance the understanding of image semantics and context by incorporating domain-specific knowledge into the learning process. Moreover, these approaches can provide multi-scale representations of the same image [17, 22, 23], capturing local and global information about its structure. These advantages make graph-based image representation appealing for various image analysis tasks.

Applying machine learning algorithms to graph data presents unique challenges due to its irregularity, variable sizes, and diverse neighbor relationships [24]. Graph Neural Networks (GNNs) have emerged to address these challenges, adapting neural network architectures to process graph-structured data effectively. GNNs capture the intricate relationships and dependencies between vertices in a graph, leveraging information from neighboring vertices and edges to encode local and global structural information. This approach enables accurate predictions and improved generalization, making GNNs ideal for tackling various graph-based machine-learning problems.

One of the main applications of GNNs is graph-based image analysis, which requires representing images as graphs. However, this is a challenging task. One common possibility in literature is to apply image segmentation methods for partitioning the image into regions and representing each region by a vertex in a graph [2, 9, 17, 19, 22, 23]. Despite the simplicity of this approach, it has two important drawbacks: (i) the dependency on the quality and quantity of the regions produced by the segmentation algorithm; and (ii) the hierarchical relationship between image elements are not captured. Here we argue that this property is essential for effective image reasoning, which a graph representation must capture. It is worth mentioning that hierarchical information could be seen as a multi-scale representation. Figure 1 illustrates failures in image classification when the hierarchical structure is disregarded (all cases were classified as airplanes, using [2] representation).

Fig. 1.
figure 1

Examples of images classified as airplanes, according to [2].

To overcome the limitations of existing methods, we propose a novel approach that leverages hierarchical segmentation techniques to generate graph vertices for image classification from graph representation. Hierarchical segmentation is a well-established technique in computer vision and image processing [6, 7, 13], enabling the identification of objects and regions within an image based on their visual characteristics. Additionally, our method allows incorporating hierarchical relationships between image elements into the resulting graph, facilitating image analysis tasks. More specifically, the idea of hierarchical segmentation is similar to generating a set of image segmentations at different levels of detail with respect to the principles of multi-scale image analysis [12]. By adopting this approach, we argue that we can effectively capture the rich structural information in the image and encode it in the resulting graph. Thus, our proposal for image classification uses a hierarchical segmentation method to capture hierarchical information for representing image data features. However, in some cases, while working on raw images, noise may produce poor results in image segmentation, for that, in this work, instead of using raw images, we generate superpixels from them to produce homogeneous and concise regions in conjunction with good object delineation. From these superpixels, region adjacency graphs are computed for representing superpixel images.

To evaluate the effectiveness of our proposed representation, we conducted experiments using images from the CIFAR-10 dataset, leveraging our hierarchical segmentation technique from the superpixel images to generate graphs. We also introduce a novel model, the Hierarchical Graph Convolutional Network for Image Classification (HGCIC), that can effectively learn from the hierarchical graphs and classify images.

Experiments demonstrated promising results, outperforming some state-of-the-art methods, highlighting the model and the graph representation effectiveness. Furthermore, the graphs used to train our model have a much smaller number of edges and vertices than those utilized in other works, reducing the computational complexity of GNNs and speeding up the learning process.

We can briefly describe the two major contributions of this work to graph-based image analysis: (i) the proposition of a novel graph representation method that leverages hierarchical image segmentation to capture hierarchical representations of the underlying image structure; and (ii) the introduction of a novel graph convolutional network (GCN) architecture that can extract and use the essential information from our new graph representation.

This work is organized as follows. Section 2 presents some related works. Section 3 describes the most important concepts needed for understanding the proposed method. Section 4 presents a hierarchical graph convolutional network for classifying images based on superpixel graphs. Section 5 describes the experiments and some comparative analysis of the proposed approach to the state-of-the-art methods. And, finally, in Sect. 6, we have drawn some conclusions and present some further work.

2 Related Works

Superpixel image segmentation could be seen as a fundamental task in image processing that divides an image into homogeneous regions. This process can reduce the complexity of an image and provide a more efficient representation for further analysis. By treating these regions or segments as vertices, we can transform the image into a graph structure, which can facilitate the use of graph-based algorithms.

A common strategy in literature for representing images as graphs is based on considering vertices as superpixels [2, 9, 11, 17, 20, 22]. Superpixels are groups of pixels into perceptually meaningful regions based on similarity criteria like color or location. This approach leverages the flexibility of graph neural networks, which can handle irregular graphs with different shapes and sizes. Other methods include splitting the images into fixed patches viewed as vertices [14] or interpreting each pixel as a vertex [8, 20]. Some works also explore multiscale graph (from image) representations to achieve better graph classification performance.

Multiscale graph-image representations can capture different levels of details and features from the image. For instance, in [22], the authors proposed a novel superpixel algorithm that produces segments with a wide size distribution, allowing a more flexible representation of an image, as it can capture fine and coarse details. The authors in [23] model hyperspectral images as multiple graphs with different neighborhood scales and propose a dynamic graph convolution operation that updates the similarity measures among vertices by fusing feature embeddings. To the best of our knowledge, the work in [17] is the only one that has used a multiscale graph segmentation, in which superpixels are obtained at several scales and different types of relations between vertices are explored to improve the model expressiveness.

The edges of the graph are also crucial for constructing the graph representation, as they enable the message-passing mechanism of GNNs. The message-passing mechanism updates the feature representations of vertices by exchanging information with neighboring vertices through edges. One natural way to build adjacency is using a region adjacency graph (RAG), in which edges are created between spatial neighbors [2, 22]. Another strategy is constructing k-nearest adjacency based on the vertex’s spatial and/or feature distances, as used in [9, 14]. Both methods have shown promising results, and their choice mainly depends on the problem requirements and the properties of the data.

3 Theoretical Background

3.1 Graph Neural Networks

Graph neural networks (GNNs) can be divided into two types: spatial or spectral. Spatial GNNs work directly on the graph structure and compute vertex (and maybe edge) representations using information from their neighbors. Spectral GNNs, on the other hand, function on the graph spectral domain. They utilize the eigenvectors of the graph Laplacian matrix as the foundation for representing the graph data. This work focuses solely on spatial GNNs.

The input of each GNN layer are the vertex feature vectors \( \{h_{u}\in \mathbb {R}^{d}\;|\; u \in \mathcal {V}\}\), the set of edges \(\mathcal {E}\), and optionally edge feature vectors (maybe seen as weights) \( \{w_{uv}\in \mathbb {R}^{d}\;|\; (u,v) \in \mathcal {E}\}\). The result of each layer is a new vertex representation \(\{h_{u}^{'}\in \mathbb {R}^{d^{'}}\;|\; u \in \mathcal {V}\}\), in which the same parametric function is applied to each vertex given its neighbors \(\mathcal {N}_{u}=\{v \in \mathcal {V}\;|\; (u,v)\in \mathcal {E})\}\) and on the edges incident to it, generically given by:

$$\begin{aligned} h'_{u}=f_{\theta }\left( h_{u}, aggregate(h_{v}, w'_{uv}\;|\;v\in \mathcal {N}_{u})\right) \end{aligned}$$
(1)

in which aggregate is a permutation invariant function (like max, min, sum), \(f_{\theta }\) is the parametric function, and \(w'_{uv}\) is the updated edge feature defined by:

$$\begin{aligned} w'_{uv} = g_{\theta }(h_{u}, h_{v}, w_{uv}) \end{aligned}$$
(2)

in which \(g_{\theta }\) is a distinct parametric function. Each vertex update step is also called the message passing step since vertices send information to their neighbors.

3.2 Residual Gated Graph Convolutional Network

According to [4], GatedGCN is a fusion of the vanilla GCN and edge gating mechanism. In [9], the authors suggested modifications to the GatedGCN architecture by introducing residual connections and batch normalization [15]. The vertex update is given by:

$$\begin{aligned} h'_{u} = h_{u} + \textrm{ReLU}(\textrm{BN}(U^{\ell }h_{u} + \sum _{v \in N_{v}}\alpha _{uv}\odot V^{\ell }h_{v})) \end{aligned}$$
(3)

in which \(U^{\ell }\), \(V^{\ell }\) are linear transformations, \(\odot \) denotes Hadamard product, \(\textrm{ReLU}\) stands for Rectified Linear Unit, \(\textrm{BN}\) represents batch normalization, and \(\alpha _{uv}\) are the edge gates defined by:

$$\begin{aligned} \alpha _{uv} = \frac{\sigma (\hat{w}_{uv})}{\sum _{{v}' \in N_{u}}\sigma (\hat{w}_{u{v}'}) + \varepsilon } \end{aligned}$$
(4)
$$\begin{aligned} \hat{w}_{uv} = A^{\ell }h_{u}^{\ell }+B^{\ell }h_{v}^{\ell } + C^{\ell } {w}_{uv} \end{aligned}$$
(5)

in which \(A^{\ell }\), \(B^{\ell }\), \(C^{\ell }\) are linear transformations, \(\sigma \) is the sigmoid function, and \(\varepsilon \) is a small-fixed constant for numerical stability. The edge gate in Eq. 4 works as a soft attention mechanism [9], allowing the model to learn the importance of different vertices in a neighborhood. Finally, the edge features are updated as follows:

$$\begin{aligned} {w}^{'}_{uv} = {w}_{uv} + \textrm{ReLU}(\textrm{BN}(\hat{w}_{uv})) \end{aligned}$$
(6)

3.3 Hierarchical Segmentation

Hierarchical image segmentation is a set of image segmentations at different detail levels [13]. The segmentations with lower levels of detail can be created by merging regions from segmentations at higher levels of detail.

Hierarchical approaches must obey the principles of multi-scale image analysis. These principles ensure that the segmentation is consistent across different levels of detail. The causality principle defines that a contour presented at a scale \(k_{1}\) should be present at any scale \(k_{2} < k_{1}\). The location principle defines that contours should be stable because they neither move nor deform from one scale to another [12].

Hierarchical image segmentation organizes image segments into a tree structure where each vertex represents a different level of detail or abstraction. The highest level of the tree represents the entire image, while lower levels correspond to smaller and more specific sub-regions or sub-segments. This structure provides a way to represent the image at different levels of resolution, allowing for a better understanding of the image’s contents.

Fig. 2.
figure 2

Pipeline for computing a hierarchy from the original image.

4 Hierarchical Graph Convolutional Networks by Using Hierarchy of Superpixels

Given a finite set V, a partition of V is a set \(\textbf{P}\) of nonempty disjoint subsets of V whose union is V. Any element of \(\textbf{P}\), denoted by \(\textbf{R}\), is called a region of \(\textbf{P}\). Given two partitions \(\textbf{P}\) and \(\textbf{P}^\prime \) of V, \(\textbf{P}^\prime \) is said to be a (total) refinement of \(\textbf{P}\), denoted by \(\textbf{P}^{\prime } \preceq \textbf{P}\), if any region of \(\textbf{P}^\prime \) is included in a region of \(\textbf{P}\). Let \(\mathcal {H}= (\textbf{P}_1, \dots , \textbf{P}_\ell )\) be a set of \(\ell \) partitions on V. \(\mathcal {H}\) is a hierarchy if \(\textbf{P}_{i-1} \preceq \textbf{P}_i\), for any \(i \in \{ 2, \dots , \ell \}\).

4.1 Graph Construction

Let \(G=(V,E)\) be a RAG computed from the superpixels in which the set V represents the superpixels and the set E the adjacency relation between the superpixels. Let \(\mathcal {H}= (\textbf{P}_1, \dots , \textbf{P}_\ell )\) be a hierarchy computed from the graph G. Let \(\mathcal {R}_j\) be the set of regions in the partition \(\textbf{P}_j\) of the hierarchy \(\mathcal {H}\). Let \(\mathcal {R}\) be set containing all regions belonging to all partition \(\textbf{P}_j \in \mathcal {H}\).

Figure 2 illustrates our proposal for computing the hierarchy from the original image. In the following, we describe how to compute three different graphs from a given hierarchy. These graphs will be used as input in the learning step of our method.

Hierarchy-Based Graph. This graph, denoted by \(G_h = (V_h,E_h)\), is a graph computed from the hierarchical structure \(\mathcal {H}\) in which the set of vertices \(V_h\) is equal to the \(\mathcal {R}\). The set of edges \(E_h\) is defined by \(E_h=\{(r_i,r_j),(r_j,r_i)~|~r_i, r_j \in \mathcal {R},~r_i \ne ~r_j,~\textrm{in which}~ r_j ~\textrm{is the smallest region that contains}~ r_i \}\).

kNN-Based Graph. This graph, denoted by \(G_k = (V_k,E_k)\), is a graph computed from the k-nearest neighborhood of regions in the hierarchical structure \(\mathcal {H}\) in the feature space, in which the set of vertices \(V_k\) is equal to the \(\mathcal {R}\). Let \(W(V) = \{w(v),~\forall v \in V\}\) be the set of feature vectors related to the vertex set V. The set of edges \(E_k\) is defined by \(E_k=\{(r_i,r_j)\;|\;w(r_j)~\mathrm{is one of the k-nn of~} w(r_i)\}\).

Complete-Based Graph. This graph, denoted by \(G_c = (V_c,E_c)\), is a graph computed from hierarchical structure \(\mathcal {H}\), in which the set of vertices \(V_c\) is equal to the \(\mathcal {R}\). Let \(W(V) = \{w(v),~\forall v \in V\}\) be the set of feature vectors related to the vertex set V. The set of edges \(E_c\) is defined by \(E_c=\{(r_i,r_j)~|~r_i, r_j \in \mathcal {R},~r_i \ne ~r_j\}\).

It is important to mention that each region is represented by the following set of features: color (color channels mean and 2-bin color histogram), texture (contrast, dissimilarity, homogeneity, energy, correlation, and angular second moment), region (orientation, bounding box area, solidity, area, eccentricity, convex area, perimeter, mean intensity, Euler number, and Hu moments), position (X and Y mean position) and the vertex altitude in the segmentation tree. The extracted features were then used to create a vertice feature vector \(h_{u}\in \mathbb {R}^{104\times 1}\) for each \(u \in {V}\).

4.2 Architecture

Figure 3 shows the proposed GCN architecture for the image classification task. To embed the input edge and vertices features, two linear layers were applied to produce D-dimensional embeddings. The dimension of the edge and vertices embeddings remained the same across all layers. Inspired by [5], this work adopted \(\mathcal {M}+1\) convolutions, in which \(\mathcal {M}\) is chosen at inference time. All \(\mathcal {M}\) convolutional layers share the same weights, which improved the results and made the proposed architecture parameter-efficient [5].

Fig. 3.
figure 3

Model Architecture. The input edges and vertices features are h and w, respectively.

An adaptive architecture that adjusts its depth can capture important features and patterns that lead to better accuracy and performance. However, balancing model capacity and complexity is crucial to avoid sub-optimal results. If the model is too shallow, it may not be able to capture complex patterns, while a model that is too deep may suffer from overfitting or be computationally expensive. After the graph convolution layers, we employ a readout layer that generates a fixed-size vector representation from the graph features. The output of the readout layer is then fed into a multi-layer perceptron (MLP) that learns to make class predictions based on the graph features.

By combining the adaptive depth graph convolution layers, readout layer, and MLP, the proposed model can effectively extract and learn hierarchical representations of the input graphs, leading to accurate and robust classification results.

5 Experimental Results

To analyze our proposal for image segmentation, we have applied our strategy to a well-known database. We have considered the three different graphs computed from the hierarchy. Also, we have trained all the models for 1,000 epochs with an initial learning rate of \(10^{-3}\), which is reduced by half if the validation accuracy does not improve after ten epochs until reaching a stopping learning rate of \(10^{-5}\), Cross Entropy loss, batch size of 64 and Adam optimizer with \( \beta _{1} = 0.9 \), \(\beta _{2} = 0.98\), and \(\epsilon = 10^{-9} \). We saved the weights on the epoch with the best validation accuracy.

We evaluated the proposed model in the CIFAR-10 database [18], comprising 60,000 32 \(\times \) 32 color images across ten classes, with 6,000 images each. The database is split into 45,000 training images, 5,000 validation images, and 10,000 test images. It is important to observe that we have followed the procedure described in [9] and randomly sampled 5,000 images from the training set for validation. The same splits were used for all experiments.

5.1 Implementation Details

Graph Construction. We adopt the SLIC [1] as the superpixel segmentation method since it is simple, fast, and memory efficient and to make a fair comparison since almost all work in our comparative analysis uses it. The target number of superpixels is typically 20 but may vary for each image. We have used the watershed by area [6] as the hierarchical segmentation method, which outperformed other methods based on different attributes in our experiments. Thus, the hierarchy-based, kNN-based, and complete-based are constructed from the hierarchy computed using watershed by area, which is applied to the RAG of the superpixels obtained by the SLIC. The number of nearest neighbors k was set to 8 on the kNN-based graph setup.

Architecture. To implement our model, we use PyTorch Geometric [10], which batches multiple graphs into a single graph with multiple subgraphs for mini-batch training. Following [5], we could set the variable depth \(\mathcal {M}\) as half of the number of vertices in each graph. However, this is not feasible for graphs with different sizes in the same batch. Instead, we use \(\mathcal {M} = \left\lfloor max(|\mathcal {V}|_{minibatch}) /2 \right\rfloor \), in which \(max(|\mathcal {V}|_{minibatch})\) is the maximum number of vertices among the graphs in the batch.

We use GatedGCN as the graph convolution since it preserves and updates the edge features \(w_{uv}\) between vertices u and v at each layer [9], and the soft attention mechanism in Eq. 5 enables the model to learn how important each neighbor v is for vertex u. Furthermore, we changed Eqs. 36 by replacing batch normalization with layer normalization and adding nonlinearity and normalization before calculating the attention coefficients, resulting in Eqs. 710.

$$\begin{aligned} h'_{u} = h_{u} + \textrm{ReLU}(\textrm{LN}(U^{\ell }h_{u} + \sum _{v \in N_{v}}\alpha _{uv}\odot V^{\ell }h_{v})) \end{aligned}$$
(7)
$$\begin{aligned} \alpha _{uv} = \frac{\sigma (\hat{w}_{uv})}{\sum _{{v}' \in N_{u}}\sigma (\hat{w}_{u{v}'}) + \varepsilon } \end{aligned}$$
(8)
$$\begin{aligned} \hat{w}_{uv} = \textrm{ReLU}(\textrm{LN}(A^{\ell }h_{u}^{\ell }+B^{\ell }h_{v}^{\ell } + C^{\ell } {w}_{uv})) \end{aligned}$$
(9)
$$\begin{aligned} {w}^{'}_{uv} = {w}_{uv} + \textrm{ReLU}(\textrm{LN}(\hat{w}_{uv})) \end{aligned}$$
(10)

Distinct from other GCN models, this work used a readout layer that concatenates the global mean and the max pooling, capturing the average and maximum values of the vertex features from the entire graph. Combining two permutation invariant functions was motivated by the better results obtained in the initial experiments compared to using only one. It is worth mentioning that the MLP includes two linear layers with layer normalization that help improves the model’s training stability and generalization performance. Table 1 shows the details of each layer in our proposed architecture.

Table 1. HGSIC Architecture Details
Table 2. Accuracy of the proposed model and state-of-art methods in the CIFAR-10 dataset. * means the target number of nodes since the authors do not report the average value, and the cells with – refer to data that have not been reported.

5.2 Quantitative Analysis

Table 2 shows the results for the HGCIC model and other state-of-the-art methods. The HGCIC model was trained using three different graphs: kNN-based (HGCIC\(_{k}\)), hierarchy-based (HGCIC\(_{h}\)), and complete-based (HGCIC\(_{c}\)). Interestingly, the HGCIC\(_{h}\) model, based on the hierarchy structure, has a much smaller number of edges and outperforms the other two graphs.

This result is noteworthy since the number of edges in a graph directly impacts the message-passing step of graph neural networks, which are designed to learn from the graph’s structural information. However, our findings suggest that the relationships captured by edges in the graph are more critical than the graph’s number of edges. This is consistent with prior work showing that incorporating hierarchical relationships between vertices and edges can help improve GNN performance [17]. Overall, test results demonstrate the effectiveness of the proposed hierarchical adjacency method in enhancing our model’s performance.

Despite the superior performance of the HGCIC\(_{h}\) model, our work did not achieve the best accuracy in image classification. However, we achieved a competitive result without resorting to complex strategies that other methods used, such as recurrent neural networks [5], multiple relations [17], positional embeddings [21], or vector fields to define the directions of information propagation [3]. These techniques could also be incorporated into our model in the future to boost its performance further. Another factor that could affect our results is the size of our graphs, which were the smallest among the compared methods. The only model with fewer parameters than ours was the one proposed by [2] but with a lower result compared to the proposed method. Therefore, our work demonstrates a promising approach for graph-based image analysis and shows the potential of hierarchical segmentation for creating effective graph representations of images.

A drawback of the methods proposed in [17] is that they require computing the eigenvalues and eigenvectors of the graph Laplacian matrix. This step is essential for learning filters that depend on the Laplacian eigenbasis and capture the graph structural information, but it can be costly. Furthermore, since the Laplacian eigenbasis is specific to each graph, models trained on one graph may not generalize well to others. This limitation can restrict the scalability and applicability of these methods to a broader range of graph data. Using spatial GNNs, our model can capture the geometric features of the graph without relying on the Laplacian eigenbasis and avoid the problem of generalization to new graphs.

Table 3. Examples of predictions of the proposed models. Images have been enlarged for easy viewing.

5.3 Qualitative Analysis

In Table 3, we present the qualitative results of our models. We observe that the model trained with the hierarchy-based adjacency can classify not only simple images but also images that are challenging for humans to classify due to their low resolution (only \(32\times 32\)).

5.4 Ablation Study

We conducted an ablation study to assess the performance impact of various components in our proposed method. We compared our proposed architecture with a baseline model based on [9], which has four GatedGCN layers, a global average pooling for the readout layer, and an MLP similar to ours, but without normalization as in the original paper. Additionally, we tested the proposed architecture and graph creation method with a simple feature vector as used in [2, 9, 17], which consists of the concatenation of the average value for each color channel and the geometric centroid, to show that a more expressive feature set can enhance the performance of the task.

Table 4 shows the result of our ablation study. All modified models showed degradation in accuracy compared to their originals, proving the benefits of both the proposed architecture and used features.

Table 4. Performance of the ablation study to assess how changes in the proposed method affect performance. HGCIC\(_{\bullet }^{5F}\) are the models with the proposed architecture trained with the simpler 5D feature vector, and G4GCN\(_{\bullet }\) are trained with the full set of features and four GatedGCN layers.

Out of all the models tested, the ones using architecture proposed in [9] showed the most significant decrease in accuracy. This was due to the limited capacity of the architecture to learn complex features from the data. Specifically, the four layers in this architecture were deemed insufficient for the graph-learned embeddings to converge, resulting in poor representations in the final GNN layer.

Among the models trained with the simpler 5D feature vector, the one with the highest accuracy was the model trained with complete graph adjacency. The superior performance of this model was because the distance between vertices is not a reliable indicator of segment differences. As a result, the model gave similar weightage to all neighbors in the message-passing, consequently prioritizing more neighbors to perform the feature aggregation. Similar behavior also occurred in experiments with the [9] architecture, where the hierarchy adjacency caused relevant vertices to be farther away from others. Again, the four layers are insufficient for the information to propagate throughout the graph.

6 Conclusion

This work introduces a new approach for constructing graphs from images using hierarchical segmentation methods. Additionally, it presents a new model called HGCIC, which has been trained using graphs obtained from hierarchical segmentation in three different adjacency setups. The proposed model has demonstrated remarkable results by incorporating variable depth, hierarchical relationships through edges, and well-defined features. The implications of this research are exciting and could have a far-reaching impact on various applications that rely on graph-image analysis.

Although the results were slightly inferior, they still showed promise. The proposed approach utilizes significantly smaller graphs than those in previous works, and the proposed GCN architecture contains fewer parameters yet still delivers promising results. Moreover, the ablative study confirmed the hypothesis that the choice of architecture and features positively impacted the model’s overall performance. These findings highlight the potential of the proposed approach as a more efficient and effective means of graph construction and analysis.

In future works, we plan to investigate the impact of attention mechanisms on our approach. Additionally, we aim to conduct a more in-depth analysis of the relationship between the number of vertices and model performance while exploring multiple graph relations.