Enhancing Graph Data Quality by Leveraging Heterogeneous Node Features and Embeddings

Angonese, Silvio Fernando; Galante, Renata

doi:10.1007/978-3-031-79029-4_27

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15412))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

478 Accesses

Abstract

Heterogeneous Graphs are important data sources due to their rich representation of knowledge, primarily based on node features and relationships. It is common for these graphs to have significant data gaps, particularly in the nodes. Graph Neural Networks are state-of-the-art solutions that achieve excellent results by extracting information based on node relationships. However, they suffer from severe limitations when there is no available information in the graph elements, weakening their representation. This paper proposes the specifications and an algorithm to process different types of node features, such as text, images, and subgraphs, generating both single and composition embeddings. To evaluate the effectiveness of the proposed algorithm, experiments were conducted to generate the features and their respective node embeddings in a Heterogeneous Graph. The achieved performance was measured using the average of Accuracy, F1-Score, and their Standard Deviations based on the Recommender System tasks applied to the embeddings generated in the experiments. We can highlight the performance achievement in the experiments as the Node Classification task, using the composition of Aggregated Features with Metapaths embedding, which achieved an F1-Score of 83.66% overcoming the 60.70% achieved by the approach without embeddings.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

A Tutorial of Graph Representation

Data Augmentation for Graph Convolutional Network on Semi-supervised Classification

A gated graph attention network based on dual graph convolution for node embedding

Article 22 March 2023

1 Introduction

With the exponential growth of data, the challenge lies in acquiring raw data and transforming it into valuable information. Although the data exists, extracting and utilizing it is not a simple task. Complex business cases, for example, can be effectively modeled as graphs, which may contain rich data about the entities they represent. Enhancing learning from graph representations signals a viable path to uncovering the hidden knowledge in raw data [6]. Deep Learning and Machine Learning excel at discovering hidden knowledge and characteristics, enabling the use of specific techniques to generate rich data from each graph node. The semantically rich data generated can then be used in various downstream applications, such as Recommender Systems (RecSys) [1, 17].

However, an important restriction is associated with the common low level of information in the graphs. If graph elements, especially nodes, have more data attached, the Representation of Knowledge will increase and enhance the subjacent performance of applications. Node embedding is a technique that maps graph nodes to low-dimensional vectors, preserving the graph structure and node features, representing the nodes [16]. Thus, node embeddings are an excellent option for collecting and aggregating data from nodes, resulting in high-quality node representation.

Some work [10, 18] introduces the generation of embeddings based on the neighboring nodes, increasing their expressiveness and recommendation performance. Meanwhile, the works of [17, 19] evaluate the heterogeneity of data types embeddings, opening up a new perspective. Researches [7, 16] propose a new approach for generating embeddings from metapaths, capturing semantics based on node relationships. Vision GNN [11] and Superpixel Image Classification [3] are examples of the few works where the graph is an image and the nodes are parts of it, demonstrating the feasibility of having images as nodes. None of the related work addresses the use of Heterogeneous Graphs with heterogeneous data types, such as texts, images, and subgraphs.

This paper proposes the specifications and a new algorithm that leverages Deep Learning techniques to extract information from images. Also, the feature generation is provided by Specialized Autoencoders, which are Machine Learning models designed to generate embeddings by mapping high-dimensional data to a compact and meaningful latent representation, preserving the essential data from the node features. The experiment results clearly demonstrate that the performance of embeddings, particularly when a composition of different types of embeddings is created, can yield superior results compared to evaluations conducted without them. In general, this paper contributes to providing an unexplored way of generating information and embeddings on nodes in Heterogeneous Graphs, through the definition of a modeling approach using texts, images, and subgraphs as representations of the graph nodes.

In previous work AGHE [2], we proposed an approach for generating and processing heterogeneous embeddings. In this paper, we specify each step of the approach, with the main contributions as follows: a) Specification of procedures for processing and generating different types of embedding; b) Proposes an algorithm based on procedures for processing and generating heterogeneous embeddings; c) Introduce composition embeddings as an alternative to generate high semantic nodes and present the evaluation of their performances achieved by the proposed algorithm; d) Share the public datasets, including the final recommended graph in JSON and CSV formats, with the research community, particularly those focused on Heterogeneous Graphs and RecSys.

The remainder of this paper is organized as follows: Sect. 2 conceptualizes the background techniques applied in this paper. Section 3 describes the related work. Section 4 presents the approach AGHE. Section 5 defines the specification of the procedures that serve as the foundation for the proposed algorithms shown in this section. Section 6 conducts some experiments and evaluates the results achieved while Sect. 7 exposes the conclusions and future works.

2 Background

The aim of this section is to present the Machine Learning and Deep Learning concepts and techniques adopted in this paper.

Heterogeneous Graph and Embeddings. In Heterogeneous Graphs the nodes and edges can be of different types, e.g., a person group represented by a Heterogeneous Graph can have nodes with different types. The bipartite graph is a special, commonly used type of Heterogeneous Graph, where edges exist between nodes of two different types. Thus, multiple types of nodes and different relationships contain comprehensive information and rich semantics [8]. A recurring problem in Representation Learning in Heterogeneous Graphs is Over-Smoothing issue. This implies that, after too many aggregations of the features in a graph, the node embeddings start to converge to nearly the same or even the very same value. Losing the capacity to discriminate between the different types of nodes and their various characteristics. During the aggregation process, if the neighboring nodes have very similar features or if the aggregation process is repeated many times, the local differences between the nodes are smoothed out, leading to a homogenization of the features. Two simple techniques used to minimize Over-Smoothing are discarding similar features from nodes during the embedding generation step and removing random edges during each training epoch [12].

Embedding captures the graph topology, node features, node-to-node relationships, and other relevant information about graphs, subgraphs, and nodes. Hence, embedding represents the nodes, and the similarity between node embeddings indicates their similarity in the graph [10, 14].

MetaPath2Vec. It is a Heterogeneous Graph embedding model that formalizes metapath based on random walk to construct the heterogeneous neighborhood of a node and then leverages a heterogeneous Skip-Gram model to perform node embedding. Maximizing the probability of preserving both the structures and semantics of a given heterogeneous network, being able to learn desirable node representations in heterogeneous networks [7].

Autoencoders. They are useful for incorporating structural graph information from nodes, providing specific implementations. Autoencoder (AE) is a kind of neural network architecture that imposes a bottleneck on the network that forces a compressed knowledge representation of the original input. If the input resources were independent of each other, the compression and subsequent reconstruction would be a very difficult task. Thus, AEs are neural networks that aim to copy their input to their output, compressing the input in a latent space representation, called Encoder $h = f(x)$, and after that, rebuilding the output through this representation, called Decoder $r = g(h)$ [5]. Autoencoder may be modified or combined to form new models for various applications such as generative models, classification, clustering, anomaly detection, recommendation, dimensionality reduction, and capture information [5].

3 Related Work

Heterogeneous Graph can be traced back to generate data embedding from node features based on random walk approach citing Representation Learning on Graphs [10, 18] improving the node expressivity. More close to the aims of our proposal is [19] which defines of Heterogeneous Graph Neural Network with the processing of embedding. The survey Graph Neural Networks in Recommender Systems [17] shows GNNs have been widely used in downstream applications essentially because graph structure and GNN have superiority in graph Representation Learning, citing GraphSAGE [10] as an important work regarding generating node embedding from node feature information.

MetaPath2Vec [7] captures the structure of Heterogeneous Graph, guiding random walks to generate sequences of heterogeneous nodes with rich semantics. Hence, metapath plays an important role in this paper capturing vital information by leveraging the relationships among heterogeneous nodes, transforming it into a form of node embedding. Vision GNN [11] and Superpixel Image Classification [3] works are other sources of inspiration that illustrate image representation in the form of a graph. In this context, each node corresponds to a distinct part of the same image, implying that every node encapsulates an image. Adopting a similar conceptualization, we can extend this idea to employ an image for representing node content.

Heterogeneous Graph can be integrated with some applications. However, one usually needs to carefully consider two factors: the first is how to construct Heterogeneous Graph for a specific application, and the second is what information or domain knowledge should be incorporated into a Heterogeneous Graph to ultimately benefit the application [16]. In the RecSys, the interaction between the user and items can be naturally modeled as a Heterogeneous Graph with two types of nodes. Thus, the application of Heterogeneous Graph embedding to RecSys constitutes an important research area, as highlighted in the survey on Heterogeneous Graph Embedding [16]. Hence, our proposal uses recommendations as an assessment to validate the value of performance by employing various types of node embeddings from the Heterogeneous Graph.

4 AGHE Approach for Generating Enhanced Heterogeneous Embeddings from Heterogeneous Graphs

This section presents the Approach for Generating Enhanced Heterogeneous Embeddings from Heterogeneous Graphs (AGHE) [2] shown in Fig. 1. AGHE generates heterogeneous embeddings through the processing of texts, images, and subgraphs represented in the nodes of Heterogeneous Graphs, such as the following steps:

1.
Graph Creation - generates the Heterogeneous Graph along with all its components, such as nodes, edges, and node features. This step is critical because all other steps and results depend on it;
2.
Generating Text Node Embeddings - is the process of creating node embeddings from their corresponding node features or extracted from images embedded in the nodes;
3.
Metapath and Aggregated Node Embeddings - generates of aggregated feature embeddings using the random walks approach. This includes defining metapaths that represent the business rules through the relationships among the nodes, followed by the generation of their embeddings using the MetaPath2Vec algorithm;
4.
Graph Enhancement with RecSys tasks - represents the experiments conducted in this paper, aiming to predict the type of nodes, predict some links, and cluster the nodes based on the Heterogeneous Graph generated in the first step;
5.
Rebuilding the Graph - involves incorporating the generated embeddings and predictions saved into the graph nodes.

In the following sections, we describe the main contributions of this paper, specifying the key procedures to be performed at each step, serving as the foundation for building the proposed algorithm.

5 Specification for Enhancing Graph Data Quality

This section builds the AGHE approach [2] defining the procedures to be used in the proposed Algorithms 1 and 2 as highlighted in Fig. 1, where each of them is described as follows.

5.1 Graph Creation

Graph creation is the first step and there are two parts, the first one regards adding nodes and defining their types, characterizing a Heterogeneous Graph. Subsequently, node features should be attached using a set of short descriptions or simple words. In this context, edge features are not considered as part of the Heterogeneous Graph in this paper. The second part involves mapping relationships between nodes by adding edges to the Heterogeneous Graph, which enables navigation between the nodes. If the node type is Image, then another important operation is needed that involves uploading the node image from an image database provided by the application that understands the graph. As a result, the Heterogeneous Graph is created and ready to be used in subsequent steps. Figure 2(a) shows a Heterogeneous Graph $\mathcal{H}\mathcal{G}(V, E)$ where V is a set of nodes, and E is a set of edges, which the node $v \in V$ can have different types of content with various data type features, such as text, image, and subgraph. Figure 2(b) shows the same graph simplified with heterogeneous data features embedded into the nodes, after the graph creation is done.

5.2 Embeddings Creation and Rebuilding the Graph

After the graph is ready for use, the following steps describe the creation of embeddings from the processing of heterogeneous features, detailing the techniques used in the proposed algorithm:

1.
Specialized Autoencoder for Processing Image Feature Data Type - nodes of the type image are processed in two steps. First, the Convolutional Neural Network (CNN) classifies the image, and second, extracts the characteristics from the image. Both results, class, and characteristics are merged and saved into the nodes as data text features. CNNs can be represented like the approximation function $f^{*}: \mathbb {R}^{\alpha \times \beta \times \gamma } \rightarrow \{1,...,c\}$ that takes as input an image from unknow distribution $\mathbb {R}^{\alpha \times \beta \times \gamma }$ where $\alpha \times \beta $ are pixels of image and $\gamma $ is the number of color bands and determines which of class the image is from a set of classes $\{1,...,c\}$. Our proposed algorithm uses the ResNet50 model, a CNN with 50 layers and ImageNet weights, to implement the Specialized Autoencoder for extracting classes and characteristics from nodes where the types are images. The extracted data was saved into the respective node feature as a text vector;
2.
Specialized Autoencoder for Generating Aggregated Embedding - the text features of nodes are captured on k interaction over their neighbors and combined with the information embedding on the before $k-1$ interaction. The approach has two phases, Message-Passing and Readout [9, 13]. The Message-Passing phase or Propagation step runs for T time steps and contains two subfunctions: message function $M_{t}$ and a node update function $U_{t}$. During the message passing phase, hidden states $h_{v}^{t}$ at each node in the graph are updated based on messages $m_{v}^{t + 1}$ according to:
$$\begin{aligned} \begin{aligned} m_{v}^{t + 1} = \sum _{w \in N_{(v)}} M_{t}(h_{v}^{t},h_{w}^{t}), \ \ \ \ h_{v}^{t + 1} = U_{t}(h_{v}^{t},m_{v}^{t + 1}), \end{aligned} \end{aligned}$$
(1)
where the sum $N_{(v)}$ denotes the neighbors of node v in graph G, $h_{w}^{t}$ node embedding from node v to w, and function $U_{t}$ updates the hidden states $h_{v}^{t}$. The Readout phase uses function R to compute an embedding vector representation for the entire graph as follows:
$$\begin{aligned} \hat{y} = R(\{ h_{v}^{T} \mid v \in G \}), \end{aligned}$$
(2)
where T denotes the total time steps. The Message-Passing phase in our algorithm was implemented through random walks, which accessed k neighboring nodes, extracted their features, and aggregated them with the local node. In an attempt to avoid the Over-Smoothing issue in the first layer, the node features were saved into a data structure of type set, removing duplicate features. The Readout phase in our proposed algorithm used a Word2Vec model to generate the vector node embeddings from the node features produced by the Message-Passing phase;
3.
Specialized Autoencoder for Generating Metapaths Embedding - the first step of the process is to read the original graph as input, and through the MetaPath2Vec algorithm using the pre-defined metapaths, within a sequence of nodes dependent on their node type, walking transversely in the graph captures the semantics between the nodes. The information of nodes is saved into the latent vector space represented by the function $f: v \rightarrow \mathbb {R}^{d}$ where v is the node embedding, and $\mathbb {R}$ is the vector space. The specific nodes of the data type graph already have an adjacency matrix linked, like a subgraph embedded, where the MetaPath2Vec exploits it and generates the corresponding metapath embedding [15]. MetaPath2Vec uses random walk to generate the metapaths embeddings, where Skip-Gram or Node2Vec model maintains the proximity of node v and its neighbors in the random walk sequences, and Heterogeneous Graph $\mathcal{H}\mathcal{G} = (V,E,T)$ is considered, where $T_{V}$ represents the set of node types and $\mid T_{V} \mid \ > \ 1$ maximizing the probability of having heterogeneous context $N_{t}(v),t \in T_{V}$ given a node v:
$$\begin{aligned} arg \ \underset{\theta }{max}\ \sum _{v \in V} \sum _{t \in T_{V}} \sum _{c_{t} \in N_{t}(v)} \log p(c_{t} \mid v;\theta ), \end{aligned}$$
(3)
where $N_{t}(v)$ denotes $v's$ neighbourhood with the $t^{th}$ type of nodes and $p(c_{t} \mid v;\theta )$ is defined as a Softmax function, and
$$\begin{aligned} p(c_{t} \mid v;\theta ) = \frac{e^{X_{ct} \cdot X_{v}}}{\sum _{u \in V} {e^{X_{u} \cdot X_{v}}}}, \end{aligned}$$
(4)
where $X_{v}$ is the $v^{th}$ row of X, representing the embedding vector for node v [7]. The proposed algorithm implements the MetaPath2Vec algorithm by performing walks on the Heterogeneous Graph based on a set of predefined metapaths. The hyperparameters “length” and “walks” define the length of each random walk and the number of random walks to generate from each node, respectively. By tracking valid walks based on node types, a Node2Vec model is built to generate the metapath vector embeddings for each node;
4.
Consolidating Data Embedding and Rebuilding the Graph within Node Features and Embeddings - each Autoencoder $A:\mathbb {R}^{n} \rightarrow \mathbb {R}^{p}$ generates its data vectors output that is used as input per the Decoder $B:\mathbb {R}^{p} \rightarrow \mathbb {R}^{n}$ to rebuild the entire graph, associating the nodes with their data embedding generated, that satisfy
$$\begin{aligned} arg \ min_{A,B} \ E[\varDelta (x,B \circ A(x)], \end{aligned}$$
(5)
where E is the expectation over the distribution of x, $\varDelta $ is the reconstruction loss function, which measures the distance between the output of the Decoder and the input, and A and B are neural networks [4, 5]; Based on the data saved during the entire process, the proposed algorithm rebuilds the Heterogeneous Graph with the original nodes and edges, adding the features generated, vector embeddings, and prediction information to the respective nodes and edges.

5.3 Algorithms

The Algorithm 1 aims to process Heterogeneous Graphs with features of different data types as input, according to the procedures described in Sect. 5.2. It produces node representations in latent vector spaces, consolidating various data type features into a unified format and generating the respective node embeddings. The algorithm takes a Heterogeneous Graph $\mathcal{H}\mathcal{G}$ as input data, and it consists of several main blocks. Lines 1 to 3 locate the subgraphs in the $\mathcal{H}\mathcal{G}$ and call itself to solve the generation of features and embeddings. Lines 4 to 7 iterate over the nodes, uploading the images and extracting their features using a specialized image Autoencoder based on a ResNet50 CNN model with ImageNet weights. Line 8 generates node embeddings from node features using specialized text Autoencoder with Word2Vec algorithm. Lines 9 and 10 iterate over the set of hyperparameter metapaths M, generating metapath embeddings into the local node from its neighbors according to the MetaPath2Vec algorithm, using Node2Vec model to create the respective embeddings. Lines 11 to 15 generate aggregated embeddings from direct neighbors and create compositions of embeddings, such as features with metapaths and aggregated features with metapaths. Thus, $\mathcal{H}\mathcal{G}$ has all the features and embeddings returned in line 17, upon completion of the execution of Algorithm 1.

Algorithm 1 delivers a Heterogeneous Graph $\mathcal{H}\mathcal{G}$ with vector embeddings $v_{h}$, where $v_{h} \in \{Features, Aggregated, Metapaths, Features+Metapaths, Aggregated+Metapaths\}$. Thus, the next step is to evaluate the performance of the downstream application using $\mathcal{H}\mathcal{G}$ with embeddings $v_{h}$ to determine if it achieves better results compared to without node embeddings.

Algorithm 2 implements RecSys tasks as an example of downstream application, where lines 1 to 5 compute the RecSys tasks based on the set H of embeddings. In line 3, the Link Prediction task can define a target node $V_{v}$ to predict the new edges using Cosine-Similarity as the applied technique. In Line 4, the node classification task has hyperparameters Iterations = 5 and K-fold cross-validation = 10 to calculate performance based on the average metrics and evaluate their standard deviations. Line 5 calls the node clustering task, defining 3 clusters using the KMeans algorithm. The generation of JSON and CSV files aims to provide a new public dataset with the raw data for the research community.

The time complexity of the algorithms is determined by the loops that iterate over the graph nodes, which is O(n) where n is the number of nodes in the graph in the worst case, $O(m \times n)$ where m is the number of metapaths, and $O(n)^{2}$ generated in the nested loops for generating embeddings composed of features neighbors of each node, which dominate over the other complexities. Therefore, the total time complexity of the proposed algorithms are polynomial $O(n)^{2}$ in the worst case.

6 Experiments

This section presents the experiments aimed at applying RecSys tasks to the graph with and without embeddings, and evaluating the resulting metrics in both cases. Special focus is given to the composition of embeddings, such as Features+Metapaths and Aggregated+Metapaths, as introduced in this paper. The experiments were guided by the methodology based on the approach for generating enhanced heterogeneous features and embeddings using the Algorithms 1 and 2. Figure 3(a) shows the main pipeline of the methodology used to execute the entire set of experiments, where they are guided by the follows steps:

1.
Graph Generation - responsible for creating Heterogeneous Graphs, including nodes, edges, and node features when available;
2.
Embeddings Generation - process of creating embeddings, where Features are generated from text node features; Aggregated embeddings are generated from the text node features of neighboring nodes; Metapath embeddings are generated from the defined set of metapaths; Features+Metapaths and Aggregated+Metapaths are compositions of those embeddings.
3.
RecSys and Performance Metrics - iterates over the set of each embedding, applying RecSys tasks to each type of embedding and generating respective performance metrics.

In essence, the experiments involve the generation of features and embeddings from nodes, and the validation of RecSys performance metrics based on the embeddings generated. To evaluate the generalization capability and robustness of XGBoost models used in the experiments, performance Accuracy and F1-Score metrics should be calculated using 10 stratified K-folds and the average of 5 iterations, along with their respective Standard Deviations.

6.1 Heterogeneous Graph Data Model

Figure 3(b) defines the Heterogeneous Graph employed in the experiments, containing features of different data types, such as Person, Car, and Pet. Each node has its own heterogeneous features, which may include texts, images (e.g., cars and pets), or subgraphs, such as the family of Mary embedded into the Mary node, each with its corresponding embeddings.

Based on the scope of experiments and the use case shown in Fig. 3(b), the graph data model was defined according to Table 1. The entire Heterogeneous Graph used in the experiments is available at https://github.com/silviofernandoangonese/datasets/blob/main/experiments_initial_het_graph.json.

Table 1. Heterogeneous Graph data model used in the experiments.

Full size table

6.2 Execution of Experiments

The experiments were guided according to the methodology shown in Fig. 3(a) and the pre-definition of the metapaths used in the experiments, which were $\mathcal {M}\ \leftarrow \{(Car, Person), (Pet, Person), (Car, Person, Person), (Pet, Person, Pet), (Person, Person, Car), (Person, Person, Pet), (Person, Person, Person)\}$. These 13 metapaths represent the interactions between different types of nodes, providing the business semantics for the respective node embeddings.

The second step was to generate single and composed node embeddings, which were defined as Features, Aggregated Features, Metapaths, a composition of Features and Metapaths, and a composition of Aggregated Features and Metapaths, which guided the entire experiments and analysis shown in Table 2.

The third and fourth methodological steps aim to validate the graph data quality by applying Link Prediction, Node Classification, and Node Clustering tasks to all the previously defined types of embeddings. Link Prediction was calculated using Cosine Similarity with a threshold of 80% based on the target node. The algorithm supports selecting a specific node or all the nodes as a target. For Node Classification, the node type was used as the class. Link prediction from “No Embeddings” was calculated using the Jaccard algorithm based on the intersection and union set operation $J(A, B) = \frac{|A \cap B|}{|A \cup B|}$. For Node Clustering, we pre-defined three clusters C0, C1, C2. “No Embeddings” clusters were calculated using the Louvain algorithm based on the nodes community where nodes without community identified have no cluster assigned.

The experiments results are collected and presented in Table 2, where each column represents the following: “Type Node Embeddings” - indicating the type of generated embeddings, using an ablation perspective, where the experiments aim to identify which embedding has the greatest impact on the performance of the models; “Prediction Links” - the count of links predicted by RecSys Link Prediction task; “Classification Avg Acc and F1-Score” - are average of Accuracy and F1-Score metrics, and “STDs” - are the Standard Deviation from each average of metrics respectively, illustrating the overall accuracy of the entire node graph classification; “Cor” - is the count of correctly predicted links applied to the final model over the entire graph; and “Inc” - is the count of incorrectly predicted links. Columns “Clusters C0 C1 C2” - display the count of nodes clustered by the RecSys task.

Table 2. Performance metrics achieved from different types of node embeddings.

Full size table

6.3 Evaluation of Results

The assumption that enriching the Heterogeneous Graph within heterogeneous embeddings, generated from processing the available data within the graph, could enhance the performance of downstream applications, was validated. Table 2 illustrates the evolution of the performance starting with Features embeddings and progressing to the best performance achieved by the composition of Aggregated+Metapaths node embeddings, as indicated by the average of Accuracy and F1-Score metrics. Link Prediction count using Features embedding is so high, it can indicate data homogeneity with a lack of distinctive features. Although it varies based on certain factors, predictions from Metapath and Features+Metapaths may be deemed more reliable. Node Clustering already reveals similar cluster distributions independent of the embedding used, except Aggregated embedding.

The best results were achieved with the composition Aggregated+Metapaths embeddings, where the results indicate that combining aggregated features from neighbors and metapaths embeddings led to better performance in all tested RecSys tasks. The experiments reveal that the best average Accuracy was 71.90% with a 1.98% Standard Deviation, which means the Accuracy could range from 69.92% to 73.88%. The average F1-Score is already at 83.66% with a 1.95% of Standard Deviation, thus the F1-Score can vary between 81.68% and 85.61%. A higher Standard Deviation indicates a greater spread of results, suggesting that the model is more sensitive to variations in input data or other training conditions. On the other hand, a lower Standard Deviation indicates greater consistency in results, which may indicate that the model is more stable and robust. This suggests that combining information from node content with information from semantics defined by features and structural relationships defined by metapaths can lead to more powerful graph node representation. The final recommended heterogeneous graph JSON file is available at https://github.com/silviofernandoangonese/datasets/blob/main/experiments_final_recommended_het_graph.json and CSV file at https://github.com/silviofernandoangonese/datasets/blob/main/experiments_final_recommended_het_graph.csv.

The failure cases were due to a high number of links predicted from the Features embedding, specifically 866 out of 872 nodes. This result is not acceptable, indicating the need for a deeper analysis of the model to understand the reason for the discrepancy in values. Additionally, another topic related to the Features embedding is the smallest cluster, C0, which contains only 1 node. This raises an important question: why is there only 1 node in this cluster, and what does it mean? Future investigations should explore this anomaly to enhance our understanding of the clustering behavior and improve the accuracy model.

7 Conclusion

This paper proposed an algorithm based on the AGHE - Approach for Generating Enhanced Heterogeneous Embeddings from Heterogeneous Graphs, enhancing the graph as a dataset for downstream applications. The performance achieved by the experiments conducted, especially when we compare “No Embeddings” with “Aggregated+Metapaths” embeddings demonstrates how the proposed algorithm effectively collaborates with the data enhancements in Heterogeneous Graph. Represented by the RecSys, which was used as a downstream application reference in this paper. Developing effective and efficient graph analytics from information embedding, can greatly help to better understand complex graphs, and provide innovative solutions for data models. Based on the obtained results, we believe that the conducted studies can open the doors for its use in different downstream applications as demonstrated. Some specific evaluations can be achieved, where the choice of appropriate embeddings plays a crucial role in the performance of downstream tasks. The results indicate that a one-size-fits-all embedding approach is not necessarily the best for all tasks and datasets.

An important lesson learned from the experiments is the significance of exploring a variety of embedding generation techniques and considering the unique characteristics of the data and tasks at hand. Combining information from different sources, such as node features and structural relationships defined by metapaths, can lead to more comprehensive and informative node representations in the graph. This highlights the importance of exploring hybrid approaches.

Future works include: aggregating edges data features to enhance the nodes data embedding; evaluating the performance of recommendations using a tabulated dataset and the same dataset modeled as a Heterogeneous Graph with heterogeneous embeddings; automatic detection of subgraphs and embedded it into the interconnected node; evaluation of the impact of the embeddings vectors elements normalization;

References

Alslaity, A., Tran, T.: Towards persuasive recommender systems. In: 2019 IEEE 2nd International Conference on Information and Computer Technologies (ICICT) on Proceedings, pp. 143–148. Publisher (2019)
Google Scholar
Angonese, S. F., Galante, R.: AGHE: approach for generating enhanced heterogeneous embeddings from heterogeneous graphs. In: 2024: Proceedings of the 51st Integrated Software and Hardware Seminar (SEMISH) on Proceedings, pp. 252–263. Publisher (2024)
Google Scholar
Avelar, P.H.C., Tavares, A.R., da Silveira, T.L.T., Jung, C.R., Lamb, L.C.: Superpixel image classification with graph attention networks. In: 2020 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 203-209. Publisher (2020)
Google Scholar
Baldi, P.: Autoencoders, unsupervised learning and deep architectures. In: Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning Workshop - Volume 27 on Proceedings, pp. 37-50. Publisher (2011)
Google Scholar
Bank, D., Koenigstein, N., Giryes, R.: Autoencoders. Publisher (2021)
Google Scholar
Barret, N., Gauquier, A., Law, J.J., Manolescu, I.: PathWays: entity-focused exploration of heterogeneous data graphs. In: The Semantic Web: ESWC 2023 Satellite Events on Proceedings, pp. 91–95. Publisher (2023)
Google Scholar
Dong, Y., Chawla, N.V., Swami, A.: MetaPath2Vec: Scalable Representation Learning for Heterogeneous Networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining on Proceedings, pp. 135-144. Publisher (2017)
Google Scholar
Fu, X., Zhang, J, Meng, Z., King, I.: MAGNN: metapath aggregated graph neural network for heterogeneous graph embedding. In: Proceedings of The Web Conference 2020 on Proceedings, pp. 2331–2341. Publisher (2020)
Google Scholar
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural Message Passing for Quantum Chemistry. Publisher (2017)
Google Scholar
Hamilton, W.L., Ying, R., Leskovec, J.: Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems on Proceedings, pp. 1025–1035. Publisher (2017)
Google Scholar
Han, K., Wang, Y., Guo, J., Tang, Y., Wu, E.: Vision GNN: an image is worth graph of nodes. In: Advances in Neural Information Processing Systems, pp. 8291–8303. Publisher (2022)
Google Scholar
Li, J., Zhang, Q., Liu, W., Chan, A. B., Fu, Y-G. Koishekenov, Y.: Another perspective of over-smoothing: alleviating semantic over-smoothing in deep GNNs. IEEE Trans. Neural Networks Learn. Syst. Proc., 1–14. Publisher (2024)
Google Scholar
Liu, Z., Zhou, J.: Introduction to Graph Neural Networks. Publisher (2020)
Google Scholar
Rozemberczki, B., Davies, R., Sarkar, R., Sutton, C.: GEMSEC: graph embedding with self clustering. In: GEMSEC: Graph Embedding with Self Clustering on Proceedings, pp. 65-72. Publisher (2020)
Google Scholar
Sun, Y., Han, J.: Mining Heterogeneous Information Networks: Principles and Methodologies. Publisher (2012)
Google Scholar
Wang, X., Bo, D., Shi, C., Fan, S., Ye, Y., Yu, Philip S.: A survey on heterogeneous graph embedding: methods, techniques, applications and sources. In: IEEE Transactions on Big Data on Proceedings, pp. 415–436. Publisher (2023)
Google Scholar
Wu, S., Fei, S., Wentao, Z., Xie, X, Cui, B.: Graph Neural Networks in Recommender Systems: A Survey. Publisher (2023)
Google Scholar
Ying, R., He, R., Chen, K., Eksombatchai, Hamilton, P.W., Leskovec, J.: Graph convolutional neural networks for web-scale recommender systems. In: Proceedings of the 24th ACM SIGKDD; Data Mining on Proceedings, pp. 974–983. Publisher (2018)
Google Scholar
Zhang, C., Song, D., Huang, C., Swami, A., Chawla, N.: Heterogeneous graph neural network. In: Proceedings of the 25th ACM SIGKDD; Data Mining on Proceedings, pp. 793–803. Publisher (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics, Federal University of Rio Grande do Sul (UFRGS), Ave. Bento Gonçalves, 9500, Porto Alegre, Rio Grande do Sul, Brazil
Silvio Fernando Angonese & Renata Galante

Authors

Silvio Fernando Angonese
View author publications
Search author on:PubMed Google Scholar
Renata Galante
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Silvio Fernando Angonese .

Editor information

Editors and Affiliations

Universidade Federal Fluminense, Niterói, Brazil
Aline Paes
Instituto Tecnológico de Aeronáutica, São José dos Campos, Brazil
Filipe A. N. Verri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Angonese, S.F., Galante, R. (2025). Enhancing Graph Data Quality by Leveraging Heterogeneous Node Features and Embeddings. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15412. Springer, Cham. https://doi.org/10.1007/978-3-031-79029-4_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-79029-4_27
Published: 30 January 2025
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79028-7
Online ISBN: 978-3-031-79029-4
eBook Packages: Computer ScienceComputer Science (R0)

Enhancing Graph Data Quality by Leveraging Heterogeneous Node Features and Embeddings

Abstract