AttrE2vec: Unsupervised Attributed Edge Representation Learning


See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/348079131

AttrE2vec: Unsupervised Attributed Edge Representation Learning

Preprint · December 2020

CITATIONS

0
READS

7

3 authors:

Some of the authors of this publication are also working on these related projects:

Social networks View project

TRANSFoRm View project

Piotr Bielak

Wroclaw University of Science and Technology

2 PUBLICATIONS   0 CITATIONS   

SEE PROFILE

Tomasz Kajdanowicz

Wroclaw University of Science and Technology

113 PUBLICATIONS   829 CITATIONS   

SEE PROFILE

Nitesh V Chawla

University of Notre Dame

382 PUBLICATIONS   21,078 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Piotr Bielak on 04 January 2021.

The user has requested enhancement of the downloaded file.

https://www.researchgate.net/publication/348079131_AttrE2vec_Unsupervised_Attributed_Edge_Representation_Learning?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_2&_esc=publicationCoverPdf
https://www.researchgate.net/publication/348079131_AttrE2vec_Unsupervised_Attributed_Edge_Representation_Learning?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_3&_esc=publicationCoverPdf
https://www.researchgate.net/project/Social-networks-9?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_9&_esc=publicationCoverPdf
https://www.researchgate.net/project/TRANSFoRm-3?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_9&_esc=publicationCoverPdf
https://www.researchgate.net/?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_1&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Piotr_Bielak2?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_4&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Piotr_Bielak2?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_5&_esc=publicationCoverPdf
https://www.researchgate.net/institution/Wroclaw_University_of_Science_and_Technology?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_6&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Piotr_Bielak2?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_7&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Tomasz_Kajdanowicz?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_4&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Tomasz_Kajdanowicz?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_5&_esc=publicationCoverPdf
https://www.researchgate.net/institution/Wroclaw_University_of_Science_and_Technology?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_6&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Tomasz_Kajdanowicz?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_7&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Nitesh_Chawla?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_4&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Nitesh_Chawla?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_5&_esc=publicationCoverPdf
https://www.researchgate.net/institution/University_of_Notre_Dame?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_6&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Nitesh_Chawla?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_7&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Piotr_Bielak2?enrichId=rgreq-ee0c9a6154948c3f0080a33b782b9118-XXX&enrichSource=Y292ZXJQYWdlOzM0ODA3OTEzMTtBUzo5NzYzNjM1MzYyNzM0MTFAMTYwOTc5NDYxNTk2NA%3D%3D&el=1_x_10&_esc=publicationCoverPdf


AttrE2vec: Unsupervised Attributed Edge Representation
Learning

Piotr Bielaka, Tomasz Kajdanowicza, Nitesh V. Chawlaa,b

aDepartment of Computational Intelligence, Wroclaw University of Science and Technology, Poland
bDepartment of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USA

Abstract

Representation learning has overcome the often arduous and manual featurization of net-
works through (unsupervised) feature learning as it results in embeddings that can apply
to a variety of downstream learning tasks. The focus of representation learning on graphs
has focused mainly on shallow (node-centric) or deep (graph-based) learning approaches.
While there have been approaches that work on homogeneous and heterogeneous net-
works with multi-typed nodes and edges, there is a gap in learning edge representations.
This paper proposes a novel unsupervised inductive method called AttrE2Vec, which
learns a low-dimensional vector representation for edges in attributed networks. It sys-
tematically captures the topological proximity, attributes affinity, and feature similarity
of edges. Contrary to current advances in edge embedding research, our proposal extends
the body of methods providing representations for edges, capturing graph attributes in
an inductive and unsupervised manner. Experimental results show that, compared to
contemporary approaches, our method builds more powerful edge vector representations,
reflected by higher quality measures (AUC, accuracy) in downstream tasks as edge classi-
fication and edge clustering. It is also confirmed by analyzing low-dimensional embedding
projections.

Keywords: representation learning, graphs, edge embedding, random walk, neural
network, attributed graph.

1. Introduction

Complex networks, included attributed and heterogeneous networks, are ubiquitous
— from recommender systems to citation networks and biological systems [1]. These
networks present a multitude of machine learning problem statements, including node
classification, link prediction, and community detection. A fundamental aspect of any
such machine learning (ML) task, transductive or inductive, is the availability of fea-
turized data. Traditionally, researchers have identified several network characteristics
suited to specific ML tasks and used them for the learning algorithm. This practice is
arduous as it often entails customizing to each specific ML task, and also is limited to
the computable characteristics.

This has led to a surge in (unsupervised) algorithms and methods that learn embed-
dings from the networks, such that these embeddings form the featurized representation

Preprint submitted to Information Sciences January 1, 2021

ar
X

iv
:2

01
2.

14
72

7v
1 

 [
cs

.L
G

] 
 2

9 
D

ec
 2

02
0


Figure 1: Our proposed AttrE2vec model compared to other methods in the task of an attributed graph
embedding. Colors denote edge features. On the left we can see a graph, where the features are aligned
to substructures of the graph. On the right, the features were shuffled (ca. 50%). Traditional approaches
fail to build robust representations, whereas our method includes features information to construct the
embedding vectors.

of the network for the ML tasks [2, 3, 4, 5, 6]. This area of research is generally no-
tated as representation learning in networks. Generally, these embeddings generated by
representation learning methods are agnostic to the end use-case, as they are generated
in an unsupervised fashion. Traditionally, the focus was on representation learning on
homogeneous networks, i.e. the networks that have singular type of nodes and edges,
and also do not have attributes attached to the nodes and edges [4].

Existing representation learning models mainly focus on transductive learning, where
a model can only be trained using the entire input graph. It means that the model
requires all the nodes and a fixed structure of the network in the training phase, e.g.,
Node2vec [7], DeepWalk [8] and GCN [9], to some extent. Besides, there have been
methods focused on heterogeneous networks that incorporate different typed nodes and
edges in a network, as well as content at each node [10, 11].

On the other hand, a less explored and exploited approach is the inductive setting. In
this approach, only a part of the network is used to train the model to infer embeddings
for new nodes. Several attempts have been made in the inductive setting including EP-B
[12], GraphSAGE [13], GAT [14], SDNE [15], TADW [16], AHNG[17] or PVECB [18].
There is also recent progress on heterogeneous graph embedding, e.g., MIFHNE [19] or

2


models based on graph neural networks [20].
State-of-the-art network embedding techniques are mostly unsupervised, i.e., aim at

learning low-dimensional representations that preserve the structure of an input graph,
e.g., GraphSAGE [13], DANE [21], line2vec [22], RCAN [23]. Nevertheless, semi-supervised
or supervised methods can learn vector representations but for a specific downstream pre-
diction task, e.g., TADW [16] or FSCNMF [24]. Hence it has been shown in the literature
that not much supervision is required to learn the embeddings.

In recent years, proposed models mainly focus on the graphs that do not contain
attributes related to nodes and edges [4]. It is especially noticeable for edge attributes.
The majority of proposed approaches consider node attributes only, omitting the richness
of edge feature space while learning the representation. Nevertheless, there have been
successfully introduced such models as DANE [21], GraphSAGE [13], SDNE [15] or
CAGE [25] which make use of node features and EGNN [26], NEWEE [27], EGAT [28]
that consume edge attributes.

Table 1: Comparison of most representative graph embedding methods with their abilities to learn
the representation, with or without attributes, reasoning types and short characteristics. The most
prominent and appropriate methods selected to compare to AttrE2vec in experiments are marked with
bold text.

Method
Representation Attributed Reasoning

Family
Nodes Edges Nodes Edges Transduct. Induct.

S
u

p
e
r
v
is

e
d

ECN [29] (2016) X X neigh. aggr.
GCN [9] (2017) X X X X GCN/GNN
ECC [30] (2017) X X X GCN, DL
FSCNMF [24] (2018) X X X GCN
GAT [14] (2018) X X X X AE, DL
Planetoid [31] (2018) X X X X GNN
EGNN [26] (2019) X X X X X X GNN
EdgeConv [32] (2019) X X GNN
EGAT [28] (2019) X X X X X X GNN
Attribute2vec [33] (2020) X X X GCN

U
n

s
u

p
e
r
v
is

e
d

DeepWalk [8] (2014) X X RW, skip-gram
TADW [16] (2015) X X X RW, MF
LINE [34] (2015) X X RW, skip-gram
Node2vec [7] (2016) X X RW, skip-gram
SDNE [15] (2016) X X X X AE
GraphSAGE [13] (2017) X X X X RW
EP-B [12] (2017) X X X X AE
Struc2vec [35] (2017) X X RW, skip-gram
DANE [21] (2018) X X X X AE
Line2vec [22] (2019) X X RW, skip-gram
NEWEE [27] (2019) X X X X RW, skip-gram
AttrE2vec (2020) X X X X X RW, AE, DL

Both node-based embedding methods and graph neural network inspired methods do
not generalize effectively to both transductive and inductive settings, especially when
there are attributes associated with edges. This work is motivated by the idea of un-
supervised learning on networks with attributed edges such that the embeddings are
generalizable across tasks and are inductive.

To that end, we develop a novel AttrE2vec, an unsupervised learning model that
adapts auto-encoder and self-attention network with the use of feature reconstruction and
graph structural loss. To learn edge representation, AttrE2vec splits edge neighborhood
into two parts, separately for each node endings of the edge, and then generates random

3


edge walks in both neighborhoods. All walks are then aggregated over the node and edge
attributes using one of the proposed strategies (Avg, Exp, GRU, ConcatGRU). These
are accumulated with the original nodes and edge features and then fed to attention
and dense layer to encode the edge. The embeddings are subsequently inferred via a
two-step loss function — for both feature reconstruction and graph structural loss. As a
consequence, AttrE2vec can explicitly incorporate feature information from nodes and
edges at many hops away to effectively produce the plausible edge embeddings for the
inductive setting.

In summary, our main contributions are as follows:

• we propose a novel unsupervised AttrE2vec method, which learns a low-dimensional
vector representation for edges that are attributed

• we exploit the concept of a graph-topology-driven edge feature aggregation, from
simple ones to learnable GRU based, that captures edge topological proximity and
similarity of edge features

• the proposed method is inductive and allows getting the representation for edges
not present in the training phase

• we conduct various experiments and show that our AttrE2vec method has superior
performance over all of the baseline methods on edge classification and clustering
tasks.

2. Related work and Research Gap

Embedding information networks has received significant interest from the research
community. We refer the readers to the survey articles for a comprehensive overview of
network embedding [4, 5, 3, 2] and cite only some of the most prominent works that are
relevant.

Unsupervised network embedding methods use only the network structure or
original attributes of nodes and edges to construct embeddings. The most common
method is DeepWalk [8], which in two-phases constructs node neighborhoods by per-
forming fixed-length random walks and employs the skip-gram [7] model to preserve the
co-occurrences between nodes and their neighbors. This two-phase framework was later
an inspiration for learning network embeddings by proposing different strategies for con-
structing node neighborhoods or modeling co-occurrences between nodes, e.g., node2vec
[7], Struc2vec [35], GraphSAGE [13], line2vec [22] or NEWEE [27]. Another group of un-
supervised methods utilizes auto-encoder or graph neural networks to obtain embedding.
SDNE [15] uses auto-encoder architecture to preserve first and second-order proximities
by jointly optimizing the loss in neighborhood reconstruction. Another auto-encoder
based representatives are EP-B [12] and DANE [21].

Supervised network embedding methods are constructed as an end-to-end meth-
ods for particular tasks like node classification or link prediction. These methods require
network structure, attributes of nodes and edges (if method is capable of using) and
some annotated target like node class. The representatives are ECN [29], ECC [30],
FSCNMF [24], GAT [14], planetoid [31], EGNN [26], GCN [9], EdgeConv [32], EGAT
[28], Attribute2vec [33].

4


Edge representation learning has been already tackled by several methods, i.e.
ECN [29], EGNN [26], line2vec [22], EdgeConv [32], EGAT [28]. However, non of these
methods was able to directly take into account attributes of edges as well as perform the
learning in an unsupervised manner.

All the characteristics of the representative node and edge representation learning
methods are grouped in Table 1.

3. Method

3.1. Motivation

In the following paragraphs, we explain our three-fold motivation to propose the
AttrE2vec.

Edge embeddings. For a decade, network processing approaches gather more and more
attention as graph data is produced in an increasing number of systems. Network em-
bedding traditionally provided the notion of vectorizing nodes that was used in node
classification or clustering. However, the edge representation learning did not gather
enough attention and was accomplished through node embedding transformation [36].
Nevertheless, such an approach is problematic. For instance, inferring edge type from
neighboring nodes’ embeddings may not be the best choice for edge type classification in
heterogeneous social networks. We claim that efficient edge clustering, edge attribute re-
gression, or link prediction tasks require dedicated and specific edge representations. We
expect that the representation learning approach devoted strictly to edges provides more
powerful vector representations than traditional methods that require node embeddings
trained upfront and transform nodes’ embedding to represent edges.

Inductive embedding methods. A vast majority of contemporary network representation
learning methods is transductive (see Table 1). It means that any change to the graph
requires the whole retraining of the method to provide predictions for unseen cases—such
property limits the applicability of methods due to high computational costs. Contrary,
the inductive approach builds a predictive ability that can be applied to unseen cases
and does not need retraining – in general, inductive methods have a lower computation
cost. Considering these advantages, we expect modern edge embedding methods to be
inductive.

Encoding graph attributes in embeddings. Much of the real-world data exhibits rich at-
tribute sets or meta-data that contain crucial information, e.g., about the similarity of
nodes or edges. Traditionally, graph representation learning has been focused on ex-
ploiting the network structure, omitting the related content. Thus, we may expect to
consume attributes as a regularizer over the structure. It would allow overcoming the
limitation when the only edge discriminating ability is encoded in the edges’ attributes,
not in the graph’s structure. Relying only on the network would produce inconclusive
embeddings.

5


3.2. Attributed graph edge embedding

We denote an attributed graph as G = (V,E), where V is a set of nodes and E =
{(u,v) ∈ V ×V} a set of edges. Every node u and every edge e = (u,v) has associated
features: mu ∈ RdV and fuv ∈ RdE , where M ∈ R|V |×dV and F ∈ R|E|×dE are node
and edge feature matrices, respectively. By dV we denote dimensionality of node feature
space and dE dimensionality of edge feature space. The edge embedding task is defined
as learning a function g : E → Rd, which takes an edge and outputs its low-dimensional
vector representation. Note that the embedding dimension d should be much less than the
original edge feature dimensionality dE, i.e.: d << dE. More specifically, we aim at using
the topological structure of the graph and node and edge attributes: f : (E,F,M) → Rd.

Figure 2: Overview of the AttrE2vec model. The model first computes edge random walks on two
neighborhoods of a given edge (u,v). Each neighbourhood walks are aggregated into Su,Sv. Both are
combined with the edge features fuv using an Encoder module, which results into the edge embedding
vector huv. The loss function consists of two parts: structural loss (Lcos) and feature reconstruction loss
(LMSE).

3.3. AttrE2vec

In contrast to traditional node embedding methods, we shift the focus from nodes
to edges and consider a graph from an edge perspective. Given any edge e = (u,v), we
can observe three natural sources of knowledge: the edge attributes itself and the two
neighborhoods - Nu and Nv, located behind nodes u and v, respectively. In AttrE2vec,
we exploit all three sources jointly.

First, we obtain aggregations (summaries) Su,Sv of the both neighborhoods Nu,Nv.
We want to capture the topological structure of the neighborhood, so we perform k edge
random walks of length L, which start from node u (or v, respectively) and use a
uniformly distributed neighbor sampling approach (DeepWalk-like) to obtain the next
edge. Each ith walk wiu started from node u is hence a sequences of edges.

RW(G,k,L,u) →{w1u,w
2
u, . . . ,w

k
u}

wiu ≡ (u,u2), (u3,u4), . . . , (uL−1,uL)
6


Next, we take the attributes of the edges (and nodes, if applicable) in each random
walk and aggregate them into a single vector using the walk aggregation model Aggw.

Siu = Aggw(w
i
u,F,M)

Later, aggregated walks are combined using the neighborhood aggregation model
Aggn, which summarizes the neighborhood Su (and Sv, respectively). The proposed
implementations of these aggregation are given in Section 3.4.

Su = Aggn({S1u,S
2
u, . . . ,S

k
u})

Finally, we obtain the low dimensional edge embedding huv using an encoder Enc
module. It combines the edge attributes fuv with the summarized neighborhood infor-
mation Su, Sv. We employ a simple Multilayer Perceptron (MLP) with 3 inputs (each of
size equal to the edge features dimensionality) and an attention mechanism over these in-
puts, to check how much of the information of each input is used to create the embedding
vector (see Figure 3):

huv = Enc(fuv,Su,Sv)

Figure 3: Encoder module architecture

The overall illustration of the method is contained in Figure 2 and the inference
algorithm is shown in Algorithm 1.

3.4. Aggregation models

For the purpose of the neighborhood aggregation model Aggn, we use an average over
vectors Siu, as there is no particular ordering of these vectors (each one was generated
by an equally important random walk). In the case of walk aggregation, we propose the
following:

7


Algorithm 1: AttrE2vec inference algorithm

Data: graph G, edge list xe, edge features F, node features M
Params: number of random walks per node k, random walk length L
Result: edge embedding vectors huv
begin

foreach (u, v) in xe do
foreach i in (1. . . k) do

wiu = RW(G,L,u)
Siu = Aggw(w

i
u,F,M)

wiv = RW(G,L,v)
Siv = Aggw(w

i
v,F,M)

end

Su = Aggn({S1u, . . . ,Sku})
Sv = Aggn({S1v, . . . ,Skv})

huv = Enc(fuv,Su,Sv)

end

end

• average – that computes a simple average of the edge attribute vectors in the
random walk;

Siu =
1

L

L∑
n=1

funun+1

• exponential – that computes a weighted average, where the weights are exponents
of the ”minus” position in the random walk so that further away edges are less
important than the near ones;

Siu =
1

L

L∑
n=1

e−nfunun+1

• GRU – that uses a Gated Recurrent Unit [37] architecture, where hidden and input
dimension is equal to the edge attribute dimension; the aggregated representation
is the output of the last hidden vector; the aggregation process starts here at the
end of the random walk and proceeds to the beginning;

Siu = GRU({funun+1,fun−1un, . . . ,fu1u2})

• ConcatGRU – that is similar to the GRU-based aggregator, but here we also
use the node feature information by concatenating the node attributes with the
edge attributes; hence the GRU input size is equal to the sum of the edge and
node dimensions; in case there are not any node features available, one could use

8


network-specific features, like degree, betweenness or more advanced techniques like
Node2vec; the hidden dimension size and the aggregation direction is unchanged;

Siu = ConcatGRU({funun+1 ⊕mun, . . . ,fu1u2 ⊕mu1})

3.5. Learning AttrE2vec’s parameters

AttrE2vec is designed to make the most use of edge attributes and information about
the structure of the network. Therefore we propose a loss function, which consists of two
main parts:

• structural loss Lcos – computes a cosine embedding loss; such function tries to
minimize the cosine distance between a given embedding h and embeddings of edges
sampled from the random walks h+ (positive), and simultaneously to maximize a
cosine distance between an embedding h and embeddings of edges sampled from a
set of all edges in the graph h− (negative), except for these in the random walks:

Lcos =
1

|B|
∑

huv∈B


∑

h
+
uv

(1 − cos(huv,h+uv)) +
∑
h
−
uv

cos(huv,h
−
uv)




where B denotes a minibatch of edges and |B| the minibatch size,

• feature reconstruction loss LMSE – computes a mean squared error of the actual
edge features and the outputs of a decoder (implemented as a 3-layer MLP – see
Figure 4), that reconstruct the edge features based on the edge embeddings;

LMSE =
1

|B|
∑

(huv,fuv)∈B

(DEC(huv) −fuv)
2

where B denotes a minibatch of edges and |B| the minibatch size.

Figure 4: Decoder module architecture

We combine the values of the above loss functions using a mixing parameter λ ∈ [0, 1].
The higher the value of this parameter is, the more structural information is preserved
and less focus is one the feature reconstruction. The total loss of AttrE2vec is given as
follows:

L = λ∗Lcos + (1 −λ) ∗LMSE
9


4. Experiments

To evaluate the proposed model’s performance, we perform three tasks: edge classi-
fication, edge clustering, and embedding visualization on three real-world datasets. We
first train our model on a small subset of edges (inductive setting). Then we use the
model to infer embeddings for edges from the test set. Finally, we evaluate them in all
downstream tasks: by predicting the class of edges in citation graphs (edge classifi-
cation), by applying the K-means++ algorithm (edge clustering; as defined in [22])
and by the dimensionality reduction method T-SNE (embedding visualization). We
compare our model to several baselines and contemporary methods in all experiments,
see Table 1. Eventually, we check the influence of AttrE2vec’s hyperparameters and per-
form an ablation study on artificially generated datasets. We implement our model in
the popular deep learning framework PyTorch. All experiments were performed on an
NVIDIA GTX1080Ti. Upon acceptance in the journal, we will make our code available
at https://github.com/attre2vec/attre2vec and include our DVC [38] pipeline so
that all experiments can be easily reproduced.

4.1. Datasets

Table 2: Datasets used in the experiments.

Name

Features
Number of Training instances

initial pre-processed

node edge node edge nodes edges classes inductive transductive

Cora 1 433 0 32 260 2 485 5 069 7+1 160 5 069

Citeseer 3 703 0 32 260 2 110 3 668 6+1 140 3 668

Pubmed 500 0 32 260 19 717 44 324 3+1 80 44 324

In order to compare gathered evaluation evidence we focused on well known datasets,
that appear in the literature, namely: Cora [39], Citeseer [39] and Pubmed [40]. These
are citation networks of scientific papers in several research areas, where nodes are the
papers and edges denote citations between papers. We summarize basic statistics about
the datasets before and after pre-processing steps in Table 2. Raw datasets contain
node features only in the form of high dimensional sparse bags of words. For Cora and
Citeseer, these are binary vectors, showing which of the most popular words were used
in a given paper, and for Pubmed, the features are in the form of TF-IDF vectors. To
adjust the datasets to our problem setting, we apply the following pre-processing steps
to obtain edge level features, which are used to train and evaluate our AttrE2vec model:

• we create dense vector representations of the nodes’ features by applying Doc2vec
[41] in the PV-DBOW variant with a target dimension size of 128;

• for each edge (u,v) and its symmetrical version (v,u) (necessary to perform uni-
form, undirected random walks) we extract the following features:

– 1 feature – cosine similarity of raw node features for nodes u and v (binary
BoW; for Pubmed transformed from TF-IDF to binary BoW),

10

https://github.com/attre2vec/attre2vec


– 2 features – the ratios of the number of used words (number of ones in the
BoW) to all possible words in the document (length of BoW vector) in each
paper u and v,

– 256 features – concatenation of Doc2vec features for nodes u and v,

– 1 feature – a binary indicator, which denotes whether this is an original edge
(1) or its symmetrical counterpart (0),

• we apply standardization (StandardScaler in Scikit-Learn [42]) of the edge feature
matrix.

Moreover, we extracted new node features as 32-dimensional Node2vec embeddings
to provide the evaluation possibility for one of our model versions (AttrE2vec with Con-
catGRU aggregator), which generalizes upon both edge and nodes attributes.

Raw datasets provide each node labeled by the research area the paper comes from. To
apply this knowledge in the edge classification problem setting, we applied the following
rule: if an edge has two nodes from the same class (research area), the edge receives this
class; if two nodes have different classes, the edge between these nodes is assigned with
a cross-domain citation class.

To ensure a fair comparison method, we follow the dataset preparation scheme from
EP-B [12], i.e., for each dataset (Cora, Citeseer, Pubmed) we sample 10 train/validation/test
sets, where the train set consists of 20 edges per class and the validation and test sets
to contain 1 000 randomly chosen edges each. While reporting the resulting metrics, we
show the mean values over these ten sampled sets (together with the standard deviation).

4.2. Baselines

We compare our method against several baseline methods. In the most simple case,
we use the edge features obtained during the pre-processing phase for all datasets (further
referred to as Doc2vec).

Many standard approaches employ simple node embedding transformations to obtain
edge embeddings. The authors of Node2vec [36] proposed binary operators like averaging,
Hadamard product, or L1 and L2 norms of vector differences. Here, we will use following
methods to obtain node embeddings: DeepWalk [8], Node2vec [36], SDNE [43] and
Struc2vec [35]. In preliminary experiments, we evaluated these methods and checked
that the Average operator and an embedding size of 64 gives the best results. We will
use these models in 2 setups: (a) Avg(M,M) – using only the averaged node features,
(b) Avg(M,M)⊕F – like previously but concatenated with the edge features from the
dataset (in total 324-dim vectors).

We also checked a scheme to compute a 64-dim PCA reduction of the concatenated
features to have comparable vector sizes with the 64-dimensional embedding of our model,
but these turned out to perform poorly. Note that SDNE has the capability of inductive
reasoning, but due to the non-availability of such implementation, we decided to evaluate
this method in the transductive scheme (which works in favor of the method).

11


Figure 5: Architecture of the MLP(M,M).

Figure 6: Architecture of the MLP(M,M,F).

We also extend our body of baselines by more sophisticated approaches – two dense
autoencoder architectures. In the first setting MLP(M,M), we train a model (see
Figure 5), which reconstructs concatenated embeddings of connected nodes. In the second
baseline MLP(M,M,F), the autoencoder (see Figure 6) is extended by edge attributes.
In both settings, we employ the mean squared error as the model loss function. The
output of the encoders (embeddings) is used in the downstream tasks. The input node
embeddings are obtained using the methods mentioned above, i.e., DeepWalk, Node2vec,
SDNE, and Struc2vec.

The last baseline is Line2vec [22], which is directly dedicated for edges - we use an
embedding size of 64.

4.3. Edge classification

To evaluate our model in an inductive setting, we need to make sure that test edges
are unseen during the model training procedure – we remove them from the graph. Note
that all baselines (except for GraphSage, see 1) require all edges during the training
phase (i.e., these are transductive methods).

After each training epoch of AttrE2vec, we evaluate the embeddings using L2-
regularized Logistic Regression (LR) classifier and compute AUC. The regression model
is trained on edge embeddings from the train set and evaluated on edge embeddings from
the validation set. We take the model with the highest AUC value on the validation set.

12


Table 3: AUC values for edge classification. F denotes the edge attributes (also referred to as ”Doc2vec”),
M – node attributes (e.g., embeddings computed using ”Node2vec”), ⊕ – concatenation operator,
Avg(M,M) – average operator on node embeddings, MLP(·) – encoder output of MLP autoencoder
trained on given attributes. AUC in bold shows the highest value and AUC in italic — the second
highest value.

Method group/name
Vector AUC

size Citeseer Cora Pubmed

T
r
a
n

s
d

u
c
ti

v
e

Edge features only; F (Doc2vec) 260 86.13 ± 0.95 88.67 ± 0.51 79.15 ± 1.41
Line2vec 64 86.19 ± 0.28 91.75 ± 1.07 84.88 ± 1.19

Avg(M,M)

DeepWalk 64 58.40 ± 1.08 59.98 ± 1.32 51.04 ± 1.23
Node2vec 64 58.26 ± 0.89 59.59 ± 1.11 51.03 ± 1.01
SDNE 64 54.28 ± 1.57 55.91 ± 1.11 50.00 ± 0.00
Struc2vec 64 61.29 ± 0.86 61.30 ± 1.58 54.67 ± 1.46

MLP(M,M)

DeepWalk 64 55.88 ± 1.68 57.87 ± 1.53 51.23 ± 0.77
Node2vec 64 55.35 ± 2.26 57.44 ± 0.87 51.48 ± 1.55
SDNE 64 55.56 ± 0.93 56.02 ± 1.22 50.00 ± 0.00
Struc2vec 64 59.93 ± 1.43 59.76 ± 1.80 53.27 ± 1.32

Avg(M,M)⊕F

DeepWalk 324 86.13 ± 0.95 88.67 ± 0.51 79.15 ± 1.41
Node2vec 324 86.13 ± 0.95 88.67 ± 0.51 79.15 ± 1.41
SDNE 324 86.14 ± 1.03 88.70 ± 0.51 79.15 ± 1.41
Struc2vec 324 86.21 ± 0.97 88.73 ± 0.48 79.24 ± 1.36

MLP(M,M,F)

DeepWalk 64 84.58 ± 1.11 86.47 ± 0.87 78.60 ± 1.84
Node2vec 64 84.65 ± 1.05 86.71 ± 0.68 78.84 ± 1.71
SDNE 64 84.32 ± 1.13 85.99 ± 0.77 78.34 ± 1.07
Struc2vec 64 83.95 ± 1.16 85.54 ± 0.96 77.19 ± 1.42

In
d

u
c
ti

v
e

Avg(M,M) GraphSage 64 54.84 ± 1.90 55.16 ± 1.36 51.14 ± 1.64
MLP(M,M) GraphSage 64 55.19 ± 1.04 55.47 ± 1.66 50.36 ± 1.54
Avg(M,M)⊕F GraphSage 324 86.14 ± 0.95 88.68 ± 0.51 79.16 ± 1.41
MLP(M,M,F) GraphSage 64 84.63 ± 1.11 86.14 ± 0.45 78.00 ± 1.85

AttrE2vec (our)

Avg 64 88.97 ± 0.82 93.43 ± 0.56 87.68 ± 1.25
Exp 64 88.91 ± 1.10 92.80 ± 0.38 86.18 ± 1.41
GRU 64 88.92 ± 1.13 93.06 ± 0.63 86.39 ± 1.21
ConcatGRU 64 88.56 ± 1.34 92.93 ± 0.61 86.34 ± 1.18

Moreover, an early stopping strategy is implemented– if the validation AUC metric does
not improve for more than 15 epochs, the learning is terminated. Our approach to model
selection is aligned with the schema proposed in [44] because this approach is more nat-
ural than relying on the loss function. This is repeated for all 10 data splits (see: Section
4.1 for details). We report a mean and std AUC measures for 10 test sets (see Table 3)

We choose AdamW [45] with a learning rate of 0.001 to optimize our model’s pa-
rameters. We also set the size of positive samples to |h+| = 5 and negative samples
to |h−| = 10 in the cosine embedding loss. The mixing coefficient is set to λ = 0.5,
equally including the influence of features and topological graph structure. We choose
an embedding size of 64 as a reasonable value while dealing with edge features of size
260.

In Table 3, we summarize the AUC values for baseline methods and for our model.
Even though vectors’ original dimensionality is relatively high (260), good results are
already yielded using only the edge features (Doc2vec). However, adding structural
information about the graph could further improve the results.

Using representations from node embedding methods, which are transformed to edge

13


embeddings using the average operator Avg(M,M), achieve poor results of about 50-
60% AUC. However, if these are combined with the edge features from the datasets
Avg(M,M)⊕F, the AUC values increase significantly to about 86%, 88% and 79% for
Citeseer, Cora, and Pubmed, respectively. Unfortunately, this results in an even higher
vector dimensionality (324).

The MLP-based approach results lead to similar conclusions. Using only node em-
beddings MLP(M,M) we achieve quite poor results of about 50% (on Pubmed) up to
60% (on Cora). With MLP(M,M,F) approach we observe that edge features improve
the classification results. The AUC values are still slightly worse than concatenation
operator (Avg(M,M)⊕F), but we can reduce the edge embedding size to 64.

The Line2vec [22] algorithm achieves very good results, without considering edge
features information – we get about 86%, 92% and 85% AUC for Citeseer, Cora, and
Pubmed, respectively. These values are higher than for any other baseline approach.

Our model performs the best among all evaluated methods. For Citeseer, we gain
about 3 percent points compared to the best baselines: Line2vec, Struc2vec (Avg(M,M)⊕F)
or GraphSage (Avg(M,M)⊕F). Note that the algorithm is trained only on 140 edges
in the inductive setting, whereas all transductive baselines require the whole graph for
training. The gains on Cora are 2 pp, and on Pubmed we achieve up to 4pp (and up
to 8pp compared only to GraphSage (Avg(M,M)⊕F)). Our model with the Average
(Avg) aggregator works the best, whereas the Gated Recurrent Unit (GRU) aggregator
achieves the second-best results.

4.4. Edge clustering

Similarly to Line2vec [22], we apply the K-Means++ algorithm on the resulting em-
bedding vectors and compute an unsupervised clustering accuracy [46]. We summarize
the results in Table 4. Our model performs the best in all but one case and achieves
significantly better results than other baseline methods. The only exception is for the
Pubmed dataset, where Line2vec achieves the best clustering accuracy. Other baseline
methods perform similarly as in the edge classification task. Hence, we will not discuss
the details, and we encourage the reader to go through the detailed results.

4.5. Embedding visualization

For all tested baseline methods and our proposed AttrE2vec method, we compute
2-dimensional projections of the produced embeddings using T-SNE [47] method. We
visualize them in Figure 7. In our subjective opinion, these plots correspond to the AUC
scores reported in Table 3—the higher the AUC, the better the group separation. In
details, for Doc2vec raw edge features seem to form groups, but unfortunately overlap
to some degree. We cannot observe any pattern in the node embedding-based settings
(Avg(M,M) and MLP(M,M)), they tempt to be quasi-random. When concatenated
with the edge attributes (Avg(M,M)⊕F and MLP(M,M,F)) we observe a slightly
better grouping, but yet non satisfying. AttrE2vec model produces much more formed
groups, with only a little overlapping. To summarize, based on the observed groups’
separability and AUC metrics, our approach works the best among all methods.

14


Figure 7: 2-D T-SNE projections of embedding vectors for all evaluated methods. Columns denotes
aggregation approach, beside F that denotes the edge attributes and g(E) that is an edge embedding
obtained with graph structure only. Rows gather particular methods.

15


Table 4: Accuracy on edge clustering. F denotes the edge attributes (also referred to as ”Doc2vec”),
M – node attributes (e.g., embeddings computed using ”Node2vec”), ⊕ – concatenation operator,
Avg(M,M) – average operator on node embeddings, MLP(·) – encoder output of MLP autoencoder
trained on given attributes. AUC in bold shows the highest value and AUC in italic — the second
highest value.

Method group/name
Vector Accuracy

size Citeseer Cora Pubmed

T
r
a
n

s
d

u
c
ti

v
e

Edge features only; F (Doc2vec) 260 54.13 ± 2.73 54.64 ± 5.86 46.33 ± 1.53
Line2vec 64 54.73 ± 2.56 63.50 ± 1.92 55.26 ± 1.36

Avg(M,M)

DeepWalk 64 28.89 ± 1.06 21.93 ± 0.86 27.24 ± 0.50
Node2vec 64 26.82 ± 0.67 21.32 ± 0.62 27.17 ± 0.74
SDNE 64 21.01 ± 0.50 17.97 ± 0.47 31.38 ± 0.69
Struc2vec 64 25.21 ± 1.33 20.15 ± 0.64 32.02 ± 1.49

MLP(M,M)

DeepWalk 64 26.36 ± 1.37 21.06 ± 0.57 27.40 ± 0.93
Node2vec 64 26.37 ± 1.64 21.31 ± 0.98 27.67 ± 0.78
SDNE 64 22.27 ± 0.76 17.15 ± 0.36 28.44 ± 1.21
Struc2vec 64 24.22 ± 0.83 19.56 ± 0.49 31.31 ± 1.70

Avg(M,M)⊕F

DeepWalk 324 54.13 ± 2.73 54.70 ± 5.85 46.33 ± 1.53
Node2vec 324 54.13 ± 2.73 54.70 ± 5.85 46.33 ± 1.53
SDNE 324 55.29 ± 2.06 55.43 ± 4.63 46.33 ± 1.53
Struc2vec 324 55.59 ± 1.51 52.47 ± 6.52 46.32 ± 1.29

MLP(M,M,F)

DeepWalk 64 48.74 ± 4.03 47.38 ± 4.72 46.49 ± 1.20
Node2vec 64 50.80 ± 2.30 48.48 ± 3.38 46.15 ± 1.43
SDNE 64 46.17 ± 3.15 44.87 ± 3.54 45.74 ± 1.89
Struc2vec 64 47.35 ± 3.73 44.38 ± 3.04 45.40 ± 1.72

In
d

u
c
ti

v
e

Avg(M,M) GraphSage 64 18.79 ± 0.62 17.70 ± 1.05 27.04 ± 0.71
MLP(M,M) GraphSage 64 18.92 ± 0.98 17.89 ± 0.85 27.09 ± 0.81
Avg(M,M)⊕F GraphSage 324 54.06 ± 2.54 54.82 ± 6.86 46.49 ± 1.64
MLP(M,M,F) GraphSage 64 48.79 ± 4.04 47.49 ± 5.41 45.15 ± 1.54

AttrE2vec (our)

Avg 64 59.82 ± 3.30 65.42 ± 1.71 48.86 ± 2.46
Exp 64 59.07 ± 4.65 66.36 ± 3.62 48.02 ± 2.55
GRU 64 60.16 ± 2.25 66.15 ± 3.71 49.41 ± 1.49
ConcatGRU 64 60.71 ± 2.75 66.00 ± 2.21 50.27 ± 3.75

5. Hyperparameter Sensitivity of AttrE2vec

We investigate hyperparameters’ effect considering each of them independently, i.e.,
setting a given parameter and preserving default values for all other parameters. The
evaluation is applied for our model’s two inductive variants: with the Average aggregator
and with the GRU aggregator. We use all three datasets (Cora, Citeseer, Pubmed) and
report the AUC values. We choose following hyperparameter value sets (values with an
asterisk denote the default value for that parameter):

• length of random walk: L = {4, 8∗, 16},

• number of random walks: k = {4, 8, 16∗},

• embedding size: d = {16, 32, 64∗},

• mixing parameter: λ = {0, 0.25, 0.5∗, 0.75, 1}.

16


Figure 8: Effects of hyperparameters on Cora, Citeseer and Pubmed datasets.

The results of all experiments are summarized in Figure 8. We observe that for both
aggregation variants, Avg and GRU, the trends are similar, so we will include and discuss
them based only on the Average aggregator.

In general, the higher the number of random walks k and the length of a single
random walk L, the better results are achieved. One may require higher values of these
parameters, but it significantly increases the random walk computation time and the
model training itself.

Unsurprisingly, the embedding size (embedding dimension) also follows the same
trend. With more dimensions, we can fit more information into the created representa-
tions. However, as an embedding goal is to find low-dimensional vector representations,
we should keep reasonable dimensionality. Our chosen values (16, 32, 64) seem plausible
while working with 260-dimensional edge features.

As for loss mixing parameter λ, we observe that too high values negatively influence
the model performance. The greater the value, the more critical the structural loss be-
comes. Simultaneously the feature loss becomes less relevant. Choosing λ = 0 causes
the loss function to consider feature reconstruction only and completely ignores the em-
bedding loss. This yields significantly worse results and confirms that our approach of
combining both feature reconstruction and structural embedding loss is justified. In
general, the best values are achieved for setting an equal influence of both loss factors
(λ = 0.5).

6. Ablation study

We performed an ablation study to check whether our method AttrE2vec is invariant
to introduced noise in an artificially generated network. We use a barbell graph, which

17


Figure 9: AttrE2vec performance for various noise levels p and mixing parameter values λ ∈{0, 0.5, 1}.

Figure 10: 2-D representations of ideal and noisy graph edges using AttrE2vec with λ ∈{0, 0.5, 1}.

18


consists of two fully connected graphs and a path which connects them (see: Figure 1).
The graph has seven nodes in each full graph and seven nodes in the path – a total of
50 edges. Next, we generate features from 3 clusters in a 200-dimensional space using
isotropic Gaussian blobs. We assign the features to 3 parts of the graph: the first to
the edges in one of the full graphs, the second to the edges in the path and the third
to the edges in the other full graph. The edge classes are matching the feature clusters
(i.e., three classes). Therefore, the structure is aligned with the features, so any good
structure based embedding method can fit this data very well (see: Figure 1). A problem
occurs when the features (and hence the classes) are shuffled within the graph structure.
Methods that employ only a structural loss function will fail. We want to check how our
model AttrE2vec, which includes both structural and feature-based loss, performs with
different amount of such noise.

We will use the graph mentioned above and introduce noise by shuffling p% of all
edge pairs, which are from different classes, i.e., an edge with class 2 (originally lo-
cated in the path) may be swapped with one from the full graphs (classes 1 or 3).
We use our AttrE2vec model with an Average aggregator in the transductive setting
(due to the graph size) and report the edge classification AUC for different values of
p ∈{0, 0.1, . . . , 0.5, . . . , 0.9, 1} and λ ∈{0, 0.5, 1}. The values of the mixing parameter λ
allow us to check how the model behaves when working only with a feature-based loss
(λ = 0), only with a structural loss (λ = 1), and with both losses at equal importance
(λ = 0.5). We train our model for five epochs and repeat the computations ten times for
every (p,λ) pair, due to the shuffling procedure’s randomness. We report the mean and
standard deviation of the AUC value in Figure 9.

Using only the feature loss or a combination of both losses allows us to achieve nearly
100% AUC in the classification task. The fluctuations appear due to the low number
of training epochs and the local optima problem. The performance of the model that
uses only structural loss (λ = 1) decreases with higher shuffling probabilities, and from
a certain point, it starts improving slightly because shuffling results in a complete swap
of two classes, i.e., all features and classes from one graph part are exchanged with all
features and classes from another part of the graph.

We also demonstrate how our method reacts on noisy data with various λ ∈{0, 0.5, 1}.
There are two graphs: one where the features are aligned to substructures of the graph
and the second with shuffled features (ca. 50%), see Figure 10. Keeping AttrE2vec with
λ = 0.5 allows to represent noisy graphs fairly.

7. Conclusions and future work

We introduce AttrE2vec – the novel unsupervised and inductive embedding model to
learn attributed edge embeddings by leveraging on the self-attention network with auto-
encoder over attribute space and structural loss on aggregated random walks. Attre2vec
can directly aggregate feature information from edges and nodes at many hops away
to infer embeddings not only for present nodes, but also for new nodes. Extensive
experimental results show that AttrE2vec obtains the state-of-the-art results in edge
classification and clustering on CORA, PUBMED and CITESEER.

19


Acknowledgments

The work was partially supported by the National Science Centre, Poland grant No.
2016/21/D/ST6/02948, and 2016/23/B/ST6/01735, as well as by the Department of
Computational Intelligence, Wroc law University of Science and Technology statutory
funds.

References

[1] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, J. Leskovec, R. Barzilay,
P. Battaglia, Y. Bengio, M. Bronstein, S. Günnemann, W. Hamilton, T. Jaakkola, S. Jegelka,
M. Nickel, C. Re, L. Song, J. Tang, M. Welling, R. Zemel, Open graph benchmark: Datasets for
machine learning on graphs (may 2020). arXiv:2005.00687.
URL http://arxiv.org/abs/2005.00687

[2] D. Zhang, J. Yin, X. Zhu, C. Zhang, Network Representation Learning: A Survey, IEEE Transac-
tions on Big Data 6 (1) (2018) 3–28. doi:10.1109/tbdata.2018.2850013.

[3] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P. S. Yu, A Comprehensive Survey on Graph Neural
Networks, IEEE Transactions on Neural Networks and Learning Systems (2019) 1–21doi:10.1109/
TNNLS.2020.2978386.

[4] B. Li, D. Pi, Network representation learning: a systematic literature review, Neural Computing
and Applications 32 (21) (2020) 16647–16679. doi:10.1007/s00521-020-04908-5.

[5] I. Chami, S. Abu-El-Haija, B. Perozzi, C. Ré, K. Murphy, Machine Learning on Graphs: A Model
and Comprehensive Taxonomy (2020).
URL http://arxiv.org/abs/2005.03675

[6] S. Bahrami, F. Dornaika, A. Bosaghzadeh, Joint auto-weighted graph fusion and scalable semi-
supervised learning, Information Fusion 66 (2021) 213–228.
URL www.scopus.com

[7] A. Grover, J. Leskovec, Node2vec: Scalable feature learning for networks, in: Proceedings of the
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 13-17-
Augu, 2016, pp. 855–864. doi:10.1145/2939672.2939754.

[8] B. Perozzi, R. Al-Rfou, S. Skiena, DeepWalk: Online Learning of Social Representations Bryan,
in: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and
data mining - KDD ’14, ACM Press, New York, New York, USA, 2014, pp. 701–710. doi:10.1145/
2623330.2623732.
URL http://dl.acm.org/citation.cfm?doid=2623330.2623732

[9] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in: 5th
International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings,
International Conference on Learning Representations, ICLR, 2017, pp. 1–14. arXiv:1609.02907.
URL http://arxiv.org/abs/1609.02907

[10] Y. Dong, N. V. Chawla, A. Swami, Metapath2vec: Scalable representation learning for hetero-
geneous networks, in: Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Vol. Part F1296, ACM, New York, NY, USA, 2017, pp. 135–144.
doi:10.1145/3097983.3098036.
URL https://dl.acm.org/doi/10.1145/3097983.3098036

[11] S. . Wang, V. V. Govindaraj, J. M. Górriz, X. Zhang, Y. . Zhang, Covid-19 classification by
fgcnet with deep feature fusion from graph convolutional network and convolutional neural network,
Information Fusion 67 (2021) 208–229, cited By :1.
URL www.scopus.com

[12] A. Garćıa-Durán, M. Niepert, Learning graph representations with embedding propagation, in:
Advances in Neural Information Processing Systems, Vol. 2017-Decem, 2017, pp. 5120–5131.

[13] W. L. Hamilton, R. Ying, J. Leskovec, Inductive representation learning on large graphs, in: Ad-
vances in Neural Information Processing Systems, Vol. 2017-Decem, 2017, pp. 1025–1035.

[14] P. Veličković, A. Casanova, P. Liò, G. Cucurull, A. Romero, Y. Bengio, Graph attention networks,
in: 6th International Conference on Learning Representations, ICLR 2018 - Conference Track
Proceedings, International Conference on Learning Representations, ICLR, 2018, pp. 1–12. arXiv:
1710.10903.

20

http://arxiv.org/abs/2005.00687
http://arxiv.org/abs/2005.00687
http://arxiv.org/abs/2005.00687
http://arxiv.org/abs/2005.00687
http://dx.doi.org/10.1109/tbdata.2018.2850013
http://dx.doi.org/10.1109/TNNLS.2020.2978386
http://dx.doi.org/10.1109/TNNLS.2020.2978386
http://dx.doi.org/10.1007/s00521-020-04908-5
http://arxiv.org/abs/2005.03675
http://arxiv.org/abs/2005.03675
http://arxiv.org/abs/2005.03675
www.scopus.com
www.scopus.com
www.scopus.com
http://dx.doi.org/10.1145/2939672.2939754
http://dl.acm.org/citation.cfm?doid=2623330.2623732
http://dx.doi.org/10.1145/2623330.2623732
http://dx.doi.org/10.1145/2623330.2623732
http://dl.acm.org/citation.cfm?doid=2623330.2623732
http://arxiv.org/abs/1609.02907
http://arxiv.org/abs/1609.02907
http://arxiv.org/abs/1609.02907
https://dl.acm.org/doi/10.1145/3097983.3098036
https://dl.acm.org/doi/10.1145/3097983.3098036
http://dx.doi.org/10.1145/3097983.3098036
https://dl.acm.org/doi/10.1145/3097983.3098036
www.scopus.com
www.scopus.com
www.scopus.com
http://arxiv.org/abs/1710.10903
http://arxiv.org/abs/1710.10903


[15] D. Wang, P. Cui, W. Zhu, Structural deep network embedding, in: Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 13-17-Augu,
2016, pp. 1225–1234. doi:10.1145/2939672.2939753.

[16] C. Yang, Z. Liu, D. Zhao, M. Sun, E. Y. Chang, Network representation learning with rich text
information, in: IJCAI International Joint Conference on Artificial Intelligence, Vol. 2015-Janua,
2015, pp. 2111–2117.

[17] M. Liu, J. Liu, Y. Chen, M. Wang, H. Chen, Q. Zheng, Ahng: Representation learning on attributed
heterogeneous network, Information Fusion 50 (2019) 221–230, cited By :3.
URL www.scopus.com

[18] L. Lan, P. Wang, J. Zhao, J. Tao, J. Lui, X. Guan, Improving network embedding with partially
available vertex and edge content, Information Sciences 512 (2020) 935–951. doi:10.1016/j.ins.
2019.09.083.

[19] B. Li, D. Pi, Y. Lin, I. Khan, L. Cui, Multi-source information fusion based heterogeneous network
embedding, Information Sciences 534 (2020) 53–71. doi:10.1016/j.ins.2020.05.012.

[20] C. Zhang, D. Song, C. Huang, A. Swami, N. V. Chawla, Heterogeneous graph neural network,
in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, ACM, New York, NY, USA, 2019, pp. 793–803. doi:10.1145/3292500.3330961.
URL https://dl.acm.org/doi/10.1145/3292500.3330961

[21] H. Gao, H. Huang, Deep attributed network embedding, in: IJCAI International Joint Conference
on Artificial Intelligence, Vol. 2018-July, 2018, pp. 3364–3370. doi:10.24963/ijcai.2018/467.

[22] S. Bandyopadhyay, A. Biswas, N. Murty, R. Narayanam, Beyond node embedding: A direct unsu-
pervised edge representation framework for homogeneous networks (2019). arXiv:1912.05140.

[23] Y. Chen, T. Qian, Relation constrained attributed network embedding, Information Sciences 515
(2020) 341–351. doi:10.1016/j.ins.2019.12.033.

[24] S. Bandyopadhyay, H. Kara, A. Kannan, M. N. Murty, FSCNMF: Fusing structure and content via
non-negative matrix factorization for embedding information networks (2018). arXiv:1804.05313.

[25] D. Nozza, E. Fersini, E. Messina, CAGE: Constrained deep Attributed Graph Embedding, Infor-
mation Sciences 518 (2020) 56–70. doi:10.1016/j.ins.2019.12.082.

[26] J. Kim, T. Kim, S. Kim, C. D. Yoo, Edge-labeling graph neural network for few-shot learning, in:
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recogni-
tion, Vol. 2019-June, 2019, pp. 11–20. arXiv:1905.01436, doi:10.1109/CVPR.2019.00010.

[27] Q. Li, Z. Cao, J. Zhong, Q. Li, Graph representation learning with encoding edges, Neurocomputing
361 (2019) 29–39. doi:10.1016/j.neucom.2019.07.076.

[28] L. Gong, Q. Cheng, Exploiting edge features for graph neural networks, in: Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, 2019, pp. 9203–9211.
doi:10.1109/CVPR.2019.00943.

[29] C. Aggarwal, G. He, P. Zhao, Edge classification in networks, in: 2016 IEEE 32nd International
Conference on Data Engineering, ICDE 2016, Institute of Electrical and Electronics Engineers Inc.,
2016, pp. 1038–1049. doi:10.1109/ICDE.2016.7498311.

[30] M. Simonovsky, N. Komodakis, Dynamic edge-conditioned filters in convolutional neural networks
on graphs, in: Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2017, Vol. 2017-Janua, 2017, pp. 29–38. doi:10.1109/CVPR.2017.11.

[31] T. D. Bui, S. Ravi, V. Ramavajjala, Neural Graph Learning: Training Neural Networks Using
Graphs, dl.acm.org 2018-Febua (2018) 64–71. doi:10.1145/3159652.3159731.

[32] Y. Wang, Y. Sun, M. M. Bronstein, J. M. Solomon, Z. Liu, S. E. Sarma, Dynamic Graph CNN for
Learning on Point Clouds, ACM Transactions on Graphics 38 (5) (2019) 146. doi:10.1145/3326362.

[33] T. Wanyan, C. Zhang, A. Azad, X. Liang, D. Li, Y. Ding, Attribute2vec: Deep network embedding
through multi-filtering GCN (apr 2020). arXiv:2004.01375.
URL http://arxiv.org/abs/2004.01375

[34] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, Q. Mei, LINE: Large-scale information network
embedding, in: WWW 2015 - Proceedings of the 24th International Conference on World Wide
Web, 2015, pp. 1067–1077. doi:10.1145/2736277.2741093.

[35] L. F. Ribeiro, P. H. Saverese, D. R. Figueiredo, Struc2vec: Learning node representations from
structural identity, in: Proceedings of the ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Vol. Part F1296, 2017, pp. 385–394. doi:10.1145/3097983.3098061.

[36] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings of the
22nd ACM SIGKDD international conference on Knowledge discovery and data mining, ACM,
2016, pp. 855–864.

[37] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical Evaluation of Gated Recurrent Neural Net-

21

http://dx.doi.org/10.1145/2939672.2939753
www.scopus.com
www.scopus.com
www.scopus.com
http://dx.doi.org/10.1016/j.ins.2019.09.083
http://dx.doi.org/10.1016/j.ins.2019.09.083
http://dx.doi.org/10.1016/j.ins.2020.05.012
https://dl.acm.org/doi/10.1145/3292500.3330961
http://dx.doi.org/10.1145/3292500.3330961
https://dl.acm.org/doi/10.1145/3292500.3330961
http://dx.doi.org/10.24963/ijcai.2018/467
http://arxiv.org/abs/1912.05140
http://dx.doi.org/10.1016/j.ins.2019.12.033
http://arxiv.org/abs/1804.05313
http://dx.doi.org/10.1016/j.ins.2019.12.082
http://arxiv.org/abs/1905.01436
http://dx.doi.org/10.1109/CVPR.2019.00010
http://dx.doi.org/10.1016/j.neucom.2019.07.076
http://dx.doi.org/10.1109/CVPR.2019.00943
http://dx.doi.org/10.1109/ICDE.2016.7498311
http://dx.doi.org/10.1109/CVPR.2017.11
http://dx.doi.org/10.1145/3159652.3159731
http://dx.doi.org/10.1145/3326362
http://arxiv.org/abs/2004.01375
http://arxiv.org/abs/2004.01375
http://arxiv.org/abs/2004.01375
http://arxiv.org/abs/2004.01375
http://dx.doi.org/10.1145/2736277.2741093
http://dx.doi.org/10.1145/3097983.3098061
http://arxiv.org/abs/1412.3555
http://arxiv.org/abs/1412.3555


works on Sequence Modeling (dec 2014). arXiv:1412.3555.
URL http://arxiv.org/abs/1412.3555

[38] R. Kuprieiev, D. Petrov, R. Valles, P. Redzyński, C. da Costa-Luis, A. Schepanovski, I. Shcheklein,
S. Pachhai, J. Orpinel, F. Santos, A. Sharma, Zhanibek, D. Hodovic, P. Rowlands, Earl, A. Grigorev,
N. Dash, G. Vyshnya, maykulkarni, Vera, M. Hora, xliiv, W. Baranowski, S. Mangal, C. Wolff,
nik123, O. Yoktan, K. Benoy, A. Khamutov, A. Maslakov, Dvc: Data version control - git for data
& models (May 2020). doi:10.5281/zenodo.3859749.
URL https://doi.org/10.5281/zenodo.3859749

[39] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, T. Eliassi-Rad, Collective classification in
network data, AI Magazine 29 (3) (2008) 93. doi:10.1609/aimag.v29i3.2157.
URL https://ojs.aaai.org/index.php/aimagazine/article/view/2157

[40] G. Namata, B. London, L. Getoor, B. Huang, Query-driven Active Surveying for Collective Clas-
sification, in: Proceedings ofthe Workshop on Mining and Learn- ing with Graphs, Edinburgh,
Scotland, UK., 2012, pp. 1–8.

[41] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: 31st International
Conference on Machine Learning, ICML 2014, Vol. 4, 2014, pp. 2931–2939. arXiv:1405.4053.
URL http://arxiv.org/abs/1405.4053

[42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-
tenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12
(2011) 2825–2830.

[43] D. Wang, P. Cui, W. Zhu, Structural deep network embedding, in: Proceedings of the 22Nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, ACM,
New York, NY, USA, 2016, pp. 1225–1234. doi:10.1145/2939672.2939753.
URL http://doi.acm.org/10.1145/2939672.2939753

[44] D. Q. Nguyen, T. D. Nguyen, D. Phung, A self-attention network based node embedding model
(jun 2020). arXiv:2006.12100.
URL http://arxiv.org/abs/2006.12100

[45] I. Loshchilov, F. Hutter, Decoupled Weight Decay Regularization (nov 2017). arXiv:1711.05101.
URL http://arxiv.org/abs/1711.05101

[46] J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in: M. F.
Balcan, K. Q. Weinberger (Eds.), Proceedings of The 33rd International Conference on Machine
Learning, Vol. 48 of Proceedings of Machine Learning Research, PMLR, New York, New York,
USA, 2016, pp. 478–487.
URL http://proceedings.mlr.press/v48/xieb16.html

[47] L. van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of Machine Learning Research
9 (2008) 2579–2605.
URL http://www.jmlr.org/papers/v9/vandermaaten08a.html

22

View publication statsView publication stats

http://arxiv.org/abs/1412.3555
http://arxiv.org/abs/1412.3555
http://arxiv.org/abs/1412.3555
http://arxiv.org/abs/1412.3555
https://doi.org/10.5281/zenodo.3859749
https://doi.org/10.5281/zenodo.3859749
http://dx.doi.org/10.5281/zenodo.3859749
https://doi.org/10.5281/zenodo.3859749
https://ojs.aaai.org/index.php/aimagazine/article/view/2157
https://ojs.aaai.org/index.php/aimagazine/article/view/2157
http://dx.doi.org/10.1609/aimag.v29i3.2157
https://ojs.aaai.org/index.php/aimagazine/article/view/2157
http://arxiv.org/abs/1405.4053
http://arxiv.org/abs/1405.4053
http://arxiv.org/abs/1405.4053
http://doi.acm.org/10.1145/2939672.2939753
http://dx.doi.org/10.1145/2939672.2939753
http://doi.acm.org/10.1145/2939672.2939753
http://arxiv.org/abs/2006.12100
http://arxiv.org/abs/2006.12100
http://arxiv.org/abs/2006.12100
http://arxiv.org/abs/1711.05101
http://arxiv.org/abs/1711.05101
http://arxiv.org/abs/1711.05101
http://proceedings.mlr.press/v48/xieb16.html
http://proceedings.mlr.press/v48/xieb16.html
http://www.jmlr.org/papers/v9/vandermaaten08a.html
http://www.jmlr.org/papers/v9/vandermaaten08a.html
https://www.researchgate.net/publication/348079131

	1 Introduction
	2 Related work and Research Gap
	3 Method
	3.1 Motivation
	3.2 Attributed graph edge embedding
	3.3 AttrE2vec
	3.4 Aggregation models
	3.5 Learning AttrE2vec's parameters

	4 Experiments
	4.1 Datasets
	4.2 Baselines
	4.3 Edge classification
	4.4 Edge clustering
	4.5 Embedding visualization

	5 Hyperparameter Sensitivity of AttrE2vec
	6 Ablation study
	7 Conclusions and future work