key: cord-0497887-8tq08b4d authors: Srinivasa, Rakshith S; Xiao, Cao; Glass, Lucas; Romberg, Justin; Sun, Jimeng title: Fast Graph Attention Networks Using Effective Resistance Based Graph Sparsification date: 2020-06-15 journal: nan DOI: nan sha: aa339dc08635486396623300d7ab64d1c8d632f6 doc_id: 497887 cord_uid: 8tq08b4d The attention mechanism has demonstrated superior performance for inference over nodes in graph neural networks (GNNs), however, they result in a high computational burden during both training and inference. We propose FastGAT, a method to make attention based GNNs lightweight by using spectral sparsification to generate an optimal pruning of the input graph. This results in a per-epoch time that is almost linear in the number of graph nodes as opposed to quadratic. Further, we provide a re-formulation of a specific attention based GNN, Graph Attention Network (GAT) that interprets it as a graph convolution method using the random walk normalized graph Laplacian. Using this framework, we theoretically prove that spectral sparsification preserves the features computed by the GAT model, thereby justifying our FastGAT algorithm. We experimentally evaluate FastGAT on several large real world graph datasets for node classification tasks, FastGAT can dramatically reduce (up to 10x) the computational time and memory requirements, allowing the usage of attention based GNNs on large graphs. Graphs are efficient representations of pairwise relations, with many real-world applications including product co-purchasing network [1] , co-author network [2] , etc. Graph neural networks (GNN) have become popular as a tool for inference from graph based data. By leveraging the geometric structure of the graph, GNNs learn improved representations of the graph nodes and edges that can lead to better performance in various inference tasks [3, 4, 5] . More recently, the attention mechanism has demonstrated superior performance for inference over nodes in GNNs [5, 6, 7, 8, 9, 10] . However, attention based GNNs suffer from great computational cost (See discussion in [11] ). This may hinder the applicability of the attention mechanism to large graphs. GNNs generally rely on graph convolution operations. For a graph G with N nodes, graph convolution with a kernel g w : R → R is defined as where U is the matrix of eigenvectors and Λ is the diagonal matrix that contains the eigenvalues of the normalized graph Laplacian matrix defined as with D and A being the degree matrix and the adjacency matrix of the graph, and g w is applied elementwise. Since computing U and Λ can be very expensive (O(N 3 )), most GNNs use an approximation of the graph convolution operator. For example, in graph convolution networks (GCN) [3] , node features are updated by computing averages as a first order approximation of Eq.(1) over the neighbors of the nodes. A single neural network layer is defined as: where H (l) and W (l) are the activations and the weight matrix at the lth layer respectively and A = A + I and D is the corresponding degree matrix. Attention based GNNs add another layer of complexity: they compute pairwise attention coefficients between all connected nodes. This process can greatly increase the computational burden, especially on large graphs. Approaches to speed up GNNs were proposed in [12, 4] . However, these sampling and aggregation based methods were designed for simple GCNs and are not applicable to attention based GNNs. There has also been works in inducing sparsity in attention based GNNs [13, 14] , but they focus on addressing potential overfitting of attention based models rather than scalability. In this paper, we propose Fast Graph Attention neTwork (FastGAT), a spectral sparsification method that leverages effective resistances to make attention based GNNs lightweight. The effective resistance measures importance of the edges in terms of preserving the graph connectivity. FastGAT uses it to measure the importance of edges and prune the input graph to generate a randomized subgraph with far fewer edges. Such a procedure preserves the spectral features of a graph, hence retaining the information that the attention based GNNs need. At the same time, the graph is amenable to more complex but computationally intensive models such as attention GNNs. With the subgraph as their inputs, the attention based GNNs will enjoy a much smaller computational complexity. Note that FastGAT is applicable to all attention based GNNs. In this paper, we mostly focus on the GAT model. However we also show the generalizability of our approach by applying it to two other attention based GNNs, namely the cosine similarity based approach [7] and Gated Attention Networks [15] . In addition, to understand why spectral sparsification is suitable for speeding up attention based GNNs, we provide a reformulation of the GAT model and show it is equivalent to applying a layer-wise graph convolution using the random walk normalized graph Laplacian. Based on this re-interpretation, we theoretically proved that spectral sparsification preserves the feature representations computed by the GAT model. We believe this interpretation also opens up interesting connections between sparsifying state transition matrices of random walks and speeding up computations in GNNs. The contributions of our paper are as outlined below: • We propose FastGAT, a method that uses effective resistance based spectral graph sparsification to accelerate attention GNNs. The rapid subsampling and the spectrum preserving property of FastGAT help attention GNNs retain their accuracy advantages and become computationally light. • We provided a theoretical justification for using spectral sparsification in the context of attention based GNNs by proving that spectral sparsification preserves the features computed by GNNs. • FastGAT outperformed state-of-the-art algorithms across a variety of datasets in terms of computation, achieving a speedup of up to 10x in training and inference time. On larger datasets such as Reddit, the GAT model runs out of memory, whereas FastGAT achieves an F1 score 0.93 with 7.73s per epoch time. • Further, FastGAT is generalizable to other attention based GNNs such as the cosine similarity based attention function [7] and the Gated Attention Network [15] . Accelerating graph based inference has drawn increasing interest. Two methods proposed in [12] (FastGCN) and [16] speed up GCNs by using importance sampling to sample a subset of nodes per layer during training. Similarly, GraphSAGE [4] also proposes an edge sampling and aggregation based method for inductive learning based tasks. All of the above works use simple aggregation and target simple GCNs, while our work focus on more recent attention based GNNs such as [5] . We are able to take advantage of the attention mechanism, while still being computationally tractable. Graph sparsification aims to approximate a given graph by a graph with fewer edges for efficient computation. Depending on final goals, there are cut-sparsifiers [17] , pair-wise distance preserving sparsifiers [18] and spectral sparsifiers [19, 20] , among others [21, 22, 23, 24] . In this work, we use spectral sparsification to choose a randomized subgraph. Apart form providing the strongest guarantees in preserving graph structure [25] , they align well with GNNs due to their connection to spectral graph convolutions. Our work is similar to [26] , which uses spectral graph sparsification to accelerate simpler Laplacian smoothing based statistical inference tasks such as regression. There has been recent works on graph sparsification for neural networks [13, 14, 27, 28] . However, their main goal is to address the issue of overfitting in GNNs. They still require learning attention coefficients and binary gate values for all edges in the graph, hence not leading to any computational and memory footprint gains. Moreover, [13] uses a gradient estimation procedure based on [28] and leads to a high variance due to a single sample of the randomized loss function is used to estimate its expected value. In [14] , the main goal is to learn a sparser graph to reduce overfitting but not scalability and thus lacks ability to handle large graphs. In contrast, FastGAT is a pre-processing step, thus resulting in a drastic improvement in training and inference time. It is also highly stable in terms of training and inference. Another work similar to ours in spirit is [27] , where it considers graph scattering transforms. These are untrainable counterparts of GCN aimed towards gaining a better theoretical understanding of GCN models and do not directly compete with more powerful models such as attention based GNNs. We denote a generic graph as G(E, V) where E is the edge set and V is the node set. N = |V| and M = |E| are the number of nodes and edges. w e is the weight of an edge e ∈ E. L represents the graph Laplacian (defined as L = D − A where D is the degree matrix) and λ i (L) denotes the ith eigenvalue of L. We use A B to indicate that the matrix B − A is positive semidefinite, and use A † to denote the Moore-Penrose inverse of a matrix. We denote the ith standard basis vector as χ i . Our goal is to address the scalability bottleneck by sparsifying the input graph before building an attention GNN, i.e., given input graph G, to output a sparsified graph H with a far fewer edges, while also ensuring that properties of G are preserved. We outline the steps in our approach below. Motivated by the fact that GNNs are approximations of spectral graph convolutions (defined in (1)), we aim to preserve the spectrum (or eigenstructure) of the graph. Formally, let L G and L H be the Laplacian matrices of the original graph G and the sparsified graph H. Spectral graph sparsification ensures that the spectral content of H is similar to that of G: To do so, we adapt the idea in [20] to sample edges using a distribution according to the effective resistances of the edges as defined below. Definition 1 (Effective Resistance) The effective resistance between any two nodes of a graph can be defined as the potential difference induced across the two nodes, when a unit current is induced at one node and extracted from the other node. Mathematically, it is defined as below. where b e = χ u − χ v (χ l is a standard basis vector with 1 in the lth position) and L † is the pseudo-inverse of the graph Laplacian matrix. The concept of effective resistance offers a way to measure the importance of an edge to the graph structure. For example, the removal of an edge between two nodes with a high effective resistance significantly harm the graph connectivity. We define a distribution over the graph edges that is proportional to the effective resistances of the edges and then prune the graph by sampling from this distribution. The pruning process is summarized in Algorithm 1. Choosing . As shown in Algorithm. 1, it requires setting a pruning parameter , which determines the quality of approximation after sparsification according to Eq. (4). In our experiments, we use two Complexity. The sample complexity q in Algorithm. 1 directly determines the final complexity. It can be theoretically shown that if q = O(N log N/ 2 ), then the spectral approximation in Eq. (4) can be achieved [20] . Note that this results in a number of edges that is almost linear in the number of nodes, as compared to quadratic as in the case of dense graphs. Of course, an evident question here is that of the complexity of computing R e for all edges. R e can quickly be estimated in O(M log N ) time, where M is the number of edges [20] . While we describe the algorithm in detail in the appendix (Section B.4) , it uses a combination of fast solutions to Laplacian based linear systems and the Johnson-Lindenstrauss Lemma 2 . This is almost linear in the number of edges, and hence much smaller than the complexity of computing attention coefficients in every layer and forward pass of GNNs. Another important point is that the computation of R e 's is a one-time cost. Unlike graph attention coefficients, we do not need to recompute the effective resistances in every training iteration. Hence, once sparsified, the same graph can be used in all subsequent experiments. Further, since each edge is sampled independently, the edge sampling process itself can be parallelized. Given the sparsified graph as the output of step 1, we then apply the standard attention based GNNs. For the GAT formulation in [5] , in each layer we compute the attention coefficients as in Eq. (5) . where h i 's are the input node features to the layer, W and a are linear mappings that are learnt, N s,i denotes the set of neighbors of node i after sparsification, and || denotes concatenation. With the α's as defined above, the node-i output embedding of a GAT layer is given as in Eq. (6) Note that we compute the attention coefficients only for the subset of edges present after sparsification. This directly leads to a drastically lower computational complexity, especially when multi-head attention is used. Our insight that preserving structural properties of a graph can provide sufficient information for GATs and the convincing experimental results that we report in Section 5 raise many challenging and interesting theoretical questions. Although we used the sampling strategy provided in [20] , their work address the preservation of only the eigenvalues of L. However, we are interested in the following questions: i) How is the GAT model related to graph convolutions? ii) Why does preserving the spectral structure of the graph also lead to good performance under the GAT model? To answer these questions, we first show that GAT has a direct relationship to graph convolutions using the random walk normalized graph Laplacian which we define shortly. We then give an upper bound on the error between the feature updates computed by a single layer of the GAT model using the full graph and a spectrally sparsified graph. For our analysis, we assume the attention vector a defined in [5] is symmetric: This results in a symmetric attention coefficient matrix and the analysis is simpler. Also note that the symmetric attention functions are used in practice as well [7] . Our first result is on the equivalence of GAT and graph convolutions. As stated in (1), graph convolution with kernel g can be defined using the eigenvectors and eigenvalues of the symmetric normalized graph Laplacian matrix L sym . Similarly we can define a convolution operation using an alternative of the normalized Laplacian, known as the random walk normalized Laplacian matrix, which is defined as: The convolution operation with L rw can then be defined as in Eq. (7) Proposition 1 Each layer in the GAT model defines a new, layer-dependent graph adjacency matrix Γ (l) and a corresponding degree matrix Γ D (l) . Each layer then computes a first order approximation of the convolution operator defined in (7) using Γ (l) and Γ D (l) : where σ is the non-linearity and g (l) Proposition 1 establishes a direct equivalence between the graph convolution and the GAT model. This interpretation also shows that the GAT model applies Laplacian smoothing to node based features [29, 30] . We defer the proof of Proposition 1 to the appendix (Section A.2). Such a connection between spectral operations and attention based graph neural networks provides directions for theoretical analysis of attention GNNs. Spectral sparsification preserves the spectrum of the underlying graph. This then hints that neural network computations that utilize spectral convolutions can be approximated by using sparser graphs. We first show that this is true in a layer-wise sense for the GCN [3] model and then show a similar result for the GAT model as well. Below, we use ReLU to denote the standard Rectified Linear Unit and ELU to denote the Exponential Linear Unit. Theorem 1 At any layer l of a GCN model with input features H (l) ∈ R N ×D , weight matrix W (l) ∈ R D×F , if the element-wise non-linearity function σ is either the ReLU or the ELU function, the features H f and H s computed using (3) with the full and a layer dependent spectrally sparsified graph obey where L sym is the symmetric normalized Laplacian matrix of the input graph (2). In our next result, we show a similar upper bound on the features computed with the full and the sparsified graphs using the GAT model. Theorem 2 At layer l of GAT with weight matrix W (l) ∈ R D×F , let Γ (l) , Γ D be as defined in Proposition 1 and let Γ sym be the corresponding symmetric normalized graph Laplacian. Then, the features H f and H s computed using (6) with the full and a layer dependent spectrally sparsified graph obey where σ is either the ReLU or the ELU function. Theorem 2 shows that if a layer-wise spectral sparsification of the graph is used to reduce the number of edges, then the feature updates computed by the sparse model are are good approximations in a relative sense. Note that this requires sparsifying the graph in each layer separately with the weights in the adjacency matrix given by Γ. In the next section, we show empirically that this expensive procedure of layer-wise sparsification can be replaced by a one-time spectral sparsification, as in FastGAT. We defer the proof of Theorem 2 to the appendix (Section A.2). Approximation of weight matrices. Theorems 1 and 2 provide an upper bound on the feature updates obtained using the full and sparsified graphs. In practice, we observe an even stronger notion of approximation between GAT and FastGAT: the weight matrices of the two models post training are good approximations of each other. We report this observation in Section. A.3 in the appendix. We show that the error between the learned matrices is small and proportional to the value of itself. Datasets We evaluated FastGAT on large and dense graph datasets using semi-supervised node classification tasks. Datasets are sourced from GLGraph library [31] . Their statistics are provided in Table 1 . We also evaluated on smaller datasets including Cora, citeseer and Pubmed, but present their results in the appendix (Sectio B.2) , as they are much smaller graphs. Baselines. We compared FastGAT with the following baseline methods. (1) The original graph attention networks (GAT) [5] , (2) SparseGAT [13] that learns edge coefficients to sparsify the graph, (3) random subsampling of edges, and (4) FastGCN [12] that is also designed for GNN speedup. Note that [12] is shown to outperform [4] . Thus, here we compare only to FastGCN [12] . Evaluation setup and Model Implementation Details are provided in Section. B in the appendix. Our first goal is to study the impact of graph sparsification on the accuracy and time performance of attention based GNNs in node classification. For fair comparison, we fix the parameter which determines how much spectral information is preserved. We then sample q = int(0.16N log N/ 2 ) number of edges from the distribution p e with replacement, as described in Section 3.1. First, we provide a direct comparison between FastGAT and the original GAT model and report the results in Table 6 . As can be observed from the results, FastGAT achieves the same test accuracy as the full GAT model across all the datasets, while being dramatically faster: while using a GPU (CPU), we are able to achieve up to 5x (10x) speedup. We then compare FastGAT with the following baselines: sparseGAT [13] , random subsampling of edges and FastGCN [12] in Table 3 . Since the former two methods use the attention mechanism similar to our method, we expect that the accuracy of our method is on par with these methods, except for the random subsampling (where we expect to perform better), since it can lead to removal of important edges. Since FastGCN does not use the attention mechanism, we generally expect to see FastGAT outperform it in terms of classification accuracy. For training time, we compare our method only with other attention based GNNs, since attention coefficients are the computational bottleneck. We compare the training time per epoch for the baseline methods against FastGAT in Figure 1 . From the results, we can conclude that FastGAT matches state-of-the-art accuracy (F1-score), while being much faster and hence efficient. While random subsampling of edges leads to a model that is as fast as ours, it also results in a degradation in accuracy performance. Our method is also faster compared to FastGCN on some large datasets, even though FastGCN does not compute any attention coefficients. We can conclude that despite using a much sparser graph, the classification accuracy remains the same (or sometimes even improves), while the training time reduces drastically. This is most evident in the case of the Reddit dataset, where the vanilla GAT model runs out of memory on a machine with 128GB RAM and a Tesla P100 GPU. This is expected, as computing attention coefficients over 57 million edges in each layer and epoch is a daunting task. FastGAT has a significantly lower training time, while also preserving classification accuracy (in Table. 3). On the Reddit dataset, both GAT and sparseGAT models run out of memory in the transductive setting. In Tables 6 and 3 , FastGCN-400 denotes that we sample 400 nodes in every forward pass, as described in [12] (similarly, in FastGCN-800, we sample 800 nodes). FastGAT-0.5 denotes we use = 0.5. GAT rand 0.5 uses random subsampling of edges, but keeps the same number of edges as FastGAT-0.5 Our next goal is to study if FastGAT needs more epochs to achieve the same level of accuracy as that of using full graphs. Fig. 2 shows consistent per epoch learning rate for multiple datasets. We show similar plots for the other datasets in the appendix (Section B.3). Finally, we study if FastGAT is sensitive to the particular formulation of the attention function. There have been alternative formulations proposed to capture pairwise similarity. For example, [7] proposes a cosine similarity based approach, where the attention coefficient of an edge is defined in Eq. (10), where β ( ) is a layer-wise learnable parameter and cos(x, y) = x y/ x y . Another definition is proposed in [15] (GaAN:Gated Attention Networks), which defines attention as in Eq. (11), where FC src and FC dst are 2-layered fully connected neural networks. We performed similar experiments on these attention definitions. Tables. 4 and 5 confirmed that spectral sparsification generalizes to different attention functions and hence can be used as a general tool to perform efficient training on large scale graphs. Note that the variability in accuracy performance across Tables 6, 4 and 5 comes from the different definitions of the attention function and not from the spectral sparsification process. Our goal is to show that given a model, spectral sparsification can achieve similar accuracy performance as that model, but in much faster time. In this paper, we introduced FastGAT, a method to make attention based GNNs lightweight by using spectral sparsification. We provided a re-formulation to interpret the GAT model as a graph convolution method based on which we theoretically justified our FastGAT algorithm. FastGAT can significantly reduce the computational time across multiple large real world graph datasets while attaining state-of-the-art performance. Our paper presents a scalable algorithm FastGAT for speeding up attention based GNNs. Next we discuss the possible real-world applications of FastGAT and its potential societal impacts. Applications and societal impact: Many application domains such as technology, web, healthcare, manufacturing, education and retail now depend on AI algorithms. Neural network models (or deep learning models) have been playing pivotal roles as the back-end algorithms to support predictive/classification models, computer vision applications and natural language processing applications. Attention based GNNs have been providing state-of-the-art performance in many classification tasks. Our algorithm FastGAT can play a positive role in supporting those diverse applications by speeding up those algorithms. For example, for healthcare applications, we can imagine FastGAT being applied to analyze 1) large biomedical knowledge graphs including the one derived from COVID-19 publications to extract knowledge representation of potential treatments to assist drug development, and 2) to model large patient-provider networks to identify insurance fraud. Caveat and potential weakness: Like many deep learning models, attention based GNNs are still black-box models and it is not always straightforward to justify their outputs. Interpretable AI research is an important area that can potentially address this challenge in future. The other weakness of attention based GNNs is the need for large amount of training data which might be difficult to acquire in some application domains. A.1 GAT model is equivalent to layer-wise convolution Consider a single layer of the GAT model. Let H ∈ R N ×D be the input feature matrix to a single GAT layer, let W ∈ R D×F denote the weight matrix, let C ∈ R D×2 be such that C = [a(1 : N ) a (N + 1 : 2N )], where a ∈ R 2N denotes the attention coefficient vector as defines in [5] . Let A ∈ R N ×N be the graph adjacency matrix and let A = A + I N . For a given graph, if D represents the degree matrix, then D −1 A is simply the state transition matrix of a random walker on the graph. We can further define a matrix Q ∈ R N ×2 as Further, let us define e ij (as in [5] ) as where h i is the ith row of H. Then, we can see that the vector e i where e i (j) = e ij is given as Hence, we can express the matrix of attention coefficients, before the softmax operation, Γ as ( A(:, 1) ) and f = exp(LeakyReLU(·)). Further, let Γ D = diag(Γ1 N ). The GAT layer update can be expressed as, Given a new graph G Γ (E, V) with Γ being the adjacency matrix. Let Γ D be the corresponding degree matrix. Then, the random walk normalized Laplacian is defined as Note that although L rw is asymmetric, it is similar to the symmetric normalized Laplacian matrix: Hence, L rw has real eigenvalues and match with those of L sym . The corresponding eigenvectors of the two matrices are also related: If v is an eigenvector of L sym , then Γ −1/2 D v is an eigenvector of L rw . Using this, we can define a new convolution operator as Then, using the Chebychev polynomial approximation similar to [3] , we can show that for a given feature vector h, we can get a first order approximation to the operation g θ h as g θ h ≈ Γ −1 D Γh w. (22) For multiple output features, the new graph convolution operation has a first order approximation as g θ H ≈ Γ −1 D ΓH W (23) which matches exactly with (18) . This shows that the model defined in [5] is similar to a GCN model, but defined layer-wise. In this section, we show that the features learnt by graph convolution based neural networks are preserved when spectral sparsification techniques are applied to the original data graph. We use the following notation: L sym is as defined in (20) and denotes the symmetric normalized Laplacian matrix of a graph, L sym,s denotes the symmetric normalized Laplacian matrix of the corresponding spectrally sparsified graph with a parameter of . Similalry, we use L rw to denote the random walk normalized Laplacian matrix of a graph and L rw,s to denote the corresponding random walk normalized Laplacian matrix of the spectrally sparsified graph. Spectral sparsification and the GCN model Consider the graph convolution network architecture proposed in [3] . We assume that the non-linearity σ(·) used in Lipshitz continuous with a Lipschitz constant σ . For a single neural network layer, let the input features be H ∈ R N ×D , let the weight matrix be W ∈ R D×F , where F is the number of output features. Then, the new set of features computed by the GCN model is and the corresponding set of features computed by the GCN model using a spectrally sparsified graph are given as Proof of Theorem 1 Proof We first characterize the spectral norm error between the corresponding graph Laplacians L sym and L sym,s and then use the bound to prove Theorem 1. We use D and D s to denote the degree matrices of the full and the spectrally sparsified graphs. Since both L sym and L sym,s are symmetric and positive semidefinite, we have, L sym − L sym,s = max(|λ max (L sym − L sym,s )|, |λ min (L sym − L sym,s )|) = sup Taking supremum on both sides, we get sup where we assume that 3 2 + 3 < , which holds true for small . We then have the final result as below. Let σ be the Lipschitz constant of the non-linearity σ. where, we use σ = 1 for ReLU or ELU non-linearity, and use inequality for any two matrices A and B. Theorem 1 shows that if two GCN models that use the full and spectrally sparsified graphs have the same initialization W , then the correponding feature updates are close in a Frobenius norm sense. Although we have not explored the dynamics or training, we strongly believe that similar bounds can be obtained on the gradients of the network parameters and in turn on the gradient descent updates. Spectral sparsification and the GAT model We can now consider the graph attention network model proposed in [5] . With H and W as defined in the previous section, the feature update equations for the GAT model using the full and the spectrally sparsified graphs are given as As before, we first bound the error L rw − L rw,s and then use it to bound the error H − H . Theorem 2 shows that if a layer-wise spectral sparsification of the graph is used to reduce the number of edges, then the feature updates computed by the sparse model are also preserved. Note that this requires sparsifying the graph in each layer separately with the weights in the adjacency matrix given by Γ. In the next section, we show that this expensive procedure of layer-wise sparsification can be repalced by a one-time spectal sparsification procedure for the binary node classification problem. We use the following Lemma to establish Theorem 2: Lemma 1 Let A ∈ R N ×N be any matrix and D ∈ R N ×N be a diagonal matrix with positive diagonal entries. Then, we have Proof We have Proof of Theorem 2: Then, using Lemma 1, we get Further, from the feature update equations (32) and since σ is Lipschitz continuous with Lipschitz constant σ , we have the final result as in the proof of Theorem 1. Theorems 1 and 2 provide an upper bound on the feature updates obtained using the full and sparsified graphs under both GCN and GAT. A stronger notion of information preservation after sparsification is obtained by studying the weight matrices to see if the graph structure is retained after the sparsification. To this end, we would like to study the error W − W s , where W and W s are weight matrices of the neural network, learned using the full and the sparsified graph respectively in any given layer. Such a result shows whether the graph structure retained is sufficient for the GAT model to learn strong features. We can see that the error is proportional to the parameter. Such a comparison was not possible on the Reddit dataset, since the model cannot be run on the full graph. To lend support to this claim, we studied the difference between the weight matrices learned with and without spectral sparsification. We used three different datasets (Coautho-Phy, Github Social and Couathor-CS). In each case, we used three different values of (0.25, 0.5, 0.75). At each parameter setting, we performed 5 independent trials and averaged the relative Frobenius errors between the weight matrices and the attention function a of all attention heads. We report the results in Fig. 3 . It is clear that the error between the learned matrices is proportional to the value of itself. This shows that the training process is highly stable with respect to spectral sparsification of the input graph. Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [1] , where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category. Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge 3. Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author's papers, and class labels indicate most active fields of study for each author. For the Reddit dataset, we predict which community different Reddit posts belong to based on the interactions between the posts. The Github social dataset consists of Github users as nodes and the task is that of classifying the users as web or machine learning developers (binary classification). For all the above datasets, the task is that of node classification. Additionally, we have experiments on citation graphs: Cora, Citeseer and Pubmed. In these datasets, the nodes represent authors, edges represent mutual citations and the task is to categorize the authors into their fields of study. Evaluation setup. For the Reddit dataset, we use training, validation and test data split of 65%, 10% and 25%, as specified in the DGLGraph library. For the other datasets, the split is 10%, 20% and 70%. The same split is for evaluating the original GAT model. For training and evaluation, we closely follow the setup used in [5] . We first use the spectral sparsification algorithm to obtain a sparse graph and then use a two-layer GAT model for training and inference. The first layer consists of K = 8 attention heads, computing 8 output features each, after which we apply the exponential linear unit (ELU). The second layer consists of a single attention head that computes C features (where C is the number of classes), followed by a softmax activation. We use the same architecture while comparing with the GAT and the sparseGAT models. We train all the models using a transductive approach wherein we use the features of all the nodes to learn the node embeddings. Implementation details. For each dataset, we compute the effective resistances of the edges using the Laplacians library written in Julia by Spielman [32] . The rest of the algorithm is implemented in PyTorch. We use the code for the GAT provided in [5] . We train our models on an Ubuntu 16.04 with 128GB memory and a Tesla P100 GPU (with 16GB memory).We use Adam optimizer with a learning rate of 0.001. We use the hyperparameters recommended in [5] for all of our experiments that use the GAT model. For FastGCN, we use the baseline parameters recommended in [12] . Computing effective resistances. For all datasets, computing the effective resistances is a one-time pre-processing task. We use the algorithm proposed in [20] , which takes about O(M log N ) time to compute the effective resistances of all the edges in the graph. We compute the resistance values and store them as metadata. While performing training and inference, we load the resistance values and then sample from the distribution described in Section 3.1. We report the experimental results on the smaller datasets Cora, Citeseer and Pubmed in Table 6 . Since the number of edges are small compared the larger datasets, the adjacency matrices for these graphs are already considerably sparse. Hence, sparsification does not result in a large reduction in the number of edges. However, the trend is still similar to that was observed on large datasets, since the accuracy performance does not drop, while training and inference time is lower than that for the model using the full graph. We provide plots of the achieved training accuracy against the epoch index for the rest of the datasets (Coauthor-CS, Amazon-Photos, Github Social) in Figure 4 . The accuracy achieved while training with sparsified graphs matches well with that obtained using the full graph on all the datasets, showing that spectral sparsification does not affect learning in attention GNNs. In this section, we briefly describe the algorithm to quickly compute the effective resistances of a graph G(E, V). We use the algorithm presented in [20] (Section 4) and describe it here for the sake of completion. Then, it can be shown that Note that the R(uv)'s are just pair-wise distances between the columns of the M × N matrix Y 1/2 BL † . The Johnson-Lindenstrauss Lemma can then be applied to approximately compute these distances. If R is a t × M random matrix chosen from a suitable distribution such as the Bernoulli distribution or the Gaussian random distribution, then if t = O(N/τ 2 ), then we have (39) Finally, the effective resistances are computed by using a fast Laplacian linear system solver [33] applied to the rows of the matrix RY 1/2 B. Each application of the fast solver takes O(M log(1/δ)) time where δ denotes the failure probability and can be set to a constant. The fast solver needs to be applied to O(log N ) rows of the matrix RY 1/2 B. Hence, the overall complexity of the algorithm is O (M log N ) . In the previous sections, we showed that for a suitable value of the tolerance parameter (such as 0.5, 0.9), the accuracy is equivalent to that of using the full graph. However, the level of sparsification : Adaptive algorithm to tune epsilon parameter (or the number of edges). We start with a sparse graph and iteratively build denser graphs as we progress through the epochs. In the "Add edges" step, we add a fixed number (0.003M ) of edges to the graph. In the "Rate of learning better?" step, we compare the slopes of the training accuracy curve with the previous slope over 20 epochs. needed to maintain the classification performance might be different for different datasets. This raises a very natural question of how to design the parameter for different datasets. In this subsection, we seek to address this question. We provide here an algorithm that sweeps through various values of and achieves state of the are results on any given dataset. In our experience, we find that using = 0.5 produces test accuracies that are as good as that of using the full graph. Hence, we set 0.5 as the minimum value of that our algorithm chooses. It iteratively chooses a denser or a sparser graph based on the current validation error of the algorithm. We provide a block diagram of the algorithm in Figure 6 . In Figure. 7, we show the training accuracy Vs the epochs for our algorithm and compare it with that of a model using a constant of 0.5. From the figure, it is evident that our adaptive algorithm is successful in Figure 7 : Simulation results for the adaptive algorithm on the coauthor-Physics, coauthor-CS and Reddit datasets. achieving the same learning rate as that of a model with constant . Hence this algorithm is suitable to be deployed as is on other real world datasets. In Figure. 5, we show the the number of edges resulting edges in the graph after each instance of the algorithm choosing to sparsify or make the graph more dense. Since denser graphs do offer more information, it is natural that the algorithm chooses denser graphs over time in general. But it is also interesting to see that there are instances where the algorithm chooses a sparser graph. We show the accompanying time per epoch as well in Figure. where we can see that it is much smaller than that of using a constant, low parameter. Inferring networks of substitutable and complementary products Inductive representation learning on large graphs Semi-supervised classification with graph convolutional networks Inductive representation learning on large graphs Graph Attention Networks. International Conference on Learning Representations Capsule graph neural network Attention-based graph neural network for semi-supervised learning Heterogeneous multi-layered network model for omics data integration and analysis Hierarchical representation learning in graph neural networks with node decimation pooling Understanding attention and generalization in graph neural networks Fastgcn: fast learning with graph convolutional networks via importance sampling Sparse graph attention networks Robust graph representation learning via neural sparsification Gaan: Gated attention networks for learning on large and spatiotemporal graphs Adaptive sampling towards fast graph representation learning Approximating s-t minimum cuts in Õ(n2) time On sparse spanners of weighted graphs Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems Graph sparsification by effective resistances gsparsify: Graph motif based sparsification for graph clustering Improved largescale graph learning through ridge spectral sparsification Metropolis algorithms for representative subgraph sampling Provable and practical approximations for the degree distribution using sublinear graph samples Graph sparsification, spectral sketches, and faster resistance computation, via short cycle decompositions Graph sparsification approaches for laplacian smoothing Pruned graph scattering transforms Learning sparse neural networks through l_0 regularization A signal processing approach to fair surface design Deeper insights into graph convolutional networks for semi-supervised learning Deep graph library Spectral sparsification of graphs