key: cord-0495771-6d36nrvp authors: Wang, Xin; Zhang, Ziwei; Zhu, Wenwu title: Automated Graph Machine Learning: Approaches, Libraries and Directions date: 2022-01-04 journal: nan DOI: nan sha: 941b54004e260098c9a56f8c0db1ddc355387965 doc_id: 495771 cord_uid: 6d36nrvp Graph machine learning has been extensively studied in both academic and industry. However, as the literature on graph learning booms with a vast number of emerging methods and techniques, it becomes increasingly difficult to manually design the optimal machine learning algorithm for different graph-related tasks. To tackle the challenge, automated graph machine learning, which aims at discovering the best hyper-parameter and neural architecture configuration for different graph tasks/data without manual design, is gaining an increasing number of attentions from the research community. In this paper, we extensively discuss automated graph machine approaches, covering hyper-parameter optimization (HPO) and neural architecture search (NAS) for graph machine learning. We briefly overview existing libraries designed for either graph machine learning or automated machine learning respectively, and further in depth introduce AutoGL, our dedicated and the world's first open-source library for automated graph machine learning. Last but not least, we share our insights on future research directions for automated graph machine learning. This paper is the first systematic and comprehensive discussion of approaches, libraries as well as directions for automated graph machine learning. G RAPH data is ubiquitous in our daily life. We can use graphs to model the complex relationships and dependencies between entities ranging from small molecules in proteins and particles in physical simulations to large national-wide power grids and global airlines. Therefore, graph machine learning, i.e., machine learning on graphs, has long been an important research direction for both academics and industry [1] . In particular, network embedding [2] , [3] , [4] , [5] and graph neural networks (GNNs) [6] , [7] , [8] have drawn increasing attention in the last decade. They are successfully applied to recommendation systems [9] , [10] , fraud detection [11] , bioinformatics [12] , [13] , physical simulation [14] , traffic forecasting [15] , [16] , knowledge representation [17] , drug re-purposing [18] , [19] and pandemic prediction [20] for Despite the popularity of graph machine learning algorithms, the existing literature heavily relies on manual hyper-parameter or architecture design to achieve the best performance, resulting in costly human efforts when a vast number of models emerge for various graph tasks. Take GNNs as an example, at least one hundred new general-purpose architectures have been published in top-tier machine learning and data mining conferences in the year of 2021 alone, not to mention cross-disciplinary researches of task-specific designs. More and more human efforts are inevitably needed if we stick to the manual try-and-error paradigm in designing the optimal algorithms for targeted tasks. On the other hand, automated machine learning (AutoML) has been extensively studied to reduce human efforts in developing and deploying machine learning models [21] , [22] . Complete AutoML pipelines have the potential to automate every step of machine learning, including auto data collection and cleaning, auto feature engineering, and auto model selection and optimization, etc. Due to the popularity of deep learning models, hyperparameter optimization (HPO) [23] , [24] , [25] , [26] and neural architecture search (NAS) [27] , [28] are most widely studied. AutoML has achieved or surpassed human-level performance [29] , [30] , [31] with little human guidance in areas such as computer vision [32] , [33] . Automated graph machine learning, combining advantages of AutoML and graph machine learning, naturally serves as a promising research direction to further boost the model performance, which has attracted an increasing number of interests from the community. In this paper, we provide a systematic overview of approaches for automated graph machine learning 1 , introduce related public libraries as well as our AutoGL, the world's first open-source library for automated graph machine learning, and share our insights on challenges and future research directions. Particularly, we focus on two major topics: HPO and NAS of graph machine learning. For HPO, we focus on how to develop scalable methods. For NAS, we follow the literature and compare different methods from search spaces, search strategies, and performance estimation strategies. We also briefly discuss several recent automated graph learning works that feature in different aspects such as architecture pooling, structure learning, accelerator and joint software-hardware design etc. Besides, how different methods tackle the challenges of AutoML on graphs are discussed along the way as well. Then, we review libraries related to automated graph machine learning and discuss AutoGL, the first dedicated framework and open-source library for automated graph machine learning. We highlight the design principles of AutoGL and briefly introduce its usages, which are all specially designed for AutoML on graphs. Last but not least, we point out the potential research directions for both graph HPO and graph NAS, including but not limited to Scalability, Explainability, Outof-distribution generalization, Robustness, and Hardware-aware design etc. We believe this paper will greatly facilitate and further promote the studies and applications of automated graph machine learning in both academia and industry. The rest of the paper is organized as follows. In Section 2, we intoduce the fundamentals and preliminaries for automated graph machine learning by briefly introducing basic formulations of graph machine learning and AutoML. We comprehensively discuss HPO based approaches on graph machine learning in Section 3 and NAS based methods for graph machine learning in Section 4. Then, in Section 5.1, we overview related libraries for graph machine learning and automated machine learning and in depth introduce AutoGL, our dedicated and the world's first opensource library tailored for automated graph machine learning. Last but not least, we outline future research opportunities in Section 6 and conclude the whole paper in Section 7. We briefly present basic problem formulations for graph machine learning, automated machine learning as well as unique characteristics for automated graph machine learning before moving to the next section. .., v |V| is a set of nodes and E ⊆ V ×V is a set of edges. The neighborhood of node v i is denoted as N (i) = {v j : (v i , v j ) ∈ E}. The nodes can also have features denoted as F ∈ R |V|×f , where f is the number of features. We use bold uppercases (e.g., X) and bold lowercases (e.g., x) to represent matrices and vectors, respectively. Most tasks of graph machine learning can be divided into the following two categories: • Node-level tasks: the tasks are associated with individual nodes or pairs of nodes. Typical examples include node classification and link prediction. • Graph-level tasks: the tasks are associated with the whole graph, such as graph classification and graph generation. For node-level tasks, graph machine learning models usually learn a node representation H ∈ R |V|×d and then adopt a classifier or predictor on the node representation to solve the task. For graphlevel tasks, a representation for the whole graph is learned and fed into a classifier/predictor. GNNs are the current state-of-the-art in learning node and graph representations. The message-passing framework of GNNs [34] is formulated as follows. where h (l) i denotes the node representation of node v i in the l th layer, m (l) is the message for node v i , AGG (l) (·) is the aggregation function, a (l) ij denotes the weights from node v j to node v i , COMBINE (l) (·) is the combining function, W (l) are learnable weights, and σ(·) is an activation function. The node representation is usually initialized as node features H (0) = F, and the final representation is obtained after L message-passing layers H = H (L) . For the graph-level representation, pooling methods (also called readout) are applied to the node representations i.e., h G is the representation of G. Many AutoML algorithms such as HPO and NAS can be formulated as the following bi-level optimization problem: where α is the optimization objective of the AutoML algorithm, e.g., hyper-parameters in HPO and neural architectures in NAS, A is the feasible space for the objective, and W(α) are trainable weights in the graph machine learning models. Essentially, we aim to optimize the objective in the feasible space so that the model achieves the best results in terms of a validation function, and W * indicates that the weights are fully optimized in terms of a training function. Different AutoML methods differ in how the feasible space is designed and how the objective functions are instantiated and optimized since directly optimizing Eq. (4) requires enumerating and training every feasible objective, which is prohibitive in practice. Typical formulations of automated graph machine learning need to properly integrate the above formulations in Section 2.1 and Section 2.2 to form a new optimization problem. Automated graph machine learning, which non-trivially combines the strength of AutoML and graph machine learning, faces the following challenges. • The uniqueness of graph machine learning: Unlike audio, image, or text, which has a grid structure, graph data lies in a non-Euclidean space [35] . Thus, graph machine learning usually has unique architectures and designs. For example, typical NAS methods focus on the search space for convolution and recurrent operations, which is distinct from the building blocks of GNNs [36] . • Complexity and diversity of graph tasks: As aforementioned, graph tasks per se are complex and diverse, ranging from node-level to graph-level problems, and with different settings, objectives, and constraints [37] . How to impose proper inductive bias and integrate domain knowledge into a graph AutoML method is indispensable. • Scalability: Many real graphs such as social networks or the Web are incredibly large-scale with billions of nodes and edges [38] . Besides, the nodes in the graph are interconnected and cannot be treated as independent samples. Designing scalable AutoML algorithms for graphs poses significant challenges since both graph machine learning and AutoML are already notorious for being compute-intensive. Approaches with HPO or NAS for graph machine learning reviewed in later sections target at handling at least one of these three challenges. As such, we will discuss approaches for automated graph machine learning from two aspects: i) HPO for graph machine learning and ii) NAS for graph machine learning. In this section, we review HPO for graph machine learning. The main challenge here is scalability, i.e., a real graph can have billions of nodes and edges, and each trial on the graph is computationally expensive. Next, we elaborate on how different methods tackle the efficiency challenge. Notice that we omit some straightforward HPO methods such as random search and grid search [23] since they are applied to graphs without any modification. Tu et al. [39] propose AutoNE, the first HPO method specially designed to tackle the efficiency problem of graphs, to facilitate the graph hyper-parameter optimization for large-scale graph representation learning. AutoNE proposes a transfer paradigm that samples subgraphs as proxies for the large graph. Specifically, AutoNE has three modules: the sampling module, the signature extraction module, and the meta-learning module. In the sampling module, multiple representative subgraphs are sampled from the large graph using a multi-start random walk strategy. Each subgraph learns a representation by the signature extraction module. Then, AutoNE conducts HPO on the sampled subgraphs using Bayesian optimization [25] and records the results. Finally, using the HPO results and representation of subgraphs to extract metaknowledges, AutoNE fine-tunes hyper-parameters on the large graph using the meta-learning module. In this way, AutoNE achieves satisfactory results while maintaining scalability since the knowledge of multiple HPO trials on the sampled subgraphs and a few HPO trails on the large graph are properly integrated. Wang et al. [40] propose e-AutoGR to further increase the explainability of hyper-parameter optimization for automated graph representation learning, with the help of hyper-parameter importance decorrelation. e-AutoGR employs six fully explainable graph features, i.e., number of nodes, number of edges, number of triangles, global clustering coefficient, maximum total degree value and number of components, as measures for similarity between different graphs. A hyper-parameter decorrelation algorithm (HyperDeco) is proposed to decorrelate the mixed relations among different hyper-parameters given various graph features so that more accurate importance of different hyper-parameters towards model performances can be estimated through any regression approaches. The authors theoretically validate the correctness of the proposed hyper-parameter decorrelation algorithm and empirically discover that first-order proximity is most important for AROPE [41] , number of walks together with window size is of great importance for DeepWalk [42] , and dropout is particularly important for GCN [43] . Guo et al. [44] propose ITuNE to replace the sampling process of AutoNE with graph coarsening to generate a hierarchical graph synopsis. A similarity measurement module is also proposed to ensure that the coarsened graph shares sufficient similarity with the large graph. Compared with sampling, such graph synopsis can better preserve graph structural information. Therefore, JITuNE argues that the best hyper-parameters in the graph synopsis can be directly transferred to the large graph. Besides, since the graph synopsis is generated in a hierarchy, the granularity can be more easily adjusted to meet the time constraints of downstream tasks. Yuan et al. [45] propose HESGA as another strategy to improve efficiency using a hierarchical evaluation strategy together with evolutionary algorithms. Specifically, HESGA proposes to evaluate the potential of hyper-parameters by interrupting training after a few epochs and calculating the performance gap with respect to the initial performance with random model weights. This gap is used as a fast score to filter out unpromising hyperparameters. Then, the standard full evaluation, i.e., training until convergence, is adopted as the final assessor to select the best hyper-parameters to be stored in the population of the evolutionary algorithm. Besides efficiency, Yoon et al. [46] propose AutoGM to focuses on studying a unified framework for various graph machine learning algorithms. Specifically, AutoGM finds that many popular GNNs and PageRank can be characterized in a framework similar to Eq. (1) with five hyper-parameters: the number of messagepassing neighbors, the number of message-passing steps, the aggregation function, the dimensionality, and the non-linearity. AutoGM also adopts Bayesian optimization to optimize these hyper-parameters. Yuan et al. [47] focus on the impact of selecting two types of GNN hyper-parameters (i.e., graph-related layers and task-specific layers) on the performance of GNN for molecular property prediction. They employed CMA-ES for HPO, which is a derivativefree and evolutionary black-box optimization method. The results reveal that optimizing the two types of hyper-parameters separately can result in improvement on GNN performance, and removing any of the two types of hyper-parameters may result in deteriorated performance. Even doing this means a larger search space, which seems to be more challenging given the same number of trials (limited computational resources), such a strategy can surprisingly achieve better performance. Meanwhile, their study further confirms the importance of HPO for GNNs in molecular property prediction problems. Many molecular datasets are far smaller than other datasets in typical deep learning applications. Most HPO methods have not been explored in terms of their performances on these small datasets in molecular domain. Yuan et al. [48] conduct a theoretical analysis of common and specific features for two stateof-the-art HPO algorithms: i.e., TPE and CMA-ES, and they compare them with random search (RS). Experimental studies are carried out on several benchmarks in MoleculeNet, from different perspectives to investigate the impact of RS, TPE, and CMA-ES on HPO of GNNs for molecular property prediction. Their experimental results indicate that TPE is the most suited HPO method for GNN under molecular property prediction problems with limited computational resources. Meanwhile, RS is the simplest method capable of achieving comparable performance with TPE and CMA-ES. GCN models are sensitive to the choice of hyper-parameters such as dropout rate and learning weight decay [40] , especially for deep GCN models. Zhu et al. [49] therefore target at automating the training of GCN models through hyper-parameter optimization. To be specific, they propose a self-tuning GCN (ST-GCN) approach by incorporating hypernets in each graph convolutional layer, enabling the joint optimization over GCN model parameters and hyper-parameters. They further extend the approach through incorporating the population based training scheme and adopt a population based training framework to self-tuning GCN, thus alleviating local minima problem via exploring hyper-parameter space globally. Experimental results on three benchmark datasets demonstrate the effectiveness of their approaches in terms of optimizing multi-layer GCNs. Bu et al. [50] analyze the performance of different evolutionary algorithms on automated graph machine learning through experimental study. The experimental results show that evolutionary algorithms can serve as an effective alternative to the traditional hyper-parameter optimization algorithms such as random search, grid search and Bayesian Optimization for GNN. Sun et al. [51] propose AutoGRL, an automated graph representation learning framework for node classification task. Au-toGRL consists of an appropriate search space with four components: data augmentation, feature engineering, hyper-parameter optimization, and architecture search. Given graph data, AutoGRL searches for the best graph representation learning model in the search space using an efficient searching algorithm. Extensive experiments are conducted on four real-world node classification datasets to demonstrate that AutoGRL can automatically find competitive graph representation learning models on specific graph data effectively and efficiently. NAS methods can be compared in three aspects [27] : search space, search strategy, and performance estimation strategy. Next, we review NAS methods for graph machine learning from these three aspects and discuss some designs uniquely for graphs. We mainly review NAS for GNNs fitting Eq. (1), which is the focus of the literature. We summarize the characteristics of different methods in Table 1 . The first challenge of NAS on graphs is the search space design since the building blocks of graph machine learning are usually distinct from other deep learning models such as CNNs or RNNs. For GNNs, the search space can be divided into the following five categories. Following the message-passing framework in Eq. (1), the micro search space defines how nodes exchange messages with others in each layer. Commonly adopted micro search spaces [36] , [52] compose the following components: Identity, Softplus, Leaky ReLU, ReLU6, and ELU. However, directly searching all these components results in thousands of possible choices in a single message-passing layer. Thus, it may be beneficial to prune the space to focus on a few crucial components depending on applications and domain knowledge [53] . Similar to residual connections and dense connections in CNNs, node representations in one layer of GNNs do not necessarily solely depend on the immediate previous layer [84] , [85] . These connectivity patterns between layers form the macro search space. Formally, such designs are formulated as where F jl (·) can be the message-passing layer in Eq. (1), ZERO (i.e., not connecting), IDENTITY (e.g., residual connections), or an MLP. Since the dimensionality of H (j) can vary, IDENTITY can only be adopted if the dimensionality of each layer matches. To handle graph-level tasks, information from all the nodes are aggregated to form graph-level representations using the pooling operation in Eq. (3). Jiang et al. [59] propose a pooling search space including row-or column-wise sum, mean, or maximum, attention pooling, attention sum, and flatten. More advanced methods such as hierarchical pooling [86] could also be added to the search space with careful designs. For example, PAS [75] further proposes to search for adaptive pooling architectures. Firstly they design a unified framework consisting of four modules: Aggregation, Pooling, Read out and Merge, which can cover existing humandesigned pooling methods (global and hierarchical) for graph classification. Based on this framework, a novel search space is designed by incorporating popular operations in human-designed architectures. To further enable efficient search, a coarsening strategy is proposed to continuously relax the search space, with the utilization of differentiable search methods. Extensive experiments on six real-world datasets from three domains are conducted, and the results demonstrate the effectiveness and efficiency of the proposed framework. Besides architectures, other training hyper-parameters can be incorporated into the search space, i.e., similar to jointly conducting NAS and HPO. Typical hyper-parameters include the learning rate, the number of epochs, the batch size, the optimizer, the dropout rate, and the regularization strengths such as the weight decay. These hyper-parameters can be jointly optimized with architectures or separately optimized after the best-architectures are found. HPO methods in Section 3 can also be combined here. Another critical model choice not incorporated in the above four categories is the number of message-passing layers. Unlike CNNs, most currently successful GNNs are shallow, e.g., with no more than three layers, possibly due to the over-smoothing problem [87] , [84] . Limited by this problem, the existing NAS methods for GNNs preset the number of layers as a fixed small number. Except for a recent attempt DeepGNAS [71] , how to automatically design deep GNNs while integrating techniques to alleviate over-smoothing remains mostly unexplored. The search strategy can be broadly divided into three categories: architecture controllers trained with reinforcement learning (RL), differentiable methods, and evolutionary algorithms. A widely adopted NAS search strategy is to use a controller to generate the neural architecture descriptions and train the controller with reinforcement learning to maximize the model performance as rewards. For example, if we consider neural architecture descriptions as a sequence, we can use RNNs as the controller [29] . Such methods can be directly applied to GNNs with a suitable search space and performance evaluation strategy. Differentiable NAS methods such as DARTS [30] and SNAS [82] have gained popularity in recent years. Instead of optimizing different operations separately, differentiable methods construct a single super-network (known as the one-shot model) containing all possible operations. Formally, we denote where o (x,y) (x) is an operation in the GNN with input x and output y, O are all candidate operations, and z (x,y) are learnable vectors to control which operation is selected. Briefly speaking, each operation is regarded as a probability distribution of all possible operations. In this way, the architecture and model weights can be jointly optimized via gradient-based algorithms. The main challenges lie in making the NAS algorithm differentiable, where several techniques such as Gumbel-softmax [88] and concrete distribution [89] are resorted to. When applied to GNNs, slight modification may be needed to incorporate the specific operations defined in the search space, but the general idea of differentiable methods remains unchanged. Evolutionary algorithms are a class of optimization algorithms inspired by biological evolution. For NAS, randomly generated architectures are considered initial individuals in a population. Then, new architectures are generated using mutations and crossover operations based on the population. The architectures are evaluated and selected to form the new population, and the same process is repeated. The best architectures are recorded while updating the population, and the final solutions are obtained after sufficient updating steps. For GNNs, regularized evolution (RE) NAS [33] has been widely adopted. RE's core idea is an aging mechanism, i.e., in the selection process, the oldest individuals in the population are removed. Genetic-GNN [90] also proposes an evolution process to alternatively update the GNN architecture and the learning hyperparameters to find the best fit of each other. It is also feasible to combine these three types of search strategies mentioned above. For example, AGNN [52] proposes a reinforced conservative search strategy by adopting both RNNs and evolutionary algorithms in the controller and train the controller with RL. By only generating slightly different architectures, the controller can find well-performing GNNs more efficiently. Peng et al. [63] adopt CEM-RL [64] , which combines evolutionary and differentiable methods. Due to the large number of possible architectures, it is infeasible to fully train each architecture independently. Next, we review some performance estimation strategies. A commonly adopted "trick" to speed up performance estimation is to reduce fidelity [27] , e.g., by reducing the number of epochs or the number of data points. This strategy can be directly generalized to GNNs. Another strategy successfully applied to CNNs is sharing weights among different models, known as parameter sharing or weight sharing [31] . For differentiable NAS with a large oneshot model, parameter sharing is naturally achieved since the architectures and weights are jointly trained. However, training the one-shot model may be difficult since it contains all possible operations. To further speed up the training process, single-path one-shot model [91] has been proposed where only one operation between an input and output pair is activated during each pass. For NAS without a one-shot model, sharing weights among different architecture is more difficult but not entirely impossible. For example, since it is known that some convolutional filters are common feature extractors, inheriting weights from previous architectures is feasible and reasonable in CNNs [92] . However, since there is still a lack of understandings of what weights in GNNs represent, we need to be more cautious about inheriting weights [53] . AGNN [52] proposes three constraints for parameter inheritance: same weight shapes, same attention and activation functions, and no parameter sharing in batch normalization and skip connections. In this section, we discuss several recent advances in automated graph machine learning that feature in taking topological structure learning, efficient architecture search or software-hardware codesign into considerations. Qin et al. [83] investigate the important question that how NAS is able to select the desired GNN architectures by conducting a measurement study with experiments, which discovers that gradient based NAS methods tend to select proper architectures based on the usefulness of different types of information with respect to the target task. The explorations further show that gradient based NAS also suffers from noises hidden in the graph, resulting in searching suboptimal GNN architectures. Based on these findings, they propose a Graph differentiable Architecture Search model with Structure Optimization (GASSO), which allows differentiable search of the architecture with gradient descent and is able to discover graph neural architectures with better performance through employing graph structure learning as a denoising process in the search procedure. The proposed GASSO model is capable of simultaneously searching the optimal architecture and adaptively adjusting graph structure by jointly optimizing graph architecture search and graph structure denoising. Extensive experiments on real-world graph datasets demonstrate that the proposed GASSO model is able to achieve state-of-the-art performance compared with existing baselines. G-Cos [74] is a GNN and accelerator co-search framework that can automatically search for matched GNN structures and accelerators to maximize both task accuracy and acceleration efficiency. Specifically, G-CoS integrates two major components: i) a generic GNN accelerator search space which is applicable to various GNN structures and ii) a one-shot GNN and accelerator co-search algorithm that enables simultaneous and efficient search for optimal GNN structures as well as their matched accelerators. Extensive experiments and ablation studies show that the GNNs together with accelerators generated by G-CoS consistently outperforms state-of-the-art GNNs and GNN accelerators in terms of both task accuracy and hardware efficiency, while only requiring a few hours for the end-to-end generation of the best matched GNNs and their accelerators. Similarly, LPGNAS [60] jointly searches for architectures and quantisation choices so that both model and buffer sizes can be greatly reduced while keeping similar accuracy as other methods. Their empirical results show that 4-bit weights, 8-bit activations quantisation strategy might be the key for GNNs. ALGNN [78] considers the computation cost and complexity of the searched model using a multi-objective optimization method. Lu et al. [76] propose FGNAS as the first software-hardware codesign framework for automating the search and deployment of GNNs. Using FPGA as the target platform, the FGNAS framework is able to perform the FPGA-aware graph neural architecture search. FPGA is employed as the vehicle for illustration and implementation of the methods. Specific hardware constraints are considered so that quantization is adopted to compress the model. Under specific hardware constraints, they show the FGNAS framework can successfully identify a solution of higher accuracy while using shorter time than random search and the traditional twostep tuning. To evaluate the design, they conduct experiments on benchmark datasets, i.e., Cora, CiteSeer and PubMed, and the results show that the proposed FGNAS framework has better capability in optimizing the accuracy of GNNs when the hardware implementation is specifically constrained. In this section, we will discuss other unique NAS designs for graphs in terms of search space, transferability and scalability. Besides the basic search space presented in Section 4.1, different graph tasks may require other search spaces. For example, metapaths are critical for heterogeneous graphs [70] , edge features are essential in modeling molecular graphs [59] and many graph tasks [81] , [80] , and spatial-temporal modules are needed in skeleton-based recognition [63] and traffic forcasting [66] . A suitable search space usually requires careful designs and domain knowledge. It is non-trivial to transfer GNN architectures across different datasets and tasks due to the complexity and diversity of graph tasks. GraphGym [93] propose to adopt a fixed set of GNNs as anchors and rank the performance of these GNNs on different tasks and datasets. Then, the rank correlation serves as a metric to measure the similarities between different datasets and tasks. The best-performing GNNs of the most similar tasks are transferred to solve the target tasks. Similar to AutoNE introduced in Section 3, EGAN [58] proposes to sample small graphs as proxies and conduct NAS on the sampled subgraphs to improve the efficiency of NAS. While achieving some progress, more advanced and principle approaches are further needed to handle billion-scale graphs. Although there have been quite a few libraries for both graph machine learning and automated machine learning, there is no but one library for automated graph machine learning. Therefore, we will briefly overview libraries for graph machine learning and automated machine learning, followed by the in-depth introduction of the world's first dedicated open-source automated graph machine learning library, AutoGL. Publicly available libraries are important to facilitate and advance the research and applications of AutoML on graphs. First, we briefly list libraries for graph machine learning and automated machine learning, respectively. Graph Machine Learning Libraries Popular libraries for graph machine learning include PyTorch Geometric [94] , Deep Graph Library [95] , GraphNets [96] , AliGraph [97] , Euler [98] , PBG [99] , PGL [100] , TF-GNN [101] , Stellar Graph [102] , Spektral [103] , CodDL [104] , OpenNE [105] , OpenHGNN [106] , GEM [107] , Karateclub [108] and classical NetworkX [109] . However, these libraries do not support AutoML. Automated Machine Learning Libraries On the other hand, AutoML libraries such as NNI [110] , AutoKeras [111] , Au-toSklearn [112] , Hyperopt [113] , TPOT [114] , AutoGluon [115] , MLBox [116] , and MLJAR [117] are widely adopted. Unfortunately, because of the uniqueness and complexity of graph tasks, they cannot be directly applied to automate graph machine learning. Despite their successes, integrating these libraries to fully support automated graph machine learning is non-trivial. This motivates us to design a specific library tailored for automated graph machine learning. To fill this gap, we present Automated Graph Learning (AutoGL) 2 , the first dedicated framework and library for automated graph machine learning. The overall framework of AutoGL is shown in Figure 1 . We summarize and abstract the pipeline of AutoML on graphs into five modules: auto feature engineering, neural architecture search, model training, hyper-parameter optimization, and auto ensemble. For each module, we provide plenty of stateof-the-art algorithms, standardized base classes, and high-level APIs for easy and flexible customization. The AutoGL library is built upon PyTorch Geometric (PyG) [94] , a widely adopted graph machine learning library. AutoGL has the following key characteristics: • Open source: The code 3 and detailed documentation 4 are available online. • Easy to use: AutoGL is designed to be user-friendly. Users can conduct quick AutoGL experiments with less than ten lines of code. 2. This manuscript is based on AutoGL v0.2.0-pre released on 11st, July 2021. Pleases visit the website for the most up-to-the version. 3. https://github.com/THUMNLab/AutoGL/ 4. https://autogl.readthedocs.io/ • Flexible to be extended: The modular design, high-level base class APIs, and extensive documentation of AutoGL allow flexible and easy customized extensions. In this section, we introduce AutoGL designs in detail. AutoGL is designed in a modular and object-oriented fashion to enable clear logic flows, easy usages, and flexible extensions. All the APIs exposed to users are abstracted in a high-level fashion to avoid redundant re-implementation of models, algorithms, and train/evaluation protocols. All the five main modules, i.e., auto feature engineering, neural architecture search, model training, hyper-parameter optimization and auto ensemble, have taken into account the unique characteristics of graph machine learning. Next, we elaborate on the detailed designs for each module. We first briefly introduce our dataset management. AutoGL Dataset is currently based on Dataset from PyTorch Geometric and supports common benchmarks for node and graph classification, including the recent Open Graph Benchmark [118] . We present the complete list of datasets in Table 3 , and users can also easily customize datasets following our documentation. Specifically, we provide widely adopted node classification datasets including Cora, CiteSeer, PubMed [119] , Amazon Computers, Amazon Photo, Coauthor CS, Coauthor Physics [120] , Reddit [121] , and graph classification datasets such as MU-TAG [122] , PROTEINS [123] , IMDB-B, IMDB-M, COL-LAB [124] , etc. Datasets from Open Graph Benchmark [118] are also supported. Table 3 summarizes the statistics of the supported datasets. The graph data is first processed by the auto feature engineering module, where various nodes, edges, and graph-level features can be automatically added, compressed, or deleted to help boost the graph learning process after. Graph topological features can also be extracted to utilize graph structures better. Currently, we support 24 feature engineering operations abstracted into three categories: generators, selectors, and graph features. The generators aim to create new node and edge features based on the current node features and graph structures. The selectors automatically filter out and compress features to ensure they are compact and informative. Graph features focus on generating graph-level features. We summarize the supported generators in Table 4 , including Graphlets [125] , EigenGNN [126] , PageRank [127] , local degree profile, normalization, one-hot degrees, and one-hot node IDs. For selectors, GBDT [128] and FilterConstant are supported. An automated feature engineering method DeepGL [129] is also supported, functioning as both a generator and a selector. For graph feature, Netlsd [130] and a set of graph feature extractors implemented in NetworkX [109] are wrapped, e.g., NxTransitivity, NxAverageClustering, etc. We also provide convenient wrappers that support feature engineering operations in PyTorch Geometric [94] and NetworkX [109] . Users can easily customize feature engineering methods by inheriting from the class BaseGenerator, BaseSelector, and BaseGraph, or BaseFeatureEngineer if the methods do not fit in our categorization. Row-normalize all node features PYGOneHotDegree One-hot encoding of node degrees. onehot One-hot encoding of node IDs In AutoGL, Neural Architecture Search (NAS) aims to automate the construction of Graph Neural Networks. The best GNN model will be searched using various NAS methods to fit the current datasets. In Neural Architecture Search module, Algorithm, GNNSpace, and Estimator submodule is developed to further solve the search problem. GNNSpace defines the whole search to determine which architectures should be evaluated next, and the Estimators are used for deriving the performances of target architectures. We have supported various NAS models, including algorithms specified for graph data like AutoNE [39] and AutoGR [40] and general-purpose algorithms like random search [23] , Tree Parzen Estimator [24] , etc. Users can customize HPO algorithms by inheriting from the BaseHPOptimizer class. This module handles the training and evaluation process of graph machine learning tasks with two functional sub-modules: Model and Trainer. Model handles the construction of graph machine learning models, e.g., GNNs, by defining learnable parameters and the forward pass. Trainer controls the optimization process for the given model. Common optimization methods are packaged as high-level APIs to provide neat and clean interfaces. More advanced training controls and regularization methods in graph tasks like early stopping and weight decay are also supported. The model training module supports both node-level and graph-level tasks, e.g., node classification and graph classification. Commonly used models for node classification such as GCN [43] , GAT [131] , and GraphSAGE [121] , GIN [132] , and pooling methods such as Top-K Pooling [133] are supported. Users can quickly implement their own graph models by inheriting from the BaseModel class and add customized tasks or optimization methods by inheriting from BaseTrainer. The Hyper-Parameter Optimization (HPO) module aims to automatically search for the best hyper-parameters of a specified model and training process, including but not limited to architecture hyper-parameters such as the number of layers, the dimensionality of node representations, the dropout rate, the activation function, and training hyper-parameters such as the learning rate, the weight decay, the number of epochs. The hyper-parameters, their types (e.g., integer, numerical, or categorical), and feasible ranges can be easily set. We have supported various HPO algorithms, including algorithms specified for graph data like AutoNE [39] and AutoGR [40] and general-purpose algorithms like random search [23] , Tree Parzen Estimator [24] , etc. Users can customize HPO algorithms by inheriting from the BaseHPOptimizer class. This module can automatically integrate the optimized individual models to form a more powerful final model. Currently, we have adopted two kinds of ensemble methods: voting and stacking. Voting is a simple yet powerful ensemble method that directly averages the output of individual models. Stacking trains another meta-model to combine the output of models. We have supported general linear models (GLM) and gradient boosting machines (GBM) as meta-models. On top of the modules mentioned above, we provide another highlevel API Solver to control the overall pipeline. In Solver, the five modules are integrated systematically to form the final model. Solver receives the feature engineering module, a model list, the HPO module, and the ensemble module as initialization arguments to build an Auto Graph Learning pipeline. Given a dataset and a task, Solver first perform auto feature engineering to clean and augment the input data, then optimize all the given models using the model training and HPO module. At last, the optimized best models will be combined by the Auto Ensemble module to form the final model. Solver also provides global controls of the AutoGL pipeline. For example, the time budget can be explicitly set to restrict the maximum time cost, and the training/evaluation protocols can be selected from plain dataset splits or cross-validation. In this section, we provide experimental results. Note that we mainly want to showcase the usage of AutoGL and its main functional modules rather than aiming to achieve the new stateof-the-art on benchmarks or compare different algorithms. For node classification, we use Cora, CiteSeer, and PubMed with the standard dataset splits from [43] . For graph classification, we follow the setting in [134] and report the average accuracy of 10fold cross-validation on MUTAG, PROTEINS, and IMDB-B. We turn on all the functional modules in AutoGL, and report the fully automated results in Table 5 and Table 6 . We use the best single model for graph classification under the cross-validation setting. We observe that in all the benchmark datasets, AutoGL achieves better results than vanilla models, demonstrating the importance of AutoML on graphs and the effectiveness of the proposed pipeline in the released library. Table 7 reports the results of two implemented HPO methods, i.e., random search and TPE [24] , for the semi-supervised node classification task. As shown in the table, as the number of trials increases, both HPO methods tend to achieve better results. Besides, both methods outperform vanilla models without HPO. Note that a larger number of trials do not guarantee better results because of the potential overfitting problem. We further test these HPO methods with ten trials for the graph classification task and report the results in Table 8 . The results generally show improvements over the default hand-picked parameters on all datasets. Table 9 reports the performance of the ensemble module as well as its base learners for the node classification task. We use voting as the example ensemble method and choose GCN and GAT as the base learners. The table shows that the ensemble module achieves better performance than both the base learners, demonstrating the effectiveness of the implemented ensemble module. We have presented AutoGL, the first library for automated graph machine learning, which is open-source, easy to use, and flexible to be extended. Currently, we are actively developing AutoGL and plan to support the following functionalities within a short time: • Support for large-scale graphs. • Handle more graph tasks, e.g., heterogeneous graphs and spatial-temporal graphs. • Support more graph library backends, e.g., Deep Graph Library [95] . All kinds of inputs and suggestions are also warmly welcomed. We have discussed existing literature in automated graph machine learning approaches and libraries. Our discussion in detail contains how HPO and NAS can be applied to graph machine learning to handle problems in automated graph machine learning. We also introduce AutoGL, a dedicated framework and library for automated graph machine learning. In this section, we will suggest future directions deserving further investigations from both academia and industry. There exist plenty of challenges and opportunities worthy of future explorations. • Scalability: AutoML has been successfully applied to various graph scenarios, however, there are still lots of future directions deserving further investigation regarding scalability to large-scale graphs. On the one hand, although HPO for large-scale graph machine learning has been preliminarily explored in literature [39] , the Bayesian Optimization utilized in the model suffers from limited efficiency. Thus it will be interesting and challenging to explore how we can reduce the computational costs to realize fast hyper-parameter optimization. On the other hand, the scalability of NAS for graph machine learning has drawn few attentions from the researchers despite applications involving large-scale graphs are very common in real world, leaving a large space for further explorations. • Explainability: Existing automated graph machine learning approaches are mainly based on black-box optimizations. For example, it is unclear why certain NAS models can perform better compared with others, and the explainability of NAS algorithms still lack systematic research efforts. There have been some preliminary studies on explainability of graph machine learning [135] , and on explainable graph hyperparameter optimization [40] via hyper-parameter importance decorrelation. However, further and deeper investigations on the explainability of automated graph machine learning are still of great importance. • Out-of-distribution generalization: When applied to new graph datasets and tasks, there still need huge human efforts to construct task-specific graph HPO configurations and graph NAS frameworks, e.g., spaces and algorithms. The generalization of current graph HPO configurations and NAS frameworks are limited, especially training and testing data come from different distributions [136] . It will be a promising direction to study the out-of-distribution generalization abilities for both graph HPO and graph NAS algorithms which are capable of handling continuously and rapidly changing tasks. • Robustness: Since many applications of AutoML on graphs are risk-sensitive, e.g., finance and healthcare, the robustness of the models is indispensable for actual usages. Though there exist some initial studies on the robustness [137] of graph machine learning, how to generalize these techniques into automated graph machine learning has not been explored. • Graph models for AutoML: In this paper, we mainly focus on how AutoML methods are extended to graphs. The other direction, i.e., using graphs to help AutoML, is also feasible and promising. For example, we can model neural networks as a directed acyclic graph (DAG) to analyze their structures [138] , [93] or adopt GNNs to facilitate NAS [90] , [139] , [140] , [141] . Ultimately, we expect graphs and AutoML to form tighter connections and further facilitate each other. • Hardware-aware models: To further improve the scalability of automated graph machine learning, hardware-aware models may be a critical step, especially in real industrial environments. Both hardware-aware graph models [142] and hardware-aware AutoML models [143] , [144] , [145] have been studied, but integrating these techniques is still in the early stage and poses significant challenges. • Comprehensive evaluation protocols: Currently, most Au-toML on graphs are tested on small traditional benchmarks such as three citations graphs, i.e., Cora, CiteSeer, and PubMed [119] . However, these benchmarks have been identified as insufficient to compare different graph machine learning models [146] , not to mention AutoML on graphs. More comprehensive evaluation protocols are needed, e.g., on recently proposed graph machine learning benchmarks [37] , [147] , or new dedicated graph AutoML benchmarks similar to the NAS-bench series [148] are needed. In this paper, we discuss the current state-of-the-art automated graph machine learning approaches, libraries. In particular, we in depth elaborate how graph hyperparameter optimization (HPO) and graph neural architecture search (NAS) have been developed to facilitate automated graph machine learning. We also introduce AutoGL, our dedicated framework and open source library for automated graph machine learning. Last but not least, we point out challenges and suggest promising directions deserving further investigations. A survey on network embedding Representation learning on graphs: Methods and applications Graph embedding techniques, applications, and performance: A survey A comprehensive survey of graph embedding: Problems, techniques, and applications Deep learning on graphs: A survey A comprehensive survey on graph neural networks Graph neural networks: A review of methods and applications Graph convolutional neural networks for web-scale recommender systems Learning disentangled representations for recommendation Graph based anomaly detection and description: a survey Network embedding in biomedical data science Predicting multicellular function through multi-layer tissue networks Neural relational inference for interacting systems Diffusion convolutional recurrent neural network: Datadriven traffic forecasting Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting Knowledge graph embedding: A survey of approaches and applications Few-shot link prediction via graph neural networks for covid-19 drug-repurposing Network medicine framework for identifying drug repurposing opportunities for covid-19 Examining covid-19 forecasting using spatiotemporal graph neural networks Automl: A survey of the state-of-the-art Taking human out of learning applications: A survey on automated machine learning Random search for hyper-parameter optimization Algorithms for hyper-parameter optimization Practical bayesian optimization of machine learning algorithms Meta hyperparameter optimization with adversarial proxy subsets sampling Neural architecture search: A survey Autoias: Automatic integrated architecture searcher for click-trough rate prediction Neural architecture search with reinforcement learning Darts: Differentiable architecture search Efficient neural architecture search via parameters sharing Learning transferable architectures for scalable image recognition Regularized evolution for image classifier architecture search Neural message passing for quantum chemistry Geometric deep learning: going beyond euclidean data Graph neural architecture search Open graph benchmark: Datasets for machine learning on graphs On power law growth of social networks Autone: Hyperparameter optimization for massive network embedding Explainable automated graph representation learning with hyperparameter importance Arbitrary-order proximity preserved network embedding Deepwalk: Online learning of social representations Semi-supervised classification with graph convolutional networks Jitune: Just-in-time hyperparameter tuning for network embedding algorithms A novel genetic algorithm with hierarchical evaluation strategy for hyperparameter optimisation of graph neural networks Autonomous graph mining algorithm search with best speed/accuracy trade-off Which hyperparameters to optimise? an investigation of evolutionary hyperparameter optimisation in graph neural network for molecular property prediction A systematic comparison study on hyperparameter optimisation of graph neural networks for molecular property prediction Automated graph learning via population based selftuning gcn Automatic graph learning with evolutionary algorithms: An experimental study Automated graph representation learning for node classification Auto-gnn: Neural architecture search of graph neural networks Simplifying architecture search for graph neural network Probabilistic dual network architecture search on graphs Neural architecture search in graph neural networks Autograph: Automated graph neural network Evolutionary architecture search for graph neural networks Efficient graph neural architecture search Graph neural network architecture search for molecular property prediction Learned low precision graph neural networks Design space for graph neural networks Sgas: Sequential greedy architecture search Learning graph convolutional network for skeletonbased human action recognition by neural searching CEM-RL: Combining evolutionary and gradientbased methods for policy search Rethinking graph neural architecture search from message-passing Autostg: Neural architecture search for predictions of spatio-temporal graphs One-shot graph neural architecture search with dynamic search space Search to aggregate neighborhood for graph neural network Autoattend: Automated attention representation search Diffmg: Differentiable meta graph search for heterogeneous graph neural networks Search for deep graph neural networks Learn layer-wise connections in graph neural networks Fl-agcns: Federated learning framework for automatic graph convolutional network search G-cos: Gnn-accelerator co-search towards both better accuracy and efficiency Pooling architecture search for graph classification Fgnas: Fpga-aware graph neural architecture search Graphpas: Parallel architecture search for graph neural networks Algnn: Auto-designed lightweight graph neural network Mopso: A proposal for multiple objective particle swarm optimization Edge-featured graph neural architecture search Autogel: An automated graph neural network with explicit link information Snas: stochastic neural architecture search Graph differentiable architecture search with structure learning Deepgcns: Can gcns go as deep as cnns Representation learning on graphs with jumping knowledge networks Hierarchical graph representation learning with differentiable pooling Deeper insights into graph convolutional networks for semi-supervised learning Categorical reparameterization with gumbel-softmax The concrete distribution: A continuous relaxation of discrete random variables Bridging the gap between sample-based and one-shot neural architecture search with bonas Single path one-shot neural architecture search with uniform sampling Large-scale evolution of image classifiers Graph structure of neural networks," in ICML Fast graph representation learning with PyTorch Geometric Deep graph library: A graph-centric, highlyperformant package for graph neural networks Relational inductive biases, deep learning, and graph networks Aligraph: A comprehensive graph neural network platform Euler: A distributed graph deep learning framework PyTorch-BigGraph: A Large-scale Graph Embedding System Paddle graph learning Tensorflow gnn Stellargraph machine learning library Graph neural networks in tensorflow and keras with spektral Cogdl: An extensive research toolkit for deep learning on graphs Openne: An open source toolkit for network embedding Heterogeneous graph neural network Gem: A python package for graph embedding methods Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs Exploring network structure, dynamics, and function using networkx Retiarii: A deep learning exploratory-training framework Auto-keras: An efficient neural architecture search system Auto-sklearn: efficient and robust automated machine learning Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures Evaluation of a tree-based pipeline optimization tool for automating data science Autogluon-tabular: Robust and accurate automl for structured data Mlbox, machine learning box Mljar automated machine learning Open graph benchmark: Datasets for machine learning on graphs Collective classification in network data Pitfalls of graph neural network evaluation Inductive representation learning on large graphs Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity Protein function prediction via graph kernels Deep graph kernels Network motifs: simple building blocks of complex networks Eigen-gnn: A graph structure preserving plug-in for gnns The anatomy of a large-scale hypertextual web search engine Lightgbm: A highly efficient gradient boosting decision tree Deep inductive graph representation learning Netlsd: hearing the shape of a graph Graph Attention Networks How powerful are graph neural networks? Graph u-nets A fair comparison of graph neural networks for graph classification Explainability in graph neural networks: A taxonomic survey Ood-gnn: Out-of-distribution generalized graph neural network Adversarial attack and defense on graph data: A survey Exploring randomly wired neural networks for image recognition Graph hypernetworks for neural architecture search Brp-nas: Prediction-based nas using gcns Gqnas: Graph q network for neural architecture search Hardware acceleration of graph neural networks Proxylessnas: Direct neural architecture search on target task and hardware Mnasnet: Platform-aware neural architecture search for mobile Hardware-aware transformable architecture search with efficient search space Pitfalls of graph neural network evaluation Benchmarking graph neural networks Nas-bench-101: Towards reproducible neural architecture search 1. We provide a paper collection about automated graph machine learning at https://github.com/THUMNLab/awesome-auto-graph-learning.