key: cord-0438581-6z0zf9es authors: Hu, Fenyu; Wang, Liping; Wu, Shu; Wang, Liang; Tan, Tieniu title: Graph Classification by Mixture of Diverse Experts date: 2021-03-29 journal: nan DOI: nan sha: dc32fb0d43406f1355afb4593a6a5ad071e9b3bd doc_id: 438581 cord_uid: 6z0zf9es Graph classification is a challenging research problem in many applications across a broad range of domains. In these applications, it is very common that class distribution is imbalanced. Recently, Graph Neural Network (GNN) models have achieved superior performance on various real-world datasets. Despite their success, most of current GNN models largely overlook the important setting of imbalanced class distribution, which typically results in prediction bias towards majority classes. To alleviate the prediction bias, we propose to leverage semantic structure of dataset based on the distribution of node embedding. Specifically, we present GraphDIVE, a general framework leveraging mixture of diverse experts (i.e., graph classifiers) for imbalanced graph classification. With a divide-and-conquer principle, GraphDIVE employs a gating network to partition an imbalanced graph dataset into several subsets. Then each expert network is trained based on its corresponding subset. Experiments on real-world imbalanced graph datasets demonstrate the effectiveness of GraphDIVE. Graph classification aims to identify class labels of graphs in a dataset, which is a critical and challenging problem for a broad range of real-world applications (Huang et al., 2016; Fey et al., 2018; Zhang & Chen, 2019; Jia et al., 2020) . For instance, in chemistry, a molecule could be represented as a graph, where nodes denote atoms, and edges represent chemical bonds. Correspondingly, the classification of molecular graphs can help predict target molecular properties (Hu et al., 2020) . As a powerful approach to graph representation learning, * Equal contribution 1 Center for Research on Intelligent Perception and Computing, Institute of Automation, Chinese Academy of Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences. Correspondence to: Shu Wu . Graph Neural Network (GNN) models have achieved outstanding performance in graph classification (Ying et al., 2018; Xu et al., 2019; Wang et al., 2020; Corso et al., 2020) . Most of existing GNN models first transform nodes into low-dimensional dense embeddings to learn discriminative graph attributive and structural features, and then summarize all node embeddings to obtain a global representation of the graph. Finally, Multi-Layer Perceptrons (MLPs) are used to facilitate end-to-end training. Nevertheless, current GNN models largely overlook the important setting of imbalanced class distribution, which is ubiquitous in practical graph classification scenarios. For example, in OGBG-MOLHIV dataset (Hu et al., 2020) , only about 3.5% of molecules can inhibit HIV virus replication. Figure 1 presents graph classification results of GCN (Kipf & Welling, 2017) and GIN (Xu et al., 2019) on this dataset. Considering either test accuracy or cross-entropy loss, the classification performance of minority class falls far behind that of majority class, which indicates the necessity of boosting GNN from the perspective of imbalanced learning. However, apart from suffering the well-known learning bias towards majority classes (Sun et al., 2009; He & Garcia, 2009; Kang et al., 2019) , the class-imbalanced learning problem is exacerbated on graph classification due to the following reasons: (I) structure diversity; (II) poor applicability in multi-task classification setting. First, structure diversity and the related out-of-distribution problem is very ubiquitous in real-world graph datasets . For example, when scientists want to forecast COVID-19 property, it will be difficult because COVID-19 is different from all known virus. This means typical imbalanced learning methods, such as re-sampling (Chawla et al., 2002) and re-weighting (Huang et al., 2016) , may no longer work on graph datasets. Because of the potential over-fitting to minority classes, these imbalanced strategies are sensitive to fluctuations of minor classes, resulting in unstable performance. Second, in more common cases, such as drug discovery (Hu et al., 2020; Bécigneul et al., 2020; Pan et al., 2015) and functional brain analysis (Pan et al., 2016) , graph datasets contain multiple classification tasks due to need for predicting various properties of a graph simultaneously. For example, Tox21 Challenge 1 aims to predict whether certain chemical compounds are active for twelve pathway assays, which can be viewed as twelve-task binary classification setting. Existing imbalanced learning methods are originally designed for single-task classification. Therefore, it is difficult to apply existing imbalanced learning strategies (Chawla et al., 2002; Kang et al., 2019; Kim et al., 2020) to multi-task setting. To this end, we propose a novel imbalanced Graph classification framework with DIVerse Experts, referred to as GraphDIVE for brevity. At first, we leverage a gating network to capture the semantic structure of the dataset. As illustrated in Figure 2 , the semantically similar graphs are grouped into the same subset. Then, multiple classifiers, which are referred to as experts, are trained based on their corresponding subsets. Due to the difference in semantic structure, samples of majority class and minority class tend to be concentrated in different subsets. Obviously, for the subset containing most of the minority class (please refer to Subset 2 in Figure 2 ), the imbalance phenomenon is alleviated. As a result, the performance of minority class get promoted. We systematically study the effect of GraphDIVE on public benchmarks and obtain the following key observations: (I) existing imbalanced learning strategies are difficult to offer satisfactory improvement on graph datasets because of the graph diversity problem, (II) the performance of existing GNNs on graph classification can be further improved by appropriately modeling the imbalanced distribution, (III) capturing semantic structure of datasets can address the structure diversity problem and alleviate the bias towards majority classes, (IV) different gates are generally necessary for multi-task setting. Apart from these observations, GraphDIVE achieves state-of-the-art results in both singletask and multi-task settings. As an instance, on HIV and BACE benchmarks, GraphDIVE achieves improvements over GCN by 2.09% and 4.24%, respectively. Over the past few years, we have witnessed the fast development of graph neural networks. They process permutation- invariant graphs with variable sizes and learn to extract discriminative features through a recursive process of transforming and aggregating representations from neighbors. At first, GNNs were introduced by (Gori et al., 2005) as a form of recurrent neural networks. Then, Bronstein et al. (2017) define the convolutional operations using Fourier transformation and Laplacian matrix. GCN (Kipf & Welling, 2017) approximate the Fourier transformation process by truncating the Chebyshev polynomial to the first-order neighborhood. GraphSAGE (Hamilton et al., 2017) samples a fixed number of neighbors and employs several aggregation functions. GAT (Veličković et al., 2018) aggregates information from neighbors by using attention mechanism. DR-GCN (Shi et al., 2020) applies class-conditioned adversarial network to alleviate bias in imbalanced node classification task. Recently, Chen et al. (2020) try to tackle the over-smoothing problem by using initial residual and identity mapping. Apart from innovation on convolution filters, there are also two main branches in the research of graph classification. For one thing, Graph pooling methods, such as DiffPool (Ying et al., 2018) , Graph U-nets (Gao & Ji, 2019) , and self-attention pooling (Lee et al., 2019) , are developed to extract more global information. Mesquita et al. (2020) take a step further in understanding the effect of local pooling methods. For another, there is recently a growing class of graph isomorphic methods (Xu et al., 2019; Morris et al., 2019; Corso et al., 2020) which aim to quantity representation power of GNNs. Re-sampling. Re-sampling methods aim to balance the class distribution by controlling each class's sample frequencies. It can be achieved by over-sampling or undersampling (Chawla et al., 2002; Han et al., 2005; He et al., 2008) . Nevertheless, traditional random sampling methods usually cause over-fitting in minority classes or under-fitting in majority classes. Recently, Kang et al. (2019) propose to use re-sampling strategy in a two-stage training scheme. Besides, Liu et al. (2020) , Kim et al. (2020) , and Chou et al. (2020) generate augmented samples to supplement minority classes, which can also be viewed as re-sampling methods. However, it is a non-trivial task to apply augmentation on graphs with variable sizes. These re-sampling methods are also unable to produce multiple predictions simultaneously in multi-task setting because different tasks are usually with different class distributions. Re-weighting. Re-weighting methods generally assign different weights to different samples. Traditional scheme re-weights classes proportionally to the inverse of the class frequency, which tends to make optimization difficult under extremely imbalanced settings (Huang et al., 2016; 2019a) . Another line of work assigns weights according to the properties of each training instance. FocalLoss (Lin et al., 2017) lowers the weights of the well-classified samples. GHM (Li et al., 2019) improves FocalLoss by further lowering the weights of very large gradients considering outliers. However, these two kinds of methods need prerequisite of domain experts to hand-craft the loss function in specific task, which may restrict their applicability. Recently, Cui et al. (2019) introduce the effective number of samples to put larger weights on minority classes. Tan et al. (2020) propose an equalization loss function that randomly ignores gradients from majority classes. LDAM (Wallach et al., 2020) introduces a label-distribution-aware loss function that encourages larger margins for minority classes. Despite the simplicity in implementation, re-weighting methods do not consider the semantic structure of datasets. So, they may not handle the graph structure diversity problem, causing unstable predictions. Mixture of Experts (MoE) is mainly based on divide-andconquer principle, in which the problem space is first divided and then is addressed by specialized experts (Jacobs et al., 1991) . MoE has been explored by several researchers and has witnessed success in a wide range of applications (TRESP, 2001; Collobert et al., 2002; Masoudnia & Ebrahimpour, 2014; Eigen et al., 2013) . In recent years, there is a surge of interest of in incorporating MoE models to address challenging tasks in natural language processing and computer vision. Shazeer et al. (2017) (Ge et al., 2015; which have demonstrated the effectiveness of MoE in fine-grained classification. Contrary to existing MoE methods which focus on increasing model capacity, we find MoE are surprisingly overlooked in graph machine learning and that MoE are especially appropriate for imbalanced graph classification task. We also propose two variants considering posterior and prior distributions for the gating function. There are also concurrent MoE works (Ma et al., 2018; Qin et al., 2020; Tang et al., 2020a ) that involve multi-task learning. However, they do not consider the important and ubiquitous class-imbalance problem. The technical core of GraphDIVE is the notion to leverage the semantic structure of the graph dataset based on the node embedding distributions. This notion encourages the GNN to group the structurally different but property-similar graphs into the same subset. Then the minority classes are more likely to be classified correctly by certain experts. Our method is similar to traditional MoE (Jacobs et al., 1991) , but with a distinct motivation for imbalanced graph classification. In the following, we first present preliminaries and then describe the algorithmic details of GraphDIVE. Finally, we present two model variants for multi-task setting. denotes a graph containing adjacency matrix, node attribute matrix and edge attribute matrix respectively. Y i = (y 1 , . . . , y T ) represents the labels of G i across T tasks. The task of graph classification is to learn a mapping f : G i → Y i . In this paper, we only consider the binary classification situation that exists widely in practical applications (Yanardag & Vishwanathan, 2015; Hu et al., 2020) . For the class-imbalanced problem, the number of instances of majority class is far more than that of minority class. The ubiquitous class-imbalanced problem of graph datasets brings a huge challenge to existing GNNs because the classifier will inevitably produce a biased prediction towards majority classes (Sun et al., 2009; He & Garcia, 2009; Tang et al., 2020b) . We conjecture that it might be too difficult for one classifier to discriminate all graphs. Inspired by Mixture of Experts (MoE) (Jacobs et al., 1991) , we propose to assign different experts to different subsets. As illustrated in Figure 3a , GraphDIVE consists of the following components. Feature extractor. Similar to the practice in Hu et al. (2020), we design a five-layer graph convolution network to extract graph features. Formally, at the k-th layer, the representation of the node v is: where u ∈ N (v) denotes the neighbors of node v, h v is the representation of v at the k-th layer, e uv is the feature vector of the edge between u and v, and m Task2 Gating Task1 Task2 Gating Task1 Task2 Gating Task1 x … … GNN (c) GraphDIVE with individual gates aggregated to node v. On top of the graph feature extractor, we summarize the global representation of a graph G by using a graph average pooling layer (i.e., readout function): where x ∈ R d , and d denotes the hidden dimension of graph embedding. Notably, GraphDIVE is generic to the choice of underlying GNN. Without loss of generality, in this paper, we choose two commonly used methods GCN (Kipf & Welling, 2017) and GIN (Xu et al., 2019) as feature extractor. Mixture of diverse experts. Under the assumption that one classifier is difficult to learn the desired mapping in skewed distribution, we adopt a gating network to decompose the imbalanced graph dataset into several subsets. Then a diverse set of individual networks, referred to as experts, are trained for discriminating graphs in their corresponding subsets. This divide-and-conquer strategy makes the learning process easier for each expert, and thus alleviates the bias towards majority class. Formally, given a graph with global representation x and label y. We introduce a latent variable z ∈ {1, 2, . . . , M }, where M represents the number of experts. In GraphDIVE, we decompose the likelihood p(y|x; Θ) as } denotes learnable parameters of gating network and expert networks. M z=1 p(z|x; Θ) = 1, and p(z|x; Θ) is the output of the gating network, indicating the prior probability to assign x to the z-th expert. p(y|z, x; Θ) represents output distribution of the z-th expert. For simplicity, we implement each expert with one linear projection layer followed by a sigmoid function: More specifically, the gating network generates an inputdependent soft partition of the dataset based on cosine similarity between graphs and gating prototypes: where τ is the temperature hyper-parameter tuning the distribution of z, and W z g is the z-th gating prototype. Since samples of minority class and majority class are usually different in semantics, they are likely to be grouped into different subsets. For the subset containing most of the minority class, the imbalanced problem phenomenon is much alleviated. Besides, unlike existing imbalanced learning strategies which suffer fluctuations in graph structure, GraphDIVE can also group structurally-different but semantically-similar graphs into same subsets. In other words, the semantic structure of the dataset is captured by gating network. Hence, the proposed method can alleviate the above-mentioned structure diversity problem of graph datasets. Prior or posterior distribution. Apart from using prior probability in Eq. (3), we also consider a model variant using posterior probability as expert weights. According to Bayes' theorem, posterior probability can be calculated as: . (6) As opposed to prior distribution which considers only graph features, this Bayesian extension considers the information from both graph labels and experts. For the convenience of expression, We refer to these two model variants as GraphDIVE-pri and GraphDIVE-post. We first present the optimization regime of Bayesian variant. Since the training objective is to maximize the loglikelihood, the loss function can be formulated as following: Noticing the interdependence between posterior distribution and prediction distribution from experts, we propose to use EM algorithm (Dempster et al., 1977) to optimize Eq. (6) and Eq. (7) iteratively: E-step: estimate the weight p(z|x, y) for each expert according Eq. (6) under current value of parameters Θ. M-step: update parameters Θ using stochastic gradient descent algorithm, and the gradient ∇ log p(y, z|x) is weighted by the estimated p(z|x, y). The gradients are blocked in the computation of posterior distribution. What is more, similar to the practice in Hasanzadeh et al. (2020), we introduce a Kullback-Leibler (KL) divergence regularization term KL(p(z|x, y)||p(z|x; Θ)) into the final loss function: where λ is a hyper-parameter which controls the extent of regularization. The above KL term ensures the posterior distribution does not deviate too far from the prior distribution. So we choose to compute gradients in M-step based on Eq. . For the optimization of GraphDIVE-pri, the posterior distribution p(z|x i , y i ; Θ) in Eq. (8) is replaced with prior distribution p(z|x i ; Θ). And regularization item becomes zero. In this case, the gating network and the expert network can be jointly optimized according to the following objective: In experiment, we find both variants fit graph data well and demonstrate superior generalization ability. And detailed comparison can be found in Section 5.3. In real-life applications, researchers are likely to be confronted with imbalanced graph classification in multi-task setting. Multi-task setting increases the difficulty of imbalanced learning as one graph might be of the majority class for some tasks while being of minority class for the other tasks. Existing imbalanced learning methods are originally designed for single-task classification, so it is a non-trivial task to adapt them for multi-task setting. In this paper, we consider two options for multi-task scenario, which are illustrated in Figure 3b and Figure 3c . Option I: Shared gates. This is the simplest adaptation variant, which uses shared gating weights for different tasks (see Figure 3b) . Compared with the model in Figure 3a , the only difference is that each expert generates predictions for different tasks. This variant has the advantage that it reduces model parameters, especially in settings with many experts. However, empirically this variant does not work well in our setting. Our reasoning is that it implicitly assumes training instances obey the same label distributions across all tasks. Option II: Individual gates. Another natural option is to assign different groups of gating weights for different tasks, as illustrated in 3c. Compared to the shared gates solution, this variant has several additional gating networks. Notably, it models task relationships in a sophisticated way. For two less related tasks, the sharing expert weights will be penalized, resulting in different expert weights instead. In this section, theoretical analysis is provided from variational inference perspective to help understand why Graph-Dive works. Assuming that an observed graph x is related to a latent variable z = {1, 2, . . . , M }, and p(z|x) denotes the probability of x locating in the z-th sub-region of feature space. Let q(z|x, y) denote a variational distribution to infer latent variable given observed data, and p(y|x, z; Θ) denotes the distribution of the prediction of each expert. As for the KL regularization term, we consider the simple case of λ = 1. Then we prove the following theorem. Theorem 1. In GraphDIVE, optimizing the final loss is equivalent to the optimization of the lower bound of log p(y|x): Proof. = E q(z|x,y) log q(z|x, y) p(z|x, y; Θ) = E q(z|x,y) log q(z|x, y)p(y|x; Θ) p(z, y|x; Θ) = log p(y|x; Θ) + E q(z|x,y) log q(z|x, y) p(y|x, z; Θ)p(z|x; Θ) = log p(y|x) − E q(z|x,y) log p(y|x, z; Θ) + KL(q(z|x, y)||p(z|x; Θ) Notice that KL(q(z|x, y)||p(z|x, y; Θ)) ≥ 0, which concludes the proof. Eq. (10) provides a lower bound of log p(y|x). Graph-DIVE optimizes the evidence lower bound (ELBO) from the perspective of variational inference. The more closer is p(z|x, y; Θ) to q(z|x, y), the tighter the lower bound is. The first item on the right hand encourages q(z|x, y) to be high for experts which make good predictions. And the second item is the Kullback-Leibler divergence between the variational distribution and the prior distribution output by the gating network. With this term, the gating network considers both graph labels and experts' capacity when partitioning graph datasets. In this section, we provide extensive experimental results of GraphDIVE on imbalanced graph classification datasets under both single-task and multi-task settings. The experimental results demonstrate superior performance of GraphDIVE over state-of-the-art models. Besides, we present a case study to demonstrate how the gating mechanism improve classification performance of minority class. We conduct experiments on the recently released largescale datasets of Open Graph Benchmark 2 (OGB) (Hu et al., 2020) , which are more realistic and challenging than traditional graph datasets. More specifically, we choose six molecular graph datasets from OGB: BACE, BBBP, HIV, SIDER, CLIONTOX, and TOX21. These datasets cover different complex chemical properties, such as inhibition of human β-secretase, and blood-brain barrier penetration. All these datasets contains two classes for each task. Here we give a brief statistics of these datasets in Table 1 , and a more detailed description can be found in Hu et al. (2020) . For the multi-class classification task, please refer to supplementary materials for details. Each graph in molecular graph dataset represents a molecule, where nodes are atoms, and edges are chemical bonds. Each node contains a 9-dimensional attribute vector, including atomic number and chirality, as well as other additional atom features such as formal charge and whether the atom is in the ring. Moreover, each edge contains a 3-dimensional attribute vector, including bond type, bond stereochemistry, and an additional bond feature indicating whether the bond is conjugated. For a fair comparison, we implement our method and all baselines in the same experimental settings as Hu et al. (2020) . For both single-task and multi-task datasets, we follow the original scaffold train-validation-test split with the ratio of 80/10/10. The scaffold splitting separates structurally different molecules into different subsets, which provides a more realistic estimate of the model performance in experimental settings (Wu et al., 2018) . We run ten times for each experiment with random seed ranging from 0 to 9, and report the mean and standard deviation of test ROC-AUC for all methods. For hyper-parameter setting, we set the embedding dimension to 300, number of layers to 5, and employ the same GNN backbone network structure. We train the model using Adam optimizer (Kingma & Ba, 2015) with initial learning rate of 0.001. For HIV and TOX21 datasets, we train the network for 120 epochs in light of the scale of the dataset. Moreover, for all the other datasets, we train the model for 100 epochs. According to the average performance on the validation dataset, we use grid-search to find the optimal value for M (i.e., the number of experts) and λ. We set the hyper-parameter space of M as [2, 3, 4, 5, 6, 7, 8] and We implement all our models based on PyTorch Geometric (Fey & Lenssen, 2019) and run all our experiments on a single NVIDIA GeForce RTX 2080 Ti 12GB. Setting and baselines. For single task, we choose BACE, BBBP and HIV datasets, which are initially single-task binary classification. Besides, at random, we pick the third task of SIDER dataset. We consider two representative and competitive graph neural networks as feature extractors: GCN (Kipf & Welling, 2017) and GIN (Xu et al., 2019) . To verify the effectiveness of our method, we also compare the following strong and competitive methods: FLAG (Kong et al., 2020) , GSN (Bouritsas et al., 2020) , WEGL (Kolouri et al., 2021) . For all these methods, we use official implementation and follow the original setting. In addition, we compare our method with state-of-the-art methods that are designed for imbalanced or long-tailed problems, including FocalLoss (Lin et al., 2017) , LDAM (Wallach et al., 2020) , GHM (Li et al., 2019) , and Decoupling . For FocalLoss, we follow the original parameter setting. For LDAM and Decoupling, official implementation is adopted, and we carefully tune the hyper-parameters since there is considerable difference between the graph classification datasets and image classification datasets that these methods originally designed for. What is more, for GHM method, we set the number of bins to 30, and momentum η as 0.9. Comparison with SOTA GNNs. We report the ROC-AUC score of state-of-the-art GNN models in Table 2 . Overall, we can see that the proposed GraphDIVE shows strong performance across all four datasets. GraphDIVE consistently outperforms other GNN models. Particularly, GraphDIVE achieves up to 12.82% absolute improvement over GCN on SIDER-3 dataset. The strong performance verifies the effectiveness of the proposed mixture of experts framework. Besides, comparing GraphDIVE-post and GraphDIVE-pri, we observe that GraphDIVE-pri performs better. We suppose the reason is that the calculation of posterior distribution introduces bias. As formulated in Eq. (6), posterior distribution considers the capacity of each expert and graph labels. However, the imbalanced distribution of graph labels makes each expert focus more on majority class, hindering the performance of GraphDIVE. On the contrary, Graphpri reliefs the confounder bias from labels and achieves relatively better results. In the following text, without specification, we refer to GraphDIVE-pri as GraphDIVE for simplicity. Comparison with SOTA imbalanced learning methods. We also compare GraphDIVE and other state-of-the-art class-imbalanced learning methods. The results are shown in Table 3 . Firstly, we find that GraphDIVE outperforms baseline models (i.e., GCN and GIN) consistently by considerable margins, which implies that the performance of existing GNNs on graph classification can be further improved by appropriately modeling the imbalanced distribution. It is also worth noting that the state-of-the-art imbalanced learning methods, such as LDAM and Decoupling, do not seem to offer significant or stable improvements over baseline models. For example, Focal loss performs better than GCN on BACE and HIV dataset, but it is inferior to baseline on BBBP and SIDER-3 dataset. We suppose the reason is that either re-sampling or re-weighting methods make the model focus more on minority class, resulting in potential over-fitting to minority class. When there are distinct differences in test graphs and train graphs, existing imbalanced learning methods may fail to generate accurate predictions. Compared with existing imbalanced learning methods, GraphDIVE usually achieves a higher ROC-AUC score and lower standard deviation. In other words, Graph-DIVE maintains remarkable and stable improvements across different datasets. We attribute this property to the gating and expert networks, which capture semantic structure of the dataset and possess the model superior generalization ability. We also present a case study in Section 5.5 to illustrate the effectiveness of GraphDIVE. For multi-task, we choose three datasets, including TOX21, CLINTOX and SIDER. Since there is no prior imbalanced learning research on multi-task graph classification and existing imbalanced learning strategies are difficult to be adapted to multi-task setting, we only compare GraphDIVE and baseline models. The results in Table 4 show Graph-DIVE still advance the performance over GCN and GIN. Notably, the improvements under multi-task setting are not as remarkable as those under single-task setting. This observation is expected because different tasks may require the shared graph representation to optimize towards different directions. So the multi-task imbalanced graph classification is a challenging research direction. We also note that the individual-gate variant (i.e., DIVE-IG) generally performs better than shared-gate variant (i.e., DIVE-SG). Considering that one graph might be of the majority class for some tasks while being of minority class for the other tasks, this result indicates that different gates are necessary for multi-task setting. To evaluate the effectiveness of the diverse experts in Graph-DIVE, we study whether and why diverse experts can alleviate prediction bias towards majority class. More specifically, we report the classification accuracy of minority class in Figure 4 . Besides, we visualize different experts' predictions on BACE dataset under the setting with three and four experts, respectively. According to the number of samples assigned to each expert, we present a pie chart in Figure 5 . We have the following observations. First of all, existing state-of-the-art re-weighting and re-sampling methods, such as LDAM and Decoupling, have marginal improvements in minority class. This is as expected since they are prone to over-fitting to minority class and cannot perfectly solve the structure diversity problem. Secondly, we observe that the proposed GraphDIVE outperforms baseline and existing imbalanced learning methods by a remarkable margin. These improvements suggest that GraphDIVE can successfully alleviate the prediction bias towards majority class and boost performance for minority class. Moreover, we take a further step to analyze why Graph-DIVE can alleviate the bias. As shown in Figure 5 , Class 0 and Class 1 denote the majority class and minority class, respectively. It can be observed that the classification of different classes relies on different experts. For example, for the setting with three experts, Expert 1 dominates the classification for the minority class while Expert 2 dominates the classification for the majority class. This phenomenon demonstrates that the proposed GraphDIVE successfully captured the semantic structure, i.e., each expert is responsible for a subset of graphs. For the expert which dominates the classification of minority class, its corresponding subset contains more minority class. So training procedure of this expert is less likely to be biased towards majority class. We also investigate the impact of expert numbers. Please refer to supplementary materials for more detailed information. To verify the effectiveness of our proposed gating mechanism, we conduct an ablation experiment on three singletask datasets. We select GCN as a feature extractor and assign four experts. We replace the gating network with simple arithmetic mean operation on predictions of multiple experts. The model is called Dive-mean accordingly. As Table 5 shows, compared with GraphDIVE, Dive-mean (i.e., DIVE-no-gate) reduces test ROC-AUC by relative 61.33% on average. Dive-mean may even degrade the performance on BBBP dataset. These results verify the effectiveness of the gating network. Notably, DIVE-mean almost shares the same number of parameters as GraphDIVE. Comparing GCN and DIVE-mean, it can also be seen that simply adding network parameters by introducing more experts cannot bring enough improvement. To sum up, the above results suggest that both the expert network (refer to Section 5.5) and the gating network are needed to boost the performance of imbalanced graph classification. In this paper, we have introduced GraphDIVE, a mixture of diverse experts framework for imbalanced graph classification. GraphDIVE employs a gating network to learn a soft partition of the dataset, in which process the semantically different graphs are grouped into different subsets. Then diverse experts are trained based on their corresponding subsets. Considering whether to use prior distribution or posterior distribution, we design two model variants and investigate their effectiveness. Besides, we also extend another two variants specially for multi-task setting. The theoretical analysis shows that GraphDIVE can optimize the exact evidence lower bound with the above divide-andconquer principle. We have conducted comprehensive experiments using various real-world datasets and practical settings. GraphDIVE consistently advances the performance over various baselines and imbalanced learning methods. Further studies exhibited the rationality and effectiveness of GraphDIVE. We investigate the impact of expert number on GraphDIVE. To be more specific, we experiment with two to eight experts. The results on single-task datasets are shown in Figure 6 . From the figure, it can be found that GraphDIVE outperforms baseline methods at most of the time, meaning that the expert numbers can be selected in a wide range. Besides, the performance improves with the increased number of experts at first, which demonstrates that more experts increase the model capacity. With the number of experts increasing, the experts which dominate the classification of minority class are more likely to generate unbiased predictions. For example, when the number of experts are less than eight, the performance of GraphDIVE-GCN increases monotonously with the number of experts increasing on BBBP dataset. Nevertheless, too many experts will inevitably introduce redundant parameters to the model, leading to over-fitting as well. In Table 6 , we report the clock time of different methods on test set. For GraphDIVE, we employ GCN as a feature extractor and assign three experts on all datasets. It can be seen that GraphDIVE only introduces marginal computation complexity compared with GCN. In order to further verify that GraphDIVE is general for other imbalanced graph classification applications, we conduct experiments on text classification, which is an imbalanced multi-class graph classification task. Figure 7 shows the class distribution of the widely used text classification datasets. It can be seen that these datasets have a long-tailed label distribution. Yao et al. (2019) create a corpus-level graph which treats both documents and words as nodes in a graph. They calculate point-wise mutual information as word-word edge weights and use normalized TF-IDF scores as worddocument edge weights. TextLevelGCN (Huang et al., 2019b) produces a text level graph for each input text, and transforms text classification into graph classification task. TextLevelGCN fulfils the inductive learning of new words and achieves state-of-the-art results. Hence, we use TextLevelGCN 3 (Huang et al., 2019b ) as a feature extractor and closely follow the experimental settings as Huang et al. (2019b) . From Table 7 , we can see that GraphDIVE achieves better results across different datasets. This result indicates that GraphDIVE is not only applicable to molecular graphs, but also is beneficial to the imbalanced multi-class graph classification in text domain. Optimal transport graph neural networks Improving graph neural network expressivity via subgraph isomorphism counting Geometric deep learning: going beyond euclidean data Smote: synthetic minority over-sampling technique Simple and deep graph convolutional networks Remix: Rebalanced mixup A parallel mixture of svms for very large scale problems Principal neighbourhood aggregation for graph nets Classbalanced loss based on effective number of samples Fast geometric deep learning with continuous b-spline kernels Graph u-nets Subset feature learning for fine-grained category classification Fine-grained classification via mixture of deep convolutional neural networks A new model for learning in graph domains Inductive representation learning on large graphs Borderline-smote: a new over-sampling method in imbalanced data sets learning Bayesian graph neural networks with adaptive connection sampling Learning from imbalanced data Adaptive synthetic sampling approach for imbalanced learning Strategies for pre-training graph neural networks Open graph benchmark: Datasets for machine learning on graphs Learning deep representation for imbalanced classification Deep imbalanced learning for face recognition and attribute prediction Text level graph neural network for text classification Adaptive mixtures of local experts Adaptive spatial-temporal graph convolutional networks for sleep stage classification Decoupling representation and classifier for long-tailed recognition Imbalanced classification via major-to-minor translation A method for stochastic optimization Semi-supervised classification with graph convolutional networks Wasserstein embedding for graph learning FLAG: adversarial data augmentation for graph neural networks Graph Classification by Mixture of Diverse Experts Self-attention graph pooling Gradient harmonized singlestage detector Focal loss for dense object detection Deep representation learning on long-tailed data: A learnable embedding augmentation perspective Modeling task relationships in multi-task learning with multi-gate mixture-of-experts Mixture of experts: a literature survey Rethinking pooling in graph neural networks Weisfeiler and leman go neural: Higher-order graph neural networks Joint structure feature exploration and regularization for multitask graph classification Task sensitive feature exploration and learning for multitask graph classification Multitask mixture of sequential experts for user activity streams Outrageously large neural networks: The sparsely-gated mixture-of-experts layer International Conference on Learning Representations Mixture models for diverse machine translation: Tricks of the trade Multiclass imbalanced graph convolutional network learning Classification of imbalanced data: A review Equalization loss for long-tailed object recognition Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations Long-tailed classification by keeping the good and removing the bad momentum causal effect Mixtures of gaussian processes Graph Attention Networks Learning imbalanced datasets with label-distribution-aware margin loss Haar graph pooling Moleculenet: a benchmark for molecular machine learning How powerful are graph neural networks? Deep graph kernels Graph convolutional networks for text classification Hierarchical graph representation learning with differentiable pooling Inductive matrix completion based on graph neural networks