key: cord-0569257-qqgbwnfe authors: Qi, Qi; Luo, Youzhi; Xu, Zhao; Ji, Shuiwang; Yang, Tianbao title: Stochastic Optimization of Areas UnderPrecision-Recall Curves with Provable Convergence date: 2021-04-18 journal: nan DOI: nan sha: a765bdc754b37c23e55bdc10a784fb788ae9bb9f doc_id: 569257 cord_uid: qqgbwnfe Areas under ROC (AUROC) and precision-recall curves (AUPRC) are common metrics for evaluating classification performance for imbalanced problems. Compared with AUROC, AUPRC is a more appropriate metric for highly imbalanced datasets. While stochastic optimization of AUROC has been studied extensively, principled stochastic optimization of AUPRC has been rarely explored. In this work, we propose a principled technical method to optimize AUPRC for deep learning. Our approach is based on maximizing the averaged precision (AP), which is an unbiased point estimator of AUPRC. We cast the objective into a sum of {it dependent compositional functions} with inner functions dependent on random variables of the outer level. We propose efficient adaptive and non-adaptive stochastic algorithms named SOAP with {it provable convergence guarantee under mild conditions} by leveraging recent advances in stochastic compositional optimization. Extensive experimental results on image and graph datasets demonstrate that our proposed method outperforms prior methods on imbalanced problems in terms of AUPRC. To the best of our knowledge, our work represents the first attempt to optimize AUPRC with provable convergence. The SOAP has been implemented in the libAUC library at~url{https://libauc.org/}. Although deep learning (DL) has achieved tremendous success in various domains, the standard DL methods have reached a plateau as the traditional objective functions in DL are no longer sufficient to model all requirements in new applications, which slows down the democratization of AI. For instance, in healthcare applications, data is often highly imbalanced, e.g., patients suffering from rare diseases are much less than those suffering from common diseases. In these applications, accuracy (the proportion of correctly predicted examples) is deemed as an inappropriate metric for evaluating the performance of a classifier. Instead, area under the curve (AUC), including area under ROC curve (AUROC) and area under the Precision-Recall curve (AUPRC), is widely used for assessing the performance of a model. However, optimizing accuracy on training data does not necessarily lead to a satisfactory solution to maximizing AUC [12] . To break the bottleneck for further advancement, DL must be empowered with the capability of efficiently handling novel objectives such as AUC. Recent studies have demonstrated great success along this direction by maximizing AUROC [60] . For example, Yuan et al. [60] proposed a robust deep AUROC maximization method with provable convergence and achieved great success for classification of medical image data. However, to the best of our knowledge, novel DL by maximizing AUPRC has not yet been studied thoroughly. Previous studies [14, 20] have found that when dealing with highly skewed datasets, Precision-Recall (PR) curves could give a more informative picture of an algorithm's performance, which entails the development of efficient stochastic optimization algorithms for DL by maximizing AUPRC. Compared with maximizing AUROC, maximizing AUPRC is more challenging. The challenges for optimization of AUPRC are two-fold. First, the analytical form of AUPRC by definition involves a complicated integral that is not readily estimated from model predictions of training examples. In practice, AUPRC is usually computed based on some point estimators, e.g., trapezoidal estimators and interpolation estimators of empirical curves, non-parametric average precision estimator, and parametric binomial estimator [3] . Among these estimators, non-parametric average precision (AP) is an unbiased estimate in the limit and can be directly computed based on the prediction scores of samples, which lends itself well to the task of model parameters optimization. Second, a surrogate function for AP is highly complicated and non-convex. In particular, an unbiased stochastic gradient is not readily computed, which makes existing stochastic algorithms such as SGD provide no convergence guarantee. Most existing works for maximizing AP-like function focus on how to compute an (approximate) gradient of the objective function [4, 6, 8, 11, 24, 38, 40, 43, 47, 48] , which leave stochastic optimization of AP with provable convergence as an open question. Can we design direct stochastic optimization algorithms both in SGD-style and Adam-style for maximizing AP with provable convergence guarantee? In this paper, we propose a systematic and principled solution for addressing this question towards maximizing AUPRC for DL. By using a surrogate loss in lieu of the indicator function in the definition of AP, we cast the objective into a sum of non-convex compositional functions, which resembles a two-level stochastic compositional optimization problem studied in the literature [52, 53] . However, different from existing two-level stochastic compositional functions, the inner functions in our problem are dependent on the random variable of the outer level, which requires us developing a tailored stochastic update for computing an error-controlled stochastic gradient estimator. Specifically, a key feature of the proposed method is to maintain and update two scalar quantities associated with each positive example for estimating the stochastic gradient of the individual precision score at the threshold specified by its prediction score. By leveraging recent advances in stochastic compositional optimization, we propose both adaptive (Adam-style) and non-adaptive (SGD-style) algorithms, and establish their convergence under mild conditions. We conduct comprehensive empirical studies on class imbalanced graph and image datasets for learning graph neural networks and deep convolutional neural networks, respectively. We demonstrate that the proposed method can consistently outperform prior approaches in terms of AUPRC. In addition, we show that our method achieves better results when the sample distribution is highly imbalanced between classes and is insensitive to mini-batch size. 2 Related Work AUROC Optimization. AUROC optimization 3 has attracted significant attention in the literature. Recent success of DL by optimizing AUROC on large-scale medical image data has demonstrated the importance of large-scale stochastic optimization algorithms and the necessity of accurate surrogate function [60] . Earlier papers [25, 28] focus on learning a linear model based on the pairwise surrogate loss and could suffer from a high computational cost, which could be as high as quadratic of the size of training data. To address the computational challenge, online and stochastic optimization algorithms have been proposed [18, 35, 42, 58, 63] . Recently, [21, 22, 36, 57] proposed stochastic deep AUC maximization algorithms by formulating the problem as non-convex strongly-concave minmax optimization problem, and derived fast convergence rate under PL condition, and in federated learning setting as well [21] . More recently, Yuan et al. [60] demonstrated the success of their methods on medical image classification tasks, e.g., X-ray image classification, melanoma classification based on skin images. However, an algorithm that maximizes the AUROC might not necessarily maximize AUPRC, which entails the development of efficient algorithms for DL by maximizing AUPRC. AUPRC Optimization. AUPRC optimization is much more challenging than AUROC optimization since the objective is even not decomposable over pairs of examples. Although AUPRC optimization has been considered in the literature (cf. [15, 47, 41] and references therein), efficient scalable algorithms for DL with provable convergence guarantee is still lacking. Some earlier works tackled this problem by using traditional optimization techniques, e.g., hill climbing search [37], cuttingplane method [61] , dynamic programming [50] , and by developing acceleration techniques in the framework of SVM [39] . These approaches are not scalable to big data for DL. There is a long list of studies in information retrieval [5, 11, 38, 47] and computer vision [4, 6, 8, 9, 24, 40, 48, 43] , which have made efforts towards maximizing the AP score. However, most of them focus on how to compute an approximate gradient of the AP function or its smooth approximation, and provide no convergence guarantee for stochastic optimization based on mini-batch averaging. Due to lack of principled design, these previous methods when applied to deep learning are sensitive to the mini-batch size [6, 47, 48] and usually require a large mini-batch size in order to achieve good performance. In contrast, our stochastic algorithms are designed in a principled way to guarantee convergence without requiring a large mini-batch size as confirmed by our studies as well. Recently, [15] formulates the objective function as a constrained optimization problem using a surrogate function, and then casts it into a min-max saddle-point problem, which facilitates the use of stochastic min-max algorithms. However, they do not provide any convergence analysis for AUPRC maximization. In contrast, this is the first work that directly optimizes a surrogate function of AP (an unbaised estimator of AUPRC in the limit) and provides theoretical convergence guarantee for the proposed stochastic algorithms. Stochastic Compositional Optimization. Optimization of a two-level compositional function in the form of E ξ [f (E ζ [g(w; ζ)]; ξ)] where ξ and ζ are independent random variables, or its finite-sum variant has been studied extensively in the literature [1, 10, 52, 27, 30, 31, 33, 34, 46, 53, 59, 62, 45] . In this paper, we formulate the surrogate function of AP into a similar but more complicated two-level compositional function of the form E ξ [f (E ζ g(w; ζ, ξ))], where ξ and ζ are independent and ξ has a finite support. The key difference between our formulated compositional function and the ones considered in previous work is that the inner function g(w; ζ, ξ) also depends on the random variable ξ of the outer level. Such subtle difference will complicate the algorithm design and the convergence analysis as well. Nevertheless, the proposed algorithm and its convergence analysis are built on previous studies of stochastic two-level compositional optimization. Notations. We consider binary classification problem. Denote by (x, y) a data pair, where x ∈ R d denotes the input data and y ∈ {1, −1} denotes its class label. Let h(x) = h w (x) denote the predictive function parameterized by a parameter vector w ∈ R D (e.g., a deep neural network). Denote by I(·) an indicator function that outputs 1 if the argument is true and zero otherwise. To facilitate the presentation, denote by X a random data, by Y its label and by F = h(X) its prediction score. Let D = {(x 1 , y 1 ), . . . , (x n , y n )} denote the set of all training examples and D + = {x i : y i = 1} denote the set of all positive examples. Let n + = |D + | denote the number of positive examples. x i ∼ D means that x i is randomly sampled from D. Following the work of Bamber [2] , AUPRC is an average of the precision weighted by the probability of a given threshold, which can be expressed as where Pr(Y = 1|F ≥ c) is the precision at the threshold value of c. The above integral is an importance-sampled Monte Carlo integral, by which we may interpret AUPRC as the fraction of positive examples among those examples whose output values exceed a randomly selected threshold c ∼ F (X)|Y = 1. For a finite set of examples D = {(x i , y i ), i = 1, . . . , n} with the prediction score for each example x i given by h w (x i ), we consider to use AP to approximate AUPRC, which is given by where n + denotes the number of positive examples. It can be shown that AP is an unbiased estimator in the limit n → ∞ [3] . However, the non-continuous indicator function I(h w (x s ) ≥ h w (x i )) in both numerator and denominator in (1) makes the optimization non-tractable. To tackle this, we use a loss function (w; x s , x i ) as a surrogate function of I(h w (x s ) ≥ h w (x i )). One can consider different surrogate losses, e.g., hinge loss, squared hinge loss, and smoothed hinge loss, and exponential loss. In this paper, we will consider a smooth surrogate loss function to facilitate the development of an optimization algorithm, e.g., a squared hinge loss (w; x s ; where m is a margin parameter. Note that we do not require to be a convex function, hence one can also consider non-convex surrogate loss such as ramp loss. As a result, our problem becomes We cast the problem into a finite-sum of compositional functions. To this end, let us define a few notations: where g xi (w) : Then, we can write the objective function for maximizing AP as a sum of compositional functions: We refer to the above problem as an instance of two-level stochastic dependent compositional functions. It is similar to the two-level stochastic compositional functions considered in literature [52, 53] but with a subtle difference. The difference is that in our formulation the inner function g xi (w) = E xj ∼D [g(w; x j , x i )] depends on the random variable x i of the outer level. This difference makes the proposed algorithm slightly complicated by estimating g xi (w) separately for each positive example. It also complicates the analysis of the proposed algorithms. Nevertheless, we can still employ the techniques developed for optimizing stochastic compositional functions to design the algorithms and develop the analysis for optimizing the objective (4). In order to motivate the proposed method, let us consider how to compute the gradient of P (w). Let the gradient of g xi (w) be denoted by ∇ w g xi (w) = (∇ w [g xi (w)] 1 , ∇ w [g xi (w)] 2 ). Then we have The major cost for computing ∇ w P (w) lies at evaluating g xi (w) and its gradient ∇ w g xi (w), which involves passing through all examples in D. To this end, we will approximate these quantities by stochastic samples. The gradient ∇ w g xi (w) can be simply approximated by the stochastic gradient, i.e., where B denote a set of B random samples from D. For estimating g xi (w) = E xj ∼D g(w; x j , x i ), however, we need to ensure its approximation error is controllable due to the compositional structure such that the convergence can be guaranteed. We borrow a technique from the literature of stochastic compositional optimization [52] by using moving average estimator for estimating g xi (w) for all positive examples. To this end, we will maintain a matrix u = [u 1 , u 2 ] with each column indexable by any positive example, i.e., u 1 xi , u 2 xi correspond to the moving average estimator of [g xi (w)] 1 and [g xi (w)] 2 , respectively. The matrix u is updated by the subroutine UG in Algorithm 2, where 8: Update w t+1 by a SGD-style method or by a Adam-style method w t+1 = UW(w t , G(w t )) 9: end for 10: Return: last solution. γ ∈ (0, 1) is a parameter. It is notable that in Step 3 of Algorithm 2, we clip the moving average update of u 2 xi by a lower bound u 0 , which is a given parameter. This step can ensure the division in computing the stochastic gradient estimator in (7) always valid and is also important for convergence analysis. With these stochastic estimators, we can compute an estimate of ∇P (w) by equation (7), where B + includes a batch of sampled positive data. With this stochastic gradient estimator, we can employ SGD-style method and Adam-style shown in Algorithm 3 to update the model parameter w. The final algorithm named as SOAP is presented in Algorithm 1. Compute 1: Option 1: SGD-style update (paras: α) w t+1 = w t − αG(w t ) 2: Option 2: Adam-style update (paras: In this subsection, we present the convergence results of SOAP and also highlight its convergence analysis. To this end, we first present the following assumption. Lipscthiz continuous and smooth with respect to w for any With a bounded score function h w (x) the above assumption can be easily satisfied. Based on the above assumption, we can prove that the objective function P (w) is smooth. Lemma 1. Suppose Assumption 1 holds, then there exists L > 0 such that P (·) is L-smooth. In addition, there exists u 0 ≥ C/n such that g xi (w) Next, we highlight the convergence analysis of SOAP employing the SGD-stype update and include that for employing Adam-style update in the supplement. Without loss of generality, we assume |B + | = 1 and the positive sample in B + is randomly selected from D + with replacement. When the context is clear, we abuse the notations g i (w) and u i to denote g xi (w) and u xi below, respectively. We first establish the following lemma following the analysis of non-convex optimization. Lemma 2. With α ≤ 1/2, running T iterations of SOAP (SGD-style) updates, we have where i t denotes the index of the sampled positive data at iteration t, C 1 and C 2 are proper constants. Our key contribution is the following lemma that bounds the second term in the above upper bound. Lemma 3. Suppose Assumption 1 holds, with u initialized by (6) for every where C 3 is a proper constant. Remark: The innovation of proving the above lemma is by grouping u it , t = 1, . . . , T into n + groups corresponding to the n + positive examples, and then establishing the recursion of the error g it (w t ) − u it 2 within each group, and then summing up these recursions together. Based on the two lemmas above, we establish the following convergence of SOAP with a SGD-style update. Theorem 1. Suppose Assumption 1 holds, let the parameters be α = 1 n 2/5 , · · · , T , and T > n + . Then after running T iterations, SOAP with a SGD-style update satisfies Remark: To the best of our knowledge, this is the first time a stochastic algorithm was proved to converge for AP maximization. Similarly, we can establish the following convergence of SOAP by employing an Adam-style update, specifically the AMSGrad update. Theorem 2. Suppose Assumption 1 holds, let the parameters η 1 ≤ √ η 2 ≤ 1, α = 1 n 2/5 + T 3/5 ,γ = n 2/5 + T 2/5 , ∀ t ∈ 1, · · · , T , and T > n + . Then after running T iterations, SOAP with an AMSGRAD update satisfies E 1 In this section, we evaluate the proposed method through comprehensive experiments on imbalanced datasets. We show that the proposed method can outperform prior state-of-the-art methods for imbalanced classification problems. In addition, we conduct experiments on (i) the effects of imbalance ratio; (ii) the insensitivity to batch size and (iii) the convergence speed on testing data; and observe that our method (i) is more advantageous when data is more imbalanced, (ii) is not sensitive to batch size, and (iii) converges faster than baseline methods. Our proposed optimization algorithm is independent of specific datasets and tasks. Therefore, we perform experiments on both graph and image prediction tasks. In particular, the graph prediction tasks in the contexts of molecular property prediction and drug discovery suffer from very severe imbalance problems as positive labels are very rare while negative samples are abundantly available. Thus, we choose to use graph data intensively in our experiments. Additionally, the graph data we use allow us to vary the imbalance ratio to observe the performance change of different methods. In all experiments, we compare our method with the following baseline methods. CB-CE refers to a method using a class-balanced weighed cross entropy loss function, in which the weights for positive and negative samples are adjusted with the strategy proposed by Cui et al. [13] . Focal is to up-weight the penalty on hard examples using focal loss [32] . LDAM refers to training with labeldistribution-aware margin loss [7] . AUC-M is an AUROC maximization method using a surrogate loss [60] . In addition, we compare with three methods for optimizing AUPRC or AP, namely, the Data. We first conduct experiments on three image datasets: CIFAR10, CIFAR100 and Melanoma dataset [49] . We construct imbalanced version of CIFAR10 and CIFAR100 for binary classification. In particular, for each dataset we manually take the last half of classes as positive class and first half of classes as negative class. To construct highly imbalanced data, we remove 98% of the positive images from the training data and keep the test data unchanged (i.e., the testing data is still balanced). And we split the training dataset into train/validation set at 80%/20% ratio. The Melanoma dataset is from a medical image Kaggle competition, which serves as a natural real imbalanced image dataset. It contains 33,126 labeled medical images, among which 584 images are related to malignant melanoma and labelled as positive samples. Since the test set used by Kaggle organization is not available, we manually split the training data into train/validation/test set at 80%/10%/10% ratio and report the achieved AUPRC on the test set by our method and baselines. The images of Melanoma dataset are always resized to have a resolution of 384 × 384 in our experiments. Setup. We use two ResNet [23] models, i.e., ResNet18 and ResNet34, as the backbone networks for image classification. For all methods except for CE, the ResNet models are initialized with a model pre-trained by CE with a SGD optimizer with momentum parameter 0.9. We tune the learning rate in a range {1e-5, 1e-4, 1e-3, 1e-2} and the weight decay parameter in a range {1e-6, 1e-5, 1e-4}. Then the last fully connected layer is randomly re-initialized and the network is trained by different methods with the same weight decay parameter but other hyper-parameters individually tuned for fair comparison, e.g., we tune γ of SOAP in a range {0.9, 0.99,0.999}, and tune m in {0.5, 1, 2, 5, 10}. We refer to this scheme as two-stage training, which is widely used for imbalanced data [60] . We consistently observe that this strategy can bring the model to a good initialization state and improve the final performance of our method and baselines. Results. Table 1 shows the AUPRC on testing sets of CIFAR-10 and CIFAR-100. We report the results on Melanoma in Table 3 . We can observe that the proposed method SOAP outperforms all baselines. It is also striking to see that on Melanoma dataset, our proposed SOAP can outperform all baselines by a large margin, and all other methods have very poor performance. The reason is that the testing set of Melanoma is also imbalanced (imbalanced ratio=1.72%), while the testing sets of CIFAR-10 and CIFAR-100 are balanced. We also observe that the AUROC maximization (AUC-M) does not necessarily optimize AUPRC. We also plot the final PR curves in Figure 3 in the supplement. [54] . We use the same two-stage training scheme with a similar hyper-parameter tuning. We pre-train the networks by Adam with 100 epochs and a tuned initial learning rate 0.0005, which is decayed by half after 50 epochs. Results. The achieved AUPRC on the test set by all methods are presented in Table 2 . Results show that our method can outperform all baselines by a large margin in terms of AUPRC, regardless of which model structure is used. These results clearly demonstrate that our method is effective for classification problems in which the sample distribution is highly imbalanced between classes. Data. In addition to molecular property prediction, we explore applying our method to drug discovery. Recent studies have shown that GNNs are effective in drug discovery through predicting the antibacterial property of chemical compounds [51] . Such application scenarios involves training a GNN model on labeled datasets and making predictions on a large library of chemical compounds so as to discover new antibiotic. However, because the positive samples in the training data, i.e., compounds known to have antibacterial property, are very rare, there exists very severe class imbalance. We show that our method can serve as a useful solution to the above problem. We conduct experiments on the MIT AICURES dataset from an open challenge (https://www.aicures.mit.edu/tasks) in drug discovery. The dataset consists of 2097 molecules. There are 48 positive samples that have antibacterial activity to Pseudomonas aeruginosa, which is the pathogen leading to secondary lungs infections of COVID-19 patients. We conduct experiments on three random train/validation/test splits at 80%/10%/10% ratio, and report the average AUPRC on the test set over three splits. Setup. Following the setup in Sec. 4.2, we use three GNNs: MPNN, GINE and ML-MPNN. We use the same two-stage training scheme with a similar hyper-parameter tuning. We pre-train GNNs by the Adam method for 100 epochs with a batch size of 64 and a tuned learning rate of 0.0005, which is decayed by half at the 50th epoch. Due to the limit of space, Table 3 only reports GINE and MPNN results. Please refer to Table 6 in the supplement for the full results of all three GNNs. Results. The average test AUPRC from three independent runs over three splits are summarized in Table 3 , Table 6 . We can see that our SOAP can consistently outperform all baselines on all three GNN models. Our proposed optimization method can significantly improve the achieved AUPRC of GNN models, indicating that models tend to assign higher confidence scores to molecules with antibacterial activity. This can help identify a larger number of candidate drugs. We have employed the proposed AUPRC maximization method for improving the testing performance on MIT AICures Challenge and achieved the 1st place. For details, please refer to [54] . Effects of Imbalance Ratio. We now study the effects of imbalance ratio on the performance improvements of our method. We use two datasets Tox21 and ToxCast from the MoleculeNet [55] . and ToxCast in Table 5 in the supplement. Our SOAP can consistently achieve improved performance when the data is extremely imbalanced. However, it sometimes fails to do so if the imbalance ratio is not too low. Clearly, the improvements from our method are higher when the imbalance ratio of labels is lower. In other words, our method is more advantageous for data with extreme class imbalance. Insensitivity to Batch Size. We conduct experiments on CIFAR-10 and CIFAR-100 data by varying the mini-batch size for the SOAP algorithm and report results in Figure 2 (Left most). We can see that SOAP is not sensitive to the mini-batch size. This is consistent with our theory. In contrast, many previous methods for AP maximization are sensitive to the mini-batch size [47, 48, 6]. Convergence Speed. We report the convergence curves of different methods for maximizing AUPRC or AP in Figure 1 on different datasets. We can see that the proposed SOAP algorithms converge much faster than other baseline methods. More Surrogate Losses. To verify the generality of SOAP, we evaluate the performance of SOAP with two more different surrogate loss functions (w; x s , x i ) as a surrogate function of the indicator I(h w (x s ) ≥ h w (x i )), namely, the logistic loss, (w; x s , x i ) = − log 1 1+exp(−c( (hw(xi)−hw(xs))) , and the sigmoid loss, (w; x s , x i ) = 1 1+exp(c( (hw(xi)−hw(xs))) where c is a hyperparameter. We tune c ∈ {1, 2} in our experiments. We conduct experiments on CIFAR10, CIFAR100 following the experimental setting in Section 4.1 for the image data. For the graph data, we conduct experiments on HIV, MUV data following the experimental setting in Section 4.2. We report the results in Table 4 . We can observe that SOAP has similar results with different surrogate loss functions. Consistency. Finally, we show the consistency between the Surrogate Objective -P (w) and AP by plotting the convergence curves on different datasets in Figure 2 (Right two). It is obvious two see the consistency between our surrogate objective and the true AP. In this work, we have proposed a stochastic method to optimize AUPRC that can be used in deep learning for tackling highly imbalanced data. Our approach is based on maximizing the averaged precision, and we cast the objective into a sum of dependent compositional functions. We proposed efficient adaptive and non-adaptive stochastic algorithms with provable convergence guarantee to compute the solutions. Extensive experimental results on graph and image datasets demonstrate that our proposed method can achieve promising results, especially when the class distribution is highly imbalanced. One limitation of SOAP is its convergence rate is still slow. In the future, we will consider to improve the convergence rate to address the limitation of the present work. [21] Guo, Z., Liu, M., Yuan, Z., Shen, L., Liu, W., and Yang, T. Communication-efficient distributed stochastic auc maximization with deep neural networks. In Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 3864-3874, 2020. [22] Guo, Z., Yuan, Z., Yan, Y., and Yang, T. Fast objective and duality gap convergence for non-convex strongly-concave min-max problems. arXiv preprint arXiv:2006.06889, 2020. [23] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016. [24] Henderson, P. and Ferrari, V. End-to-end training of object class detectors for mean average precision. [31] Lin, T., Fan, C., Wang, M., and Jordan, M. I. Improved oracle complexity for stochastic compositional variance reduced gradient. CoRR, abs/1806.00458, 2018. [32] Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980-2988, 2017. [33] Liu, L., Liu, J., Hsieh, C., and Tao, D. Stochastically controlled stochastic gradient for the convex and non-convex composition problem. CoRR, abs/1809.02505, 2018. [34] Liu, L., Liu, J., and Tao [43] Oksuz, K., Cam, B. C., Akbas, E., and Kalkan, S. A ranking-based, balanced loss function unifying classification and localisation in object detection. In Advances in Neural Information Processing Systems, 2020. [44] Qi, Q. Soap code for reproducing results. https://github.com/Optimization-AI, 2021. [45] Qi, Q., Xu, Y., Jin, R., Yin, W., and Yang, T. Attentional biased stochastic gradient for imbalanced classification. arXiv preprint arXiv:2012.06951, 2020. [46] Qi, Q., Guo, Z., Xu, Y., Jin, R., and Yang, T. An online method for a class of distributionally robust optimization with non-convex objectives. In Proceedings of Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021. [ We include the results about effect of imbalance ratio in Table 5 , and the full results using three networks on MIT AICURES data in Table 6 , and PR curves of final models on CIFAR10, CIFAR100 data in Figure 3 . Table 5 : Test AUPRC on task 0 and task 2 of the Tox21 dataset and task 12 and task 8 of the ToxCast dataset with three graph neural network models. In the following, we abuse the notations g i (w) = g xi (w) ∈ R 2 and u i = u xi = ([u xi ] 1 , [u xi ] 2 ). We use u it to denote the updated vector at the t-th iteration for the sampled i t -th positive data. The gray dashed lines are the random classifiers on test data sets whose AUPRC equals to the ratio between positive samples and all samples n + /n on every data set, respectively. Proof. By combining Lemma 3 and Lemma 2, we have: Then by set α = 1 n 2/5 + T 3/5 , γ = n 2/5 + T 2/5 , and multiply 2 αT on both sides of above equation, ≤ O( n 2/5 + T 2/5 ) where the last inequality is due to T ≥ n + and O compresses constant numbers. We finish the proof. Proof of Lemma 1. We first prove the second part that g i (w) ∈ Ω. Due to the definition of g i (w) = E xj ∼D [g(w; x j , x i )] = E xj ∼D [ (w; x j , x i )I(y j = 1), (w; x i , x j )], and the Assumption 1, it is ∈ Ω. Next, we prove the smoothness of P (w). To this end, we need to use the following Lemma 4 and the proof will be presented after Lemma 1. is a L f -smooth, C f -Lipschitz continuous function for any u ∈ Ω, and ∀ i ∈ [1, · · · n], g i is a L g -smooth, C g -Lipschitz continuous function. Proof of Lemma 4. According to the definition, we have Due to the assumption that (w; x j , x i ) is a L l -smooth, C l -Lipschitz continuous function, we have We finish the proof of Lemma 4. Proof of Lemma 2. To make the proof clear, we write ∇g it (w; ξ) = ∇g(w t ; ξ, x it ), ξ ∼ D. Let u it denote the updated u vector at the t-th iteration for the selected positive data i t . g C 2 f L/2. Taking expectation on both sides, we have where the equality (a) is due to ab ≤ a 2 /2 + b 2 /2 and the inequality (b) uses the factor ∇g it (w t ; ξ) ≤ C l and ∇f is L f -Lipschitz continuous for u, g i (w) ∈ Ω and C 1 = C 2 l C 2 f . Hence we have, Taking summation and expectation over all randomness, we have Let i t denote the selected positive data i t at t-th iteration. We will divide {1, . . . , T } into n + groups with the i-th group given by T i = {t i 1 , . . . , t i k . . . , }, where t i k denotes the iteration that the i-th positive data is selected at the k-th time for updating u. Let us define φ(t) : that maps the selected data into its group index and within group index, i.e, there is an one-to-one correspondence between index t and selected data i and its index within T i . Below, we use notations Proof of Lemma 3. To prove Lemma 3, we first introduce another lemma that establishes a recursion for u it − g it (w t ) 2 , whose proof is presented later. Lemma 5. By the updates of SOAP Adam-style or SGD-style with B + = 1, the following equation holds for ∀ t ∈ 1, · · · , T where E t denotes the conditional expectation conditioned on history before t i k−1 . Then, by mapping every i t to its own group and make use of Lemma 5, we have where u 0 i is the initial vector for u i , which can be computed by a mini-batch averaging estimator of g i (w 0 ). Thus Proof. We first introduce the following lemma, whose proof is presented later. Lemma 6. Suppose the sequence generated in the training process using the positive sample Define g it (w t ) = g(w t , ξ, x it ). Let Ω (·) : R 2 → Ω denotes the projection operator. By the updates of u it , we have where the inequality (a) is due to that t i k − t i k−1 is a geometric distribution random variable with p = 1/n + , i.e., E |t i k−1 [(t i k − t i k−1 ) 2 ] ≤ 2/p 2 = 2n 2 + , by Lemma 6. The last equality hold by defining Proof. Proof of Lemma 6. Denote the random variable ∆ k = i k+1 − i k that represents the iterations that the ith positive sample has been randomly selected for the k + 1-th time conditioned on i k . Then ∆ k follows a Geometric distribution such that Pr(∆ k = j) = (1 − p) j−1 p, where p = 1 n+ , j = 1, 2, 3, · · · . As a result, Proof. We first provide two useful lemmas, whose proof are presented later. Lemma 7. Assume assumption 1 holds where According to Lemma 8 and plugging Lemma 3 into equation (14), we have Then by rearranging terms in Equation (15), dividing αT (1 + η 1 )( + C 2 g C 2 f ) −1/2 on both sides and suppress constants, C g , L g , C 3 , L, C f , L f , V, into big O, we get Moreover, by the definition of L and w 0 = w 1 , we have (17) where the inequality (a) is due to Lemma 7 and c T +1 ≤ (1 − η 1 ) −1 α in equation (30). The inequality The inequality (a) is due to γ = n 2/5 + T 2/5 , α = 1 n 2/5 + T 3/5 . In inequality (b), we further compress the ∆ 1 , η 1 , η, c into big O and γ ≤ 1 → n 2/5 + ≤ T 2/5 . Proof. This proof is following the proof of Lemma 4 in [10] . Then follow the Adam-style update in Algorithm 3, we have which completes the proof. Proof. To make the proof clear, we make some definitions the same as the proof of Lemma 2. Denote by ∇g it (w t ; ξ) = ∇g(w t ; ξ, x it ), ξ ∼ D, where i t is a positive sample randomly generated from D + at t-th iteration, and ξ is a random sample that generated from D at t-th iteration. It is worth to notice that i t and ξ are independent. u it denote the updated u vector at the t-th iteration for the selected positive data i t . , h t+1 = η 1 h t + (1 − η 1 )∇g it (w t ; ξ)∇f (u it ) and the second inequality is due to Lemma 7. Taking expectation on both sides, we have where E t [·] = E[·|F t ] implies taking expectation over i t , ξ given w t . In the following analysis, we decompose Υ into three parts and bound them one by one: Let us first bound I t 1 , (23) where equality (a) is due to ∇P (w t ) = E it,ξ [∇g it (w t ; ξ) ∇f (g it (w t ))], where i t and ξ are independent. The inequality (b) is according to ab ≤ a 2 /2 + b 2 /2. The last inequality (c) is due to For I t 2 and I t 3 , we have Define the Lyapunov function L t = P (w t ) − c t ∇P (w t−1 ), D t h t (28) where c t and c will be defined later. By setting α t+1 ≤ α t = α, c t = (31) where the last inequality is due to equation (30) such that we have 2(α + c t+1 ) −1/2 C 2 g L 2 f ≤ cα, and α + c t+1 ≤ 2(1 − η 1 ) −1 α. Stochastic multi-level composition optimization algorithms with level-independent convergence rates. CoRR, abs The area above the ordinal dominance graph and the area below the receiver operating characteristic graph Area under the precision-recall curve: Point estimates and confidence intervals Smooth-ap: Smoothing the path towards large-scale image retrieval Learning to rank with nonsmooth cost functions Deep metric learning to rank Learning imbalanced datasets with label-distribution-aware margin loss Towards accurate one-stage object detection with ap-loss Ap-loss for accurate one-stage object detection Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization Ranking measures and loss functions in learning to rank Auc optimization vs. error rate minimization Class-balanced loss based on effective number of samples A patient-centric dataset of images and metadata for identifying melanomas using clinical context Training deep neural networks via direct loss minimization A deep learning approach to antibiotic discovery Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions Accelerating stochastic composition optimization Advanced graph and sequence neural networks for molecular property prediction and drug discovery MoleculeNet: a benchmark for molecular machine learning How powerful are graph neural networks? Optimal epoch stochastic gradient descent ascent methods for min-max optimization Stochastic online auc maximization Fast stochastic variance reduced ADMM for stochastic composition optimization Robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification A support vector method for optimizing average precision A composite randomized incremental gradient method Online auc maximization We thank Bokun Wang for discussing the proofs, and thank anonymous reviewers for constructive comments. Q.Q contributed to the algorithm design, analysis, and experiments under supervision of T.Y. Y.L and Z.X contributed to the experiments under supervision of S.J. Q.Q and T.Y were partially supported by NSF Career Award #1844403, NSF Award #2110545 and NSF Award #1933212. Y.L, Z.X and S.J were partially supported by NSF IIS-1955189. Choosing η 1 < 1 and defining τ = η 2 1 η2 , with the Adam-style (Algorithm 3) updates of SOAP that h t+1 = η 1 h t + (1 − η 1 )G(w t ), we can verify for every dimension l,where w l is the lth dimension of w, the third inequality follows the Cauchy-Schwartz inequality. For the lth dimension ofv,v l t , first we havev lUsing equation (19) and equation (20), we haveThen by rearranging terms, and taking summation from 1, · · · , T of equation (31), we haveBy combing with Lemma 3, We finish the proof.