key: cord-0572021-654n2uba authors: Nan, Yang; Li, Fengyi; Tang, Peng; Zhang, Guyue; Zeng, Caihong; Xie, Guotong; Liu, Zhihong; Yang, Guang title: Automatic Fine-grained Glomerular Lesion Recognition in Kidney Pathology date: 2022-03-11 journal: nan DOI: nan sha: 9beb399bddc399efea07f4d3d0cb36e5236f71fc doc_id: 572021 cord_uid: 654n2uba Recognition of glomeruli lesions is the key for diagnosis and treatment planning in kidney pathology; however, the coexisting glomerular structures such as mesangial regions exacerbate the difficulties of this task. In this paper, we introduce a scheme to recognize fine-grained glomeruli lesions from whole slide images. First, a focal instance structural similarity loss is proposed to drive the model to locate all types of glomeruli precisely. Then an Uncertainty Aided Apportionment Network is designed to carry out the fine-grained visual classification without bounding-box annotations. This double branch-shaped structure extracts common features of the child class from the parent class and produces the uncertainty factor for reconstituting the training dataset. Results of slide-wise evaluation illustrate the effectiveness of the entire scheme, with an 8-22% improvement of the mean Average Precision compared with remarkable detection methods. The comprehensive results clearly demonstrate the effectiveness of the proposed method. Immunoglobulin A nephropathy (IgAN) is the leading cause of chronic kidney disease worldwide, especially in Asian regions, with nearly 40% of patients developing the end-stage renal disease within decades. Patients with this nephropathy have varied histological lesions, ranging from crescentic glomerulonephritis, mesangial proliferation to global and segmental sclerosis. The five main types of structure ( ) in IgAN ( Fig. 1) includes Neg (tubule and arteriole), GS (Global Sclerosis), C (Crescent), SS (Segmental Sclerosis) and NoA (None of Above). However, due to the collapse and proliferation of capillary loops, some pathological changes in IgAN share a high visual similarity that even pathologists cannot achieve satisfactory agreement. A previous study found a low intra-class correlation coefficients of recognizing SS (0.66) and C (0.46) in IgAN, given by three to five pathologists [1] . Evidently, there is an urgent need to shift the balance of pathological changes' identification towards more objective and quantification. Feb. 12, 2022 3 As digital pathology evolves, biopsy tissues can be scanned as whole slide images (WSIs) through micro-scanners. Meanwhile, the remarkable success of deep convolutional neural networks also provide pathways to address the above intractable issues. With these efforts, researchers have studied computer-aided diagnosis for renal pathological feature recognition, including glomeruli location and lesion classification. Current location methods are mainly based on the combination of patch-wise predictions, including detection [2] and segmentation [3] of numerous glomeruli in large patches, or binary classification (glomerulus or nonglomerulus) in small patches [4] . However, most existing studies typically focus on classifying sclerosis and non-sclerosis and do not involve complicated lesion types, indicating the insufficient exploration of IgAN. To identify complex pathological changes in IgAN, several difficulties need to be addressed. Firstly, each glomerulus should be accurately localized. In addition, glomerular lesions with low intra-class variances should be precisely classified, even under an imbalanced data distribution. Unfortunately, most fine-grained classification tasks were performed on natural images [5, 6] with relatively unitary Feb. 12, 2022 4 background regions, which are contrary to the complex background regions in renal pathological images (Supplement- Fig. 1 ). Besides, methods that heavily rely on part annotations [7] , such as bounding boxes or masks that label the sub-regions, are impractical for medical images due to the high cost of manual annotations. For fine-grained recognition, the network should assess the 'confidence' of its prediction to express the certainty of its output. Unfortunately, methods such as Bayesian deep learning and Monte Carlo dropout require repetitious inferences, which are inconvenient and time-consuming. Last but not least, the network applied in medical image analysis should be able to reduce the negative impact of inconclusive annotations. In this paper, we present the first attempt to conduct fine-grained visual recognition (FGVR) in large-scaled whole slide images, aiming to recognize complex pathological changes in IgAN. Different from previous works, FGVR aims to locate but distinguish finegrained subcategories.. To address the weak capacity of existing detection modules (such as Mask R-CNN [8] , FCOS [9] , etc.) on FGVR task, we propose a two-stage scheme, with glomerulus segmentation and classification, respectively. Initially, a focal instance structural similarity (FISS) loss is presented to acquire accurate segmentation results. It coalesces focal loss with instance structural similarity loss to accurately segment the boundary of the glomerulus. Then, the glomerular lesions are divided into two groups based on pathological representation and lesion severity, with Neg, GS, CSN (the combination of C, SS, and NoA) as a parent class and C, SS, NoA as a child class. Based on this definition, an Uncertainty-Aided Apportionment Network (UAAN) is proposed to classify complex pathological Feb. 12, 2022 5 changes in IgAN, yielding two groups of predictions (corresponded to the parent and child classes) and their uncertainty. This uncertainty indicates the confidence coefficient of the prediction of the proposed network, which can be further applied to data reconstitution, including seeking missing annotations, mislabeled, and hard samples. With this indicator, the uncertain annotations can be picked up, rechecked, and analyzed by experts. To better illustrate the mechanism of UAAN and the interpretability, heatmaps from different layers are visualized using Gradient weighted Class Activation Mapping (Grad-CAM) [10] . Besides, uniform manifold approximation and projection (UMAP) [11] is applied to demonstrate the data distribution before and after data reconstitution. Experimental results on the Warwick-QU dataset [12] and our in-house renal pathology dataset are reported, with a patch-wise and a slide-wise evaluation respectively. The main contributions of our work are: l We have introduced a scheme for fine-grained visual recognition in kidney pathology, aiming to detect complex pathological changes in IgAN. l We have proposed a focal instance structural similarity loss to improve segmentation performance by assessing the structural integrity through the instance. l We have designed an effective architecture for fine-grained classification and uncertainty assessment, with detailed ablation experiments and heatmap visualization. l Comprehensive experiments have been conducted on multi-levels (patch-wise and slidewise) to prove the effectiveness of our proposed network. Feb. 12, 2022 6 The common loss functions mainly consist of Dice coefficient loss, cross-entropy loss, Jaccard loss, and focal loss [13] . In addition to the well-known losses, Tversky loss [14] was proposed as an extension to the Jaccard loss, which restricted false positive and false negative rates through hyperparameters and . Lovasz loss combined Jaccard loss with Lovasz extension to find the minima of the submodular function [15] . Structural similarity was also introduced as an optimization function for segmentation based on sliding windows [16] . However, existing studies for segmentation losses barely assess the predictions through instance level [17] . Specifically, the loss function should be designed considering each object within the image respectively, rather than adopting the same strategy for all the objects. In other words, the penalty of missing a small object should be higher than missing an equivalent area within the large object. The fine-grained visual classification (FGVC) was first defined to classify different species of birds from images with a single object [18, 19] . It could be divided into region-based and feature expression-based methods. The region-based approaches [20, 21] [23] incorporated convolutional operations along the edges of the tree structure and used the routing functions in each node to determine the root-to-leaf computational paths within the tree. However, due to the complex and various background samples, methods on natural images could barely achieve satisfactory performance in pathological studies. Given a whole slide image ℋ ∈ !×#×$ that can be divided into large patches ∈ × ×$ and binary mask ℳ ∈ × , the target bounding boxes ℬ ∈ ( ! ×) are first where ℱ denotes the fixed samples, ℓ /-0 indicates CSN index lists (details can be found in In this study, the segmentation module 1 is modified based on SegNet [24] , introducing the group normalization and leaky relu. Currently, most segmentation losses are designed based on overlap measurements, while it makes the network only focus on the correct ratio of predicted pixels to the ground truth. To address this issue, a compound loss (DL+FISS) that considers overlap and regional structural similarity is introduced, which can be easily adopted in various architectures. The FISS loss combines focal loss + with instance structural similarity (ISS) loss ,-- where α and are constraint weights, p and g represent the prediction and ground truth, ℬ indicates the bounding box of each target in the ground truth. Since we want to balance the constraint from + and ,--, we set both and to 1, which was also proved appropriate in our initial pilot study. The Instance Structural Similarity (ISS) loss is inspired by [25] , aiming to assess the structural integrity of each target object. Then, a list that includes similarity indexes of instances is acquired, e.g., list [ 3 , 4 , 5 , $ , ) , 6 ] will be acquired when there are six instances within the input image (shown in Supplement- Fig. 2 ). Let , be the number of instances in an input image x, 7 , 7 be the mean and variance of -th instance within the x, the 2 can be given through where 7 and 7 indicate the region of the i-th instance extracted from the prediction and the ground truth of x. Instances that are smaller than the sliding window during similarity calculation are applied with zero paddings to ensure the kernel can slide across the object. Image without ROI is divided into patches (we set to 4) to compute 0 instead of evaluating the similarity directly By combining ,--and +@;AB , it will focus on the imbalanced samples while maintaining the structural integrity, which can output smoother boundaries and fewer false negative samples. Further exploration of FISS is illustrated in Supplement-3. This section introduces the details of UAAN ( . ), including hierarchical structure design, feature apportionment mechanism, uncertainty aided data reconstitution, and solutions for imbalanced dataset. Deep networks can achieve superior results when categories are independent and identically distributed. However, due to the complex pathological changes in IgAN, lesions cannot be Feb. 12, 2022 10 well classified through normal solutions. In this section, we design a hierarchical architecture for fine-grained classification, with the parent branch C for classifying classes with large variation and child branch ; for dividing class with small variation (shown in Fig. 3 ). The backbone of UAAN is inspired by the Inception-ResNetV2 [26] by introducing Group normalization, PReLU [27] (to strengthen the generalization ability), and basic convolution block (including convolution layer, GN, and PReLU). After the backbone, feature maps are separately transported to two sub-branches ( C and ; ) through CSN index list ℓ /-0 , which is acquired during the data preparation with its elements representing the index of CSN samples in current batch. Assume a batch of N images containing m CSN samples, then ; ∈ is the number of width, height, and channel in the last feature map of the backbone. This unique mechanism requires that each batch should include at least one CSN sample. After the backbone of UAAN, the C and ; are performed to give parent and child class Feb. 12, 2022 11 predictions, respectively. In addition, a feature apportionment mechanism is proposed between these two branches to transfer valuable information of subcategories. In both C and ; , the global maxpooling layers (GMP) are applied to extract the most intense response in each channel in the last feature map. And the dropout layers are implemented to force the network to learn more robust features that are useful in conjunction with different random subsets of other neurons. Details of the two branches can be found in Fig. 4 . Feature apportionment plays a crucial role in UAAN and is performed through the Grasper The gather operation in the Grasper layer extracts the certain feature maps of CSN, based on ℓ /-0 . Since C is designed to classify Neg, GS, and CSN (here C, SS and NoA are regarded as a single parent class), the layers in C aims to learn discriminative features among Neg, GS, and CSN samples. Therefore, the CSN feature in C can be regarded as the common feature of C, SS, and NoA ( ;H/-0 ) that rarely exists in GS and Neg samples (e.g., capillary loops, mesangial regions). On the contrary, since ; aims to classify the C, SS, and NoA, layers in ; are optimized to extract discriminative features of child classes. Therefore, the utilization of common and discriminative features can be promoted by transferring ;H/-0 from C to ; through a sigmoid activation layer. This mechanism will help the network not only be trained with the discriminative features but also make use of common features. The hard samples in this study are samples that include many resembled features (or similar morphological characteristics) among C, SS, and NoA. For instance, the mesangial proliferation and crescent may co-exist in a single glomerulus, and experts usually give out their subjective judgment according to the severity degree. Due to these hard samples, even a large amount of data is provided, the network training can barely get good results. This is because including these samples in the training set makes it difficult for the network to extract features that are highly related to the real sub-category. In natural image analysis, these Feb. 12, 2022 13 'unreliable' and hard samples are usually tracked, relabeled, and augmented to enhance the capacity of computational modules. However, it is unaffordable for biomedical image analysis due to the heavy cost of expert annotation. We hold that some hard samples do not help to train a robust module, which leads to a negative swing of classification boundaries. To bridge this gap, an uncertainty assessment is proposed to output the confidence coefficient of the prediction as to remove 'unreliable' and hard samples from the raw training set, calculated from the logits of the softmax layer (") " ($) # and vectors of the GMP layer where L refers to each glomerulus segmented by * , indicates softmax activation, to calculate the probability density function , where K is the number of classes. With the estimated probability density function I and the raw probability density function M , the uncertainty factor can be calculated using where M and I are denoted as With this uncertainty factor, data reconstitution can be introduced to pick out and remove Details of imbalanced solution is presented in Supplement-4. The datasets in this study include Warwick-QU and our in-house Renal Pathological Whole Slide Images (RP-WSIs). The Warwick-QU dataset is used to assess the effectiveness of FISS loss (since segmentation of circular glomeruli is a relatively simple task compared to segmentation of various glands). The RP-WSIs is used to assess the effectiveness of. Details of these two sets are shown in Supplement-5. Essentially, 300 WSIs were used for training and validation and 100 WSIs were used for testing [28] . Details of preprocessing and training strategies can be found in Supplement-6. The evaluation scheme is divided into patch-wise and slide-wise. Patch-wise experiments aim Assume P as the prediction given by the segmentation network and G represents the ground truth, the segmentation task is assessed through Dice coefficient All these losses are trained on the same network without any architectural modification. . (11) where N is the total number of tested images and K is the number of classes, visualized. Fourthly, we expurgate the "unreliable" data under different uncertainty thresholds from M and perform a second review by experts to explore the influence of different uncertainty thresholds. At last, the data distribution in latent space before and after uncertainty-aided data reconstitution is visualized by uniform manifold approximation and projection (UMAP) [11] , calculated by vectors from GMP and Dense layer. The overall performance of the proposed scheme on whole slide images is assessed through the commonly used metric Average Precision (AP) and mean Average Precision (mAP). In this experiment, the ground truths (bounding boxes with classes tag) are generated through glomeruli boundaries given by experts. During the evaluation, the segmentation is first conducted to locate all glomeruli (stage-1), followed by the classification (stage-2). The inputs of classification are given by cropping the minimum bounding boxes of segmented glomeruli (given from stage-1). It is of note that all segmented objects except tiny objects (area less than 100 pixels) will be transferred to UAAN including incorrect-segmented Feb. 12, 2022 18 samples. Then, the classification module produces category predictions of these cropped rectangles, and the probability given from stage-2 is regarded as the confidence threshold in mAP computations. Objects with a certain IoU threshold are considered as true positive samples (threshold 0.5 in our experiments). We compared the AP 63 (mean average precision with 0.5 IoU thresholds) of each subcategory among our scheme and various detection methods, including Mask R-CNN [8] , FCOS [9] , Mask Scoring R-CNN [29] , and Cascade R-CNN [30] . All these models are trained from scratch without any pre-training. Experimental results on Warwick-QU and RP-WSIs-P datasets for FISS loss are reported in supplement-7 through Dice coefficient score. Ablation Study of UAAN. The effectiveness of different techniques is presented in Table 1 . It can be found that employing feature apportionment significantly improves micro-accuracy (with a 3.34% relative gain), while the uncertainty-aided data reconstitution effectively improves the macro-accuracy (with a 2.58% gain). When employing both feature apportionment and data reconstitution, there was a significant increase in both micro (5.57%) and macro (4.69%) accuracy compared with the baseline module. 20 set through uncertainty factor, the performance of all comparison models has been significantly improved, with 1% to 4% micro-accuracy gain. Heatmaps of UAAN and competing networks are shown in Fig. 5 using Grad-CAM. The final prediction of UAAN is marked by black bounding boxes and the pathological changes in GS, C, and SS are annotated with yellow curves, while the "None" in MH/-0 indicates the heat map of these two categories is not available in the child branch. It is obvious that heat maps of ;H/-0 (marked by the green rectangle) are highly related to the CSN common features, highlighting the capillary loops, intrinsic cells, and mesangial matrix. During the training procedure, these features are filtered by the grasper layer and conveyed to the child branch to enhance the common information. The highlight regions in +QR straightly reflect where the network is "focusing", which Feb. 12, 2022 22 thresholds is illustrated in Supplement-8. Effectiveness of Data Reconstitution. Fig. 6 (a-c) shows the data distribution before reconstitution, which is disordered, with certain NoA, SS, and C densely mixed. From observation in Fig. 6 (e-f), employing data reconstitution via uncertainty factor tends to tear off those mislabeled or hard samples in feature space, especially for SS, NoA, and C. While for those low variance categories, such as GS and Neg, are less affected. The distribution map in Fig. 6 (f) indicates that the combination of probability vectors from GMP and dense layer exerts the clearest boundary, and the correlation between these vectors builds a bridge for uncertainty index. Table 4 presents the results of the slide-wise evaluation for fine-grained glomeruli detection on 100 whole slide images. The visualization results of slide-wise evaluation are presented in Supplement-9. This section provides discussions on the proposed method, including (1) Although there is a slight reduction of the structural integrity, employing DL+FISS actually achieves better performance compared to using the DL+ISS only. Table 1 reminds researchers to make better use of common features between different categories, instead of only forcing the network to focus on discriminative regions. As Table 1 shows, feature apportionment mainly helps the network to distinguish NoA from C and SS, while classifying C and SS still requires a cleaner data constitution. Through the reconstitution of the training set, the classification module can find a better boundary to distinguish fine-grained classes. By integrating uncertainty-aided data reconstitution with a feature apportionment mechanism, both the micro and macro accuracy are significantly improved. In Table 2 , the proposed method has gained a competitive edge over other basic models, with 95.07% micro-accuracy (3-11% relative gain) and 91.28% macro-accuracy (12-26%). This huge gap has indicated the poor classification ability of current competitive models [26, 31-34, 36, 37] when dealing with data imbalanced issue. Compared with most fine-grained classification methods, the proposed UAAN has several advantages: (1) It has not required a large number of branches or generative models (2) It has considered both discriminative features and common features between subcategories, instead of focusing on the discriminative part only (3) It has a streamlined structure and can be further improved by adding complex operations such as deep convolutional neural networks (especially in medical applications) since it reflects the confidence coefficient of the current predictions. In our study, we have introduced the uncertainty assessment by calculating the correlation between two probability densities. Although our assessment has not provided a strict uncertainty calculation using Bayesian inference, it can still estimate the confidence coefficient of the predictions to some extent. Besides, it has also been used for selecting low-annotation data and hard samples, which has not required multiple forward propagations. The ablation experiments in Table 3 show that, with the increase of uncertainty threshold, fewer samples have been selected while the ratio of the mislabeled samples is higher. With detailed inspection by three senior pathologists, 36% and 55% among the mislabeled images have been corrected as C and SS (in case of Feb. 12, 2022 26 UFMIVF = 0.5). This has proved our assumption that the output of the GMP layer is more correlated to the real distribution and can be used to simulate the raw probability density function. If there is no obvious correlation between M and I , we consider the features are misrepresented in the last dense layer. That is, the output of the softmax layer represents an erroneous probability distribution due to the inaccurate feature representation process. It might be the reason why CNN+SVM is better than pure CNN in some previous studies [39] . Aided by this uncertainty assessment, significant improvements (1-4%) of micro-accuracy have been achieved by current state-of-the-art models [26, 31-34, 36, 37] Limitations. Although our work has achieved superior performance among the current methods, it has some limitations. The proposed two-stage scheme suffers from a complex feature extraction procedure, and the performance of stage-2 (fine-grained classification) is highly related to that of the segmentation stage. For instance, the classification module cannot work well when two contiguous glomeruli cannot be segmented individually. In addition, the various annotation qualities, expert subjectivity, and dataset may affect the stability of data reconstitution. Indeed, external validation can be helpful for testing the generalizability of our proposed deep learning model, but at the moment we have no access to additional data resources. In lieu of constructing an external validation dataset, we will publish our implementation on opensource platform for reproducibility studies and external validation by other researcher groups. In this study, we have introduced a comprehensive scheme for glomeruli lesion recognition, employing focal instance structural similarity loss and the uncertainty-guided apportionment network. This work is the first attempt to tackle the fine-grained visual recognition task in pathological image analysis and has achieved superior performance in IgAN whole slide images. Both the focal instance structural similarity loss and uncertainty-aided apportionment networks are effective, resulting in more than 8-22% improvement of the mAP compared Feb. 12, 2022 28 with current schemes. The proposed method provides a high-precision computational scheme for fine-grained lesion identification of IgA nephropathy in whole slide images, which helps pathologists make more objective and effective clinical diagnoses. For future work, it is of note that the detected lesions are not specific for IgA nephropathy, and one of our future research directions will be the transfer learning for the lesion's detection to other nephropathies with a similar presentation, e.g., lupus nephritis and diabetic nephropathy. To better explore the performance of segmenting a circular object through different losses, we simulate thousands of situations under different ground truth ratios (shown in Supplement- Due to the updated mechanism of the network's parameters (update on batch-sized images after each iteration), an image including both small and large objects can be treated as minibatch sub-images with a single object. Therefore, if a loss function can get a good response of a single object under different circumstances, then it must be effective for images containing various targets. Each red and blue circle can be treated as the prediction and the ground truth, respectively, and the red points can be regarded as the center of the predicted glomerulus. Initially, we introduce the ground truth ratio, reflecting different sizes of the target objects followed by dividing images with different GTR (ground truth ratios) into three subcategories: small-sized objects (GTR not more than 20%), middle-sized objects (GTR greater than 20% and less than 50%), and large-sized objects (GTR not less than 50%). For small objects with a 6% area ratio in Supplementary- Fig. 2(b) , --,R is more insensitive compared with general W7;I and ,--, which makes it less effective in small object segmentation. The ,--presents a more sensitive response to small objects, even when the prediction is close to the ground truth (e.g., we empirically consider the object is well segmented if its IoU is greater than 0.8). For middle-sized objects in Supplementary- Fig. 2 (c), W7;I and --,R present similar penalties, while ,--is much stricter. For large-sized objects in Supplementary- Fig. 2(d) , --,R and ,--show similar effects and can precisely reflect structural integrity. 39 The hierarchical structure of UAAN requires that each batch should include at least one CSN sample, which might be inapplicable for an imbalanced dataset. Therefore, we selected 3 "fixed" samples (each one from C, SS, and NoA), and randomly fed one of them into the network with regular training samples (suppose the number of input images per batch is 8, then the size of regular samples is 7). For instance, if a mini-batch contains [fixed sample, NoA, GS, SS, SS, Neg, C, Neg], then the N and m will be 8 and 5, respectively, and the 'CSN index list' will be [0, 1, 3, 4, 6] (Fig. 3) . It is noted that the "fixed" samples are not involved in loss calculation and backpropagation to prevent overfitting. In addition to the "fixed samples", we adopt a weighted cross-entropy loss for optimization, the weight index O of th class is calculated by where is the number of classes, O is the number of the th class and X represents the total number of all classes. Therefore, the loss function for training UAAN is presented as where O and O indicate the k-th value of the one-hot encoded label and softmax probability, respectively. Meanwhile, data augmentation is adopted due to severe classimbalance issues, including randomized hue adjustment (randomly adjusting the hue of input images by ∆ ∈ [−0.15, 0.15]), randomized horizontal flip, randomized vertical flip, and randomized brightness adjustment (∆ ∈ (0, 0.15)), where ∆ is the scalar added to the For each input image x, assume μ and as the mean and variance of x, the standard min-max normalization was employed for pre-processing through * = − − D7( All trainable parameters in the experiments were initialized using the Glorot uniform initializer. Networks that participated in the comparison were trained with the Adam optimizer on an NVIDIA Tesla V100 GPU for 100 epochs, with the initial learning rate of 0.0001 and the decay of 0.96 per epoch. To present an impartial evaluation, all models were trained from scratch and did not incorporate any transfer learning strategy. 45 Experimental results of FISS loss are reported in S- Fig. 4 . Fig. 4 : Visualization results of different loss functions trained on the same network from scratch, without any preprocessing, post-processing, and data augmentation. 47 The initial mislabeled rate in the raw dataset M is 6.7%, given by three senior pathologists randomly checking 100 images per class. ('/' represents no uncertainty threshold applied). Supplement- Table 5 . Ablation Experiments of Uncertainty Factor With the aid of uncertainty indicator , 1-15% ( ∈ [0.3, 0.9]) of the training set are suspected as "unreliable" data. We expurgate these data from M and perform a second review from experts to explore whether these images belong to hard or mislabeled samples. arrows indicate incorrect predictions, including the black arrows (for incorrect classes), the blue arrows (for the hard sample that all models are failed to recognize) and the red arrows The Oxford classification of IgA nephropathy: pathology definitions Region-based convolutional neural nets for localization of glomeruli in trichrome-stained whole kidney sections Computational segmentation and classification of diabetic glomerulosclerosis Detection and classification of novel renal histologic phenotypes using deep neural networks LG-CNN: From local parts to global discrimination for fine-grained recognition Exploiting spatial relation for fine-grained image classification Mask-CNN: Localizing parts and selecting descriptors for fine-grained bird species categorization Mask r-cnn FCOS: Fully Convolutional One-Stage Object Detection, international conference on computer vision2019 Grad-cam: Visual explanations from deep networks via gradient-based localization Umap: Uniform manifold approximation and projection for dimension reduction A stochastic polygons model for glandular structures in colon histology images Focal loss for dense object detection Tversky Loss Function for Image Segmentation Using 3D Fully Convolutional Deep Networks, International Workshop on Machine Learning in Medical The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks Correlation maximized structural similarity loss for semantic segmentation Deep semantic segmentation of natural and medical images: A review The caltech-ucsd birds-200-2011 dataset Fei-Fei, 3d object representations for fine-grained categorization Part-Based R-CNNs for Fine-Grained Category Detection Part-Stacked CNN for Fine-Grained Visual Categorization G2C: a generator-to-classifier framework integrating multi-stained visual cues for pathological glomerulus classification Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization Segnet: A deep convolutional encoder-decoder architecture for image segmentation Image Quality Assessment: From Error Visibility to Structural Similarity Inception-v4, inception-resnet and the impact of residual connections on learning Delving deep into rectifiers: Surpassing human-level performance on imagenet classification Association, World Medical Association Declaration of HelsinkiE thical Principles for Medical Research Involving Human Subjects Mask Scoring R-CNN, computer vision and pattern recognition2019 Delving Into High Quality Object Detection, computer vision and pattern recognition2018 Deep Residual Learning for Image Recognition Aggregated residual transformations for deep neural networks Densely connected convolutional networks Rethinking Model Scaling for Convolutional Neural Networks, international conference on machine learning2019 Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, Advances in neural information processing systems Destruction and construction learning for fine-grained image Feb Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2019 Squeeze-and-excitation networks Identification of glomerular lesions and intrinsic glomerular cell types in kidney diseases via deep learning Deep convolutional neural networks for image classification: A comprehensive review Slide-wise Evaluation Supplement-Fig. 5 . Visualization results of the slide evaluation, the red, green, yellow and blue boxes represent annotations/predictions of C, NoA, GS and SS, respectively Fei-Fei, 3d object representations for fine-grained categorization The caltech-ucsd birds-200-2011 dataset The pascal visual object classes challenge: A retrospective Association, World Medical Association Declaration of HelsinkiE thical Principles for Fig. 1 . Histogram of images from Stanford Cars [1] , CUB-200-2011 [2] , PASCAL-VOC2012 [3] , and our in-house dataset, with red, green, and blue region representing the intensity value from R, G, B channels respectively. The foreground objects are annotated by yellow curves. Fig. 2 : Computation mechanism of previous SSIM methods and the proposed ISS (instance SSIM) 35 36 Neg: Negative structures such as tubule and arteriole Global Sclerosis: the entire glomerular tuft involved with sclerosis Crescent: the presence of at least two layers of cells that are filling circumferential or circumscribed the Bowman's space Segmental Sclerosis: any amount of the tuft involved with sclerosis, but not involving the whole tuft 40 normalized image representation. 41 Patches (RP-WSIs-P) 9700 3700 Glomeruli (RP-WSIs-) 23113 8529* P indicates the large patches used for the segmentation task, G represents the glomeruli patches used for the classification task, and S is the whole slide images used for the detection task.Both the training set and test set for patches RP-WSIs-P and glomeruli RP-WSIs-G were cropped from the corresponding training and test slides, respectively. P included more than 13,000 large patches with the size of 1024 1024 pixels, while RP-WSIs-G was built through cropping the minimum bounding rectangle of each glomerulus. The distribution of glomeruli images is shown in Supplement- Table 3 .