key: cord-0553589-jdf40jdq authors: Chhipa, Prakash Chandra; Upadhyay, Richa; Pihlgren, Gustav Grund; Saini, Rajkumar; Uchida, Seiichi; Liwicki, Marcus title: Magnification Prior: A Self-Supervised Method for Learning Representations on Breast Cancer Histopathological Images date: 2022-03-15 journal: nan DOI: nan sha: d9b63670320dd4ed1ad6992b8e3cdfbd5a9b9f69 doc_id: 553589 cord_uid: jdf40jdq This work presents a novel self-supervised pre-training method to learn efficient representations without labels on histopathology medical images utilizing magnification factors. Other state-of-theart works mainly focus on fully supervised learning approaches that rely heavily on human annotations. However, the scarcity of labeled and unlabeled data is a long-standing challenge in histopathology. Currently, representation learning without labels remains unexplored for the histopathology domain. The proposed method, Magnification Prior Contrastive Similarity (MPCS), enables self-supervised learning of representations without labels on small-scale breast cancer dataset BreakHis by exploiting magnification factor, inductive transfer, and reducing human prior. The proposed method matches fully supervised learning state-of-the-art performance in malignancy classification when only 20% of labels are used in fine-tuning and outperform previous works in fully supervised learning settings. It formulates a hypothesis and provides empirical evidence to support that reducing human-prior leads to efficient representation learning in self-supervision. The implementation of this work is available online on GitHub - https://github.com/prakashchhipa/Magnification-Prior-Self-Supervised-Method Cancer diagnosis by analyzing histopathological wholeslide images (WSI) is an active research field in machine learning [2] . Currently, the most widely used method for analysis of histopathological samples is visual inspection under microscopy at different magnification. The visual inspection follows an pyramid approach where structural information is viewed at low magnification and cellular details captured at high magnification. Automated classification of histopathological WSI can make cancer diagnosis faster and less prone to error. A challenge for the supervised learning approaches applied to histopathological WSI is the scarcity of labeled data. Furthermore, label information for digital WSI is also limited and does not provide details of the affected region at different * Corresponding author -prakash.chandra.chhipa@ltu.se 1 https://github.com/prakashchhipa/Magnification-Prior-Self-Supervised-Method Fig. 1 : The proposed approach comprises three steps: (1) Parameters initialization using inductive transfer of supervised ImageNet pre-trained model Efficient-net b2 [1] . (2) selfsupervised pre-training on unlabeled BreakHis histopathology images using a novel method Magnification Prior Contrastive Similarity to provide positive pairs by exploiting supervision signal, e.g., magnification from data and reducing human prior. (3) Supervised fine-tuning on limited labeled BreakHis histopathology images for the downstream task. magnifications, as in dataset BreakHis [3] . In such scenarios, representations learned purely through supervised learning might suffer as such methods typically require a large amount of labeled data. It can lead to sub-optimal performance on downstream tasks such as malignancy classification. Exploring efficient representation learning on small-scale histopathological WSI data using objectives that does not require labels is, therefore, a promising approach. Such methods require no labeled data for learning representations, and those representations, in turn, can be used to lower the amount of labeled data needed to learn successful downstream models. However, many such methods use human priors to decide on a reasonable objective that requires human effort. This work proposes a novel self-supervised method based on contrastive joint embedding called Magnification Prior Contrastive Similarity (MPCS), to learn efficient representations without labels. The proposed method uses a magnification factor (a signal from data) to construct positive pairs for contrastive similarity. MPCS uses magnification factors to enable self-supervised learning on the small-scale dataset. This work also hypothesizes that reducing human inducted prior in self-supervised methods enhances representation learning. This hypothesis is investigated by designing various pair sampling methods associated with different inducted human prior to self-supervised pre-training. MPCS is evaluated for histopathological WSI on the BreakHis [3] dataset. This evaluation uses an Efficient-net [1] model pre-trained on ImageNet [4] , which is then further pretrained using MPCS. Finally, the effectiveness of the learned representations is assessed when fine-tuned for the downstream task of malignancy classification at two ratios of labels. The complete approach is depicted in Figure 1 . Following are the main contributions of the work: 1) Enables self-supervised learning on small-scale histopathology dataset by exploiting a data prior and transfer learning 2) Demonstrates a label efficient self-supervised learning method and obtains state-of-the-art performance using only 20% of labeled data 3) Outperforming malignancy classification performance over previous works by 3.51% when using 80% of the labeled data. 4) Hypothesizes a relation between human inducted prior and representation learning potential for self-supervised method. 5) Presents initial empirical support showing that reducing human prior leads to efficient representation learning, which suggests promising further research 6) Empirical evidence on cross-magnification evaluation to demonstrate magnification invariant representation learning II. RELATED WORK Most efforts in machine learning for histopathological analysis has been using supervised learning. However, current supervised learning methods struggle to when labelled data is scarce [2] . Other methods are often used alongside supervised learning to make up for the lack of labeled data. These methods often fall under one or both categories; (i) using unlabelled information or pseudo-labels or (ii) using transfer learning from models trained on other tasks. Examples of such methods are image augmentation [5] for the first category and feature-extraction and selection [6] , [7] for the second. One public histopathological image dataset that poses such a challenge of data-and label scarcity is BreakHis [3] . The dataset consists of 2,480 benign and 5,429 malignant images from 82 patients at four different magnification levels (40×, 100×, 200×, 400×). The original work suggests two evaluation metrics, image-level accuracy and patient-level accuracy (recognition rate), and presented baseline results. Following [19] DenseNet [20] Simple Pretrained XGBoost [21] 3 trials PI [22] ResNet [17] Simple Pretrained Custom 3 trials TL [23] AlexNet [9] & VGG16 [24] No Pretrained Two networks 5-fold RPDB [25] DenseNet [20] Simple AnoGAN [26] No 5-fold CV MIM [27] Various the release of the original dataset, the authors released a work where they tested the effectiveness of using a Convolutional Neural Network (CNN) on the dataset [8] , though without making use of either of the two methods of handling data scarcity mentioned above. After the release of BreakHis [3] , many different methods have been applied to the dataset, most of which utilize a CNN in conjunction with one or both of the methods for handling data scarcity above. Most approaches to tackle the BreakHis dataset either use transfer learning from models trained on other supervised tasks, train in a supervised learning setting on BreakHis data extended with augmentations, or both. Table I contains a summary of a few such approaches along with some of the strategies used on BreakHis [3] . In the table, custom means using a specialized or novel method introduced. Simple augmentations refer to one or more of the including rotation, flipping, cropping, shifting, and zooming. Another learning paradigm to effectively counter earlier stated data scarcity challenges is self-supervised learning. It refers to methods that use pseudo-labels retrieved from structural properties of data itself to learn representations. Representation learning through self-supervised methods focused on unlabelled data can dramatically reduce the need for annotations by human experts. A common approach to self-supervised learning is to optimize the similarity of some extracted features. Representation learning from self-supervised learning paradigms for computer vision can be categorize majorly as (i) Joint Embedding Architecture & Method (JEAM) ( [28] , [29] , [30] , [31] ), (ii) Prediction Methods ( [32] , [33] , [34] ), and loosely (iii) Reconstruction Methods ( [35] , [36] ). JEAM is primarily conceptualized over multiple views of input images, exploitation of embedding capabilities of joint architectures, and specialized loss objectives have shown recent advances in self-supervision. JEAM can be divided further with each subdivision providing many interesting works; (i) Contrastive Methods (PIRL [37] , SimCLR [28] , SimCLRv2 [38] , MoCo [39] ), (ii) Distillation (BYOL [29] , SimSiam [40] ), (iii) Quantization (SwAV [30] , DeepCluster [41] ), and (iv) Information Maximization (Barlow Twins [31] , VICReg [42] ). Of these divisions, this work focuses on contrastive methods. The works SimCLR [28] , SimCLRv2 [38] , and MoCo [39] formalizes contrastive JEAM by providing frameworks for learning similarity between positive pair of distorted/transformed views from same images and dissimilarity from negative pairs from different images. The methods vary by how they manage the large pool of negative pairs, with SimCLR using a larger batch size and MoCo using a momentum encoder. Recently, contrastive JEAM has been tailored for medical images. In MICLe [43] , which is based on SimCLR [28] , multiple instance contrastive learning is applied by enabling input views from several image instances of the same patient. Another work of note that makes use of contrastive methods on histopathology is DRL [44] which learns representations based on local structure heterogeneity and global context homogeneity over noise contrastive estimation. It obtains stateof-the-art performance on the GlaS [45] and PCam [32] datasets. Other applications making use of contrastive JEAM are chest X-ray [46] , [47] , CT scans for COVID-19 [48] , 3d-Radiomic [49] , and Radiograph [50] . A work using contrastive JEAM on the BreakHis dataset is SMSE [51] which trains the network using pair and triplet losses. It is clear that contrastive JEAM for feature learning has progressed in histopathology images. However, it suffers from requiring a large amount of data for pre-training and specialized human knowledge to provide prior and annotations. Thus applying the contrastive JEAM paradigm on small-scale datasets with the reduced human dependency of prior is an open challenge and interest of this work. The primary focus of this work reduces to introduce a novel self-supervised pre-training method to enhance the accuracy of malignancy classification on Whole-slide Histopathological Images (WSI) with limited labels. The aim is to learn representations from data without labels and use supervision signals from data, e.g., magnification factor. To help overcome the unlabeled data scarcity, the work aims to effectively utilize inductive transfer from the domain of ImageNet [4] . Given the fact that BreakHis [3] is a small scale and classimbalanced dataset, this work hypothesizes a constraint case of inductive transfer. In this work, a heterogeneous source domain dataset i.e., ImageNet [4] is used to train a multi-class largescale network that disseminates common yet robust knowledge about visual representation to small-scale target domain dataset i.e., BreakHis [3] for classification of the malignancy. Specifically, Efficient-net [1] b2 model is used. In this work, the inductive transfer (i) helps to obtain improved performance on the downstream task of malignancy classification and (ii) enables self-supervised pre-training using the proposed method on the small-scale dataset. Magnification Prior Contrastive Similarity (MPCS) method formulates self-supervised pre-training to learn representations on microscopic Histopathology WSI without labels on smallscale data. The main objective of MPCS is to lower the amount of labeled data needed for the downstream task to address challenges in supervised learning. MPCS construct pairs of distinct views considering characteristics of microscopic histopathology WSI (H-WSI) for contrastive similarity-based pre-training. Microscopic H-WSI structural properties are different from natural visual macroscopic images [4] (vehicles, cats, or dogs) in terms of location, size, shape, background-foreground, and concrete definition of objects. Unlike SimCLR [28] where pairs of distinct views from the input image is constructed by human-centered augmentations, MPCS constructs pair of distinct views using pair sampling methods based on the signal from data itself i.e. magnification factor in BreakHis [3] . Two H-WSI from different magnification factors of the same sample makes a pair. Utilizing prior from data (magnification factor) enables meaningful contrastive learning on histopathology H-WSI and reduces dependency over human inducted prior. Further, tumor-affected regions in H-WSI are characterized by format and highly abnormal amounts of nuclei. Such affected regions are promising in all the H-WSIs of different magnification for the same sample. Thus, affected regions being common and size invariant in positive pair of a sample allow learning contrastive similarity by region attentions. MPCS effectiveness on the small-scale dataset is obtained through inductive transfer learning of pre-trained ImageNet [4] model and applying a uniform instance of augmentation to both the views of pair during self-supervised pre-training. Current work also hypothesizes that reduced human-prior in the pre-training method provides an enhanced degree of freedom to the method that can increase the potential of the network to learn efficient representations in a self-supervised approach. To investigate, three strategies for pair sampling are formulated based on inducted human prior. The number of human decisions defines the level of inducted humanprior (HP) during pair sampling. Explained in Figure 3 , In Fixed Pair, decisions to choose magnification factor for both views are by human, thus making strong human-prior. In Ordered Pair, only the second view of the pair is chosen by a human using a look-up table, making weaker human-prior. In Random Pair, no human prior inducted and magnification factor for both views are sampled randomly. Further, Figure 4 demonstrates Degree of Freedom (DoF) for the method where Fixed Pair strategy provides no DoF, Order Pair provides one DoF, Random Pair provides 2 DoF to the method. In MPCS, to formulate a batch of 2N views, randomly sampled batch of N sets of input X = {X (1) , X (2) , ..., X (N ) } are considered where each set of input are the output after the respective average pooling layers, shown in step 3 of Figure 2 . • A small-scale MLP projection head g(·) that maps representations to the latent space where contrastive loss is applied, shown in step 4 of Figure 2 . A multi-layer perceptron with single hidden layer to obtain z where σ is ReLU. • A contrastive loss function, normalized temperaturescaled cross entropy loss (NT-Xent) from SimCLR is defined for a contrastive prediction, shown in step 5 of In Equation 1, where 1 [k =M F 1] ∈ 0, 1 is an indicator evaluating to 1 if k = i. Equation 2 and 3 defines dot product between l 2 normalized respective quantities and τ defines the temperature parameter. Several of the works presented in Table I use a random 70:30 split of training and test data with repeated trials ( [8] , [13] , [18] , [19] , [22] ). A few other works are ambiguous as to whether full k-fold cross-validation is used or random trials ( [15] , [16] , [27] ). A careful analysis reveals the downsides of the data-splitting strategy used in most state-of-the-art methods. While this strategy performs experiment repetitions over five times, it does not endorse the theoretical guarantee for selecting mutually exclusive data points in the test-set across the trials (any two trials are conditionally independent). Moreover, the specifically chosen ratio of 70:30 for train-test data splits repeated over 5-times forces some data points to occur in a test set more frequently than others. Lastly, this strategy does not care for class level distribution in each trial which becomes critical for evaluating class imbalance and small-scale detests. In conclusion, this work proposes 5-cross stratified validation data split scheme. Each fold contained 20% data, following class distribution from whole data, and mutually exclusive. Four out of five folds are used training and the remaining one for testing. This work designs extensive experimentation mentioned in Table II to investigate proposed methods. The proposed method compares with the ImageNet supervised transfer learning approach and other state-of-the-art methods over malignancy classification. As general rule, all the proposed pretraining methods initializes by ImageNet [4] , are mentioned in Table II as ImageNet → MPCS-X. However, preferred names in discussion used as MPCS-X. All the experiments use the Efficient-net b2 [1] architecture which was chosen based on performance comparison among architectures. Training of the models in experiments uses input resize to (341, 341), learning rate 1e-05 for pre-training and 2e-05 for fine-tuning, Adam optimizer, and augmentations (random crop, flip, affine, rotation, and color-jitter). Experiments in Limited Labelled Data Setting (Table II) uses only 20% labeled data for finetuning and 20% for testing. Experiments in Fully Supervised Setting (Table II) uses 80% labeled data for fine-tuning and 20% for testing to evaluate performance as supervised learning to perform extensive comparison with previous work. This work only focuses on methodological development. Thus, the comparison is performed with all other state-ofthe-art methods, excluding which incorporates i) fusion of two or more models & ensembles approach, ii) task-specific optimized architectures, and iii) methods evaluated on very different data folds. Although comparison includes previous works on the data-split strategy of 70:30, repeated over 5times. Statistical significance was calculated to the proposed method and ImageNet approach in the limited labeled data setting; however, it does not take-place for a fully supervised setting due to the non-availability of fold-wise results on other previous work. This work evaluates MPCS which leverages a magnification prior in histopathology images and employs a reduced human prior by providing more degrees of freedom. The results show that this method is beneficial for self-supervised pre-training. Table III compares the performance of Exp-1 to Exp-4 in Table II where only 20% of the labeled data is used for supervised fine-tuning. The MPCS pre-trained models obtain significant (across magnification, p <0.05) improvement (1.55 ∼ 2.52)% over the ImageNet transfer learning method for all four magnifications showing label efficiency. Additionally, the results are competitive with state-of-the-art methods that have been trained with 70%-80% labeled data. Table IV compares the performance of Exp-5 to Exp-7 in Table II where 80% of the labeled data is used for supervised fine-tuning. Additionally, the table shows the previous works on the dataset. The MPCS pre-trained models obtain higher accuracy than the ImageNet transfer learning method for all magnification levels and are competitive with state-of-theart methods, outperforming the several methods with (3.5 ∼ 8.0)% higher accuracy in malignancy classification. In addition to the evaluation metrics proposed by the original work, several previous works ( [18] , [52] ) have evaluated how well models generalize to unseen magnification factors. A cross-magnification evaluation is performed for completion, and results can be found in the supplementary material. A. Self-supervised methods demonstrate label efficiency significantly The first observation from the results in Table III showcases that self-supervised method, MPCS obtains proportionally bigger margins when fine-tuning with fewer labels that are only 20%. All the variant of proposed self-supervised models outperforms significantly (across magnification, p <0.05) over the ImageNet transfer learning model on all the magnification factors (40x, 100x, 200x, and 400x) for both evaluation criteria. Specifically, the MPCS-Ordered Pair method demonstrates robust representation learning capability during fine-tuning in limited labeled data setting (20% labels) with (2.52±0.02)% significant improvement(p <0.01) in patient-level accuracy and similar results on image-level accuracy over ImageNet model. It matches the performance to the mentioned state-ofthe-art methods from the fully supervised learning setting in Table IV for both evaluations. Further, a noticeable fact is that MPCS pre-trained models are outperforming the ImageNet model for a larger fraction of labels in a fully supervised learning setting, as shown in Table IV. B. Supervision signal from data together with transfer learning enables self-supervision on small-scale datasets Self-supervised learning methods ( [28] , [39] , [40] , & [29] ) in the computer vision domain often use millions of images from ImageNet [4] to learn. Going against this trend, the MPCS method for self-supervised learning enables representation learning on the small-scale dataset using a supervision signal magnification factor in view pair construction. However, the model evaluated in this work still used the large-scale data through inductive transfer learning of Efficient-net b2 parameters pre-trained on ImageNet, which further helps to scale pre-training. The MPCS-Ordered Pair inducts weaker human prior in pair sampling. Thus, the MPCS method obtains one DoF for randomly choosing the first input view. In comparison, the MPCS-Fixed Pair, which inducts stronger human prior by choosing both views, 200x and 400x by human-prior, gives zero DoF to the MPCS method. The MPCS-Random Pair, in which the MPCS method obtains the highest degree of freedom since human-prior is absent. The result in Table III Figure 5 explains the trends consistently that weaker humanprior based pair sampling outperformed on extreme cases of stronger human prior or the absence of it. Thus, the current work results support the hypothesis that human prior should be reduced for efficient representation learning. However, it requires further investigation on different datasets and tasks. D. Comparing the proposed method with other state-of-the-art methods MPCS demonstrates outstanding performance on BreakHis for both the 20% and 80% labeled data cases. Table IV compares MPCS with previous works on image and patientlevel accuracy and highlights the top two results for each. MPCS-Random Pair and MPCS-Ordered Pair achieves the top two results for both image and patient-level images when averaged across magnifications, with around three percentage points increase above the previous best results [51] . As for magnification-specific performance, Ordered Pair and Random Pair achieve the top two results in all but two categories and get the top results in all but one. It is also worth mentioning that results for other works which do not follow repeated folds and obtain higher accuracy, some specific fold results for MPCS-Ordered Pair and MPCS-Random Pair also obtains PLA 98.82% and 96.99% respectively. This work proposed a novel self-supervised pre-training method, Magnification Prior Contrastive Similarity which enables self-supervised pre-training for comparatively smallscale BreakHis [3] for efficient representation learning by exploiting supervision signals from data. Future work includes investigating on cross-domain adaptation, utilizing unsupervised and self-supervised transfer learning, and further validation on the human-prior hypothesis. Efficientnet: Rethinking model scaling for convolutional neural networks Machine learning methods for histopathological image analysis A dataset for breast cancer histopathological image classification Imagenet: A large-scale hierarchical image database Scannet: A fast and dense scanning framework for metastastic breast cancer detection from whole-slide image Deep features for breast cancer histopathological image classification Breast cancer image classification via multi-network features and dual-network orthogonal low-rank learning Breast cancer histopathological image classification using convolutional neural networks Imagenet classification with deep convolutional neural networks Deep learning for magnification independent breast cancer histopathology image classification Data augmentation for histopathological images based on gaussian-laplacian pyramid blending Using filter banks in convolutional neural networks for texture classification Multiple instance learning for histopathological breast cancer image classification Integrated segmentation and recognition of hand-printed numerals Breast cancer histopathology image classification and localization using multiple instance learning Breast cancer detection from histopathology images using modified residual neural networks Deep residual learning for image recognition Breast cancer histopathological image classification: is magnification important Sequential modeling of deep features for breast cancer histopathological image classification Densely connected convolutional networks Xgboost: Reliable large-scale tree boosting system Partially-independent framework for breast cancer histopathological image classification Transfer learning based histopathologic image classification for breast cancer detection Very deep convolutional networks for large-scale image recognition Classification of breast cancer histopathological images using discriminative patches screened by generative adversarial networks Unsupervised anomaly detection with generative adversarial networks to guide marker discovery Breakhis based breast cancer automatic diagnosis using deep learning: Taxonomy, survey and insights A simple framework for contrastive learning of visual representations Bootstrap Your Own Latent: A new approach to self-supervised learning Unsupervised learning of visual features by contrasting cluster assignments Barlow twins: Self-supervised learning via redundancy reduction Rotation equivariant cnns for digital pathology Unsupervised learning of visual representations by solving jigsaw puzzles Unsupervised visual representation learning by context prediction Auto-encoding variational bayes Generative adversarial nets Self-supervised learning of pretextinvariant representations Big self-supervised models are strong semi-supervised learners Momentum contrast for unsupervised visual representation learning Exploring simple siamese representation learning Deep clustering for unsupervised learning of visual features Vicreg: Variance-invariancecovariance regularization for self-supervised learning Big selfsupervised models advance medical image classification Dataefficient histopathology image analysis with deformation representation learning Gland segmentation in colon histology images: The glas challenge contest Moco pretraining improves representation and transferability of chest x-ray models Align, attend and locate: Chest x-ray diagnosis via contrast induced attention network with limited supervision Sample-efficient deep learning for covid-19 diagnosis based on ct scans Imbalance-aware self-supervised learning for 3d radiomic representations Comparing to learn: Surpassing imagenet pretraining on radiographs by comparing image representations Magnificationindependent histopathological image classification with similarity-based multi-scale embeddings Magnification generalization for histopathology image embedding APPENDIX A. Self-supervised methods learns magnification invariant representationsThe MPCS methods not only outperforms for magnification specific task stated in Table III and IV but representations learned through proposed method also demonstrate consistent edge in classification performance in cross-magnification evaluation over the ImageNet model. Table VI evaluates the mean performance of models trained on other magnifications except on which evaluation is performed (type-2 mean cross magnification evaluation). Interestingly, type-2 cross magnification evaluation also shows similar trends except in 400x, in which the ImageNet model obtained high PLA performance. Empirical analysis on type-1 and type-2 cross magnification suggests that MPCS self-supervised pre-trained models perform better than the ImageNet model by learning magnification invariant representations.