key: cord-1010685-aln3yap8 authors: Liu, Zhizhe; Zhu, Zhenfeng; Zheng, Shuai; Liu, Yang; Zhou, Jiayu; Zhao, Yao title: Margin Preserving Self-paced Contrastive Learning Towards Domain Adaptation for Medical Image Segmentation date: 2021-03-15 journal: IEEE journal of biomedical and health informatics DOI: 10.1109/jbhi.2022.3140853 sha: d574727d0e60d83b3efba5889af55ca93f81da9e doc_id: 1010685 cord_uid: aln3yap8 To bridge the gap between the source and target domains in unsupervised domain adaptation (UDA), the most common strategy puts focus on matching the marginal distributions in the feature space through adversarial learning. However, such category-agnostic global alignment lacks of exploiting the class-level joint distributions, causing the aligned distribution less discriminative. To address this issue, we propose in this paper a novel margin preserving self-paced contrastive Learning (MPSCL) model for cross-modal medical image segmentation. Unlike the conventional construction of contrastive pairs in contrastive learning, the domain-adaptive category prototypes are utilized to constitute the positive and negative sample pairs. With the guidance of progressively refined semantic prototypes, a novel margin preserving contrastive loss is proposed to boost the discriminability of embedded representation space. To enhance the supervision for contrastive learning, more informative pseudo-labels are generated in target domain in a self-paced way, thus benefiting the category-aware distribution alignment for UDA. Furthermore, the domain-invariant representations are learned through joint contrastive learning between the two domains. Extensive experiments on cross-modal cardiac segmentation tasks demonstrate that MPSCL significantly improves semantic segmentation performance, and outperforms a wide variety of state-of-the-art methods by a large margin. . Category-agnostic VS. the proposed category-aware domain alignment. Top: Previous category-agnostic domain alignment methods that aim to align global marginal distributions but ignoring the semantic consistency. Bottom: The proposed margin preserving contrastive learning method for category-aware feature alignment. Obviously, we can boost the inter-class difference and reduce the intra-class variation by constructing enough positive and negative pairs. of assumption that enough labeled data is available for target task. However, such assumption is seriously limited in many real-world clinical scenarios. Taking the recent outbreak of epidemic as an example, we have been facing a global health crisis, i.e., the pandemic of a novel Coronavirus Disease (COVID-19) [5] , [6] , since December 2019. Due to the high cost of annotation and the urgent work of doctors to combat the pandemic, there is not enough annotated data to train a well-performing DNN. One of the most common solutions is to train a model on a label-rich domain (named as source) and then generalize it to a label-lacking domain (named as target). However, since the significant distribution gap between the two domains (i.e., domain shift problem), the trained model usually suffers from a sharp drop in performance when applied to the target domain. Many approaches based on unsupervised domain adaptation (UDA) [7] [8] [9] [10] [11] [12] have recently been proposed to make the knowledge learned from the source domain better transferred to the target. Most previous methods employ representation learning based on some distance metrics (e.g., the maximum mean discrepancies (MMD) [7] ) or adversarial learning [11] , [12] Local category alignment Inter-category separability AdaOutput [9] AdvEnt [10] CLAN [13] CAG [14] IntraDA [11] SIFAv2 [4] MPSCL shown at the top of Fig. 1 . Here, semantic consistency means that after domain adaptation alignment, the distributions of the same category from different domains should be identical in the embedding space, while the distribution between categories can be easily distinguished. To avoid the semantic confusion between the two domains, some methods [15] , [16] tried to generate pseudo-labels for target data by self-training, providing more powerful supervision for classifier training. As an anchor-guided UDA model for semantic segmentation, both category-wise domain alignment and self-training were facilitated in an explicit way [14] . Despite the category-wise domain alignment, the local semantic structure in the embedding space was not adequately considered [14] thus ignoring the inter-category separability. As a result, the inter-category difference, which is crucial for dense pixel-wise prediction task, won't be sufficiently boosted. Fortunately, it has been shown in some recent works [17]- [19] that contrastive learning can help to learn powerful representations by taking a closer look at both inter-category and intra-category distributions. We aim to address the semantic inconsistency problem, while enhancing both intracategory compactness and inter-category separability. Fig. 1 illustrates a comparison between category-agnostic models and our proposed category-aware domain alignment. In contrast to category-agnostic models, our model is more prone to intercategory margin preserving by constructing enough semantic prototype induced contrastive pairs for contrastive learning while reducing the intra-category variation. When applying self-training to achieve category-aware feature alignment, we need to generate some pseudo-labels for target data to match the joint distributions between the two domains. A straightforward way [15] , [16] is to first generate pixel-wise predictions of the target data using a classifier trained on the source data. Then, following the self-spaced learning scheme which has been found effective for gradually learning a robust model [20] [21] [22] , a suitable selection strategy is explored to remove error-prone predictions and generate the final pseudo-labels. However, since the target samples usually contain hard-adapted regions, especially around the boundary regions, it may generally generate unreliable and overconfident pixel predictions. As shown in Fig. 2 , not only incorrect predictions are generated, but also it takes a lower entropy value, i.e., overconfidence. Thus, it is clearly not a trivial task to design a suitable selection strategy to avoid choosing hard-adapted regions to serve as candidate for pseudo-labeling. However, as an intuitive assumption, the visual representations belonging to the same category in the source and well-adapted target domains usually take higher similarity. It would means the well-adapted pixel regions from the target domain can be measured in the embedding space. Motivated by the observations above, we aim to develop in this paper a contrastive learning framework for crossmodal medical image segmentation, in which the semantic prototypes and pseudo-labels are fully exploited. In particular, we emphasize our contributions as follows: -We propose a novel Margin Preserving Self-paced Contrastive Learning (MPSCL) framework to tackle cross-modal medical image segmentation. To the best of our knowledge, it is the first attempt that contrastive learning is applied to UDA in medical image analysis. -Different from the traditional construction of contrastive pairs in contrastive learning, the domain-adaptive semantic prototypes, which are based on the prior knowledge of source domain, are exploited to bridge the two domains and constitute the positive and negative pairs for contrastive learning. -Induced by the progressively refined semantic prototypes, a novel margin preserving contrastive loss is proposed to boost the discriminability of visual representations in embedding space. Meanwhile, the domain-invariant representations are learned via joint contrastive learning between the two domains. -To perform contrastive learning in target domain without prior label available, more informative pseudo-labels are generated in target domain via self-paced scheme, which further benefits the category-aware feature alignment. Unsupervised domain adaptation aims to alleviate the domain shift problem between the source and target domains. In terms of how to bridge the gap between the two domains, the existing UDA methods can be divided into three categories. The first group aims to address the above issue by transforming the image appearance between the two domains [4] , [23] . For example, with the success of CycleGAN [24] in unpaired image-to-image transformation, Chen et al. [4] proposed to transform the labeled source MRI images to the appearance of target CT images and then utilized the synthesized target-liked x src n , y src n Source image and ground-truth label of the n-th source domain sample: x src ∈ R H×W ×1 and y src n ∈ R H×W ×L x trg n ,ŷ k * n Image and the k-th pseudo-labels of the n-th target domain sample. x trg ∈ R H×W ×1 andŷ k * n ∈ R H×W ×L y src n [l; i] Ground-truth label for the l-th pixel of x src n belonging to the i-th category, l ∈ images to train a segmentation model. Different from the approaches based on image alignment, the other stream chooses to bridge the distribution gap between the two domains in the feature space [7] , [13] , [25] , [26] . Specially, benefiting from the advances of generative adversarial networks [27] , which has been widely used in representation learning [28] , [29] , some methods [13] , [26] have focused on learning domain-invariant representations by a minimax game between a generator and a discriminator. Inspired by the fact that the segmentation outputs of images from two domains should have considerable similarities, e.g., spatial layout and local context, many recent methods [9] , [10] , [30] tended to perform structure adaptation between the two domains in the output level. Working along this line, [10] proposed an entropy-based adversarial learning to penalize low-confident predictions on target domain. The discrepancy between our MPSCL and the state-of-the-art UDA methods is presented in Tab. I. It can be seen that: i) Different from some global domain alignment methods (e.g., AdvEnt [10] and SIFAv2 [4] ), we also conduct local category alignment, which can further improve the transferability of the learned model. ii) Compared with some local domain alignment methods (e.g., CLAN [13] and CAG [14] ), our model enhances not only the intra-category compactness but also the inter-category separability, thus making the learned representations more discriminative. Contrastive learning aims at learning an embedding representation space by maximizing similarity and dissimilarity on positive and negative data pairs, which has been extensively used in the metric learning [31] and self-supervised learning (SSL) [17] , [18] , [32] . In the SSL setting, where the supervised information of training data is unavailable, contrastive learning focus on learning an invariant representation space by designing various pretext tasks based on data transformations(e.g., rotation cropping and color jittering) [17] , [18] . Recently, Khosla et al. [19] have extended the contrastive loss for supervised training. Due to the exploration of local semantic structures, it is able to learn more powerful representations. Although contrastive learning has achieved impressive results in representation learning, it performs yet poorly in crossmodal medical semantic segmentation, in which the significant distribution gap between the two domains keeps to be a hard nut to crack. Concretely, since the supervision of target domain is unavailable, it fails to directly bridge the two domains in the semantic level and construct enough contrastive pairs. Thus, different from the way for construction of paired data in SSL, the domain-adaptive prototypes are utilized in our MPSCL to serve as category anchors, guiding the construction of contrastive pairs in feature space for contrastive learning. Self-training, which typically includes a teacher-student framework, uses a good teacher model trained on the labeled data to assign pseudo-labels to the unlabeled data, and then utilizes human labels and pseudo-labels to jointly train a student model. In deep learning, self-training has received increasing interest due to the dramatic reduction in cost of data labeling (e.g., image classification [33] , machine translation [34] and speech recognition [35] ). Recently, some UDA works [11] , [14] , [15] have attempted to generate pseudo-labels for the target data to achieve category-aware domain alignment. For example, [14] leveraged an anchor-based pixel-level distance loss to match the joint distributions between the two domains in the feature space by self-training. But, since it fails to make full use of local semantic structure information, the knowledge learned from source domain will not generalize well to the target domain. Typically, referring the self-paced learning scheme [20] [21] [22] which is usually used to learn a more robust model by introduce a regularizer term, these UDA methods also apply the "easy-to-hard" training scheme, starting the training process with the most confident pseudolabels. The key notations used throughout this paper are summarized in Table II . Given a labeled source dataset {X src , Y src }, a semantic segmentation model aims to learn a mapping F from the image domain X src to the label domain Y src : Specially, the mapping function F can be obtained by minimizing a hybrid loss L Seg that is generally defined as: where p src n ∈ R H×W ×L denotes the pixel-wise prediction by F, and L is the number of categories. In addition, the first term L CE n is the weighted cross-entropy loss for pixel-level classification, and the second term L Dice n is the Dice loss which is usually applied in medical image segmentation tasks with multiple organ structures. As for the design of hybrid loss L Seg , the central point is how to tackle the class imbalance in medical image segmentation [12] . Generally, a model trained on the source domain X src is hard to directly generalize to the target domain due to the significant distribution discrepancies between the two domains. Recently, several UDA methods have been proposed to bridge the gap, which can be formulated as : where F uda is trained on the labeled source domain {X src , Y src } and unlabeled target domain {X trg }. Typically, the mapping function F uda aims to learn a domain-invariant representation space by distilling transferable knowledge from the source domain. Here, Y src and Y trg are assumed to be identical as in general setting. As shown in Fig. 3 , the overall framework of MPSCL model mainly contains four components, i.e., Generative Adversarial Network, Domain-adaptive Category Prototypes, Self-paced Pseudo-labels, and Margin Preserving Contrastive Learning. • Generative Adversarial Network The generative adversarial network is utilized as a backbone alignment network to promote the category-aware alignment between the two domains. Particularly, the generator contains three branches, one of which generates the predicted masks of source domain for supervised learning, and the other two produce the weighted self-information maps of source and target domains for adversarial learning. • Domain-adaptive Category Prototypes The category prototypes are exploited to constitute the contrastive pairs for the join contrastive learning between the two domains. To make these prototypes well domain-adaptive, they are refined in a progressive way in model training. • Self-paced Pseudo-labels In order to conduct contrastive learning in target domain without prior label available, the informative self-paced pseudo-labels are generated for the target data to provide extra supervision. • Margin Preserving Contrastive Learning To boost the discriminability of representation via generator, a novel margin preserving contrastive learning loss is proposed. While ensuring the tight clustering within categories, the difference between categories can be maximized. Due to the significant distribution discrepancies between the two domains, the traditional construction methods of contrastive pairs (e.g., rotation cropping and color jittering) cannot be suitable for cross-domain contrastive learning. To this end, the domain-adaptive category prototypes are exploited to construct contrastive pairs. Specially, to obtain the representative prototypes, we first initialize the category prototypes using the category centers of the initial source pixel feature, and we have: where |N i | denotes the number of pixels belonging to the i-th category in source domain, i.e., |N i | = N src n=1 H×W l=1 y src n [l; i], and y src n [l; i] = 1 if the l-th pixel of x src n belongs to the i-th category. To make the prototypes receive more pseudo-supervision from the target domain in model training and so as to have better cross-domain adaptability, we update them with a progressive refinement way in each iteration. For the category prototypes C (k) at the k-th iteration, the i-th category prototype c (k) i is refined by the mean vector of pixel feature belonging to the i-th category in the mini-batch as: where B represents the batchsize, and |B i | denotes the number of pixels belonging to the i-th category. The α ∈ [0, 1] is a momentum coefficient for moving the semantic category prototypes, and α is empirically set as 0.2. Meanwhile, for the source and target domains, the category prototypes are regard as category anchors and construct contrastive pairs with each pixel feature. Then, the joint cross-domain contrastive learning is performed to learn domain-invariant representations. Fig. 4 . Illustration of the self-paced pseudo-labels. C (k) represents the domain-adaptive category anchor set at the k-th iteration. r To conduct supervised contrastive learning in target domain without prior label available, we borrow the idea from selftraining and generate pseudo-labels for the target samples. As shown in Fig. 2 , it usually outputs incorrect and overconfident predictions on the target domain via the classier of generative network G , especially in the early training of the model. Obviously, to assign pseudo-labels directly based on these unreliable predictions will be inevitably with high risk. To avoid selecting pixels with error-prone predictions, we propose a self-paced pseudo-labels assigning approach in the embedding space, which is mainly based on the assumption that the well-adapted pixel feature are close to the prototype of same category and far from others. As shown in Fig. 4 , a self-paced selection strategy is presented by following an 'easy-to-hard' scheme to capture those well-adapted pixels. Given the l-th pixel feature f i])|i = 1, · · · , L are first obtained by using cosine similarity: where · 2 denotes the L 2 normalization. Let's sort r (k) n [l; i]|i = 1, · · · , L in descending order and denote I 1 and I 2 the index corresponding to the maximum and submaximum confidence scores, respectively. Hence, the final pseudolabels are generated as follows: whereŷ l denotes the predicted category index of l-th pixel. δ th is a pre-defined threshold to remove hard-adapted regions. In practice, what is more desirable for assigning pseudolabels is to seek those informative samples. To measure a The illustration of the margin preserving contrastive loss. The feature of each pixel (f src n [l] or f trg n [l]) form positive pairs with prototypes of the same categories and negative pairs with prototypes of different categories. Meanwhile, to enhance separability between categories while reducing the variation within categories, we introduce a deviation angle, i.e., m in Equ. 8, as penalty to the positive pair (i.e., the pair between pixel feature and positive category prototypes). Finally, after multiple self refinement of category prototypes, the learned representation will possess distinct inter-category margin and therefore will be more discriminative. sample whether informative or not, the confidence difference R (k) n [l; I 2 ] is adopted in Eq. (7) to characterize the significance associated to it. Correspondingly, by setting a threshold on the confidence difference, a selection mask of interest regions can be established. Since the generation pseudo-label is essentially category-prototypes induced, with the progressive refinement of category-prototype as in Eq.(5), more reliable informative pseudo-labels can be generated by the means of self-pacing, thus facilitating the supervision for contrastive learning in target domain. To boost the discriminability of representation, the crossdomain contrastive learning is proposed to promote the representations belonging to the same category to be closer together and far from other categories. An intuitive way to achieve this goal is through mean square loss such as [14] , however, it is clear that the discriminability in [14] can not be sufficiently preserved since only the intra-category compactness is considered. To ensure the tight clustering within categories while maximizing the difference between categories, a novel margin preserving contrastive loss is proposed. As a geometric explanation shown in Fig. 5 , it can be found that there is usually a small separability between categories as well as a large variability within categories. To tackle this issue, we introduce a deviation angle as penalty to the positive anchor for margin preserving. Specifically, the margin preserving contrastive loss of source and target domains is defined as: where cos(θ with margin preserving to positive prototype. The temperature τ is set to avoid overfitting [36] , and τ = 1 is a familiar setting. As we can see from Fig. 5 , after some iterations via backpropagation, the inter-category difference becomes evident as the margin preserving deviation angle penalty is incorporated. In fact, although the proposed margin preserving contrastive loss is similar to [37] , we are induced by the domain-adaptive category prototypes to tackle cross-domain adaptation problem. To promote the consistency in spatial layout and local context in output space, the generative adversarial network is utilized to generate semantic masks with similar structure between the two domains. Following [10] , we also perform structure adaptation by minimizing the entropy via adversarial learning. In the case of each source image x src n or target image x trg n as shown in Fig 3, the output of generator is applied to generate a weighted self-information map I xn ∈ R H×W ×L as the discriminator input. Here, the I xn is composed of pixel-level vector I xn [l; i] = −p n [l; i] log p n [l; i], where i = 1, · · · , L, and p n [l; i] is the predicted probability of the l-th pixel belonging to the i-th category. Similar to [27] , we let the discriminator D to distinguish the input coming from the source and target domains. Meanwhile, we train the generator G to fool the discriminator D. Specially, let L B denote the binary cross-entropy domain classification loss, and the objective function to train the discriminator can be defined as: and the adversarial loss of the generator G is: Thus, combining Eq.(2), Eq.(8) and Eq.(10), the total optimization loss of generator G is derived by: (11) where L Csrc n and L Ctrg n denote the margin preserving contrastive learning of source and target domains, and the {γ, β} represent the corresponding weight factor of two domains. The λ denotes the weight factor of the adversarial term L adv . In this section, we will present experimental results to validate the performance of our MPSCL on cross-domain semantic segmentation task. Our work in this paper mainly focus on cross-modal medical image segmentation. Thus, the widely used Multi-Modality Whole Heart Segmentation (MMWHS) challenge 2017 dataset [38] is adopted for cardiac substructure segmentation. Specially, the training data is composed of unpaired 20 MRI and 20 CT volumes from different patient cohorts, and the ground-truth masks of these data are provided. For evaluating our model quantitatively, we select the following four structures: ascending aorta (AA), left atrium blood cavity (LAC), left ventricle blood cavity (LVC), and myocardium of the left ventricle (MYO). We conduct extensive experiments for cross-modal adaptation in two directions, i.e., from MRI to CT images and CT to MRI images. For a fair comparison, we adopt the preprocessed data published by SIFAv2 [4] . During test phase, we are mainly interested in two aspects of performance, i.e., the overlap and the difference between predictions and ground-truth masks. Accordingly, we adopt two common metrics, the Dice similarity coefficient (Dice) and the average symmetric surface distance (ASD), to quantitatively analyze the performance of our model. As for Dice, it measures the voxel-wise segmentation accuracy between the predicted segmentation and ground-truth labels, while ASD computes the average distances between the surface of the predicted masks and the ground-truth in 3D. A higher Dice and a lower ASD value indicate better segmentation results. In our experiments, DeepLabV2 [39] with pretrained parameters from ImageNet [40] is selected as the generator G. For the discriminator D, the PatchGAN configuration is adopted in cardiac CT to MRI task while the discriminator configuration of AdaOutput [9] is used in cardiac MRI to CT task. To provide the representative category prototypes and informative pseudo-labels, we first train MPSCL (more than 4000 iterations) with β = 0, γ = 0, and λ = 0.003. Afterwards, the domain-adaptive prototypes are initialized by the category centers of initial source feature, and the selfpaced pseudo-labels are generated for the target domain. To avoid choosing hard-adapted pixel regions and dropping useful information, the threshold δ th is set as 0.25. Then, We continually train MPSCL with γ = 1.0, β = 0.1 and λ = 0.003 and progressively refine the category prototypes and pseudo-labels. The deviation angle penalty m is selected from {0.2, 0.4}, and the temperature τ is set to 1.0. Meanwhile, similarly to [9] , we perform domain adaptation on multi-level outputs from conv4 and conv5 to further improve performance. During the training procedure, our model, except the discriminator, is trained using Stochastic Gradient Descent optimizer [41] with learning rate 2.5×10 −4 , momentum 0.9 and weight decay 10 −4 . The Adam optimizer [42] with learning rate 10 −4 is used for training the discriminator. To evaluate the superiority of the proposed MPSCL, we compare with a wide range of UDA methods in cross-modal medical segmentation tasks. These methods can be divided into two categories: (i) Category-agnostic global alignment Fig. 6 . Visual comparison of segmentation results produced by different methods for cardiac CT slice images (1st-2nd row) and MRI slice images (3rd-4th row). From left to right are the raw test slice images (1st column), "Supervised training" upper bound (2nd column), "W/o adaptation" lower bound (3rd column), results of other unsupervised domain adaptation methods (4th-9th column), results of our MPSCL network (10th column), and ground truth (last column). The color corresponding to the semantic category is at the bottom. methods. In this paradigm, we select two image-level alignment UDA methods (SIFAv1 [43] and SIFAv2 [4] ) and two methods (AdaOutput [9] and AdvEnt [10] ) which bridge the gap at the output-level; (ii) Category-aware local alignment methods. In view of this aspect, we focus on three categoryaware alignment methods, including category-level adversarial network(CLAN [13] ) based on self-adaptive adversarial loss, and two self-training based alignment methods (CAG [14] and IntraDA [11] ). For a fair comparison, the generator architectures used for the implementation of other methods are the same as MPSCL except SIFAv1 and SIFAv2. Since the same dataset and data preprocessing are applied for SIFAv1 and SIFAv2, as well as our model, the results from their papers are directly reported. Additionally, the architectures of discriminator follows the configuration of PatchGAN [44] . In order to evaluate the importance of domain adaptation in cross-domain semantic segmentation, we first get the lower bound performance 'without adaptation' ( named as W/o adaptation) by training a model only on source domain and directly generalizing it to the target domain. In addition, we also provide the upper bound performance by conducting supervised learning on target domain to evaluate how much the gap is decreased between the 'without adaptation' model and fully-supervised model. For a fair comparison, the generator G in our MPSCL framework is utilized for training the lower and upper bound models. Table III presents the results for cross-modal cardiac segmentation task, including lower and upper models and several state-of-the-art UDA methods. It can be seen that: i) the W/o adaptation model trained on MRI images only obtains the average Dice of 23.28% when being applied to CT domain. Similarly, the model trained on CT images also achieves merely the average Dice of 20.35% on MRI domain. These results are far below the performances 90.40% and 85.07% of supervised training models, which demonstrates the serious domain shift problem between MRI and CT domains; ii) remarkably, our MPSCL model has achieved significant performance improvements in terms of both Dice and ASD measurements. For CT images, we improve the average Dice to 84.08% over the four cardiac structures with the average ASD being reduced to 3.47, and for MRI images, we obtain the average Dice of 69.87% and the average ASD 3.80; iii) meanwhile, our method significantly outperforms other category-agnostic alignment methods by a large gain. This shows that it is of great importance to maintain the semantic consistency between the two domains. In addition, compared with category-aware alignment methods, our method also achieves significant improvements. For example, for CT images, our MPSCL achieves a clear improvement of 4.68% in the average Dice and a obvious reduction of 1.75 in the average ASD. This shows that it is of great importance to enhance inter-class separability and intra-class compactness between the two domains. Fig. 6 presents some segmentation results of several examples, and it is obvious that the W/o adaptation model is hard to completely capture the correct cardiac structures due to the domain shift problem between the two domains. Meanwhile, with comparison to the other UDA methods, the outputs of our MPSCL are more consistent with the ground truth for the slice images in both two transferring directions. In addition, considering the practical clinical application, we also provide the 3D segmentation results of a patient volume data in Fig. 7 . Although our MPSCL is trained in a 2D view without considering correlation of inter-frames, very complete and accurate heart structure segmentation for CT and MRI volumes can also be obtained. Both the qualitative and visualization results demonstrate that our MPSCL can effectively tackle the domain shift problem. In addition, we also provide a Fig. 8 with the training/validation/test loss of weighted cross-entropy to examine it for overfitting effects. The above loss are calculated on the source training/target training/target test datasets, respectively. It can be seen that, in the early stage of model training, the jitter of the loss function is more pronounced in the target validation and test datasets, but for the training set, it tends to converge quickly. This is because the discriminator can Fig. 7 . Visual comparison of segmentation results produced by different methods for cardiac CT data volume (1st rows) and MRI data volume (2nd rows). From left to right are the raw test volumes (1st column), "Supervised training" upper bound (2nd column), "W/o adaptation" lower bound (3rd column), results of other unsupervised domain adaptation methods (4th-9th column), results of our MPSCL network (10th column), and ground truth (last column). The color corresponding to the semantic category is at the bottom. In addition, it can be seen that the loss variation trends in the validation and test datasets are the same during the training process, which means that our model is not overfitted during the training process. In particular, we also apply an earlystopping strategy to obtain the best model. In the following, we demonstrate the effectiveness of the deviation angle penally in margin preserving contrastive loss given in Eq.(8) from the view of statistic analysis. Fig. 9 presents the distributions of the angle θ (k) n [·; y l ] of target domain. It can be observed that, from the beginning to end of MPSCL training, the similarity between the pixel feature and the positive category anchor improve continuously, which means the gap between the two domains is gradually reduced. In other words, the semantic consistency between the two domains is well preserved in our MPSCL. What deserves noting is, at the end of the model training, most of the angles are concentrated in a smaller intervals. Obviously, the intracategory compactness is enhanced. Meanwhile, due to the part of generated noisy pseudo-labels, some pixel feature have lower similarities with the positive category anchor. To provide more in-depth and visual examination of the self-paced pseudo-labels, we conduct a qualitative analysis as illustrated in Fig. 10 1 . Clearly, the generated pseudo-labels can provide informative supervision for conducting contrastive learning in target domain without human annotations. Our MPSCL model can progressively refine the pseudo-labels to correct the error and produce better supervision during model training in an 'easy-to-hard' scheme. At the start of model training, since the confidence difference R (1) n between the maximum and submaximum confidence scores for each pixel region is not noticeable, only a few well-adapted pixel regions are selected although the noisy predictions are also removed. But as training proceeds, the generated predictions come gradually closer to the ground-truth labels. On the other hand, for the pixel regions poorly adapted before, they will receive more significant confidence difference. With such selfpacing procedure, the generated pseudo-labels can provide more information to help generate better results on target categories. However, it also should be noticed that the selfpaced pseudo-labels also contain incorrect information, which will lead to the negative transfer. Fig. 10 . Visualization of pseudo-labels that are gradually refined during model training. The left side of the dotted line is a MRI slice and the right side is a CT slice. For each side, the first column is the generated predictions based on the domain-adaptive prototypes. The second column is the confidence difference between the maximum and the submaximum confidence scores. The third column is the pixel-level masks for interest region selection, where white indicates that the pixels are selected and black indicates unselected pixels. The fourth column is the generated pseudo-labels. We conduct ablation experiments to evaluate the effectiveness of preserving the semantic consistency between the two domains in our method. The ablation study results are show in Table. IV containing MRI to CT and CT to MRI two applications, and our Baseline method in Table. IV only achieves global marginal feature alignment by setting β = 0.0. It is obvious that our MPSCL improves the segmentation performance to a large degree, and for CT images, the average Dice is increased from 81.75% to 84.08%, and the average ASD is reduced from 3.68 to 3.47. Similarly, for MRI images, the average Dice is increased to 69.87% and the average ASD is reduced to 3.80. This is due to the fact that our MPSCL significantly promotes the category-level alignment between the two domains and avoids the semantic confusion in feature space. In addition, it is also worth noticing that there also exits slight degradation for some categories (i.e., LAC category in CT domain and AA category in MRI domain). This maybe because that the threshold δ th is not suitable for these categories, which results in the generated pseudo-labels containing more error information. We also validate the influence of angular margin penalty in our MPSCL and the quantitative results are included in Table. V, in which CSCL denotes the conventional selfpaced contrastive learning without adopting the deviation angle penalty (i.e., m = 0.0). Meanwhile, we also present the angle distributions of different settings in Fig. 11 . It can be observed that the average Dice is improved from 83.85% to 84.08%, and the average ASD is decreased from 3.68 to 3.47 in the MRI to CT direction. Additionally, for MRI images, our method improves the average Dice by more than 1.41% and reduces the average ASD by more than 0.27. As illustrated in Fig. 11 , the intra-category compactness is further enhanced by explicitly putting the deviation angle penalty between each pixel feature and the positive category anchor. Both the qualitative and visualization results demonstrate that our MPSCL can further boost the discriminability of representation compared to CSCL. It also should be pointed out, our model does not perform as well as CSCL in some categories (including LAC and LVC categories in CT domain and AA category in MRI domain). The possible reason may be the pseudo-labels are not correctly assigned in theses cases, thus the deviation angle penalty may further enlarge the difference between the representations and the positive category anchor. For the challenge of unsupervised domain adaptation in medical image segmentation, an innovative MPSCL model is proposed, which promotes category-aware feature alignment by cross-domain contrastive learning. To the best of our knowledge, this is the first time to introduce contrastive learning for this practical problem. Specially, the domain-adaptive category prototypes are exploited to constitute contrastive pairs for the joint contrastive learning between the two domains. We generate informative self-paced pseudo-labels for target domain to perform contrastive learning in target domain without prior label available. The discriminability of representation is boosted by a margin preserving contrastive learning loss. It is worth noticing that one category prototype per category does not cover the overall distribution. Thus, our ongoing research work includes learning prototypes adaptively with the data distribution. Fully convolutional network ensembles for white matter hyperintensities segmentation in mr images U-net: Convolutional networks for biomedical image segmentation A 3d active model framework for segmentation of proximal femur in mr images Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation A novel coronavirus outbreak of global health concern Clinical features of patients infected with 2019 novel coronavirus in wuhan, china Learning transferable features with deep adaptation networks Deep coral: Correlation alignment for deep domain adaptation Learning to adapt structured output space for semantic segmentation Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation Unsupervised intra-domain adaptation for semantic segmentation through self-supervision Unsupervised cross-modality domain adaptation of convnets for biomedical image segmentations with adversarial loss Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation Category anchor-guided unsupervised domain adaptation for semantic segmentation Unsupervised domain adaptation for semantic segmentation via class-balanced self-training Bidirectional learning for domain adaptation of semantic segmentation Self-supervised learning of pretextinvariant representations A simple framework for contrastive learning of visual representations Few-cost salient object detection with adversarial-paced learning Spftn: A joint learning framework for localizing and segmenting objects in weakly labeled videos Leveraging prior-knowledge for weakly supervised object detection under a collaborative self-paced curriculum learning framework Unsupervised domain adaptation with dualscheme fusion network for medical image segmentation Unpaired image-to-image translation using cycle-consistent adversarial networks Transfer feature learning with joint distribution adaptation Collaborative unsupervised domain adaptation for medical image diagnosis Generative adversarial nets Unsupervised learning for cell-level visual representation in histopathology images with generative adversarial networks Distribution-induced bidirectional generative adversarial network for graph representation learning Entropy guided unsupervised domain adaptation for cross-center hip cartilage segmentation from mri Facenet: A unified embedding for face recognition and clustering Unsupervised feature learning via non-parametric instance discrimination Self-training with noisy student improves imagenet classification Revisiting self-training for neural sequence generation Self-training for end-to-end speech recognition Distilling the knowledge in a neural network Arcface: Additive angular margin loss for deep face recognition Multi-scale patch and multi-modality atlases for whole heart segmentation of mri Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs Imagenet: A large-scale hierarchical image database Large-scale machine learning with stochastic gradient descent Adam: A method for stochastic optimization Synergistic image and feature adaptation: Towards cross-modality domain adaptation for medical image segmentation Image-to-image translation with conditional adversarial networks