key: cord-0215793-bqrohc4a
authors: Tang, Yucheng; Yang, Dong; Li, Wenqi; Roth, Holger; Landman, Bennett; Xu, Daguang; Nath, Vishwesh; Hatamizadeh, Ali
title: Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis
date: 2021-11-29
journal: nan
DOI: nan
sha: 076a8e778f2e9efb3c2fd45fed534ae9e6035f1b
doc_id: 215793
cord_uid: bqrohc4a

Vision Transformers (ViT)s have shown great performance in self-supervised learning of global and local representations that can be transferred to downstream applications. Inspired by these results, we introduce a novel self-supervised learning framework with tailored proxy tasks for medical image analysis. Specifically, we propose: (i) a new 3D transformer-based model, dubbed Swin UNEt TRansformers (Swin UNETR), with a hierarchical encoder for self-supervised pre-training; (ii) tailored proxy tasks for learning the underlying pattern of human anatomy. We demonstrate successful pre-training of the proposed model on 5,050 publicly available computed tomography (CT) images from various body organs. The effectiveness of our approach is validated by fine-tuning the pre-trained models on the Beyond the Cranial Vault (BTCV) Segmentation Challenge with 13 abdominal organs and segmentation tasks from the Medical Segmentation Decathlon (MSD) dataset. Our model is currently the state-of-the-art (i.e. ranked 1st) on the public test leaderboards of both MSD and BTCV datasets. Code: https://monai.io/research/swin-unetr

Vision Transformers (ViT)s [20] have started a revolutionary trend in computer vision [14, 62] and medical image analysis [7, 24] . Transformers demonstrate exceptional capability in learning pre-text tasks, are effective in learning of global and local information across layers, and provide 

Sub-Volume Input CT Figure 1 . Overview of our proposed pre-training framework. Input CT images are randomly cropped into sub-volumes and augmented with random inner cutout and rotation, then fed to the Swin UNETR encoder as input. We use masked volume inpainting, contrastive learning and rotation prediction as proxy tasks for learning contextual representations of input images. scalability for large-scale training [44, 60] . As opposed to convolutional neural networks (CNNs) with limited receptive fields, ViTs encode visual representations from a sequence of patches and leverage self-attention blocks for modeling long-range global information [44] . Recently, Shifted windows (Swin) Transformers [36] proposed a hierarchical ViT that allows for local computing of self-attention with non-overlapping windows. This architecture achieves linear complexity as opposed to quadratic complexity of self-attention layers in ViT, hence making it more efficient. In addition, due to the hierarchical nature of Swin Transformers, they are well-suited for tasks requiring multi-scale modeling.

In comparison to CNN-based counterparts, transformerbased models learn stronger features representations during pre-training, and as a result perform favorably on fine-tuning downstream tasks [44] . Several recent efforts on ViTs [6, 56] have achieved new state-of-the-art results by self-supervised pre-training on large-scale datasets such as ImageNet [17] .

In addition, medical image analysis has not benefited from these advances in general computer vision due to: (1) large domain gap between natural images and medical imaging modalities, like computed tomography (CT) and magnetic resonance imaging (MRI); (2) lack of cross-plane contextual information when applied to volumetric (3D) images (such as CT or MRI). The latter is a limitation of 2D transformer models for various medical imaging tasks such as segmentation. Prior studies have demonstrated the effectiveness of supervised pre-training in medical imaging for different applications [11, 45] . But creating expert-annotated 3D medical datasets at scale is a non-trivial and time-consuming effort.

To tackle these limitations, we propose a novel selfsupervised learning framework for 3D medical image analysis. First, we propose a new architecture dubbed Swin UNEt TRansformers (Swin UNETR) with a Swin Transformer encoder that directly utilizes 3D input patches. Subsequently, the transformer encoder is pre-trained with tailored, self-supervised tasks by leveraging various proxy tasks such as image inpainting, 3D rotation prediction, and contrastive learning (See Fig. 1 for an overview). Specifically, the human body presents naturally consistent contextual information in radiographic images such as CT due to its depicted anatomical structure [51, 58] . Hence, proxy tasks are utilized for learning the underlying patterns of the human anatomy. For this purpose, we extracted numerous patch queries from different body compositions such as head, neck, lung, abdomen, and pelvis to learn robust feature representations from various anatomical contexts, organs, tissues, and shapes.

Our framework utilizes contrastive learning [41] , masked volume inpainting [43] , and 3D rotation prediction [21] as pre-training proxy tasks. The contrastive learning is used to differentiate various ROIs of different body compositions, whereas the inpainting allows for learning the texture, structure and correspondence of masked regions to their surrounding context. The rotation task serves as a mechanism to learn the structural content of images and generates various sub-volumes that can be used for contrastive learning. We utilize these proxy tasks to pre-train our proposed framework on a collection of 5,050 CT images that are acquired from various publicly available datasets.

Furthermore, to validate the effectiveness of pre-training, we use 3D medical image segmentation as a downstream application and reformulate it as a 1D sequence-to-sequence prediction task. For this purpose, we leverage the Swin UNETR encoder with hierarchical feature encoding and shifted windows to extract feature representations at four different resolutions. The extracted representations are then connected to a CNN-based decoder. A segmentation head is attached at the end of the decoder for computing the final segmentation output. We fine-tune Swin UNETR with pre-trained weights on two publicly available benchmarks of Medical Segmentation Decathlon (MSD) and the Beyond the Cranial Vault (BTCV). Our model is currently the state-of-the-art on their respective public test leaderboards. Our main contributions in this work are summarized as follows:

• We introduce a novel self-supervised learning framework with tailored proxy tasks for pre-training on CT image datasets. To this end, we propose a novel 3D transformer-based architecture, dubbed as Swin UNETR, consisting of an encoder that extracts feature representations at multiple resolutions and is utilized for pre-training.

• We demonstrate successful pre-training on a cohort of 5,050 publicly available CT images from various applications using the proposed encoder and proxy tasks. This results in a powerful pre-trained model with robust feature representation that could be utilized for various medical image analysis downstream tasks.

• We validate the effectiveness of proposed framework by fine-tuning the pre-trained Swin UNETR on two public benchmarks of MSD and BTCV and achieve state-of-the-art on the test leaderboards of both datasets.

Medical Segmentation with Transformers Vision transformers are first used in classification tasks and are adopted from sequence-to-sequence modeling in natural language processing. Self-attention mechanisms that aggregate information from the entire input sequence are first achieving comparable, then better performance against prior arts of convolutional architectures such as ResNet [26] or U-Net [15] . Recently, transformer-based networks [30, 55, 57, 63] are proposed for medical image segmentation. In these pioneering works, the transformer blocks are used either as a bottleneck feature encoder or as additional modules after convolutional layers, resulting in limited exploitation of the spatial context advantages of transformers. Comparing to prior works [7, 55] , which are using transformers as secondary encoder, we propose to utilize transformers to embed high-dimensional volumetric medical images, which allow for a more direct encoding of 3D patches and positional embeddings. Most medical image analysis tasks such as segmentation requires dense inference from multi-scale features. Skip connection-based architectures such as UNet [15] and pyramid networks [46] are widely adopted to leverage hierarchical features. However, vision transformers with a single patch size, while successful in natural image applications, are intractable for high-resolution and high-dimensional volumetric images. To avoid quadratic overflow of computing self-attention at scales [35, 49] , Swin Transformer [36, 37] is proposed to construct hierarchical encoding by a shiftedwindow mechanism. Recent works such as Swin UNet [5] and DS-TransUNet [34] utilize the merits of Swin Transformers for 2D segmentation and achieve promising performance. Augmenting the above-mentioned methods, we learn from 3D anatomy in broader medical image segmentation scenarios by incorporating hierarchically volumetric context.

Pre-training in Medical Image Analysis In medical image analysis, previous studies of pre-training on labeled data demonstrate improved performance by transfer learning [11, 45] . However, generating annotation for medical images is expensive and time-consuming. Recent advances in self-supervised learning offer the promise of utilizing unlabeled data. Self-supervised representation learning [3, 16, 33] constructs feature embedding spaces by designing pre-text tasks, such as solving jigsaw puzzles [40] . Another commonly used pre-text task is to memorize spatial context from medical images, which is motivated by image restoration. This idea is generalized to inpainting tasks [23, 43, 65] to learn visual representations [4, 8, 54] by predicting the original image patches. Similar efforts for reconstructing spatial context have been formulated as solving Rubik's cube problem [66] , random rotation prediction [21, 50] and contrastive coding [13, 41] . Different from these efforts, our pre-training framework is simulta-neously trained with a combination of pre-text tasks, tailored for 3D medical imaging data, and leverages a transformerbased encoder as a powerful feature extractor.

Swin UNETR comprises a Swin Transformer [36] encoder that directly utilizes 3D patches and is connected to a CNNbased decoder via skip connections at different resolutions. Fig. 2 illustrates the overall architecture of Swin UNETR. We describe the details of encoder and decoder in this section.

Assuming that the input to the encoder is a sub-volume X P R HˆWˆDˆS , a 3D token with a patch resolution of pH 1 ,W 1 ,D 1 q has a dimension of H 1ˆW 1ˆD1ˆS . The patch partitioning layer creates a sequence of 3D tokens with size H H 1ˆW W 1ˆD D 1 that are projected into a C-dimensional space via an embedding layer. Following [36] , for efficient modeling of token interactions, we partition the input volumes into non-overlapping windows and compute local self-attention within each region. Specifically, at layer l, we use a window of size MˆMˆM to evenly divide a 3D token into ẑ l andẑ l are the outputs of W-MSA and SW-MSA; LN and MLP denote layer normalization and Multi-Layer Perceptron (see Fig. 2 ). Following [36] , we adopt a 3D cyclic-shifting for efficient batch computation of shifted windowing. Furthermore, we calculate the self-attention according to

where Q,K,V represent queries, keys and values respectively, d is the size of the query and key. Our encoder uses a patch size of 2ˆ2ˆ2 with a feature dimension of 2ˆ2ˆ2ˆ1 " 8 (i.e. single input channel CT images) and a C " 48-dimensional embedding space. Furthermore, the overall architecture of the encoder consists of 4 stages comprising of 2 transformer blocks at each stage (i.e. L " 8 total layers). In between every stage, a patch merging layer is used to reduce the resolution by a factor of 2. 

The encoder of Swin UNETR is connected to a CNN-based decoder at each resolution via skip connections to create a "U-shaped" network for downstream applications such as segmentation. Specifically, we extract the output sequence representations of each stage i (i P t0,1,2,3,4uq in the encoder as well as the bottleneck (i " 5) and reshape them into features with size H 2 iˆW 2 iˆD 2 i . The extracted representations at each stage are then fed into a residual block consisting of two postnormalized 3ˆ3ˆ3 convolutional layers with instance normalization [53] . The processed features from each stage are then upsampled by using a deconvolutional layer and concatenated with processed features of the preceding stage. The concatenated features are fed into a residual block with aforementioned descriptions. For segmentation, we concatenate the output of the encoder (i.e. Swin Transformer) with processed features of the input volume and feed them into a residual block followed by a final 1ˆ1ˆ1 convolutional layer with a proper activation function (i.e. softmax ) for computing the segmentation probabilities (see Fig. 2 for details of the architecture).

We pre-train the Swin UNETR encoder with multiple proxy tasks and formulate it with a multi-objective loss func- Figure 3 . Shifted windowing mechanism for efficient self-attention computation of 3D tokens with 8ˆ8ˆ8 tokens and 4ˆ4ˆ4 window size. tion ( Fig. 1) . The objective of self-supervised representation learning is to encode region of interests (ROI)-aware information of the human body. Inspired by previous works on context reconstruction [23, 65] and contrastive encoding [25] , we exploit three proxy tasks for medical image representation learning. Three additional projection heads are attached to the encoder during pre-training. Furthermore, the downstream task, e.g. segmentation, fine-tunes the full Swin UNETR model with the projection heads removed. In training, sub-volumes are cropped random regions of the volumetric data. Then, stochastic data augmentations with random rotation and cutout are applied twice to each sub-volume within a mini-batch, resulting in two views of each data.

The cutout augmentation masks out ROIs in the subvolume X P R HˆWˆDˆC randomly with volume ratio of s. We attach a transpose convolution layer to the encoder as the reconstruction head and denote its output asX M . The reconstruction objective is defined by an L1 loss between X

The masked volume inpainting is motivated by prior work which focused on 2D images [43] . We extend it to 3D domain to showcase its effectiveness on representation learning of volumetric medical images.

The rotation prediction task predicts the angle categories by which the input sub-volume is rotated. For simplicity, we employ R classes of 0˝, 90˝, 180˝, 270˝rotations along the z-axis. An MLP classification head is used for predicting the softmax probabilitiesŷ r of rotation categories. Given the ground truth y r , a cross-entropy loss is used for rotation prediction task:

The 3D rotation and cutout also serves simultaneously as an augmentation transformation for contrastive learning.

The self-supervised contrastive coding presents promising performance on visual representation learning when transferred to downstream tasks [12, 42] . Given a batch of augmented sub-volumes, the contrastive coding allows for a better representation learning by maximizing the mutual information between positive pairs (augmented samples from same sub-volume), while minimizing that between negative pairs (views from different sub-volumes). The contrastive coding is obtained by attaching a linear layer to the Swin UNETR encoder, which maps each augmented sub-volume to a latent representation v. We use cosine similarity as the distance measurement of the encoded representations as defined in [12] . Formally, the 3D contrastive coding loss between a pair v i and v j is defined as:

where t is the measurement of normalized temperature scale. 1 is the indicator function evaluating to 1 iff k ‰ i. sim denotes the dot product between normalized embeddings. The contrastive learning loss function strengthens the intra-class compactness as well as the inter-class separability.

Formally, we minimize the total loss function by training Swin UNETR's encoder with multiple pre-training objectives of masked volume inpainting, 3D image rotation & contrastive coding as follows:

A grid-search hyper-parameter optimization was performed which estimated the optimal values of λ 1 " λ 2 " λ 3 " 1.

Pre-training Datasets : A total of 5 public CT datasets, consisting of 5,050 subjects, are used to construct our pre-training dataset. The corresponding number of 3D volumes for chest, abdomen and head/neck are 2,018, 1,520 and 1,223 respectively. The collection and source details are presented in the supplementary materials. Existing annotations or labels are not utilized from these datasets during the pre-training stage.

BTCV : The Beyond the Cranial Vault (BTCV) abdomen challenge dataset [32] contains 30 subjects with abdominal CT scans where 13 organs are annotated by interpreters under supervision of radiologists at Vanderbilt University Medical Center. Each CT scan is acquired with contrast enhancement phase at portal venous consists of 80 to 225 slices with 512ˆ512 pixels and slice thickness ranging from 1 to 6 mm. The multi-organ segmentation problem is formulated as a 13 classes segmentation task (see Table 1 for details). The preprocessing pipeline is detailed in supplementary materials.

MSD: Medical Segmentation Decathlon (MSD) dataset [1] comprises of 10 segmentation tasks from different organs and image modalities. These tasks are designed to feature difficulties across medical images, such as small training sets, unbalanced classes, multi-modality data and small objects. Therefore, the MSD challenge can serve as a comprehensive benchmark to evaluate the generalizability of medical image segmentation methods. The pre-processing pipeline for this dataset is outlined in supplementary materials.

For pre-training tasks, (1) masked volume inpainting: the ROI dropping rate is set to 30% (as also used in [3] ); the dropped regions are randomly generated and they sum up to reach overall number of voxels; (2) 3D contrastive coding: a feature size of 512 is used as the embedding size; (3) rotation prediction: the rotation degree is configured to 0˝, 90˝, 180˝, and 270˝. We train the model using the AdamW [38] optimizer with a warm-up cosine scheduler of 500 iterations. The pre-training experiments use a batch-size of 4 per GPU (with 96ˆ96ˆ96 patch), and initial learning rate of 4e´4, momentum of 0.9 and decay of 1e´5 for 450K iterations. Our model is implemented in PyTorch and MONAI 4 . A five-fold cross validation strategy is used to train models for all BTCV and MSD experiments. We select the best model in each fold and ensemble their outputs for final segmentation predictions. Detailed training hyperparameters for fine-tuning BTCV and MSD tasks can be found in the supplementary materials. All models are trained on a NVIDIA DGX-1 server.

The Dice similarity coefficient (Dice) and Hausdorff Distance 95% (HD95) are used as measurements for experiment results. HD95 calculates 95 th percentile of surface distances between ground truth and prediction point sets. Metric formulations are as follows:

HD " maxtmax y 1 PY 1 min

where Y andȲ denote the ground truth and prediction of voxel values. Y 1 andŶ 1 denote ground truth and prediction 

We extensively compare the benchmarks of our model with baselines. The published leaderboard evaluation is shown in Table 1 . Compared with other top submissions, the proposed Swin UNETR achieves the best performance. We obtain the state-of-the-art Dice of 0.908, outperforming the second, third and fourth top-ranked baselines by 1.6%, 2.0% and 2.4% on average of 13 organs, respectively. Distinct improvements can be specifically observed for organs that are smaller in size, such as splenic and portal veins of 3.6% against prior stateof-the-art method, pancreas of 1.6%, and adrenal glands of 3.8%. Moderate improvements are observed in other organs. The representative samples in Fig. 4 demonstrate the success of identifying organ details by Swin UNETR. Our method detects the pancreas tail (row 1), and branches in the portal vein (row 2) in Fig. 4 , where other methods under segment parts of each tissue. In addition, our method demonstrates distinct improvement in segmentation of adrenal glands (row 3). Table 3 . MSD test dataset performance comparison of Dice and NSD. Benchmarks obtained from MSD test leaderboard 5 .

The overall MSD results per task and ranking from the challenge leaderboard are shown in Table. Table 4 . Ablation study of the effectiveness of each objective function in the proposed pre-training loss. HD denotes Hausdorff Distance. Experiments on fine-tuning the BTCV dataset. labeled data, experiments with pre-training weights achieve approximately 10% improvement comparing to training from scratch. On employing all labeled data, the self-supervised pre-training shows 1.3% higher average Dice. The Dice number 83.13 of learning from scratch with entire dataset can be achieved by using pre-trained Swin UNETR with 60% data. Fig. 7 indicates that our approach can reduce the annotation effort by at least 40% for BTCV task.

We perform organ-wise study on BTCV dataset by using pre-trained weights of smaller unlabeled data. In Fig. 8 , the fine-tuning results are obtained from pre-training 100, 3,000, and 5,000 scans. We observe that Swin UNETR is robust with respect to the total number of CT scans trained. Fig. 8 demonstrates the proposed model can benefit from larger pre-training datasets with increasing size of unlabeled data.

We perform empirical study on pre-training with different combinations of self-supervised objectives. As shown in Table 4 , on BTCV test set, using pre-trained weights by inpainting achieves the highest improvement at single task modeling. On pairing tasks, inpainting and contrastive learning show Dice of 84.45% and Hausdorff Distance (HD) of 24.37. Overall, employing all proxy tasks achieves best Dice of 84.72%. Figure 8 . Pre-trained weights using 100, 3000 and 5000 scans are compared for fine-tuning on the BTCV dataset for each organ.

Our state-of-the-art results on the test leaderboards of MSD and BTCV datasets validate the effectiveness of the proposed self-supervised learning framework in taking the advantage of large number of available medical images without the need of annotation effort. Subsequently, fine-tuning the pretrained Swin UNETR model achieves higher accuracy, improves the convergence speed, and reduces the annotation effort in comparison to training with randomly initialized weights from scratch. Our framework is scalable and can be easily extended with more proxy tasks and augmentation transformations. Meanwhile, the pre-trained encoder can benefit the transfer learning of various medical imaging analysis tasks, such as classification and detection. In MSD pancreas segmentation task, Swin UNETR with pre-trained weights outperforms AutoML algorithms such as DiNTS [27] and C2FNAS [59] that are specifically designed for searching the optimal network architectures on the same segmentation task. Currently, Swin UNETR has only been pre-trained using CT images, and our experiments have not demonstrated enough transferability when applied directly to other medical imaging modalities such as MRI. This is mainly due to obvious domain gaps and different number of input channels that are specific to each modality. As a result, this is a potential direction that should be studied in future efforts.

In this work, we present a novel framework for selfsupervised pre-training of 3D medical images. Inspired by merging feature maps at scales, we built the Swin UNETR by exploiting transformer-encoded spatial representations into convolution-based decoders. By proposing the first transformer-based 3D medical image pre-training, we leverage the power of Swin Transformer encoder for fine-tuning segmentation tasks. Swin UNETR with self-supervised pre-training achieves the state-of-the-art performance on the BTCV multi-organ segmentation challenge and MSD challenge. Particularly, we presented the large-scale CT pre-training with 5,050 volumes, by combining multiple publicly available datasets and diversities of anatomical ROIs.

We provide the supplementary materials in the following. In Sec. A, we describe the details of datasets that are used for pre-training from public sources. In Sec. B, we illustrate the preprocessing and implementation details of fine-tuning tasks using BTCV and MSD datasets. In Sec. C, we present qualitative and quantitative comparisons of segmentation tasks in MRI modality from MSD dataset. The presented results include benchmarks from all top-ranking methods using the MSD test leaderboard. In Sec. D, the model complexity analysis is presented. Finally, we provide pseudocode of Swin UNETR self-supervised pre-training in Sec. E.

In this section, we provide additional information for our pre-training datasets. The proposed Swin UNETR is pre-trained using five collected datasets. The total data cohort contains 5,050 CT scans of various body region of interests (ROI) such as head, neck, chest, abdomen, and pelvis. LUNA16 [47] , TCIA Covid19 [18] and LiDC [2] contain 888, 761 and 475 CT scans which composes the chest CT cohort. The HNSCC [22] has 1,287 CT scans from head and neck squamour cell carcinoma patients. The TCIA Colon dataset [29] comprises the abdomen and pelvis cohort with 1,599 scans. We split 5% of each dataset for validation in the pre-training stage. Table S .1 summarizes sources of each collected dataset. Overall, the number of training and validation volumes are 4, 761 and 249, respectively. The Swin UNETR encoder is pre-trained using only unlabeled images, annotations were not utilized from any of theses datasets. We first clip CT image intensities from´1000 to 1000, then normalize to 0 and 1. To obtain informative patches of covering anatomies, we crop sub-volumes of 96ˆ96ˆ96 voxels at foregrounds, and exclude full air (voxel = 0) patches. In summary, Swin UNETR is pre-trained via a diverse set of human body compositions, and learn a general-purpose representation from different institutes' data that can be leveraged for wide range of fine-tuning tasks.

We report fine-tuning results on two public benchmarks: BTCV [32] and MSD challenge [48] . BTCV contains 30 CT scans with 13 annotated anatomies and can be formulated as a single multi-organ segmentation task. The MSD contains 10 tasks for multiple organs, from different sources and using different modalities. Details regarding preprocessing these datasets are provided in the subsequent sub-sections of 2.1 and 2.2.

All CT scans are interpolated into the isotropic voxel spacing of r1.5ˆ1.5ˆ2.0s mm. The multi-organ segmen-tation problem is formulated as a 13 class segmentation, which includes large organs such as liver, spleen, kidneys and stomach; vascular tissues of esophagus, aorta, IVC, splenic and portal veins; small anatomies of gallbladder, pancreas and adrenal glands. Soft tissue window is used for clipping the CT intensities, then normalized to 0 and 1 followed by random sampling of 96ˆ96ˆ96 voxels. Data augmentation of random flip, rotation and intensities shifting are used for training, with probabilities of 0.1, 0.1, and 0.5, respectively.

The MSD challenge contains 6 CT and 4 MRI datasets. We provide additional parameters of pre-processing and augmentation details for each task as follows: Task01 BrainTumour: The four modalities MRI images for each subject are formed into 4 channels input. We convert labels to multiple channels based on tumor classes. which label 1 is the peritumoral edema, label 2 is the GD-enhancing tumor, and label 3 is the necrotic and non-enhancing tumor core. Label 2 and 3 are merged to construct tumor core (TC), label 1, 2 and 3 are merged to construct whole tumor (WT), and label 2 is the enhancing tumor (ET). We crop the sub-volume of 128ˆ128ˆ128 voxels and use channel-wise nonzero normalization for MRI images. Data augmentation probabilities of 0.5, 0.1 and 0.1 are set for random flips at each axis, intensities scaling and shifting, respectively. Task02 Heart: The heart MRI images are interpolated to the isotropic voxel spacing of 1.0 mm. Channel-wise nonzero normalization is applied to each scan. We sample the training sub-volumes of 96ˆ96ˆ96 voxels by ratio of positive and negative as 2:1. Augmentation probabilities for random flip, rotation, intensities scaling and shifting are set to 0.5, 0.1, 0.2, 0.5, respectively. Task03 Liver: Each CT scan is interpolated to the isotropic voxel spacing of 1.0 mm. Intensities are scaled to r´21,189s, then normalized to r0,1s. 3D patches of 96ˆ96ˆ96 voxels are obtained by sampling positive and negative ratio of 1 : 1. Data augmentation of random flip, rotation, intensities scaling and shifting are used, for which the probabilities are set to 0.2, 0.2, 0.1, 0.1, respectively. Task04 Hippocampus: Each hippocampus MRI image is interpolated by voxel spacing of 0.2ˆ0.2ˆ0.2, then applied spatial padding to 96ˆ96ˆ96 as the input size of Swin UNETR model. Same as other MRI datasets, channel-wise nonzero normalization is used for intensities. Probability of 0.1 is used for random flip, rotation, intensity scaling & shifting. Task05 Prostate: We utilize both given modalities for prostate MRI images for each subject as two channels input. Channel-wise nonzero normalization is used. Voxel spacing of 0.5 and spatial padding of each axis are employed to construct the input size of 96ˆ96ˆ96. We use random flip, rotation, intensity scaling and shifting with probabilities Task07 Pancreas: We clip the intensities to a range of 87 to 199. Patch size of 96ˆ96ˆ96 is used to sample training data with positive and negative ratio of 1 : 1. We set augmentation of random flip, rotation and intensity scaling to probabilities of 0.5, 0.25 and 0.5, respectively.

Task08 HepaticVessel: To fit the optimal tissue window for hepatic vessel and tumor, we clip each CT image intensities to r0, 230s HU. We apply data augmentation same with Task07 Pancreas for training. Task09 Spleen: Spleen CT scans are pre-process with interpolation isotropic voxel spacing of 1.0 mm on each axis. Soft tissue window of r´125,275s HU is used for the portal venous phase contrast enhanced CT images. We use the training data augmentation of random flip, intensity scaling & shifting with probabilities of 0.15, 0.1, and 0.1, respectively. Task10 Colon: We use HU range of r´57,175s for the colon tumor segmentation task and normalized to 0 and 1. Next, we sample training sub-volumes by positive and negative ratio of 1 : 1. Same as Task07 and Task08, we use random flip, rotation, intensity scaling as augmentation transforms with probabilities of 0.5, 0.25 and 0.5, respectively.

C. Results

In this section, we provide extensive segmentation visualization from MSD dataset. In particular, we compare two cases randomly selected from Swin UNETR and DiNTS for each MSD task. As shown in Fig S.1 , DiNTS includes the under-segmentation due to lack of parts of labels (Heart, Hippocampus). The missing parts result in a lower Dice score. On BrainTumour, Liver, Pancreas, HepaticVessel and Colon tasks, the comparison indicate that our method achieves better segmentation where the under-segmentation of tumors are observed in DiNTS. For Lung task, the over-segmentation is observed with DiNTS where surrounding tissues are included with label of the lung cancer, while Swin UNETR clearly delineate the boundary. In Heart and Spleen, DiNTS and Swin UNETR have comparable Dice score, yet Swin UNETR performs better segmentation on tissue corner (See Fig S.1) . Overall, Swin UNETR achieves better segmentation results and solves the under-and over-segmentation outliers as observed in segmentation via DiNTS.

In this section, we provide the quantitative benchmarks of MRI segmentation tasks from MSD dataset. In addition to Task01 BrainTumour, we implement experiment on three remaining MRI dataset including Heart, Hippocampus and Prostate (see Table. S.2). The results are directly obtained from the MSD 6 leaderboard. Regarding MRI benchmark, we achieve much better performance on brain tumor segmentation presented in the paper, with average Dice improvement of 2% against second best performance. Comparing to models genesis [65] , nnUNet [28] , the Swin UNETR shows comparable results on Heart, Hippocampus and Prostate. Overall, we achieve the best average results (Dice of 82.14% and NSD of 94.66%) across four MRI datasets, showing Swin UNETR's superiority of medical image segmentation.

In this section, we examine the model complexity along with inference time. In Table. S.3, the number of network paramerts, FLOPs, and averaged inference time of Swin UNETR and baselines on BTCV dataset are presented. We calculate the FLOPs and inference time based on input size of 96ˆ96ˆ96 used in the BTCV experiments with sliding window approach. Swin UNETR shows moderate size of parameter with 61.98M, less than transformer-based methods such as TransUNet [7] of 96.07M, SETR [62] of 86.03M, and UNETR [24] of 92.58M, but larger than 3DUNet (nnUNet) [28] of 19.07M, ASPP [10] 47.92M. Our model also shows comparable FLOPs and inference time in terms of 3D approaches such as nnUNet [28] and CoTr [55] . Overall, Swin UNETR outperforms CNN-based and other transformerbased methods while perserves moderate model complexity.

Regarding self-supervised pre-training time of Swin UNETR encoder, our approach takes only approximately 6 GPU days. We evaluate pre-training on the 5 collected public datasets with totally 5,050 scans for training and validation, and set maximum training iterations to 45K steps.

In this section, we illustrate the Swin UNETR pre-training details. The Pytorch-like pseudo-code implementation is shown in Algorithm S.1. The Swin UNETR is trained in self-supervised learning paradigm, where we design masked volume inpainting, rotation prediction and contrastive coding as proxy tasks. The self-training aims at improving the quality of representations learnt by large unlabeled data and propagating to smaller fine-tuning dataset. To this end, we leverage multiple transformations for input 3D data, which can exploit inherent context by a mechanism akin to autoencoding and similarity identification. In particular, given an input mini batch data, the transform of random rotation is implemented on each image in the mini batch iteratively. To simultaneously utilize augmentation transformations for contrastive learning, the random rotation of 0˝, 90˝, 180˝, 270˝is applied twice on the same input to generate randomly augmented image pairs of the same image patch. Subsequently, the mini batch data pairs are constructed with the cutout transforms. The drop size of voxels are set to 30% of input sub-volumes. We randomly generate masked ROIs inside image, until the total masked voxels are larger than scheduled number of dropping voxels. Unlike canonical pre-training rules of masked tokens in BERT [19] , our local transformations to the CT sub-volumes are then arranged to neighbouring tokens. This scheme can construct semantic targets across partitioned tokens, which is critical in medical spatial context. By analogy to Models Genesis [65] , which is CNN-based model consisting expensive convolutional, transposed convolution layers and skip connection between encoder and decoder, our pre-training approach is trained to reconstruct input sub-volumes from the output tokens of the Swin Transformer. Overall, the intuition of modeling inpainting, rotation prediction and contrastive coding is to generalize better representations from aspects of images context, geometry and similarity, respectively.

The medical segmentation decathlon

The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans

Sit: Self-supervised vision transformer

Big self-supervised models advance medical image classification

Swin-unet: Unet-like pure transformer for medical image segmentation

Emerging properties in self-supervised vision transformers

Transunet: Transformers make strong encoders for medical image segmentation

Self-supervised learning for medical image analysis using image context restoration

Encoder-decoder with atrous separable convolution for semantic image segmentation

Encoder-decoder with atrous separable convolution for semantic image segmentation

Med3d: Transfer learning for 3d medical image analysis

A simple framework for contrastive learning of visual representations

An empirical study of training self-supervised vision transformers

Per-pixel classification is not all you need for semantic segmentation

net: learning dense volumetric segmentation from sparse annotation

Up-detr: Unsupervised pre-training for object detection with transformers

Imagenet: A large-scale hierarchical image database

Chest imaging representing a covid-19 positive rural us population

Pre-training of deep bidirectional transformers for language understanding

An image is worth 16x16 words: Transformers for image recognition at scale

Unsupervised representation learning by predicting image rotations

Imaging and clinical data archive for head and neck squamous cell carcinoma patients treated with radiotherapy. Scientific data

Transferable visual words: Exploiting the semantics of anatomical patterns for self-supervised learning

Unetr: Transformers for 3d medical image segmentation

Momentum contrast for unsupervised visual representation learning

Deep residual learning for image recognition

Dints: Differentiable neural network topology search for 3d medical image segmentation

nnu-net: a self-configuring method for deep learning-based biomedical image segmentation

Accuracy of ct colonography for detection of large adenomas and cancers

Medical transformer: gated axial-attention for medical image segmentation

Scalable neural architecture search for 3d medical image segmentation

Miccai multi-atlas labeling beyond the cranial vaultworkshop and challenge

Swinir: Image restoration using swin transformer

Ds-transunet: Dual swin transformer u-net for medical image segmentation

Feature pyramid networks for object detection

Swin transformer: Hierarchical vision transformer using shifted windows

Decoupled weight decay regularization

Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy

Unsupervised learning of visual representations by solving jigsaw puzzles

Representation learning with contrastive predictive coding

Contrastive learning for unpaired image-to-image translation

Context encoders: Feature learning by inpainting

Do vision transformers see like convolutional neural networks?

Transfusion: Understanding transfer learning for medical imaging

A multi-scale pyramid of 3d fully convolutional networks for abdominal multi-organ segmentation

Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge

A large annotated medical image dataset for the development and evaluation of segmentation algorithms

Sniper: Efficient multi-scale training

Benjamin Bergner, and Christoph Lippert. 3d self-supervised methods for medical imaging

Body part regression with self-supervision

Highresolution 3d abdominal segmentation with random patch network fusion

Instance normalization: The missing ingredient for fast stylization

Self-supervised image-text pre-training with mixed data in chest x-rays

Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation

Self-supervised learning with swin transformers

Levit-unet: Make faster encoders with transformer for medical image segmentation

Self-supervised learning of pixel-wise anatomical embeddings in radiological images

C2fnas: Coarseto-fine neural architecture search for 3d medical image segmentation

Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers

Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers

nnformer: Interleaved transformer for volumetric segmentation

Prior-aware neural network for partially-supervised multi-organ segmentation

Models genesis

Rubik's cube+: A self-supervised feature learning framework for 3d medical image analysis