key: cord-0138826-elbbz5cb authors: Shamshad, Fahad; Khan, Salman; Zamir, Syed Waqas; Khan, Muhammad Haris; Hayat, Munawar; Khan, Fahad Shahbaz; Fu, Huazhu title: Transformers in Medical Imaging: A Survey date: 2022-01-24 journal: nan DOI: nan sha: 24aa57dae649b6683d8f5bc8deaf2ff549cdacc4 doc_id: 138826 cord_uid: elbbz5cb Following unprecedented success on the natural language tasks, Transformers have been successfully applied to several computer vision problems, achieving state-of-the-art results and prompting researchers to reconsider the supremacy of convolutional neural networks (CNNs) as {de facto} operators. Capitalizing on these advances in computer vision, the medical imaging field has also witnessed growing interest for Transformers that can capture global context compared to CNNs with local receptive fields. Inspired from this transition, in this survey, we attempt to provide a comprehensive review of the applications of Transformers in medical imaging covering various aspects, ranging from recently proposed architectural designs to unsolved issues. Specifically, we survey the use of Transformers in medical image segmentation, detection, classification, reconstruction, synthesis, registration, clinical report generation, and other tasks. In particular, for each of these applications, we develop taxonomy, identify application-specific challenges as well as provide insights to solve them, and highlight recent trends. Further, we provide a critical discussion of the field's current state as a whole, including the identification of key challenges, open problems, and outlining promising future directions. We hope this survey will ignite further interest in the community and provide researchers with an up-to-date reference regarding applications of Transformer models in medical imaging. Finally, to cope with the rapid development in this field, we intend to regularly update the relevant latest papers and their open-source implementations at url{https://github.com/fahadshamshad/awesome-transformers-in-medical-imaging}. C ONVOLUTIONAL Neural Networks (CNNs) [1] - [4] have significantly impacted the field of medical imaging due to their ability to learn highly complex representations in a data-driven manner. Since their renaissance, CNNs have demonstrated remarkable improvements on numerous medical imaging modalities, including Radiography [5] , Endoscopy [6] , Computed Tomography (CT) [7] , [8] , Mammography Images (MG) [9] , Ultrasound Images [10] , Magnetic Resonance Imaging (MRI) [11] , [12] , and Positron Emission Tomography (PET) [13] , to name a few. The workhorse in CNNs is the convolution operator, which operates locally and provides translational equivariance. While these properties help in developing efficient and generalizable medical imaging solutions, the local receptive field in convolution operation limits capturing long-range pixel relationships. Furthermore, the convolutional filters have stationary weights that are not adapted for the given input image content at inference time. Meanwhile, significant research effort has been made by the vision community to integrate the attention mechanisms [14] - [16] in CNN-inspired architectures [17] - [22] . These attention-based 'Transformer' models have become an attractive solution due to their ability to encode long-range dependencies and learn highly effective feature representa- tions [23] . Recent works have shown that these Transformer modules can fully replace the standard convolutions in deep neural networks by operating on a sequence of image patches, giving rise to Vision Transformers (ViTs) [22] . Since their inception, ViT models have been shown to push the state-of-the-art in numerous vision tasks, including image classification [22] , object detection [24] , semantic segmentation [25] , image colorization [26] , low-level vision [27] , and video understanding [28] to name a few. Furthermore, recent research indicate that the prediction errors of ViTs are more consistent with those of humans than CNNs [29] - [32] . These desirable properties of ViTs have sparked great interest in the medical community to adapt them for medical imaging applications, thereby mitigating the inherent inductive biases of CNNs [33] . Motivation and Contributions: Recently, medical imaging community has witnessed an exponential growth in the number of Transformer based techniques, especially after the inception of ViTs (see Fig. 1 ). The topic is now dominant in prestigious medical imaging conferences and journals, and it is getting increasingly difficult to keep pace with the recent progress due to the rapid influx of papers. As such, a survey of the existing relevant works is timely to provide a comprehensive account of new methods in this emerging field. To this end, we provide a holistic overview of the applications of Transformer models in medical imaging. We hope this work can provide a roadmap for the researchers to explore the field further. Our major contributions include: • This is the first survey paper that comprehensively covers applications of Transformers in the medical imaging domain, thereby bridging the gap between the vision and medical imaging community in this rapidly evolving area. Specifically, we present a comprehensive overview of more than 125 relevant papers to cover the recent progress. • We provide a detailed coverage of the field by categorizing the papers based on their applications in medical imaging as depicted in Fig. 2 . For each of these applications, we develop a taxonomy, highlight task-specific challenges, and provide insights about solving them based on the literature reviewed. • Finally, we provide a critical discussion of the field's current state as a whole, including identifying key challenges, highlighting open problems, and outlining promising future directions. • Although the main focus of this survey is on Vision Transformers, we are also the first since the inception of the original Transformer, about half a decade ago, to extensively cover its language modeling capabilities in the clinical report generation task (see Sec. 9 ). Paper Organization. The rest of the paper is organized as follows. In Sec. 2, we provide background of the field with a focus on salient concepts underlying Transformers. From Sec. 3 to Sec. 10, we comprehensively cover applications of Transformers in several medical imaging tasks as shown in Fig. 2 . In particular, for each of these tasks, we develop a taxonomy and identify task-specific challenges. Sec. 11 presents open problems and future directions about the field as a whole. Finally, in Sec. 12, we give recommendations to cope with the rapid development of the field and conclude the paper. Medical imaging approaches have undergone significant advances over the past few decades. In this section, we briefly provide a background of these advancements and broadly group them into three categories: hand-crafted, CNNbased, and ViT-based. For the hand-crafted and CNN-based methods, we describe the underlying working principles along with their major strengths and shortcomings in the context of medical imaging. For the ViT-based methods, we highlight core concepts behind their success and defer further details to later sections. Conventional algorithms to solve medical imaging tasks are based on hand-crafted mathematical models designed by field experts using domain knowledge. The development of these hand-crafted models with a focus on refining discriminative features and efficient optimization algorithms for a range of medical imaging problems has been the central research topic in the past [40] , [41] . Successful hand-crafted models in medical imaging include total-variation [42] , non-local self-similarity [43] , sparsity/structured sparsity [44] , Markov-tree models on wavelet coefficients [45] , and untrained neural networks [46] - [48] . These models have been extensively leveraged in medical domain for image segmentation [49] , reconstruction [50] , disease classification [51] , enhancement [52] , and anomaly detection [53] due to their interpretability with solid mathematical foundations and theoretical supports on the robustness, recovery, and complexity [54] , [55] . Further, unlike deep learning-based approaches, they do not require large annotated medical imaging datasets for training. This reduced reliance on labeled datasets is crucial to the medical research community ViT-based approaches give superior performance as compared to CNN-based methods due to their ability to model the global context. Figure sources: (a) [34] , (b) [35] , (c) [36] , (d) [37] , (e) [38] , (f) [39] . as collecting voluminous, reliable, and labeled data in the medical domain is difficult due to the lack of expert annotators, high time consumption, ethical considerations, and financial costs. However, due to inadequacy to leverage the expressive power of large medical imaging data sets, these hand-crafted models often suffer from poor discriminative capability [56] . Consequently, these models often fail to represent nuances of high-dimensional complex medical imaging data that can hamper the performance of the medical imaging diagnosis systems [57] , [58] . To circumvent the poor discriminability and generalization issue, learned handcrafted models have been proposed to exploit data better. The representative approaches include optimal directions [59] , K-SVD [60] , data-driven tight frame [61] , low-rank models [62] , and piece-wise smooth image model [63] . Next, we explain the popular data-driven approaches explored in the literature. CNNs are effective at learning discriminative features and extracting generalizable priors from large-scale medical datasets, thus providing excellent performance on medical imaging tasks, making them an integral component of modern AI-based medical imaging systems. The advancements in CNNs have been mainly fueled by novel architectural designs, better optimization procedures, availability of special hardware (e.g., GPUs) and purpose-built open source software libraries [64] - [66] . We refer interested readers to comprehensive survey papers related to CNNs applications in medical imaging [56] , [67] - [75] . Despite considerable perfor- Vision transformer first splits the input image into patches and projects them (after flattening) into a feature space where a transformer encoder processes them to produce the final classification output. mance gains, the reliance of CNNs on large labeled datasets limits their applicability over the full spectrum of medical imaging tasks. Furthermore, CNNs-based approaches are generally more challenging to interpret and often act as black box solutions. Therefore, there has been an increasing effort in the medical imaging community to amalgamate the strengths of hand-crafted and CNNs based methods resulting in the prior information-guided CNNs models [76] . These hybrid methods contain special domain-specific layers, and include unrolled optimization [77] , generative models [78] , and learned denoiser-based approaches [79] . Despite these architectural and algorithmic advancements, the decisive factor behind CNNs success has been primarily attributed to their image-specific inductive bias in dealing with scale invariance and modeling local visual structures. While this intrinsic locality (limited receptive field) brings efficiency to CNNs, it impairs their ability to capture long-range spatial dependencies in an input image, thereby stagnating performance [33] (see Fig. 3 ). This demands an alternative architectural design capable of modeling long-range pixel relationships for better representation learning. Transformers were introduced by Vaswani et al. [14] as a new attention-driven building block for machine translation. Specifically, these attention blocks are neural network layers that aggregate information from the entire input sequence [80] . Since their inception, these models have demonstrated state-of-the-art performance on several Natural Language Processing (NLP) tasks, thereby becoming the default choice over recurrent models. In this section, we will focus on Vision Transformers (ViTs) [22] that are built on vanilla Transformer model [14] by cascading multiple transformer layers to capture the global context of an input image. Specifically, Dosovitskiy et al. [22] interpret an image as a sequence of patches and process it by a standard transformer encoder as used in NLP. These ViT models continue the long-lasting trend of removing hand-crafted visual features and inductive biases from models in an effort to leverage the availability of larger datasets coupled with increased computational capacity. ViTs have garnered immense interest in the medical imaging community, and a number of recent approaches have been proposed which build upon ViTs. We highlight the working principle of ViT in a step-by-step manner in Algorithm 1 for medical image classification. Below, we briefly describe the core components behind the success of ViTs that are self-attention and multi-head selfattention. For a more in-depth analysis of numerous ViT architectures and applications, we refer interesting readers to the recent relevant survey papers [23] , [81] - [84] . The success of the Transformer models has been widely attributed to the self-attention (SA) mechanism due to its ability to model long-range dependencies. The key idea behind the SA mechanism is to learn self-alignment, that is, to determine the relative importance of a single token (patch embedding) with respect to all other tokens in the sequence [80] . For 2D images, we first reshape the image x ∈ R H×W ×C into a sequence of flattened 2D patches x p ∈ R N ×(P 2 C) , where H and W denotes height and width of the original image respectively, C is the number of channels, P × P is the resolution of each image patch, and N = HW/P 2 is the resulting number of patches. These flattened patches are projected to D dimension via trainable linear projection layer and can be represented in matrix form as X ∈ R N ×D . The goal of self-attention is to capture the interaction amongst all these N embeddings, that is done by defining three learnable weight matrices to transform input X into Queries (via The input sequence X is first projected onto these weight matrices to get Q = XW Q , K = XW K and V = XW V . The corresponding attention matrix A ∈ R N ×N can be written as, The output Z ∈ R N ×Dv of the SA layer is then given by, Multi-Head Self Attention (MHSA) consists of multiple SA blocks (heads) concatenated together channel-wise to model complex dependencies between different elements in the input sequence. Each head has its own learnable weight matrices denoted by {W Qi , W Ki , W Vi }, where i = 0 · · · (h−1) and h denotes total number of heads in MHSA block. Specifically, we can write, whereas W O ∈ R h.Dv×N computes linear transformation of heads and Z i can be written as, Note that the complexity of computing the softmax for SA block is quadratic with respect to the length of the input sequence that can limit its applicability to highresolution medical images. Recently, numerous efforts have been made to reduce complexity, including sparse attention [85] , linearization attention [86] , low-rank attention [87] , memory compression based approaches [88] , and improved MHSA [89] . We will discuss the efficient SA in the context of medical imaging in the relevant sections. Further, we find it important to clarify that several alternate attention approaches [90] - [93] have been explored in the literature based on convolutional architectures. In this survey, we focus on the specific attention used in transformer blocks (MHSA) which has recently gained significant research attention in medical image analysis. Next, we outline these methods categorized according to specific application domains. Accurate medical image segmentation is a crucial step in computer-aided diagnosis, image-guided surgery, and treatment planning. The global context modeling capability of Transformers is crucial for accurate medical image segmentation because the organs spread over a large receptive field can be effectively encoded by modeling the relationships between spatially distant pixels (e.g., lungs segmentation). Furthermore, the background in medical scans is generally scattered (e.g., in ultrasound scan [94] ); therefore, learning global context between the pixels corresponding to the background can help the model in preventing misclassification. Below, we highlight various attempts to integrate ViTbased models for medical image segmentation. We broadly classify the ViT-based segmentation approaches into organspecific and multi-organ categories, as depicted in Fig. 5 , due to the varying levels of context modeling required in both sets of methods. ViT-based organ-specific approaches generally consider a specific aspect of the underlying organ to design architectural components or loss functions. We mention specific examples of such design choices in this section. We have further categorized organ-specific categories into 2D and 3D-based approaches depending on the input type. Here, we describe the organ-specific ViT-based segmentation approaches for 2D medical scans. Skin Lesion Segmentation. Accurate skin lesion segmentation for identifying melanoma (cancer cells) is crucial Figure 6 : Comparison of different skin lesion segmentation approaches. From left to right: Input image, CNN based UNet++ [95] , ViT-based TransUNet [96] , Boundary aware transformer (BAT) [97] , and the ground truth (GT) image. Red circles highlight small regions with an ambiguous boundary where BAT can perform well due to the use of boundary-wise prior knowledge. Image taken from [97] . for cancer diagnosis and subsequent treatment planning. However, it remains a challenging task due to significant variations in color, size, occlusions, and contrast of skin lesion areas, resulting in ambiguous boundaries [98] and consequently deterioration in segmentation performance. To address the issue of ambiguous boundaries, Wang et al. [97] propose a novel Boundary-Aware Transformer (BAT). Specifically, they design a boundary-wise attention gate in Transformer architecture to exploit the prior knowledge about boundaries. The auxiliary supervision of the boundarywise attention gate provides feedback to train BAT effectively. Extensive experiments on ISIC 2016+PH2 [99] , [100] and ISIC 2018 [101] validate the efficacy of their boundary-wise prior, as shown in Fig. 6 . Similarly, Wu et al. [102] propose a dual encoder-based feature adaptive transformer network (FAT-Net) that consists of CNN and transformer branches in the encoder. To effectively fuse the features from these two branches, a memory-efficient decoder and feature adaptation module have been designed. Experiments on ISIC 2016-2018 [99] , [101] , [103] , and PH2 [100] datasets demonstrate the effectiveness of FAT-Net fusion modules. Tooth Root Segmentation. Tooth root segmentation is one of the critical steps in root canal therapy to treat periodontitis (gum infection) [104] . However, it is challenging due to blurry boundaries and overexposed and underexposed images. To address these challenges, Li et al. [105] propose Group Transformer U-Net (GT U-Net) that consists of transformer and convolutional layers to encode global and local context, respectively. A shape-sensitive Fourier Descriptor loss function [106] has been proposed to deal with the fuzzy tooth boundaries. Furthermore, grouping and bottleneck structure has been introduced in the GT U-Net to significantly reduce the computational cost. Experiments on their in-house Tooth Root segmentation dataset with six evaluation metrics demonstrate the effectiveness of GT U-Net architectural components and Fourier-based loss function. In another work, Li et al. [107] propose anatomyguided multibranch Transformer (AGMB-Transformer) to incorporate the strengths of group convolutions [108] and progressive Transformer network. Experiments on their selfcollected dataset of 245 tooth root X-ray images show the effectiveness of AGMB-Transformer. Cardiac Image Segmentation. Despite their impressive performance in medical image segmentation, Transformers are computationally demanding to train and come with a high parameter budget. To handle these challenges for cardiac image segmentation task, Deng et al. [109] propose TransBridge, a lightweight parameter-efficient hybrid model. TransBridge consists of Transformers and CNNs based encoder-decoder structure for left ventricle segmentation in echocardiography. Specifically, the patch embedding layer of the Transformer has been re-designed using the shuffling layer [110] and group convolutions to significantly reduce the number of parameters. Extensive experiments on large-scale left ventricle segmentation dataset, echo-cardiographs [111] demonstrate the benefit of TransBridge over CNNs and Transformer-based baseline approaches [112] . Kidney Tumor Segmentation. Accurate segmentation of kidney tumors via computer diagnosis systems can reduce the effort of radiologists and is a critical step in related surgical procedures. However, it is challenging due to varying kidney tumor sizes and the contrast between tumors and their anatomical surroundings. To address these challenges, Shen et al. [117] propose a hybrid encoderdecoder architecture, COTR-Net, that consists of convolution and transformer layers for end-to-end kidney, kidney cyst, and kidney tumor segmentation. Specifically, the encoder of COTR-Net consists of several convolution-transformer blocks, and the decoder comprises several up-sampling layers with skip connections from the encoder. The encoder weights have been initialized using a pre-trained ResNet [118] architecture to accelerate convergence, and deep supervision has been exploited in the decoder layers to boost segmentation performance. Furthermore, the segmentation masks are refined using morphological operations as a post-processing step. Extensive experiments on the Kidney Tumor Segmentation dataset (KiTS21) [119] demonstrate the effectiveness COTR-Net. Cell Segmentation. Inspired from the Detection Transformers (DETR) [132] , Zhang et al. proposed Cell-DETR [133] , a Transformer-based framework for instance segmentation of biological cells. Specifically, they integrate a dedicated attention branch to the DETR framework to obtain instancewise segmentation masks in addition to box predictions. During training, focal loss [134] and Sorenson dice loss [132] are used for the segmentation branch. To enhance performance, they integrate three residual decoder blocks [118] in Cell-DETR to generate accurate instance masks. Experiments on their in-house yeast cells dataset demonstrate the effectiveness of Cell-DETR relative to U-Net based baselines [114] . Similarly, existing medical imaging segmentation approaches generally struggle for Corneal endothelial cells due to blurry edges caused by the subject's movement [135] . This demands preserving more local details and making full use of the global context. Considering these attributes, Zhang [113] , UNet (CNN-based) [114] , Attn UNet (CNN-based) [115] , UNet++ (CNN-based) [95] , and SpecTr (ViT-based) [116] . Image adapted from [116] . [126] 0.79±0.14 0.71±0.26 Segmenter [127] 0.80±0.14 0.82±0.11 Medical Transformer [128] 0.71±0.14 0.62±0.17 BEiT [129] 0.72±0.21 0.66±0.28 Table 1 : Evaluation of Transformer-based semantic segmentation methods for pathological image segmentation in terms of average jaccard index on PAIP liver histopathological dataset [130] . It can be seen that transformer-based models tend to outperform CNNs with the exception of Swin-UNet. Results are from [131] , which is one of the first study to systematically evaluate the performance of transformers on pathological image segmentation task. and transformer layers. Specifically, they propose a bodyedge branch that provides precise edge location information and promotes local consistency. Extensive ablation studies on their self-collected TM-EM3000 and public Alisarine dataset [137] of Corneal Endothelial Cells show the effectiveness of MBT-Net architectural components. Histopathology. Histopathology refers to the diagnosis and study of the diseases of tissues under a microscope and is the gold standard for cancer recognition. Therefore accurate automatic segmentation of histopathology images can substantially alleviate the workload of pathologists. Recently, Nguyen et al. [131] systematically evaluate the performance of six latest ViTs, and CNNs-based approaches on whole slide images of the PAIP liver histopathological dataset [130] . Their results (shown in Table 1 ) demonstrate that almost all Transformer-based models indeed exhibit superior performance as compared to CNN-based approaches due to their ability to encode the global context. Here, we describe ViT-based segmentation approaches for volumetric medical data. Brain Tumor Segmentation. An automatic and accurate brain tumor segmentation approach can lead to the timely diagnosis of neurological disorders such as Alzheimer's disease. Recently, ViT-based models have been proposed to segment brain tumors effectively. Wang et al. [138] have made the first attempt to leverage Transformers for 3D multimodal brain tumor segmentation by effectively modeling local and global features in both spatial and depth dimensions. Specifically, their encoder-decoder architecture, TransBTS, employs a 3D CNN to extract local 3D volumetric spatial features and Transformers to encode global features. Progressive upsampling in the 3D CNN-based decoder has been used to predict the final segmentation map. To further boost the performance, they make use of test-time augmentation. Extensive experimentation on BraTS 2019 1 and BraTS 2020 2 datasets show the effectiveness of their proposed approach compared to CNN-based methods. Unlike most of the ViT-based image segmentation approaches, TransBTS does not require pre-training on large datasets and has been trained from scratch. In another work, inspired from the architectural design of TransBTS [138] , Jia et al. [139] propose Bi-Transformer U-Net (BiTr-UNet) that performs relatively better on BraTS 2021 [140] segmentation challenge. Different from TransBTS [138] , BiTr-UNet consists of an attention module to refine encoder and decoder features and has two ViT layers (instead of one as in TransBTS). Furthermore, BiTr-UNet adopts a post-processing strategy to eliminate a volume of predicted segmentation if the volume is smaller than a threshold [141] followed by model ensemble via majority voting [142] . Similarly, Peiris et al. [143] propose a light-weight UNet shaped volumetric transformer, VT-UNet, to segment 3D medical image modalities in a hierarchical manner. Specifically, two self-attention layers have been introduced in the encoder of VT-UNet to capture both global and local contexts. Furthermore, the introduction of windowbased self-attention and cross-attention modules and Fourier positional encoding in the decoder significantly improve the accuracy and efficiency of VT-UNet. Experiments on BraTs 2021 [140] show that VT-UNet is robust to data artifacts and exhibits strong generalization ability. In another similar work, Hatamizadeh et al. [145] propose Swin UNet based architecture, Swin UNETR, that consists of Swin transformer as the encoder and a CNN-based decoder. Specifically, Swin UNETR computes self-attention in an efficient shifted window partitioning scheme and is a top-performing model on BraTs 2021 [140] validation set. In Table 2 , we provide dice score and other parameters of various Transformer based models for the 3D multimodal BraTs 2021 dataset [140] . Histopathology. Boxiang et. al [116] propose Spectral Transformer (SpecTr) for hyperspectral pathology image [140] . segmentation, which employs transformers to learn the contextual feature across the spectral dimension. To discard the irrelevant spectral bands, they introduce a sparsity-based scheme [146] . Furthermore, they employ separate group normalization for each band to eliminate the interference caused by distribution mismatch among spectral images. Extensive experimentation on the hyperspectral pathology dataset, Cholangiocarcinoma [147] , shows the effectiveness of SpecTr as also shown in Fig. 7 . Breast Tumor Segmentation. Detection of breast cancer in the early stages can reduce the fatality rate by more than 40% [148] . Therefore, automatic breast tumor detection is of immense importance to doctors. Recently, Zhu et al. [149] propose a region aware transformer network (RAT-Net) to effectively fuse the Breast tumor region information into multiple scales to obtain precise segmentation. Extensive experiments on a large ultrasound breast tumor segmentation dataset show that RAT-Net outperforms CNN and transformer-based baselines. Similarly, Liu et al. [150] also propose a hybrid architecture consisting of transformer layers in the decoder part of 3D UNet [151] to effectively segment tumors from volumetric breast data. Multi-organ segmentation aims to segment several organs simultaneously and is challenging due to inter-class imbalance and varying sizes, shapes, and contrast of different organs. ViT models are particularly suitable for the multiorgan segmentation due to their ability to effectively model global relations and differentiate multiple organs. We have categorized multi-organ segmentation approaches based on the architectural design, as these approaches do not consider any organ-specific aspect and generally focus on boosting performance by designing effective and efficient architectural modules [152] . We categorize multi-organ segmentation approaches into Pure Transformer (only ViT layers) and Hybrid Architectures (both CNNs and ViTs layers). Pure Transformer based architectures consist of only ViT layers and have seen fewer applications in medical image segmentation compared to hybrid architectures as both global and local information is crucial for dense prediction tasks like segmentation [96] . Recently, Karimi et. al [153] propose a pure Transformer-based model for 3D medical image segmentation by leveraging self-attention [17] between neighboring linear embedding of 3D medical image patches. They also propose a method to effectively pre-train their Figure 8 : Overview of TransUNet architecture [96] proposed for multi-organ segmentation. It is one of the first transformer-based architecture proposed for medical image segmentation and merits both transformer and UNet. It employs a hybrid CNN-Transformer architecture for encoder, followed by multiple upsampling layers in decoder to output final segmentation mask. Image adapted from [96] . model when only a few labeled images are available. Extensive experiments show the effectiveness of their convolutionfree network on three benchmark 3D medical imaging datasets related to brain cortical plate [154] , pancreas, and hippocampus. One of the drawbacks of using Pure Transformer-based models in segmentation is the quadratic complexity of self-attention with respect to the input image dimensions. This can hinder the ViTs applicability in the segmentation of high-resolution medical images. To mitigate this issue, Cao et al. [125] propose Swin-UNet that, like Swin Transformer [126] , computes self-attention within a local window and has linear computational complexity with respect to the input image. Swin-UNet also contains a patch expanding layer for upsampling decoder's feature maps and shows superior performance in recovering fine details compared to bilinear upsampling. Experiments on Synapse and ACDC [155] dataset demonstrate the effectiveness of the Swin-UNet architectural design. Hybrid architecture-based approaches combine the complementary strengths of Transformers and CNNs to effectively model global context and capture local features for accurate segmentation. We have further categorized these hybrid models into single and multi-scale approaches. 3.2.2.1 Single-Scale Architectures: These methods process the input image information at one scale only and have seen widespread applications in medical image segmentation due to their low computational complexity compared to multi-scale architectures. We can sub-categorized singlescale architectures based on the position of the Transformer layers in the model. These sub-categories include Transformer in Encoder, Transformer between Encoder and Decoder, Transformer in Encoder and Decoder, and Transformer in Decoder. Transformer in Encoder. Most initially developed Transformer-based medical image segmentation approaches have Transformer layers in the model's encoder. The first work in this category is TransUNet [96] that consists of 12 Transformer layers in the encoder as shown in Figure 8 . These Transformer layers encode the tokenized image patches from the CNN layers. The resulting encoded features are upsampled via up-sampling layers in the decoder to output the final segmentation map. With skip-connection incorporated, TransUnet sets new records (at the time of publication) on synapse multi-organ segmentation dataset [156] and automated cardiac diagnosis challenge (ACDC) [155] . In other work, Zhang et al. propose TransFuse [157] to effectively fuse features from the Transformer and CNN layers via BiFusion module. The BiFusion module leverages the self-attention and multi-modal fusion mechanism to selectively fuse the features. Extensive evaluation of Trans-Fuse on multiple modalities (2D and 3D), including Polyp segmentation, skin lesion segmentation, Hip segmentation, and prostate segmentation, demonstrate its efficacy. Both TransUNet [96] and TransFuse [157] require pre-training on ImageNet dataset [158] to effectively learn the positional encoding of the images. To learn this positional bias without any pre-training, Valanarasu et al. [128] propose a modified gated axial attention layer [159] that works well on small medical image segmentation datasets. Furthermore, to boost segmentation performance, they propose a Local-Global training scheme to focus on the fine details of input images. Extensive experimentation on brain anatomy segmentation [160] , gland segmentation [161] , and MoNuSeg (microscopy) [162] demonstrate the effectiveness of their proposed gated axial attention module. In another work, Tang et al. [163] introduce Swin UNETR, a novel self-supervised learning framework with proxy tasks to pre-train Transformer encoder on 5,050 images of CT dataset. They validate the effectiveness of pre-training by finetuning the Transformer encoder with a CNN-based decoder on the downstream task of MSD and BTCV segmentation datasets. Similarly, Sobirov et al. [164] show that transformerbased models can achieve comparable results to state-of-theart CNN-based approaches on the task of head and neck tumor segmentation. Few works have also investigated the effectiveness of Transformer layers by integrating them into the encoder of UNet-based architectures in a plug-and-play manner. For instance, Cheng et al. [165] propose TransClaw UNet by integrating Transformer layers in the encoding part of the Claw UNet [166] to exploit multi-scale information. TransClaw-UNet achieves an absolute gain of 0.6 in dice score compared to Claw-UNet on Synapse multi-organ segmentation dataset and shows excellent generalization. Similarly, inspired from the LeViT [167] , Xu et al. [168] propose LeViT-UNet which aims to optimize the trade-off between accuracy and efficiency. LeViT-UNet is a multistage architecture that demonstrates good performance and generalization ability on Synapse and ACDC benchmarks. Transformer between Encoder and Decoder. In this category, Transformer layers are between the encoder and decoder of a U-Shape architecture. These architectures are more suitable to avoid the loss of details during downsampling in the encoder layers. The first work in this category is TransAttUNet [169] that leverages guided attention and multi-scale skip connection to enhance the flexibility of traditional UNet. Specifically, a robust selfaware attention module has been embedded between the encoder and decoder of UNet to concurrently exploit the expressive abilities of global spatial attention and transformer self-attention. Extensive experiments on five benchmark medical imaging segmentation datasets demonstrate the Transformer blocks Incorporating long-term dependencies into highlevel features. 1. Precise spatial encoding. 2. High-resolution low-level features. Modeling object concepts from highlevel features at multiple scales. Figure 9 : Overview of the interleaved encoder not-another transFormer (nnFormer) [144] for volumetric medical image segmentation. Note that convolution and transformer layers are interleaved to give full play to their strengths. Image taken from [144] . effectiveness of TransAttUNet architecture. Similarly, Yan et al. [170] propose Axial Fusion Transformer UNet (AFTer-UNet) that contains a computationally efficient axial fusion layer between encoder and decoder to effectively fuse inter and intra-slice information for 3D medical image segmentation. Experimentation on BCV [171] , Thorax-85 [172] , and SegTHOR [173] datasets demonstrate the effectiveness of their proposed fusion layer. Transformer in Encoder and Decoder. Few works integrate Transformer layers in both encoder and decoder of a U-shape architecture to better exploit the global context for medical image segmentation. The first work in this category is UTNet that efficiently reduces the complexity of the self-attention mechanism from quadratic to linear [174] . Furthermore, to model the image content effectively, UTNet exploits the two-dimensional relative position encoding [20] . Experiments show strong generalization ability of UTNet on multi-label and multi-vendor cardiac MRI challenge dataset cohort [175] . Similarly, to optimally combine convolution and transformer layers for medical image segmentation, Zhou et al. [144] propose nnFormer, an interleave encoder-decoder based architecture, where convolution layer encodes precise spatial information and Transformer layer encodes global context as shown in Fig. 9 . Like Swin Transformers [126] , the self-attention in nnFormer has been computed within a local window to reduce the computational complexity. Moreover, deep supervision in the decoder layers has been employed to enhance performance. Experiments on ACDC and Synapse datasets show that nnFormer surpass Swin-UNet [125] (transformer-based medical segmentation approach) by over 7% (dice score) on Synapse dataset. In other work, Lin et al. propose Dual Swin Transformer UNet (DS-TransUNet) [176] to incorporate the advantages of Swin Transformer in U-shaped architecture for medical image segmentation. They split the input image into non-overlapping patches at two scales and feed them into the two Swin Transformer-based branches of the encoder. A novel Transformer Interactive Fusion module has been proposed to build long-range dependencies between different scale features in encoder. DS-TransUNet outperforms CNN-based methods on four standard datasets related to Polyp segmentation, ISIC 2018, GLAS, and Datascience bowl 2018. Transformer in Decoder. Li et al. [177] investigate the use of Transformer as an upsampling block in the decoder of the UNet for medical image segmentation. Specifically, they adopt a window-based self-attention mechanism to better complement the upsampled feature maps while maintaining TransUNet [96] Multi-organ CT, MRI 2D Synapse [156] , ACDC [155] Dice, Hausdorff distance Hybrid Yes Encodes strong global context by treating the image features as sequences but also well utilizes the low-level CNN features via a u-shaped hybrid architectural design. TransFuse [157] Multi-organ UTNet [34] Heart MRI 2D MRI Challenge Cohort [175] Dice, Hausdorff distance Hybrid No Self-attention modules in encoder and decoder. Design relative position encoding to reduce the complexity of self-attention from quadratic to linear. TransClaw UNet [165] Multi-organ CT 2D Synapse [156] Dice, Hausdorff distance Integrated transformer layer in the encoder path of Claw-UNet to extract shallow spatial features. TransAttUNet [169] Multi-organ Xray, CT 2D ISIC 2018 [101] , JSRT [194] , Montgomery [195] , NIH [196] , Clean-CC-CCII [197] , Data Science Bowl 18 [193] , GLAS [161] Dice, F1 Multi-level guided attention and multi-scale skip connections to mitigate information recession problem. LeViT-UNet [168] Multi-organ CT, MRI 2D Synapse [156] , ACDC [155] Dice, Hausdorff distance Hybrid Integrate multiscale LeViT architecture as the a encoder in UNet. Polyp-PVT [198] Multi-organ nnFormer [144] Multi-organ CT, MRI 3D Synapse [156] , ACDC [155] Dice Hybrid Interleaved convolution and self-attention based encoder-decoder architecture. MISSFormer [199] Multi-organ CT, MRI 2D Synapse [156] , ACDC [155] Dice, Hausdorff distance Axial fusion mechanism to fuse intra-slice and inter-slice contextual information to guide segmentation. Multi-organ CT, MRI 3D BraTS 21 [140] , MSD [171] Dice, Hausdorff distance Hybrid Yes U-shaped encoder-decoder design. Encoder has two consecutive self-attention layers to encode local and global cues, and our decoder has novel parallel shifted window based self and cross attention blocks to capture fine details. Swin UNETR [145] Brain MRI 3D BraTS 21 [140] Dice, Hausdorff distance Hybrid Yes Swin UNet based architecture that consists of Swin transformer as the encoder and a CNNbased decoder. Computes self-attention in an efficient shifted window partitioning scheme. Table 3 : An overview of ViT-based approaches for medical image segmentation. P.T: pretraining. Figure 10 : Qualitative results of brain tumor segmentation task using transformer. From left to right: Ground truth image, UNETR [35] (ViT-based), TransBTS [138] (ViT-based), CoTr [200] (ViT-based), and UNet [114] (CNN based). Note that transformer-based approaches demonstrate better performance in capturing the fine-grained details of brain tumors as compared to CNN-based method. Image courtesy [35] . Figure 11 : Overview of CNN and a Transformer (CoTr) architecture [112] proposed for 3D medical image segmentation. It consists of CNN encoder (left) to extract multi-scale features from the input, followed by DeTrans-encoder (yellow blocks) to process the flattened multi-scale feature maps. Output features from encoder are fed to the CNN decoder (right) for segmentation mask prediction. Image courtesy [112] . efficiency. Experiments on MSD Brain and Synapse datasets demonstrate the superiority of their architecture compared to bilinear upsampling. In another work, Li et. al [190] propose SegTran, a Squeeze-and-Expansion Transformer for 2D and 3D medical image segmentation. Specifically, the squeeze block regularizes the attention matrix, and the expansion block learns diversified representations. Furthermore, a learnable sinusoidal positional encoding has been proposed that helps the model to encode spatial relationships. Extensive experiments on Polyp, BraTS19, and REFUGE20 (fundus images) segmentation challenges demonstrate the strong generalization ability of Segtran. These architectures process input at multiple scales to effectively segment organs having irregular shapes and different sizes. Here, we highlight various attempts to integrate the multi-scale architectures for medical image segmentation. We further group these approaches into 2D and 3D segmentation categories based on the input image type. 2D Segmentation. Most ViT-based multi-organ segmentation approaches struggle to capture information at multiple scales as they partition the input image into fixed-size patches, thereby losing useful information. To address this issue, Zhang et. al. [183] propose a pyramid medical transformer, PMTrans, that leverage multi-resolution attention to capture correlation at different image scales using a pyramidal architecture [201] . PMTrans works on multiresolution images via an adaptive partitioning scheme of patches to access different receptive fields without changing the overall complexity of self-attention computation. Ex-tensive experiments on three medical imaging datasets of GLAS [161] , MoNuSeg [184] , and HECKTOR [189] show the effectiveness of exploiting multi-scale information.In other work, Ji et al. [39] propose a Multi-Compound transformer (MCTrans) that learns not only feature consistency of the same semantic categories but also capture correlation among different semantic categories for accurate segmentation [202] . Specifically, MCTrans captures cross-scale contextual dependencies via the Transformer self-attention module and learned semantic correspondence among different categories via Transformer Cross-Attention module. An auxiliary loss has also been introduced to improve feature correlation of the same semantic category. Extensive experiments on six benchmark segmentation datasets demonstrate the effectiveness of the architectural components of MCTrans. 3D Segmentation. The majority of multi-scale architectures have been proposed for 2D medical image segmentation. To directly handle volumetric data, Hatamizadeh et. al. [35] propose a ViT-based architecture (UNETR) for 3D medical image segmentation. UNETR consists of a pure transformer as the encoder to learn sequence representations of the input volume. The encoder is connected to a CNNbased decoder via skip connections to compute the final segmentation output. UNETR achieves impressive performance on BTCV [203] and MSD [171] segmentation datasets as shown in Fig. 10 . One of the drawbacks of UNETR is its large computational complexity in processing large 3D input volumes. To mitigate this issue, Xie et. al [112] propose a computationally efficient deformable self-attention module [204] multi-scale features, as shown in Figure 11 , to reduce the computational and spatial complexities. Experiments on BTCV [203] demonstrate the effectiveness of their deformable self-attention module for 3D multi-organ segmentation. From the extensive literature reviewed in this section, we note that the medical image segmentation area is heavily impacted by transformer-based models, with more than 50 publications within one year since the inception of the first ViT model [22] . We believe such interest is due to the availability of large medical segmentation datasets and challenge competitions associated with them in top conferences compared to other medical imaging applications. As shown in Fig. 12 , a recent transformer-based hybrid architecture is able to achieve 13% performance gain in terms of dice score compared to the baseline transformer model, indicating rapid progression of the field. In short, ViT-based architectures have achieved impressive results over benchmark medical datasets, competing and most of the time improving over CNN-based segmentation approaches (see Table 3 for details). Below, we briefly describe some of the challenges associated with ViTs based medical segmentation methods and give possible solutions based on insights from the relevant papers discussed. As mentioned before, the high computational cost associated with extracting features at multiple levels hinders the applicability of multi-scale architectures in medical segmentation tasks. These multi-scale architectures exploit processing input image information at multiple levels and achieve superior performance than single-scale architectures. Therefore, designing efficient transformer architectures for multi-scale processing requires more attention. Most of the proposed ViT-based models are pre-trained on the ImageNet dataset for the downstream task of medical image segmentation. This approach is sub-optimal due to the large domain gap between natural and medical image modalities. Recently, few attempts have been made to investigate the impact of self-supervised pre-training on medical imaging datasets on the ViTs segmentation performance. However, these works have shown that ViT pretrained on one modality (CT) gives unsatisfactory performance when applied directly to other medical imaging modalities (MRI) due to the large domain gap making it an exciting avenue to explore. We defer detailed discussion related to pre-training ViTs for downstream medical imaging tasks to Sec. 11.1. Moreover, recent ViT-based approaches mainly focus on 2D medical image segmentation. Designing customized architectural components by incorporating temporal information for efficient high-resolution and high-dimensional segmentation of volumetric images has not been extensively explored. Recently, few efforts have been made, e.g., UNETR [35] uses Swin Transformer [126] based architectures to avoid quadratic computing complexity; however, it requires further attention from the community. In addition to focusing on the scale of datasets, with the advent of ViTs, we note there is a need to collect more diverse and challenging medical imaging datasets. Although diverse and challenging datasets are also crucial to gauge the performance of ViTs in other medical imaging applications, they are particularly relevant for medical image segmentation due to a major influx of ViT-based models in this area. We believe these datasets will play a decisive role in exploring the limits of ViTs for medical image segmentation. Accurate classification of medical images plays an essential role in aiding clinical care and treatment. In this section, we comprehensively cover applications of ViTs in medical image classification. We have broadly categorized these approaches into COVID-19, tumor, and retinal disease classification based methods due to a different set of challenges associated with these categories as shown in Fig. 13 . Studies suggest that COVID-19 can potentially be better diagnosed with radiological imaging as compared to tedious real-time polymerase chain reaction (RT-PCR) test [205] - [207] . Recently, ViTs have been successfully employed for diagnosis and severity prediction of COVID-19, showing SOTA performance. In this section, we briefly describe the impact of ViTs in advancing recent efforts on automated image analysis for the COVID-19 diagnosis process. Most of these works use three modalities, including Computerized tomography (CT), Ultrasound scans (US), and X-ray. We have further categorized ViT-based COVID-19 classification approaches into Black-box models and Interpretable models according to the level of explainability offered. ViT-based Black-box models for COVID-19 imaging classification generally focus on improving accuracy by designing novel and efficient ViT architectures. However, these models are not easily interpretable, making it challenging to gain user-trust. We have further sub-categorized black-box models into 2D and 3D categories, depending on the input image type. Below, we briefly describe these approaches: Interpretable Saliency Based Grad-CAM Based Black Box Interpretable Figure 13 : Taxonomy of ViT-based medical image classification approaches. The influx of ViT-based COVID-19 classification approaches makes it a dominating category in the taxonomy. The High computational cost of ViTs hinders their deployment on portable devices, thereby limiting their applicability in real-time COVID-19 diagnosis. Perera et al. [208] propose a lightweight Point-of-Care Transformer (POCFormer) to diagnose COVID-19 from lungs images captured via portable devices. Specifically, POCFormer leverages Linformer [174] to reduce the space and time complexity of self-attention from quadratic to linear. POC-Former has two million parameters that are about half of MobileNetv2 [209] , thus making it suitable for real-time diagnosis. Experiments on COVID-19 lungs POCUS dataset [210] , [211] demonstrate the effectiveness of their proposed architecture with above 90% classification accuracy. In other work, Liu et al. [212] proposed ViT-based model for COVID-19 diagnosis by exploiting a new attention mechanism named Vision Outlooker (VOLO) [213] . VOLO is effective for encoding fine-level features into ViT token representation, thereby improving classification performance. Further, they leverage the transfer learning approach to handle the issue of insufficient and generally unbalanced COVID-19 datasets. Experiments on two publicly available COVID-19 CXR datasets [211] , [214] demonstrate the effectiveness of their architecture. Similarly, Jiang et al. [215] leverage Swin Transformer [126] and Transformer-in-Transformer [216] to classify COVID-19 images from Pneumonia and normal images. To further boost the accuracy, they employ model ensembling using a weighted average. Research progress in ViT-based COVID-19 diagnosis approaches is heavily impeded due to the requirement of a large amount of labeled COVID-19 data, thereby demanding collaborations among hospitals. This collaboration is difficult due to limited consent by patients, privacy concerns, and ethical data usage [217] . To mitigate this issue, Park et al. [218] proposed a Federated Split Task-Agnostic (FESTA) framework that leveraged the merits of Federated and Split Learning [219] , [220] in utilizing ViT to simultaneously process multiple chest X-ray tasks, including the diagnosis in COVID-19 Chest Xray images on a massive decentralized dataset. Specifically, they split ViT into the shared transformer body and task-specific heads and demonstrate the suitability of ViT body to be shared across relevant tasks by leveraging multitasklearning (MTL) [221] strategy as shown in Fig. 14 . They affirm the suitability of ViTs for collaborative learning in medical imaging applications via extensive experiments on the CXR dataset. 3D: Most of the ViT-based approaches for COVID-19 classification operate on 2D information only. However, as suggested by Kwee et al. [222] , the symptoms of COVID-19 might be present at different depths (slices) for different patients. To exploit both 2D and 3D information, Hsu et al. [223] propose a hybrid network consisting of transformers and CNNs. Specifically, they determine the importance of slices based on significant symptoms in the CT scan via Wilcoxon signed-rank test [224] with Swin Transformer [126] as backbone network. To further exploit the intrinsic features in the spatial and temporal dimensions, they propose a Convolutional CT Scan Aware Transformer module to fully capture the context of the 3D scans. Extensive experiments on the COVID-19-CT dataset show the effectiveness of their proposed architectural components. Similarly, Zhang et al. [225] , [226] also proposed Swin Transformer based two-stage framework for the diagnosis of COVID-19 in the 3D CT scan dataset [227] . Specifically, their framework consists of UNet based lung segmentation model followed by the image classification with Swin Transformer [126] backbone. Interpretable models aim to show the features that influence the decision of a model the most, generally via visualization techniques like saliency-based methods, Grad-CAM, etc. Due to their interpretable nature, these models are well suited to gain the trust of physicians and patients and therefore have paved their way for clinical deployment. We have further divided interpretable models into saliency-based [228] and Grad-CAM [229] based visualization approaches. Saliency Based Visualization. Park et al. [231] , propose a ViT-based method for COVID-19 diagnosis by exploiting the low-level CXR features extracted from the pre-trained backbone network. The backbone network has been trained in a self-supervised manner (using contrastive-learning based SimCLR [232] method) to extract abnormal CXR features embeddings from large and well-curated CXR dataset of CheXpert [233] . These feature embeddings have been leveraged by ViT model for high-level diagnosis of COVID-19 images. Extensive experiments on three CXR test datasets acquired from different hospitals demonstrate the superiority of their approach compared to CNN-based models. They also validated the generalization ability of their proposed approach and adopted saliency map visualizations [234] to provide interpretable results. Similarly, Gao et al. [235] propose COVID-ViT to classify COVID from non-COVID images as part of the MIA-COVID19 challenge [227] . Their experiments on 3D CT lungs images demonstrated the superiority of ViT-based approach over DenseNet [236] baseline in terms of F1 score. In another work, Mondal et al. [230] introduce xViTCOS for COVID-19 screening from lungs CT and X-ray images. Specifically, they pre-train xViTCOS on ImageNet to learn generic image representations and finetune the pre-trained model on a large chest radiographic dataset. Further, xViTCOS leverage the explainability-driven saliency-based approach [234] with clinically interpretable visualizations to highlight the role of critical factors in the resulting predictions, as shown in Figure 15 . Experiments on COVID CT-2A [237] and their privately collected Chest X-ray dataset demonstrate the effectiveness of xViTCOS. Grad-CAM Based Visualization. Shome et al. A tumor is an abnormal growth of body tissues and can be cancerous (malignant) or noncancerous (benign). Earlystage malignant tumor diagnosis is crucial for subsequent treatment planning and can greatly improve the patient's survival rate. In this section, we review ViT-based models for tumor classification. These models can be mainly categorized into Black-box models and Interpretable models. We highlight the relevant anatomies in bold. Black-Box Models. TransMed [240] is the first work that leverages ViTs for medical image classification. It is a hybrid CNN and transformer-based architecture that is capable of classifying parotid tumors in the multi-modal MRI medical images. TransMed also employs a novel image fusion strategy to effectively capture mutual information from images of different modalities, thereby achieving competitive results on their privately collected parotid tumor classification dataset. Later, Lu et al. [241] propose a twostage framework that first performs contrastive pre-training on glioma sub-type classification in the brain followed by the feature aggregation via proposed transformer-based sparse attention module. Ablation studies on TCGA-NSCLC [242] dataset show the effectiveness of their two-stage framework. For the task of breat cancer classification, Gheflati et al. [243] systematically evaluate the performance of pure and hybrid pre-trained ViT models. Experiments on two breast ultrasound datasets provided by Al-Dhabyani et al. [244] and Yap et al. [245] shows that Vit-based models provide better results than those of the CNNs for classifying images into benign, malignant, and normal categories. Similarly, other works employ hybrid Transformer-CNN architectures to solve medical classification problem for different organs. For instance, Khan et al. [246] propose Gene-Transformer to predict the lung cancer subtypes. Experiments on TCGA-NSCLC [242] dataset demonstrates the superiority of Gene Transformer over CNN baselines. Chen et al. [247] present a multi-scale GasHis-Transformer to diagnose gastric cancer in the stomach. Jiang et al. [248] propose a hybrid model to diagnose acute lymphocytic leukemia by using symmetric cross-entropy loss function. Interpretable Models. Since the annotation procedure is expensive and laborious, one label is assigned to a set of instances (bag) in whole slide imaging (WSI) based pathology diagnosis. This type of weakly supervised learning is known as Multiple Instance Learning [249] , where a bag is labeled positive if at least one instance is positive or labeled negative when all instances in a bag are negative. Most of the current MIL methods assume that the instances in each bag are independent and identically distributed, thereby neglecting the correlation among different instances. Shao et al. [239] present TransMIL to explore both morphological and spatial information in weakly supervised WSI classification. Specifically, TransMIL aggregates morphological information with two transformer-based modules and a position encoding layer as shown in Fig. 16 . To encode spatial information, a pyramid position encoding generator is proposed. Further, the attention scores from the TransMIL have been visualized to demonstrate interpretability, as shown in Fig. 17 . TransMIL shows state-of-the-art performance on three different computational pathology datasets CAMELYON16 (breast) [250] , TCGA-NSCLC (lung) [242] , and TCGA-R (kidney) [251] . To diagnose lung tumors, Zheng et al. [252] propose graph transformer network (GTN) to leverage the graph-based representation of WSI. GTN consists of a graph convolutional layer [253] , a transformer layer, and a pooling layer. GTN further employs GraphCAM [234] to identify regions that are highly associated with the class label. Extensive evaluations on TCGA dataset [242] show the effectiveness of GTN. Yu et al [258] propose MIL-ViT model which is first pretrained on a large fundus image dataset and later fine-tuned on the downstream task of the retinal disease classification. MIL-ViT architecture uses MIL-based head that can be used with ViT in a plug-and-play manner. Evaluation performed on APTOS2019 [255] and RFMiD2020 [259] datasets shows that MIL-ViT is achieving more favorable performance than CNN-based baselines. Most data-driven approaches treat diabetic retinopathy (DR) grading and lesion discovery as two separate tasks, which may be sub-optimal as the error may propagate from one stage to the other. To jointly handle both these tasks, Sun et al. [260] propose lesion aware transformer (LAT) that consists of a pixel relation based encoder and a lesion-aware transformer decoder. In particular, they leverage transformer decoder to formulate lesion discovery as a weakly supervised lesion localization problem. LAT model sets state-of-the-art on Messidor-1 [261] , Messidor-2 [261] , and EyePACS [262] datasets. Yang et al. [273] propose a hybrid architecture consisting of convolutional and Transformer layers for fundus disease classification on OIA dataset [274] . Similarly, Wu et al. [275] and Aldahou et al. [276] also verify that ViT models are more accurate in DR grading than their CNNs counterparts. In this section, we provide a comprehensive overview of about 25 papers related to applications of ViTs in medical image classification. In particular, we see a surge of Transformer-based architectures for diagnosing COVID-19, compelling us to develop taxonomy accordingly. Below, we briefly highlight some of the challenges associated with this area, identify recent trends, and provide future directions worthy of further exploration. The lack of large COVID-19 datasets hindered the applicability of ViT models to diagnose COVID-19. A recent work by Shome et al. [238] attempts to mitigate this issue by combining three opensource COVID-19 datasets to create a large dataset comprising 30,000 images. Still, creating diverse and large COVID-19 datasets is challenging and requires significant effort from the medical community. More attention must be given to design interpretabile (to gain end-users trust) and efficient (for point-of-care testing) ViT models for COVID-19 diagnosis to make them a viable alternative of RT-PCR testing in the future. We notice that most works have used the original ViT model [22] as a plug-and-play manner to boost the medical image classification performance. In this regard, we believe that integrating domainspecific context and accordingly designing architectural components and loss functions can enhance performance and provide more insights in designing effective ViT-based classification models in the future. Finally, let us highlight the exciting work of Matsoukas et al. [33] that, for the first time, demonstrates that ViTs pre-trained on ImageNet perform comparably to CNNs for the medical image classification task as shown in Table 5 . This also raises an interesting question "Can ViT models pre-trained on medical imaging datasets perform better than ViT models pre-trained on ImageNet for medical image classification?." A recent work by Xie et al. [277] attempts to answer this by pre-training the ViT on large-scale 2D and 3D medical images. On the medical image classification problem, their model obtains substantial performance gain over the ViT model pre-trained on ImageNet, indicating that this area is worth exploring further. A brief overview of ViT-based medical image classification approaches has been provided in Table 4 . In medical image analysis, object detection refers to localization and identification of a region of interest (ROIs) such as lung nodules from X-ray images and is typically an essential aspect of diagnosis. However, it is one of the most time-consuming tasks for clinicians, thereby demanding the accurate computer-aided diagnosis (CAD) system to act as a second observer that may accelerate the process. Following the success of CNNs in medical image detection [278] , [279] , recently few attempts have been made to improve performance further using Transformer models. These approaches are mainly based on the detection transformer (DETR) framework [24] . Shen et al. [200] propose the first hybrid framework COTR, consisting of convolutional and transformer layers for end-to-end polyp detection. Specifically, the encoder of COTR contains six hybrid convolution-in-transformer layers to encode features. Whereas, the decoder consists of six transformer layers for object querying followed by a feedforward network for object detection. COTR performs better than DETR on two different datasets ETIS-LARIB and CVC-ColonDB. The DETR model [24] is also adapted in other works [280] , [281] for the end-to-end polyp detection [280] , and detecting lymph nodes in T2 MRI scans for the assessment of lymphoproliferative diseases [281] . Overall, the frequency of new Transformer-based approaches for the problem of medical image detection is lesser than those for the segmentation and classification. This is in contrast to the early years of CNN-based designs that were rapidly developed for the medical image detection, as indicated in Fig. 31 . A recent work [282] shows that generic class-agnostic detection mechanism of multi-modal ViTs (like MDETR [283] ) pre-trained on natural images-text pairs performs poorly on medical datasets. Therefore, investigating the performance of multi-modal ViTs by pre-training them on modality-specific medical imaging datasets is a promising future direction to explore. Furthermore, since the recent ViTbased methods yield competitive results on medical image detection problems, we expect to see more contributions in the near future. The goal of medical image reconstruction is to obtain a clean image from a degraded input. For example, recovering a Undersampled MRI Endoscopic Video Low-Dose CT Low-Dose PET High-Data Regime Figure 18 : Taxonomy of ViT-based medical image reconstruction approaches. high-resolution MRI image from its under-sampled version. It is a challenging task due to its ill-posed nature.Moreover, exact analytic inverse transforms in many practical medical imaging scenarios are unknown. Recently, ViTs have been shown to address these challenges effectively. We categorize the relevant works into medical image enhancement and medical image restoration areas, as depicted in Fig. 18 . ViTs have achieved impressive success in the enhancement of medical images, mostly in the application of Low-Dose Computed Tomography (LDCT) [284] , [285] . In LDCT, the Xray dose is reduced to prevent patients from being exposed to high radiation. However, this reduction comes at the expense of CT image quality degradation and requires effective enhancement algorithms to improve the image quality and, subsequently, diagnostic accuracy. Zhang et al. [37] propose an hybrid architecture TransCT that leverages the internal similarity of the LDCT images to effectively enhance them. TransCT first decomposes the LDCT image into high-frequency (HF) (containing noise) and low-frequency (LF) parts. Next it removes the noise from the HF part with the assistance of latent textures. To reconstruct the final high-quality LDCT images, TransCT further integrates features from the LF part to the output of the transformer decoder. Experiments on Mayo LDCT dataset [286] demonstrate the effectiveness of TransCT over CNNbased approaches. To perform LDCT image enhancement, Wang et al. [289] propose a convolution-free ViT-based encoder-decoder architecture TED-Net. It employs Tokento-token block [290] to enrich the image tokenization via a cascaded process. To refine contextual information, TED-Net introduces dilation and cyclic-shift blocks [125] in tokenization. TED-Net shows favorable performance on the Mayo Clinic LDCT dataset [286] . In another work, Luthra et al. [291] propose Eformer which is Transformer-based residual learning architecture for LDCT images denoising. To focus on edges, Eformer uses the power of Sobel-Feldman [288] (unsupervised CNN with pre-training), Self-Attention GAN DIP [46] (unsupervised CNN), Self-Attention GAN [288] (unsupervised CNN with pre-training), SLATER DIP [38] (unsupervised transformer), SLATER [38] (unsupervised transformer with pre-training), and reference image. Bottom row shows corresponding error maps. It can be seen that SLATTER outperforms all other approaches in term of quality of reconstruction. Results taken from [38] . operator [292] , [293] in the proposed edge enhancement block to boost denoising performance. Moreover, to handle the over-smoothness issue, the multi-scale perceptual loss [292] is used. Eformer achieves impressive image quality gains in terms PSNR, SSIM, and RMSE on the AAPM-Mayo Clinic dataset [286] . Like LDCT, Low-dose positron emission tomography (LD-PET) images reduce the harmful radiation exposure of standard-dose PET (SDPET) at the expense of sacrificing diagnosis accuracy. To address this challenge, Luo et al. [294] propose an end-to-end generative adversarial network (GAN) based method integrated with Transformers, namely Transformer-GAN, to effectively reconstruct SDPET images from the corresponding LDPET images. Specifically, the generator of Transformer-GAN consists of a CNN-based encoder to learn compact feature representation, a transformer network to encode global context, and a CNN-based decoder to restore feature representation. They also introduce adversarial loss to obtain reliable and clinically acceptable images. Extensive experiments on their in-house collected clinical human brain PET dataset show the effectiveness of Transformer-GAN quantitatively and qualitatively. Medical image restoration entails transforming signals collected by acquisition hardware (like MRI scanners) into interpretable images that can be used for diagnosis and treatment planning. Recently, ViT-based models have been proposed for multiple medical image restoration tasks, including undersampled MRI restoration, Sparse-View CT image reconstruction, and endoscopic video reconstruction. These models have pushed the boundaries of existing learning-based systems in terms of reconstruction accuracy. Next, we briefly highlight these approaches. Reducing the number of MRI measurements can result in faster scan times and a reduction in artifacts due to patients movement at the expense of aliasing artifacts in the image [295] . High-Data Regime Approaches. Approaches in this category assume the availability of large MRI training datasets to train the ViT model. Feng et al. [296] propose Transformerbased architecture, MTrans, for accelerated multi-modal MR imaging. The main component of MTrans is the cross attention module that extracts and fuses complementary features from the auxiliary modality to the target modality. Experiments on fastMRI and uiMRI datasets for reconstruction and super-resolution tasks show that MTrans achieve good performance gains over previous methods. However, MTrans requires separate training for MR reconstruction and superresolution tasks. To jointly reconstruct and super-resolve MRI images, Feng et al. [297] propose Task-Transformer that leverages the power of multi-task learning to fuse complementary information between the reconstruction branch and the superresolution branch. Experiments are performed on the public IXI and private MRI brain datasets. Similarly, Mahapatra et al. [298] propose a hybrid architecture to super-resolve MRI images by exploiting the complementary advantages of both CNNs and ViTs. They also propose novel loss functions [299] to preserve semantic and structural information in the superresolved images. Low-Data Regime Approaches. One drawback of the aforementioned approaches is the requirement of a massive paired dataset of undersampled and corresponding fully sampled MRI acquisitions to train ViT models. To alleviate the data requirement issue, Korkmaz et al. [38] , [300] propose a zero-shot framework, SLATER, that leverages prior induced by randomly initialized neural networks [46] , [47] for unsupervised MR image reconstruction. Specifically, during inference, SLATER inverts its transformer-based It can be seen that ViT model pre-trained on ImageNet produce sharp results as compared to recent CNN-based model [311] and randomly initialized ViT model. Bottom row: Example reconstructions of Brain images by models pre-trained on ImageNet and fine-tuned on Knee MRI dataset. Results show that pre-trained ViT models are more robust to anatomical shifts. Figures adapted from [301] . generative model via iterative optimization over-network weights to minimize the error between the network output and the under-sampled multi-coil MRI acquisitions while satisfying the MRI forward model constraints. SLATER yields quality improvements on single and multi-coil MRI brain datasets over other unsupervised learning-based approaches as shown in Fig. 19 . Similarly, Lin et al. [301] show that a ViT model pre-trained on ImageNet, when fine-tuned on only 100 fastMRI images, not only yields sharp reconstructions but is also more robust towards anatomy shifts compared to CNNs as shown in Fig. 20 . Furthermore, their experiments indicate that ViT benefits from higher throughput and less memory consumption than the U-Net baseline. Sparse-view CT [312] can effectively reduce the effective radiation dose by acquiring fewer projections. However, a decrease in the number of projections demands sophisticated image processing algorithms to achieve high-quality image reconstruction [313] . Wang et al. [305] present a hybrid CNN-Transformer, named Dual-Domain Transformer (DuDoTrans), by considering the global nature of sinogram's sampling process to better restore high-quality images. In the first step, DuDoTrans reconstructs low-quality reconstructions of sinogram via filtered back projection step and learnable DuDo consistency layer. In the second step, a residual image reconstruction module performs enhancement to yield highquality images. Experiments are performed on the NIH-AAPM dataset [286] to show generalizability, and robustness (against noise and artifacts) of DuDoTrans. Reconstructing surgical scenes from a stereoscopic video is challenging due to surgical tool occlusion and camera viewpoint changes. Long et al. [307] propose E-DSSR to reconstruct surgical scenes from stereo endoscopic videos. Specifically, E-DSSR contains a lightweight stereo Transformer module to estimate depth images with high confidence and a segmentor network to accurately predict the surgical tool's mask. Extensive experiments on Hamlyn Centre Endoscopic Video Dataset [308] and privately collected DaVinci robotic surgery dataset demonstrate the robustness of E-DSSR against abrupt camera movements and tissue deformations in real-time. In this section, we have reviewed about a dozen papers related to the applications of ViT models in medical image reconstruction, as shown in Table 6 . Below, we highlight a few challenging problems and provide recent trends in the field. Recently, an interesting work [301] has investigated the impact of pre-training ViT on the task of MRI image reconstruction. Their results indicate that pre-trained ViT yields sharp reconstructions and is robust towards anatomy shifts (see Fig. 20 ). The robustness of ViTs can be of particular relevance to the pathology image reconstruction as the range of pathology can vary significantly in the anatomy being imaged. Further, it raises an interesting question "Are ViTs pre-trained on medical image dataset able to provide any advantages in terms of reconstruction performance and robustness against anatomy shifts compared to their counterparts pre-trained on ImageNet"? Extensive and systematic experiments are required to answer this question. Another promising future direction is to investigate the impact on the performance of ViT pre-trained on one image modality (like CT) and fine-tuned on another modality (like MRI) for image reconstruction tasks. We notice that most of the Transformer-based approach focus on MRI and CT image reconstruction tasks, and their applicability to other modalities are yet to be explored. In addition, proposed architectures are mostly generic and have not fully exploited the application-specific aspects. We believe that designing architectural components and formulating loss functions according to the task at hand can significantly boost performance. We want to highlight one particular work that uses the Transformer-layer architecture to regularize the challenging problem of MRI image reconstruction from under-sampled measurements [38] . This work is inspired by the strong prior induced by the structure of untrained neural networks [46] , [47] . These untrained network priors have recently garnered much attention from the medical image community as they do not need labeled training data. Considering advances in the untrained neural network area, we believe this direction requires further attention from medical imaging researchers in the context of Transformers. We also observe that compared to the early years of CNNs (one paper from 2012 to 2015), Transformers have rapidly gained widespread attention in the medical image reconstruction community (more than a dozen papers in 2021), potentially due to the recent advancement in image-to-image translation frameworks. In this section, we provide an overview of the applications of ViTs in medical image synthesis. Most of these approaches incorporate adversarial loss to synthesize realistic and highquality medical images, albeit at the expense of training instability [314] . We have further classified these approaches into intra-modality synthesis and inter-modality synthesis due Highlights Modality Input Type Datasets Metric TransCT [37] Transformer for LDCT enhancement with high and low frequency decomposition. CT 2D NIH-AAPM [286] RMSE, SSIM, VIF [302] SLATER [38] Transformer based approach for zero shot MRI image reconstruction. to a different set of challenges in both categories, as shown in Fig. 21 . The goal of intra-modality synthesis is to generate higherquality images from the relatively lower quality input images of the same modality. Next, we describe the details of ViTbased intra-modality medical image synthesis approaches. Supervised image synthesis methods require paired source and target images to train ViT-based models. Paired data is difficult to obtain due to annotation cost and time constraints, thereby generally hindering the applicability of these models in medical imaging applications. Zhang et al. [315] focus on synthesizing infant brain structural MRIs (T1w and T2w scans) using both transformer and performer (simplified selfattention) layers [88] . Specifically, they design a novel multiresolution pyramid-like U-Net framework, PTNet, utilizing performer encoder, performer decoder, and transformer bottleneck to synthesize high-quality infant MRI. They demonstrate the superiority of PTNet both qualitatively and quantitatively compared to pix2pix [316] , and pix2pixHD [317] on large-scale infant MRI dataset [318] . Furthermore, in addition to better synthesis quality, PTNet has a reasonable execution time of around 30 slices per second. Semi-supervised approaches typically require small amounts of labeled data along with large unlabelled data to train models effectively. Kamran et al. [319] propose a multi-scale conditional generative adversarial network (GAN) [316] using ViT as a discriminator. They train their proposed model in a semi-supervised way to simultaneously synthesize Fluorescein Angiography (FA) images from fundus photographs and predict retinal degeneration. They use softmax activation after MLP head output and a categorical CE loss for classification. Besides adversarial loss, they also use MSE and perceptual losses to train their network. For ViT discriminator, they use embedding feature loss calculated using positional and patch features from the transformer encoder layers by successfully inserting the real and synthesized FA images. Their quantitative results in terms of Frechet inception Distance [320] and Kernel Inception Distance [321] demonstrate the superiority of their approach over baseline methods on diabetic retinopathy dataset provided by Hajeb et al. [322] . These approaches are particularly suitable for medical image synthesis tasks as they do not require paired training datasets. Recently, Ristea [323] proposed a cycle-consistent generative adversarial transformer (CyTran) to translate unpaired contrast CT scans to non-contrast CT scans and volumetric image registration of contrast CT scans to noncontrast CT scans. To handle high-resolution CT images, they propose hybrid convolution and multi-head attention-based architecture shown in Fig. 22 . CyTran is unsupervised due to the integration of cyclic loss. Morever, they introduce the Figure 22 : Hybrid convolutional-transformer network for CT image generation as proposed in [323] . It consists of downsampling convolutional layers to extract features from input images, a convolutional-transformer block comprising a multi-head self-attention mechanism, and an upsampling block to generate output images. Figure 23 : A pair of MRI (left) and CT (right) images of the same subject showing the significant appearance gap between the two modalities making medical image synthesis from MRI to CT a challenging task. Image is from [324] . Coltea-Lung-CT100W dataset formed of 100 3D anonymized triphasic lung CT scans of female patients. The inter-modality approaches aim to synthesize targets to capture the useful structural information in the source images of different modalities. Examples include CT to MRI translation or vice-versa. Due to challenges associated with inter-modal translation, only supervised approaches have been explored. Dalmaz et al. [325] introduce a novel synthesis approach, ResViT, for the multi-modal imaging based on a conditional deep adversarial network with ViT-based generator. Specifically, ResViT, employs convolutional and transformer branches within a residual bottleneck to preserve both local precision and contextual sensitivity along with the realism of adversarial learning. The bottleneck comprises novel aggregated residual transformer blocks to synergistically preserve local and global context, with a weight-sharing strategy to minimize model complexity. The effectiveness of ResViT model is demonstrated on two multi-contrast brain MRI datasets, BraTS [326] ), and a multi-modal pelvic MRI-CT dataset [327] . In this section, we have reviewed the applications of ViT models in medical image synthesis. Realistic synthesis of medical images is particularly important as, in general, more than one imaging modality is involved in accurate clinical decision-making due to their complementary strengths. Therefore, in many practical applications, a certain modality is desired but infeasible to acquire due to cost and privacy issues. Recent transformer-based approaches can effectively circumvent these issues due to their ability to generate more realistic images than GAN-based methods. Furthermore, most Transformer-based medical image synthesis approaches use the adversarial loss to generate realistic images. The adversarial loss can cause mode-collapse, and effective strategies must be employed to mitigate this issue [328] . Lastly, to the best of our knowledge, no work has been done using transformer-based models for inter-modality image synthesis approaches in an unsupervised setting. This can be due to the highly challenging nature of the problem (like CT and MRI images of the same subject have significantly different appearances, as shown in Fig. 23) , thereby making it a promising direction to explore. Medical image registration aims to find dense per-voxel displacement and establish alignment between a pair of fixed and moving images. In medical imaging, registration may be necessary when analyzing a pair of images acquired at different times, from different viewpoints, or using different modalities (like MRI and CT) [75] . Accurate medical image registration is a challenging task due to difficulties in extracting discriminative features from multimodal medical images, complex motion, and lack of robust outlier rejection approaches [329] . In this section, we briefly highlight the applications of ViTs in medical image registration. The first study to investigate the usage of transformers for self-supervised medical volumetric image registration has been proposed by Chen et al. [330] . Their model, ViT-V-Net, consists of a hybrid architecture composed of convolutional and transformer layers. Specifically, ViTs are applied to the high-level features of fixed and moving images extracted via a series of convolutional and max-pooling layers. The output from ViT is then reshaped and decoded using a V-Net style decoder [331] . To efficiently propagate the information, ViT-V-Net uses long skip connections between the encoder and decoder. The output of the ViT-V-Net decoder is a dense displacement field, which is fed to the spatial transformer network for warping. Experiments on in-house MRI dataset show superiority of ViT-V-Net over other competing approaches in terms of Dice score. Chen et al. [332] further extends ViT-V-Net and propose TransMorph model for volumetric medical image registration. Particularly, TransMorph makes use of Swin Transformer in the encoder to capture the semantic correspondence between input fixed and moving images, followed by long skip connections-based convolutional decoder to predict dense displacement field. For uncertainty estimation, they also introduce Bayesian deep learning by applying variational inference on the parameters of the encoder in TransMorph. Extensive evaluation is performed to compare TransMorph with other approaches for the medical image registration task. Specifically, experiments on inter-patient brain MRI registration provided by John-Hopkin university and XCAT-to-CT registration demonstrate the superiority of TransMorph against twelve different handcrafted, CNN-based, and transformer-based approaches. Similarly, Zhang et al. [333] present a novel dual transformer architecture (DTN) for volumetric diffeomorphic registration by effectively establishing correspondences between anatomical structures in an unsupervised manner. The DTN consists of two CNN-based 3D U-Net encoders to extract embeddings of separate and concatenated volumetric MRI images. To further refine and enhance the embeddings, they propose encoder-decoder-based dual transformers to encode the crossvolume dependencies. Given the enhanced embeddings, the CNN decoder infers the deformation fields. Qualitative and quantitative results in terms of Dice similarity coefficient and negative Jacobian determinant on OASIS dataset [334] of MRI scans demonstrate the effectiveness of their proposed architecture. Application of transformers to medical image registration problem is still at early stages, and it is difficult to draw any conclusion at this stage. However, seeing the rapid development of Transformer-based registration approaches in generic computer vision, we expect to see the same trend in this field in the near future. Recently, immense progress has been made to automatically generate clinical reports from medical images using deep learning [335] - [338] . This automatic report generation process can help clinicians in accurate decision-making. However, generating reports (or captions) from the medical imaging data is challenging due to diversity in the reports of different radiologists, long sequence length (unlike natural image captions), and dataset bias (more normal data compared to abnormal). Moreover, an effective medical report generation model is expected to process two key attributes: (1) language fluency for human readability and (2) clinical accuracy to correctly identify the disease along with related symptoms. In this section, we briefly describe how transformer models help achieve these desired goals and effectively mitigate the aforementioned challenges associated with medical report generation. Specifically, these transformer-based approaches have achieved state-of-the-art performance both in terms of Natural Language Generation Dataset Bias Focus on Feature Alignment Miscellaneous Figure 24 : Taxonomy of applications of ViTs in clinical report generation. (NLG) and Clinical Efficacy (CE) metrics. Also note that, unlike previous sections that mainly discuss ViTs, in this section, the focus is on the transformers as powerful language models to exploit the long-range dependencies for sentence generation. We have broadly categorized transformer-based clinical report generation approaches into reinforcement learning (RL) based and supervised/unsupervised learning methods, as shown in Fig. 24 , due to differences in their underlying training mechanism. RL-based medical report generation approaches can directly use the evaluation metrics of interest (like human evaluation, relevant medical terminologies, etc.) as rewards and update the model parameters via policy gradient. All approaches covered in this section use the self-critical RL [339] approach to train models, which is more suitable for the report generation task compared to the conventional RL. One of the first attempts to integrate transformer in clinical report generation has been made by Xiong et al. [340] . They propose Reinforced-Transformer for Medical Image Captioning (RTMIC) that consists of a pre-trained DenseNet [236] to identify the region of interest from the input medical image, followed by a transformer-based encoder to extract visual features. These features are given as input to the captioning decoder to generate sentences. All these modules are updated via self-critical reinforcement learning method during training on IU Chest X-ray dataset [341] . Similarly, Miura et al. [342] show that the high accuracy of automatic radiology reports as measured by natural language generation metrics such as BLEU [343] and CIDer [344] are often incomplete and inconsistent. To address these challenges, Miura et al. [342] propose a transformer-based model that directly optimizes the two newly proposed reward functions using self-critical RL. The first reward promotes the coverage of radiology domain entities with corresponding reference reports, and the second reward promotes the consistency of the generated reports with their descriptions in the reference reports. Further, they combine these reward functions with the semantic equivalence metric of BERTScore [345] that results in generated reports with better performance in terms of clinical metrics. Surgical Instructions Generation. Inspired by the success of transformers in medical report generation, Zhang et al. [346] propose a transformer model to generate instructions from the surgical scenes. Lack of a predefined template, as in the case of medical report generation, makes generation of surgical instructions a challenging task. To handle this challenge, Zhang et al. [346] have proposed an encoderdecoder based architecture back-boned by a transformer model. Specifically, their proposed architecture, optimized via self-critical reinforcement learning [339] , effectively models the dependencies for visual features, textual features, and visual-textural relational features to accurately generate surgical reports on the DAISI dataset [347] . Supervised/unsupervised approaches use differentiable loss functions to train models for medical report generation and do not interact with the environment via an agent. We have categorized supervised/unsupervised approaches into methods that focus on dataset bias, explainability, feature alignment, and miscellaneous categories based on the challenges these approaches address. Dataset bias is a common problem in medical report generation as there are far more sentences describing normalities than abnormalities. To mitigate this bias, Srinivasan [348] propose a hierarchical classification approach using a transformer as a decoder. Specifically, the transformer decoder leverage attention between and across features obtained from reports, images, and tags for effective report generation. The architecture consists of Abnormality Detection Network to classify normal and abnormal images, Tag Classification Net to generate tags against images, and Report Generation Net that takes image features and tags as inputs to generate final reports. Experiments on IU Chest X-ray dataset [341] demonstrate the effectiveness of the proposed architectural components. Similarly, Liu et al. [349] try to imitate the work of radiologists by distilling posterior and prior knowledge to generate accurate radiology reports. Their proposed architecture consists of three modules of Posterior Knowledge Explorer (PoKE), Prior Knowledge Explorer (PrKE), and Multidomain Knowledge Distiller (MKD). Specifically, PoKE identifies the abnormal area in the input images (mitigate image data bias), PrKE explores relevant prior information from the radiological reports and medical knowledge graph (mitigate textual data bias), and MKD (based on transformer decoder) distills the posterior and prior knowledge to generate radiology report. In another work, You et al. [350] propose AlignTransformer to generate a medical report from X-ray images. Specifically, AlignTransformer consists of two modules: align hierarchial attention and multi-grained transformer. Align hierarchial attention module helps to better locate the abnormal region in the input medical images. On the other hand, multi-grained transformer leverages multi-grained visual features using adaptive exploiting attention [351] to accurately generate long medical reports. AlignTransformer achieves favorable performance on IU-Xray [341] and MIMIC-CXR [352] datasets. Figure 25 : A chest X-ray image and its accompanying report, which includes findings and impressions, with aligned visual and textual components highlighted in different colors. Figure taken from [353] . Feature alignment based approaches mainly focus on the accurate alignment of encoded representation of the medical images and corresponding text, which is crucial for the interaction and generation across modalities (images and text here) and subsequently for accurate report generation, as indicated in Fig. 25 . To align better, Chen et al. [353] propose a cross-modal memory network to augment the transformer-based encoder-decoder model for radiology report generation. They design a shared memory to facilitate the alignment between the features of medical images and texts. Experiments on IU-Xray [341] and MIMIC-CXR [352] datasets demonstrate that the proposed model can better align image and text features as compared to baseline methods. Similarly, building on the shared-memory work of Chen et al. [353] , Yan et al. [361] introduce a weakly supervised contrastive objective to favor reports that are semantically close to the target, thereby producing more clinically accurate outputs. Similarly, Amjoud et al. [362] investigate the impact on the report generation performance by modifying different architectural components of the model proposed by Chen et al. [353] including replacing visual extractor and changing the number of layers in transformer-based decoder. Explainability in medical report generation is crucial to improve trustworthiness for deploying models in clinical settings and a mean for extracting bounding boxes for lesion localization. For model explainability, Hou et al. [363] employ attention to identify regions of interest in the input image and demonstrate where the model is focusing for the resulting Figure 26 : The depiction of lesion-image attention mapping areas and ground truth between CNN+Transformer and Faster-RCNN+Transformer samples, where green boxes represent the annotated region for each lesion word and red boxes represent the lesion-image attention mapping regions. Image taken from [354] . text. This attention mechanism increases the explainability of black-box models used in clinical settings and provides a method for extracting bounding boxes for disease localization. Specifically, they propose RATCHET transformer model to generate reports by using DenseNet-101 [236] as an image feature extractor. RATCHET consists of a transformerbased RNN-Decoder for generating chest radiograph reports. They assess the model's natural language skills and the medical correctness of generated reports. Similarly, despite the immense interest of AI and clinical medicine researchers in the automatic report generation area, benchmark datasets are scarce, and the field lacks reliable evaluation metrics. To address these challenges, Li et al. [354] introduce a large-scale Fundus fluorescence in Angiography images and reports dataset containing 10,790 reports describing 1,048,584 images with explainable annotations as shown in Fig. 26 . The dataset comes with annotated Chinese reports and corresponding translated English reports. Further, they introduce nine reliable metrics based on human evaluation criteria. In this section, we highlight several approaches that try to improve different aspects of clinical report generation from medical images. Examples include a memory-driven transformer to capture similar patterns in reports, uncertainty quantification for reliable report generation, a curriculum learning-based method, and an unsupervised approach to avoid paired training datasets. Chen et al. [364] propose a memory-driven transformer to exploit similar patterns in the radiology image reports. Specifically, they add a module to each layer of transformerbased decoder by optimizing the original layer normalization with a novel memory-driven conditional layer normalization. Extensive experiments on IU Chest X-ray [341] and MIMIC-CXR [352] datasets demonstrate the superiority of their approach both in terms of Natural Language Generation (NLG) and Clinical Efficacy (CE) metrics. Similarly, Lovelace et al. [369] also leverage the transformer-based encoder and decoder for accurate medical report generation on MIMIC-CXR dataset [352] . To emphasis on clinically relevant report generation, they design a method to differentiate clinical information from generated reports, which they use to refine the model for clinical coherence. In another work, Alfarghaly et al. [372] present a pre-trained transformer-based model to generate a medical report from images. Specifically, the encoder consists of a pre-retrained CheXNet model that can generate semantic features from the input medical images. These semantic features are used to condition GPT2 decoder [373] , [374] to generate accurate medical reports. Similarly, to judge the reliability of the automatic medical report generating model, uncertainty quantification is the key indicator. To incorporate this measure, Wang et al. [366] propose transformer-based confidence guided framework to quantify both visual and textual uncertainty. These uncertainties are subsequently used to construct an uncertainty-weighted loss to reduce misjudgment risk and improve the overall performance of the generated report. In other work, Nguyen et al. [367] propose differentiable endto-end framework that consists of transformer as generator for report generation. Specifically, their proposed framework has three complementary modules: a classifier to learn the representation of disease features, a transformer-based generator model to generate the medical report, and interpreter to make the generated report consistent with the classifier output. They demonstrate the effectiveness of proposed components on IU-Xray [341] and MIMIC-CXR [352] datasets. Inspired by curriculum learning [375] , Nooralahzadeh et al. [365] present a two-stage transformer architecture to progressively generate medical reports. Their progressive approach shows better performance over single-stage baselines in generating full-radiology reports. In another study, Pahwa et al. [376] investigate the impact of visual feature extractor model Table 8 : Quantitative comparison of transformer models for the task of clinical report generation in terms of Natural Language Generation (NLG) and Clinical Efficacy (CE) on two benchmark datasets. The NLG metrics include BLEU (BL) [343] , METEOR (MTR) [370] and ROUGE-L (RG-L) [371] and CE metrics include precision, recall and F1 score. on the performance of medical report generation. Based on insights, they propose a modified HRNet [377] , MedSkip, to extract visual features for the subsequent processing by the transformer-based decoder to generate an accurate medical report. Similarly, Park et al. [378] investigate the expressiveness of features to discriminate between normal and abnormal images. They demonstrate the superiority of transformer-based decoder without global average pooling over hierarchical LSTM baseline. Existing transformer-based report generation models are mostly supervised and use paired image-report data during training. The paired data is difficult to obtain due to privacy and cost in the medical domain. To mitigate this issue, Liu et al. [368] propose a knowledge graph auto-encoder that works in the share latent domain of images and reports to extract useful information in an unsupervised way. Specifically, they use attention in the encoder to extract the knowledge representation from the knowledge graph and use a three-layer transformer in the decoder to generate reports. Their proposed framework can also be used in a semi-supervised or supervised manner in addition to the unsupervised mode. Quantitative and qualitative results, as well as evaluation by radiologists, corroborate the effectiveness of their approach. In this section, we have provided a comprehensive overview of the transformer's applications for clinical report generation from X-ray images. In contrast to previous sections that discuss applications of ViTs, this section focuses on transformers as powerful language models. It is also pertinent to note that even though multiple surveys exist covering the applications of deep learning in clinical report generation [335] - [338] , to the best of our knowledge, none of these have covered the applications of transformer models in the area despite having transformers' phenomenal impact since their inception back in 2017. In this regard, we hope this section will serve as a valuable resource to the research community. Below, we briefly highlight a few challenges associated with transformer-based models for report generation and outline promising future directions to explore. As we have seen, transformer-based report generation models mostly rely on natural language generation (NLG) evaluation metrics such as CIDEr and BLEU to assess performance. These NLG metrics often fail to represent clinical efficacy. One recent work by Miura et al. [342] addresses this issue by proposing two new reward functions for the transformer model in reinforcement learning framework to better capture disease and anatomical knowledge in the generated reports. Another work by Li et al. [354] introduces nine reliable human evaluation criteria to validate the generated reports. Despite these works, we believe that more attention from the research community is required to design reliable clinical evaluation metrics to facilitate the adoption of transformerbased medical report generation models in clinical settings. All transformer-based approaches covered in this section use the X-ray modality for automatic report generation. Generating reports from other modalities like MRI or PET have their own challenges associated with them due to their specific nature and distinct characteristics. Further, few medical datasets like ROCO [379] , PEIR Gross [380] , and ImageCLEF [381] are available that consist of multiple modalities, different body parts, and corresponding captions. These datasets have the potential to become worthy benchmarks to gauge the performance of future multimodal (or unimodal like MRI, PET) transformer-based models for medical report generation. We believe that transformer-based models tailored to specific modalities to generate reports must be explored in the future with a focus on creating diverse and challenging datasets of other modalities. Details of a few existing medical report generation datasets are given in Table 7 . Further, we would like to point out the interested researchers toward the recently explored surgical instructions generation work using transformers [346] that could have a huge impact on surgical robotics, a market that is expected to reach USD 22.27 Billion by 2028. However, only one dataset, DAISI [347] , is available to evaluate models in this emerging area, demanding attention from the medical community to create diverse and more challenging datasets. Moreover, datasets for medical report generation like IU X-Ray [341] does not contain any standard train-test split and most of the transformer-based approaches evaluate the performance on different tests data. In this regard, the results in Table 8 are not directly comparable, but they can provide an overall indication about the performance of the models. We think what seems to be missing is a set of standardized procedures for creating challenging and diverse clinical report generation datasets. In this section, we briefly highlight applications of Transformers in other medical imaging areas, including survival outcome prediction, visual question answering, and medical point cloud analysis. Survival outcome prediction is a challenging regression task that seeks to predict the relative risk of cancer death. Recently, transformer models have shown impressive success in predicting survival rates. Chen et al. [382] propose a Multimodal Co-Attention Transformer (MCAT) for the survival outcome prediction from wholeslide imaging (WSI) in pathology. MCAT learns a coattention mapping between genomics and WSIs features to discover how histology features attend to genes while predicting patient survival outcomes. Extensive experiments on five cancer datasets demonstrate the superiority of MCAT compared to state-of-the-art CNN-based approaches. Similarly, Kipkogei et al. [383] propose a Transformer-based architecture, Clinical Transformer, to model the relation between clinical and molecular features to predict survival outcomes from cancerous lung dataset [384] . In another work, Eslami [385] propose PubMedCLIP, a fine-tuned version of Contrastive Language-Image Pre-Training (CLIP) [386] for the medical domain by training it on the image-caption pairs from PubMed articles. Extensive experiments show that PubMedCLIP outperforms the previous state-of-the-art by nearly 3%. Recently, Liu et al. [10] propose 3D Medical Point Transformer (3DMPT) to analyse 3D medical data. 3DMPT is tested on 3D medical classification and part segmentation based tasks. Similarly, Malkiel et al. [387] propose a Transformer-based architecture to analyse fMRI data. They pre-train the model on 4D fMRI data in a selfsupervised manner and fine-tune it on various downstream tasks, including age and gender prediction, as well as diagnosing Schizophrenia. We have reviewed the exciting applications of vision transformers in medical image analysis. Despite their impressive performance, there remain several open research questions. In this section, we outline some of their limitations and highlight promising future research directions. Specifically, we will discuss the challenges of pre-training on large datasets (Sec. 11.1), interpretability of ViT-based medical imaging approaches (Sec. 11.2), robustness against adversarial attacks (Sec. 11.3), designing efficient ViT architectures for real-time medical applications (Sec. 11.4), challenges in deploying ViT-based models in distributed settings (Sec. 11.5), and domain adaptation (Sec. 11.6). Further, wherever possible, we refer interested researchers to relevant CNNs-based medical imaging resources (recent studies, datasets, software libraries, etc.) to explore previously untapped applications by ViTbased models in medical imaging like adversarial robustness. Due to a lack of intrinsic inductive biases in modeling local visual features, ViTs need to figure out the imagespecific concepts on their own via pre-training from largescale training datasets [22] . This may be a barrier to their widespread application in medical imaging, where typically datasets are orders of magnitude smaller compared to natural image datasets due to cost, privacy concerns, and the rarity of certain diseases, thereby making ViTs difficult to train efficiently in the medical domain. Existing learning-based medical imaging approaches commonly rely on transferring learning via ImageNet pretraining, which may be suboptimal due to drastically different image characteristics between medical and natural images. Recently, Matsoukas et al. [33] has studied the impact of pre-training on ViTs performance for image classification and segmentation via a careful set of extensive experiments on several medical imaging datasets. Below, we briefly highlight major findings of their work. • CNNs outperform ViTs for the medical image classification task when initialized with random weights. • CNNs and ViTs benefit significantly from ImageNet initialization for medical image classification. ViTs appear to benefit more from transfer learning, as they make up for the gap observed using random initialization, performing on par with their CNN counterparts. • CNNs and ViTs perform better with self-supervised pre-training approaches like DINO [271] and BYOL [388] . ViTs appear to outperform CNNs in this setting for medical image classification by a small margin. In short, although recent ViT-based data-efficient approaches like DeiT [389] , Token-to-Token [290] , transformer in transformer [216] , etc., report encouraging results in the generic vision applications, the task of learning these transformer models tailored to domain-specific medical imaging applications in a data-efficient manner is challenging. Recently, Tang et al. [163] has made an attempt to handle this issue by investigating the effectiveness of self-supervised learning as a pre-training strategy on domain-specific medical images. Specifically, they propose 3D transformer-based hierarchical encoder, Swin UNETR, and after pre-training on 5,050 CT images, demonstrates its effectiveness via fine-tuning on the downstream task of medical image segmentation. The pre-training on the medical imaging dataset also reduces Figure 27 : Impact of pre-training ViT on domain specific medical imaging dataset. First column: By only using 10% labelled data, Swin UNET pre-trained on CT images (blue) is able to achieve 10% improvement in dice score over Swin UNET trained from scratch (orange). Middle column: Dice scores of recently proposed transformer models on the spleen segmentation task of MSD dataset. Swin UNETR achieves state-of-the-art performance due to pre-training on the domain specific (CT) medical dataset. Last column: Qualitative visualizations of Swin UNETR pre-trained on CT images on BTCV multi-organ segmentation challenge. It can be seen that Swin UNETR predictions are closer to the groundtruth compared to baseline UNETR approach. First and last columns are adapted from [163] . the annotation effort compared to training Swin UNETR from scratch with random initialization. This is shown in Fig. 27 , where it can be seen that pre-trained Swin UNETR can achieve the same performance by using only 60% of data as achieved by random initialized Swin UNETR using 100% of labeled data. This results in 40%reduction of manual annotation effort. Furthermore, as shown in Fig. 27 We believe such studies for ViT-based models, along with multi-instance contrastive learning to leverage patient meta data [391] , will provide further insights to the community. Similarly, combining the self-supervised and semi-supervised pre-training in the context of ViTs for medical imaging applications is also an interesting avenue to explore [392] . Although the success of transformers has been empirically established in an impressive number of medical imaging applications, it has so far eluded a satisfactory interpretation. In most medical imaging applications, ViT models have been deployed as block-boxes, thereby failing to provide insights and explain their learning behavior for making predictions. This black-box nature of ViTs has hindered their deployment in clinical practice since, in areas such as medical applications, it is imperative to identify the limitations and potential failure cases of designed systems, where interpretability plays a fundamental role [393] . Although several explainable AIbased medical imaging systems have been developed to gain deeper insights into the working of CNNs models for clinical applications [394] - [396] , however, the work is still in its infancy for ViT-based medical imaging applications. It is despite the inherent suitability of the self-attention mechanism to interpretability due to its ability to explicitly model interactions between every region in the image a shown in Fig. 28 [397] . Recent efforts for interpretable ViT-based medical imaging models leverage saliency-based approaches [234] and Grad-CAM based visualizations [229] . Despite these efforts, the development of interpretable and explainable ViT-based approaches, specifically tailored for life-critical medical imaging applications, is a challenging and open research problem. Furthermore, formalisms, challenges, definitions, and evaluation protocols regarding interpretable ViTs based medical imaging systems must also be addressed. We believe that progress in this direction would not only help physicians to decide whether they should follow and trust automatic ViT-based model decisions but could also facilitate the deployment of such systems from a legal perspective. Advances in adversarial attacks have revealed the vulnerability of existing learning-based medical imaging systems against imperceptible perturbation in the input images [404] - [406] . Considering the vast amount of money that underpins the medical imaging sector, this inevitably poses a risk whereby potential attackers may seek to profit from manipulation against the healthcare system, as shown in Fig. 29 . For example, an attacker might try to manipulate the examination reports of patients for insurance fraud or a false medical reimbursement claim, thereby raising safety concerns. Therefore, ensuring the robustness of ViTs against adversarial attacks in life-critical medical imaging applications is of paramount importance. Although rich literature exists related to the robustness of CNNs in the medical imaging domain, to the best of our knowledge, no such study exists for ViTs, making it an exciting as well challenging direction to explore. Recently, few attempts have been made to evaluate the robustness of ViTs to adversarial attacks for natural images [407] - [416] . The main conclusions of these attempts, ignoring their nuance difference, can be summarized as ViTs are more robust to adversarial attacks than CNNs. However, these robust ViT models cannot be directly deployed for medical imaging applications as the variety and type of patterns and textures in medical images differ significantly from the natural domain. Therefore, a principled approach to evaluate the robustness of ViTs against adversar- Table 9 : Description of datasets generally used in medical adversarial deep learning. Dataset Size Modality RSNA [264] 29,700 X-ray JSRT [194] 247 X-ray BraTS 2018 [398] 1689 MRI BraTS 2019 [185] 1675 MRI OASIS [399] 373-2168 MRI HAM10000 [272] 10,000 Dermatoscopic ISIC 18 [101] 3594 Dermatoscopic LUNA 16 [400] 888 CT-Scans NIH Chest X-ray [196] 112,000 X-ray APTOS [255] 5590 Fundoscopy Chest X-ray [401] 5856 X-ray NSLT [402] 75,000 CT-Scans Diabetic Retinopathy [403] 35,000 Fundoscopy ial attacks in the medical imaging context, which builds the groundwork for resilience, could serve as a critical model to deploy these models in clinical settings. Furthermore, theoretical understanding to provide guarantees about the performance and robustness of ViTs, like CNNs [417] , can be of significant interest to medical imaging researchers. In Table 9 , we provide a description of datasets used in adversarial medical learning to evaluate the robustness of CNNs for interesting researchers to benchmark the robustness of ViTbased models. Despite the tremendous success of ViTs in numerous medical imaging applications, the intensive requirements for memory and computation hamper their deployment on resource constraint edge devices [423] , [424] . Due to recent advancements in edge computing, healthcare providers can process, store and analyze complex medical imaging data onpremises, speeding diagnosis, improving clinician workflows, enhancing patient privacy, and saving time-and potentially lives. These edge devices provide extremely fast and highly accurate processing of large amounts of medical imaging data, therefore demanding efficient hardware design to make ViT-based models suitable for edge computing-based medical imaging hardware. Recently few efforts have been made to compress transformer-based models by leveraging enhanced block-circulant matrix-based representation [425] and neural architecture search strategies [426] . Due to the exceptional performance of ViTs, we believe that there is a dire need for their domain-optimized architectural designs tailored for edge devices. It can have a tremendous impact on medical imaging-based health care applications where on-demand insights help teams make crucial and urgent decisions about patients. Building robust deep learning-based medical imaging models highly depends on the amount and diversity of the training data. The training data required to train a reliable and robust model may not be available in a single institution due to strict privacy regulations, the low incidence rate of some pathologies, data-ownership concerns, and limited [418] TensorFlow Open-source framework for for machine learning and other computations on decentralized data. CrypTen [419] PyTorch Framework to facilitate research in secure and privacy-preserving machine learning. OpenMined [420] TensorFlow, PyTorch, Keras Open-source decentralised privacy preserving framework. Includes specialized tools like PySyft, PyGrid, and SyferText. Opacus [421] PyTorch Enables training PyTorch models with differential privacy. Allows the client to online track the privacy budget. Deepee [422] PyTorch Library for differentially private deep learning for medical imaging in PyTorch. PriMIA [422] PyTorch Framework for end-to-end privacy-preserving decentralized deep learning for medical images. numbers of patients. Federated Learning (FL) has been proposed to facilitate multi-hospital collaboration while obviating data transfer. Specifically, in FL, a shared model is built using distributed data from multiple devices where each device trains the model using its local data and then shares the model parameters with the central model without sharing its actual data. Although a plethora of approaches exists that address FL for CNNs based medical imaging applications, the work is still in its infancy for ViTs and requires further attention. Recently few research efforts have been made to exploit the inherent structure of ViT in distributed medical imaging applications. Park et al. [218] propose a Federated Split Task-Agnostic (FESTA) framework that integrates the power of Federated and Split Learning [219] , [220] in utilizing ViT to simultaneously process multiple chest X-ray tasks, including diagnosing COVID-19 CXR images on a large corpus of decentralized data. Specifically, they split ViT into shared body and task-specific heads and demonstrate that ViT body with sufficient capacity can be shared across relevant tasks by leveraging the multitask-learning (MTL) [221] strategy. However, FESTA is just a proof-of-concept study, and its applicability in clinical trials requires further experimentation. Furthermore, challenges like privacy attacks and robustness against communication bottlenecks for ViT-based FL medical imaging systems require in-depth investigation. An interesting future direction is to explore recent privacy enhancement approaches like differential privacy [427] to prevent gradient inversion attacks [428] on FL-based medical imaging systems in the context of ViTs. In short, we believe that the successful implementation of distributed machine learning frameworks coupled with the strengths of ViTs could hold significant potential for enabling precision medicine at a large scale. This can lead to ViT models that yield unbiased decisions and are sensitive to rare diseases while respecting governance and privacy concerns. In Table 10 , we highlight various tools and libraries that have been developed to implement distributed and secure deep learning. This can provide useful information for researchers who wish to rapidly prototype their ViT-based models for medical imaging in distributed settings. Recent efforts for ViT-based medical imaging systems have primarily focused on improving the accuracy and gen-erally lacking a principled mechanism to evaluate their generalization ability under different distribution/domain shifts. Recent studies have shown that test error generally increases in proportion to the distribution difference between training and test datasets, thereby making it a crucial issue to investigate in the context of ViTs. In medical imaging applications, these distribution shifts in data arise due to several factors that include: images acquired with a different device model at a different hospital, images of some unseen disease not in the training dataset, images that are incorrectly prepared, e.g., poor contrast, blurry images, etc. Extensive research exists on CNN-based out-of-distribution detection approaches in medical imaging [431] - [435] . Recently, few attempts have been made to show that large-scale pretrained ViTs, due to their high-quality representations, can significantly improve the state-of-the-art on a range of out-ofdistribution tasks across different data modalities [386] , [430] , [436] . However, investigation in these works has been mostly carried out on toy datasets such as CIFAR-10 and CIFAR-100, therefore not necessarily reflecting out-of-distribution detection performance on medical images with complex textures and patterns, high variance in feature scale (like in Xray images), and local specific features. This demands further research to design ViT-based medical imaging systems that should be accurate for classes seen during training while providing calibrated estimates of uncertainty for abnormalities and unseen classes. We believe that research in this direction using techniques from transfer learning and domain adaptation will be of interest to the practitioners working in medical imaging based life-critical applications to envision potential practical deployment. In Fig. 30 , we highlight the performance gain of ViTs as compared to CNNs for out of distribution detection to inspire medical imaging researchers who wish to explore this area. Another possible direction is to explore the recent advancements in continual learning [437] to effectively mitigate the issue of domains shift using ViTs. Few preliminary efforts have been made to explore this direction [438] ; however, the work is still in its infancy and requires further attention from the community. Further, standardized and rigorous evaluation protocols also need to be established for domain adaptation in the medical imaging applications, similar to DOMAINBED [439] framework in the natural image domain. Such a framework will also help in advocating models reproducibility. Figure 30 : A 2D PCA projection of the space of embedding vectors for three models, having two in-distribution and one out-of-distribution class. The points are projections of embeddings of the categories of in-distribution classes (yellow and black) and out-of-distribution classes (red points). The color-coding shows the Mahalanobis outlier score [429] . ResNet-20 plot (left panel) leads to overlapping clusters indicating that classes are not well separated. ViT pre-trained on ImageNet-21k (middle panel) can distinguish classes from each other well but does not lead to well-separated outlier scores. ViT fine-tuned on the in-distribution dataset (right panel) is excellent at clustering embeddings based on classes and assigning high Mahalanobis distance to out-of-distribution inputs (red). Image courtesy [430] . From the papers reviewed in this survey, it is evident that ViTs have pervaded every area of medical imaging (see Fig. 31 ). To keep pace with this rapid development, we recommend organizing the relevant workshops in top computer vision and medical imaging conferences and arranging special issues in prestigious journals to quickly disseminate the relevant research to the medical imaging community. In conclusion, we present the first comprehensive review of the applications of Transformers in medical imaging. We briefly cover the core concepts behind the success of Transformer models and then provide a comprehensive literature review of Transformers in a broad range of medical imaging tasks. Specifically, we survey the applications of Transformers in medical image segmentation, detection, classification, reconstruction, synthesis, registration, clinical report generation, and other tasks. In particular, for each of these applications, we develop taxonomy, identify applicationspecific challenges as well as give insights to solve them and specify recent trends. Despite their impressive performance, we anticipate there is still much exploration left to be done with Transformers in medical imaging, and we hope this survey provides a roadmap to researchers to progress this field further. Deep learning Backpropagation applied to handwritten zip code recognition Imagenet classification with deep convolutional neural networks Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks Overview of deep learning in gastrointestinal endoscopy Deep learning computed tomography Recent and upcoming technological developments in computed tomography: high speed, low dose, deep learning, multienergy Deep learning in mammography and breast histology, an overview and future trends Deep learning in medical ultrasound analysis: a review An overview of deep learning in medical imaging focusing on mri Deep learning for brain mri segmentation: state of the art and future directions Deep learning for pet image reconstruction Attention is all you need Pre-training of deep bidirectional transformers for language understanding Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity Non-local neural networks Disentangled non-local neural networks Stand-alone self-attention in vision models Attention augmented convolutional networks Scaling local selfattention for parameter efficient visual backbones An image is worth 16x16 words: Transformers for image recognition at scale An attentive survey of attention models Deformable detr: Deformable transformers for end-toend object detection Rethinking semantic segmentation from a sequenceto-sequence perspective with transformers Pre-trained image processing transformer Vivit: A video vision transformer Intriguing properties of vision transformers The emergence of the shape bias results from communicative efficiency Partial success in closing the gap between human and machine vision Are convolutional neural networks or transformers more like human vision? Is it time to replace cnns with transformers for medical images? Utnet: a hybrid transformer architecture for medical image segmentation Unetr: Transformers for 3d medical image segmentation Hepatic vessel segmentation based on 3dswin-transformer with inductive biased multi-head self-attention Transct: Dual-path transformer for low dose computed tomography Unsupervised mri reconstruction via zero-shot learned adversarial transformers Multi-compound transformer for accurate biomedical image segmentation Fundamentals of medical imaging Medical imaging Nonlinear total variation based noise removal algorithms Patch group based nonlocal self-similarity prior learning for image denoising Compressed sensing: theory and applications Hidden markov tree modeling of complex wavelet transforms Deep image prior Untrained neural network priors for inverse imaging problems: A survey Single-shot retinal image enhancement using deep image priors A review of medical image segmentation algorithms. EAI Endorsed Transactions on Pervasive Health and Technology Image reconstruction: From sparsity to data-adaptive methods and machine learning A survey of medical image classification techniques A review of image enhancement techniques in medical imaging. Machine Intelligence and Smart Systems Anomaly detection in medical imaging-a mini review Learning interpretable models Interpretable machine learning in healthcare A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises Algorithmic aspects of inverse problems using generative models Comparison of handcrafted features and convolutional neural networks for liver mr image adequacy assessment Method of optimal directions for frame design K-svd: An algorithm for designing overcomplete dictionaries for sparse representation Data-driven tight frame construction and image denoising Lowrank modeling and its applications in image analysis γ-convergence approximation to piecewise smooth medical image segmentation Niftynet: a deep-learning platform for medical imaging. Computer methods and programs in biomedicine Torchio: a python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning Deepneuro: an open-source deep learning toolbox for neuroimaging Generative adversarial network in medical imaging: A review Jeroen Awm Van Der Laak, Bram Van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique Deep learning for medical image analysis Deep learning in medical image analysis Not-so-supervised: a survey of semi-supervised, multi-instance, and transfer learning in medical image analysis Deep learning techniques for medical image segmentation: achievements and challenges Biomedical imaging and analysis in the age of big data and deep learning Deep learning in medical image registration: a survey. Machine Vision and Applications Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing Deep learning techniques for inverse problems in imaging Plug-and-play methods for magnetic resonance imaging: Using denoisers for image recovery Neural machine translation by jointly learning to align and translate A survey on visual transformer Transformers in vision: A survey Efficient transformers: A survey A survey of transformers Dynamicvit: Efficient vision transformers with dynamic token sparsification Transformers are rnns: Fast autoregressive transformers with linear attention omformer: A nystr\" om-based algorithm for approximating self-attention Rethinking attention with performers Ra-unet: A hybrid deep attention-aware network to extract liver and tumor in ct scans Attention gated networks: Learning to leverage salient regions in medical images Attention res-unet with guided decoder for semantic segmentation of brain tumors Attention mechanisms in computer vision: A survey Ultrasound medical imaging techniques: A survey Unet++: Redesigning skip connections to exploit multiscale features in image segmentation Transunet: Transformers make strong encoders for medical image segmentation Boundary-aware transformers for skin lesion segmentation Automatic skin lesion segmentation with fully convolutional-deconvolutional networks Skin lesion analysis toward melanoma detection: A challenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the international skin imaging collaboration (isic) Ph 2-a dermoscopic image database for research and benchmarking Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic) Fat-net: Feature adaptive transformers for automated skin lesion segmentation Isic 2017-skin lesion analysis towards melanoma detection Individual tooth segmentation from ct images using level set method with shape and intensity prior Gt u-net: A unet like group transformer network for tooth root segmentation Fourier descriptors for plane closed curves Agmb-transformer: Anatomy-guided multi-branch transformer network for automated evaluation of root canal therapy Xception: Deep learning with depthwise separable convolutions Transbridge: A lightweight transformer for left ventricle segmentation in echocardiography Sa-net: Shuffle attention for deep convolutional neural networks Video-based ai for beat-to-beat assessment of cardiac function Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation Identification of melanoma from hyperspectral pathology image using 3d convolutional networks U-net: Convolutional networks for biomedical image segmentation Spectral transformer for hyperspectral pathology image segmentation Automated kidney tumor segmentation with convolution and transformer network Deep residual learning for image recognition The 2021 kidney and kidney tumor segmentation challenge Pyramid scene parsing network Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs Feature pyramid networks for object detection Pyramid attention network for semantic segmentation Linknet: Exploiting encoder representations for efficient semantic segmentation Swin-unet: Unet-like pure transformer for medical image segmentation Swin transformer: Hierarchical vision transformer using shifted windows Segmenter: Transformer for semantic segmentation Medical transformer: Gated axial-attention for medical image segmentation Beit: Bert pre-training of image transformers Liver cancer segmentation challenge Evaluating transformer based semantic segmentation networks for pathological image segmentation End-to-end object detection with transformers Attentionbased transformers for instance segmentation of cells in microstructures Kaiming He, and Piotr Dollár. Focal loss for dense object detection Corneal endothelial cells over the past decade: are we missing the mark (er)? Translational vision science & technology A multibranch hybrid transformer networkfor corneal endothelial cell segmentation A system for the automatic estimation of morphometric parameters of corneal endothelium in alizarine red-stained images Transbts: Multimodal brain tumor segmentation using transformer Bitr-unet: a cnn-transformer combined network for mri brain tumor segmentation The rsnaasnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification nnu-net: Self-adapting framework for u-net-based medical image segmentation Application of majority voting to pattern recognition: an analysis of its behavior and performance A volumetric transformer for accurate 3d tumor segmentation nnformer: Interleaved transformer for volumetric segmentation Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images Adaptively sparse transformers A multidimensional choledoch database and benchmarks for cholangiocarcinoma diagnosis Breast ultrasound image segmentation: a survey Region aware transformer for automatic breast ultrasound tumor segmentation 3d deep attentive u-net with transformer for breast tumor segmentation from automated breast volume scanner net: learning dense volumetric segmentation from sparse annotation Deep learning in multi-organ segmentation Convolution-free medical image segmentation using transformers A deep attentive convolutional neural network for automatic cortical plate segmentation in fetal mri Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: Is the problem solved? Synapse multi-organ segmentation dataset Transfuse: Fusing transformers and cnns for medical image segmentation Imagenet: A large-scale hierarchical image database Axial-deeplab: Stand-alone axialattention for panoptic segmentation Automatic real-time cnn-based neonatal brain ventricles segmentation Gland segmentation in colon histology images: The glas challenge contest A multi-organ nucleus segmentation challenge Self-supervised pre-training of swin transformers for 3d medical image analysis Automatic segmentation of head and neck tumor: How powerful transformers are? Transclaw u-net: Claw u-net with transformers for medical image segmentation Claw u-net: A unet-based network with deep feature concatenation for scleral blood vessel segmentation Levit: a vision transformer in convnet's clothing for faster inference LeViT-UNet: make faster encoders with transformer for medical image segmentation Transattunet: Multi-level attention-guided u-net with transformer for medical image segmentation After-unet: Axial fusion transformer unet for medical image segmentation A large annotated medical image dataset for the development and evaluation of segmentation algorithms A deep learning-based auto-segmentation system for organs-at-risk on whole-body computed tomography images for radiation therapy Segthor: Segmentation of thoracic organs at risk in ct images Linformer: Self-attention with linear complexity Multi-centre, multi-vendor and multi-disease cardiac segmentation: the m&ms challenge Ds-transunet: Dual swin transformer u-net for medical image segmentation More than encoder: Introducing transformer decoder to upsample Kvasirseg: A segmented polyp dataset Wmdova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians Automated polyp detection in colonoscopy videos using shape and context information A benchmark for endoluminal scene segmentation of colonoscopy images Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer Pyramid medical transformer for medical image segmentation A dataset and a technique for generalized nuclear segmentation for computational pathology Multimodal brain tumor segmentation challenge Multimodal brain tumor segmentation challenge 2020 U-net transformer: self and cross attention for medical image segmentation The cancer imaging archive (tcia): maintaining and operating a public information repository Overview of the hecktor challenge at miccai 2020: automatic head and neck tumor segmentation in pet/ct Medical image segmentation using squeezeand-expansion transformers Refuge challenge: A unified framework for evaluating automated methods for glaucoma assessment from fundus photographs Pranet: Parallel reverse attention network for polyp segmentation Nucleus segmentation across imaging experiments: the 2018 data science bowl Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists' detection of pulmonary nodules Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quantitative imaging in medicine and surgery Xlsor: A robust and accurate lung segmentor on chest x-rays using criss-cross attention and customized radiorealistic abnormalities generation Benchmarking deep learning models and automated model design for covid-19 detection with chest ct scans. medRxiv Polyp-pvt: Polyp segmentation with pyramid vision transformers Missformer: An effective medical image segmentation transformer Cotr: Convolution in transformer network for end to end polyp detection Laplacian pyramid reconstruction and refinement for semantic segmentation Context prior for scene segmentation Miccai multi-atlas labeling beyond the cranial vaultworkshop and challenge Deformable convolutional networks Correlation of chest ct and rt-pcr testing for coronavirus disease 2019 (covid-19) in china: a report of 1014 cases Sensitivity of chest ct for covid-19: comparison to rt-pcr Can chest ct features distinguish patients with negative from those with positive initial rt-pcr results for coronavirus disease (covid-19)? Pocformer: A lightweight transformer architecture for detection of covid-19 using point of care ultrasound Mobilenetv2: Inverted residuals and linear bottlenecks Pocovidnet: automatic detection of covid-19 from a new lung ultrasound imaging dataset (pocus) Covid-19 image data collection: Prospective predictions are the future Automatic diagnosis of covid-19 using a tailored transformer-like network Vision outlooker for visual recognition Can ai help in screening viral and covid-19 pneumonia? Covid-19 detection in chest x-ray images using swin-transformer and transformer in transformer Transformer in transformer Federated deep learning for detecting covid-19 lung abnormalities in ct: a privacy-preserving multinational validation study Federated split vision transformer for covid-19cxr diagnosis using task-agnostic training Federated machine learning: Concept and applications Split learning for health: Distributed deep learning without sharing raw patient data Multitask learning. Machine learning Chest ct in covid-19: what the radiologist needs to know Visual transformer with statistical test for covid-19 classification Wilcoxon signed-rank test A transformer-based framework for automatic covid19 diagnosis in chest cts Mia-cov19d: A transformer-based framework for covid19 classification in chest cts Anastasios Arsenos, Levon Soukissian, and Stefanos Kollias. Mia-cov19d: Covid-19 detection through 3-d chest ct image analysis Review of visual saliency detection with comprehensive information Gradcam: Visual explanations from deep networks via gradient-based localization Arnab Bhattacharjee, Parag Singla, and Prathosh AP. xvitcos: Explainable vision transformer based covid-19 screening using radiography Vision transformer for covid-19 cxr diagnosis using chest x-ray feature corpus A simple framework for contrastive learning of visual representations Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Transformer interpretability beyond attention visualization Covid-vit: Classification of covid-19 from ct chest images based on vision transformer models Densely connected convolutional networks Covid-net ct-2: Enhanced deep neural networks for detection of covid-19 from chest ct images through bigger, more diverse learning Covid-transformer: Interpretable covid-19 detection using vision transformer for healthcare Transmil: Transformer based correlated multiple instance learning for whole slide image classication Transmed: Transformers advance multi-modal medical image classification Smile: Sparse-attention based multiple instance contrastive learning for glioma sub-type classification using pathological images Nsclc radiogenomics: initial stanford study of 26 cases Vision transformers for classification of breast ultrasound images Dataset of breast ultrasound images Automated breast ultrasound lesions detection using convolutional neural networks Gene transformer: Transformers for the gene expression-based classification of lung cancer subtypes Gashis-transformer: A multi-scale visual transformer approach for gastric histopathology image classification Method for diagnosis of acute lymphoblastic leukemia based on vit-cnn ensemble model Multiple instance learning for computer aided diagnosis Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer A deep learning based graph-transformer for whole slide image classification. medRxiv Semi-supervised classification with graph convolutional networks A radiogenomic dataset of non-small cell lung cancer. Scientific data Aptos 2019 blindness detection: Detect diabetic retinopathy to stop blindness before it's too late Skin lesion analysis towards melanoma detection A curated mammography data set for use in computer-aided detection and diagnosis research Mil-vt: Multiple instance learning enhanced vision transformer for fundus image classification Automatic detection of rare pathologies in fundus photographs using few-shot learning Lesion-aware transformers for diabetic retinopathy grading Feedback on a publicly distributed image database: the messidor database Eyepacs: an adaptable telemedicine system for diabetic retinopathy screening Siim-acr pneumothorax segmentation Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Covit-gan: Vision transformer forcovid-19 detection in ct scan imageswith self-attention gan fordataaugmentation Sars-cov-2 ct-scan dataset: A large dataset of real patients ct scans for sars-cov-2 identification Extensive covid-19 x-ray and ct chest images dataset. mendeley data, v3 Curated dataset for covid-19 posterior-anterior chest radiography images (x-rays) Chest x-ray image phase features for improved diagnosis of covid-19 using convolutional neural network Emerging properties in self-supervised vision transformers The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data Fundus disease image classification based on improved transformer Ophthalmic image analysis dataset Vision transformer-based recognition of diabetic retinopathy grade Encoding retina image to words using ensemble of vision transformers for diabetic retinopathy grading Unified 2d and 3d pre-training for medical image classification and segmentation Evaluate the malignancy of pulmonary nodules using the 3-d deep leaky noisy-or network A comprehensive study of applying object detection methods for medical image analysis Xiaozhou Shi, and Junwen Pan. Transformer for polyp detection Lymph node detection in t2 mri with transformers Multimodal transformers excel at class-agnostic object detection Mdetr-modulated detection for end-to-end multi-modal understanding Screening for lung cancer with low-dose computed tomography: a systematic review and meta-analysis of the baseline findings of randomized controlled trials Hélène de Forges, Pascale Fabbro-Peray, and Julien Frandon. Systematic review and meta-analysis on the impact of lung cancer screening by low-dose computed tomography Low-dose ct for the detection and classification of metastatic liver lesions: results of the 2016 low dose ct grand challenge Model-based image reconstruction for mri Inverse gans for accelerated mri reconstruction Ted-net: Convolutionfree t2t vision transformer-based encoder-decoder dilation network for low-dose ct denoising Tokens-totoken vit: Training vision transformers from scratch on imagenet Eformer: Edge enhancement based transformer for medical image denoising Edcnn: Edge enhancement-based densely connected network with compound loss for low-dose ct denoising An isotropic 3x3 image gradient operator. Presentation at Stanford AI Project 3d transformer-gan for highquality pet reconstruction Deep learning for undersampled mri reconstruction Accelerated multi-modal mr imaging with transformers Task transformer network for joint mri reconstruction and superresolution Mr image super resolution by combining feature disentanglement cnns and vision transformers Swapping autoencoder for deep image manipulation Deep mri reconstruction with generative vision transformers Vision transformers enable fast and robust accelerated mri Image information and visual quality An open dataset and benchmarks for accelerated mri Dudotrans: Dual-domain transformer provides more attention for sinogram restoration in sparse-view ct reconstruction Multidomain integrative swin transformer network for sparse-view tomographic reconstruction E-dssr: Efficient dynamic surgical scene reconstruction with transformer-based stereoscopic depth perception Self-supervised siamese learning on stereo image pairs for depth estimation in robotic surgery Emine Ulku Saritas, Can Barış Top, and Tolga Ç ukur. Transms: Transformers for super-resolution calibration in magnetic particle imaging Openmpidata: An initiative for freely accessible magnetic particle imaging data End-to-end variational networks for accelerated mri reconstruction Framing u-net via deep convolutional framelets: Application to sparse-view ct Image reconstruction for sparse-view ct and interior ct-introduction to compressed sensing and differentiated backprojection. Quantitative imaging in medicine and surgery On the loss landscape of adversarial training: Identifying challenges and how to overcome them Ptnet: A high-resolution infant mri synthesizer based on transformer Image-to-image translation with conditional adversarial networks High-resolution image synthesis and semantic manipulation with conditional gans The developing human connectome project: A minimal processing pipeline for neonatal cortical surface reconstruction Vtgan: Semi-supervised retinal image synthesis and disease prediction using vision transformers Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems Diabetic retinopathy grading by digital curvelet transform. Computational and mathematical methods in medicine Cytran: Cycle-consistent transformers for non-contrast to contrast ct translation Deep mr to ct synthesis using unpaired data Resvit: Residual vision transformers for multi-modal medical image synthesis The multimodal brain tumor image segmentation benchmark (brats) Mr and ct data with multiobserver delineations of organs in the pelvic area-part of the gold atlas project Generative adversarial networks in computer vision: A survey and taxonomy Medical image registration in image guided surgery: Issues, challenges and research opportunities Vit-v-net: Vision transformer for unsupervised volumetric medical image registration V-net: Fully convolutional neural networks for volumetric medical image segmentation Transformer for unsupervised medical image registration Learning dual transformer network for diffeomorphic registration Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults Ion Androutsopoulos, and Dimitris Papamichail. Diagnostic captioning: a survey Deep learning in generating radiology reports: A survey John Pavlopoulos, and Ion Androutsopoulos. A survey on biomedical image captioning A survey on deep learning and explainability for automatic imagebased medical report generation Self-critical sequence training for image captioning Reinforced transformer for medical image captioning Preparing a collection of radiology examinations for distribution and retrieval Improving factual completeness and consistency of image-to-text radiology report generation Bleu: a method for automatic evaluation of machine translation Cider: Consensus-based image description evaluation Bertscore: Evaluating text generation with bert Surgical instruction generation with transformers Database for ai surgical instruction Hierarchical x-ray report generation via pathology tags and multi head attention Exploring and distilling posterior and prior knowledge for radiology report generation Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report generation Meshed-memory transformer for image captioning Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs Crossmodal memory networks for radiology report generation Ffa-ir: Towards an explainable and reliable medical report generation benchmark Padchest: A large chest x-ray image dataset with multi-label annotated reports Knowledge-driven encode, retrieve, paraphrase for medical image report generation Diaretdb1 diabetic retinopathy database and evaluation protocol Auxiliary signal-guided knowledge encoder-decoder for medical report generation Deepopht: medical report generation for retinal images via deep models and visual explanation Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response Weakly supervised contrastive learning for chest x-ray report generation Automatic generation of chest x-ray reports using a transformer-based deep learning model Ratchet: Medical transformer for chest x-ray diagnosis and reporting Generating radiology reports via memory-driven transformer Progressive transformer-based generation of radiology reports Jianping Fan, and Zhiqiang He. Confidence-guided radiology report generation Automated generation of accurate\& fluent medical x-ray reports Auto-encoding knowledge graph for unsupervised medical report generation Learning to generate clinically coherent chest x-ray reports Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems A package for automatic evaluation of summaries Automated radiology report generation using conditioned transformers Encoder-agnostic adaptation for conditional language generation Language models are unsupervised multitask learners Curriculum learning Medskip: Medical report generation using skip connections and integrated attention Deep highresolution representation learning for human pose estimation Medical image captioning model to convey more details: Methodological comparison of feature difference generation Radiology objects in context (roco): a multimodal image dataset On the automatic generation of medical imaging reports Overview of the imageclef 2018 caption prediction tasks Multimodal co-attention transformer for survival prediction in gigapixel whole slide images Explainable transformer-based neural network for the prediction of survival outcomes in non-small cell lung cancer (nsclc) Tumor mutational load predicts survival after immunotherapy across multiple cancer types Does clip benefit visual question answering in the medical domain as much as it does in the general domain? Learning transferable visual models from natural language supervision Pretraining and fine-tuning transformers for fmri prediction tasks Bootstrap your own latent: A new approach to self-supervised learning Training dataefficient image transformers & distillation through attention Big self-supervised models advance medical image classification Contrastive learning leveraging patient metadata improves representations for chest x-ray interpretation Big self-supervised models are strong semisupervised learners On the interpretability of artificial intelligence in radiology: challenges and opportunities Explainable deep learning models in medical image analysis Interpretable medical image classification with selfsupervised anatomical embedding and prior knowledge Estimating uncertainty and interpretability in deep learning for coronavirus (covid-19) detection An attentive survey of attention models Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge Open access series of imaging studies Lung nodule analysis Identifying medical diagnoses and treatable diseases by image-based deep learning National lung screening trial -the cancer data assess system Diabetic retinopathy challenge Understanding adversarial attacks on deep learning based medical image analysis systems Toward an understanding of adversarial examples in clinical trials Adversarial attacks on medical machine learning Adversarial robustness comparison of vision transformer and mlp-mixer to cnns Understanding robustness of transformers for image classification Towards transferable adversarial attacks on vision transformers On the adversarial robustness of visual transformers Vision transformers are robust learners Fahad Shahbaz Khan, and Fatih Porikli. On improving adversarial transferability of vision transformers Rethinking the design principles of robust vision transformer On the robustness of vision transformers to adversarial examples Adversarial token attacks on vision transformers Reveal of vision transformers robustness against adversarial attacks Towards proving the adversarial robustness of deep neural networks Tensorflow federated: Machine learning on decentralized data Crypten: Secure multi-party computation meets machine learning To lower the barrier to entry to privacy preserving technology Userfriendly differential privacy library in pytorch End-to-end privacy preserving deep learning on multiinstitutional medical imaging Improving the efficiency of transformers for resource-constrained devices Ftrans: energy-efficient acceleration of transformers using fpga Hat: Hardware-aware transformers for efficient natural language processing Deep learning with differential privacy Evaluating gradient inversion attacks and defenses in federated learning A simple unified framework for detecting out-of-distribution samples and adversarial attacks Exploring the limits of out-of-distribution detection Generalized out-of-distribution detection: A survey Out of distribution detection for medical images Delving deep into the generalization of vision transformers under distribution shifts Efficient out-of-distribution detection in digital pathology using multi-head convolutional neural networks A baseline for detecting misclassified and out-of-distribution examples in neural networks Oodformer: Out-of-distribution detection transformer Recent advances of continual learning in computer vision: An overview Continual learning for domain adaptation in chest x-ray classification search of lost domain generalization The authors would like to thank Maryam Sultana (MBZ University of Artificial Intelligence) for her help with a few figures.