key: cord-0564682-pm43d9no
authors: Bhattacharya, Moinak; Jain, Shubham; Prasanna, Prateek
title: RadioTransformer: A Cascaded Global-Focal Transformer for Visual Attention-guided Disease Classification
date: 2022-02-23
journal: nan
DOI: nan
sha: 47e0e43ac4f2685606c230bcabfa631d6628fdd2
doc_id: 564682
cord_uid: pm43d9no

In this work, we present RadioTransformer, a novel visual attention-driven transformer framework, that leverages radiologists' gaze patterns and models their visuo-cognitive behavior for disease diagnosis on chest radiographs. Domain experts, such as radiologists, rely on visual information for medical image interpretation. On the other hand, deep neural networks have demonstrated significant promise in similar tasks even where visual interpretation is challenging. Eye-gaze tracking has been used to capture the viewing behavior of domain experts, lending insights into the complexity of visual search. However, deep learning frameworks, even those that rely on attention mechanisms, do not leverage this rich domain information. RadioTransformer fills this critical gap by learning from radiologists' visual search patterns, encoded as 'human visual attention regions' in a cascaded global-focal transformer framework. The overall 'global' image characteristics and the more detailed 'local' features are captured by the proposed global and focal modules, respectively. We experimentally validate the efficacy of our student-teacher approach for 8 datasets involving different disease classification tasks where eye-gaze data is not available during the inference phase.

Medical image interpretation relies largely on how domain experts study images. Radiologists hone their image search skills during years of training on medical images from different domains. In fact, studies have shown that systematic visual search patterns can lead to improved diagnostic performance [42, 78] . Current diagnostic and prognostic models, however, are limited to image content se- Visual search patterns of radiologists on chest radiographs are used to first train a global-focal teacher network, referred to as human visual attention training (Section 3.1). This pre-trained teacher network teaches the global-focal student network to learn visual attention in a self-supervised manner using a novel visual attention loss (Section 3.3). The global-focal student-teacher network is implemented to explicitly integrate radiologist visual attention for improving disease classification on chest radiographs. mantics such as disease location, annotation, and severity level, and do not take this rich auxiliary domain knowledge into account. They primarily implement hand-crafted descriptors or deep architectures that learn textural and spatial features of diseases [5, 61] . The spatial dependencies of intra-image disease patterns, often subconsciously interpreted by expert readers, may not be adequately captured via image feature representation learning alone. age patches to determine diagnostically relevant regionsof-interest. Although these approaches integrate long-range feature dependencies and learn high-level representations, they lack apriori domain knowledge, fundamentally rooted in disease pathophysiology and its manifestation on images.

Recently, it has been demonstrated that deep-learning networks can be trained to learn radiologists' attention level and decisions [47] . However, it is still unclear how effectively and efficiently such search patterns can be used to improve a model's decision-making ability. To address this gap, we propose to leverage domain experts' systematic viewing patterns, as the basis of underlying attention and intention, to guide a deep learning network towards improved disease diagnosis.

Motivation. The motivation for our approach stems from a) understanding the importance of human visual attention in medical image interpretation, and b) understanding the medical experts' search heuristics in decision-making. Medical image interpretation is a complex process that broadly comprises a global-focal approach involving a) identifying suspicious regions from a global perspective, and b) identifying specific abnormalities with a focal perspective. During the global screening process, radiologists scans for coarse low-contrast features in which certain textural attributes are analyzed and prospective abnormal regions of interest are identified. In the focal process, the regions of abnormalities are re-examined to determine the severity, type of disease, or reject the assumption of abnormality. For example, while analyzing a chest radiograph for COVID-19, a radiologist skims through the thoracic region at a glance to identify suspicious regions based on intensity variations. This helps in selective identification by eliminating 'obviously healthy' regions. The focal feature learning process involves a more critical analysis of the suspicious regions to understand the structural and morphological characteristics of specific regions and their surroundings. This typically involves domain-specific features such as distribution of infiltrates and accumulation of fluid. We use this as a motivation to design a global-focal transformer, called RadioTransformer, that integrates human visual cognition with the self-attention-based learning of transformers. This improves their class activation regions, leading to a probabilistic score from attention features that correlates highly with human visual attention based diagnosis. The objective of our work is to augment the fast learning convergence capabilities of deep networks in a disease diagnosis setting with domain-specific expert viewing patterns in a cognitive-aware manner.

Contributions. The primary contributions of this work can be summarized as follows:

• A novel student-teacher based global-focal Radio-Transformer architecture, constituting transformer blocks with shifting windows, is proposed to leverage the radiologists' visual attention in order to improve the diagnostic accuracy. The global module attempts to learn high-level coarse representations and the focal module attempts to learn low-level granular representations with two-way lateral connections to address the semantic attention gap with smoothed moving average training. • A novel visual attention loss (VAL) is proposed to train the student network in a self-supervised manner with the visual attention regions from the teacher network which is pre-trained with radiologist visual attention. This loss explicitly teaches the student network to focus on regions from teacher-generated visual attention using a weighted combination of attention region overlap and regression of center and boundary points. Figure 7 shows an overview of the proposed RadioTransformer architecture consisting of the global-focal studentteacher network with a novel Visual Attention Loss. While the underlying concepts of the proposed framework are domain-agnostic, in this work we have validated it on pulmonary and thoracic disease classification on chest radiographs.

Eye-gaze tracking in Radiology. Eye-tracking studies have been conducted in radiology to draw insights into the visual diagnosis process [39, 78] . Experts' visual search patterns haven been studied in various diseases [32, 38, 49, 54, 89, 92] to understand their relationship with the diagnostic performance of radiologists [2, 14, 82] . Clinical error in diagnostic interpretation has often been attributed to reader fatigue and strain, which has been extensively validated via eye-tracking studies [16, 72, 80, 83] . Variations in cognition and perceptual patterns while viewing images can cause the same image being interpreted differently by different experts. This has led to a few studies displaying eye-positions from experts as a visual aid to improve diagnostic performance of novice readers [36, 42] . The dependence of diagnostic decisions on visual search patterns presents a unique opportunity to integrate this rich auxiliary domain information in computer-aided diagnosis systems.

Visual attention-driven learning. In the context of image interpretation, visual attention refers to the cognitive operations that direct an observer's attention to specific regions in an image. We represent visual attention as saliency maps constructed by tracking users' eye movements. Eyegaze [34] has been used in several computer vision [29, 52] studies for head-pose estimation, human-computer interaction, driver vigilance monitoring, etc. Human eyes tend to focus on visual features, such as corners [45] , lumi- nance [71] , visual onsets [74, 75] , dynamic events [23, 24] , color, intensity, and orientation [25, 26, 59] . Image perception, in general, is hence tightly coupled with visual attention of the observer. Several methods, involving gaze analysis, have been proposed for tasks such as object detection [57, 90, 91] , image segmentation [51, 65] , object referring [79] , action recognition [21, 40, 48, 81] , and action localization [68] . Other specialized methods use visual attention for goal-oriented localization [43] and egocentric activity recognition [50] . A recent work incorporated sonographer knowledge in the form of gaze tracking data on ultrasounds to enhance anatomy classification tasks [60] . In another study [70] , Convolutional Neural Networks (CNN) trained on eye tracking data were shown to be equivalent to the ones trained on manually annotated masks for the task of tumor segmentation. Despite evidence of the importance of expert gaze patterns in improving image interpretation, their role in machinelearning driven disease classification in radiology, is still under explored. The interpretation of radiology images is a complex task, requiring specialized viewing patterns unlike the more general visual attention in other tasks. For example, determining whether a lesion is cancerous or not involves the following hierarchical steps: a) detecting the presence of a lesion, b) recognizing whether it is pathologic, c) determining the type, and finally, d) providing a diagnosis. These sequential analysis patterns, to some extent, are captured by the visual search patterns which are not leveraged by machine learning models. To bridge this gap, our proposed work uses the visual attention knowledge from ra-diologists to train a transformer-based model for improving disease classification on chest radiographs.

Disease classification on chest radiographs. Reliable classification of cardiothoracic and pulmonary diseases on chest radiographs is a crucial task in Radiology, owing to the high morbidity and mortality resulting from such abnormalities. Several methods have been proposed to address this, of which the most prominent baselines, ChexNet [64] , and CheXNext [63] , use a Densenet-121 [20] backbone. Attention-based models such as A 3 N et [84] , and Du-aLAnet [73] , have also been proposed for this diagnostic task. CheXGCN [6] and SSGE [7] are Graph Convolutional Network (GCN)-based methods; the latter proposes a student-teacher based SSL method. More recently, attempts have been made to develop methods for diagnosis and prognosis of COVID-19 from chest radiographs.

Most of these methods [3, 22, 46, 85, 87] use backbones of deep convolution neural network for COVID-19 prediction. Although, CNN-based methods have achieved tremendous success through generic feature extraction strategies, these architectures often fail to comprehensively encode spatial features from a biological viewpoint [35] .

To address this limitation, transformer-based approaches, such as vision transformers [12] , have been proposed. The self-attention mechanism in transformers integrates global information by encoding the relative locations of the patches. Few recent works have proposed vision transformers for COVID-19 prediction task [53, 58] . However, the efficacy of shifting window based [44] transformer architectures has not been evaluated in this domain. These recent methods compute self-attention among patches within local windows. As an example, Swin-UNet [4] implements swin transformer blocks for medical image segmentation. These blocks are well suited to characterize intra-image disease heterogeneity, a very crucial factor affecting diagnosis and patient prognosis. This motivates our choice of using shifting window blocks in the proposed global-focal network. Figure 2 presents an overview of the end-to-end framework of the proposed RadioTransformer global-focal student-teacher network. This comprises two parallel architectures, a student and a teacher model. Both student and teacher networks have global and focal network components. Four focal blocks in each model are cascaded with two global blocks in parallel. The global and focal blocks are connected via a two-way lateral (TWL) connection [10, 13, 41] with smoothed exponential moving average (SEMA). SEMA regulates the attention features shared between the global and focal blocks to bridge the attention gap caused by different learning scales across these networks The teacher model is first trained with human visual attention obtained from visual search patterns of radiologists. The student model learns the behavior of the teacher network via a self-supervised VAL and a classification loss. There are two TWL connections between the teacher and student models coupled with layered SEMA. The proposed architecture is explained in the following subsections.

3.b.1.1 3.b.2.1 3.b.3.1 3.b.2.2 3.b.2.3 3.b.3.2 3.b.3.3 3.b.1.2 3.b.1.3 3.b.1.4 3.b.2.4 3.b.3.

Pre-processing. In this subsection, we discuss the methodology for extracting visual search patterns from eyetracking data and generating visual attention maps of radiologists. The eye-tracking data [31] consists primarily of a) raw eye-gaze information (as shown in Figure 3 .*.*.2), and b) fixations information, captured from radiologists while they are analyzing chest radiographs in a single-screen setting The eye-gaze points are reflective of the diagnostic search patterns. The cumulative attention regions, represented as heatmaps (Figure 3 .*.*.3), are human attention regions reflective of diagnostically important areas. A multidimensional Gaussian filter with standard deviation, σ = 64, is used to generate these attention heatmaps. Contours from these attention heatmaps are selected with a thresholding value of λ = 140 and, subsequently, bounding boxes are generated from the contour with the largest area, as shown in Figure 3 .*.*.4. Human visual attention training (HVAT). Next, the teacher network is trained with ground-truth labels (in our case normal, pneumonia, and congestive heart fail-ure(CHF)) and bounding boxes generated during preprocessing. The teacher network has a classification head to provide an output probability value and a detection head to output key-points. The probability value is a 1 × n vector, where n represents the number of different types of disease labels. The key-points output is {x c , y c , h, w}, where (x c , y c ) are the x and y coordinates of the center, and (h, w) are the height and width respectively. Also, Categorical Crossentropy loss is used for classification, and weighted addition of Generalized Intersection-with-Union (GIoU) loss [66] and Mean Squared Error (MSE) loss for detection. The teacher model is now pre-trained with human visual attention learned on both classification and identification of visual attention regions.

Global-focal networks can be described as a singlestream architecture where the two components operate in parallel. The global network consists of two and the focal network consists of four shifting-window transformer blocks ( Figure 4 ). This draws its analogy from the pathways that involve the Parvo, Magno, and Konio ganglion cells [55, 88] . The focal network is inspired by the functioning of slow responding Parvo cells (in the 'what' pathway), and the global network is inspired by the fast Magno cells (in the 'where' pathway).

The teacher and student networks are variants of globalfocal architecture. The primary idea of the global-focal architecture is to pseudo-replicate learning of attention in a detailed shifting window fashion as shown in Supplementary Figure 1 . The global and focal layers are represented as f i and g j where i ∈ {0, 1, 2, 3} and j ∈ {0, 1}. Focal network. The focal network is implemented to learn high contrast and focal information from shifting the windows incrementally on four blocks that are cascaded in a series. The first block of the focal network has multi-layer perceptron head, h mlp These constitute weighted addition of the outputs from the aforementioned layers coupled with SEMA on the weighted addition outputs. This can be represented as,

where, λ gf p1 and λ gf p2 are the hyper-parameters for weighted addition of the outputs from the global-focal networks represented as gf . z(g p (.)) is the output from the global network and z(f p (.)) is the output from the focal network, p ∈ {in, out} where in is the intermediate, and out is the final output. {z f in , z g in } : {z(f in (.)), z(g in (.))} are the outputs from the intermediate layers of the focal and global networks, respectively. {z f out , z g out } : {z(f out (.)), z(g out (.))} are the final outputs from the focal and global networks, respectively. This is shown in Figure 4 . The smoothed moving average s v is given by,

where s vp is the smoothed-value of the current variable v in the current iteration for different p, and s v is the smoothedvalue of the variable from the previous iteration for a different p.δ gf p is the smoothing decay hyperparameter of the global-focal TWL connection. This is represented aŝ δ gf p = 1 − 1 N , where N is the number of samples in the current iteration.

A student-teacher based self-supervised learning network is proposed in this work. The student network is trained by leveraging visual attention from the teacher which is pretrained on direct visual attention maps obtained from radiologists' eye-tracking data. The teacher network is updated with a SEMA from the student network. Teacher network.

The teacher network is a cascaded global-focal learning network with two global and four local blocks connected in parallel, represented as:

where x t is the input to the teacher network, which is subject to hard augmentation techniques with stateless highvalue intervals of brightness, contrast, hue, and saturation. z t in is the intermediate output of the teacher network with {λ l0 t1 , λ l0 t2 }, and {λ l1 t1 , λ l1 t2 } as the hyperparameters for weighted addition of the intermediate and final outputs from global and focal blocks, respectively. Student network. The input to the student network is softly augmented with stateless relatively low-value intervals of brightness, contrast, hue, and saturation as compared to the teacher network. The student predicts probability values of the disease classes along with an attention region. This attention region is subjected to a self-supervised loss with the output of the attention region from the teacher network. The student network can be represented as

where x s is the input to the student network. z s in is the intermediate output of the student network with {λ l0 s1 , λ l0 s2 }, and {λ l1 s1 , λ l1 s2 } as the hyperparameters for weighted addition of the intermediate and final outputs from the global and focal blocks of the student network, respectively. TWL connections. TWL connections between student and teacher architectures are introduced between layers {f in , g in } and {f out , g out }. The weighted addition of the outputs from the aforementioned layers are coupled with SEMA. This is represented as: 

where z st in is the output from the intermediate TWL connection of student-teacher network and s v is the SEMA from this layer.

where z st out is the output from the final layer of student-teacher network and 

The visual attention regions are obtained from the teacher network and the predicted attention regions are obtained from the student network. We propose a novel selfsupervised visual attention loss (VAL) function to train the student network. VAL includes a GIoU and a MSE loss, as shown in Figure 5 . We use a hyperparameter λ li ∈ R + to induce weights in the losses with i ∈ {1, 2}.

where A hva is the visual attention region predicted from the pre-trained teacher network and A pred is the attention region predicted from the student network. C is the smallest convex hull of A hva and A pred . The regression loss between the predicted keypoints and keypoints from visual attention is represented as where {c x , c y } are the center points and {h, w} are height, and width of the attention region. K (.) is the keypoint of A pred .K (.) is the keypoint of A hva . n is the number of samples in a particular batch. The final loss is calculated as:

where L V AL is the proposed VAL and {λ l1 , λ l2 } are the hyperparameters used for weighted addition of the two losses.

Datasets. The proposed architecture is evaluated on eight different datasets consisting of two pneumonia classification, four COVID-19 classification (TCIA-SBU [11, 67] , and MIDRC [11, 76, 77] only for testing) , and two thoracic disease classification cohorts. Further dataset details are provided in the Supplementary section. The datasets along with the train-validation-test splits are shown in Table 1.

Environment. All experiments were performed on the Google Cloud Platform in a compute node with 2 vCPUs, 16 GB RAM, and 20 GB disk memory. The baselines and proposed architectures were trained on a cloud TPU of either type v2-8 or v3-8 with version 2.6.0. All implementations are in TensorFlow [1] and Keras [8] v2.6.0.

Implementation. During HVAT, the teacher network is pre-trained on eye-gaze data from [15, 30] Table 3 . Ablation Study. Accuracy(↑), AUC(↑), F1(↑), Precision(↑), and Recall(↑) are shown for different ablations on three datasets.

We report the F1 Score and Area-Under-Curve (AUC) for all experiments. Detailed results are shown in the Supplementary section. We compare our methods with existing architectures such as different variations of ResNet [18] , ResNetv2 [19] , DenseNet [20] , Vision Transformer [12] , Compact Convolution Transformers [17] , and two variations of Swin Transformers [44] . Note that we show our comparison results primarily on the most prominent backbones (DenseNet-121 [20] , vision transformer [12] , etc.) used by the baselines [53, 63, 64] and not on individual implementations. The results are also shown on two variations of the proposed methodology. RadT w/o (HVAT+VAL) is the basic backbone of our proposed RadioTransformer architecture, i.e., the global-focal student-teacher network without HVAT and VAL. RadT is the final proposed Radio-Transformer architecture consisting of global-focal studentteacher with HVAT and VAL. As shown in Table 2 , our proposed architecture outperforms other methods on all six datasets. Note that the F1 scores are computed without any standard averaging such as macro, micro or weighted. This is why, F1 scores on 14-class classification datasets, such as, NIH, and VinBigData are comparatively lower than the reported scores on RSNA, Radiography, etc. However, in these datasets where lower F1 scores are reported, the AUC of the proposed framework still outperforms the baselines. Ablation experiments. Here, we discuss the categorical inference on all the individual components of our proposed network. In Table 3 , the ablation experiment results for different components are summarized for three different datasets. The global network outperforms the focal network for the binary classification task in the RSNA dataset. This signifies that for simple binary classification, where global feature representations generally lead to a clear distinction between labels, the global network performs better. This is, in fact, true for radiologists' decision making as well; the results provide a justification for the designed global-focal approach. For the Radiography and VinBig-Data datasets, which are multi-class classification tasks, focal network performs better than the global network owing to diagnostic relevance of the more granular details in the images. It is also evident from the results that when HVAT is used along with global-focal networks, the scores improve. Interestingly, when VAL is added, scores are not significantly higher than the previous ablations. There are primarily two reasons: a) VAL lacks in fast convergence on individual global and focal networks, and b) attention loss between two visual attention regions may not converge well with regression of key-points and minimizing the GIoU. Figure 6 illustrates the qualitative differences between RadT w/o (HVAT+VAL), and RadT. The first column, 6.a.*.*, and 6.b.*.*, are normal and pneumonia samples from the RSNA dataset. Similarly, 6.c.*.* are normal, and 6.d.*.* are COVID-19, from the Radiography dataset. The images in 6.*.*.1 and 6.*.*.2 are the class activation maps from RadT w/o (HVAT+VAL) and RadT, respectively. We can observe clear differences in attention region patterns between these two rows. The attention regions in the first row are relatively discretized and the inconsistency in overlap with the white regions (infiltrates/fluids) is quite prominent. However, in the second row, relatively continuous attention regions are observed with consistent overlap with the disease patterns. Similarly, in 6.c.*.1, attention regions observed are more discrete in nature, unlike 6.c.*.2. For normal chest radiographs, this potentially signifies that RadT focuses intrinsically on regions that may be significant for a radiologist to diagnose and reject the presence of infiltrates/fluids. On the contrary, RadT w/o (HVAT+VAL) attempts to identify non-overlapping regions with visual attention to reject the presence of infiltrates/fluids. Also, we observe that the attention regions from RadT w/o (HVAT+VAL) cover a larger area than those from RadT, implying that lack of visual attention knowledge leads to low confidence in decision-making and hence the model needs to search a comparatively larger space to conclusively accept or reject a claim. In 6.b.2.*, it is observed that for a lung densely filled with fluid, RadT w/o (HVAT+VAL) focuses on a comparatively sparse and large region. However, RadT focuses on regions with dense fluid accumulation. These qualitative findings suggest that RadioTransformer inherently analyzes the regions with a visuo-cognitive approach similar to that of a radiologist.

Clinical bias. The inherent clinical bias in HVAT, while training with the eye-gaze dataset, is not addressed in the current implementation. Radiologists have access to additional information such as patient age, gender, ethnicity, and other factors that may influence their viewing patterns. We have not controlled for these factors in our experiments. Diversity in visual search patterns. Due to lack of datasets comprising eye-gaze patterns from multiple readers, we have not accounted for search pattern diversity in the scope of the current study. Another concern is the presence of 'blind spots' in the search patterns; since the VAL is tightly coupled with available eye-gaze data, blind spots from the reader may be learned by the network, potentially leading to false interpretations. This can be mitigated by utilizing a large training set comprising data from multiple readers. Generalizability. RadioTransformer achieves significant improvement in diagnostic tasks on existing pulmonary and thoracic radiology benchmarks. While it is designed to be domain-agnostic, its performance on other domains and imaging modalities (CT, MRI) remains to be validated. Negative impacts. RadioTransformer can potentially induce decision bias due to the radiologist-in-the-loop design, leading to false or ambiguous interpretations. Discussion on negative impacts is provided in the Supplementary section.

This paper presents RadioTransformer, a novel visual attention-driven transformer framework, motivated by radiologists' visuo-cognitive approaches. Unlike existing techniques that rely only on visual information for diagnostic tasks, RadioTransformer leverages eye-gaze patterns from experts to train a global-focal student-teacher network. Our framework learns and implements hierarchical search patterns to improve the diagnostic performance of selfattention-based learning. When evaluated on eight datasets, comprising over 260,000 images, the proposed architecture outperforms SOTA approaches. Our qualitative analysis shows that by integrating visual attention into the net-work, RadioTransformer focuses on diagnostically relevant regions of interest leading to higher confidence in decision making. To the best our knowledge, no method has been proposed that integrates gaze data from expert radiologists to improve the diagnostic performance of self-attentionbased deep learning architectures. This work paves the way for radiologist-in-the-loop computer-aided diagnosis tools. Figure 8 illustrates the various augmentations for different blocks of RadioTransformer. The images in the first and second rows are the inputs to the student focal and global blocks, respectively. The images in the third and fourth rows are the inputs to teacher focal and global blocks, respectively. As seen in the images, the teacher network implements hard augmentations compared to the student network. The focal block has a higher contrast value than the global block. For stateless augmentations, we use tf.image.stateless random contrast(.), tf.image.stateless random brightness(.), tf.image.statelessrandom hue(.), and tf.image.stateless random saturation(.). More details on the augmentation parameters are provided in Supplementary table 4.

In addition to the AUC and F1 scores provided in the main paper, here we show the accuracy, precision, and recall values for classification tasks in the 8 datasets. In Supplementary table 5, the performance metrics for pneumonia classification datasets such as Cell, and RSNA Pneumonia Challenge dataset, and COVID-19 classification datasets such as SIIM-RSNA-FISABIO COVID-19 challenge, and Radiography dataset are shown. In Supplementary table 6, we show the performance metrics for 14 thoracic diseases classification task (in the NIH, and VinBigData datasets), and COVID-19 classfication task (in MIDRC and TCIA-SBU datasets).

We supplement our qualitative results (in Section 5.2 of the main paper) with additional class activation maps for both the datasets i.e., RSNA, and Radiography. In Figure 9 , the RadT w/o (HVAT+VAL) and RadT class activation maps are shown for Normal and Pneumonia cases. Similarly, in Figure 10 , the RadT w/o (HVAT+VAL), and RadT class activation maps are shown for Normal and COVID-19 cases. For both the datasets, the maps of RadT w/o (HVAT+VAL) show discrete patterns and those of RadT show comparatively continuous patterns. In addition to all the previous discussions, we discuss another interesting finding. In the fourth row of Figure 9 , we observe that apart from clear attention on the white/fluid regions, there some extraneous attention regions in the shoulders. Again, this phenomenon is not observed in the fourth row of Figure 10 . This is clearly explainable from the ablation study in the main paper. For RSNA dataset, the global block is showing better performance, hence the global block is activated in this case. The global block focuses on high-level features and in this case, it hypothesizes to identify features from non-relevant regions(like shoulder, etc) in addition to the white/fluid regions in the lungs. Whereas, in Radiography dataset, the focal block is activated and the attention regions perfectly intersect with the white/fluid regions.

Parvo, Magno, and Konio cells are ganglion cells that transfer information generated by the photoreceptors in the retina to the visual cortex in the brain. Structurally, Magno cells are larger, have thick axons with more myelin while Parvo cells are smaller, have less myelin and thinner axons. Functionally, the Magno cells have a large receptive field; they respond rapidly to changing stimuli and detect robust/global details like luminance, motion, stereopsis, and depth. Parvo cells, on the other hand, have a smaller receptive field, respond slowly to stimuli, and detect finer/local details like chromatic modulation and the form of an object. The Global-Focal blocks in RadioTransformer are inspired from these cellular pathways.

Radiologist-in-the-loop Bias. RadioTransformer can potentially be used for training residents/fellows by highlighting specific regions based on predicted attention. However, due to biases induced by the ground truth visual attention 

Large-scale machine learning on heterogeneous distributed systems

Eye movements of radiologists reflect expertise in ct study interpretation: A potential tool to measure resident development

Mh-covidnet: Diagnosis of covid-19 using deep neural networks and meta-heuristic-based feature selection on x-ray images

Swin-unet: Unet-like pure transformer for medical image segmentation

Deep learning with multimodal representation for pancancer prognosis prediction

Label co-occurrence learning with graph convolutional networks for multi-label chest x-ray image classification

Multi-label chest x-ray image classification via semantic similarity graph embedding. IEEE Transactions on Circuits and Systems for Video Technology

Keras: Deep learning library for theano and tensorflow

Nasser Al Emadi, Mamun Bin Ibne Reaz, and Mohammad Tariqul Islam. Can ai help in screening viral and covid-19 pneumonia?

Spatiotemporal residual networks for video action recognition

The cancer imaging archive (tcia): maintaining and operating a public information repository

Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale

Slowfast networks for video recognition

Eye-tracking in the study of visual expertise: methodology and approaches in medicine

Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation

The effects of fatigue from overnight shifts on radiology search patterns and diagnostic performance

Escaping the big data paradigm with compact transformers

Deep residual learning for image recognition

Identity mappings in deep residual networks

Densely connected convolutional networks

Mutual context network for jointly estimating egocentric gaze and action

Corodet: A deep learning based classification for covid-19 detection using chest x-ray images

Quantifying the contribution of low-level saliency to human eye movements in dynamic scenes. Visual Cognition

Bayesian surprise attracts human attention

A saliency-based search mechanism for overt and covert shifts of visual attention. Vision research

Computational modelling of visual attention

Mimic-iv (version 0.4)

Mimic-cxr database

A review and analysis of eye-gaze estimation systems, algorithms and performance evaluation methods in consumer platforms

Eye gaze data for chest x-rays

Creation and validation of a chest x-ray dataset with eye-tracking and report dictation for ai development

The development of expertise in radiology: in chest radiograph interpretation,"expert" search pattern may predate "expert" levels of diagnostic accuracy for pneumothorax identification

Identifying medical diagnoses and treatable diseases by image-based deep learning

Gaze and eye contact: a research review

Attention-based multi-scale gated recurrent encoder with novel correlation loss for covid-19 progression prediction

Computer-displayed eye position as a visual aid to pulmonary nodule interpretation

The 2021 siim-fisabio-rsna machine learning covid-19 challenge: Annotation and standard exam classification of covid-19 chest radiographs

Identification of gaze pattern and blind spots by upper gastrointestinal endoscopy using an eyetracking technique

State of the art: Eye-tracking studies in medical imaging

In the eye of the beholder: Gaze and actions in first person video

Feature pyramid networks for object detection

Viewing another person's eye movements improves identification of pulmonary nodules in chest x-ray inspection

Goal-oriented gaze estimation for zero-shot learning

Swin transformer: Hierarchical vision transformer using shifted windows

The gaze selects informative details within pictures. Perception & psychophysics

Covxnet: A multi-dilation and medicine

Can a machine learn from radiologists' visual search behaviour and their interpretation of mammograms-a deeplearning study

Dynamic eye movement datasets and learnt saliency models for visual action recognition

The effect of a digital training tool to aid chest image interpretation: Hybridising eye tracking technology and a decision support tool

Integrating human gaze into attention for egocentric activity recognition

Active segmentation with fixation

A review of various state of art eye gaze estimation techniques

Explainable vision transformer based covid-19 screening using radiography

Visual assessment of digital ulcers in systemic sclerosis analysed by eye tracking: implications for wound assessment

Contrast coding and magno/parvo segregation revealed in reaction time studies. Vision research

An open dataset of chest x-rays with radiologist's annotations

Training object class detectors from eye tracking data

Vision transformer for covid-19 cxr diagnosis using chest x-ray feature corpus

Modeling the role of salience in the allocation of overt visual attention

Efficient ultrasound image analysis models with sonographer gaze assisted distillation

Radiographic-deformation and textural heterogeneity (r-depth): an integrated descriptor for brain tumor prognosis

Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images

Deep learning for chest radiograph diagnosis: A retrospective comparison of the chexnext algorithm to practicing radiologists

Radiologistlevel pneumonia detection on chest x-rays with deep learning

An eye fixation database for saliency detection in images

Generalized intersection over union

Stony brook university covid-19 positive cases

Action is in the eye of the beholder: Eyegaze driven model for spatio-temporal action localization

Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia

Eye tracking for deep learning segmentation using convolutional neural networks

The long and the short of it: Spatial statistics at fixation vary with saccade amplitude and task

Fatigue in radiology: a fertile area for future research

Dualanet: Dual lesion attention network for thoracic disease classification in chest x-rays

Stimulus-driven capture and attentional set: selective search for color and visual abrupt onsets

Journal of experimental psychology: human perception and performance

Data from medical imaging data resource center (midrc) -rsna international covid radiology database (ricord) release 1c -chest x-ray, covid+ (midrc-ricord-1c). The Cancer Imaging Archive

The rsna international covid-19 open radiology database (ricord)

How visual search relates to visual diagnostic performance: a narrative systematic review of eye-tracking research in radiology

Object referring in videos with language and human gaze

Prevalence of eye strain among radiologists: influence of viewing variables on symptoms

Space-variant descriptor sampling for action recognition based on saliency and eye movements

Analysis of perceptual expertise in radiology-current knowledge and a new perspective. Frontiers in human neuroscience

Tired in the reading room: the influence of fatigue in radiology

Triple attention learning for classification of 14 thoracic diseases using chest radiography

Covidnet: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases

Covid-net s: Towards computer-aided severity assessment via training and validation of deep neural networks for geographic extent and opacity extent scoring of chest x-rays for sars-cov-2 lung disease severity

Functional assessment of magno, parvo and konio-cellular pathways; current state and future clinical applications

The influence of experience on gazing patterns during endovascular treatment: Eye-tracking study

Exploring the role of gaze behavior and object detection in scene understanding

Studying relationships between human gaze, description, and computer vision

Quantification of avoidable radiation exposure in interventional fluoroscopy with eye tracking technology

Quantitative Comparison:1. F1(↑) and AUC(↑) are reported for the baselines and the proposed methodology

Quantitative Comparison:2. F1(↑) and AUC(↑) are reported for the baselines and the proposed methodology

In this supplementary material, we provide detailed illustration of the global-focal block (Section 8), additional information on the assets used in this work (Section 9), the different augmentations in student-teacher network (Section 10), more quantitative (Section 11), and qualitative (Section 12) results. We also present an analogy of the global-focal block with cellular pathways (Section 13), and the negative impacts (Section 14) of our proposed work.

The global-focal block in the RadioTransformer architecture is shown in detail in Figure 7 . The global and focal blocks are cascaded in parallel. The shifting window for each block is shown with the window in red color.

RSNA Pneumonia Detection challenge [69] , and Cell Pneumonia [33] are pneumonia classification datasets con-