key: cord-0454765-5d7iizgp authors: Akbarian, Sina; Seyyed-Kalantari, Laleh; Khalvati, Farzad; Dolatabadi, Elham title: Evaluating Knowledge Transfer in Neural Network for Medical Images date: 2020-08-31 journal: nan DOI: nan sha: 02b6972cea439f9216ce952c83b9980300cb698c doc_id: 454765 cord_uid: 5d7iizgp Deep learning and knowledge transfer techniques have permeated the field of medical imaging and are considered as key approaches for revolutionizing diagnostic imaging practices. However, there are still challenges for the successful integration of deep learning into medical imaging tasks due to a lack of large annotated imaging data. To address this issue, we propose a teacher-student learning framework to transfer knowledge from a carefully pre-trained convolutional neural network (CNN) teacher to a student CNN. In this study, we explore the performance of knowledge transfer in the medical imaging setting. We investigate the proposed network's performance when the student network is trained on a small dataset (target dataset) as well as when teacher's and student's domains are distinct. The performances of the CNN models are evaluated on three medical imaging datasets including Diabetic Retinopathy, CheXpert, and ChestX-ray8. Our results indicate that the teacher-student learning framework outperforms transfer learning for small imaging datasets. Particularly, the teacher-student learning framework improves the area under the ROC Curve (AUC) of the CNN model on a small sample of CheXpert (n=5k) by 4% and on ChestX-ray8 (n=5.6k) by 9%. In addition to small training data size, we also demonstrate a clear advantage of the teacher-student learning framework in the medical imaging setting compared to transfer learning. We observe that the teacher-student network holds a great promise not only to improve the performance of diagnosis but also to reduce overfitting when the dataset is small. Medical imaging is widely used for diagnosing several life-threatening diseases. However, shortage of expert human resources to read and interpret medical imaging exams puts patients' lives at risk [1, 2] . Therefore, finding a reliable alternative for expediting reading and interpreting medical images is critical in order to improve diagnosis and consequently treatment of diseases [3] . Recently, Artificial Intelligence (AI)-based systems especially state-of-the-art Deep Neural Network (DNN) models have proved to be effective in improving clinical decision making for medical imaging diagnostics [4, 5, 6] . However, training a DNN from random initialization to achieve high accuracy is compute-intensive, memory-demanding, and generally requires a large amount of annotated data that is not always easy to collect in the medical domain. Knowledge transfer has gained much attention in the S. Akbarian research community in order to address these shortcomings with training DNN models [7, 8, 9, 10, 11, 12] . Knowledge transfer from a source domain to a target domain is a technique to facilitate the training process of the DNN on smaller datasets. Recently, several approaches on the knowledge transfer technique have been proposed to maintain the performance of DNN models while using small training datasets [13, 14, 8, 7, 15] . One popular approach in knowledge transfer is transfer learning in which a model already pre-trained on a large source dataset (such as ImageNet [16] ) is fine-tuned on a target dataset (e.g. medical images) with minimal modifications where some of the parameters remain frozen during training [13] . A pre-trained network trained on large datasets with thousands of classes, various illumination conditions, different backgrounds, and orientation is a powerful tool to extract features [17] even in a very small and noisy target data regime. Using transfer learning, the network retains its ability to extract low-level features learned from the source domain, and learns how to combine them to detect complex patterns on the target domain [14] . Transfer learning has been the basis for DNN-based medical imaging diagnosis such as skin cancer [18, 19] , chest X-rays [20, 21, 22, 23, 24, 25] , Diabetic Retinopathy [26, 27, 28, 29, 30] , Alzheimer's Disease [31, 32] , and sleep monitoring [33] . However, in an empirical study conducted by Raghu et al. [20] , it has been shown that using transfer learning from ImageNet to medical images, the parameters of the convolutional neural network (CNN) models do not update drastically during the fine-tuning. This study also showed that smaller architectures trained on medical image datasets from scratch can perform similar to the transfer learning from large models. Moreover, Jang et al. [34] also reported that transfer learning may not help if the two tasks and/or datasets are semantically distinct. Another popular approach in knowledge transfer is a teacherstudent learning framework that has been actively studied in recent years in order to improve the transfer of knowledge for both in-domain and cross-domain tasks [7, 15, 11, 12] . In this framework, the network providing knowledge is called the teacher and the network learning the knowledge is called the student. During training, a student network learns to imitate the output of a larger and more powerful teacher network or ensemble of networks. Teacher-student learning frameworks have been widely used for performance improvement (especially for small datasets regimes) and/or model compression [7] . Inspired by the growing interest in applying machine learning to medicine and how to reuse and adapt previously acquired knowledge on new medical tasks and domains quickly, we propose adopting a teacher-student learning framework in the medical imaging setting. To the best of our knowledge, there is no study exploring a teacher-student learning framework to improve the performance of medical imaging diagnostic models. In this study, we conducted an empirical investigation to gather the advantages of knowledge transfer in medical imaging through a series of experiments. We focused on four main questions that we found to be fundamental in deriving our experimental analysis in the context of medical imaging: • How does knowledge transfer perform on small datasets? • How does knowledge transfer perform when the domains and tasks are distinct? • How much training data is needed to achieve high performance in knowledge transfer? • Does knowledge transfer help with overfitting in a small data regime? In terms of the teacher-student learning framework, we leveraged the work proposed in [12] where the knowledge transfer is framed as an attention transfer mechanism. More specifically, a teacher network improves the performance of another student network by providing information about where it looks, i.e., about where it concentrates its attention. Our experiments were conducted on two medical imaging diagnostic tasks: (1) Chest X-ray pathology classification and (2) Diabetic Retinopathy (DR) classification. The former, such as chest X-ray imaging is widely used in diagnosing several diseases such as thorax disease [35] , Tuberculosis [4] , Pneumonia [5] , and COVID-19 [36] . Staff shortage in radiology departments in several countries [37, 38, 39] may put the patients life at risk. This problem is even more severe in some countries such as Rwanda where there is one radiologist per 1000 patients [40] , or in Liberia, 1 radiologist per 2 million patients [41] . The later, Diabetic Retinopathy, is also one of the major causes of blindness in the western world [26] . The early diagnosis of DR is crucial for its treatment. Early identification and scaling of DR involve localizing and weighting of numerous features on the Retina images which are highly time consuming. Both applications could benefit from recent advances in DNN and computer vision. This paper is organized as follows: Section II summarizes related works. Section III describes the datasets used in this study. Section IV presents our proposed approach in building knowledge transfer including transfer learning and teacher-student framework. Section V presents our experiments and results. Section VI discusses the takeaways, addresses limitations of the current work, and proposes potential future work. Chest X-ray pathology classification. Enriched with access to the large public hospital scale datasets [23, 42, 21, 43] , CNNs have been utilized for abnormality classification on medical chest X-rays images [21, 5, 22, 23, 24, 25] . The CNN classifiers are built to yield the diagnostic labels where the networks are trained on chest X-ray images and produce the probability of several diagnostic diseases per image. Transfer learning has been widely adopted for chest X-ray diagnostic tools [21, 5, 22, 23, 24, 25] and DenseNet [44] is commonly used in training classifiers [45, 5, 22, 25, 24, 21] . In addition to DenseNet, Irvin et al. [21] has applied several other CNN models including ResNet-152, Inception-v4, and SE-ResNeXt-101 on X-ray images, however, DenseNet-121 architecture was found to produce the best results in practice. Diabetic Retinopathy classification. There has been a great amount of research for early detection of DR using neural networks [47, 48] and CNN [26] . However, insufficient annotated Retina dataset remains to be one of the challenges of applying deep learning in classification and early detection of DR. Transfer learning, therefore, has been extensively used to improve the performance of the models [27, 28, 29, 30, 49] . Although the CNN model achieved high accuracy for the binary classification of the disease using transfer learning, the performance degraded with increasing in the number of classes. This happens due to the imbalanced nature of the annotated data for some specific classes [29] . In a study conducted by Gulshan et al. [49] , it was shown that CNN models achieved high sensitivity and specificity for detection of diabetic retinopathy from Retinal fundus photographs. Raghu et al. [20] also conducted experimental evaluations of deep and light CNN models with different initialization strategies for detection of diabetic retinopathy. In order to tackle shortcomings with the basic transfer learning, several advanced approaches were proposed including Knowledge Distillation (KD) in the neural network which is a knowledge transfer between a teacher and a student network [7] . The original idea behind the KD came from Bucilua et al. [8] where they proposed the idea of compressing the knowledge of a number of large ensemble base-level classifiers into a single smaller and faster model. This would reduce the computation and memory complexity of the models. This idea was later generalized by Hinton et al. [7] in which a knowledge is transferred from a large DNN (teacher) to a small network (student) by minimizing the difference between the logits (the inputs to the final softmax) produced by the teacher model and those produced by the student model. Yim et al. [15] proposed an approach that minimized the distance between the intermediate layers of the teacher and student networks. This method helps with faster optimization and better performance of the student network than a DNN trained from scratch. Moreover, using their approach, the student DNN can learn the distilled knowledge from a teacher DNN that is trained for a different task. Romero et al. [11] also proposed another teacher-student framework, called FitNet, where they introduced intermediate-level hints from the teacher's hidden layers in addition to output layers to guide the training process of the student network. Using FitNet, the student network can learn an intermediate representation that is predictive of the intermediate representations of the teacher network. FitNet is able to train very deep student models with less parameters, which can generalize better and/or run faster than their teachers. Attention transfer proposed by Zagoruyko et al. [12] is a teacher-student training scheme similar to FitNet for knowledge transfer using teacher's feature maps to guide the learning of the student. Using this approach, given the spatial attention maps of a teacher network, the student network is trained to learn the exact behavior of the teacher network by trying to replicate its output at a layer receiving attention from the teacher. The number of attention transfer and position of the layers depend on whether low-, mid-, and high-level representation information is required. Motivated by advances in knowledge transfer approaches and their potential impact on medical image analysis, this study explores the performance of different training strategies in the context of transfer learning and teacher-student learning framework. This study's teacher-student learning framework is leveraging attention transfer mechanism for medical imaging diagnostic. In this study, we conducted our knowledge transfer experiments on four different publicly available medical imaging datasets listed in Table I . CheXpert [21] , ChestX-ray8 [23] , and MIMIC-CXR [42] are chest X-ray images annotated for a number of diseases and Diabetic Retinopathy (Retina) 1 is Retina images annotated for the diabetic scale of retinopathy. Fig. 1 shows some sample images included in these datasets. CheXpert. CheXpert [21] is a chest radiographs dataset comprising 223,648 frontal and lateral images of 64,740 patients. Each image in the dataset has 14 multilabel annotations associated with diagnostic labels for 13 diseases: Enlarged Cardiomediastinum, Cardiomegaly, Lung Lesion, Lung Opacity, Edema, Consolidation, Pneumonia, Atelectasis, Pneumothorax, Pleural Effusion, Pleural Other, Fracture, Support Devices, and No Finding. ChestX-ray8. The original ChestX-ray8 [23] includes 112,120 frontal X-ray images from 30,805 unique patients. However, in this study, we used a small sample (5%) of the dataset translating to 5,606 images 2 . ChestX-ray8 dataset includes 15 multiclass annotations for 14 diseases: Hernia, For all chest X-ray datasets (CheXpert, MIMIC-CXR, and ChestX-ray8), the labels were automatically extracted from the radiologist reports, using natural language processing techniques. For CheXpert and MIMIC-CXR in particular, the disease labels are from the set of {positive, negative, not mention, or uncertain} conditions. In this study, all "non-positive" labels were mapped to zero similar to "U-zero" study in [21] . In all three chest X-ray datasets the "No Finding" label is not independent of the other disease labels and indicates absence of other diseases. In the following section, we describe different knowledge transfer strategies conducted in this study to predict the diagnostic labels from medical imaging datasets. We focus on two knowledge transfer strategies: Transfer Learning and Teacher-Student Learning framework. We used DenseNet [44] as the backbone for our CNN classifiers. DenseNet is one of the latest neural networks for visual object recognition that has been used extensively in medical image classifications [5, 22, 21, 24, 50] . DenseNet is composed of DenseBlocks and Transition Layers and the input to each layer of the DenseBlock is from all preceding layers. Transition Layers are placed between the Dense layers which includes batch normalization, a convolution layer and pooling layers to reduce the size and complexity of the model. For each task, depending on the dataset, we added an additional Two versions of the DenseNet were used in this study; DenseNet-121 [44] and DenseNet-40. The latter is lighter than DenseNet-121 where we removed the last two blocks of the network for this study and we call it DenseNet-40 in the rest of the paper (see Appendix for more details of DenseNet-40). For the transfer learning approach, all DenseNet networks were initialized with ImageNet weights. For the teacher-student learning framework, the knowledge was transferred from a teacher either pre-trained on ImageNet (Teacher ImageNet ) [16] or carefully pre-trained on MIMIC-CXR (Teacher MIMIC-CXR ) [42] . For the Teacher MIMIC-CXR , we leveraged PyTorch checkpoints provided by the work of Seyyed-Kalantari et al. [24] . Teacher MIMIC-CXR is the DenseNet-121 initialized with the ImageNet and trained on 80% of the MIMIC-CXR dataset. More details of the optimization and hyperparameter tuning of the network are reported in [24] . Following the work of Zagoruyko et al. [12] , we built an activation based attention transfer to transfer knowledge from a convolutional layer of the teacher network to a convolutional layer of the student network. In our setting, the knowledge was transferred between the one layer before the last layer of the last dense blocks of both the teacher and student networks as shown in Fig. 2 . For a given convolutional layer, the corresponding 3D activation tensor, A ∈ R C×H×W , consists of C feature planes with spatial dimensions H × W . We assume that transfer loss is placed between student and teacher attention maps with the same spatial resolution (same H and W ) as defined below: where Q j T and Q j S are respectively the j-th feature plane (out of C feature planes) of teacher's and student's 3D activation tensor, A, in a vectorized form. In order to calculate attention transfer loss, Q j T and Q j S were replaced with their l 2 normalized form as can be seen in Eq. 1 and illustrated in Fig. 2 . The attention transfer loss was calculated by making use of the l 2 norm between student's and teacher's normalized feature planes averaged over all feature planes, C. The total loss was defined as follows: where CE S is the standard cross-entropy loss for the student network and β is the weight balancing attention loss and crossentropy loss. In this study, the CE S is a multi-label binary cross-entropy for X-ray datasets and multi-class cross-entropy for the Retina dataset. In the following section, we describe the setup and results of a series of experiments we ran concerning transfer learning and attention transfer mechanisms on various medical imaging datasets. teacher to a CNN student. During training, the student network learns similar spatial attention maps to those of an already pre-trained teacher in order to make a good prediction. In our setting, transfer of knowledge occurs between the one layer before the last layer of the last dense blocks of both the teacher and student networks. In the shown example, the spatial attention map (H × W ) is 8 × 8 and there are 32 feature planes (C). Parameters. Adam [51] was used to optimize the loss function in all of the tasks. The learning rate was decreased by a factor of 2 over every 16 epochs from an initial value of 5 × 10 −5 as suggested in [24] . For all the experiments, the CNN models were trained for a maximum of 128 epochs with a batch size of 32. So that each batch could fit in Nvidia Titan XP 12 GB GPU used for training the CNN models. All evaluations were made based on three repetitions of each model. The best model and sets of hyperparameters were chosen based on the best AUC performance on the validation sets across all epochs. In order to find an optimal value for the β coefficient shown in Table III , we performed a grid search in the range of values from 1 to 2000 on the validation set. The β coefficient is reversely related to total loss -that is, when the β decreases, the impact of attention loss on the total loss increases. Architecture. For both transfer learning and attention transfer settings, all 40 layers of DenseNet-40 (1.4m trainable parameters) were unfrozen so their weights could get updated in each epoch of training. For the DenseNet-121 in the transfer learning setting, we conducted two tests where all the 121 layers (7.0m trainable parameters) and the last 34 layers (2.4m trainable parameters) of the network were unfrozen, respectively. For the attention transfer setting, both ImageNet and MIMIC-CXR were used for training the teacher network to explore the effect of cross-domain and in-domain training. In both cases, the teacher-student networks were initialized using ImageNet weights. Thus, the same initialization (i.e., ImageNet) was used for transfer learning. In the transfer learning setting, the loss function was the multi-label binary cross-entropy for X-ray datasets and multi-class cross-entropy for the Retina dataset. In the attention transfer framework, the cross-entropy loss was combined with attention transfer loss shown in Eq. (2) . For the CheXpert and ChestX-ray8 datasets, the best models were selected based on the performance of the average of multi-label AUC on the validation set. For the Retina dataset, the best model was selected based on the weighted average of F1-score on the validation set. Data Augmentation. All of the images were resized to 256×256, center cropped. Additionally, −15 • to +15 • random rotation and random horizontal flip were applied on the training dataset. Following [21, 22, 5] , images were normalized using the mean and standard deviation of the ImageNet. All datasets were split into the train-validation-test-set split as listed in Table II with no patient shared across the splits. VI. RESULTS Knowledge transfer for small datasets. Table IV shows the performance of the knowledge transfer on small datasets of X-ray images. For this experiment, we trained our CNN models on a small subset of CheXpert 5k and ChestX-ray8 5.6k which were randomly sampled. At a high-level we observe that the teacher network pre-trained on MIMIC-CXR substantially improves the performance on both CheXpert 5k (AUC = 76.6±0.03) and ChestX-ray8 5.6k (AUC = 80.45±0.38) datasets. In this setting, student networks learn required knowledge for X-ray diagnostic tasks from a teacher pre-trained on chest X-ray imaging datasets. Hence, in a small dataset regime, attention transfer would improve the performance when the teacher and student networks are trained to learn the same task within a similar domain. However, for both datasets, a larger student network (DenseNet-121) with 6,968,206 trainable parameters outperforms a lighter student network (DenseNet-40) with 1,364,142 trainable parameters for attention transfer from Teacher MIMIC-CXR . For the CheXpert 5k , in particular, the AUC difference between large and light CNN models is very small which is not surprising as both teacher and student networks are trained on the similar domain which is X-ray images and same task which is classification for the same labels. On the contrary, the ChestX-ray8 has different sets of labels (diseases) compared with MIMIC-CXR but both are still in the same domain and the larger student network is significantly better than the lighter student network. Therefore, regardless of the domain of the source and target datasets, the student network should be deep enough to learn the task when the knowledge is transferred from a teacher pre-trained on a different task (different sets of labels). However, we emphasize the importance of utilizing attention transfer in order to improve the classification performance on the ChestX-ray8 5.6k . Using this approach teacher MIMIC-CXR provides an extra source of information where the CNN cannot gain if trained through a transfer learning approach as shown in Table IV. 2. Knowledge transfer between distinct domains and tasks. In this experiment, we trained CNN models on the Retina dataset which is different from both ImageNet and MIMIC-CXR. Some of the differences are as follows: (1) Retina and ImageNet images are RGB versus X-ray images which are grayscale. (2) all datasets have different sets of classes and labels, i.e., Retina and ImageNet are not multi-label and each image associated with only one of the 5 and 1000 class labels, respectively; however, X-ray has 15 multi-labels binary classes where one image may have more than one disease label positive. Taking the differences into account, it can be said that the Retina dataset is much closer to ImageNet than X-ray images. Table V shows the performances of the CNN models on the test set. Our results indicate that better AUC (85.04±0.28) and F1-score (78.01±0.55) performance are achieved on Retina images if the knowledge (attention in our case) is transferred from the teacher network pre-trained on ImageNet than MIMIC-CXR. These results imply that in a teacher-student learning framework, the performance substantially increases when the teacher network is pre-trained on a domain similar to the domain that student network will be trained on. 3. The effect of dataset size on Knowledge Transfer. In order to analyze how much training data is needed to achieve high performance on attention transfer, we show AUC curves of CNN models as a function of the number of training examples sampled from CheXpert (5k, 50k, and entire data which is 178k) in Fig. 3 . A glance at the plots reveals three trends. First, for both transfer learning and attention transfer, the performance on AUC score increases as the number of training data increases. Second, in-domain attention transfer substantially outperforms cross-domain attention transfer for a small training data but as the size of training data increases in-domain and cross-domain attention transfer perform the same. As it can be seen from Fig. 3-(b) , the same is true for transfer learning setting as well. This can be explained by the fact that for large dataset the network can learn the task from the data through optimizing the cross-entropy loss, therefore The area under the receiver operating characteristic curve (AUC) score ± the 95% confidence intervals (CI). The best scores are in bold, and the second best scores are underlined. *tp denotes trainable parameters. Here St DenseNet-40 and St DenseNet-121 denote the student network which is DenseNet-40 and DenseNet-121, respectively. For small imaging dataset, attention transfer improves the performance when knowledge is transferred from a teacher to a student trained for the same task (multi-label binary classification) within the same domain (chest X-ray). there is less need to reuse previously acquired knowledge through attention transfer. Lastly, for the transfer learning approach from ImageNet, regardless of the size of medical imaging data, it is always beneficial to unfreeze all CNN layers during training so all parameters of the network get updated at each epoch. Knowledge transfer as a regularizer. Fig. 4 illustrates the AUC learning curves of CNN models trained using student-teacher framework on CheXpert 5k and ChestX-ray8 5.6k validation set. Our analysis indicates that regardless of the size of the student network (DenseNet-40 or DenseNet-121), attention transfer not only improves the performance but also serves as a regularizer to delay overfitting. The regularization effect of attention transfer stabilizes the training of the student network with less fluctuations. Knowledge transfer is widely used in computer vision tasks to enable deep CNN models to quickly learn complex concepts when trained on small image datasets (e.g., hundred/thousands versus millions of images). In this paper, we provided further insights into the adoption of the teacher-student learning framework based on the concept of attention transfer [12] on training CNN models for the medical domain. Our experiments were conducted on diagnostic classification tasks where we explored fundamental components of knowledge transfer on three medical imaging datasets. Our series of experiments revealed that in a small data regime (less than 50k), regardless of the source and target domains, attention transfer outperforms transfer learning approach. However, as the size of the dataset increases attention transfer and transfer learning perform the same function. In terms of transfer learning approach, our finding was in line with the previous study [20] where lighter CNN models were shown to have relatively similar performance results to that of larger networks. We also found that in the teacher-student learning framework, the performance of the CNN model depends on the similarity between the source and target domain where with more similar domains, higher performance can be achieved. However, it is important to note that the source and target domains do not necessarily need to be the same, yet, the best performance will be achieved when the knowledge is transferred from a teacher pre-trained on a similar domain to target domain. For instance, in the case of Retinopathy diabetic classification, the best performance was achieved through knowledge transfer from ImageNet than medical X-ray; however, all three domains are distinct where Retina and ImageNet are more similar than Retina and X-ray. One other interesting aspect of attention transfer is its regularization effect which delays the over fitting and makes the training more robust compared with the transfer learning approach. Although it might slow down the convergence but it will allow the network to continue training which improves the performance; a trade-off between improvement and convergence speed. To sum up, attention transfer can help with performance improvement on small medical imaging dataset and cross-domain knowledge transfer from other domains to medical imaging domain, and last but not least serve as a regularizer during training of the CNN on medical imaging dataset. As already mentioned in the above paragraph, in this study we showed that the teacher-student learning framework significantly helps the performance for small imaging datasets. Limited availability of annotated data is a major issue in medical imaging and usually restrains medical research particularly at the beginning of any pandemic. Currently, the 2019 novel coronavirus (COVID-19) is affecting the world and a small CheXpert. Here AT DenseNet-121 (Teacher MIMIC-CXR ) denotes that knowledge is transferred from TeacherMIMIC-CXR to the student network which is DenseNet-121. TF DenseNet121 (tp = 7.0 m) denotes transfer learning using DenseNet-121 with 7.0m trainable parameters. The performance on AUC score for both attention transfer and transfer learning increases as the number of training data increase. For attention transfer, the AUC score difference between in-domain and cross-domain knowledge transfer decreases as the size of training data increases. amount of available data hurts researchers' ability to build machine learning models that can help inform decision-makers with a timely response to the disease. COVID-19 is a virus that directly affects the lungs, so chest X-ray or CT routinely used by clinicians for diagnosis of pneumonia [52] have the potential to be leveraged for COVID-19 screening in emergency departments and ambulatory settings. As a result, there have been recent efforts in the machine learning community to develop advanced computer vision models for automated detection of COVID-19 cases from medical images [36] . This is an example of the need for advanced knowledge transfer techniques that can improve performance of DNN diagnostic models to be trained on small datasets. Our study, being of an exploratory analysis, raises a number of opportunities for future work which would further elaborate knowledge transfer in the medical setting. Explainability: In this study, we highlighted some of the advantages of attention transfer versus transfer learning including a higher performance of the network for small datasets and acting as a regularizer during training. There is another important advantage of leveraging attention transfer in the medical settings which is its capability in providing some level of explainability and it has not been explored in this study. A direction for future research that stems from this work is to analyze the attention weights to explore where the medical image teacher and student network pay more attention to, and the potential correlation of attention weights with the disease. Knowledge distillation: Attention transfer is just one form of teacher-student learning framework. There are various forms of knowledge transfer that have been widely developed and implemented to aid generalization while training deep CNN models on various domains and tasks. A potential future work could be investigating other forms of knowledge transfer especially knowledge distillation models on medical imaging mainly for model compression that is more suitable for deployment on edge devices. Initialization strategy: During training and optimization of a CNN, the search space for the optimal parameters is determined by the choice of hyperparameters and the initialized weights of the network as well as the training strategy. One of the limitations of this study was to restrict our experiments to examine different training strategies of CNN models initialized with ImageNet weights only. In the attention transfer framework, we explored knowledge transfer for both in-domain and cross-domain settings where in both cases, the network initialization was the same as that of transfer learning setting (e.g., ImageNet). In transfer learning, we didn't study networks' performances initialized with cross-domain pre-trained weights since the goal was to use the same initialization for both transfer learning and attention transfer. Exploring different initializations in order to expand the search space for optimization of CNN models is and ChestX-ray85.6k (bottom). Student network is DenseNet-121 in (a) and (c), and DenseNet-40 in (b) and (d). Attention transfer (AT) as a regularizer delays the over fitting and makes the training more robust comparing with transfer learning (TF) approach. However, it might slows down the convergence but it allows the student network to continue training. an interesting research direction to be pursued in future. Few-shot learning: The other direction that remains to be investigated is combining attention transfer with few-shot learning. Inspired by the work of Tian et al. [53] , a novel future research is to combine attention transfer with few-shot learning techniques to learn a good embedding that can generalize well on a novel class. In this setting, it is essential to learn a good embedding such that when we apply the model that is trained on a base dataset (e.g CheXpert), it can predict a novel class (e.g COVID-19), with access to very limited images from the novel class at test time only. In other words, no image of the novel class is offered to the network throughout training or/and validation. COVID-19 early detection: As mentioned above, our proposed attention transfer pipeline enhances the performance of CNN models in a small training data regime where access to a large annotated dataset is not possible. This was the case in the early stages of the COVID-19 pandemic where the large datasets of COVID-19 may have less than a thousand positive images [54, 55] . Thus, a trained CNN using a teacher-student framework can be utilized for early detection of COVID-19 from chest X-ray or CT (computed tomography) images 4 with access to less amount of images. We would like to acknowledge Vector Institute and also its high performance computing platforms made available for conducting the research reported in this paper. We also like to thank Vanessa Allen, Samir Patel, and Public health Ontario for their support in this project. We also acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference number PDF-516984]. Radiologist shortage leaves patient care at risk, warns royal college Improving Patient Safety: Avoiding Unread Imaging Exams in the National VA Enterprise Electronic Health Record Radiographers supporting radiologists in the interpretation of screening mammography: a viable strategy to meet the shortage in the number of radiologists Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning An overview of deep learning in medical imaging focusing on mri Distilling the knowledge in a neural network Model compression Learning what and where to transfer Knowledge transfer in deep convolutional neural nets Fitnets: Hints for thin deep nets Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer A survey on transfer learning Assessment of the generalization of learned image reconstruction and the potential for transfer learning A gift from knowledge distillation: fast optimization, network minimization and transfer learning ImageNet: A Large-Scale Hierarchical Image Database Using pre-training can improve model robustness and uncertainty Dermatologist-level classification of skin cancer with deep neural networks Deep learning ensembles for melanoma recognition in dermoscopy images Transfusion: Understanding Transfer Learning for Medical Imaging CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists ChestX-ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases Chexclusion: Fairness gaps in deep chest x-ray classifiers Learning to diagnose from scratch by exploiting dependencies among labels Convolutional neural networks for diabetic retinopathy Convolutional neural networks based transfer learning for diabetic retinopathy fundus image classification Identification of diabetic retinopathy in eye images using transfer learning Automated detection of diabetic retinopathy using deep learning Transfer learning for diabetic retinopathy A deep learning model to predict a diagnosis of alzheimer disease by using 18f-fdg pet of the brain Early diagnosis of alzheimer's disease with deep learning Distinguishing obstructive versus central apneas in infrared video of sleep using deep learning: Validation study Learning what and where to transfer Diagnose like a Radiologist: Attention Guided Convolutional Neural Network for Thorax Disease Classification Predicting covid-19 pneumonia severity on chest x-ray with deep learning Clinical radiology UK workforce census 2017 report A County-Level Analysis of the US Radiologist Workforce: Physician Supply and Subspecialty Characteristics Current radiologist workload and the shortages in Japan: how many fulltime radiologists are required? Imaging in the Land of 1000 Hills: Rwanda Radiology Country Report Diagnostic radiology in liberia: a country report MIMIC-CXR: A large publicly available database of labeled chest radiographs PadChest: A large chest x-ray image dataset with multi-label annotated reports Densely Connected Convolutional Networks Confounding variables can degrade generalization performance of radiological deep learning models Eyepacs: an adaptable telemedicine system for diabetic retinopathy screening Automatic detection of diabetic retinopathy using an artificial neural network: a screening tool Automated identification of diabetic retinopathy stages using digital fundus images Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs Thoracic disease identification and localization with limited supervision Adam: A method for stochastic optimization Computed tomography scan contribution to the diagnosis of communityacquired pneumonia Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need? BIMCV COVID-19+: a large annotated dataset of RX and CT images from COVID-19 patients COVID-19 Image Data Collection: Prospective Predictions Are the Future In this study, we explored teacher-student learning frameworks in the medical imaging setting for both large and light convolutional neural network students. DenseNet-121 was used for the large student network. We built a light version of DenseNet, called DenseNet-40, for the light student network. In DenseNet-40, we removed the last two blocks of DenseNet-121. Details of DenseNet-40 architecture are shown in Table VI .