key: cord-0605377-pxmu7rzu authors: Dai, Yin; Gao, Yifan; Liu, Fayu; Fu, Jun title: Mutual Attention-based Hybrid Dimensional Network for Multimodal Imaging Computer-aided Diagnosis date: 2022-01-24 journal: nan DOI: nan sha: 76fe2e842a20b3a16b5ea64539b5616575e50689 doc_id: 605377 cord_uid: pxmu7rzu Recent works on Multimodal 3D Computer-aided diagnosis have demonstrated that obtaining a competitive automatic diagnosis model when a 3D convolution neural network (CNN) brings more parameters and medical images are scarce remains nontrivial and challenging. Considering both consistencies of regions of interest in multimodal images and diagnostic accuracy, we propose a novel mutual attention-based hybrid dimensional network for MultiModal 3D medical image classification (MMNet). The hybrid dimensional network integrates 2D CNN with 3D convolution modules to generate deeper and more informative feature maps, and reduce the training complexity of 3D fusion. Besides, the pre-trained model of ImageNet can be used in 2D CNN, which improves the performance of the model. The stereoscopic attention is focused on building rich contextual interdependencies of the region in 3D medical images. To improve the regional correlation of pathological tissues in multimodal medical images, we further design a mutual attention framework in the network to build the region-wise consistency in similar stereoscopic regions of different image modalities, providing an implicit manner to instruct the network to focus on pathological tissues. MMNet outperforms many previous solutions and achieves results competitive to the state-of-the-art on three multimodal imaging datasets, i.e., Parotid Gland Tumor (PGT) dataset, the MRNet dataset, and the PROSTATEx dataset, and its advantages are validated by extensive experiments. The purpose of computer-aided diagnosis is to use computer methods to analyze medical imaging of different patients to help physicians diagnose diseases. It is an important task in computer vision with many practical applications, such as lung nodule diagnosis, diabetic retinopathy screening, COVID-19 examination, etc [1] - [4] . Recently, deep learning models designed for natural image classification [5] has become popular in medical image analysis [6] , [7] . However, medical images are different from typical natural images in many aspects. For example, the region of interest is small and most of them are 3D images. Generally speaking, 3D imaging tasks in the most prominent medical imaging modes (such as CT and MRI) should be solved directly in 3D, but 3D models usually have more parameters than 2D models, so training requires more labeled data [8] . Compared with natural images, medical images have a small amount of data in many cases. Training a three-dimensional neural network model from scratch is difficult to converge, so the model may be under-fitting. Furthermore, multimodal imaging is essential for the development of a comprehensive pathological model and has attracted increasing attention from researchers in recent years [9] - [13] . In the research of medical imaging, different imaging modes are usually combined to overcome the limitations of independent imaging technology. For example, as shown in Fig. 1 , in magnetic resonance imaging (MRI) [14] , T1 images produce well contrast between human anatomical structures, while T2 images help to visualize lesions. The latest research, such as HyperDenseNet [11] , has been proposed to solve the problem of multimodal 3D medical image recognition. Nevertheless, these models have a large number of parameters. Due to the small region of interest in medical images, there are many redundant and unnecessary calculations. Therefore, we design a transferable, lightweight mutual attention-based hybrid dimension network to effectively improve the accuracy of multimodal 3D medical image classification. Specifically, our proposed network consists of three key components, Hybrid Dimensional Block (HDB), Stereoscopic Attention Module (SAM), and Mutual Attention Framework (MAF). HDB aims to combine 3D convolutions and 2D convolutions to reduce network parameters, and use pre-trained models from ImageNet [15] in 3D images to improve performance. SAM is a novel attention module, which captures the pathological information of changing continuously at different depths in the same region. MAF transfers SAM between the dual deep network. Because the important abnormal information of images with different modalities is located in similar areas of images, it can strengthen the information interaction and augment the consistency of pathological detection between the dual deep network. In summary, we make the following four contributions: 1) We present a new architecture to improve the performance of multimodal imaging computer-aided diagnosis with our proposed hybrid dimensional network. It allows the use of pre-trained models from ImageNet in 3D images and carries much fewer parameters than the existing 3D models. 2) We propose a novel stereoscopic attention module in this work, which can be leveraged to capture contextual information from region-wise dependencies in a more efficient way. ing efficient high-level feature representations of spatial information within the network. It aims at preserving multi-modality image consistency through leveraging feature recalibration. 4) 4. Experimental evaluations demonstrate that the proposed method achieves the most advanced performance on the in-house Parotid Gland Tumor (PGT) dataset, the MRNet dataset, and the PROSTATEx dataset. The rest of this paper is organized as follows. Section II presents some closely related works. The pipeline of our proposed method is in Section III. Section IV introduces the experimental results and details. Finally, we summarize our work in Section V. The mixed-dimensional network aims to reduce the calculations in 3D convolutional neural networks and uses pretrained models from 2D image datasets (such as ImageNet). In video action recognition, R2D [16] is a simple method that combines depth and channel and sends it to a 2D residual network for calculation. Pseudo-3D residual networks (P3D) [17] and R(2+1)D [16] use 2D spatial convolution and 1D temporal convolution to approximate 3D convolution and initialize with model parameters pre-trained on ImageNet. However, in computer-aided diagnosis, since diseased tissues are often gathered in a three-dimensional area, although 1D temporal convolution can reduce parameters, it will lose a large amount of stereoscopic context information of the pathological area. Therefore, we propose a structure that combines 2D convolution and 3D convolution, which not only preserves the unique stereoscopic information of medical images, but also reduces the parameters reasonably. The attention mechanism [18] is an effective technology, which can help the model pay more attention to important information. In recent years, the attention mechanism has made important breakthroughs in image processing, natural language processing [19] , and other fields, and has been proved to be beneficial to improve the performance of the model. SE-Net [20] proposed for the first time an effective mechanism to learn channel-wise attention and obtain competitive performance. Non-local neural networks [21] generate an attention map through the correlation matrix between each point in the feature map, and then the attention guides the aggregation of rich contextual information. CBAM [22] is a simple yet effective attention module that uses spatial and channel attention to improve performance. ECA-Net [23] made some improvements to SE-Net, and proposed a local cross-channel interaction strategy without dimensionality reduction. DRA-Net [24] adaptively capture contextual information based on the relation-aware attention mechanism. In the clinical diagnosis task, the radiologists first selectively pays attention to the abnormal area, and then examines it in detail. Inspired by this human visual attribute, many papers use attention-based deep learning methods to highlight the possible lesions in the image. [25] proposed an attention residual learning convolutional neural network model for skin lesion classification in dermoscopy images. [26] is a deep selective attention network for breast cancer classification. [27] captures richer contextual dependencies based on the use of guided self-attention mechanisms. Although the attention mechanism can effectively improve the performance of deep networks in large training datasets, using attention weights with a large number of additional parameters may not only lead to computational costs, but also overfitting small scale training datasets. In comparison, our method only adds a very small number of parameters, and takes into account the interaction of three-dimensional region and the relevant attention of different modal images. Multimodal medical classification is the most fundamental and challenging part in medical image analysis. It is proved that a reasonable fusion of different modalities has been a potential means to enhance Deep CNNs. Multimodal fusion can capture more abundant abnormal information and improve the quality of diagnosis. [10] sets three modality-specific encoders to capture lowlevel features and a decoder to fuse low-level and highlevel features. HyperDenseNet [11] builds dual deep networks for different modalities of MRI and links features across these streams. [12] fuses final features from modality-specific paths to make final decisions. MMFNet [13] provides a more complex structure to fuse multimodal MRI images. In particular, HyperDenseNet demonstrates the potential of dual deep network and cross-modal information fusion in improving model performance. On the other hand, in 3D medical image recognition, cross-modal feature maps are computationally expensive and redundant. Therefore, based on a dual deep network, we propose a method of attention weight fusion. By transferring the attention weight between different modes instead of the feature map, the network can efficiently capture complementary information. Transfer learning [28] uses a powerful pre-trained network as a feature extractor, which is an efficient paradigm to improve the performance of the model. In fact, in medical image computing, fine-tuning ImageNet's pre-trained model has become a common method. However, the pre-trained model of ImageNet requires that the input of the neural network model must be two-dimensional images and cannot be used on three-dimensional images. For three-dimensional medical images, there is currently no pre-trained model that can generally perform transfer learning. At present, the existing migration models suitable for 3D medical imaging include MedicalNet [29] and Model Genesis [8] . MedicalNet trains the model in several open source medical datasets. Model Genesis obtains the pre-trained model of 3D medical images through self-supervised learning without using labels. Furthermore, MMFNet [13] proposes an initialization strategy named self-transfer, which trains several modality-specific encoderdecoder models respectively, and these pre-trained encoders will be used as the initial encoders for multi-modality model. On the contrary, due to the different imaging methods of medical images and the constraints of the network structure, they cannot be well migrated to downstream tasks. Our method combines the 2D residual network into the 3D convolutional neural network so that the 3D convolutional neural network can load pre-trained weights from ImageNet. Besides, it can be simply embedded into other networks to improve the performance of the model. It is well known that multimodal MRI images carry a lot of anatomical and pathological information. This urges us to build a classification model based on the dual deep network, making full use of the complementary information of two kinds of information sources. In this article, we propose a novel network that aims to improve feature representation for multimodal medical image classification. As shown in Fig. 2 , it consists of three main components: hybrid dimensional blocks, the stereoscopic attention module, and the mutual attention connection. We use two ResNet [5] models as two parts of the network and load the parameters pre-trained on ImageNet. This makes the model have the ability to extract features at the very beginning of training. Inspired by the successes of Pseudo-3D (P3D) in numerous challenging video classification tasks, we develop a new architecture of building modules named HDB to merge 2D convolution and 3D convolution, pursuing volume-wise encoding in an efficient way for 3D medical image classification. The overall structure of the hybrid dimensional block is shown in Fig. 3 (b). Compared with the ordinary residual network, the input and output of the hybrid dimensional block are 3D images, and the main operation part is composed of 2D residual blocks instead of 3D residual blocks. Both convolutions are parallel to each other on different paths. The 3D convolution part consists of 3×3×3 3D convolution, batch normalization [30] , and ReLU [31] , while the 2D convolution part is a classical residual block. For intuitive understanding, we assume that the channel size is C, the depth is D, and the feature map size is pH, W q, then the input shape is pC, D, H, W q. The first 3×3×3 3D convolution has stride 1 and has C 1 kernels, so the shape of the output is pC 1 , D, H, W q. The first 2D convolution in residual block requires 3D input instead of 4D input. By stacking C 1 channels along with the depth, the input is reshaped to pC 1ˆD , H, W q. After a 2D residual block, the output has a shape pC 1ˆD , H 1 , W 1 q. Finally, the channel and depth are split to obtain shape pC 1 , D, H 1 , W 1 q. Usually, the pathological part of a medical image changes continuously at different depths in the same region, and capturing such regions is very helpful for the prediction ability of the model. However, the current research on spatial attention mainly focuses on the relationship between pixels rather than region [21] , [23] , [24] , and medical image recognition is not highly dependent on related information between pixels. In addition, the spatial attention mechanism brings a lot of computational complexity and parameters, which not only makes training more difficult but also improves the difficulty of model optimization. To model the context of the region in 3D medical images, we propose the stereoscopic attention module. Specifically, we first define i as the set of four dimensions: Then, we directly construct the weighted relationship between regions. As shown in Fig. 4 , given a feature X P R CˆDˆHˆW , where C, D, H and W are channel, depth, height and width. First, global average pooling (GAP) is applied to the four dimensions of channel, depth, height and width, and four features X C GAP , X D GAP , X H GAP and X W GAP are generated. Next, we reshape the four features into p1, 1, Cq, p1, 1, Dq, p1, 1, Hq, p1, 1, W q, and send them to four different onedimensional convolutional layers respectively. After that, they are activated by the sigmoid function, and the results are reshaped into pC, 1, 1, 1q, p1, D, 1, 1q, p1, 1, H, 1q, p1, 1, 1, W q. We get the four-dimensional attention weight vector X C aw , X D aw , X H aw , X W aw , and perform elementwise product between X. Finally, perform elementwise sum operation with them to get the final output O P R CˆDˆHˆW as follows: SAM explicit guidance model to build attention to threedimensional area. Each attention weight interactives the information of the adjacent area through one-dimensional convolution and obtains stereoscopic perception when reconstructing from 1D to 4D. In medical imaging diagnosis, cross-modality images bring richer information. However, simply inputting different modalities into different backbones for calculations will lose a lot of complementary information, and this information has a significant effect on accurately diagnosing diseases. In HyperDenseNet, the author uses the connection structure of feature maps to improve the flow of gradients and the fusion of information, but it brings a lot of parameters and unnecessary calculations. To improve this situation, we propose the mutual attention framework, which only allows attention parameters to be transferred across modalities. The mutual attention framework divides attention into self-attention and crossattention. In the general attention mechanism, the feature map is weighted by the attention parameters from its own modality, while in the mutual attention framework, the feature map is weighted by the attention parameters of its own modality and cross-modality at the same time. This not only strengthens the flow of information between different modalities but also implicitly synthesizes complementary information between different modalities. At the same time, mutual attention also imitates the doctor's observation of different modal images, that is, looking for as much information as possible in different images to get the most reliable diagnosis. As shown in Fig. 5 , we first define j and k as the set of four dimensions in self modality and cross modality, respectively. k " tC cross , D cross , H cross , W cross u Second, define the attention weight of the self-modality as W i self , and the attention weight of the cross-modality as W i cross . From Eq. 4, self-attention weight and cross-attention weight can be obtained as follows: Where W i self is self-attention weight, W i cross is crossattention weight. Therefore, considering mutual attention, X i in Eq 5 is updated as follows: The factor α P p0, 1q determines the proportion of selfattention weight. When α=1, the feature map is only weighted by the self-attention weight, when α=0, the feature map is only weighted by the cross-attention weight. A reasonable value of α should be between 0 to 1. We will specifically discuss the effect of α on the model in Section IV. To evaluate the proposed method, we conducted comprehensive experiments on the parotid gland tumor (PGT) dataset, the MRNet dataset, and the PROSTATEx dataset. Experimental results show that MMNet has reached the most advanced performance on all datasets. Next, we first introduced the datasets and implementation details and then performed a large number of ablation experiments on the parotid gland tumor dataset. Finally, we compare our approach with some of the state-of-the-art technologies on three datasets. A. Dataset 1) PGT Dataset: The incidence of malignant tumors in parotid gland tumors [32] is about 20%. Correct preoperative diagnosis of these tumors is essential for proper surgical planning. Among them, imaging examination plays an important role in determining the nature of parotid gland masses. Magnetic resonance imaging (MRI) is considered to be the preferred imaging method for preoperative diagnosis of parotid tumors [33] . MRI can provide information about the exact location of the lesion, the relationship with the surrounding structure, and can assess the spread of nerves and bone invasion. However, it is reported that parotid gland tumors show considerable overlap in imaging features (such as tumor margins, homogeneity, and signal intensity), so it is difficult for doctors to identify the mass. According to common clinical classifications, we divide parotid gland tumors into five categories: pleomorphic adenoma, Warthin tumor, malignant tumor, basal cell adenoma, and a few other benign lesions. A total of 375 patients with parotid gland lesions were studied, and some patients lacking MRI T1 or T2 images were removed, and finally, 344 patients with parotid gland lesions were included. First, perform OTSU [34] and manual adjustment to extract the foreground area in the original image. Then the images of different modalities of the same patient are registered to improve the consistency of the foreground area. Then resample each image to (18, 224, 224) . Therefore, 344 images are finally included, each of which is a stack of 3D images of MRI T1 and T2, and the size is (36, 224, 224) . Data augmentation uses random flipping and random noise. Random flipping performs flipping of the image with 50% probability. Random noise adds Gaussian noise with a mean value of 0 and a variance of 0.25 to the image. The patients were randomly divided into training group (n = 275) and independent test group (n = 69) according to the ratio of 4:1, and then the training group was used to optimize the model parameters. 2) MRNet Dataset: The MRNet dataset [35] is a publicly available medical image benchmark containing 1370 knee MRI examinations performed at Stanford University Medical Center. Each sample is marked for the anterior cruciate ligament (ACL), the meniscus, or other abnormal signs (Abnormality) of the corresponding knee joint. The dataset is randomly divided into 1130 training samples, 120 validation samples, and 120 test samples. The provided dataset has three MRI modalities, including T1 weighted image, T2 weighted image, and proton density weighting. In order to make a fair comparison with the baseline model, in this paper, we preprocess the data by implementing the same strategy used in MRNet. The data augmentation method is consistent with the approach used in the PGT dataset. 3) PROSTATEx Dataset: The PROSTATEx dataset [36] is a training set from the SPIE-AAPM-NCI PROSTATEx challenge. It includes the imaging dataset of 204 patients and a total of 330 biopsy-proven prostate lesions (76 clinically significant and 254 clinically insignificant). Furthermore, this study excluded the remaining test data from 140 patients because their labels were publicly unavailable. The MRI protocol of the PROSTATEx dataset is described in [36] . We set SGD [37] as the optimizer with a learning rate equal to 10´3 and momentum equal to 0.9. The maximum training round is set to 100. Our experiments were performed on NVIDIA 3080 GPU (with 10GB GPU memory). The code is implemented using Pytorch [38] and TorchIO [39] . To ensure a fair comparison of the experiments, we use different evaluation criteria in the public and private datasets. Among them, the evaluation criteria in the public dataset are consistent with the existing solutions. In the MRNet dataset and PROSTATEx dataset, we evaluated using accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (ROC-AUC). In the PGT dataset, The evaluation criteria for each model are the overall accuracy and the precision of each category. C. Baseline Methods 1) PGT Dataset: In order to compare with the hybrid dimensional block, we select P3D and R2D as the baseline model. 3D-Resnet and C3D [40] are included in the comparison as common 3D neural network models. HyperdenseNet and our method have a similar idea, that is, using the fusion of multimodal to calculate medical images. The reason for selecting MedicalNet and Model Genesis as baseline models is that they are both large-scale pre-trained models designed for medical images. 2) MRNet Dataset: In this experiment, we compare our model with three state-of-the-art implementations. MRNet [35] mainly consists of three AlexNets [41] backbone networks. They predict each modality independently and fuse the decisions of each backbone network to derive the final diagnosis. ELNet [42] mainly uses the Resnet model and proposes two techniques, multi-slice normalization and BlurPool layer, to further improve the performance. MRPyrNet [43] uses a feature pyramid network to improve the ability to capture injuries that occur in the knee region. This module was inserted into MRNet and ELNet and significant performance improvements were achieved. 3) PROSTATEx Dataset: Liu et al. [44] proposed Xmas-Net, a convolutional neural network-based, end-to-end PCa classification framework. Seah et al. [45] used auto-window techniques and transfer learning to further improve the performance of deep models for PCa classification. Chen et al. [46] built deep models using InceptionV3 [47] and VGG-16 [48] for PCa diagnosis and used pre-trained models from ImageNet to alleviate the problem of insufficient medical image data. [49] developed a novel semi-automatic classification framework based on 3D convolutional neural networks using different MRI sequences as inputs. [50] proposed a multi-input selection network (MISN) for PCa classification. Compared with other methods, MISN is a multiple input and output deep network for maximizing the use of multiparametric MR images from the PROSTATEx dataset to improve diagnostic performance. Table I shows the comparison between MMNet and other models in the parotid tumor dataset. The experimental results show that our model has a total accuracy of 90.0% and a standard deviation of 0.032, which is far ahead of the most advanced methods, and has fewer parameters and fast operation speed. Compared with 3D-Resnet34, MMNet34 has fewer parameters (52M vs 64M), higher performance (90.0% vs 73.3%), and nearly four times the operation speed (0.72 vs 2.7). Moreover, MMNet has higher precision and lower standard deviation in almost every classification, which reflects that it has stronger robustness in smaller medical image datasets. In comparison with HyperdenseNet, our method makes use of the flow of attention weight to improve the performance of the model. This also shows that for medical images, a large number of model parameters may not bring better results. We also compare it with MedicalNet and Model Genesis, which shows that large-scale medical data pretraining may not bring greater improvement than ImageNet. 2) MRNet Dataset: The experimental results of different models are summarized in Table II . The ROC curves for MMNet18 and MMNet34 is shown in Fig. 6 . On MRNet, MMNet leads the other state-of-the-art implementations in almost all metrics. Specifically, MMNet18 leads in 3 out of 12 metrics, MMNet34 leads in 8 metrics, and only ELNet slightly outperforms MMNet18 and MMNet34 in the sensitivity of abnormality diagnosis. it is worth mentioning that our method outperforms existing methods by 10.8% in the sensitivity of ACL diagnosis, which well demonstrates the good robustness of MMNet in the diagnosis of minor injuries. In addition, compared to ELNet and MRPyrNet, our method does not rely on any domain knowledge of knee injuries. For example, ELNet requires radiologists to select the slice in the MRI sequence that is most likely to contain pathological information, whereas MRPyrNet always assumes that the abnormality is always present in the center of the MRI sequence. While these 2D slice-based models accomplish the detection of abnormalities with the help of the prior knowledge, our implementation can extract common abnormal patterns in multiple MRI sequences more efficiently and robustly through the hybrid dimensional deep model and a dynamic mutual attention module to make a more accurate diagnosis. 3) PROSTATEx Dataset: We further compared our method with the state-of-the-art methods on the PROSTATEx dataset, and the results are shown in Table III . The ROC curves of different methods are shown in Fig. 7 . Our smaller model achieved a ROC-AUC of 0.92, a sensitivity of 0.79, a specificity of 0.88, and an accuracy of 0.82. The larger model achieves a ROC-AUC of 0.94, a sensitivity of 0.85, a specificity of 0.86, and an accuracy of 0.81, which are much better than the most recent existing methods. In particular, our model outperforms the MISN network designed for the PROSTATEx dataset. Compared with this state-of-the-art approach, we use a Table III hybrid dimensional model and attention mechanism to extract features effectively and obtain higher performance. In addition, our method is also superior to other 3D CNN-based methods. Moreover, it greatly surpasses other methods that use pretrained models. In this section, we performed ablation analysis on SAM and MAF, to validate the effectiveness of our proposed model. Next, we discussed the impact of the value of α in the MAF. Table 2 shows the ablation analysis results of key modules in MMNet. The MMNet using only hybrid dimensional blocks has an accuracy of 83.6% and a standard deviation of 0.046. The model with SAM has an accuracy of 86.4% and a standard deviation of 0.03. The model with both SAM and MAF has the best performance, with an accuracy of 90.0% and a standard deviation of 0.032. Experiments show that SAM and MAF have different contributions to the model. In addition, MMNet, which only has HDB, is also ahead of P3D in terms of indicators, showing the advantages of our hybrid dimensional structure in capturing features in 3D medical images. 2) Ablation Experiments on α value of MAF: In order to get the influence of self-attention weight in MAF on the model, we adjust the α value of self-attention weight in Eq. 11 while keeping other experimental conditions unchanged, and get the results as shown in Table V . When α " 0, that is, the attention weight is completely exchanged between modalities, and the average accuracy is the lowest, which is 82.34%. With the increase of α, the accuracy showed a trend of first rising and then declining. When the self-attention weight is 0.6, the average accuracy is the highest, which is 90.02%. At this time, the model can focus on complementary information in cross-modal. When the self-attention weight is 1, the average accuracy is 86.39%, which is better than α " 0. This shows that excessive exchange of cross-modal information will reduce the ability of the model to find complementary information. Class Activation Mapping (CAM) [51] is a tool for analyzing classified hidden information. It uses the global average pooling before the full connection layer of the model to get the weight map of the image. Some part of the image is the main basis for the model to make classification decisions, and CAM can well visualize these areas. As shown in Fig. 8 , we show some 2D slices of the image. We can find that MMNet has a higher response in capturing pathological regions, which indicates that our proposed model can deal with objects with different shapes and structures. Also, R2D aliases channels and depth at the same time, resulting in a significant loss of 3D information of the image, which leads to its poor performance in capturing pathological regions. This shows the importance of maintaining the consistency of channel and depth in 3D medical images. This paper discusses a novel mutual attention-based hybrid dimensional network, MMNet, which utilizes multimodal 3D medical images to boost the performance of lesion detection models. Experiments show that our method can substantially improve the performance of medical image recognition. This method adopts a novel hybrid dimension architecture, and integrates the features of two CNN to simulate the cross-modal dependency, so as to achieve high performance and high efficiency. In summary, our method improves the ability of feature extraction and provides great advantages for multimodal 3D medical image recognition. Evaluate the malignancy of pulmonary nodules using the 3-d deep leaky noisy-or network Explain: Explanatory artificial intelligence for diabetic retinopathy diagnosis Convolutional sparse support estimator-based covid-19 recognition from x-ray images Anam-net: Anamorphic depth embedding-based lightweight cnn for segmentation of anomalies in covid-19 chest ct images Deep residual learning for image recognition A survey on deep learning in medical image analysis Deep learning for image-based cancer detection and diagnosis-a survey Models genesis Deep learning-based image segmentation on multimodal medical imaging Joint sequence learning and cross-modality convolution for 3d biomedical segmentation Hyperdense-net: a hyper-densely connected cnn for multi-modal image segmentation Fully convolutional networks for multi-modality isointense infant brain image segmentation Mmfnet: A multi-modality mri fusion network for segmentation of nasopharyngeal carcinoma Magnetic resonance imaging: theory and practice Imagenet: A large-scale hierarchical image database A closer look at spatiotemporal convolutions for action recognition Learning spatio-temporal representation with pseudo-3d residual networks An attentive survey of attention models Fundamentals of artificial intelligence Squeeze-and-excitation networks Non-local neural networks Cbam: Convolutional block attention module Eca-net: efficient channel attention for deep convolutional neural networks, 2020 ieee Scene segmentation with dual relation-aware attention network Attention residual learning for skin lesion classification Attention by selection: A deep selective attention approach to breast cancer classification Multi-scale self-guided attention for medical image segmentation A survey on transfer learning Med3d: Transfer learning for 3d medical image analysis Batch normalization: Accelerating deep network training by reducing internal covariate shift Deep sparse rectifier neural networks Clinical prognostic factors in malignant parotid gland tumors Tumors of the parotid gland: Mr imaging characteristics of various histologic types A threshold selection method from gray-level histograms Deep-learningassisted diagnosis for knee magnetic resonance imaging: development and retrospective validation of mrnet Prostatex challenges for computerized classification of prostate lesions from multiparametric magnetic resonance images Stochastic gradient descent tricks Pytorch: An imperative style, high-performance deep learning library Torchio: a python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning Learning spatiotemporal features with 3d convolutional networks Imagenet classification with deep convolutional neural networks Elnet: Automatic classification and segmentation for esophageal lesions using convolutional neural network Improving mri-based knee disorder diagnosis with pyramidal feature details Prostate cancer diagnosis using deep learning with 3d multiparametric mri Detection of prostate cancer on multiparametric mri A transfer learning approach for malignant prostate lesion detection on multiparametric mri Rethinking the inception architecture for computer vision Very deep convolutional networks for large-scale image recognition Semi-automatic classification of prostate cancer on multi-parametric mr imaging using a multi-channel 3d convolutional neural network Selecting proper combination of mpmri sequences for prostate cancer classification using multi-input convolutional neuronal network Learning deep features for discriminative localization