key: cord-0956221-63gfdc95
authors: Sitaula, Chiranjibi; Hossain, Mohammad Belayet
title: Attention-based VGG-16 model for COVID-19 chest X-ray image classification
date: 2020-11-17
journal: Appl Intell
DOI: 10.1007/s10489-020-02055-x
sha: d9a39add8f493a1286314c2a9f3ee336205913d2
doc_id: 956221
cord_uid: 63gfdc95

Computer-aided diagnosis (CAD) methods such as Chest X-rays (CXR)-based method is one of the cheapest alternative options to diagnose the early stage of COVID-19 disease compared to other alternatives such as Polymerase Chain Reaction (PCR), Computed Tomography (CT) scan, and so on. To this end, there have been few works proposed to diagnose COVID-19 by using CXR-based methods. However, they have limited performance as they ignore the spatial relationships between the region of interests (ROIs) in CXR images, which could identify the likely regions of COVID-19’s effect in the human lungs. In this paper, we propose a novel attention-based deep learning model using the attention module with VGG-16. By using the attention module, we capture the spatial relationship between the ROIs in CXR images. In the meantime, by using an appropriate convolution layer (4th pooling layer) of the VGG-16 model in addition to the attention module, we design a novel deep learning model to perform fine-tuning in the classification process. To evaluate the performance of our method, we conduct extensive experiments by using three COVID-19 CXR image datasets. The experiment and analysis demonstrate the stable and promising performance of our proposed method compared to the state-of-the-art methods. The promising classification performance of our proposed method indicates that it is suitable for CXR image classification in COVID-19 diagnosis.

COVID-19 disease, which is triggered by Severe Acute Respiratory Syndrome Coronavirus 2 (SARS CoV-2) [1] [2] [3] , has been posing a severe threat to humanity by widespread community transmission and increasing death rate daily. It is believed to have originated from Wuhan city of China [4] . Now, it has been spread all over the world [5] [6] [7] 1 The spread of virus for the infection is also related to the geographic region of the corresponding country [8] .

To identify the infection of this disease in the human body, medical professionals have been using the Polymerase Chain Reaction (PCR) method widely, which is not only expensive but also an arduous task. Nonetheless, it is timeconsuming whereas faster results are more likely to save the lives of people. Thus, researchers are trying to find the cheapest and quickest Computer-Aided Diagnosis (CAD) methods such as Chest X-Ray (CXR) [9] [10] [11] , Computed Tomography (CT) [12, 13] , and so on. Besides, World Health Organization (WHO) 2 has encouraged people for chest imaging to the patients who are not hospitalized but having mild symptoms. Among the CAD methods, the CXR-based method is one of the cheapest and quickest approaches for the early diagnosis of such disease.

CXR-based methods for COVID-19 diagnosis are proposed in [9, [13] [14] [15] . These methods are mostly based on the pre-trained deep learning models that outperform the traditional computer vision-based methods (also called hand-crafted features extraction methods) [16] . Moreover, the deep learning-based methods extract features at a higher order. Consequently, it has a breakthrough performance in image analysis, especially for CXR images. As a result, deep learning-based methods have been widely adopted in the literature for CXR image analysis, especially for COVID-19 diagnosis.

Existing CXR-based methods for COVID-19 diagnosis have three major limitations. Firstly, they do not perform well as some of them require a separate classifier after the feature extraction step, which is a demanding task. Secondly, the spatial relationship between the region of interests (ROIs) in images has been ignored in the literature, though they help to improve the performance of CXR images more accurately. Finally, existing deep learning-based methods require a higher number of training parameters, which not only yield a computation burden in the classification but also lead to over-fitting problems because of the limited availability of COVID-19 CXR images.

To address these limitations, we propose a novel deep learning model using an appropriate layer of the VGG-16 [17] and the attention module [18] . We choose pooling layer as an appropriate layer, which not only has a higher discriminability for CXR images but also faster in deep learning model training task [19] . Kumar et al. [19] also mentions that deep learning models are applicable to different domains, including human health, medicine, etc. Given such importance and applicability, we train our deep learning model in an end-to-end approach. Therefore, it does not require an additional classifier for the classification purpose. Furthermore, with the help of the attention module as a deep learning layer, we capture the spatial relationship between the ROIs during the training process to better discriminate CXR images (see details for the visualization example of ROIs in Fig. 1) . Moreover, our model requires lower number of parameters as it leverages the appropriate layer (4th pooling layer) of the VGG-16 model. Specifically, this pooling layer captures the invaluable interesting information of CXR images, which helps to identify and diagnose most of the lungs-related diseases like COVID-19 swiftly.

The main contributions of our proposed method are as follows:

-We propose a novel deep learning model by the combination of the VGG-16 and the attention module, which is one of the most appropriate models for datasets. Also, we also perform a qualitative and quantitative study of our method using CXR images.

The evaluation results demonstrate that our model outperforms the state-of-the-art methods.

The paper is organized as follows. In Section 2, we review the existing methods related to the CXR image classification, including COVID-19 disease. We explain our proposed method in Section 3. Section 4 elaborates the experimental settings, implementation, results and discussion, comparison, and different analyses. Finally, Section 5 concludes the paper with future works.

Deep learning (DL) models are very popular nowadays in various image representation and classification tasks ranging from scene images to health images [9, 21, 22] . DL models are a larger artificial neural network (ANN) that are inspired by the structure and function of the human brain. They are categorized into two types: Non pre-trained DL models and Pre-trained DL models. Non pre-trained DL models need to be trained from scratch which needs a massive amount of datasets and prone to over-fitting. In contrast, pre-trained DL models are already trained with the public image datasets such as ImageNet [23] , Places [24] , and avoid over-fitting in most cases. Due to the extraction of semantic features at a higher-order from those pre-trained models, the performance of such models are higher in most domains [16, 21] compared to the traditional computer methods such as Generalized Search Tree (GIST)color [25] , GIST [26] , Scale Invariant Feature Transform [27] , Histogram of Gradient [28] , and Spatial Pyramid Matching [29] .

In this section, we review some of the recent deep learning-based methods [9-11, 14, 15, 22, 30-32] that have been used widely to perform CXR image analysis including COVID-19 disease. We divide these methods 

To perform CXR image analysis for different diseases including COVID-19, there have been several recent works. Firstly, Stephen et al. [30] proposed a DL model to detect pneumonia. For this, they trained a DL model from scratch using a collection of CXR images. In the meantime, researchers further realized the ability of such pre-trained models in X-ray image analysis tasks and wanted to explore further to analyze the strengths of various DL models. For example, Loey et al. [11] used transfer learning approach on AlexNet [33] , GoogleNet [34] , and ResNet-18 [35] to represent and classify CXR images for COVID-19 diagnosis. They used the COVID-19 dataset consisting of four categories (COVID, Normal, Pneumonia Bacterial, and Pneumonia Viral). Also, they used the Generative Adversarial Network (GAN) [36] to increase the number of images for training that helps to avoid over-fitting in the experiment. Similarly, Khan et al. [9] proposed a novel DL model based on Xception [37] . For this, they finetuned the Xception model and trained using COVID-19 CXR images for classification purposes. Moreover, Ozturk et al. [10] proposed a novel DL model to represent and classify COVID- 19 Although existing methods based on a single DL model provide a significant performance boost in CXR image analysis, they still ignore the spatial relationship between ROIs, which is one of the important discriminating clues in the CXR image analysis task.

The use of a single DL model alone might not carry out sufficient discriminating information for CXR images classification. Given the appearance of such weaknesses, researchers used more than one DL model to form a combined model, which is also called the ensemble model and the learning approach is called ensemble learning. For example, Zhou et al. [41] combined multiple Artificial Neural Networks (ANNs) to identify lung cancer cells. Similarly, Sasaki et al. [31] designed an ensemble model to detect the abnormality detection in CXR images. Furthermore, Li et al. [42] used more than two CNNs (Convolution Neural Networks) to minimize the falsepositive rate in lung nodules of CXR images. Similarly, Islam et al. [43] proposed an ensemble model, which was obtained by aggregating different pre-trained DL models to detect the abnormality in lung nodule of CXR images. Recently, Chouhan et al. [22] proposed a model, which aggregates the outputs of five pre-trained models such as AlexNet, DenseNet-121, ResNet-18, Inception-V3, and GoogleNet, to detect pneumonia using the transfer learning approach on the CXR images.

However, ensemble models still have two weaknesses. Firstly, it is prone to the over-fitting problem in most cases because of the limited amount of CXR images in the medical domain. Secondly, the ensemble model is computationally expensive as it has to extract patterns using million of parameters during the training step. This also leads to tuning the hyper-parameters carefully, which is a challenging task itself.

Our proposed method is based on the well-established pretrained DL model (VGG-16) and the attention module. We prefer to use the VGG-16 model (see detailed description in Table 1 ) for two reasons. Firstly, it extracts the features at low-level by using its smaller kernel size, which is appropriate for CXR images with a lower number of layers compared to its another counterpart VGG-19 model. Secondly, it has a better feature extraction ability for the classification of COVID-19 CXR images as shown in [15] . We use a fine-tuning approach, which is one of the transfer learning techniques. To work with the VGG-16 model for the fine-tuning process, we use the pre-trained weight of ImageNet [23] . It helps to overcome the overfitting problem as we have limited amount of COVID-19 CXR images for training purpose. Our proposed method (also called Attention-based VGG-16) consists of four main building blocks such as Attention module, Convolution module, FC-layers, and Softmax classifier. The overall block diagram of the proposed model is shown in Fig. 2 . We explain each building block in the next subsections.

We use this module to capture the spatial relationship of visual clues in the COVID-19 CXR images. For this, we follow the spatial attention concept proposed by Woo et al. [18] . We perform both max pooling and average pooling on the input tensor, which is 4th pooling layer of the VGG-16 model in our method. After that, these two resultant tensors (max pooled 2D tensor and average pooled 2D tensor) are concatenated to each other to perform a convolution of filter size (f ) of 7 × 7 using Sigmoid function (σ ). The highlevel diagram of the attention module is shown in Fig. 2 .

The concatenated resultant tensor (M s (F )) is defined as

where, F s avg ∈ R 1×H ×W and F s max ∈ R 1×H ×W represents the 2D tensors achieved by average pooling and max pooling operation on the input tensor F , respectively. Here, H and W denote the height and width of the tensor, respectively.

We use the convolution module in our method, which is the 4th pooling layer of the VGG-16 model. The scale-invariant convolution module captures the interesting clues of the image. The interesting clues are extracted from the midlevel layer (4th pooling) that is more appropriate to CXR images. However, the features from other layers (higher or lower) are not appropriate to CXR images because such images are neither more general nor more specific. Thus, we first input the 4th pooling layer to the attention module. After that, the result of that module is concatenated with 4th pooling layer itself.

To represent the concatenated features achieved from attention and convolution block into one-dimensional (1D) features, we use fully connected layers. It consists of three layers such as flatten, dropout, and dense as shown in Fig. 2 . In our method, we fix dropout to 0.5 and set the dense layer to 256.

To classify the features extracted from the FC-layers, we use the softmax layer. For the softmax layer which is the last dense layer, the unit number depends on the number of categories (e.g., three for dataset having three categories, four for the dataset with four categories, etc.). The softmax layer outputs the multinomial distribution of the probability scores based on the classification performed. The output of this distribution is

Attention-based VGG-16 model for COVID-19 chest X-ray image classification Table 2 .

To perform extensive experiments using our method, we use three COVID-19 CXR image datasets [9, 10] Table 3 . 

To implement our proposed method, we used Keras [44] in Python [45] . To train our deep learning model using end to end mode, we leveraged the softmax layer as a classifier in the experiment. Similarly, for fine-tuning purposes in our model, we loaded pre-trained ImageNet weight and trained from the initial layer with CXR images. Here, the initial layer is defined as the first layer of the VGG-16 model. The detailed parameters which include basic settings for training in addition to offline augmentation, required to implement our method are listed in Table 4 . Additionally, to prevent from over-fitting, we fixed the learning rate decay in every 4 steps at the rate of 0.4 based on the initial learning rate with Adam optimizer. Meanwhile, we implemented our method on a computer with NVIDIA GeForce GTX 1050 GPU and 4GB GDDR5 VRAM.

Since our method uses a fine-tuning approach, we compare our method with some of the fine-tuned models based on some pre-trained deep learning models (Table 6 ). To implement fine-tuning on top of other pre-trained models, we use some similar settings as used in our method (see details in Table 4 ). Moreover, to achieve the optimal accuracy from the existing methods, we perform additional hyperparameters tuning during the training. The details of such optimal parameters are presented in Table 5 . Additionally, we also compare our model with three state-of-the-art models that have used COVID-19 CXR images for classification tasks (Table 7) . In Table 6 , we present the results of D1, D2, and D3 in column 2, 3, and 4, respectively. While looking at the results in column 2 for D1, we observe that our method outperforms all fine-tuned pre-trained models. Specifically, our method, which yields 79.58% accuracy, has at least 10% higher than the second-best method (Incep.-ResnetV2) that has an accuracy of 68.10% on D1. Similarly, while looking at the results in column 3 for D2, we notice that our method again outperforms all fine-tuned pre-trained models. To this end, our method, which provides 85.43% accuracy, has at least 1.5% higher accuracy compared to the second-best contender method (Incep.-ResnetV2), with the accuracy of 83.93% on D2. Moreover, we observe that our method surpasses the existing methods while looking at the results in column 4 for D3. Specifically, our method, which imparts 87.49% accuracy, has at least 3.14% higher than the second-best method (Incep.-ResnetV2) that has an accuracy of 84.35% on D3. Furthermore, the second-best method has the highest number of training parameters (57 millions), which is over 3 times higher than ours. Higher number of training parameters burden the deep learning model in training steps. Also, while implementing our method using VGG-19 on all three datasets, we notice that our method outperforms all the pre-trained models on D1, D2, and D3. This justifies the better efficacy of our method with VGG-19 model as well.

Furthermore, in Table 7 , we present the results in column 2, 3, and 4 for D1, D2, and D3, respectively. While looking at the results of datasets (D1, D2, and D3), we notice that our method has excellent performance compared to three recent contender methods (CoroNet [9] , Luz et al. [14] , and nCOVnet [15] ). Also, it is interesting to see that our method is stable in each dataset compared to Luz et al. [14] and nCOVnet [15] that have a lower number of parameters than ours. Moreover, our method consumes the third least number of parameters yet stable classification performance on different datasets. [15] To sum up, we speculate that the performance of our model is stable and consistent on three COVID-19 CXR image datasets because of three main reasons. First, our model leverages a smaller size of the filter of the VGG-16 model, which is appropriate to capture interesting regions of CXR images. Second, the 4th pooling layer used in our method is more appropriate to CXR images because CXR images are neither more specific nor more general compared to ImageNet [23] , which has been used to pre-train the VGG-16 model. Third, we can capture the more interesting regions of CXR images that bolster the performance while working with the convolution block. Table 6 Comparison with other fine-tuned models based on pretrained deep learning models using average classification accuracy (%) and training parameters (in millions) on three datasets (D1, D2, and D3 

In this subsection, we study the convergence analysis of our method on three datasets (D1, D2, and D2), which are shown in Figs. 5, 6, and 7, respectively. To see the stability of the learning pattern, we increased the epoch from 40 to 60 in our model. Note that, we present the representative model accuracy/loss plot of one set from each dataset. From Figs. 5, 6, and 7, we observe that the gap between training and validation accuracy/loss on D1 is lower than on D2 and D3. Furthermore, we also observe that our method has converged and shown best-fit on all datasets. Hence, this result provides an ability to generalize the prediction of CXR images during classification.

In this subsection, we perform the class-wise analysis of our proposed method for all datasets (D1, D2, and D3). For this, we use precision (3), recall (4), and f-score (5) for each class on the corresponding dataset, defined as follows:

where f p, t p, and f n denote false positive, true positive, and false negative, respectively. The results are listed in Tables 8 and 9 , 10 for D1, D2, and D3, respectively. To report in the table, we average precision, recall, and f-score of all five sets on the corresponding dataset. While observing three tables for three datasets, our method produces the highest precision for Covid class on two datasets, whereas this class has second-best precision on third dataset. In the meantime, our method also imparts significant performance for other classes in terms of recall and f-score on all datasets. Furthermore, we also utilize the confusion matrix to figure out the distribution of predicted images in different classes, which have been shown in Fig. 8 for D1, D2, and D3. While looking at the three confusion matrix in the figure closely, we notice that our method has classified the images into the corresponding class at a higher rate on the corresponding dataset.

In this subsection, we analyze the visual maps produced by convolution and attention module for five different diseases (covid, no findings, normal, pneumonia bacteria, and pneumonia viral). For this, we utilize one of the sets (Set 1) from dataset D3. Here, we utilize D3 for the qualitative analysis because this dataset has a higher number of categories compared to the remaining datasets used in our work. The visualization maps are presented in Fig. 9 . By observing the visualization maps for five different diseases, we notice that the convolution and attention modules impart the complementary information that indicates that both information is equally important for their better separation. Specifically, we observe that the attention map highlights the defects in the upper region of the lungs mostly, which can be seen in the figure for covid, no findings, pneumonia bacteria, and pneumonia viral disease. Since the attention module identifies the local salient regions, we believe that it has detected the local salient regions deteriorated by covid and other diseases in the top regions of the lungs. Nevertheless, the convolution map identifies the defects in the lower and middle regions of the lungs. Since the convolution module highlights defects in the global region unlike the attention module, we conjecture that it has detected the salient regions in multiple parts (lower and middle) of the lungs for the potential defects. Meanwhile, we notice that normal images do not have heatmap by both convolution and attention module. This is obvious because such images are clear and easily separable for the classification. 

In this subsection, we perform an ablative analysis of our method on D1. For this, we study the contribution of the attention module, convolution module, and their combination in our method. To study the contribution of each module, we utilize the average classification accuracy and computational complexity, which are listed in Table 11 . By observing the table for average classification accuracy, we notice that the combination of both modules (attention and convolution) outperforms each module. As a result, we speculate that, although the attention module is not good while using alone, it is attributed to bolster the classification performance while working jointly with the convolution module. Meanwhile, we analyze the complexity of each module (convolution, attention) that has been used in our method. Let l, c, s, and k represent the corresponding layer of deep learning, number of input channels, spatial size of the filter, and spatial size of the output feature map, respectively. First, the convolution module of layer l consumes O(c l−1 .s 2 l .c l .k 2 l ) complexity. Note that the VGG-16 model contains a stack of convolution layers itself and without it, we can not perform classification tasks. Thus, VGG-16 without additional convolution and attention also imparts a similar complexity. Second, the attention module, which consists of max pooling and average pooling followed by convolution operation, imparts O(2.c l−1 .k 2 l ) + O(c l−1 .s 2 l .k 2 l ) complexity. Importantly, the attention module has a lower complexity than the convolution module because it primarily needs pooling operations. Last, our combined modules (attention module and convolution module) impart the combined complexity of the convolution and attention module. Note that such computational complexities are similar to other datasets as well.

In this paper, we proposed a novel deep learning model using attention module on top of VGG-16, called attention-based VGG-16, to classify the COVID-19 CXR images. We evaluated our method on three COVID-19 CXR datasets. The evaluation results indicate that our method is not only efficient in terms of classification accuracy but also training parameters. From this result, we can conclude that our proposed method is more appropriate for COVID-19 CXR image classification.

However, the performance of our proposed method could be further improved by the following two techniques. First, our method does not utilize offline data augmentation techniques in the experiment. Thus, the use of extensive augmentation techniques such as GAN or Convolution Autoencoder before training could improve the performance further. This also helps to increase the number of CXR images, which results in mitigating the overfitting problem during the training step. Second, the use of other pre-trained deep learning models having a smaller filter size could improve the performance of CXR images. This is because a smaller filter size helps extract more discriminating ROIs of CXR images.

Funding There are no financial supports to complete this work.

Availability of data and materials The datasets are publicly available.

The source code of our proposed method can be found at:[online] Available: https://bitbucket.org/chirudeakin/ covidattention/src/master/

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and corona virus disease-2019 (COVID-19): the epidemic and the challenges

Game consumption and the 2019 novel coronavirus

Diagnostic testing for the novel coronavirus

A review of coronavirus disease-2019 (COVID-19)

Attention-based VGG-16 model for COVID-19 chest X-ray image classification case of 2019 novel coronavirus in the united states

The first two cases of 2019-ncov in Italy: where they come from

The first 2019 novel coronavirus case in nepal

Forecasting of covid19 per regions using arima models and polynomial functions

Coronet: a deep neural network for detection and diagnosis of COVID-19 from chest X-ray ages

Automated detection of COVID-19 cases using deep neural networks with X -ray images

Within the lack of chest covid-19 x-ray dataset: a novel detection model based on gan and deep transfer learning

Classification of COVID-19 patients from chest CT images using multi-objective differential evolution-based convolutional neural networks

Covid-19 pneumonia diagnosis using a simple 2d deep learning framework with a single chest ct image: model development and validation

Towards an efficient deep learning model for covid-19 patterns detection in x-ray images

Application of deep learning for fast detection of covid-19 in x-rays using ncovnet

Computer-aided detection in chest radiography based on artificial intelligence: a survey

Very deep convolutional networks for large-scale image recognition

Cbam: convolutional block attention module

Deep network architecture for large scale visua detection and recognition issues

Grad-cam: visual explanations from deep networks via gradient-based localization

Hdf: hybrid deep features for scene image representation

A novel transfer learning based approach for pneumonia detection in chest x-ray images

ImageNet: a large-scale hierarchical image database

Places: an image database for deep scene understanding

Modeling the shape of the scene: a holistic representation of the spatial envelope

Gist of the scene

Distinctive image features from scale-invariant keypoints

Histograms of oriented gradients for human detection

Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories

An efficient deep learning approach to pneumonia classification in healthcare

Ensemble learning in systems of neural networks for detection of abnormal shadows from x-ray images of lungs

Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks

Imagenet classification with deep convolutional neural networks

Going deeper with convolutions

Deep residual learning for image recognition

Generative adversarial nets

Xception: deep learning with depthwise separable convolutions

Yolo9000: better, faster, stronger

Efficientnet: rethinking model scaling for convolutional neural networks

Deep learning system for covid-19 diagnosis aid using x-ray pulmonary images

Lung cancer cell identification based on artificial neural network ensembles

False-positive reduction on lung nodules detection in chest radiographs by ensemble of convolutional neural networks

Automatic detection of pneumonia on compressed sensing images using deep learning

Rethinking the inception architecture for computer vision

Densely connected convolutional networks

Inception-v4 inception-resnet and the impact of residual connections on learning

Mobilenets: efficient convolutional neural networks for mobile vision applications

Conflicts of interest We would like to confirm that there are no known conflict of interests exist.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.