key: cord-0663242-kjb6fx9k
authors: He, Xin; Wang, Shihao; Ying, Guohao; Zhang, Jiyong; Chu, Xiaowen
title: Efficient Multi-objective Evolutionary 3D Neural Architecture Search for COVID-19 Detection with Chest CT Scans
date: 2021-01-26
journal: nan
DOI: nan
sha: 12683cda7a4bb9e7fb7a74f384ac23ae71e0e6bb
doc_id: 663242
cord_uid: kjb6fx9k

COVID-19 pandemic has spread globally for months. Due to its long incubation period and high testing cost, there is no clue showing its spread speed is slowing down, and hence a faster testing method is in dire need. This paper proposes an efficient Evolutionary Multi-objective neural ARchitecture Search (EMARS) framework, which can automatically search for 3D neural architectures based on a well-designed search space for COVID-19 chest CT scan classification. Within the framework, we use weight sharing strategy to significantly improve the search efficiency and finish the search process in 8 hours. We also propose a new objective, namely potential, which is of benefit to improve the search process's robustness. With the objectives of accuracy, potential, and model size, we find a lightweight model (3.39 MB), which outperforms three baseline human-designed models, i.e., ResNet3D101 (325.21 MB), DenseNet3D121 (43.06 MB), and MC3_18 (43.84 MB). Besides, our well-designed search space enables the class activation mapping algorithm to be easily embedded into all searched models, which can provide the interpretability for medical diagnosis by visualizing the judgment based on the models to locate the lesion areas.

T EN months after the global pandemic starts, COVID-19 is still a significant problem for most countries in the world. Besides the 14-days incubation period making it hard to trace and guarantee the patients, accurate diagnosis is also hard to perform. One of the most widely used testing methods is to use reverse transcription-polymerase chain reaction (RT-PCR) [1] for viral testing; however, it is relatively slow, expensive, and requires professionals, reagents, and exceptional devices to perform. To facilitate rapid COVID-19 diagnosis, many researchers attempt to accelerate COVID-19 diagnosis by deep learning (DL) techniques. However, most of the proposed models are designed manually, which requires the designer's abundant experience and high expertise. This situation hinders the generalization of DL for COVID-19 diagnosis. At the same time, an excellent neural network needs to perform well on multiple metrics (e.g., precision and sensitivity) before § Corresponding author.

applying it to actual disease diagnosis. However, in practice, no human expert can guarantee to find the optimal neural architecture.

This paper proposes an efficient evolutionary multiobjective neural architecture search (EMARS) method, which can automatically search 3D neural networks for COVID-19 detection. Our method can achieve 89.74% sensitivity on Clean-CC-CCII dataset [2] , which is significantly higher than the average sensitivity of antigen tests (56.2%) [3] , similar to radiologist's average diagnosis sensitivity by Chest CT (92%) [4] , and only slightly worse than average sensitivity of RT-PCR (95.2%) [3] .

The neural architecture search (NAS) technique is a feasible and promising solution to automate and accelerate the process of model designing, as many studies have experimentally demonstrated that NAS-designed models outperform handcrafted models. There are mainly four classes of NAS methods: reinforcement learning (RL)-based methods [5] - [8] , gradient descent (GD)-based methods [9] - [11] , surrogate modelbased optimization (SMBO) methods [12] , and evolutionary algorithm (EA)-based methods [13] - [19] . Many early studies focus on searching for neural architectures that achieve higher performance (e.g., classification accuracy), regardless of the resource consumption and model size. For example, Zoph et al. [5] were the first to propose RL-based NAS methods and successfully found models outperforming state-of-the-art (SOTA) manually designed models, while they took 800 GPUs and 22,400 GPU days for searching, which is unacceptable for individuals and small companies. The following GDbased methods, such as DARTS [9] , significantly improve the search efficiency. However, as stated in [10] , DARTS tends to select the simpler operations (e.g., skip-connect) in the later stage of the search, resulting in a lack of diversity in the searched models. Although the EA-based methods can escape local optima and find promising models, it introduces randomness during the search stage. The EA-based methods have a huge requirement of computational resources and time, as the typical EA-based NAS methods generally need to train each individual for several epochs before obtaining their arXiv:2101.10667v1 [eess.IV] 26 Jan 2021 validation results.

In this paper, we propose an Evolutionary Multi-objective neural ARchitecture Search (EMARS) framework to resolve the above inherent issues. Many previous NAS studies search only one cell and repeat a fixed number of searched cell to construct the final model. Inspired by [8] , [20] , our search space is factorized into multiple searchable cells and blocks (shown in Fig. 1) ; therefore, the model diversity in our work is guaranteed. Besides, we use the mobile inverted bottleneck convolution (MBConv) as the candidate operations. MBConv requires less computation than standard convolution modules and has been demonstrated effective in improving model performance. In EMARS, individuals indicate child architectures derived in the same SuperNet initialized at the beginning of the algorithm. In other words, all individuals share the weights of the SuperNet among each other, which can significantly improve the efficiency of our evolutionary algorithm. The search process of 100 epochs can be finished in about 8 hours using 4 Nvidia Tesla V100 GPUs.

Many multi-objective NAS methods [8] , [11] , [18] , [19] only considered accuracy and model size as objectives. In this work, we introduce a new objective, namely potential, into the NSGA-III algorithm [21] . We experimentally demonstrated that potential is of great benefit to improve the robustness of the search process by applying EMARS to three publicly available COVID-19 CT scan datasets (Clean-CC-CCII [22] , MosMedData [23] , and Covid-CTset [24] ). According to the experimental results, EMARS can effectively find a series of neural architectures with higher accuracy than baseline models (ResNet3D101 [25] , DenseNet3D121 [26] , and MC3 18 [25] ), and these searched architectures cover a wide range over the model size. Furthermore, medical diagnoses generally require interpretability of the decision, so we apply the class activation mapping (CAM) [27] algorithm into our EMARS series models to visualize the judgment of the model, which can help doctors understand the chest CT scan while verifying the validity.

The contributions of our work are summarized as follows:

1) We design a 3D search space, which is factorized into multiple searchable cells and blocks and hence increase the model diversity. We also use the weight sharing strategy, which significantly improves search efficiency. The search process of 100 epochs can be finished in about 8 hours. 2) We propose an EA-based NAS framework, namely EMARS, which is capable of scalability, i.e., we can easily apply multiple objectives into EMARS for optimization. Specifically, we introduce a new objective, called potential, and experimentally prove it effective to improve the robustness of the search process. 3) With the proposed search space and EMARS, we find a series of neural architectures, all of which outperform three baseline models with a much smaller mode size. 4) In our search space, a global average pooling layer is inserted before the fully connected layer; therefore, the class activation mapping (CAM) [27] algorithm can be easily embedded into our searched models, which can help doctors locate the discriminative lesion areas on the CT scan images. The rest of the paper is organized as follows. Section II describes the related work. Section III introduces the search space for building 3D neural architectures. Section IV illustrates our search algorithm, including the warm-up, selection, crossover, and mutation processes. We introduce the experimental implementations in Section V, and present and analyze the results in Section VI. Section VII concludes the paper and proposes the future research directions.

With the rapid growth of computational power, DL has become a popular way to assist the diagnosis of X-ray or CT images [28] . The growing amount of publicly available datasets in different aspects also facilitates researches to implement deep neural networks on different tasks. For COVID-19, there are two kinds of data, which are CT images and X-ray images. The difference between CT and X-ray is CT is a 3D format, which contains the textural information of a part of the human body composed of many slices of body cross sections images. X-ray is a 2D format contains overlapping textual information of the human body. There are several researches using deep learning on X-ray datasets [29] - [31] . Ghoshal et al. [29] achieved 88.39% accuracy on their X-ray dataset using the DL model. While Narin et al. [31] achieved 98% accuracy on a smaller X-ray dataset. Experiments using CT datasets are more popular compared with X-rays. There are two kinds of CT datasets, 2D and 3D, used for deep learning classification for COVID-19. According to [32] , most of existing studies focus on a single representative slice from a CT scan volume for COVID-19 detection [33] - [37] . However, since 3D volumes contains more information, which is CT's advantage by design, more experiments choose to use 3D CT volume to do classification and segmentation tasks [22] - [24] , [38] , [39] , in which Zheng et al. [38] , Li et al. [39] , Morozov et al. [23] , Zhang et al. [22] designed 3D convolutional networks to analyze 3D CT volumes.

Recently, there is a growing interest in the NAS technique, as it has been applied to many areas and outperformed humandesigned models [40] , [41] . Arguably, the studies of [5] , [7] mark the beginning of NAS, as they demonstrated that RLbased NAS methods could effectively discover good architectures. ENAS [6] accelerates the search process by adopting a parameter-sharing strategy, in which all child architectures are regarded as sub-graph of a super-net; this enables these architectures to share parameters, obviating the need to train each child model from scratch. Besides RL-based methods, several improved methods are also proposed to further improve NAS efficiency.

SMBO methods evaluate the searched models with the surrogate function instead of metrics from trained architectures and thus shorten the search time. Furthermore, Liu et al. [12] Output GAP used learned a surrogate model to guide the search, and their method is five times more efficient than the RL-based method [7] . Liu et al. [9] were one of the first to propose the GD-based method, namely DARTS, which uses the softmax function to relax the discrete search space and significantly improve the search efficiency. But according to Liang et al. [10] , the performance of DARTS is often observed to collapse, as DARTS tends to select the simpler operations (e.g., skipconnect) in the later stage of the search, which may result in a lack of diversity in the searched models.

Evolutionary algorithm (EA) is inspired by biological evolution. A new model (also known as individual) is evolved from a previous model with operations including selection, crossover and mutation. The early EA-based methods are computateintensive, e.g., AmoebaNet [16] took 450 GPUs and 3,150 GPU days for searching. CARS [17] significantly improves the search efficiency by introducing the weight sharing strategy into the evolutionary algorithm. MoreNAS [18] combines RL and EA to obtain promising architecture for multi-objective optimal. LemonadeNAS [42] encodes network by function, and each network can be generated from network morphism operators.

In terms of the multi-objective tasks, the target will be complex, and hard to weigh different objectives. Some existing methods try to reduce multi-objective to single-objective. For example, MONAS [19] maps multi-objective to single by a linear combination, but it may lead to suboptimal. Since it's tricky for one single network to surpass all the optimal target, architectures satisfied with the Pareto front are preferred. Yant et al. [17] introduced an improved method based on NSGA-III [21] , namely pNSGA, to achieve optimal architectures with multi-objective. Besides, most multi-objective NAS methods [8] , [11] , [18] , [19] only considered accuracy and model size as objectives. In this paper, we propose a new objective, namely potential, which is of great benefit to improve the robustness of the search process.

A well-designed search space is of great benefit to enhance the final model performance. The traditional cell-based search space [6] , [9] has several problems: 1) the cell structure is inefficient for reference as it is a non-regularized combination of candidate operations; 2) the final model is constructed by repeating the searched cell, thus lacking diversity. To this end, we adopt the idea that is factorizing each network into cells and blocks [8] , as shown in Fig. 1 . The details of our search space are introduced as follows.

Each block is a searchable module, which can be selected from a predefined number of candidate operations. To find a lightweight and high-quality 3D model for COVID-19 detection, we add a series of mobile inverted bottleneck convolution (MBConv) [20] into the candidate operation set. As shown in Fig. 2 , MBConvk e comprises three sub-modules: 1) a 3D point-wise (1×1×1) convolution, which increases the number of channels of the output feature e times that of input feature; 2) the intermediate expansion layer uses lightweight a 3D depthwise convolution with kernel size k × k × k to extract features and introduce non-linearity; 3) another 3D point-wise (1×1×1) convolution, which restores the output feature size to the input feature size. In MBConv, most convolutional operations are followed by a 3D batch normalization and a ReLU6 activation function [43] , and the last convolution has no ReLU6.

Each cell is composed of a calibration block and a different number of searchable blocks. The calibration block is a 3D 1 × 1 × 1 point-wise convolution to solve the problem of feature dimension mismatch; therefore, all subsequent blocks have stride 1. As shown in Fig. 1 , in our search space, the i-th cell has B i searchable blocks, and each block is also different. Therefore, this design paradigm enables model diversity. In our experiments, we empirically choose the following set of candidate operations:

Identity indicates the skip connection operation [44] , which is equivalent to reducing one block.

As shown in Fig. 1 , the network could be specified with several factors: cell structures, the number of filters F in the stem layer, the number of cells N , the number of blocks B = [B 1 , ..., B N ] and the stride of each calibration block S = [S 1 , ..., S N ] for each cell. In this work, we fix F, N, B, and S during the search stage; thus the network can be regarded as a SuperNet N constrained to all possible cell structures. Besides, the stem layer is a fixed convolutional operation to process the input. The global average pooling (GAP) [45] is inserted before the fully connected (FC) layer; therefore, the network can handle variable input size.

Although EA can effectively solve the multi-objective problem that other optimization algorithms struggle to solve, it usually suffers from huge computational resources consumption [13] , [14] , [42] . Therefore, inspired by [6] , [9] , [17] , we adopt a weight-sharing strategy to improve efficiency.

An individual architecture denoted by N (α) is sampled from the SuperNet N , where α is a set of one-hot sequences that encodes the individual architecture. Each one-hot sequence is decoded as a candidate operation. For example, as shown in Fig. 1 , [0, 0, 0, 0, 0, 1] is decoded as the identity (skip-connect) operation. All individuals share the weights W of the SuperNet, and the weights of the i-th individual are denoted by W(α).

With loss of the individual

where H is the loss function, X is the input data and Y is the label data, the individual gradient W(α) can be calculated as

Since the weights W of the SuperNet is shared among all individual architectures, the gradient of W can be calculated as the accumulation of gradients of all individuals.

where P is the size of population. In [17] , the authors used a mini-batch architectures to obtain an unbiased approximation to Eq. 2, detailed as Eq. 3

where B is the number of individuals in a mini-batch and B < P . In our experiments, we find that B = 1 works just fine, i.e., we can update W using the gradient from any single individual sequentially sampled from the population.

Our evolutionary search algorithm is based on NSGA-III [21] , which is composed of selection, crossover, mutation, and update steps. Alg. 1 summarizes the detailed steps of our evolutionary algorithm for searching 3D neural architectures. In our experiment, the SuperNet weights W are randomly initialized and shared among all individuals; if we evolve from the beginning, then the first set of sampled architectures can get more training. In this case, these architectures may dominate in the later stage and compromise the search's effectiveness. Similar to [17] , [46] , we adopt uniform sampling to treat all individual architectures equally during the warm-up stage. In our experiments, all searchable blocks are selected from eight different operations, and each operation is sampled with a probability of 1 8 . After the warm-up stage, many individuals are sampled and trained, and the top P bestperforming individual architectures are collected as the initial population for evolution.

As Alg. 1 illustrates, all individuals from the population A are equally trained for batches before the selection step. Then, top K best-performing individuals are selected for the following evolution steps. The selection process allows us to preserve strong individuals while eliminating weak ones. The most commonly used selection method is to select individuals based on the their fitness, such as validation accuracy [14] , [47] . Practically, we are not only concerned with model accuracy, but also with other metrics such as model size. We use NSGA-III [21] to select promising individuals along the Pareto front of multi-objective.

In practice, we denote {N (α 1 ), ..., N (α P )} as a population of individual architectures and T = {T 1 , ..., T M } as multiobjective. We want to minimize the number of architectures by replacing some architecture with ones dominating them. In  {N (α 1 ) , ..., N (α P )}, N (α i ) dominants N (α j ) when N (α i ) are not worse than N (α j ) in each metrics of multi-objective. Formally:

From the above explanation, we can guarantee N (α i ) must have at least one metric better than N (α j ) with others metric at least the same. Thus, we can replace N (α j ) with N (α i ) and ensure measurement increasing at the process of evolution.

Most existing multi-objective NAS methods [8] , [11] , [18] , [19] only considered the accuracy and model size, which may cause Matthew Effect, because models with relatively more training tend to achieve higher validation accuracy and thus are more likely to be trained, while other models may therefore lose the opportunity to compete. Therefore, we propose a new objective, namely potential (P), which predicts the individual performance by incorporating the individuals' history performance. The individual potential is represented by the slope of the line after applying the linear fitting to the individual's history performance. The potential of one individual is derived as follows

P indicates the individual potential, X ∈ R S×1 stores the epoch index when the individual is sampled, Y ∈ R S×1 indicates the corresponding validation accuracy, and S represents the number of times the individual is sampled.

In our experiments, we consider three objectives: accuracy, model size, and potential. Therefore, we can generate three Pareto stages by applying the non-dominated sorting algorithm to each objective. Then we merge three Pareto stages to get the final Pareto stage.

After selection, K best-performing individuals are selected for the crossover, which exchanges architecture encodings between two different parent individuals to result in recombinant individuals. Since we use the one-hot encoding to represent the categorical candidate operations, crossovers are performed on the one-hot encodings other than binary encodings. Each pair of one-hot encodings of two parent individuals are crossovered with a probability of p c . Fig. 3 (a) presents an example that only one crossover occurs between two parent individuals, both of which consist of three one-hot encodings. Fig. 3 . Examples of crossover and mutation. The basic unit for both crossover and mutation is the one-hot encoding sequence, which represents a candidate operation. The length of the one-hot sequence indicates the total number of candidate operations (here 6).

Since crossovers are performed between two promising parent individuals, the child individuals can inherit their good architecture encodings. In other words, crossovers are primarily for exploitation, while mutations are usually for exploration. The basic unit for mutation is also the one-hot encoding. In Fig. 3 (b) , the second one-hot encoding of the parent individual is mutated.

In this section, we first describe three datasets used in our experiments. Then, we introduce the implementation details of the baseline experiment and our EMARS algorithm.

In this paper, we use three publicly available datasets: Clean-CC-CCII [32] , MosMedData [23] and COVID-CTset [24] , all of which provide 3D chest CT scans. The statistics of three datasets are presented in Table. I. Clean-CC-CCII consists of three classes: NCP (novel coronavirus pneumonia), CP (common pneumonia), and Normal, while both MosMed-Data and COVID-CTset contains NCP and Normal. Besides, each patient may have several CT scans, and each scan data is composed of multiple slice grey images. For Clean-CC-CCII and MosMedData, the image format is PNG (portable network graphics), while the image format of the Covid-CTset is 16bit TIFF (tagged image file format). Notably, in this work, the basic classification unit is the scan data instead of the slice image.

We conducted multiple search experiments on the Clean-CC-CCII dataset. Therefore, to shorten the experimental time, we set each slice's size to 128×128 and fix each scan data containing 32 slices. After finishing the search experiments, we applied the best-performing model to the other two datasets to verify its transferability. In order to provide a more thorough evaluation of the EMARS-designed model performance, we processed the two datasets differently. For the MosMedData dataset, each scan data consists of 40 slices, and a slice resolution is 256×256. For the Covid-CTset, we set the slice size to 512×512, and each scan contains 32 slices. 

In our experiments, we use three hand-crafted 3D neural architectures as the baseline models: DenseNet3D121 [26] , ResNet3D101 [25] , and MC3 18 [25] . We apply transformations to scans, including resize, center-crop, and normalization. For the training set, we randomly perform the horizontal and vertical flip operation. We use the Adam [48] optimizer with the weight decay of 5e-4. The learning rate is initialized to 0.001. The cosine annealing scheduler [49] is applied to adjust the learning rate. Three baseline models are trained for 200 epochs. The loss function H is cross-entropy.

EMARS includes two stages: the search stage and the retraining stage. The experimental configuration of each stage is as follows:

1) Search stage: The SuperNet comprises six cells, and the number of searchable blocks in each cell is [4, 4, 4, 4, 4, 4, 1] . Each block is selected from eight candidate operations (see Section III-C). The blocks within the same cell keep the same number of input and output channels, and all blocks have a stride of 1. In other words, the spatial dimensions of the output features of each cell are determined by the calibration block. Here, we empirically set the number of output channels of each cell to [24, 40, 80, 96, 192, 320] , and the stride of calibration block in each cell to [2, 2, 2, 1, 2, 1]. The stem block is a Conv3D-BN3D-ReLU6 sequential module, with the number of output channels fixed to 32.

Besides using the weight-sharing strategy, we reduce the resolution of input scan data to 64×64 to improve search efficiency. The training set is divided into the sub-training set D train , and the validation set D val . To avoid the Matthew effect, we performed the warm-up stage before the evolution and initialized the population with 20. The SuperNet weights are optimized using the stochastic gradient descent (SGD) optimizer with a momentum of 3e-4. The initial learning rate is 0.001.

Selection. For each epoch in the evolution process, all individual architectures from the population were equally trained with the training data batches before the selection. Using the NSGA-III algorithm, we evaluate the impact of different objectives on search results, including accuracy, potential, and Fig. 4 (a) cover a wider range over model size dimension, while Fig. 4 (b) and (c) show that individuals gradually evolve into large and small models, respectively. We set selection size K in Alg. 1 to 10, which indicates that we preserved 10 most promising individuals for exploitation and generated 10 new individuals for exploration. Crossover & Mutation. As shown in Alg. 1, after the selection, we first generated a random probability p ∈ (0, 1). If p > 0.5, we randomly sampled a new individual; otherwise, we performed crossover and mutation on the selected individuals to generate a new individual. The basic unit for both crossover and mutation is the one-hot encoding. The probability of crossover and mutation for each one-hot encoding is p c = 0.3 and p m = 0.2, respectively.

2) Retraining stage: After the search stage, we export top-10 promising individual architecture along Pareto front of the search stage's objectives. We train each exported individual for a few epochs and finally choose the best-performing one for further retraining. The experimental configuration of retraining is the same as the baseline experiment.

In this section, we first introduce our evaluation metrics. Then, we present and analyze the experimental results.

We use several commonly used evaluation metrics to compare the model performance, as follows:

Accuracy = T N + T P T N + T P + F N + F P

To be noticed, the positive and negative cases are assigned to the COVID-19 class and the non-COVID-19 class, respectively. Specifically, T P and T N indicate the number of correctly classified COVID-19 (i.e., NCP) and non-COVID-19 (i.e., CP and Normal) scans, respectively. F P and F N indicate the number of wrongly classified COVID-19 and non-COVID-19 scans, respectively. The accuracy is the micro-averaging value for all test data to evaluate the overall performance of the model. Besides, we also use model size as an evaluation metric to compare the model efficiency.

To verify the ability of multi-objective optimization of the EMARS algorithm, we set up three experiments with different model size objectives. The three experiments run for 100 epochs. We visualized the search results of three experiments in Fig. 4 , in which each point indicates the result of one individual. We split 100 epochs into two parts: the purple points represent the evolution results of the first 50 epochs, and the yellow points represent the evolution results of the last 50 epochs. Fig. 4 shows the distribution of evolution with the objectives of (a) only the validation accuracy, (b) the validation accuracy and large model size, (c) the validation accuracy and small model, respectively. We can see that the results of the three experiments are in line with expectations. The yellow points in Fig. 4 (a) cover a wider range over the model size dimension, while the yellow points in Fig. 4(b) and (c) tend to be distributed to the large and small model size, respectively. Besides, the number of individuals with an accuracy of more than 0.7 in Fig. 4 (a) to (c) is 26, 42, and 14, respectively. A possible explanation for this might be that the larger models have more parameters and thus overfit the training dataset during the search stage, which is proved by the retraining results shown in the Table. II. EMARS-A, EMARS-B, and EMARS-C are the best-performing models selected from Fig.  4 (a) , (b), and (c), respectively. Although the size of EMARS-B is much more than EMRAS-A and EMRAS-C, it achieves the lowest accuracy, sensitivity, and f1-score.

We experimentally demonstrated that the potential objective is of great benefit to search stability. Fig. 5 (a) shows that, with the objective of accuracy, EMARS can gradually find strong individuals achieving high validation accuracy, but may also waste time in some weak ones. Fig. 5 (b) shows that, with the objective of potential, the search process is relatively robust, as the 30/70 percentile becomes closer than that of Fig. 5 (a) ; however, Fig. 5 (b) fails to find enough individuals with high accuracy. Therefore, we take accuracy, potential, and small model size as objectives, and Fig. 5 (c) shows that the search process keeps robust, and many individuals get high accuracy. We further retrained the best-performing individual from each experiment, and their performance is presented in Table. III. One can see that EMARS-E, which is searched with the objectives of accuracy and potential, has a smaller size and outperforms EMARS-A and EMARS-D in terms of precision, sensitivity, and f1-score. Table. IV summarizes the performance between three 3D baseline models and EMARS series models on the Clean-CC-CCII dataset. Our searched architectures cover an extensive range of model sizes, ranging from 3.39 MB to 20.61 MB. In other words, we can easily select a suitable architecture for different deploy devices. Besides, all EMARS series models outperform the baseline models with a much smaller model size. EMARS-A achieves the best accuracy of 89.67% among all models, and it surpasses ResNet3D101, DenseNet3D121, and MC3 18 by 4.83%, 3.05%, and 4.07%, respectively. EMARS-C achieves the best sensitivity compared with other models. EMARS-E achieves the best precision and f1-score and has the smallest size (3.39 MB), which is 98.96%, 92.13%, and 92.27% smaller than ResNet3D101, DenseNet3D121, and MC3 18.

Based on the above experimental results, we select two representative architectures, i.e., EMARS-B and EMARS-C, to evaluate the transferability by training them on the MosMedData and Covid-CTset datasets. The experimental configuration is the same as the baseline experiment (see Section V-B). The final results presented in Table. V show the transferability of our searched architectures.

CAM is an algorithm that can visualize the regions that the model focuses on, and hence provide the interpretability for our searched models. We apply it to a 3D CT scan volume from the Clean-CC-CCII dataset using EMARS-E model. Fig.  6 presents the generated heat maps of some slices. A red and brighter region means that it have a larger impact on the model's decision to classify it as COVID-19.

From the perspective of the scan volume, we can see that some slices have more impacts on the model's decision than the others. In terms of a single slice, the areas that EMARS-E focuses on has ground-glass opacity, which is proved a distinctive feature of CT images of COVID-19 Chest CT images [50] . CAM enables the interpretability of our searched models (e.g., EMARS-E), which can help doctors quickly locate the discriminative lesion areas in a large CT volume. 

In this work, we propose a factorized 3D search space, in which all child architectures share weights among each other. We introduce an efficient evolutionary multi-objective neural architecture search (EMARS) framework to search for 3D models for COVID-19 CT scan classification. We also propose a new objective, namely potential, that can effectively improve the robustness of the search process. The results on three COVID-19 datasets show that a series of models searched by EMARS cover a wide range over the model size, and they all outperform the baseline models on the Clean-CC-CCII dataset. We also verify the EMARS series models' transferability by training two representative models on the MosMedData and Covid-CTset datasets. Our work demonstrates that NAS is a powerful and promising solution for assisting in COVID-19 CT scan detection. In the future, we will apply our EMARS framework to more complex tasks, such as 3D medical image segmentation.

Massive and rapid COVID-19 testing is feasible by extraction-free SARS-CoV-2 RT-PCR

Benchmarking deep learning models and automated model design for covid-19 detection with chest ct scans

Rapid, point-of-care antigen and molecular-based tests for diagnosis of SARS-CoV-2 infection

Chest CT for detecting COVID-19: a systematic review and meta-analysis of diagnostic accuracy

Neural architecture search with reinforcement learning

Efficient neural architecture search via parameter sharing

Learning transferable architectures for scalable image recognition

Mnasnet: Platform-aware neural architecture search for mobile

Darts: Differentiable architecture search

Darts+: Improved differentiable architecture search with early stopping

Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search

Progressive neural architecture search

Large-scale evolution of image classifiers

Hierarchical representations for efficient architecture search

Genetic cnn

Regularized evolution for image classifier architecture search

Cars: Continuous evolution for efficient neural architecture search

Multi-objective reinforced evolution in mobile neural architecture search

Monas: Multiobjective neural architecture search using reinforcement learning

Proceedings of the IEEE conference on computer vision and pattern recognition

An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part i: solving problems with box constraints

Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography

Mosmeddata: Chest ct scans with covid-19 related findings

A fully automated deep learning-based network for detecting covid-19 from a new and large lung ct scan dataset

A closer look at spatiotemporal convolutions for action recognition

Temporal 3d convnets: New architecture and transfer learning for video classification

Learning deep features for discriminative localization

A survey on deep learning in medical image analysis

Estimating Uncertainty and Interpretability in Deep Learning for Coronavirus (COVID-19) Detection

COVID-19 Screening on Chest X-ray Images Using Deep Learning based Anomaly Detection

Automatic Detection of Coronavirus Disease (COVID-19) Using X-ray Images and Deep Convolutional Neural Networks

Benchmarking deep learning models and automated model design for covid-19 detection with chest ct scans

Classification of COVID-19 patients from chest CT images using multi-objective differential evolution-based convolutional neural networks

Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: Results of 10 convolutional neural networks

Covid MTNet: Covid-19 detection with multi-task deep learning approaches

Sample-Efficient Deep Learning for COVID-19 Diagnosis Based on CT Scans

Radiologist-Level COVID-19 Detection Using CT Scans with Detail-Oriented Capsule Networks

Deep Learning-based Detection for COVID-19 from Chest CT using Weak Label

Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct

Automl: A survey of the state-of-the-art

Neural architecture search: A survey

Efficient multi-objective neural architecture search via lamarckian evolution

Efficient convolutional neural networks for mobile vision applications

Deep residual learning for image recognition

Network in network

Single path one-shot neural architecture search with uniform sampling

Eena: efficient evolution of neural architecture

Adam: A method for stochastic optimization

Sgdr: Stochastic gradient descent with warm restarts

Performance of radiologists in differentiating covid-19 from non-covid-19 viral pneumonia at chest ct