key: cord-1019845-7jb1dia6 authors: Louati, Hassen; Bechikh, Slim; Louati, Ali; Aldaej, Abdulaziz; Said, Lamjed Ben title: Joint design and compression of convolutional neural networks as a Bi-level optimization problem date: 2022-05-17 journal: Neural Comput Appl DOI: 10.1007/s00521-022-07331-0 sha: 412336d643494b3df255d6b8ea931ac6ef068be7 doc_id: 1019845 cord_uid: 7jb1dia6 Over the last decade, deep neural networks have shown great success in the fields of machine learning and computer vision. Currently, the CNN (convolutional neural network) is one of the most successful networks, having been applied in a wide variety of application domains, including pattern recognition, medical diagnosis and signal processing. Despite CNNs’ impressive performance, their architectural design remains a significant challenge for researchers and practitioners. The problem of selecting hyperparameters is extremely important for these networks. The reason for this is that the search space grows exponentially in size as the number of layers increases. In fact, all existing classical and evolutionary pruning methods take as input an already pre-trained or designed architecture. None of them take pruning into account during the design process. However, to evaluate the quality and possible compactness of any generated architecture, filter pruning should be applied before the communication with the data set to compute the classification error. For instance, a medium-quality architecture in terms of classification could become a very light and accurate architecture after pruning, and vice versa. Many cases are possible, and the number of possibilities is huge. This motivated us to frame the whole process as a bi-level optimization problem where: (1) architecture generation is done at the upper level (with minimum NB and NNB) while (2) its filter pruning optimization is done at the lower level. Motivated by evolutionary algorithms’ (EAs) success in bi-level optimization, we use the newly suggested co-evolutionary migration-based algorithm (CEMBA) as a search engine in this research to address our bi-level architectural optimization problem. The performance of our suggested technique, called Bi-CNN-D-C (Bi-level convolution neural network design and compression), is evaluated using the widely used benchmark data sets for image classification, called CIFAR-10, CIFAR-100 and ImageNet. Our proposed approach is validated by means of a set of comparative experiments with respect to relevant state-of-the-art architectures. CNNs are currently among the most widely used machine learning models for object recognition and computer vision [1] [2] [3] . Despite the fact that CNN with several layers have been in use for a long time, they gained widespread interest in the scientific community in 2006 following the work of several researchers, such as Bengio et al. 2007 [4] and LeCun et al. 2015 [5] . In fact, when dealing with extremely complex classification problems or for specific purposes, CNN has become increasingly used, especially at a high level of precision. CNNs architecture is defined by a large number of hyperparameters, which should be fine-tuned to optimize the architecture. Previous works in the literature have been proposed with the goal of optimizing architectures such as ResNet [6] and VGGNet [7] . Unfortunately, the majority of these architectures are either defined manually by experts or automatically created using greedy induction techniques. Despite the impressive performance of the CNN design, experts in the disciplines of optimization and machine learning proposed that improved structures may be discovered using automated approaches. Evolutionary computation researchers proposed modeling this task as an optimization problem and then solving it with an appropriate search algorithm [8] . Indeed, selecting the blocks' number, nodes per block and the graph's topology within each CNN block is similar to solving a problem of optimization within a large search space. Due to the fact that EAs are capable of approximating the global optimum and thus avoiding local optimal solutions (architectures), authors in [9] proposed recently the use of such metaheuristic techniques to handle the challenge of optimizing the CNN architecture. Efficient model designs [10, 11] focus on acceleration over compression through the use of optimized convolutional operations or network architectures. Recently, as a means of improving accuracy, deepening on CNN models has become a popular trend, as demonstrated by ResNet [6] , VGGNet [7] and Xception [12] . Indeed, it is challenging to deploy these deep models on low-resource devices such as smartphones and mobile robots. However, billions of network parameters represent a significant storage overhead for embedded devices, such as the VGG16 deep learning model, which has over 138 million parameters and requires over 500MB of memory space to classify a 224 224 image. Obviously, such a large model cannot be directly deployed in on-board devices. Deep compression process is a critical technique for resizing a deep learning model by consolidating and removing inadequate components. However, compressing deep models without significant loss of precision is a critical issue. Several techniques for CNN pruning have been presented, including neurons, filters and channel pruning approaches [13] , which reduce model weight by removing unimportant connections. Due to the fact that EAs are capable of approximating the global optimum and thus avoiding locally optimal solutions, recent works [9, 14, 15] recommend that similar metaheuristic algorithms be employed to address the CNN architectural optimization challenge in the field of network compression. To do so successfully, the solution encoding, the fitness function and the variation operators must all be defined. In fact, all previous work focuses on compressing manual architectures and their nonexistent compression for automated CNN architecture. We notice that in our previous works, the problematic of compression is not tackled. Motivated by recent survey papers [9, 14, 16] on deep neural networks pruning and the reported interesting results, we decided to tackle the problem of filter pruning. As any CNN architecture could be pruned in different ways, we framed the problem of ''joint design and pruning'' as a bi-level optimization problem. The upper-level goal is to search for good architectures, while the lowerlevel one is to apply filter pruning on the considered architecture. Indeed, the evaluation of an upper-level architecture requires sending this architecture to the lower level to execute the fitter pruning on it by deactivating some filters. The filters that should be deactivated could not be known before hand as the number of possibilities is huge and corresponds to a whole search space. For this reason, the filter pruning task is executed at the upper level as an evolutionary optimization (search) process. In this way, the fitness evaluation of each upper-level solution (architecture) requires the (near) optimal filter pruning decision (encoded as a binary vector where 0 means that the corresponding filter is deactivated) found at the lower level. By following such as a bi-level optimization process, the final output of our approach is an CNN architecture with minimum number of filters and optimized topology. Figure 1 illustrates an example of a Bi-CNN-D-C (bi-level convolution neural network design and compression) scenario. To our knowledge, this is the first study to model and solve the CNN architecture design and compression problem as bi-level method. Each upper-level solution necessitates solving a separate lower-level optimization problem; the computational cost intends to be prohibitively expensive. We address this issue by solving combinatorial BLOPs using CEMBA [17] . Indeed, each upper-level population collaborates with its corresponding lower-level population. This fact enables a significant reduction in the number of evaluations performed during the lower-level search process. The main contributions of our paper could be summarized as follows: • For the first time, an evolutionary method that combines CNN architecture generation with filter pruning within the optimization process is developed. This is motivated by the fact that any generated architecture should be first pruned before evaluating its classification performance. • The joint design and filter pruning is modeled as a bilevel optimization problem where architectures are generated through crossover and mutation at the upper level with minimum NB and NNB, while filter pruning of each architecture is applied at the lower level. • The bi-level optimization modeling is solved using a bilevel co-evolutionary algorithm to ensure the effective collaboration between the architecture generation (at the upper level) and the filter pruning at (the lower level). • Detailed experiments on CIFAR and ImageNet data sets in addition to a COVID-19 case study are conducted in comparison with several recent and prominent peer works. The merits of our proposed algorithm, Bi-CNN-D-C, are demonstrated based on several metrics including the classification error, the number of GPU-Days and the number of parameters. The rest of this paper is structured as follows. Section 2 summarizes the review of the literature on CNN pruning. Section 3 details our proposed approach. Section 4 details the experimental design and performance analysis results. Finally, in Sect. 5, the paper is concluded and some future research directions are suggested. 2 Related work Recently, some researchers have taken an interest in EAs as a means of evolving deep neural network architectures. A survey on applications of swarm intelligence and evolutionary computing based optimization of deep learning models has been published by Darwish et al. [9] . Based on this survey, we selected the most representative: • Cheung and Sable [18] optimized the architecture hyperparameters using a hybrid EA on the basis of the diagonal Levenberg Marquardt technique with rapid convergence and a low computing cost of fitness assessments number. They established the critical role of architectural choices in convolutional layer networks. Their findings demonstrate that even the simplest evolution strategies can yield significant gains. When variation effects are present, the employment of evolved parameters in combination with local contrast normalization preprocessing and absolute value across layers has proven a compulsive performance on the MNIST data sets [19] . • Fujino et al. [20] presented evolutionary Deep Learning, called evoDL, as a technique for discovering unique architectural designs. This technique is intended to be used to investigate the development of hyperparameters in deep convolutional neural networks, called DCNNs. Additionally, authors proposed AlexNet as a fundamental framework of the CNN and optimize both the parameters tuning and activation functions using evoDL. • Real et al. [21] used the CIFAR-10 and CIFAR-100 data sets to develop the CNN structure in order to identify the classification model. They presented a mutation operator that may be used to avoid locally optimum models. They demonstrated that neuro-evolution is capable of constructing highly accurate networks. • Xie et al. [22] maximized the recognition accuracy by representing the network topology as a binary string. The primary constraint was the high computing cost, which compelled the authors to conduct the tests on small-scale data sets. • Mirjalili et al. [23] developed an adaption for solving bi-objective models, called NSGA-Net. The image classification and object alignment results obtained demonstrate that NSGA-Net is capable of providing the user with less than complicated correct designs. • Alejandro et al. [24] developed EvoDEEP to optimize network characteristics by calculating the probability of layer transitions based on the finite state machines concept. The goal was reducing classification error rates and preserving the layer sequence. • Real et al. [25] provided a GA with an updated tournament selection operator that takes into account the age of the chromosomes while selecting youngest chromosomes. The architectures are described as small directed graphs with edges and vertices representing common network actions and hidden states. They developed novel mutation operators connecting the edges' origin to other vertices and rename the edges arbitrarily in order to cover the entire search space. • Sun et al. [26] developed an evolutionary technique for improving convolutional neural network designs and initializing their weights for image classification problems. This aim was realized by developing a unique approach for initializing weight, a novel encoding variable-length chromosomes strategy, a slacked binary tournament selection methodology and an efficient fitness evaluation technique. Experiments indicated that the EvoCNN methodology surpasses clearly a wide number of existing approaches in terms of classification performance on practically all data sets investigated. • Lu et al. [26] established a multi-objective modeling of the architectural search problem for the first time by minimizing two potentially conflicting objectives: classification error rate and computational complexity, as measured by the number of floating point operations (FLOPS). In order to execute a multi-objective EA, they updated the non-dominated sorting GA-II (NSGA-II) algorithm. • Jing et al. [27] developed a multi-objective model aiming to maximize classification accuracy while keeping the tuning parameters to a minimum. The proposed model was solved based on a hybrid binary encoding representing component layers and network connections using multi-objective particle swarm optimization with Decomposition, called MOPSO/D. The architectures discovered are considered to be exceptionally competitive when compared to models created manually and automatically. Deep network compression is one of the most significant strategies for resizing a deep learning model by combining the removal of ineffective components [14] . However, compressing deep models without considerable loss of precision is a key challenge. Recently, many studies have been focused on discovering new techniques to minimize the computational complexity of CNNs based on EAs while retaining their performance [14] . We divide the network compression techniques into three categories depending on the existing work: filter pruning [29] [30] [31] [32] , quantization [33] [34] [35] [36] [37] and Huffman encoding [38] [39] [40] . The convolutional operation in the CNN model integrates a large number of filters to improve its performance under various classification and prediction processes [41] . Recently, various pruning-based filter pruning techniques [29] [30] [31] [32] have been suggested. The addition of filters enhances the defining features of the spatial characteristics generated by the CNN model [9, 42] . However, this increment results in a significant increase in the DNN model's FLOPs. As a result, removing superfluous filters is critical for reducing the computational requirements of the DCNN model. Figure 2 illustrates a scenario using filterlevel pruning. We summarize the most important works on filter pruning currently available: • Luo et al. [31] introduced an efficient framework named ThiNet for accelerating the operation of the CNN model through the use of compression during the training and testing phases. They implemented filter-level pruning, in which a filter that is no longer necessary is deleted based on statistical information generated from the following layer. The authors proposed pruning filters at the filter level as an optimization issue for determining which filters to prune. They solve the optimization problem with a greedy method which is defined as follows: arg min where N represents the training example number (X i ,Y i ), E j j represents the subset element number, k represents the channel number within the CNN model and c rate represents the channels number retrained after compression. • Bhattacharya and Lane [43] developed a technique for CNN compression that removes sparsification in convolutional filters and the fully connected layer. The primary goal was to minimize the amount of storage required by devices throughout the training and inference processes. By utilizing layer separation and convolutional filters, the computational and spatial complexity of the DCNN model can be significantly expanded. • Zhou et al. [44] suggested a multi-objective optimization problem for filter pruning, followed by a kneeguided approach. They proposed a trade-off between performance degradation and parameter count. The fundamental concept is to remove parameters that contribute to performance degradation. They used the performance loss criteria to determine the significance of a parameter. To produce a tiny compressed model, the number of filters should be limited to a minimum while yet achieving a high degree of precision. The challenge can be handled by identifying a compact binary representation capable of pruning the maximum number of filters while maintaining a reasonable level of performance. This work has the advantage of lowering the number of parameters and processing overhead. • Huynh et al. [45] presented the DeepMon approach for developing deep learning inference on mobile devices. They assert that they can do inference in a short period of time and with minimal power consumption by using the graphics processing unit on the mobile device. They presented a method for convolutional processes on mobile graphics processing units to be optimized. The technique repurposes the results by utilizing CNN's internal processing structure, which includes filters and a network of connections. Thus, deleting filters and superfluous connections demonstrates faster inference. • Denton et al. [32] significantly reduce the time required to evaluate a large CNN model developed for object recognition. The authors used insignificant convolutional filters to develop approximations that significantly minimize the necessary computation. They began by compressing each convolutional layer using an appropriate low-rank approximation and then finetuning until prediction performance was recovered. Weight quantization decreases both the storage and computing requirements of the CNNs model [33] [34] [35] [36] [37] , Han et al. [34] suggested a weight quantization approach for compressing deep neural networks by reducing the number of bits needed to encode weight matrices. The authors attempt to decrease the number of weights which should be stored in memory. The identical weights are removed as a result, and numerous connections are derived from a single remaining weight. The authors used integer arithmetic for inference and floating point computations for training. Jacob et al. [37] presented a quantization technique based on integer arithmetic for inference. Integer arithmetic is more efficient than floating point arithmetic and requires fewer bits to represent. Additionally, the authors construct a training step that mitigates the accuracy penalty associated with the conversion of floating point operations to integer operations. As a result, the suggested technique eliminates the trade-off between on-device latency and accuracy degradation caused by integer operations. The authors performed inference using integer arithmetic and training using floating point operations. Quantization is a technique that creates an affine mapping between integers Q and real numbers R, i.e., of the type where Eq. 2 denotes the quantization method with the parameters W and T. For instance, Q is set to 8 for 8-bit quantization. W is an arbitrary positive real number, and T has the same type as variable Q. The quantization strategy for compressing DNN models is explored in the Neural Computing and Applications current literature [33] [34] [35] [36] [37] . The strategies cover model reduction by arranging weight matrices optimally. However, the previous work does not address the negative repercussions of weight quantization or its estimation complexity. A Huffman encode is a lossless data compression algorithm that is frequently used [46] . Schmidhuber et al. [39] utilized Huffman coding to compress text files generated by a neural prediction network. Han et al. [40] used a three-stage compression strategy to encode the quantized weights, which included pruning, quantization and finally Huffman coding [9] . Ge et al. [47] proposed a hybrid model compression technique based on Huffman coding to capture the sparse nature of trimmed weights. Huffman codes are superior to all other variable-length prefix codings. However, Elias and Golomb. 1975 encoding [48] can take advantage of various intriguing characteristics, such as the recurrence of specific sequences, to achieve greater average code lengths. Despite the interesting findings of design and compression work on optimizing deep learning architectures, all researchers believed architecture optimization was a single-level problem. Therefore, We show that CNN design can be improved if two optimization levels are considered, where a search space is assigned to each level. The two following questions motivate our bi-level model: • How can we design a less complex architecture with the minimum possible convolution blocks (NB) and convolution nodes per block (NNB) while achieving high performance, which is highly dependent on the topologies of the convolution blocks' graphs? • For any CNN architecture, there are a large number of filters per layer; how could we determine the optimal number of filters per layer? For the following reasons, a bi-level modeling of the design and compression architecture is necessary to solve these two research problems. On the one hand, optimizing the design and compression of hyperparameters requires intelligent sampling of the entire high-level search space. On the other hand, in order to assess the upper-level quality solution (NB, avgNNB, NF, Err), we must pass the vector (TOP, NF) to the lower level as a fixed parameter, with the intention of finding the best selected filters (NF) from the lower-level search space. Once the lower-level process is completed, each architecture is passed through the process of quantization of 32-bit floating point values into 5-bit integer levels. This process is used to further reduce the stored size of the weights file. These strategies approached the problem as a bi-level optimization problem, evaluating each pair of hyperparameters independently. This observation demonstrates a significant inconvenience of present approaches and is the paper's key research gap. The bilevel modeling of the CNN architecture design and compression optimization problem illustrated in Fig. 3 demonstrates our approach. In fact, the upper-level optimization process is concerned with optimizing the (NB, NNB) and determining the optimal topology sequence in terms of classification accuracy while the lower level focused on the CNN pruning filters. As we are in the case of bi-level optimization, we have two kinds of solutions: (1) an upperlevel solution and (2) a lower-level one. Indeed, the upperlevel solution is encoded as a vector containing two subvectors: (1) the first one contains integer values expressing the NB and the NNB and (2) the second one is a binary sequence expressing the topology (encoding adopted from Genetic-CNN [49] ). This encoding is chosen to reduce as possible the chromosome length at the upper level. The lower-level is a sequence of sub-vectors each expressing the filter pruning decision of the corresponding convolution node. Modeling such a bi-level problem with the goal of finding better architectures with less complexity would be a better idea. It would be wiser to model such a bi-level problem with the goal of identifying more complex architectures. To solve the proposed bi-level optimization model using CEMBA's adaptation, the following upper-level processes should be detailed: -Upper-level solution encoding: It is constructed by concatenating the number of blocks NB with an integer sequence NNB representing the node numbers in every block and with a sequence of graph topology of the convolution layer. A possible directed graph is represented by this object. -Upper-level fitness function: Aim to evaluate an upperlevel solution, we must reduce the complexity of the CNN architecture as much as possible by optimizing the (NB,NNB) while achieving high performance. In order to accomplish this, we propose the following fitness function: FðNB; avgNNB; NF; ERRÞ ¼ ðNB=NBmaxÞ þ ðAvgNNB=NNBmaxÞ þ ðNF=NFmaxÞ þ ðErrÞ To differ the population at the upper level, the uniform crossover operator [50] has been considered, which allows for variation across all chromosomal segments. To guarantee the diversity of solution variation, every parent solution is converted into a binary sequence based on the Gray encoding [51] . This encoding technique is inspired by the fact that neighboring integer values vary by just one bit, which is not true for the conventional binary encoding [52] . This has been shown to help prevent premature convergence at so-called Hamming walls [53] , where too many simultaneous mutations (or crossover events) are required to change the chromosome to a more advantageous solution. A uniform crossover procedure randomly selects a recombination mask from a uniform distribution. This mask represents a binary vector of 0 and 1. The first offspring is formed by extracting the bit from both parents in case that the corresponding mask bit is equal to 0 and from both parents if the corresponding mask bit is equal to 1. The second offspring is generated using the inverse mask. Finally, each offspring is encoded into an integer vector and the value of its fitness function is calculated. Due to the fact that the proposed solution represents a vector of integers, the length of the binary chromosome is a multiple of four (we mean that each integer is encoded using 4 bits). If this is not the case for a created offspring, the last bits are removed to maintain a multiple of four length. It is crucial to remember that the NB value of a created offspring may vary from the length of the NNB sequence. The offspring solution is rectified in this example by changing the sequence length value from NNB to NB. Then, in order to optimize accuracy, we must look for the optimum topologies. Then, in order to maximize accuracy, we must determine the optimal topologies. As seen in Fig. 4 , the answer will be encoded as a squared binary matrices sequence, one for each conceivable directed network. A value of 1 indicates that the row node is the column node predecessor; a value of 0 indicates that there is no relationship between the two nodes. Due to the fact that this work is concerned with the CNN model, the following constraints must be respected: • Each active convolution node should have a predecessor node. The latter may be a previous convolution node or the convolution node at the input. • Each active convolution node should have a successor node. The latter may be a convolutional successor node or the output successor node. • Any active convolution node should have predecessors in its preceding layers. For instance, node 4 may have predecessors in the form of nodes 3, 2, 1 and the input node. • The initial convolutional node should have a single preceding node that acts as the input node. • The last convolution node's output node should have only one successor node. The goal of mutation is to inject abrupt changes within the population to ensure its diversity and thus its ability to explore other regions of the search space (e.g., non-visited Upper level: Crossover operator [4] ones so far). Among these operators, we cite one-point mutation, random reset, inversion mutation, just to name a few. As we adopted binary encoding as both levels (NB integer, NNB integer, topology, filter activation decision vector), the one-point mutation allowed us a progressive change of the subject solution. In this way, diversity is slightly incorporated within the evolutionary process. As with the crossover operator, the solution of mutation operator is converted to a binary string using Gray encoding before applying the one-point mutation. Due to the possibility that the variation will alter the NB field, the consistency is achieved by using the following repair technique. (assumes LNS = length (NNB sequence)). • If (NB\ LNS), then delete the chromosome's final (LNS-(LNS-NB)) integers. • If (NB[LNS), then at the end of the chromosome, add (NB-LNS) randomly generated integers. These two conditionals guarantee that NB will always equal LNS. On the basis of prior work [1] , we suppose that the quantity of NB must be in the interval [9, 11] while the quantity of NNB within the interval [32, 49] in this research. The Acc is computed using the holdout validation method [54] by 80% of data records are randomly selected for training and 20% for testing. • Encoding the solution of the Upper level: It resembles the selected filters number (NF) to be pruned in the convolutional layer. The filter subset of the binary vector represented by a bit sequence of 0, 1. • Fitness function of the lower level : To assess the lower-level solutions, the complexity of the CNN architecture must be reduced by minimizing the NF while preserving or improving the high precision. To do so, we provide the following fitness function: In the proposed lower-level deep pruning filters, a binary strings is adopted for representing the filters of a CNN model. It is essential to know that the suggested algorithm prunes convolution layers. DCNNs are constructed basically by stacking multiple convolution, pooling and fully connected layers. The goal of this paper is the automated joint design and filter pruning of CNN architectures. As filters are located only within convolution layers, only these latter are pruned. Our approach considers each bit as one single filter, e.g., if we are looking to represent two layers of convolution, 16 filters for one layer and 32 filters for the other one, we will require a string of 48-bit, while a bit with a zero assigned indicates the elimination of the corresponding filter. Furthermore, during pruning simple of CNN models, uniquely one bit string is needed, with every bit representing a model filter. Figure 5 shows the binary representation of a CNN before and after pruning. The twopoint crossover operator is used to vary the population [50] since it enables chromosome parts to change. Each parent solution in this operation is a set of binary strings [51] . A couple of cutting points are chosen for each couple of parent in this process, after which the bits between the cuts will be exchanged to produce a couple of offspring solutions. In fact, the two-point crossover is adopted to allow the variation of all parts of the chromosomes. Indeed, if the one-point crossover is used, the extreme regions (extreme genes) of the chromosomes are likely to still unchanged. This could significantly reduce the exploitation ability of the crossover and the population diversity. To mitigate this issue, researchers proposed the use of two cut points instead of a single one to allow the variation of the entire chromosome. Similar to crossover, the solution of mutation operator is encoded as a binary string, followed by a random mutation of one point. A point on the chromosomes of both parents is chosen at random and referred to as a ''crossing point.'' The bits to the right of this point are exchanged between the two parent chromosomes. A quantization of 32-bit floating point values into 5-bit integer levels is used to further reduce the stored size of the weights file. The quantization part are spread linearly between Wmin and Wmax because it produces higher accuracy results than density-based quantization; thus, even if a weight occurs with a low probability, it may have a high value and therefore a high influence, and if quantized to be less than its real value. This stage produces a compressed sparse row of quantized weights. Due to the statistical characteristics of the quantization output, Huffman compression might be used to further reduce the weights file. However, this adds the additional hardware needs of a Huffman decompressor and a compressed sparse row to weights matrix converter. The test error is computed using the holdout validation technique [55] , which randomly selects 70% of the data records for training and 30% for testing. To deal with this the over-fitting issue, the training data (70%) is divided into 5 folds, and thus, fivefold cross-validation is applied during training. The classification performance is averaged over the 5 folds of the training partitions. Figure 6 illustrates the adopted validation strategy in this work [56] . Eventually, in the experiments, we report the classification error on the test data (30%). We compared the performance of the suggested strategy to earlier work using the two frequently used benchmark data sets: CIFAR-10, CIFAR-100 and ImageNet. The first batch of data contains 60,000 32  32 RGB images that are grouped into ten groups of 6000 images each. Indeed, the test sample size is 10,000, but the training set sample size is 50,000. The other data set is similar to CIFAR-10, but differs in terms of class count; it comprises 100 classes and each class contains 600 images. Both data sets present significant challenges due to factors like as noise, image size and image rotation. The photographs are incremented during the processing stage to ensure that the comparisons are fair. Indeed, four zero pixels are added to each image, resulting in the modification of a 32  32 image. Following that, the clipped image is arbitrarily compressed with a probability of 0.5. This technique was influenced by [57] . Due to the enormous number of classes, the most of current studies do not conduct experiments using CIFAR 100 data sets. To demonstrate our Bi-CNN-D-C performance, we carry out a series of tests on CIFAR-10 and CIFAR-100. Finally, the ImageNet The third batch of collection contains 14, 197 ,122 images that have been annotated using the WordNet hierarchy. Since 2010, the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has used the data set as a benchmark for image classification and object recognition. The freely available data set provides a Additionally, a set of test images is released without the associated manual annotations. Our examination study will address the following main questions: • How do the architectures generated by Bi-CNN-D-C compare to previous work on CIFAR-10 image classification? • Is it possible for Bi-CNN-D-C to maintain its efficacy on CIFAR-100 and ImageNet, that is, when the number of classes is increased to 100? • Is Bi-CNN-D-C capable of producing high-quality designs in spite of its high computational cost? To solve these RQs, we compare the best architecture developed by Bi-CNN-D-C to previously generated and current designs. According to previous work, the most used performance measures in image classification using DNN are error rate and floating point operations (FLOPs). Equation (8) gives the error rate (Test error), where FP stands for false positives, FN stands for false negatives and NE stands for the total number of samples. #Params is the sum of the weights and biases in the convolution layer, and is given by Eq. (4), where Wc, Bc, pc and K are the weights, biases, parameters and the convolution layer size, respectively; N represents the kernels number; and C represents the channels number in the input image [1] . The GPUDays metric is the number of GPU day units where a unit means that the algorithm has performed one day on one GPU. FLOPs represent the number of floating point operations per second, which is accredited as the computation speed, which is a measure of hardware performance, and is given by Eq. (7) where W, H and C in represent, respectively, the width, height and number of channels of the input characteristics map. K is the core width, and C out is the number of output channels. Each method is executed 20 times, and then, the performance values of the best 20 outputted architectures of each method are averaged (for each metric). The most representative previous studies from both categories of CNN generation methods are compared to our Bi-CNN-D-C methodology. From the evolutionary approach based on CNN Design, we selected BLOP-CNN, Genetic-CNN, LargeScale-Evo, AE-CNN, CNN-GA and NSGA-Net. From the pruning approaches, we selected DeepPruningES, Channel-Evo and Classical-Pruning [58] which are added to the experimental comparisons. The parameters of the compared algorithms are established on the basis of the commonly used trial-and-error strategy [59] in order to achieve as much impartiality as possible in the comparisons. The parameter settings used in our studies are summarized in Tables Tables 3, 4 and 5 sum up the comparative results obtained for the various architectures generated by the various methods of CNN design on CIFAR-10, CIFAR-100 and ImageNet. We detail and clarify these results for each category and metric in the sections that follow. We detail and clarify these results for each category and metric in the sections that follow. In fact, evolutionary methods have the ability to escape local optima and cover the entire search space because of their global search capability. They also have the capability of accepting less performing architectures on a probabilistic basis through the use of the mating selection operator, which allows them to cover the entire search space. This practice helps to constrain the search space, as we can obtain more complex architectures with varying block counts and sizes. Indeed, the algorithm could produce a large number of architectures having multiple topologies for each pair NB, NNB. This behavior distinguishes BLOP-CNN and Bi-CNN-D-C as the first algorithms in the literature that simultaneously vary and optimize various components as bi-level design. However, efficient model designs such as BLOP-CNN focus more on acceleration than compression by optimizing convolution operations or network architectures. That is the reason, we are driving the compression research process more effectively and efficiently. We will now proceed to the analysis of the #Params. These metrics are used to represent the number of parameters. Due to the reduction blocks and minimization of NB and NNB at the upper level, and compression convolution layer hyperparameters (FSi,Nbits of quantized weights) at the lower level, the #Params of Bi-CNN-D-C are less than those of NSGA-Net-4 and BLOP-CNN, as shown in Tables 3 and 4 . Notably, the CIFAR-10 and CIFAR-100 images are identical in size (i.e., the images are identical in both data sets, except that the number of classes increases from ten (CIFAR-10) to one hundred (CIFAR-100)). The final fully connected layer for CIFAR-100 takes longer to compute theoretically, but this is irrelevant because it is so small in comparison with the rest of the model. In what follows, we analyze the computation time of our algorithm in terms of the number of GPUDays, which mainly depends on the number of function evaluations and the population architectures' number of parameters (params). We agree that the bi-level optimization process computational cost is very important (31 GPUDays as shown in Tables 3, 4 and 5), but acceptable with respect to evolutionary NAS methods. This could be explained by the low population size and number of generations defined from the start, which are both equal to 30. Despite these low values, Bi-CNN-D-C is able to generate pruned architectures with minimum number of filters and thus much reduced number of parameters. It is worth noting that our algorithm is the first one that applies filter pruning to evolved architectures within the evolution process, because existing methods apply such pruning to existing architectures that are passed as inputs to the pruning method. The considerable reduction in terms of the NF makes the GPUDays metric decreasing from one generation to another, as the time required to compute the architecture classification error is dropping with the advance of the optimization process. When evolutionary algorithms are employed to solve realworld issues, we typically looking to realize whether or not they have converged. In this section, the evolutionary trajectories of the suggested method in terms of the benchmark data sets are studied. We analyze the convergence behavior of BLOP-CNN on CIFAR-10 Over the evolutionary search process, the NB quantity is decreased from 15 to 9. Additionally, the interval extent is minimized over generations, resulting in architectures with similar NB values at the conclusion of the optimization process. Due to the fact that NB and NNB are mutually exclusive objectives, minimizing the NNB is not easy. Indeed, the AVG NNB's slope decrease is notably less than NB's. This fact could be explicated by the fact that these two quantities may have a conflicting connection. Indeed, reducing the number of blocks NB may result in an increase in the number of nodes per block NNB, with the goal of maximizing or preserving the classification Acc. We believe that the EA at the top is attempting to strike a favorable trade-off between NB and NNB. Moreover, We find that the upper level is progressively maximized from generation to generation with a degree of convergence toward the maximum attained value. The quantity of Acc is increased from 40 to 98%. The first 15 generations have a rather steep maximization slope in comparison with the latter 15 generations. This could be explained by the fact that the search space of the possible CNN architecture is well explored during the first phase of the evolution process. During this phase, the huge search space contains low-, medium-and high-quality architectures. Thus, during the first phase the population is distributed over the entire search space. For this reason, a high number of low-and medium-quality architectures are visited and even designated as best found architectures at that stage so far. After that the evolutionary process was able to focus the population on the promising regions of the search space, where most architectures have similar respectful classification performance values. This focus makes the algorithm sampling well-performing architectures that are similar in terms of classification accuracy, which explains the alleviation of the maximization slope. This phenomenon could be observed also in many other applications of genetic algorithms [49, 60] . The evolution trajectory analysis provides good interection between NB, NNB, NF and Err at the upper level, with the goal of developing effective CNN architectures with an optimum block and nodes per block size. The most representative previous studies compressed the already-existing CNN manual architecture. There is no existing work that compresses an architecture that is generated automatically. For this reason, Bi-CNN-D-C is the first work to compress an evolving CNN architecture automatically. Table 10 summarizes the CNN manual architectures that were used to compare the proposed approach. In fact, the results using the CIFAR-10 and CIFAR-100 data sets are presented in Tables 6 and 7, respectively. For each DCNN architecture, the best test error and number of FLOPs are shown. Based on Table 6 , the average test error of the DeepPruningES algorithm is between 7.43 and 8.91%, where for the VGG16 and VGG19, they obtain similar results of 8.21% with a 32.01% and 32.56% diminution in the FLOPs number, and for ResNet56, ResNet110, DenseNet50 and DenseNet100, they obtain values lying between 7.43% and 8.91% with a 16.72% and 32.56% reduction in the FLOPs number. In Table 6 , the average test error of the Channel-Evo algorithm is between 5.85 and 7.91%, with the VGG16 and VGG19 achieving similar results of 7.26% with a respective 52% and 53.05% reduction in the number of FLOPs, and the ResNet56, ResNet110, DenseNet5 and Dense-Net100 values lying between [5.85 and 7.91%] and a [16.02% and 17.35%] reduction in the number of FLOPs. Always on Table 6 , the average test error of the auto-balanced filter is between 8.27 and 9.32%, whereas the VGG16, VGG19, ResNet56, ResNet110, DenseNet50 and DenseNet110 obtain values between 8.27% and 9.32% and a [36.5% and 57%] reduction in FLOPs. The average test error of Classical-Pruning is laying between [6.46 and 9.09%]. In fact, the proposed Bi-CNN-D-C algorithm provide 1.98  107 and 2:21  10 7 of #Flops, with 32.5%, 28.9% pruned percentages. Bi-CNN-D-C is capable of reducing FLOPs while maintaining acceptable test errors. The main reason that could explain the outperformance of Bi-CNN-D-C over considered peer works corresponds to the principal motivation of this work. From the start of this paper, we have mentioned that the main shortcoming of existing pruning methods including evolutionary ones consists in the fact that these methods take as input an existing (already designed) architecture, and then, the algorithm searches for the best possible pruning decision. This drastically limits the performance of such kind of methods. From a metamorphic vision, we could say that such an algorithm remains paralyzed in a single point of the Neural Computing and Applications architecture search space, which is not the case for our algorithm. By allowing the joint design and pruning of architectures, our algorithm is able not only to move from an architecture to another but also to (near) optimally prune each generated architecture. In this way, our algorithm is not still paralyzed in a single point of the architecture space (i.e., it has the freedom search space sampling). Moreover, the collaborative interaction of the two optimization levels (upper and lower) ensures the narrowing of the search process toward high-performing architectures with minimum number filters (and also minimum NB and NNB). To the best of the authors' knowledge, Bi-CNN-D-C is the first EA capable of compression automatically designing CNN architectures and providing bi-level interaction between convolutional layers and their hyperparameters. Chest X-ray and CT are two of the most commonly available radiological tests for the diagnosis of several lung diseases. Chest X-ray and CT are two of the most Neural Computing and Applications commonly available radiological tests for the diagnosis of several lung diseases. In our study, we acquired chest X-rays belonging to 50 COVID-19 patients from [61] by Dr. Joseph Cohen. In these data, a number of individuals having intense respiratory distress sickness, serious respiratory problem, pneumonia, COVID-19 have chest X-ray along with computed tomography images in this archive. We also choose normal chest X-ray photographs from the Kaggle library that have been labeled as chest X-ray images (pneumonia) https://www.kaggle.com/paultimothy mooney/chest-xray-pneumonia. This study uses a chest X-ray images database divided into 2 separated groups: COVID-19 patient images and normal patient images. We scaled all images within the data set into 224 by 224 pixels. Then, the data set is randomly separated into 2 distinct data sets where 80% is considered for training and 20% for testing. Figures 8 and 9 show uninfected and infected people's chest X-ray pictures, respectively. Many computational intelligence strategies for COVID-19 detection using computed tomography (CT) and X-ray images have been proposed recently [62] [63] [64] . Fei et al. [65] have proposed VB-Net, an interactive approach for CT images that is an extended version of V-Net. It is divided into two sections. The first is a contractual approach that uses downsampling and convolution to obtain the image's general features. The second is a broad approach that incorporates fine-grained image features through upsampling and convolution processes. The work's main distinctive feature is the addition of a human expert into the loop in order to lead the segmentation of the infection procedure. Prabira et al. [66] extracted a deep features set using nine pre-trained CNN models, which were then given to the support vector machine (SVM) classifier. In the comparison studies, manually constructed DCNNs architectures were used. Based on X-ray images, the suggested method was found to have superior detection accuracy. Xu et al. [67] developed an approach called in-depth screening strategy to identify pneumonia due to COVID-19 among both, viral Influenza-A healthy and cased of pneumonia. A 3D deep learning algorithm was used to segment candidate infection locations first. Individual images categorized as pneumonia due to Influenza-A viral, COVID-19, without being linked to infection categories, in addition to confidence ratings, through a classification model based on location attention. Eventually, the noisy or Bayesian function was used to calculate the kind of infection and the total confidence score. Shuai et al. [68] presented an Inception model based on transfer learning It is possible to divide the latter into two halves. To transform the input of the image and convert it into vectors of one-dimensional feature for the classification challenge, the first half used a pre-trained starter network, whereas a fully connected network is used by the second half used. The resulting test error findings on X-ray image an CT are summarized in Tables 8 and 9 , respectively, using the identical computer configuration and implementation environment as the previous section's experiments (fourth). We notice that Bi-CNN-D-C outperforms the peer approaches in terms of results. In reality, it is worth mentioning that the suggested algorithm's primary purpose is to reduce architecture complexity. Manual architecture, from the standpoint of optimization, is a wide search space that must be carefully sampled in order to build an effective and efficient design. Bi-CNN-D-C might be justified in the same manner that images of both, CIFAR-100 and CIFAR-10 were. We recall the aim of our approach which is about focusing on the CNN design and compression. Bi-CNN-D-C shows its ability to provide end users successful designs capable to detect COVID-19 from X-ray and CT images, while taking into consideration the hierarchical structure of the CNN architecture design challenge. This case study, we believe, will encourage researchers and practitioners to use the scalable computational approach for X-ray image analysis utilizing DNN in the future. Deep neural networks have demonstrated outstanding performance in a wide range of machine learning tasks, including classification and clustering [69, 70] , for real-life applications of soft computing techniques in different fields Fig. 7 Acc evolution of the best found NB and NNB values on CIFAR-10 Neural Computing and Applications [2, [71] [72] [73] . In fact, Designing an architecture for the Deep CNN is an extremely interesting, challenging and timely subject. Recently, several works have concentrated on developing novel methods for reducing the computational complexity of CNNs using EAs. Indeed, recent research [14] suggests that such metaheuristic algorithms could be used to solve the CNN architecture optimization problem in the field of network compression and acceleration. Nevertheless, none of the previous works took the bi-level nature of neural architecture design and compression into account. Following the CNN architecture design, there are an infinite number of alternative architectures with various network topologies for any set of blocks of various sizes. Following the CNN architecture compression, the search space expands exponentially in size as the number of layers increases, we must reduce network complexity and eliminate redundancy in order to achieve a small compressed model. Based on this observation, we suggest a bi-level model of the CNN architecture design and compression problem in this study, where the upper level seeks to minimize network complexity primarily in terms of the NB and NNB, and maximizes classification Acc with regard to [17] is then used to solve the bi-level model. According to the results analysis, the suggested approach BI-CNN-D-C demonstrated its effectiveness and superior performance on the CIFAR-10 and -100 benchmark data sets when compared to several representative architectures as well as some additional ones generated by recent prominent from design and compression CNN architectures. Finally, we'd like to draw attention to some interesting perspectives. The first is directly related to our work and entails developing an interaction model that enables users to interact with Bi-CNN-D-C via the evolution process by examining some generated architectures, mining their common patterns and then making recommendations in the form of soft and/or hard constraints to generate CNN architectures that satisfy the expert's preferences and knowledge. The second objective is to extend our work to the field of real-time federated learning, in which multiple local devices collaborate to train a shared global model, while training data remains on edge devices. This enables the resolution of critical issues such as data privacy, data security, data access rights and access to heterogeneous data, all of which are critical in a wide variety of domains, including defense, telecommunications, the Internet of Things and pharmaceutics. The third one is directly related to future work. In fact, pure filter pruning can lead to the removal of important channels and could cause a considerable loss of interesting channels. Our future work will solve this issue by preserving important channels even if their corresponding filters contain a very small number of channels. Finally, it would be interesting to extend our approach to the case of dynamic environment (incremental learning) [74] and thus dynamic evolutionary algorithms could represent an interesting alternative to deal with such a challenge. The most evocative examples of manual architecture are VGG16, VGG19, ResNet50, DenseNet50 and ResNet110. The details of the CNN architectures used to compare the proposed approach are summarized in Table 10 . In the academic and real-world optimization problems, the majority uses a single level of optimization. Multiples problems, however, are designed as two levels, referred to as BLOP [75] . We uncover a problem of optimization nested within external optimization constraints in such Neural Computing and Applications scenarios. The upper-level problem, often known as the leader problem, is considered as an external task of optimization. The nested internal optimization work called also a follower problem or a lower level, and the two-level problem is referred to as a leader-follower problem or a Stackelberg game [76] . The follower problem looks like a constraint at the upper level; therefore, uniquely the best solution with regard to the follower optimization problem could be considered as a leader candidate. Definition: Assuming < n  < n ! < to be the leader problem and f : < n  < n ! < to be the follower one, analytically, a BLOP could be stated as follows: There are two types of variables in a BLOP: variables of the upper-level variables and variables of the lower level. The optimization work for the follower problem is done against the x l variables, with the x u variables acting as fixed parameters. As a result, each x u represents a new follower issue, the optimal solution of which is a function of x u and must be found. In the leader problem, all variables (x u , x l ) are examined, with the exception of x l . The formal definition of BLOP is provided below: Single-level problems are intrinsically more complex to solve than BLOPs. It is not unexpected that the majority of previous research has focused on the more straightforward situations of BLOP, such as problems with good features like linear objectives, constraint functions, convexity or quadraticity [77] . Despite the fact that it was the first study on bi-level optimization originates from the 1970s, the utility of these mathematical programs in representing hierarchical decision-making engineering and processes challenges were not realized until the early 1980s. BLOPs have encouraged researchers to devote special attention to them. Kolstad [75] compiled the first bibliographic survey on the subject (1985) in the mid-1980s. Existing BLOP-solving methods can be divided into two groups: (1) classical methods while (2) evolutionary methods. Number one family preserves extreme pointbased approaches such as [78] , branch-and-bound [79] , penalty function methods [80] , complementary pivoting [81] and methods of trust region [82] . These strategies have a disadvantage which is that they are significantly reliant to the mathematical properties of the BLOP in question. Metaheuristic algorithms, which are mostly evolutionary algorithms, belong to the second family (EA). Several EAs have recently proved their efficacy in addressing such problems due to their insensitivity to the mathematical properties of the problem, as well as their capacity to handle enormous problem instances by offering satisfactory answers in a fair amount of time. Here are a few examples of notable works [80, 81] . Appendix C Main principle of CEMBA As for each upper-level architecture there exists a whole search space of possible filter pruning decisions, the joint neural architecture search and (filter) compression are framed as a bi-level optimization problem. Motivated by the recent emergence of the field of EBO (evolutionary bilevel optimization) and the related achieved interesting results in many application fields, we decided to use the evolutionary algorithm as a search engine to solve our bilevel optimization problem. The main difficulty faced in EBO is the important computational cost it needs. This is because each the fitness evaluation of each upper-level solution requires running a whole lower-level evolutionary process to approximate the optimal corresponding lowerlevel solution. Through the literature there exist many EBO algorithms [83] , but most of them focus on problems with continuous variables (using approximation techniques and gradient information). The number of algorithms that were designed for the discrete case is much reduced with regard to the continuous case. Examples of discrete EBO algorithms are NBEA, BLMA, CODBA, CODCRO and CEMBA [84] . As the majority of algorithms deal with the continuous situation, CEMBA among the most effective bilevel EAs for dealing with discrete scenarios that has been demonstrated [17] . Every upper-level solution is evaluated using 2 stages: first, the variables of the upper-level are directed to the lower level; and second, the indicator-based multi-objective local search gets closer to the solution with the maximum marginal contribution in terms of multi-objective quality indicator and sends it to the next level to complete the evaluation of the quality of the consideration. To summarize how it works, we will go over the research processes of its upper and lower levels: • Upper and lower population initialization Create the upper-level population and the lower-level population from scratch. Two starting populations were obtained by using the Discrete Space Decomposition Method twice. The goal of utilizing a decomposition approach is to produce uniform coverage of the decision space and, to the extent possible, a collection of solutions that are evenly dispersed over every level decision space. • Lower-level fitness assignment To assess an upperlevel solution in BLOP, a bi-level optimization problem necessitates to execute of an entire lower-level method, which is the BLOP's fundamental challenge. As a result, we dissect each problem level using 2 populations for solving it. The lower-level algorithm of each lower population employs the higher solutions of the matching upper population to evaluate the lowerlevel solutions. • Local search procedure The local search is applied for each lower-level population using the IBMOLS principle for the lower-level method. In fact, we begin by calculating the objective function's normalized values. As a result, for each lower solution, we create a neighborhood and then calculate its fitness value using an indicator I and the objective function's normalized values. An update of the fitness values is then performed, the worst-case solution is removed, and the fitness values are updated again. It is worth noting that neighborhood generation comes to a halt in one of two situations: (1) whenever entire solution neighborhood has been examined, or (2) in case an adjacent solution that improves (with regard to I) is discovered. In case that all lower-level members have been visited, the entire local search procedure comes to an end. • Best indicator contribution lower-level solution determination Because evaluating the upper-level population's leader solutions necessitates approximating each matching lower-level population's follower solution, the lower-level solutions are compared to the members of the upper-level. • Upper-level indicator-based procedure Each higherlevel population performs its algorithm based on the IBEA after obtaining the lower-level solutions with the best indicator contribution of the follower problem. This is because we find the person with the lowest fitness value, delete him or her and then update the fitness values of the remaining people until we reach the stop criterion. After that, mating selection and variation are used. We should remark that using IBEA aids the algorithm in approximating the best upper front. • Migration strategy (each a generations) After a certain number of generations, use a migration strategy. As a result, we use the parameter b to select a set of solutions that includes b objective follower space The Taguchi method [85] is a sophisticated case of the trial-and-error one [86] . In order to clarify more and verify the proposed parameter tuning values, we have applied the Taguchi method in which the signal-to-noise ratio (SNR) parameter is calculated as follows: where N represents the number of performed runs. The SNR parameter reflects the variability and the mean of the experimental data. The used parameters for tuning are the following: (1) population size (Pop. size), (2) upper generation number (UGen. nb) and (3) lower generation number (LGen. nb). The considered levels for each parameter, while the corresponding orthogonal array (L27(33 ) where we have 27 experiments, 3 variables and 2 levels. Figure 11 displays the obtained SNR results for IB-CEMBA. Moreover, Fig. 12 displays computed results for Bi-CNN-D-C in terms of mean fitness values of the upperlevel in Taguchi experimental analysis, which confirmed the achieved optimal levels using SNR parameter. In fact, the computed mean upper-level fitness values confirmed the achieved optimal level using SNR parameter. Deep convolutional neural network architecture design as a bilevel optimization problem A hybridization of deep learning techniques to predict and control traffic disturbances Deep learning and case-based reasoning for predictive and adaptive traffic emergency management Greedy layerwise training of deep networks Deep learning Simpler certified radius maximization by propagating covariances Very deep convolutional networks for large-scale image recognition Evolutionary optimization of convolutional neural networks for cancer mirna biomarkers classification A survey of swarm and evolutionary computing approaches for deep learning Performance characterization of deep learning models for breathing based authentication on resource-constrained devices Evolutionary optimization of residual neural network architectures for modulation classification Xception: deep learning with depthwise separable convolutions Network trimming: a datadriven neuron pruning approach towards efficient deep architectures Dutta T (2020) A survey on deep neural network compression: Challenges, overview, and solutions Advanced metaheuristic optimization techniques in applications of deep neural networks: a review Evolutionary design of neural network architectures: a review of three decades of research Solving combinatorial multi-objective bi-level optimization problems using multiple populations and migration schemes Hybrid evolution of convolutional networks The mnist database of handwritten digit images for machine learning research The mnist database of handwritten digit images for machine learning research Large-scale evolution of image classifiers Aggregated resid-'ual transformations for deep neural networks Evolutionary algorithms and neural networks Evodeep: a new evolutionary approach for automatic deep neural networks parametrisation Regularized evolution for image classifier architecture search Completely automated CNN architecture design based on blocks A self-organizing multiobjective particle swarm optimization algorithm for multimodal multi-objective A survey on deep neural network compression: challenges, overview, and solutions Pruning deep convolutional neural networks architectures with evolution strategy Pruning deep convolutional neural networks architectures with evolution strategy Thinet: a filter level pruning method for deep neural network compression Exploiting linear structure within convolutional networks for efficient evaluation Network trimming: a datadriven neuron pruning approach towards efficient deep architectures Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding To compress, or not to compress: characterizing deep learning model compression for embedded inference Performance characterization of deep learning models for breathing-based authentication on resource-constrained devices Quantization and training of neural networks for efficient integer-arithmetic-only inference Deep compression: Compressing deep neural networks with pruning, trained quantization & huffman coding Predictive coding with neural nets: application to text compression Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding Multi-agent deep neural networks coupled with LQF-MWM algorithm for traffic control and emergency vehicles guidance Efficient neural network using pointwise convolution kernels with linear phase constraint Sparsification and separation of deep learning layers for constrained resource inference on wearables A knee-guided evolutionary algorithm for compressing deep neural networks Deepmon: Mobile gpubased deep learning framework for continuous vision applications Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding Universal codeword sets and representations of the integers Optimal source codes for geometrically distributed integer alphabets (corresp.) Genetic cnn On the virtues of parameterized uniform crossover An analysis of gray versus binary encoding in genetic search An analysis of gray versus binary encoding in genetic search Multi-criterion evolutionary design of deep convolutional neural networks STATISTICS the reusable holdout: preserving validity in adaptive data analysis Wrappers for feature subset selection Structure discovery of deep neural network based on evolutionary algorithms Pruning filters for efficient convnets Parameter tuning for configuring and analyzing evolutionary algorithms Nsga-net: neural architecture search using multi-objective genetic algorithm. In: In Genetic and evolutionary computation conference Covid-19 image data collection: Prospective predictions are the future. journal of machine learning for biomedical imaging (melba) COVID-19 diagnosis on CT images with Bayes optimization-based deep neural networks and machine learning algorithms Evolutionary optimization of convolutional neural network architecture design for thoracic x-ray image classification Evolutionary optimization for cnn compression using thoracic x-ray image classification Lung infection quantification of COVID-19 in CT images with deep learning Detection of coronavirus disease (COVID-19) based on deep features Deep learning system to screen coronavirus disease 2019 pneumonia A deep learning algorithm using CT images to screen for corona virus disease (COVID-19) Price forecasting for real estate using machine learning: A case study on riyadh city Traffic disturbance mining and feedforward neural network to enhance the immune network control performance Deep learningbased appearance features extraction for automated carp species identification A survey of deep learning techniques: application in wind and solar energy resources Spatiotemporal modeling for nonlinear distributed thermal processes based on kl decomposition, mlp and lstm network A multiple reference point-based evolutionary algorithm for dynamic multi-objective optimization with undetectable changes A review of the literature on bi-level mathematical programming A study of the demand for butter in the united kingdom Mixed integer linear programming models to solve a real-life vehicle routing problem with pickup and delivery An explicit solution to the multi-level programming problem Midperipheral fundus involvement in diabetic retinopathy Convex combinations of stable polynomials Multi-objective stackelberg game between a regulating authority and a mining company: a case study in environmental economics Bilevel optimization based on kriging approximations of lower level optimal value function A review on bilevel optimization: from classical to evolutionary approaches and applications Solving combinatorial bi-level optimization problems using multiple populations and migration schemes Taguchi Techniques for Quality Engineering: Loss Function. Orthogonal Experiments, Parameter and Tolerance Design Parameter tuning for configuring and analyzing evolutionary algorithms Acknowledgements The authors thank the Deanship of Scientific Research at Prince Sattam bin Abdulaziz University for supporting this work. Conflict of interest The authors declare no conflict of interest.