1 Introduction

Machine learning is a field of artificial intelligence that aims to create computational systems capable of acquiring knowledge through accumulated experiences and making automated decisions [15]. Among the various applications of machine learning, its relevance in the healthcare field stands out, being employed in medical diagnosis, disease prediction, and clinical decision support [6]. These systems aim to assist and streamline the work of medical professionals.

In this context, in 2021, MedMNIST [24] emerged as an extensive collection of biomedical images developed to fine-tune and test classification models in medical applications. The MedMNIST dataset consists of 708,069 2D images distributed across 12 distinct datasets, as well as 9,998 3D images divided into 6 distinct sets. All images are standardized to a resolution of 28\(\,\times \,\)28 pixels and cover various medical contexts, including binary and multiclass classifications. The data is pre-divided into training, validation, and test sets. In addition to creating the dataset, the authors benchmarked different versions of the ResNet model [8] (ResNet-18 and ResNet-50) and three automated machine learning frameworks on each dataset.

In the realm of deep learning techniques, Convolutional Neural Networks (CNNs), such as the ResNets, have stood out as extremely effective in several real-world tasks involving processing visual data, such as object detection and image and video processing [17]. Although recent advances in these techniques have led to an increase in the accuracy of CNNs, they have significantly increased their size and inference time [14].

Therefore, the research area of neural network compression techniques is gaining increasing prominence, particularly pruning and quantization, which have been extensively explored in the literature. Broadly speaking, in pruning, less significant neuron connections are removed from the model. In quantization, the precision of the model parameters is reduced. In [14], it has been shown that different types of pruning present varying trade-offs in terms of compression, accuracy, and speedup. On the other hand, quantization demonstrated significant improvement in performance and reduced storage requirements.

Compressing machine learning models for medical applications is crucial for several reasons. First, many medical devices, such as portable ultrasound machines and wearable health monitoring devices, operate with limited computational resources. Compressed models enable these devices to perform tasks with reduced memory and computational power, making them more suitable for resource-constrained environments. Additionally, in medical settings, particularly during real-time applications like diagnosis and monitoring, the speed of inference is crucial. Compressed models facilitate faster execution, thereby decreasing the time required for critical decision-making. Also, compressed models typically consume less energy during inference, which is advantageous for battery-powered medical devices or situations where power conservation is important, as it can extend device runtime and reduce the need for frequent recharging or battery replacement. Furthermore, from a patient’s perspective, the wait time for a diagnosis can be highly stressful. Therefore, even a minor reduction in inference time can substantially alleviate patient anxiety. It is also noteworthy that minutes saved in a system performing numerous inferences can translate to substantial reductions in patient wait times overall.

When applying pruning and quantization techniques, several decisions must be made, such as determining the extent of connections to prune and selecting the layers for quantization. Previous research [2, 5, 9, 10, 22, 23] has shown that the compression challenge is essentially a multi-objective optimization problem, requiring the identification of optimal trade-offs between the computational efficiency of the model (in terms of time and memory consumption) and its accuracy. Except for the approach presented in [9], which uses reinforcement learning, all other studies employ various types of evolutionary algorithms (EAs) as optimization methods. However, a significant bottleneck in this process is the considerable number of function evaluations required to find satisfactory solutions, especially challenging in the compression problem since the objective function requires evaluating the compressed model over the entire test set.

The use of evolutionary algorithms for problems with expensive function evaluations is a well-known concern [4]. A well-established technique to address this issue is the use of surrogate models, also called meta-models [19, 20]. The idea is to build a computationally inexpensive model to substitute an expensive objective and constraint functions and use this cheaper model to guide the optimization process instead to evaluate the expensive functions over the entire test set.

The primary goal of this work is to understand the impact of model compression techniques on the performance of Convolutional Neural Networks (CNNs) in health-related applications. Specifically, we employ a surrogate-based multi-objective compression methodology on the ResNet50 model, which has already been extensively testes on the MedMNIST dataset in [24]. This technique is evaluated on three MedMNIST datasets: RetinaMNIST for grading diabetic retinopathy severity, DermaMNIST for categorizing seven different diseases, and BloodMNIST for classifying blood cells into eight categories.

The results indicate that the proposed surrogate-based multi-objective compression method effectively identifies less computationally intensive models while maintaining or even enhancing accuracy across all three datasets. This underscores the practicality and significance of employing compression techniques, especially in the healthcare domain. Extending this technique to other healthcare-related applications with higher-quality neural network models holds promising potential for further validation and exploration. Notably, there was an average reduction of 50% in inference time on average and an improvement of over 1% in accuracy.

2 Artificial Neural Network Compression as a Multi-objective Optimization Problem

Deep artificial neural networks are robust, complex, and highly effective structures for solving various types of problems. However, their high computational cost can hinder their implementation on different devices [22]. In this section, we provide a brief overview of two compression techniques, pruning and quantization, including the relevant parameters that directly influence the final performance and quality of the compressed model.

2.1 Prunning and Quantization

Pruning, within the context of neural networks, can be defined as a compression technique. It involves the selective removal of connections or neurons with low importance scores while preserving the network’s accuracy, as discussed in previous works [14, 21]. These importance scores can be determined based on various criteria, including weight magnitude, sensitivity, or activation level, as outlined in [21]. The pruning process typically consists of three key steps: (1) assessing the importance scores of each connection or neuron, (2) eliminating the connections or neurons with the lowest scores, and (3) fine-tuning the remaining network to restore its accuracy. For achieving higher compression rates, pruning can be applied iteratively, and it can also be combined with other compression techniques, such as quantization, to further reduce the network’s size and complexity.

Quantization, as applied to an artificial neural network (ANN), serves as a model compression technique aimed at diminishing the number of bits required to represent each of the ANN’s parameters [7]. Typically, ANN parameters are represented using 32-bit floating-point numbers, thereby inflating both storage requirements and computational intricacies. Lowering the precision of these parameters not only accelerates model performance and curtails storage expenses but also retains adequate information for effective model inference, all while upholding accuracy [22].

2.2 Compression Design Variables

The implementation of these compression techniques requires the specification of certain parameters that may vary based on the chosen approach. When using pruning, one must define whether it will apply to only convolutional layers, only linear layers, or both. Once the type of layer is selected, one must specify which layers to prune and the pruning intensity, which is the proportion of weights and biases that will be pruned in a layer. Additionally, quantization can be applied to the previously pruned structures. Therefore, seven design variables have been defined to optimize the compression process and achieve the most suitable resulting model. Below, the variables and their respective domains are outlined:

  • \(x_1\) (Linear Layer Pruning): \(x_1\) \(\in \) {0 (No pruning in the linear layers), 1 (Prune all linear layers), 2 (Prune a random set of linear layers)}

  • \(x_2\) (Convolutional Layer Pruning): \(x_2\) \(\in \) {0 (No pruning in the convolutional layers), 1 (Prune all convolutional layers), 2 (Prune a random set of convolutional layers)}

  • \(x_3\) (Pruning Intensity for the Linear Layer (%)): \(0\%<x_3<100\%\)

  • \(x_4\) (Pruning Intensity for the Convolutional Layer (%)): \(0\%<x_4<100\%\)

  • \(x_5\) (Pruning Type for the Linear Layer): \(x_5\) \(\in \) {0 (Prune weights), 1 (Prune biases), 2 (Prune both weights and biases)}

  • \(x_6\) (Pruning Type for the Convolutional Layer): \(x_6\) \(\in \) {0 (Prune weights), 1 (Prune biases), 2 (Prune both weights and biases)}

  • \(x_7\) (Quantization): \(x_7\) \(\in \) {0 (do not perform quantization), 1 (perform quantization)}

Variables \(x_1\) to \(x_6\) refer to compression by pruning, and \(x_7\) defines whether quantization is applied or not.

2.3 Objective Functions

As mentioned before, the compression problem commonly posed as a multi-objective optimization problem which aims at finding solutions with a good balance between the quality and performance (execution time and space) of the resultant model.

More specifically, in this work, the compression problem is formulated as follows:

(1)

where, \(\textbf{x} = \left[ x_1, x_2, x_3, x_4, x_5, x_6, x_7\right] \) is defined over the search space presented in Sect. 2.2 and \(\tau \) = (0.96 \(\times \) Base Model Accuracy). That is, we allow for a decrease in accuracy of 4%.

The non-zero weights rate, which reflects the sparsity of the weight and bias tensors in the neural network model, is directly impacted by the pruning technique. On the other hand, the number of FLOPS (floating-point operations per second) is a metric that is reduced with the application of quantization. When we multiply the FLOPS by the non-zero weights rate, we obtain an objective \(f_1\) that is affected by both pruning and quantization. Minimizing \(f_1\) entails increasing compression levels, which can be desirable in terms of computational efficiency and resource utilization. Conversely, maximizing the objective \(f_2\) improves the overall quality of the neural network, as measured by relevant performance metrics such as accuracy and F-score.

When analyzing the problem of neural network compression in the literature, it is possible to observe a conflict between the amount of pruning performed and the accuracy obtained. This conflict is demonstrated in studies such as [22] and [14]. Thus, this paper addresses a multi-objective compression problem that involves seven variables (see Sect. 2.2), two objectives, and an accuracy constraint, which aims to avoid spending computational resources on the evaluation of poor solutions.

3 Multi-objective Constrained Optimization Based on Surrogate Models

To find compression parameters that lead to good models, we adapt the approach proposed in [19] to the problem defined in Eq. (1). It consists of the following steps:

  1. 1.

    Build an initial dataset \( D = \langle X,Y \rangle \) for generating the surrogate models. \( X \) is a set of \( n \) points sampled from the search space defined in Sect. 2.2, and \( Y \) stores the values of the objectives (FLOPS \(\times \) non zero weights rate, accuracy) and the constraint (minimum allowed accuracy), as defined in Eq. (1), corresponding to the set \( X \).

  2. 2.

    Generate a set, \( S_{o1} \), of surrogate models for the first objective.

  3. 3.

    Generate a set, \( S_{o2} \), of surrogate models for the second objective.

  4. 4.

    Generate a set, \( S_{c} \), of surrogate models for the constraint.

  5. 5.

    Given a quality metric, \( M \) (e.g. mean average percentage error), use cross-validation to select the best surrogate in each set.

  6. 6.

    Build a surrogate problem with the best surrogate models.

  7. 7.

    Solve (approximately) the surrogate problem with a multi-objective evolutionary algorithm (MOEA).

  8. 8.

    Randomly sample \( k \) feasible non-dominated solutions obtained by the MOEA. These will be called from now on as infill points.

  9. 9.

    Evaluate these infill points using the original objective and constraint functions, and add the results to the dataset \( D \).

  10. 10.

    Repeat steps 2 to 9 until the allotted budget for function evaluations is exhausted.

  11. 11.

    Return \( D^* \), which is the set of feasible non-dominated solutions in \( D \).

Selecting an appropriate model for the objective functions and constraints is a critical step for the performance of any surrogate-based optimization algorithm. In this work, we have chosen to use the Mean Absolute Percentage Error (MAPE) as the quality metric. MAPE is a statistical metric used to measure the average absolute percentage differences between the predicted values and the actual values in a dataset. It is mathematically defined by:

$$\begin{aligned} \frac{1}{n} \left( \sum _{i=0}^{n-1}\frac{|Y_i - \hat{Y}_i|}{MAX(\epsilon , |y_i|)} \right) \end{aligned}$$
(2)

where n represents the number of samples, \(Y_i\) is the actual value, \(\hat{Y}_i\) is the predicted value, and \(\epsilon \) is a very small positive value to avoid division by zero. Because it penalizes errors more for values of lower magnitude, it emerges as a valuable quality metric for optimization, as demonstrated in [19]. The employed MOEA is the Two-Archive Evolutionary Algorithm for constrained Multiobjective Optimization (CTAEA) [12]. As demonstrated in the literature, CTAEA is an excellent algorithm for solving constrained multi-objective problems [1, 13].

4 Experimental Setup

In this section we introduce the used datasets, the ResNet50 model, which is the base model, as well as the parameter setting for the method presented in Sect. 3.

4.1 Datasets

The MedMNIST dataset [24] encompasses an extensive collection of biomedical images organized into various categories. For the experiments in this study, we chose the following three 2D datasets containing color images standardized to a 3\(\,\times \,\)28\(\,\times \,\)28 format:

  1. 1.

    RetinaMNIST: This dataset contains 1,600 retinal fundus images, divided into 1,080 images for training, 120 for validation, and 400 for testing. Each image is associated with labels representing five distinct classes, making it suitable for an ordinal regression task aimed at classifying the severity of diabetic retinopathy into five different levels [24].

  2. 2.

    DermaMNIST: A comprehensive collection of dermatoscopic images of common pigmented skin lesions. The dataset comprises 10,015 dermatoscopic images categorized into 7 different diseases, forming a multi-class classification task [24]. The dataset is divided into 7,007 images for training, 1,003 for validation, and 2,005 for testing.

  3. 3.

    BloodMNIST: This dataset is derived from individual normal cells captured from individuals without infections, hematologic or oncologic diseases, and who were not undergoing pharmacological treatment at the time of blood collection. It encompasses a total of 17,092 images distributed across 8 distinct classes [24]. There are 11,959 images allocated for training, 1,712 for validation, and 3,421 images for testing.

4.2 ResNet50

Similar to the dataset selection process, the choice for the ResNet architecture was influenced by its well-established presence in the literature, enabling comparisons across different studies. Specifically, we opted for the ResNet50 architecture [8], a common choice in various studies concerning neural network compression. Moreover, this model was also the best-performing model among those tested on these datasets by [24].

The ResNet, short for Residual Network, is a deep convolutional neural network model that introduced residual connections which allow the network to learn residual functions and thus alleviate the degradation in accuracy that often accompanies increasing network depth. ResNet50 has 50 layers, 49 of which are convolutional and only one is fully connected. The convolutional filters follow the standard size of 3\(\,\times \,\)3, and after the convolution process, data normalization is performed. This architecture has about 26 million parameters.

To train this architecture, the same parameters used in [24] were adopted, which are: (i) epochs: 100; (ii) learning rate: \(1e-3\); (iii) L2 regularization penalty: \(1e-3\); (iv) batch size: 128; (v) loss function: Cross-entropy; (vi) optimizer: Adam.

4.3 Experimental Procedure

To evaluate the proposed surrogate-based approach on the compression performance, we tested the methodology presented in Sect. 3 as follows:

  1. 1.

    Load the untrained version ResNet50 from PyTorch.

  2. 2.

    Train the loaded architecture for RetinaMNIST, DermaMNIST, and BloodMNIST training sets using the same parameters of [24].

  3. 3.

    Run five independent executions of the procedure presented in Sect. 3 to try to solve the compression problem defined by Eq. (1). The parameters used in all the runs were:

    1. (a)

      Initial sampling: 50 function evaluations.

    2. (b)

      Optimizer: CTAEA.

    3. (c)

      Number of generations: 1000.

    4. (d)

      Population size: 50.

    5. (e)

      Evaluation function budget: 100.

    6. (f)

      Infills per iteration: 2.

    7. (g)

      Surrogate models: KNeighborsRegressor [16], RandomForest [16], XGBoost [3], LightGBM [11].

    8. (h)

      Surrogate selection metric: MAPE.

    9. (i)

      Number of independent runs: 5 for each database.

  4. 4.

    The quality of each compressed solution is assessed with RetinaMNIST, DermaMNIST or BloodMNIST test set.

  5. 5.

    The final set of all the evaluated solutions as well as the hypervolume of the non-dominated set are stored for each run. The hypervolume is defined as:

    Definition 1 (Hypervolume Indicator). Given a point set \(S \subseteq \mathbb {R}^d\) and a reference point \(r \in \mathbb {R}^d\), the hypervolume indicator of S is the measure of the region weakly dominated by S and bounded above by r, i.e.:

    $$\begin{aligned} H(S) = \varLambda (\{ q \in \mathbb {R}^d | \exists p \in S: p \le q \text { and } q\le r\}) \end{aligned}$$
    (3)

    where \(\varLambda (\cdot )\) denotes the Lebesgue measure.

  6. 6.

    The Friedman test was used to compare the different algorithms and algorithm versions and, when necessary, the Dunn’s test was used as a post-hoc. The adopted confidence level \(\alpha \) was set to 0.05.

To ensure that the surrogates were not negatively affecting compression, we conducted a test using pure CTAEA without surrogate models. The experiment was performed within the same budget of 100 function evaluations, which roughly translates to 5 h of CPU time on machines with an Intel(R) Xeon(R) 2.200 GHz, 12.68 GB RAM and a disk of 107.72 GB. It should be noted that GPUs were not used in this study due to the limitations of the current quantization implementation [18].

The pure CTAEA was tested in two different configurations: (i) 10 generations with a population size of 10, and (ii) 20 generations with a population size of 5. This choice was made to minimally strike a balance between having a sufficiently large population size and providing enough generations for the population to evolve and improve.

5 Results

In this section, we analyze the performance of our methodology from three different perspectives. First, in Sect. 5.1, we evaluate the accuracy of our base models in comparison to the ResNet50 model as described in [24]. Second, in Sect. 5.2, we compare our proposed compression method, which utilizes surrogates, to the original versions of the optimizer. Finally, in Sect. 5.3, we assess the trade-offs between quality and performance for the compressed model relative to the base models.

5.1 Accuracy of Base Models Prior to Compression

We acknowledge that model performance can fluctuate slightly due to factors such as the use of different frameworks, random splitting into mini batches, random weight initialization, among others. To ensure that our base models (ResNet50 before compression) achieve accuracy comparable to benchmarked results, we compare the implemented models with the results obtained by [24] in this section. Table 1 presents the obtained results.

Table 1. Accuracy of Implementations

As shown in Table 1, the accuracy difference between our implementation and Yang’s was within 2% for datasets where it was lower and approximately 3% for those where it was higher. Therefore, we conclude that we have reasonably accurate starting models to test the method presented in Sect. 3.

5.2 Assessing the Impact of Employing Surrogate Models in the Optimization Process

In this section, we aim to determine whether the surrogate-based approach confers an advantage over the standard, non-surrogate-based version of the CTAEA by performing a comparison across three different datasets. Figure 1 shows the box plots for the surrogate-based methodology alongside the two standard CTAEA configurations. Specifically, CTAEA1 employs 20 individuals that undergo 5 generations of evolution, while CTAEA2 uses 10 individuals that undergo 10 generations of evolution.

As shown in Fig. 1(a), the approach employing surrogate models yielded larger hypervolumes for RetinaMNIST. The box plots also reveal lower variance in the values, indicating that the proposed method leads to more robust solutions, characterized by higher average hypervolumes and a lower standard deviation.

Fig. 1.
figure 1

Comparison between the surrogate-based and the raw version of CTAEA - ResNet50 on (a) RetinaMNIST, (b) DermaMNIST, and (c) BloodMNIST. The hypervolume scale is \(1e^7\).

When we analyze the results for the DermaMNIST dataset, as illustrated in Fig. 1(b), we observe similar outcomes to those obtained with the RetinaMNIST dataset. The approach utilizing surrogate models demonstrated advantages over the standard CTAEA model, and this advantage became even more pronounced for the DermaMNIST dataset. Additionally, we notice a lower standard deviation, indicating that the results were again more consistent.

Finally, Fig. 1(c) highlights the results for the BloodMNIST dataset. Once again, we can observe that the surrogate-based approach yields better hypervolumes for the non-dominated sets and more consistent results in terms of solution quality. Thus, in all the analyzed cases, the surrogate-based approach outperformed both methods utilizing the simple CTAEA (p < 0.05). It is important to notice that the use of CTAEA1 resulted in highly variable outcomes between runs, while CTAEA2 demonstrated slightly greater consistency. However, for the DermaMNIST dataset, CTAEA2 exhibited exceptionally unfavorable results.

5.3 Analysis of the Non-Dominated Set of Compressed Solution

Figure 2 shows the non-dominated solutions with the lowest hypervolume, that is, the worst set of solutions among the five runs, achieved by surrogate-based multiobjective approach.

Fig. 2.
figure 2

ResNet50 - Objective function space on first column and Test set inference Time by Non-zero weights rate on second column. The first line are related to the RetinaMNIST, second to DermaMNIST and the last one, BloodMNIST.

In Fig. 2(a) and 2(b), we can see the results for the RetinaMNIST dataset. The points in blue were obtained with the proposed approach. The green ‘x’ represents the base model performance before compression and the red triangle, the results from the MedMNIST [24]. On Fig. 2(a), we can see that the compression by pruning and quantization led to models with less FLOPs which in some cases presented better accuracy. However, we did not optimize for inference time, on Fig. 2(b), we can see that in the test set it was reduced from 30 s to less than 15 s for the compressed models. The inference time for the [24] models was not reported.

In Figs. 2(c), 2(d), 2(e) and 2(f), we present the results for the DermaMNIST and BloodMNIST datasets, respectively. The dots represent the compressed solutions, with blue indicating the use of both pruning and quantization, and orange indicating the use of only pruning. The compression strategy was determined by the optimization algorithm. The base model is represented by a green ‘x’, and the MedMNIST model by a red triangle. The results for DermaMNIST were similar to those obtained for RetinaMNIST. The compressed models showed a significant reduction in the number of floating-point operations (FLOPs) and halved the inference time. For BloodMNIST, as shown in Figs. 2(e) and 2(f), the inference time was again reduced by about half in the compressed model. However, in terms of accuracy, the models did not surpass the results of [24]. Nonetheless, they still improved accuracy over the base model.

It is evident that most of the non-dominated solutions involve the joint application of pruning and quantization techniques for model compression. Quantization plays a crucial role in this process, substantially reducing the number of floating-point operations per second (FLOPS) and shrinking the network size in Megabytes (MB). On the other hand, the pruning technique results in a drastic reduction in the rate of non-zero weights, enabling simpler operations and opening the possibility of applying alternative encodings to the neural network structure, such as sparse matrix encoding, which also reduces the size in Megabytes (MB) of the neural network. The combination of these two techniques results in substantially less complex neural networks with significantly shorter inference times compared to the original architecture. The right charts of Fig. 2 illustrate this comparison, showing that the compressed networks are, on average, more than twice as fast.

Confusion Matrices of the Most Accurate Compressed Solutions: The methodology adopted in this study chose accuracy as the primary metric for comparison. However, especially in medical contexts, it is relevant to explore other metrics that address different analyses, such as false positives, false negatives, among other characteristics. Therefore, Fig. 3 displays confusion matrices for the three datasets, considering both the original and compressed solutions that resulted in higher accuracy.

Fig. 3.
figure 3

Confusion matrix for the original architecture and its compressed version, highlighting the higher accuracy. First line is related to RetinaMNIST, the second, DermaMNIST and the last line, BloodMNIST.

There are a few things that can be observed. First, as observed in the previous section, there is a slight increase in accuracy of the compressed model in comparison with the base model. Also, the profile of false positives and false negatives is quite similar between compressed and base models except in the RetinaMNIST where the compressed model was more biased towards the majority class. Considering the limitation in the resources and that one of the objectives was the accuracy, this kind of result is somewhat expected. Interestingly, such bias was not observed in DermaMNIST which also presents a significant class imbalance. Overall, the compressed models retain the hits and misses of the base models which indicates that knowledge about the base model may be transferred to the compressed model.

Several patterns emerge from the analysis of the compressed models. Similar to observations in the previous section, there is a modest increase in the accuracy of the compressed models compared to the base models. Furthermore, the tendency of false positives and false negatives remains largely consistent between the compressed and base models. However, an exception was noted in the RetinaMNIST dataset, where the compressed model exhibited a greater bias toward the majority class. This outcome aligns with the resource constraints and the objective to maintain accuracy, suggesting that some degree of bias toward the dominant class is an anticipated result of the compression process. Notably, such a bias did not manifest in the DermaMNIST dataset, despite its significant class imbalance. This indicates that while the compression process can introduce or accentuate biases in some cases, it does not do so uniformly across different datasets. The variability in these outcomes underscores the need for broader experimentation to fully understand the typical effects of model compression across various datasets.

6 Conclusion

The proposed surrogate-based optimization approach has showcased its ability to significantly reduce computational overhead while maintaining or even enhancing the accuracy of the models across the three distinct datasets: RetinaMNIST, DermaMNIST, and BloodMNIST.

Particularly, the compressed models have demonstrated a substantial reduction in floating-point operations (FLOPs) and inference times, which is critical in medical applications where timely and accurate diagnosis is paramount. Although the compression process introduced a bias towards majority classes in the RetinaMNIST dataset, this was not observed in the DermaMNIST dataset, indicating that the effects of compression techniques can vary across different datasets and applications.

These findings suggest that the surrogate-based approach is a promising strategy for optimizing the trade-offs between model size, computational efficiency, and performance in healthcare-related applications. The ability to deploy lightweight yet accurate models on resource-constrained medical devices could significantly improve real-time diagnostics and patient monitoring, making advanced healthcare more accessible.

Future research should aim to validate these techniques across a broader array of medical datasets and deep learning models. Besides, further refinement of the surrogate models to enhance their predictive accuracy and reliability can still improve the methodology. The ultimate goal is to facilitate the integration of efficient, high-performing CNNs into the healthcare sector, thereby contributing to the advancement of medical AI and improving patient outcomes.