key: cord-0516170-d2k7a5nx
authors: Hamdi, Ibraheem; Ridzuan, Muhammad; Yaqub, Mohammad
title: Hyperparameter Optimization for COVID-19 Chest X-Ray Classification
date: 2022-01-26
journal: nan
DOI: nan
sha: 22e87c9cf2c287a1cec31d19fbc33395e0192e5a
doc_id: 516170
cord_uid: d2k7a5nx

Despite the introduction of vaccines, Coronavirus disease (COVID-19) remains a worldwide dilemma, continuously developing new variants such as Delta and the recent Omicron. The current standard for testing is through polymerase chain reaction (PCR). However, PCRs can be expensive, slow, and/or inaccessible to many people. X-rays on the other hand have been readily used since the early 20th century and are relatively cheaper, quicker to obtain, and typically covered by health insurance. With a careful selection of model, hyperparameters, and augmentations, we show that it is possible to develop models with 83% accuracy in binary classification and 64% in multi-class for detecting COVID-19 infections from chest x-rays.

COVID-19 is an infectious disease that causes mild to moderate respiratory illness, and in severe cases death. Over 2.5 million new cases were reported in the last 24 hours [1], and 60% of the US population are expected to be infected by March of 2022 [2] .

PCR, a molecular test that can identify the disease in genetic material from collected swabs, is commonly used to detect the disease [3] . However, the test can cost over 400 US dollars [4] , and results are taking longer and longer due to the rise in cases [5, 6, 7] . Therefore, it is becoming increasingly important to develop a relatively cheaper, faster, and more accessible screening procedure.

X-rays were discovered in the late 19th century and have been used in the medical field ever since [8] . Chest x-rays can be obtained within 15 minutes, including positioning time [9] , making them rather quick to produce for a rapid diagnosis. To create a model that can recognize and classify COVID-19 infections from chest x-rays would be immensely beneficial, especially at times like these when early quarantine is critical to limit the spread the spread of the virus.

For learning purposes, we made use of vanilla DenseNet-121 [10] architecture as well as TorchXRayVision [11] library's variations that are pre-trained on xray data from different data combinations. We investigated the effect of model, hyperparameter, and augmentation choice on the accuracy of classification. In the beginning, the program was written as a Jupyter Notebook using only vanilla DenseNet-121 and without any augmentations. Each experiment required manual modification to the code in multiple locations. With the introduction of TorchXrayVision's models and more augmentations, making a mistake while modifying the code became easier and tracebacks errors started happening. The solution to that was adding a new section in the beginning of the code that asked the user what augmentations, hyperparameters, and models are required. At that time, the only way to stop the code without reaching the set maximum number of epochs was to physically interrupt the program. Furthermore, model checkpoints were being created at every epoch, slowing down the training process and using up too much storage space. To address that, a user-defined validation accuracy was set to start saving checkpoints and another to stop the program.

As the project grew, it became more obvious that a better system was needed, one that is more automated. Preferably, it would have features like earlystopping, to save the best model and retain GPU time, and online logging, to enable remote monitoring. Luckily, a presentation by one of the students at the university introduced us to a potential solution. Hydra-Lightning-Template [12] became the core of this program. It enables easy experimentation, hyperparameter search with Optuna [13] , and online logging through Weights & Biases (Wandb) [14] . It is built on Pytorch Lightning, which relies on '.yaml' configuration files to retrieve experiment settings such as hyperparameters and model. This immensely simplified experimentation by minimizing the amount of boilerplate code, allowing us to produce different models without the need to manually change the code.

Optuna is an automatic hyperparameter optimization framework that is integrated into Pytorch Lightning. It has a simple and intuitive interface that is platform agnostic and lightweight with minimal infrastructure dependency, making it extremely versatile. Features include pruning to terminate unpromising trials and speed up optimization, easy parallelization to enable scalability across multiple machines, quick visualization through a built-in dashboard, as well as an abundance of optimization samplers [15] . This makes Optuna more suitable for our experimentation rather than other methods like GPyOpt, Sage-Maker, or GoogleVizier, that don't have all of these advantages and features together [16] .

The Society for Imaging Informatics in Medicine (SIIM) partnered with the Foundation for the Promotion of Health and Biomedical Research of Valencia Region (FISABIO), Medical Imaging Databank of Valencia Region (BIMCV) and the Radiological Society of North America (RSNA) to create a Kaggle competition with the aim to advance the identification of COVID-19 Pneumonia from chest x-rays.

The dataset provided is composed of over 6000 scans in DICOM format belonging to four different classes: 1,676 'Negative for Pneumonia', 2,855 'Typical Appearance', 1,049 'Indeterminate Appearance', and 474 'Atypical Appear-ance' [17] . To ensure data balance for binary experiments, 50% of the data was composed of 'Negative for Pneumonia', while the other half was a random mix of the other three classes.

To avoid confusing the model with lateral x-rays, 230 data folders that had more than one DICOM file were excluded. Furthermore, the dataset was stratified according to the assigned labels to ensure a balanced distribution of classes. In our experiments, we used a 70:20:10 ratio to split the data into training, validation, and testing datasets.

Optuna offers flexibility through a wide range of samplers to perform the optimization. These include Tree-structured Parzen Estimator (TPA), Gaussian Processes (GP), Covariance Matrix Adaptation Evolution Strategy (CMA-ES), and Random and Grid Search [18] . TPA is the default and recommended sampler by Optuna especially for experiments with less than 1000 trials and hence was used.

Within the Hydra-Lightning-Template, a configuration file can be created to define the optimization settings. For instance, number of trials, optimization metric, sampler or optimization technique, hyperparameters, and range of search are specified inside the '.yaml' file. Optuna then uses the sampler to maximize (or minimize) the optimization metric, validation accuracy in our case, by trying different hyperparameter values within the search space. Based on the learning curve, Optuna may choose to prune or early-stop the experiment to save time and try another set of hyperparameters.

Pydicom [19] library was used to read the x-ray images provided in DICOM format. The function 'pixel array' is used to convert the images to a 2-dimensional NumPy array. However, Vanilla DenseNet-121 expects 3-channel input, while TorchXRayVision models expect a grayscale image of size 224 x 224. Therefore, we had to design our Data Module to handle files differently according to the model specified for the experiment.

According to the configuration of each experiment, geometrical transformations such as scaling, shear, rotation, translation, as well as horizontal and vertical flipping were applied to investigate the effects of augmentations on the accuracy of models. Scaling, shear, rotation, and translation were chosen because they naturally exist in x-ray imaging as a result of positioning error, while horizontal and vertical flipping can happen as a result of input or scanning error. These transformations were chosen as they make a good representation of what is seen in the real world. Color transformations however would not help since x-rays are always produced in grayscale.

DenseNet-121 model was chosen for this task due to its reported success with chest x-rays in literature [20] . In addition, there are nine DenseNet-121 TorchXRayVision models that are pretrained on different x-ray datasets such as CheXpert, NIH, and PadChest [11] . We encountered an issue with downloading the weights for the 'JF Healthcare' model so it was discarded.

Each of its pretrained models has 18 outputs corresponding to the following pathologies:

When calling TorchXRayVision models, the number of classes is specified to 2 or 4 according to the experiment. However, vanilla DenseNet-121 doesn't have such a feature, therefore the classifier was changed to 'nn.Linear(num features, num classes)' where num features is in features from the original classifier and num classes is set to 2 or 4 depending on whether the experiment is binary or multi-class.

To keep the architecture consistent across all experiments, the ResNet model from TorchXRayVision was not used. Furthermore, by using Optuna for hyperparameter optimization, augmentations, batch size, dropout rate, models, and learning rate were experimented with. The search limit was set to 0.0001-0.001 for learning rate, 0%-20% for dropout rate, -15%-15% rotation, -30%-30% scale, -30%-30% shear, and 0%-100% translation. Horizontal and vertical flipping and batch sizes of 8, 16, 32, 64 and 128 were also studied.

Adam optimizer and cross-entropy loss are used for all of our experiments.

To begin, the models were trained using a learning rate of 0.0003, and batch size of 32. TorchXRayVision's "ALL" model has been trained on all different datasets and achieved the highest baseline accuracy of 81.61% (see Table. 1). Therefore, "ALL" model was used for our binary experiments. We observed that the maximum validation accuracy peaked using 64 batches, and started decreasing on either side (see Table. 2). This is due to the fact that the ability of a model to generalize degrades beyond a certain batch size [21] . With validation accuracy of 82.81%, the decision was made to use batch size of 64 from that point for our binary experiments. Instead of assuming 0.0003, we started searching for the optimal learning rate using "ALL" model and batch size of 64. Validation accuracy peaked at 1.737 (see Table. 3) and dropped above or below that. With too low of a learning rate, the model will take much longer to reach the optimal solution and might even underfit the data. On the other hand, having a learning rate that is very high means the model will be unable to converge as the steps are too large [22] . The highest validation accuracy of 82.48% was achieved with a learning rate of 0.0001737, and we estimate that the optimal learning rate is between 0.00016 and 0.00018.

Dropout is a technique used to boost accuracy and prevent models from overfitting by randomly dropping units from layers during training [23] . This way, a neural network is forced to learn with incomplete information, enabling it to generalize better. Using "ALL" model, batch size of 64, and learning rate of 0.0001737, the best model achieved a validation accuracy of 80.3% using 13.06% dropout rate (see Table. 4). Contrary to previous intuition, using dropout did not provide an improvement in the validation accuracy. It is possible that the network required more time to learn with dropout, and that a better accuracy could've been achieved with longer training. Due to the decrease in accuracy using dropout, we decided to proceed without any to experiment with augmentations, using "ALL" model, batch size of 64, and learning rate of 0.0001737. The model achieved its highest validation accuracy of 80.85% using scale of +/-24.12%, shear of +/-18.09%, and translation of 64.25% without any horizontal or vertical flipping (see Table. 5).

After binary classification, we experimented with the classification of all four labels separately. Using a learning rate of 0.0003, baseline performance of all eight models was obtained. The highest validation accuracy of 62.15% was achieved using the "ALL" model (see Table. 6). Therefore, we decided to use the "ALL" model for our multi-class experiments as well.

Batch size of 32 achieved the highest validation accuracy of 62.39% (see Table. 7) . Hence, we decided to use a batch size of 32 for any following experiments. Instead of assuming 0.0003 again, we used Optuna to search for the optimal learning rate with "ALL" model and batch size of 32. The highest validation accuracy of 63.18% was reached with a learning rate of 0.0005913 (see Table. 8). We estimate that the optimal learning rate is between 0.0005 and 0.0006. Using "ALL" model, batch size of 32, and learning rate of 0.0005913, we noticed a slight improvement using dropout. The best model achieved a validation accuracy of 63.6% using 13.94% dropout rate (see Table. 9). We estimate that the best learning rate is between 13% and 14%. With "ALL" model, batch size of 32, learning rate of 0.0005913, and dropout rate of 13.94, we started experimenting with augmentations. The best performing model achieved a validation accuracy of 64.38% using +/-1.437% scale, +/-5.252% shear, and 27.61% translation (see Table. 10).

As shown above, the accuracy of our multi-class models hardly exceeded the 60% mark before overfitting the data. After inspecting our confusion matrices (see Fig. 1 ) and F1 scores (see Fig. 2 ), we discovered that the models are simply unable to recognize the third ('Indeterminate Appearance') and forth ('Atypical Appearance') classes, especially the latter. This might be a direct cause of having a smaller amount of data for them compared to the other two. 

According to a study [24] , up to 29% of negative PCR tests can become positive when repeated. With careful tuning of augmentations, hyperparameters, and model, we are able to produce a model with 83% accuracy that can be used as a quick way to detect COVID-19 early and take precautionary measure even without a positive PCR result. However, for multi-class classification, even with further hyperparameter optimization, we don't believe it is possible to exceed an accuracy of 65-66%. In other words, we can classify the presence of COVID-19 pneumonia from chest x-rays but the severity of the infection.

Augmentations are widely accepted as a method to improve accuracy, especially when there is lack or imbalance of data [26] . However, in the case of our binary experiments, this concept did not hold true. Adding dropout was also detrimental to the performance of our binary models. In fact, it caused the training accuracy to peak then decrease, preventing the model from overfitting.

After examining the final results of the competition, we discovered that the highest score achieved was 63.5% [25] . DICE score was used because the challenge asked for bounding boxes and not just classification. With that score in mind, we believe that there is a limitation within the dataset itself regardless of model, learning rate, dropout, batch size, and/or augmentation selection. For future work, we would still like to experiment with other models, optimizers, losses, as well as GAN-based data augmentations [27] to improve our multi-class model. 

Canadians struggle with flight delays, testing backlogs as they return from U.S

Densely Connected Convolutional Networks

GitHub -mlmed/torchxrayvision: TorchXRayVision: A library of chest X-ray datasets and models

GitHub -ashleve/lightning-hydra-template: PyTorch Lightning + Hydra. A feature-rich template for rapid, scalable and reproducible ML experimentation with best practices

Optuna -A hyperparameter optimization framework

Weights & Biases -Developer tools for ML

A hyperparameter optimization framework

Optuna: A Define by Run Hyperparameter Optimization Framework -SciPy

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

A disciplined approach to neural network hyper-parameters: Part 1 -learning rate, batch size, momentum, and weight decay

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Interpreting a covid-19 test result

Data Augmentation Can Improve Robustness

GAN-based Data Augmentation for Chest X-ray Classification