key: cord-0138227-ntfbff4t authors: Kang, Myeongkyun; Chikontwe, Philip; Luna, Miguel; Hong, Kyung Soo; Ahn, June Hong; Park, Sang Hyun title: Mixing-AdaSIN: Constructing a De-biased Dataset using Adaptive Structural Instance Normalization and Texture Mixing date: 2021-03-26 journal: nan DOI: nan sha: 30ac29839f4d82f71d51fa4cf9026fb4e5d2a2c6 doc_id: 138227 cord_uid: ntfbff4t Following the pandemic outbreak, several works have proposed to diagnose COVID-19 with deep learning in computed tomography (CT); reporting performance on-par with experts. However, models trained/tested on the same in-distribution data may rely on the inherent data biases for successful prediction, failing to generalize on out-of-distribution samples or CT with different scanning protocols. Early attempts have partly addressed bias-mitigation and generalization through augmentation or re-sampling, but are still limited by collection costs and the difficulty of quantifying bias in medical images. In this work, we propose Mixing-AdaSIN; a bias mitigation method that uses a generative model to generate de-biased images by mixing texture information between different labeled CT scans with semantically similar features. Here, we use Adaptive Structural Instance Normalization (AdaSIN) to enhance de-biasing generation quality and guarantee structural consistency. Following, a classifier trained with the generated images learns to correctly predict the label without bias and generalizes better. To demonstrate the efficacy of our method, we construct a biased COVID-19 vs. bacterial pneumonia dataset based on CT protocols and compare with existing state-of-the-art de-biasing methods. Our experiments show that classifiers trained with de-biased generated images report improved in-distribution performance and generalization on an external COVID-19 dataset. 1 Introduction exploit underlying biases in the data for better predictions. Yet, they fail to generalize when the bias shifts in external data reporting lower performance. This has critical implications, especially in medical imaging where biases are hard to define or accurately quantify. To address this, extensive data augmentation [5] or re-sampling is often employed; though is still limited by collection (multiinstitute data) and how to express the bias when it is unknown. Thus, there is a need for methods that mitigate bias in part or fully towards improved model performance and better generalization. In general, models trained on biased data achieve high accuracy and despite their capacity lack the motivation to learn the complexity of the intended task. For instance, a model trained on our in-house biased dataset with COVID-19 and bacterial pneumonia reported 97.18% (f1-score) on the validation set, yet degrades to 33.55% when evaluated on an unbiased test-set. Here, we believe bias may originate from varied CT protocols based on exam purpose, scanners, and contrast delivery requirements [3] . Though contrast CT is a standard protocol, it is challenging for practitioners to meet requirements during the pandemic due to extra processes such as contrast agent injection and disinfection [21, 11] . Further, protocols may also vary for other pneumonia that exhibit similar imaging characteristics with COVID-19 in CT. Consequently, we believe biased datasets are often constructed unexpectedly and sometimes unavoidably due to the aforementioned factors. Among the existing techniques proposed to remove model dependence on bias, augmentation is a de-facto technique for medical images; with other methods pre-defining the bias the trained model should be independent of. This assumes bias is easily defined, but one has to take extra care in the medical setup where such assumptions do not hold. To address this, we propose to construct a de-biased dataset where spurious features based on texture information become uninformative for accurate prediction. A key motivation is that accurate prediction of COVID-19 from other pneumonia's is dependent on the CT protocols related to texture features and contrast. Thus, we propose to generate COVID-19 CTs with bacterial pneumonia protocol characteristics and vice versa for bacterial pneumonia with COVID-19, respectively. Specifically, we propose Mixing-AdaSIN; a generative model based bias removal framework that leverages Adaptive Structural Instance Normalization (AdaSIN) and texture mixing to generate de-biased images used to train classifiers robust to the original bias. For image generation, we employ two main components: (a) texture mixing, which enables realistic image generation, and (b) AdaSIN, which guarantees structural consistency and prevents bias retainment in the input image via modifying the distribution of the structure feature. To prevent incorrect image generation, we first pre-train a contrastive encoder [8] to learn key CT features and later use it to search similar image pairs for the texture mixing step in the proposed generative framework. For evaluation, we construct biased train/validation sets based on the CT protocol and an unbiased test set from the non-overlapping CT protocols of the train/validation sets, respectively. The proposed method reports high bias mitigation performance (66.97% to 80.97%) and shows improved generalization performance when verified on an external dataset. The main contributions of this work are summarized as follows: -We propose a generative model that can sufficiently mitigate the bias present in the training dataset using AdaSIN and texture mixing. -We show that the use of contrastive learning for texture transfer pair selection prevents incorrect image generation. -We constructed a biased COVID-19 vs. bacterial pneumonia dataset to verify bias mitigation performance. Our approach not only enabled improvements for standard classification models such as ResNet18 but also current stateof-the-art COVID-19 classification models. -We also demonstrate the generalization performance of our classifier trained with the de-biased data on an external public dataset. CT based COVID-19 Classification. Several methods have been proposed to address this task since the inception of the pandemic [12, 16, 24, 25] . For instance, Li et al. [16] encode CT slices using a 2D CNN and aggregate slice predictions via max-pooling to obtain patient-level diagnosis. Wang et al. [24] proposed COVID-Net, an approach that utilizes the long-range connectivity among slices to increase diagnostic performance. Later, Wang et al. [25] further improve COVID-Net by employing batch normalization and contrastive learning to make the model robust to multi-center training. However, these models did not address the bias in the dataset and thus may fail to generalize on other datasets. Bias Mitigation. To mitigate bias, Alvi et al. [1] employed a bias classifier and a confusion loss to regularize the extracted features to make them indistinguishable from a bias classifier. Kim et al. [15] proposed to mitigate bias through mutual information minimization and the use of a gradient reversal layer. Though these methods can mitigate distinct biases such as color, they fail to mitigate bias in the medical domain since the bias from CT protocols is subtle and hard to distinguish even for humans [2] . Another line of work is the augmentation based models that utilize techniques such as arbitrary style transfer. Geirhos et al. [5] proposed shape-ResNet, an approach that finetunes a model pre-trained on AdaIN [10] generated images. Li et al. [18] proposed a multitask learning approach that enables accurate prediction by either using shape, texture, or both types of features. Though successful, a key drawback is the heavy reliance on artistic image generation techniques that may be detrimental for medical images and subsequent diagnoses. To address this, our approach is able to capture subtle differences in the image to generate consistent texture updated images. Thus, classifiers trained on the generated images can avoid subtle bias information. Generative Texture Transfer. Methods [10] and [6] both proposed to generate texture updated images based on arbitrary style transfer, with adaptive and conditional instance normalization employed to transfer texture information. CycleGAN [27] is another popular method for texture transfer that uses the idea of consistency across several model outputs. However, these techniques not only change the texture but also induce structural distortion to the outputs which may introduce new forms of bias. Recently, Park et al. [20] achieved better results by using an autoencoder with texture swapping as well as a co-occurrence patch discriminator to capture high-level detail. In this method, the discriminator model may often change the original structural characteristics which is undesirable for medical images. Since our main objective is to maintain the structural information, we avoid techniques such as cycle consistency and patch discriminators that often produce structurally distorted images. ., x i m } each with its label y i ∈ {0, 1} denoting bacterial pneumonia and COVID-19 samples, respectively. Our goal is to generate a dataset D = {X 1 , ..., X n } where each X i is a set of texture updated CT slices that may contain the bias information of the other label CT protocol. To achieve this, we first pre-train a contrastive encoder using D to learn representative slice features and then use it for slice similarity search i.e. x i 1 , x j 2 ∼ D, y i = y j with semantically similar image structures. Second, to generate x i 1 we feed searched pairs x i 1 , x j 2 to an encoder network E that outputs structural and texture features used as input for a generator network G. Here, AdaSIN and texture mixing is employed on the respective features for improved generation. Lastly, following standard practice for adversarial-based methods, we also employ a discriminator network D to improve the quality of generation. To enforce style and content similarity, a pre-trained VGG19 [23] is used to optimize content loss between x 1 and x 1 , and a style loss between x 1 and x 2 , respectively. Through this process, we can generate images to construct D and then combine D and D to train a classifier that will learn to ignore bias information. The overall framework is presented in Figure 1 with specific details of each step categorized below. Slice Similarity Search. As opposed to arbitrarily sampling pairs for texture transfer in our generative framework, we employ similarity based sampling of image pairs with similar structural features between the classes. Here, we pretrain a momentum contrastive network (MoCo) [8] using the training data and find the closest slice based on the L1 distance. This is crucial for generation since using arbitrary image pairs for texture transfer can produce artificial images without any clinical significance. Following, we construct image pairs for the entire dataset for image generation. Image Generation with AdaSIN and Texture Mixing . To generate a de-biased and texture mixed image, an encoder network E takes as input the sampled image pairs to produce texture and structure features that are first modified via AdaSIN before being feed to G. To retain texture and structure, both are required as inputs for G. Specifically, the structure feature is a feature with spatial dimensions pooled at an earlier stage of E, whereas the texture feature is an embedding vector obtained following 1 × 1 convolution in the later stages of E. In addition, we believe that directly employing these features for generation can still retain the inherent bias contained in the features, thus we propose to modify the structural features via AdaIN [10] . To achieve this, we use the mean(µ(·)) and standard deviation (σ(·)) of the features before passing to G. Formally, where s 1 and s 2 denotes the extracted structure features of the input image pairs i.e. x 1 , x 2 ∼ D. Next, G takes both features for an image generation and texture transfer. Texture transfer is achieved via convolution weight modulation and demodulation introduced in [14] . Herein, texture information is delivered to each convolutional layer in G except for the last layer. To train the entire framework, we follow the arbitrary style transfer training procedure described in [10] . A VGG19 pre-trained model is used to extract features from the input pairs which is then used in the style L style and content L content losses, respectively. The style loss minimizing the mean squared error (MSE) between the generated output and the texture image, whereas the content loss minizing the MSE between generated output and the structure image [17] . Further, we use an adversarial loss to improve the image generation quality via L GAN (G, D) = − E x1,x2∼D [D(G (AdaSIN (s 1 , s 2 ) , t 2 ))], with regularization (non-saturation loss) omitted for simplicity [7, 14] . The final loss function is defined as: Implementation Details. A Mask-cascade-RCNN-ResNeSt-200 [26] with deformable convolution neural network (DCN) [4] was employed to extract the lung and lesion regions in the CT scans to mask out the non-lung regions and Experiments Settings For evaluation, we constructed an in-house biased dataset for training/validation and an unbiased testing dataset. First, we extracted the CT protocol per scan using metadata in the DICOM files and create splits COVID-19 and bacterial pneumonia based on the protocol. The dataset and bias information are shown in Table 1 . To evaluate the effect of bias mitigation using the generated images for classification, a pre-trained ResNet18 and recent COVID-19 classification models [9, 16, 24, 25] i.e. COVNet, COVID-Net, Contrastive-COVIDNet were compared. The models were trained for 100 epochs with a batch size 64 and a learning rate 0.001. We also applied random crops, horizontal flips and intensity augmentations i.e. (brightness and contrast) as a baseline augmentation technique. Performance comparison of our approach against recent state-of-the-art non-generation based bias mitigation methods [15, 1] applied for natural image classification is also reported. To verify the effectiveness of the proposed method, we include comparisons against a commonly used arbitrary style transfer model i.e. AdaIN, and the current state-of-the-art generation method i.e. swappingautoencoder [10, 20] . For a fair comparison of the generation based methods, the same texture pairs were utilized for a generation. Also, training and validation were performed three times for all methods with final average performance reported. Table 2 , we present the evaluation results on the biased COVID-19 vs. bacterial pneumonia dataset. Initially, the model shows high f1-score on the validation dataset i.e. 97.18%, yet significantly drops to 33.55% on the unbiased test set. This shows that the classifier makes predictions based on the bias information. The results of the learning-based models i.e. Learning not to learn, and Blind eye show no considerable performance improvements and highlight the failure to mitigate the bias, especially for medical domain images as reported in [2] . Further, these methods were proposed to remove bias that is distinctly recognizable in the image such as color. Capturing and mitigating the subtle bias difference in the medical image is considerably harder for such techniques. Among generation based methods, the proposed method reports the best performance. Even though AdaIN can transfer texture well, the quality of the generated image is extremely low. Consequently, this inhibits classifier training as shown by the limited performance improvements. Though swapping-autoencoder updates the texture with high quality image generation results, two major drawbacks were noted: (i) the generated image still retains bias, and (ii) it distorts key representative characteristics by artificial translation of lesions i.e. changing the lesion of COVID-19 to appear as bacterial pneumonia. Such phenomena may be due to the direct usage of structure features, and use of the co-occurrence discriminator which leads to structural information deformation. On the other hand, our model employs two modules i.e. boosting a high-quality image gen- Fig. 2 . Grad-CAM [22] visualizations of the ResNet18 [9] classifier. Herein, Grad-CAM of the base classifier pointed the normal lung area. On the other hand, the classifier trained with Mixing-AdaSIN pointed the lesion correctly. Hence, the model with debiasing can be more generalized. eration via texture mixing; and minimizing the bias artifact transfer through an AdaSIN. We consider these techniques as instrumental in mitigating bias. In addition, the classifier with a de-biased method employed a more generalized feature as shown in Figure 2 . The Grad-CAM [22] of the trained classifier pointed to the lesion more correctly, thus a better generalization performance is expected. Our proposed method can be easily applied to existing CT based COVID-19 classification models. In particular, COVNet reported 74.61% f1-score which represents successful mitigation of bias artifacts. However, COVID-Net and Contrastive COVID-Net showed relatively low accuracy, mainly due to slight differences in training details and architectures. Also, due to the long-range connectivity in the models, reliance on bias information is heavily induced during training. To verify the generalization efficacy of our trained classifier on external data, we employ the publicly available MosMed dataset [19] . It consists of 1110 CT scans from COVID-19 positive patients. However, the dataset contains CT scans that are not consistent with pneumonia observed in our original dataset. Thus, we selected scans of severe and critical patients only to evaluate the trained models. In addition, since we trained three classifiers from an internal experiment, we tested each classifier three times and final average performance reported. In Table 3 , results are fairly consistent with improvements shown on the internal dataset evaluation. Our model shows a significant improvement over the baseline with +1% gain over swappingautoencoder. More importantly, even though the classifier has not observed the CT samples with a different protocol, performance was still consistent verifying the utility of the proposed de-biasing technique. In this work, we have proposed a novel methodology to train a COVID-19 vs. bacterial pneumonia classifier that is robust to bias information present on training data. We constructed an in-house biased training dataset in conjunction with an unbiased testing dataset and proved that our method allowed the classifier to learn the appropriate features to correctly predict the labels without considering bias information and achieved better generalization. All of this was possible thanks to an adequate image generation design that relies on two major components: (a) texture mixing, which enables realistic image generation, and (b) AdaSIN, which prevents bias flow from the input to the output image in the generation stage, while maintaining structural consistency. We proved the benefits of our pipeline by achieving the best bias mitigation performance when compared to other related methods in both our in-house dataset as well as in an external dataset. Considering that biases can be easily included when constructing datasets, we hope that our findings help to improve performance in various medical tasks. Turning a blind eye: Explicit removal of biases and variation from deep neural network embeddings Debiasing skin lesion datasets and models? not so fast Contrast agents in diagnostic imaging: Present and future Deformable convolutional networks Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness Exploring the structure of a real-time, arbitrary neural artistic stylization network Improved training of wasserstein gans Momentum contrast for unsupervised visual representation learning Deep residual learning for image recognition Arbitrary style transfer in real-time with adaptive instance normalization Chest ct practice and protocols for covid-19 from radiation dose management perspective Quantitative assessment of chest ct patterns in covid-19 and bacterial pneumonia patients: a deep learning perspective Progressive growing of GANs for improved quality, stability, and variation Analyzing and improving the image quality of stylegan Learning not to learn: Training deep neural networks with biased data Using artificial intelligence to detect covid-19 and communityacquired pneumonia based on pulmonary ct: evaluation of the diagnostic accuracy Demystifying neural style transfer Shapetexture debiased neural network training Mosmeddata: data set of 1110 chest ct scans performed during the covid-19 epidemic Swapping autoencoder for deep image manipulation Role of computed tomography in covid-19 Gradcam: Visual explanations from deep networks via gradient-based localization Very deep convolutional networks for large-scale image recognition Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Contrastive cross-site learning with redesigned net for covid-19 ct classification Resnest: Split-attention networks Unpaired image-to-image translation using cycle-consistent adversarial networks