key: cord-0593068-fn6wuz6t authors: Luo, Luyang; Xu, Dunyuan; Chen, Hao; Wong, Tien-Tsin; Heng, Pheng-Ann title: Pseudo Bias-Balanced Learning for Debiased Chest X-ray Classification date: 2022-03-18 journal: nan DOI: nan sha: b4ab330457e136b500ab3a3289705dd8bce7c817 doc_id: 593068 cord_uid: fn6wuz6t Deep learning models were frequently reported to learn from shortcuts like dataset biases. As deep learning is playing an increasingly important role in the modern healthcare system, it is of great need to combat shortcut learning in medical data as well as develop unbiased and trustworthy models. In this paper, we study the problem of developing debiased chest X-ray diagnosis models from the biased training data without knowing exactly the bias labels. We start with the observations that the imbalance of bias distribution is one of the key reasons causing shortcut learning, and the dataset biases are preferred by the model if they were easier to be learned than the intended features. Based on these observations, we propose a novel algorithm, pseudo bias-balanced learning, which first captures and predicts per-sample bias labels via generalized cross entropy loss and then trains a debiased model using pseudo bias labels and bias-balanced softmax function. To our best knowledge, we are pioneered in tackling dataset biases in medical images without explicit labeling on the bias attributes. We constructed several chest X-ray datasets with various dataset bias situations and demonstrated with extensive experiments that our proposed method achieved consistent improvements over other state-of-the-art approaches. To date, deep learning (DL) has achieved comparable or even superior performance to experts on many medical image analysis tasks [16] . Robust and trustworthy DL models are hence of greater need than ever to unleash their huge potential in solving real-world healthcare problems. However, a common trust failure of DL was frequently found that the models reach a high accuracy without learning from the intended features. For example, using backgrounds to distinguish foreground objects [18] , using the gender to classify hair colors [19] , or worse yet, using patients' position to determine COVID-19 pneumonia from chest X-rays [4] . Such a phenomenon is called shortcut learning [5] , where the DL models choose unintended features, or dataset bias, for decision making. More or less, biases could be generated during the creation of the datasets [21] . Meanwhile, the general objective of recognition tasks is minimizing the risks of mapping the inputs to the output prediction over the training data. If the dataset biases frequently co-occurred with the primary targets, the model might take shortcuts by learning from such spurious correlation to minimize the empirical risk over the whole training set. As a result, dramatic performance drops could be observed when applying the models onto other data without the same covariate shift. In the field of medical image analysis, shortcut learning has also been frequently reported, including but not limited to: using hospital tokens to recognize pneumonia cases [24] ; learning confounding patient and healthcare variables to identify fracture cases; relying on chest drains to classify pneumothorax case [15] ; or leveraging shortcuts to determine COVID-19 patients [4] . These findings reveal that shortcut learning makes deep models less explainable and less trustworthy to doctors as well as patients, and addressing shortcut learning is a far-reaching topic for modern medical image analysis. To combat shortcut learning and develop debiased models, a major branch of previous works is based on data re-weighting to learn from the less biased data. For instance, REPAIR [12] proposed to solve a minimax problem between the classifier parameters and dataset re-sampling weights. Group distributional robust optimization [19] prioritized worst group learning, which was also majorly implemented by data re-weighting. Yoon et al. [23] proposed to address dataset bias with a weighted loss and a dynamic data sampler. Another direction of works emphasizes learning invariance across different environments. Invariant risk minimization [1] penalized the loss gradient across environments to obtain an unbiased estimator. Currently, contrastive learning and mutual information minimization have also been used to learn invariant representations among different environments [20, 27] . However, these methods assumed that the dataset biases are explicitly annotated, which might be infeasible for realistic situations, considering the burden and expertise of labeling, especially for medical images. Recently, some approaches have made efforts to relax the dependency on explicit bias labels. Nam et al. [14] proposed to learn a debiased model by mining the high-loss samples with a highly-biased model. Lee et al. [11] further incorporated feature swapping between the biased and debiased models to augment the training samples. Yet, very few methods attempted to efficiently address shortcut learning in medical data without explicitly labeling the biases. In this paper, we are pioneered in tackling the challenging problem of developing debiased medical image analysis models without explicit labeling on the bias attributes. We first observed that the imbalance of bias distribution is one of the key causes to shortcut learning, and dataset biases would be preferred when they were easier to be learned than the intended features. We thereby proposed a novel algorithm, namely pseudo bias-balanced learning (PBBL). PBBL first develops a highly-biased model by emphasizing learning from the easier features. The biased model is then used to generate pseudo bias labels that are later utilized to train a debiased model with a bias-balanced softmax function. We constructed several chest X-ray datasets with various bias situations to evaluate the efficacy of the debiased model. We demonstrated that our method was effective and robust under all scenarios and achieved consistent improvements over other state-of-the-art approaches. Let X be the set of input data, Y the set of target attributes that we want the model to learn, and B the set of bias attributes that are irrelevant to the targets. Our goal is to learn a function f : X → Y that would not be affected by the dataset bias. We here built the following chest X-ray datasets for our study. Source-biased Pneumonia (SbP): For the training set, we first randomly sampled 5,000 pneumonia cases from MIMIC-CXR [8] and 5,000 healthy cases (no findings) from NIH [22] . We then sampled 5, 000 × r% pneumonia cases from NIH and the same amount of healthy cases from MIMIC-CXR. Here, the data source became the dataset bias, and health condition is the target to be learned. We created the validation and the testing sets by equally sampling 200 and 400 images from each group (w/ or w/o pneumonia; from NIH or MIMIC-CXR), respectively. We varied r to be 1, 5, and 10, which led to biased sample ratios of 99%, 95%, and 90%, respectively. Moreover, as overcoming dataset bias could lead to better external validation performance [5] , we included 400 pneumonia cases and 400 healthy cases from Padchest [2] to evaluate the generalization capability of the proposed method. Note that we converted all images to JPEG format to prevent the data format from being another dataset bias. Gender-biased Pneumothorax (GbP): Previous study [10] pointed out that gender imbalance in medical datasets could lead to a biased and unfair classifier. Based on this finding, we constructed two training sets from the NIH dataset [22] : 1) GbP-Tr1: 800 male samples with pneumothorax, 100 male samples with no findings, 800 female samples with no findings, and 100 female samples with pneumothorax; 2)GbP-Tr2: 800 female samples with pneumothorax, 100 female samples with no findings, 800 male samples with no findings, and 100 male samples with pneumothorax. For validation and testing sets, we equally collected 150 and 250 samples from each group (w/ or w/o pneumothorax; male or female), respectively. Here, gender becomes a dataset bias and health condition is the target that the model is aimed to learn. Following previous studies [14, 11] , we call a sample bias-aligned if its target and bias attributes are highly-correlated in the training set (e.g., (pneumonia, MIMIC-CXR) or (healthy, NIH) in the SbP dataset). On the contrary, a sample is said to be bias-conflicting if the target and bias attributes are dissimilar to the previous situation (e.g., (pneumonia, NIH) or (healthy, MIMIC-CXR)). Our first observation is that bias-imbalanced training data leads to a biased classifier. Based on the SbP dataset, we trained two different settings: i) SbP with r = 10; ii) Bias-balancing by equally sampling 500 cases from each group. The results are shown in Fig. 1a and Fig. 1b , respectively. Clearly, when the dataset is bias-imbalanced, learning bias-aligned samples were favored. On the contrary, balancing the biases mitigates shortcut learning even with less training data. For a better interpretation, we adopt the causal assumption [13] that the data X is generated from both the target attributes Y and the bias attributes B, which are independent to each other, as shown in Fig. 1c . The conditional probability p(y = j|x) hence can be formalized as follows: where p(y = j|b) raises a distributional discrepancy between the biased training data and the ideal bias-balanced data (e.g., the testing data). Moreover, according to our experimental analysis before, the imbalance also makes the model favor learning from bias-aligned samples, which finally results a in biased classifier. To tackle the bias-imbalance situation, let k be the number of classes and n j,b the number of training data of target class j with bias class b, we could derive a bias-balanced softmax as follows: k to be the desired conditional probability of the bias-balanced dataset, andφ j = to be the conditional probability of the biased dataset. If φ can be expressed by the standard Softmax function of the logits η generated by the model, i.e., φ j = exp(ηj ) k i=1 exp(ηi) , thenφ can be expressed aŝ Theorem 1 (we provide proof in the supplementary) shows that bias-balanced softmax could well solve the distribution discrepancy between the bias-imbalanced training set and the bias-balanced testing set. Denoting M the number of training data, we obtain the bias-balanced loss for training a debiased model: However, this loss requires estimation of the bias distribution on the training set, while comprehensively labeling all kinds of attributes would be unpractical, especially for medical data. In the next section, we elaborate on how to obtain the estimation of the bias distribution without knowing the bias labels. Inspired by [14] , we conducted two experiments based on the Source-biased Pneumonia dataset with r = 10, where we set the models to classify data source (Fig. 2a) or health condition (Fig. 2b) , respectively. Apparently, the model has almost no signs of fitting on the bias attribute (health condition) when it's required to distinguish data source. On the other hand, the model quickly learns the biases (data source) when set to classify pneumonia from healthy cases. From these findings, one could conclude that dataset biases would be preferred when they were easier to be learned than the intended features. Based on this observation, we could develop a model to capture the dataset bias by making it quickly fit on the easier features from the training data. Therefore, we adopted the generalized cross entropy (GCE) loss [26] , which was originally proposed to address noisy labels by fitting on the easier clean data and slowly memorizing the hard noisy samples. Inheriting this idea, the GCE loss could also quickly capture easy and biased samples than the categorical cross entropy (CE) loss. Giving f (x) the softmax output of the model, denoting f y=j (x) the probability of x being classified to class y = j and θ the parameters of model f , the GCE loss is formulated as follows: where q is a hyper-parameter. The gradient of GCE is ∂LGCE(f (x;θ),y=j) ∂θ (we provide proof in the supplementary), which explicitly assigns weights on the CE loss based on the agreement between model's prediction and the given label. As shown in Fig. 2c , GCE loss fits the bias-aligned samples quickly while yields much higher loss on the bias-conflicting samples. With the afore discussed observations and analysis, we propose a debiasing algorithm, namely Pseudo Bias Balanced Learning. We first train a biased model f B (θ B ) with the GCE loss and calculate the corresponding receiver operating characteristics (ROC) over the training set. Based on the ROC curve, we compute the sensitivity u(τ ) and specificity v(τ ) under each threshold τ and then assign pseudo bias labels to each sample with the following: Moreover, as the biased model could also memorize the correct prediction for the hard bias-conflicting cases [25] , we propose to capture and enhance the bias via iterative model training. Finally, we train our debiased model based on the pseudo bias labels and the bias-balance softmax function. Our approach can be summarized in Algorithm 1. Evaluation metrics: We evaluate the models with area under the ROC curve (AUC) with four criteria: i) AUC on bias-aligned samples; ii) AUC on biasconflicting samples; iii) Average of bias-aligned AUC and bias-conflicting AUC, which we call balanced-AUC; iv) AUC on all samples. The difference between the first two metrics could reflect whether the model is biased, while the latter two metrics provide unbiased evaluations on the testing data. Compared Methods: we compared our method with four other approaches: i) Vanilla model, which does not use any debiasing strategy and could be broadly regarded as a lower bound; ii) Group Distribution Robust Optimization (G-DRO) [19] , which uses the bias ground truth and could be regarded as the upper bound. G-DRO divides training data into different groups according to Input: θB, θD, image x, target label y, numbers of iterations TB, TD, N . Output: Debiased model fD(x; θD). 1: Initializeb = y. 2: for n=1, · · · , N do 3: Initialize network fB(x; θB). for t=1, · · · , TB do 5: Update fB(x; θB) with LGCE(fB(x; θB),b) 6: end for 7: Calculate u, v, and τ over training set. 8: Update pseudo bias labelsb with Eq. 5. 9: end for 10: Initialize network fD(θD). 11: for t=1, · · · , TD do 12: Update fD(x; θD) with LBS(fD(x; θD), y,b) 13: end for their targets and bias labels. It then optimizes the model with priority on the worst-performing group and finally achieves robustness on every single group; iii) Learning from Failure (LfF) [14] , which develops a debiased model by weighted losses from a biased model; iv) Disentangled Feature Augmentation (DFA) [11] , which is based on LfF and further adds feature sharing and augmentation between the debiased and biased models. Model Training Protocol: We used same backbone for every method: a DenseNet-121 [7] with pre-trained weights from [3] . Particularly, we fixed the weights of DenseNet, replaced the final output layer with three linear layers, and used the rectified linear units as the intermediate activation function. We ran each model with three different random seeds, and reported the test results corresponding to the best validation AUC. Each model is optimized with Adam [9] for around 1,000 steps with batch size of 256 and learning rate of 1e-4. N in Algorithm 1 is empirically set to 1 for SbP dataset and 2 for GbP datasets, respectively. q in GCE loss is set to 0.7 as recommended in [26] . Results on Source-biased Pneumonia Dataset: We report the test results of our method and state-of-the-arts on the source-biased pneumonia dataset in Table 1 . With the increasing of bias ratio, the vanilla model becomes more and more biased and witnesses a severe decrease in balanced-AUC and overall-AUC. All other methods also show decreases on the two metrics, while G-DRO shows quite robust performance under all situations. Meanwhile, our method achieves consistent improvement over the compared approaches under most of the situations, demonstrating the effectiveness in debiasing. Interestingly, the change of external testing performance appears to be in line with the change of the balanced-AUC and overall AUC, which further reveals that overcoming shortcut learning improves the model's generalization capability. These findings demon- strate our method's effectiveness in solving shortcut learning, which also shows potential in robustness and trustworthiness for real-world clinic usage. Results on Gender-biased Pneumothorax Dataset: We report the test results of our method and state-of-the-arts on the gender-biased pneumothorax dataset in Table 2 . By the performance of the vanilla model, gender bias may not affect the performance as severely as data source bias, but it could lead to serious fairness issues. We observe that G-DRO performs robustly well on the two different training sets. Among approaches that do not use ground truth bias labels, our proposed method achieves consistent improvement over others with the two different training sets. The results also show the potential of our method in developing fair and trustworthy diagnosis models. In this paper, we studied the causes and solutions for shortcut learning in medical image analysis, with chest X-ray as an example. We showed that shortcut learning occurs when the bias distribution is imbalanced, and the dataset bias is preferred when it is easier to be learned than the intended features. Based on these findings, we proposed a novel pseudo bias balanced learning algorithm to develop a debiased model without explicit labeling on the bias attribute. We also constructed several debiasing datasets from public-available data, on which we demonstrated that our method overcame shortcut learning and achieved consistent improvements over other state-of-the-art methods. Proof of Theorem 1 is provided following [17, 6] for better reference. The exponential family parameterization of the multinomial distribution provides the standard Softmax function as the canonical response function as follows: and the canonical link function as: By adding −log(φ j /φ j ) to both sides of Eq. 7, we have: from which we further have: Substitute Eq. 11 back to Eq. 9, we could have: We recall that Hence, For simplicity, we let n b = k i=1 n j,b to be the number of samples obtaining the bias label as b. Finally, by substituting Eq. 14 back to Eq. 12, we havê Gradient of Generalized Cross Entropy Loss [26] : The form of the GCE loss is as follows: Hence, the gradient is: ∂L GCE (f (x; θ), y = j) ∂θ = −f y=j (x; θ) q−1 ∂f y=j (x; θ) ∂θ (17) Recall that the form of conventional cross entropy loss is L CE (f (x; θ), y = j) = −log(f y=j (x; θ)), hence Therefore, ∂L GCE (f (x; θ), y = j) ∂θ = f y=j (x; θ) q ∂L CE (f (x; θ), y = j) ∂θ (19) Invariant risk minimization Padchest: A large chest x-ray image dataset with multi-label annotated reports Torchxrayvision: A library of chest x-ray datasets and models Ai for radiographic covid-19 detection selects shortcuts over signal Shortcut learning in deep neural networks Unbiased classification through bias-contrastive and biasbalanced learning Densely connected convolutional networks Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports Adam: A method for stochastic optimization Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis Learning debiased representation via disentangled feature augmentation Repair: Removing representation bias by dataset resampling Representation learning via invariant causal mechanisms Learning from failure: De-biasing classifier from biased classifier Hidden stratification causes clinically meaningful failures in machine learning for medical imaging Ai in health and medicine Balanced meta-softmax for longtailed visual recognition why should i trust you?" explaining the predictions of any classifier Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization End: Entangling and disentangling deep representations for bias correction Unbiased look at dataset bias Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases Generalizable feature learning in the presence of data bias and domain class imbalance with application to skin lesion classification Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study Understanding deep learning requires rethinking generalization Generalized cross entropy loss for training deep neural networks with noisy labels Learning bias-invariant representation by cross-sample mutual information minimization