key: cord-0216458-q3gm38eo
authors: Han, Luyi; Lyu, Yuanyuan; Peng, Cheng; Zhou, S.Kevin
title: GAN-based disentanglement learning for chest X-ray rib suppression
date: 2021-10-18
journal: nan
DOI: nan
sha: 1c7fe03232da34b81d6320138d3ec47a3d5ed097
doc_id: 216458
cord_uid: q3gm38eo

Clinical evidence has shown that rib-suppressed chest X-rays (CXRs) can improve the reliability of pulmonary disease diagnosis. However, previous approaches on generating rib-suppressed CXR face challenges in preserving details and eliminating rib residues. We hereby propose a GAN-based disentanglement learning framework called Rib Suppression GAN, or RSGAN, to perform rib suppression by utilizing the anatomical knowledge embedded in unpaired computed tomography (CT) images. In this approach, we employ a residual map to characterize the intensity difference between CXR and the corresponding rib-suppressed result. To predict the residual map in CXR domain, we disentangle the image into structure- and contrast-specific features and transfer the rib structural priors from digitally reconstructed radiographs (DRRs) computed by CT. Furthermore, we employ additional adaptive loss to suppress rib residue and preserve more details. We conduct extensive experiments based on 1,673 CT volumes, and four benchmarking CXR datasets, totaling over 120K images, to demonstrate that (i) our proposed RSGAN achieves superior image quality compared to the state-of-the-art rib suppression methods; (ii) combining CXR with our rib-suppressed result leads to better performance in lung disease classification and tuberculosis area detection.

Chest X-ray (CXR), a 2D projection of a 3D scene from an X-ray source, and computed tomography (CT), which reconstructs a 3D scene based on a collection of 2D X-ray projections, are the widely used modalities to diagnose lung disease. As compared to CT, CXR examinations induce up to 120 times lower radiation dose and are more affordable (Brenner and Hall, 2007) . However, diagnosis with CXR alone is challenging and misses lung cancer findings (Hossain et al., 2018) . That is because CXR represents a 2D projection of the 3D chest and contains overlapped anatomies and ambiguous structure details. Thereby, it is clinically significant to improve the diagnostic reliability of CXR for better clinical decision making.

Clinical evidence indicates that bone-suppressed CXR can improve the interpretation of the CXR images and diagnostic reliability by markedly reducing the visibility of particular superimposed structures in chest radiographs (Hogeweg et al., 2013a; Laskey, 1996) . Rib suppression is also an important preprocessing step for a computer-aided diagnosis (CAD) system contributing to lung segmentation and tuberculosis detection (Jaeger et al., 2013; Maduskar et al., 2013) . Automatically detecting and removing the rib structures in a CXR by applying image-processing techniques can simplify the feature extraction and analysis stage in a CAD system. Fig. 1 shows an example of rib suppression. Dual-energy (DE) CXRs can provide bone-free images, but motion artifacts are unavoidable due to cardiac motion and breath. In a post-processing setting, various methods are proposed to generate bone-suppressed CXRs and can be categorized as: (1) physical model based and (2) deep learning based. The physical model-based methods mimic the bone structures depending on various physical models and obtain bone-free results via subtracting bone from the original CXR. These approaches require manually annotated bone mask for each CXR image (Hogeweg et al., 2013a; von Berg et al., 2016; Simkó et al., 2009; Hogeweg et al., 2012; Wu et al., 2012) .

Learning-based methods can be divided into two subcategories according to the source of manual ground truth: (1) DE CXR based and (2) digitally reconstructed radiography (DRR) based. For DE CXR based learning models, the suppression of bones in CXR images is learned with a specifically designed artificial neural network (Suzuki et al., 2006; Chen and Suzuki, 2013) or convolutional neural network (CNN) (Yang et al., 2017; Zhou et al., 2018; Chen et al., 2019) . However, limited amount of DE CXR images impedes the sufficient learning of CNNs. DRR based learning methods attempt to utilize the structural prior knowledge in CT domain. Clinically, the bone component in the DRR domain paired with original chest DRR is readily accessible through projecting bone from a set of CT images, making it possible to learn a bone-suppressed model (Li et al., 2019) . Although DRR appears similar to CXR, they have different contrasts due to simulation assumptions. Domain adaptation based on CycleGAN (Zhu et al., 2017) is employed to reduce the gap through transferring image from CXR domain to DRR domain. After that, bone component of CXR image can be generated by a model trained on large annotated DRR data with less domain gap. To obtain rib-suppressed results, Li et al. (2020) subtracts bone decomposition from the original high-resolution CXR. By histogram matching on CXR in the area of inner-rib mask, it can produce high-resolution results with inter-rib information unchanged, which is meaningful for clinical diagnosis.

Recently, GAN-based methods have achieved success in the domain adaptation. The principle of GAN is to introduce a discriminator to distinguish the generated image domain from the real image domain, while enforce the generator to trick the discriminator in order to generate an image belonging to the do-main of real image. And the performance of domain adaptation improves during the adversarial learning between the generator and the discriminator. Particularly, CycleGAN (Zhu et al., 2017) constructs a forward-and backward-generation cycle to ensure the bilateral consistency. And this cycle-consistency constrain enables unpaired data transfer from each other (Zhang et al., 2018b; Kamnitsas et al., 2017; Zhang et al., 2018c) . However, the cycle-consistency in CycleGAN is limited to imagelevel, resulting in details drop at the backward generation stage.

To improve the preservation of details in performing domain adaptation, MUNIT and DRIT (Lee et al., 2018) disentangle the latent space in the generator into content-and style-specific representation. In this way, only the domain-variant features will be exchanged and the domaininvariant feature will be well-preserved. Many recent medical applications, like synthesis (Ben-Cohen et al., 2019) and crossmodality segmentation , adopt the concept of feature disentanglement to improve the performance. A perfect example can be found in Chartsias et al. (2019) , which improved the high-level representation on 2D medical images by disentangling latent space into spatial anatomical and nonspatial modality representation. Despite the success in generating more details via feature disentanglement, these methods still confront the blur issue in the generated images. It is because that the recovery capacity of the decoder is limited and it is hard to reconstruct realistic noise distribution. This motives us to propose an easy-to-generate output of the decoder. Instead of directly generating images in the target domain, our RSGAN aims at generating a residual map, which contains domain-invariant anatomical structure and residual intensity in a specific domain.

The annotations of CT volume, e.g. lung and bone, can be projected to the DRR domain. By leveraging the information from DRR images, several studies developed research in X-ray decomposition (Albarqouni et al., 2017; Li et al., 2019) , lung enhancement (Gozes and Greenspan, 2018; Li et al., 2019) , and rib-suppression (Li et al., 2019 (Li et al., , 2020 tasks. To mitigate the domain gap between CXR and DRR, some of these methods (Li et al., 2019 (Li et al., , 2020 consist of the domain adaptation between DRR and CXR images. Specially, DecGAN (Li et al., 2019) proposes a CycleGAN-based network and decomposes input into different components (bone, lung and other softtissue structures), then adaptively combining the decomposed components to implement the rib-suppression or lung enhancement. However, the rib-suppressed DRR is not strictly equal to the linear combination with the projection of lung and other soft-tissue structures. The realistic suppression indicates that the rib area in the CT volume should be replaced with intensities of soft tissue. In DecGAN (Li et al., 2019) , this part of soft tissue is omitted and considered to be air. Furthermore, the CycleGAN-based framework drops details and results in a low-resolution generation. Li et al. (2020) , a further work of DecGAN (Li et al., 2019) , attempt to generate more details in a coarse-to-fine manner. The generated rib-mask would be subtracted from the CXR images with ad-hoc histogram matching. However, the performance of rib suppression is affected by the accuracy of generated rib mask. A less accurate rib mask would result in sharp intensity changes at the rib edges in the rib-suppressed image. This approach has the following drawbacks:

1. lacking structure-consistency in CXR during domain adaptation; 2. neglecting the difference between the decomposed bone and the intensity residue resulting from a CXR image subtracted by the corresponding rib-suppressed prediction; 3. lacking a formulation to promote a consistent rib mask, which may result in sharp changes around rib edges.

To address the above challenges and obtain better rib suppression results for real CXR, we propose a generative adversarial network (GAN) based disentanglement learning framework called Rib Suppression GAN (RSGAN). Our contributions are three folds:

1. We transfer the prior knowledge from the DRR domain to the CXR domain with disentangled structure-and contrastspecific generators. 2. We predict residual map of ribs instead of the ribsuppressed image, enabling features to be effectively recovered from a typical decoder. 3. We formulate an adaptive loss function to enhance intensity-consistency at the inter-rib regions and preserve the details overlapped by ribs.

The remainder of this paper is organized as follows. In Section 2, we detail the proposed rib-suppressed method-RSGAN. In Section 3, we describe the dataset and the metrics for evaluating the performance of rib suppression. In Section 4, we show predicted rib-suppressed images for each competing method and the feasibility of RSGAN in downsteam applications. In Section 5, we discuss the improvement of RSGAN in detail. And we conclude in Section 6.

Denoting a CXR image by I. It is assumed that where Q is the rib-suppressed image and R is the rib residual image. While direct approaches explicitly predicts the ribsuppressed image Q from I, we attempt to estimate R from I instead and then subtract it from I to finally get Q.

We propose a GAN-based disentanglement learning framework RSGAN to suppress rib bones in CXR images by leveraging anatomical knowledge from unpaired CT/DRR images. Although the contrast is different between CXR and DRR images, the anatomical structures are similar in both images. Based on this fact, our framework disentangles structure and contrast with separate generators for both CXR and DRR images. The rib bone features learned from the DRR domain are then combined with the contrast feature of the CXR domain to construct the CXR-based residual map. The final rib-suppressed CXR is achieved through subtracting transferred ribs from the input CXR image as shown in Fig. 1 . To facilitate to the description of the proposed approach, the mathematical setting and notations are listed in Table 1 . Fig. 2 shows the disentangled representations in our generative model. In the generator, each input is disentangled into three components, including anatomical content, image contrast, and rib bones, with three encoders E S , E C , and E B , respectively. The output has four predictions (lung mask, ribsuppressed image, residual map of rib, and bone component projection) from four decoders G L , G Q , G R , and G B , respectively. 

Bone mask for CXR generated by bilateral filter and threshold segmentation,

In the decoding path, G Q inputs with contrast and ribsuppressed features and outputs the rib-suppressed image. Improving from Li et al. (2019) , we inpaint the rib-removed area in CT volume based on its surrounding tissue intensities to simulate more real suppression performance, which projected as the ground truth for rib-suppressed DRR. G R takes the contrast and rib bone feature as inputs, outputs a residual map. The residual map is utilized to generate a rib-suppressed image with more details by subtracting it from the input image. The input image can be reconstructed by adding the rib-suppressed image and the residual map. G B only takes the rib bone feature as input and outputs the bone component projection. The ground truth of the bone component projection for the DRR image is generated by projecting a separate CT rib bone component obtained in a 3D volume. By generating bone projection, better rib bone features can be obtained for residual map prediction, and a bone mask is available due to the clearer rib edge than the residual map. G L inputs rib-suppressed features and outputs the lung mask, which can help the model pay more attention to the lung area.

For CXR images, it is difficult to achieve the desirable ribsuppressed results due to the overlap of multiple tissues (e.g., ribs, clavicles, lung). To relieve the interference from other organs, the specific representation of ribs should be well constructed. CT scans provide the 3D structure of the ribs that are without any overlap and hence easily annotated. Thus tailored rib-suppressed DRR images can be delivered by projecting from the corresponding rib-suppressed CT slices. The rib features of CXR images then can be transferred from those of DRR images. As illustrate in Fig. 2 , the mapping mode of ribsuppression is learned from the DRR domain and supervised from three aspects, including rib-suppression, residual map of ribs, and bone component. The supervised loss is given as,

where · 1 is a L 1 loss, Q d and Q d are the ground truth and prediction of rib-suppressed DRR image, respectively, R d and R d are the ground truth and predicted residual map of bone in DRR image, respectively, and B d and B d are the ground truth and prediction of bone component projection in DRR image, respectively.

Directly disentangling ribs from CXR under the supervision of the annotated rib in DRR is ineffective due to the domain gap. To reduce the impact of domain variance in rib-suppression, our RSGAN, shown in Fig. 3 , employs cycle-consistency learning to implement CXR disentanglement via domain adaption. Particularly, we separate the learning into domain-invariant and domain-specific features. Here, we assume that the rib bone features and the rib-suppressed features disentangled from CXR images or DRR images are domain-invariant, while the contrast features are domain-specific. In this way, a domain-specific image can be generated by exchanging the contrast feature from the target domain via contrast exchanging (CE) block.

The requirement of transferring the rib-suppressed map from the DRR domain to the CXR domain is to achieve the specific style of these two domains. Here, two discriminators are involved to evaluate the generations for DRR style and CXR style, respectively. Additionally, the third discriminator is utilized to assess the predicted difference of bone projection from CXR and DRR images, so that to sufficiently and precisely transfer the rib structures from DRR to CXR. Adversarial learning is employed to minimize the domain distance between real image and prediction, and adversarial losses of the generator and discriminator are defined as,

where I x and I d denote the CXR and DRR image. I d→x denotes the predicted CXR image transferred from DRR domain. I x→d denotes the predicted DRR image transferred from CXR domain. B d and B x refer to the prediction of bone component projection in DRR image and CXR image, respectively. Two discriminators D x and D d are leveraged to evaluate the performance of the reconstructed domain-transfer images in CXR domain and DRR domain, respectively, which retain anatomical structure but with exchanged contrast. D B refers the bone component discriminator to distinguish the predicted bone projection from CXR to DRR.

To improve transferability between two domains, we constrain learning on both feature-and image-level reconstructions. At the feature level, we employ contrast-and structureconsistency losses. Contrast-consistency loss L c is used to keep the contrast feature identical within the same domain. The structure-consistency loss L s is used to protect the structures from deterioration when the contrast changes. L c and L s are defined as follows:

where E C refers to the contrast encoder, E S refers to the ribsuppressed encoder, E B refers to the rib bone encoder, I x→d means transferring the CXR image to the DRR domain, and I d→x denotes the reverse transfer.

To ensure the generation be consistent with the real images, a pixel-wise L 1 constraint is used to yield a reconstructed image close to the real one at the image-level. Meanwhile, a cycleconsistency L 1 constraint is introduced to ensure the pixel similarity between an original input and the corresponding cycletransferred generation as shown in Fig. 3 . The pixel-wise reconstruction loss L rec and the cycle-consistency loss L cyc are utilized to ensure the image-to-image translation, which are written as follows,

where I x and I d refer to the forward prediction of I x and I d , respectively, and I x and I d denote the cycle predictions, respectively. I x and I x are formulated as

,

I d and I d are similarly formulated.

The above domain adaptation learning is built up using the global intensity distribution. The translation of local details is ignored. Hence, as shown in Fig. 4 , additional local constraints are applied on the rib-suppression learning on CXR images from three aspects: (1) handling rib residue; (2) inter-rib intensity consistency; and (3) inner-rib detail preservation. Handling rib residue. Ribs may reside in the rib-suppression prediction during the domain adaptation learning. Here, we employ a gradient-based constraint within the lung regions to eliminate the influence of intensity various across images and enhance smooth translation at the rib regions for a rib-suppression prediction. Gradient supervision is applied on predicted ribsuppression of CXR images and the corresponding residual maps of ribs during the adversarial learning, which is defined as,

where refers to the gradient operator, D refers to a discriminator aiming to regularize the gradient map of rib-suppressed image for a smooth transition on rib edges, Q d→x is the predicted rib-suppressed CXR images transferred from the DRR domain, R x denotes the predicted residual map of ribs in CXR and L x and L d refer to the predicted lung mask in CXR and DRR image, respectively. The lung mask decoder is trained with a Dice loss on a set of annotated CXR images and is frozen afterward.

Inter-rib intensity consistency. Inter-rib intensities should be consistent between the input CXR image and the corresponding rib-suppressed prediction. The intensity-consistency loss is defined as:

where M x denotes the corresponding bone mask of I x , which is generated from B x by utilizing bilateral filter and threshold segmentation, and ∪ denotes the union region of two input masks.

Note that, the bone mask M x can not be generated from the residual map R x . Because it will bring unstable texture changes to the residual map when introducing L G . The bone mask M x will be influenced by the disturbance of the residual map and enlarge the bias during the training stage. Instead, the bone component projection B x is more stable and stops the gradient from L G , which is necessary to generate the bone mask.

Inner-rib detail preservation. To preserve the trachea overlapped with ribs in the predicted residual map of ribs, we employed a Laplacian-based loss L to eliminate the overlapped regions in the residual prediction. Laplacian as a regularization method encourages generating smoother residual maps (Li et al., 2020) . L is defined as,

where represents the Laplacian operator.

The above extra adaptive loss functions are employed to refine the performance of rib suppression on local details, which only work on the fine-tuning stage and need to be removed in the initial stage. The total loss function of the generative model in the initial stage and in the fine-tuning stage can be summarized as,

where we set λ adv = 1, λ f = 1, λ i = 10, λ G = 10, λ inter = 500, and λ = 1 with experimental experience.

We utilize two CT datasets and four CXR datasets in our experiment. CT datasets involve 896 CT volumes from LIDC-IDRI (Armato III et al., 2011) and 777 CT volumes from 2017 1 and 2019 2 TianChi AI Competition for Healthcare organized by Alibaba and all the CT volumes are selected with a thickness less than 2.5 mm. CXR datasets involve 11,200 CXRs of size 512 × 512 from TBX11K (Liu et al., 2020) , 112,120 CXRs of size 1024 × 1024 from chest-14 (Wang et al., 2017) , 138 CXRs of size around 4020 × 4892 from Montgomery County (MC) (Jaeger et al., 2014) , and 662 CXRs of size around 3000 × 3000 from Shenzhen Hospital (Jaeger et al., 2014) . As for the CT dataset, we train a U-Net model for rib segmentation on 199 CT volume selected from LIDC-IDRI with manual labels and generate rib masks for the whole CT dataset. Then, based on the rib masks, we suppress the region of ribs on CT volumes with an inpainting method (Telea, 2004) . To generate the DRRs, we utilize DeepDRR (Unberath et al., 2019) to project each CT volume into 42 projection slices which are uniformly sampled from −10 • to 10 • in azimuth and elevation angles, respectively. Similarly, we generate the projection of ribsuppressed volume, rib region, and lung area with the same projection strategy. All the DRRs are resized to 320 × 320 based on bilinear interpolation for training. As for the CXR dataset, all the CXRs are resized to 320 × 320 when input into the network and back to their original resolution for rib suppression. We utilize the training set of TBX11K, Chest-14, and MontgomerySet for training, and others for testing. Fig. 5 illustrates the details of three kind of blocks in the generator. The contrast encoder E C is consisted of four convolutional layers with a kernel size of 3 × 3 and a stride of 2, followed with a global average pooling (GAP) and a fully connection layer of 256 output channels. The rib-suppressed encoder E S and the rib bone encoder E B have the same structure, consisting of two convolutional layers with a kernel size of 3 × 3 and a stride of 2, followed with a residual block of 256 output channels. In this paper, the size of features encoded from E S and E B is set to be 80 × 80 with the channel of 256. The rib-suppressed image decoder G Q and the residual map decoder G R have the same structure as shown in Fig. 5 , involving three weight demodulation blocks (Karras et al., 2020) , and each weight demodulation block is followed with an upsampling layer to pass a higher dimension of feature to the next block. The scale factor of the upsampling layer is 2. Three different resolutions of feature maps in the decoder are set to be 80×80, 160×160, and 320×320. Every block includes another sub-branch of convolutional layer to generate an output image in a higher resolution. The lower resolution image is merged with the higher resolution one outputed by next block via an upsampling layer and an element-wise addition. The contrast feature is fed into each weight demodulation block as shown by the dotted arrow. The bone component decoder G B has similar structure with G Q and G R , replacing the demodulation block with residual block, and the final output is followed with a Tanh activation. The lung mask decoder G L is the same as G B , but outputs with a Sigmoid activation to predict the mask of lung. Note it, the contrast feature is not fed into G B and G L .

Four discriminators are utilized in our proposed RSGAN. All discriminators are built with PatchGAN (Zhu et al., 2017) , consisting of four convolutional layers with a kernel size of 4 and a stride of 2 and one convolutional layer with a kernel size of 1.

The proposed network is developed with PyTorch and trained on a GeForce RTX 2080 Ti GPU. All the networks are trained using Adam optimizer with a learning rate of 1 × 10 −5 . We train the proposed network with 40,000 iterations in the initial stage, and 10,000 iterations for the fine-tuning stage with a batch size of 1.

Four metrics are utilized to evaluate rib suppression quality, including Weber Contrast (Hogeweg et al., 2013b) , Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018a) , Peak Signal Noise Rate (PSNR), and Structural Similarity Index Measure (SSIM).

Weber Contrast (Hogeweg et al., 2013b) provides an estimation of rib-suppression performance on the boundaries for CXR images by calculating the contrast gap between the ribsuppressed region and the background. It is defined as C w = I r /I b − 1, where I r denotes the average intensity of the manually annotated rib region. I b denotes the confined background, corresponding to the surrounded region of ribs in a radius of fewer than 5 pixels. Lower Weber Contrast means higher similarity between the rib-suppressed region and the background, suggesting better visualization of the predicted rib-suppressed CXR.

LPIPS (Zhang et al., 2018a ) is a metric to measure the perceptual similarity at feature level, enabling to evaluate the difference between the generated and real images at multiple feature space. LPIPS is defined as follow,

where x, x 0 denote the reference and generated patch, respectively.ŷ l ,ŷ l 0 ∈ R H l ×W l ×C l correspond to the unit-normalized features at lth layer in a feature-extraction network when inputting x, x 0 . ω l ∈ R C l represents a weighted vector to control different channels, refers to a channel-wise multiplication, and · 2 is a L 2 loss. Note that, AlexNet (Krizhevsky et al., 2012) is chosen as the feature-extraction network, and all weights in vector ω l are set to 1.

Besides, PSNR and SSIM, the widely used image-level similarity metrics, are employed to estimate the details remained in the rib-suppressed prediction. PSNR presents the total pixel errors, and SSIM quantifies the structure similarity based on brightness, contrast, and structures. All the above-mentioned three similarity metrics are calculated over the manually labeled lung regions without bones. Lower LPIPS, higher PNSR and SSIM refer to a better prediction.

Other two metrics are introduced to evaluate the quality of two key components in RSGAN -residual map mechanism and domain adaptation, including Mean Absolute Error for reconstruction consistency (MAE rec ) and Mean Absolute Error for cycle consistency (MAE cyc ). They are defined as MAE rec = I x − I x 1 and MAE cyc = I x − I x 1 . Lower MAE rec and MAE cyc refer to a better performance of the generator.

We compare our method with U-Net (DRR), U-Net (Cycle), and Li et al. (2020) with the above-mentioned metrics. For the setting of U-Net (DRR), a U-Net (Ronneberger et al., 2015) is trained on annotated DRR images and directly used to predict the residual map of ribs for CXR. For U-Net (Cycle), the CXR image is first translated to the DRR domain by CycleGAN (Zhu et al., 2017) and fed back to the CXR domain after rib suppression by U-Net (Ronneberger et al., 2015) . The comparisons are based on 192 CXRs with manually annotated rib and lung region, in which 99 CXRs are randomly selected from the Shenzhen hospital dataset (Jaeger et al., 2014) and 93 come from Chest-xray14 dataset (Wang et al., 2017) . Table 2 and Fig. 6 show the quantitative and visual results of alternative methods, respectively. RSGAN achieves state-of-the-art performance with the lowest Weber Contrast of 1.96 × 10 −2 , LPIPS of 0.62 × 10 −2 and the highest PSNR of 34.62, SSIM of 0.996. The LPIPS, PSNR and SSIM results of RSGAN are much better than those of U-Net (DRR) and U-Net (Cycle). This is probably because RSGAN poses a constraint on the inter-rib area and ensures the region between ribs unchanged. The proposed RSGAN achieves slightly better Weber Contrast, PSNR and SSIM than those of Li et al. (2020) , but obtains a 0.56 × 10 −2 drop of LPIPS. It is indicated that RSGAN introduces better perceptual similarity on the edge of the rib area. As illustrated in Fig. 6 , we randomly choose four CXRs including two normal and two pulmonary tuberculosis cases from Shenzhen hospital dataset (Jaeger et al., 2014) . The rib-suppressed results by RSGAN have cleaner visualization in both overview and details than other methods and also have a lower difference of inter-rib area in the colormap. In the blue rectangle of case 1, U-Net (DRR) and U-Net (Cycle) can not suppress the ribs completely and details are smoothed out. Due to the invalidation of histogram matching, a sharp difference appears in the result of Li et al. (2020) . Among all the methods, RSGAN removes the most rib residue and retains the richest details, which proves that RSGAN can obtain a better and robust rib-suppressed image via the disentanglement approach. As for the details remaining in the rib-suppressed results, the proposed RSGAN could make the detail much clearer than other methods and retain both big nidus (tuberculosis in red rectangles) and very small objects (small vascular and trachea, e.g. in blue rectangles), but Li et al. (2020) leverage a large residual of clavicle because of bad rib prediction. All these advantages contribute to reducing the reading difficulty and the chance of misdiagnosis.

We investigate the effectiveness of two key components in our RSGAN: (i) residual map mechanism and domain adaptation (RMDA); and (ii) different extra adaptive loss functions. Residual map mechanism is accompanied with the domain adaption, and we take them as an indivisible part. The experiments with and without RMDA are performed to demonstrate the effectiveness of component RMDA. The detailed experimental settings are RSGAN not using residual map (nRM), RSGAN using residual map (RM) , and RSGAN using RMDA. In the experiment of RSGAN (nRM), the finally rib-suppressed image is directly predicted from the generator G Q in Fig. 2 . And the generator G Q is trained with the following loss function,

where Q d and Q d are the ground truth and prediction of ribsuppressed DRR image, respectively. In the experiment of RSGAN (RM), all the generators are preserved without the discriminators. The generators with residual map mechanism are trained with the following loss function,

where λ i is set to 10, which is identical with RSGAN. For RSGAN (RMDA), all the generators and discriminators are available, and all the setting are the same as the proposed RAGAN expect using the extra adaptive loss functions. RS-GAN (RMDA) is equal to the RSGAN only trained after the initial stage mentioned in Section 2.5, whose loss function is written as Eq. 12. The first line of each case is the CXR images, and the second line is the colormap for the difference in the inter-rib area between the rib-suppressed image and the original CXR. The red region in the CXR colormap indicates the manually labeled mask of the inter-rib area. Rectangles in different colors refer to different areas in the image, and the areas are magnified at the right side of the image. Red arrows indicate some notable differences in the local area.

Then we set RSGAN (RMDA) as the baseline model, and provide an ablation study on the effect of proposed extra loss functions, including rib-suppressed adversarial loss L G , background consistency loss L inter , and residual smooth loss L . We compare the quantitative results with adding the above components in turn, and evaluate with the metrics of Weber Contrast, LPIPS, PSNR, and SSIM. Table 2 and Fig. 7 show the quantitative and visual results of the ablation study of two key components-residual map mechanism and domain adaptation. Compared with RSGAN (RM) and RSGAN (nRM), the LPIPS is reduced by nearly 29 × 10 −2 , the PSNR and SSIM increase of 2.65 and 0.008, respectively, which illustrate that introducing residual map mechanism can improve the performance of inter-rib details. The metrics of RSGAN (RM) are also better than those of U-Net (DRR), which shows that our proposed disentangled network has better ability of generation than that of U-Net. After introducing the domain adaptation structure, RSGAN (RMDA) achieves lower Weber contrast of 2.45 × 10 −2 and lower MAE rec of 0.027 than those of RSGAN (RM), and has much better MAE cyc of 0.041 than that of U-Net (Cycle) and Li et al. (2020) . It may because the proposed domain adaptation method can reduce the impact of domain variance between DRR and CXR, and can avoid the changing of the inter-rib area in the procedure of domain transform. Table 3 shows the quantified comparisons over the different extra adaptive loss functions. The first row lists the baseline results from the typical adversarial learning loss function. We can see that, adding the suppressed loss L G can reduce the Weber contrast by 0.74 × 10 −2 , illustrating that the intensity difference decreases between the inner region of rib and background. Despite this, L G results in the decrease of PSNR and the increase of LPIPS, indicating the loss of details during rib suppression. Further introducing the inter-rib constraint loss L inter , LPIPS, PSNR and SSIM achieve obvious improvement, suggesting that L inter facilitates the preservation of the inter-rib details when suppressing ribs. Additionally, the complementary Laplacianbased loss L can result in marginal improvement on the listed metrics and retains clearer details in the inner-rib area of the rib-suppressed result.

As shown in Fig. 8 , three CXRs are chosen from Shenzhen hospital dataset (Jaeger et al., 2014) to visually compare the ribsuppressed performance by utilizing different loss functions. When adding with the L G , we obtain the rib-suppressed result of CXR with fewer rib residues than that in the baseline, but with fewer details remaining. Comparing with L G , L inter helps keep the intensity in inter-rib area unchanged, and L improves the preserving of the details in the inner-rib area. Fig. 9 illustrates the box plots for Weber contrast, LPIPS, PSNR, and SSIM of RSGAN and compared methods. Paired T-tests show that the Weber contrast, LPIPS, PSNR and SSIM improvement given by RSGAN are statistically significant (p < 0.05) against U-Net (DRR), U-Net (Cycle), Li et al. (2020) , RS-GAN (nRM), RSGAN (RM), RSGAN (RMDA) and "+L G ". Although RSGAN does not give statistically significant improvements on Weber contrast and SSIM against "+L G + L inter " (p = 0.141 and p = 0.900), RSGAN achieves statistically significant improvements on LPIPS and PSNR (p < 0.05).

We employ two downstream applications, including lung disease classification and tuberculosis detection, to evaluate the quality of the imputed rib-suppressed CXR images.

To quantify the contribution of rib suppression in lung disease classification, we conduct experiments on Chest-xray14 dataset (Wang et al., 2017) and TBX11K (Liu et al., 2020) .

Referring to the experiments in (Wang et al., 2017) , we utilize DenseNet-121 (Huang et al., 2017) as classification network to predict 14 lung disease in Chest-xray14 dataset (Wang et al., 2017) , and conduct experiments on different input combinations. We propose five combinations including (1) DenseNet-121: only input with original CXR; (2) DenseNet-121 + Li: only input with rib-suppressed CXR generated by Li et al. (2020) ; (3) DenseNet-121 + Li (Mix): input with the concatenation of two original CXR channels and one rib-suppressed CXR channel generated by Li et al. (2020) ; (4) DenseNet-121 + RSGAN: only input with rib-suppressed CXR generated by our method; (5) DenseNet-121 + RSGAN (Mix): input with the concatenation of two original CXR channels and one ribsuppressed CXR channel generated by our method. We compare the five input combinations using the common metrics of Area Under Curve (AUC) on predicting 14 lung diseases on Chest-xray14 dataset (Wang et al., 2017) . The results are illustrated in Table 4 , and our method achieves state-of-the-art results on lung disease classification. Concatenating original CXR and rib-suppressed images generated by our method obtains a performance boost of 0.014 in average AUC than single original CXR. Only utilizing our rib-suppressed CXR achieves 1% AUC higher in pneumothorax and 1.6% AUC higher in fibrosis than original CXR because these lung diseases cause appearance changes in lung textures and rib suppression helps highlight the lung texture.

Referring to the experiments in Liu et al. (2020) , we utilize Faster R-CNN (FRCNN) as a classification network to predict CXR into three classes (healthy, unhealthy but nontuberculosis, tuberculosis). Similarly, we propose five combination including (1) Here we compare the five input combinations using the metrics of Accuracy, AUC, Sensitivity, Specificity, Average Precision, and Average Recall on predicting three classes. The comparisons are based on TBX11K dataset (Liu et al., 2020) using the official splits. As illustrated in Table 5 , RSGAN achieves state-of-the-art results on CXR image classification with an increase of 2.7% in accuracy and 1.49% in AUC than those of input without rib-suppressed images. It is supported that rib suppression can help improve CXR image classification accuracy for clinical diagnosis.

In order to demonstrate the effectiveness of RSGAN for tuberculosis (TB) area detection, we utilize FRCNN as a detec- Fig. 7 . Rib-suppressed results for ablation study of residual map mechanism and domain adaptation. The first line of each case is the CXR images, and the second line is the colormap for the difference in the inter-rib area between the rib-suppressed image and the original CXR. The red region in the CXR colormap indicates the manually labeled mask of the inter-rib area. Rectangles in different colors refer to different areas in the image, and the areas are magnified at the right side of the image. Red arrows indicate some notable differences in the local area. Fig. 8 . Rib-suppressed results for ablation study of extra adaptive loss functions. The first line of each case is the CXR images, and the second line is the colormap for the difference in the inter-rib area between the rib-suppressed image and the original CXR. The red region in the CXR colormap indicates the manually labeled mask of the inter-rib area. Rectangles in different colors refer to different areas in the image, and the areas are magnified at the right side of the image. Red arrows indicate some notable differences in the local area. Fig. 9 . Box plots of Weber contrast, LPIPS, PSNR, and SSIM for different methods. For statistically significant improvements between RSGAN and compared methods, methods marked with *, **, and *** indicate 0.01 < p < 0.05, 0.001 < p < 0.01, and p < 0.001, respectively. Table 2 . The quantitative results of CXR rib suppression of manual annotated image from Shenzhen hospital and Chest-xray14 dataset. The comparison methods involve state-of-art methods and ablation study with different key components in RSGAN (nRM indicates not using a residual map, RM indicates using a residual map, and RMDA indicates using both residual map and domain adaption). The best result is in bold and the second best one is underlined.

C w (10 −2 )↓ LPIPS ( Table 3 . Ablation study with different extra adaptive loss functions. The best result is in bold and the second best one is underlined. tion network to predict TB area and classify each bounding box into category-agnostic tuberculosis (CA TB), active TB, and latent TB. Similar to classification tasks, we combine five types of input to train the network: (1) For the evaluation of tuberculosis detection, we utilize two metrics of the average precision of the bounding box (AP). AP 50 refers to AP at the IoU (intersection-over-union) threshold of 0.5. mAP denotes the average AP with the IoU threshold from 0.5 to 0.95 with a step of 0.05. The higher the AP 50 and mAP are, the higher accuracy of tuberculosis detection the method gets. The comparisons are based on TBX11K dataset (Liu et al., 2020) according to the official splits. The results are illustrated in Table 6 . Combining the original CXR image with its ribsuppressed result can obtain the best detection results than the Table 6 . TB area detection results on TBX11K test set. CA TB denotes class-agnostic TB. The best result is in bold and the second best one is underlined.

CA TB Active TB Latent TB AP 50 mAP AP 50 mAP AP 50 mAP FRCNN (Liu et al., 2020) 55 other types of input, and achieves approximately 5% increase on AP 50 of CA TB than that of input only with original CXR image. It might be because that RSGAN could suppress the clavicles and ribs overlapped on the tuberculosis lesion, which is illustrated in the red rectangles of case 3 and case 4 in Fig. 6 .

In this study, we develop the RSGAN for suppressing ribs in the CXR images. Our proposed RSGAN is especially designed to handle with the challenges -overlapped anatomies and unknown ground truth -in CXR rib-suppression. To obtain the rib structural features, we borrow the rib-suppression knowledge from DRR images. The structural-specific features of ribs are learned from our disentangled GAN framework. These features are then transferred to the CXR images for rib-suppression using the domain adaption technique. Besides, to ensure the inter-rib intensities (i.e. those in the cavity regions) intact, we incorporate a residual map mechanism, which characterizes the intensity difference between a CXR and its corresponding rib-suppressed image, into domain adaption as a constraint for the contrast-specific features. Furthermore, extra adaptive loss functions are introduced to handle rib residue and preserve details in both inter-rib and inner-rib regions.

We demonstrate the effectiveness of our proposed RSGAN by comparing with the state-of-the-art methods. Sufficient ablation studies are conducted to validate the improvement of each key component in RSGAN.

Residual Map-. ResNet (He et al., 2016) demonstrates that the residual representation can simplify the optimization for deep CNN. Many researches (Nie et al., 2018; Sun et al., 2020; de Bel et al., 2021) prove that long-term residual connection benefits medical image synthesis. It is because that the residual maps have intensity distribution closed to zero, making CNN easier to be trained.

Quantitative results listed in Table 2 and predictions shown in Fig. 7 prove that the residual map mechanism is efficient to improve the accuracy of predicting the rib-suppressed image. For the rib suppression task, the residual maps involve many pixels with the intensity of zero due to most areas except ribs on the image keep unchanged. On the other hand, the residual map is the combination of the projection for rib and tissue components, which is simpler than that of the rib-suppressed CXR. Thus the rib residual map is easier to be predicted than rib-suppressed CXR, which simplifies the task. As shown in Fig. 6, Fig. 7 and Table 2 , U-Net (DRR) and RSGAN (RM), the methods directly predict the residual map instead of the rib-suppressed image, achieve superior performance than that of RSGAN (nRM) from both statistical and visual results. Especially, the methods without using residual map are ineffective, resulting in blur boundaries and dropped details as shown in Fig. 7 .

Domain Adaption-. Domain adaption is a key factor of suppressing the ribs in CXR images based on DRR images. In the proposed RSGAN, the structural features of DRR images are adaptively transferred to those of CXR images via the disentangled structure-and contrast-specific generators. Cycle-GAN (Zhu et al., 2017 ) is a widely used solution for domain adaption by constructing the mapping of the source-target domains. But the CycleGAN-based methods only learn the relationship between the CXR and DRR images and neglect the domain variance between original and rib-suppressed images. The disentanglement network provides a reconstruction consistency for the process of image decomposition and reconstruction of a single image and a cycle consistency for the domain transferring processes between CXR and DRR. Aided by these two consistency constraints, the methods equipped with a disentangled architecture achieve clearer details and less blur in the inter-rib area than U-Net (Cycle). It is demonstrated that disentangled domain adaptation can make up for the shortcoming of CycleGAN-based methods by implicit constraints. As shown in Table 2 , Li et al. (2020) achieves higher MAE cyc but better rib suppression performance than those of U-Net (Cycle). Worse MAE cyc of Li et al. (2020) is because that the LoG transformation is applied on both CXR and DRR images which can make bone component shaper and enlarge the consistency error. Different from U-Net (Cycle), Li et al. (2020) avoids transferring the rib-suppressed image from DRR to CXR by generating a rib mask from domain-transferred image to solve the problem of the CycleGAN-based methods. But it is difficult to guarantee that the predicted binary rib masks are accurate at the boundary of the rib. As shown in Fig. 6 , there is a plaque remained in the Case 2 of Li et al. (2020) , which is caused by a wrong binary rib mask.

Adaptive Loss Function-. An effective loss function is highly important for the convergence and performance of a neural network. The extra adaptive loss functions in our RSGAN are introduced to address the drawbacks in the method of Li et al. (2020) . Li et al. (2020) predicts the residual map from the bone component projection based on histogram matching and then substracts the residual map from the original CXR image constrained within the area of predicted binary mask. This method neglects the difference between the decomposed bone and the intensity residue and lacks a formulation to promote a consistent rib mask, which may result in sharp changes around rib edges. The loss function L G in RSGAN provides GAN-based learning to handle rib residue adaptively rather than binary segmentation. As shown in Table 2 and Table 3 , baseline with L G achieves the lowest Weber Contrast than that of RSGAN and Li et al. (2020) . However, LPIPS, PSNR, and SSIM perform worse when only introducing L G . Weber Contrast is the metric to evaluate the average difference between the rib and its surrounding region. Because the rib-suppressed DRR images remain fewer details of the trachea than CXR, adversarial learning with L G blurs the details in the inter-rib area of CXR images, which leads to lower Weber contrast but worse LIPIS, PSNR, and SSIM. To solve this issue, L inter and L help keep the intensity in inter-rib area unchanged and improves the preserving of the details in the inner-rib area, respectively.

Besides the performance of rib suppression, we also compare our proposed RSGAN with Li et al. (2020) in downstream applications, including lung disease classification and tuberculosis detection. As shown in Table 4 , using only rib-suppressed image achieves a lower AUC than that of using only original CXR images. And combining CXR with its corresponding ribsuppressed image obtains better result than using either of them. It is demonstrated that the rib suppression can improve the reliability of pulmonary disease diagnosis by reducing overlapped anatomies and ambiguous structure details, but there might be some details missing in the rib suppressed image, which cause a drop of AUC in lung disease classification. Compared with the result of Li et al. (2020) , combing the CXR with rib-suppressed image predicted by our proposed RSGAN obtains the best performance in both lung disease classification and tuberculosis detection tasks.

In this paper, we propose a GAN-based disentanglement learning network to obtain the rib suppression result of a CXR image automatically. The proposed approach aims to suppress ribs by predicting the residual map between the input CXR and its rib-suppressed image. Specifically, we train the model on annotated DRRs for rib suppression and transfer structural priors derived from unpaired CT/DRR images into the CXR domain. Furthermore, we propose three rib suppression loss functions based on prior knowledge to improve the performance of the generated residual map. Experimental results based on multiple benchmarking CXR datasets demonstrate that the performances of automatic lung disease classification and TB area detection are boosted with the aid of rib-suppressed images produced by our approach. The limitation of our RSGAN is that minor pixellevel rib residues may appear in the predicted rib-suppressed CXR images caused by interpolation bias when resampling a low-resolution image to the high-resolution image. A future study will focus on the generation of a high-resolution residual map in order to achieve a more accurate rib-suppressed image.

X-ray in-depth decomposition: Revealing the latent structures

The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans

Residual cyclegan for robust domain transformation of histopathological tissue slides

Improving cnn training using disentanglement for liver lesion classification in ct

A novel bone suppression method that improves lung nodule detection

Computed tomography-an increasing source of radiation exposure

Disentangled representation learning in cardiac image analysis

Separation of bones from chest radiographs by means of anatomically specific multiple massive-training anns combined with total variation minimization smoothing

Bone suppression of chest radiographs with cascaded convolutional networks in wavelet domain

Lung structures enhancement in chest radiographs via ct based fcnn training

Deep residual learning for image recognition

Suppression of translucent elongated structures: applications in chest radiography

Suppression of translucent elongated structures: applications in chest radiography

Clavicle segmentation in chest radiographs

Missed lung cancer

Densely connected convolutional networks

Multimodal unsupervised image-to-image translation

Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quantitative imaging in medicine and surgery 4

Automatic screening for tuberculosis in chest radiographs: a survey. Quantitative imaging in medicine and surgery 3

Unsupervised domain adaptation in brain lesion segmentation with adversarial networks

Analyzing and improving the image quality of stylegan

Imagenet classification with deep convolutional neural networks

Dual-energy x-ray absorptiometry and body composition

Diverse image-to-image translation via disentangled representations

Highresolution chest x-ray bone suppression using unpaired ct structural priors

Encoding ct anatomy knowledge for unpaired chest x-ray image decomposition

Rethinking computer-aided tuberculosis diagnosis

Improved texture analysis for automatic detection of tuberculosis (tb) on chest radiographs with bone suppression images

Medical image synthesis with deep convolutional adversarial networks

U-net: Convolutional networks for biomedical image segmentation

Elimination of clavicle shadows to help automatic lung nodule detection on chest radiographs

An adversarial learning approach to medical image synthesis for lesion detection

Image-processing technique for suppressing ribs in chest radiographs by means of massive training artificial neural network (mtann)

An image inpainting technique based on the fast marching method

Enabling machine learning in x-ray-based procedures via realistic simulation of image formation

Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases

A learning based deformable template matching method for automatic rib centerline extraction and labeling in ct images

Unsupervised domain adaptation via disentangled representations: Application to cross-modality liver segmentation

Cascade of multi-scale convolutional neural networks for bone suppression of chest radiographs in gradient domain

The unreasonable effectiveness of deep features as a perceptual metric

Task driven generative modeling for unsupervised domain adaptation: Application to x-ray image segmentation

Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network

Generation of virtual dual energy images from standard single-shot radiographs using multi-scale and conditional adversarial network

Unpaired image-to-image translation using cycle-consistent adversarial networks