key: cord-0117543-znsy7z45
authors: Guo, Xiaoyuan; Duan, Jiali; Purkayastha, Saptarshi; Trivedi, Hari; Gichoya, Judy Wawira; Banerjee, Imon
title: OSCARS: An Outlier-Sensitive Content-Based Radiography Retrieval System
date: 2022-04-06
journal: nan
DOI: nan
sha: 7f259918cde5d2770bf6d0a7413b88fd184c978c
doc_id: 117543
cord_uid: znsy7z45

Improving the retrieval relevance on noisy datasets is an emerging need for the curation of a large-scale clean dataset in the medical domain. While existing methods can be applied for class-wise retrieval (aka. inter-class), they cannot distinguish the granularity of likeness within the same class (aka. intra-class). The problem is exacerbated on medical external datasets, where noisy samples of the same class are treated equally during training. Our goal is to identify both intra/inter-class similarities for fine-grained retrieval. To achieve this, we propose an Outlier-Sensitive Content-based rAdiologhy Retrieval System (OSCARS), consisting of two steps. First, we train an outlier detector on a clean internal dataset in an unsupervised manner. Then we use the trained detector to generate the anomaly scores on the external dataset, whose distribution will be used to bin intra-class variations. Second, we propose a quadruplet (a, p, nintra, ninter) sampling strategy, where intra-class negatives nintra are sampled from bins of the same class other than the bin anchor a belongs to, while niner are randomly sampled from inter-classes. We suggest a weighted metric learning objective to balance the intra and inter-class feature learning. We experimented on two representative public radiography datasets. Experiments show the effectiveness of our approach. The training and evaluation code can be found in https://github.com/XiaoyuanGuo/oscars.

With the widespread adoption of radiology in diagnosis and treatment planning, the amount of medical image data is rapidly increasing Hwang et al. (2012) . Fast and effective retrieval in large-scale medical image repositories has been demanding to support data management, re-search and clinical applications Sotomayor et al. (2021) . One common way to retrieval images is content-based, which has been widely researched and applied to the medical field Wang et al. (2014) ; Dubey (2021) ; Chowdhury et al. (2016) ; Chen et al. (2022) .

For a given query image, a content-based image retrieval (CBIR) system returns a ranked list of images from the database based on a similarity measure between the query and retrieved images Duan and Kuo (2021) ; Revaud et al. (2019) . The core idea behind CBIR is to minimize the distance of an anchor image a to its positive counterparts ps and maximize the distance to the corresponding negative images ns in the feature space. Usually, the positive images are in the same class as the anchor image. However, adopting this strategy can be problematic as it only considers the inter-class variation. The assumption -as long as a and p are from the same class, they show similar visual features -is not realistic as samples from one class often exhibit certain intra-class variations. Noisy, under-represented data can exist, also called outliers. This phenomenon is more common in radiology as images are often acquired via different equipment from different sources and varies based on acquisition protocols. These variations, as shown in the left part of Fig. 1 , pose specific challenges in the consumer domain and need to be recognized in assessing image similarity Akgül et al. (2011) .

Although there have been multiple studies for radiograph retrieval, few of them pay attention to the intra-class similarity problem. Anavi et al. (2015) investigated X-ray image retrieval with both distance-based and probabilitybased approaches. Chowdhury et al. (2016) proposed a content-based medical image retrieval (CBMIR) system for radiographic images and employed a CNN to obtain high-level image representations. Qayyum et al. (2017) proposed a CBIR framework by training a CNN for the classification task. Layode and Rahman (2020) developed a chest X-ray image retrieval system for COVID-19 detection with deep denoising autoencoders as feature extractors. Zhong et al. (2021) designed an image re-trieval system for COVID-19 chest radiograph via optimizing a multi-similarity loss. Outside medical domain, methods including FastAP Cakir et al. (2019) , MultiSimilarity Wang et al. (2019 ), CircleLossSun et al. (2020 and SupCon Khosla et al. (2020) try to discover challenging negative data to improve the retrieval accuracy. Nevertheless, these existing efforts all emphasize the inter-class similarity but neglect the intra-class similarity.

In this paper, we focus on relevant radiograph image retrieval in external datasets which can contain lots of noisy data compared to the clean internal dataset. Such a system will help to collect cleaner external image dataset with minimal human effort and accelerate AI evaluation. To achieve the goal, we propose an Outlier-Sensitive Content-based rAdiologhy Retrieval System (OSCARS), which takes both the intra-class and inter-class variations into consideration. To acquire the intra-class variation information, we adopt the unsupervised anomaly detectors trained on the internal dataset and utilize the assigned anomaly scores to the external dataset to split each class into several bins, with each bin in a certain range regarding the anomaly scores. Based on which, we construct the quadruplet data (a, p, n intra , n inter ) with an anchor image a, a positive image p from the same class and same bin, an intra-class negative image n intra from the same class but different bins, and an inter-class negative image n inter that is from a different class.

With the proposed quadruplet sampling strategy, we incorporate the intra-class discriminative information into the training data and hence improve the retrieval of sensitivity outlier-related queries after model training. All the images in a quadruplet are fed into the feature extractor to learn their latent embeddings (e a , e p , e n intra , e n inter ). As illustrated in the right of Fig. 1 , we then learn the intra-class embedding similarity to achieve (S im(e a , e p ) > S im(e a , e n intra )) with an intra-class triplet loss L intra and the interclass similarity for (S im(e a , e n intra ) > S im(e a , e n inter )) with an inter-class triplet loss L inter in a weighted way. Our summarized contributions are:

1. We introduce the task of outlier-sensitive image retrieval for noisy external medical image dataset and propose an effective image retrieval system OS-CARS to enhance the relevance of outlier-related results. 2. We propose to acquire intra-class information of ex- Figure 2 : OSCARS architecture involves two main steps. Step1: train anomaly detectors on the internal dataset for each class C i I ; learn clean in-distributions with anomaly scores assigned to C i I ; apply the trained anomaly detectors on each class C i E of the external dataset and split the data into several bins C i E according to the anomaly scores. (Dark colors mean more distribution shifts.) Step2: generate quadruplets (a, p, n intra , n inter ) by sampling the intra-class positive, negative and inter-class negative simultaneously; learn the intra-class and inter-class similarity in feature space with the intra-class triplet loss L intra and inter-class triplet loss L inter . ternal datasets via anomaly detectors trained unsupervised. By training on clean internal datasets, the anomaly detectors assign each sample of the external dataset with a specific anomaly score. Based on which, we split each class into several bins with different intra-class variations. 3. We sample both the intra-class and inter-class negative images to construct quadruplets for intra-class and inter-class similarity learning. 4. We demonstrate the model effectiveness with two public representative radiography datasets -Stanford Muscoloskeletal Radiography (MURA) Rajpurkar et al. (2017) and CheXpert Irvin et al. (2019).

Given a clean internal dataset D I and a noisy external dataset D E , the external data of class c can contain outliers visually different from the internal class. Therefore, a conventional image retrieval system for the external dataset will be insufficient as it merely treats all the samples from one class as the same without considering the intra-class variations. Thus, the system will lack sensitivity to the outliers, undermining the retrieval accuracy. Our objective is to train an image retrieval model that will prioritize the images with both intra-class and inter-class dissimilarity during retrieval ranking. Figure 2 summarizes the whole framework of our model. There are mainly two steps involved. First, we design to learn intraclass information in an unsupervised way (introduced in Sec. 2.1). Second, we propose to sample training data that are with intra-class bin information and inter-class information (introduced in Sec. 2.2). With these steps, images with the same labels and similar contents are pulled together by maintaining intra-class similarity.

Due to the difficulties of collecting annotated data with intra-class information provided in the medical domain, the outlier-sensitivity research on medical images has been delayed. To overcome the problem, we propose to generate intra-class labels automatically inspired by a recent work -MedShift Guo et al. (2021b) . Given a clean internal dataset D I , MedShift has suggested an approach to identify outliers for noisy external dataset D E . Following the same steps of MedShift, we first obtain the internal distribution information by training an unsupervised outlier detector named CVAD Guo et al. (2021a) for each class on the same internal datasets used in Guo et al. (2021b) . Then, the trained anomaly detectors are evaluated on the external datasets as they have learnt intraclass discriminative features. Thus, each external data has its anomaly score, based on which we split each class into B bins with the K-Means clustering techniques Lloyd (1982) ; MacQueen et al. (1967) . B (5 in our paper) is determined by the Elbow method Thorndike (1953) . The resulting bins are in different anomaly score ranges. With the data from different bins, we get the intra-class labels. Given that both the intra-class and inter-class labels are available, for each image a, we randomly sample one intra-class positive image p, one intra-class negative sample n intra and one inter-class negative sample n inter accordingly, thus collecting the quadruplets (a, p, n intra , n inter ) for training.

With the sampled quadruplets data, we feed each of the image to a CNN-based feature extractor to acquire latent embeddings (e a , e p , e n intra , e n inter ). For simplicity, we adopt the ResNet18 He et al. (2016) pre-trained on Ima-geNet Deng et al. (2009) as the network backbone. OS-CARS is designed to consider both the inter-class similarity and the intra-class similarity at the same time, which brings the model advantages of acquiring the sensitivity of intra-class outlier relevance during image retrieval. However, balancing the effect of the two parts is a challenging problem. Too much weight on intra-class information will distract the general retrieval accuracy of inter-class data. Therefore, we design an intra-class triplet margin loss and an inter-class triplet margin loss to optimize the model architecture. To balance the influence of intra-class and intra-class information on final ranking, we adopt a weighted loss formulated as: L = λL intra (e a , e p , e n intra ) + (1 − λ)L inter (e a , e n intra , e n inter ) = λ(max{d(e a , e p ) − d(e a , e n intra ) + M intra , 0})

where d(x i , y i ) = x i − y i 2 . λ, M intra and M inter are set as 0.05, 1 and 2 in our experiments respectively. When we have a query image unseen during training, we first acquire the query representation with the trained image feature backbone and then compute the cosine similarity between the representative features of the query image and dataset images. Images are ranked based on the similarity scores in the descending order.

We have evaluated our approach on two publicly available large-scale radiograph image datasets. The first is Stanford MURA dataset, a large dataset of bone Xrays, which contains seven classes -HAND, FORARM, FIGER, SHOULDER, ELBOW, WRIST, HUMERUS. The second is CheXpert dataset, which in total has 14 classes -No Finding, Enlarged Cardiomediastinum, Cardiomegaly, Lung Lesion, Lung Opacity, Edema, Consolidation, Pneumonia, Atelectasis, Pneumothorax, Pleural Effusion, Pleural Other, Fracture, Support Devices. As the chest x-ray images are with two views -frontal and lateral. We here only use frontal view and leave the lateral view for future studies. See more details in the supplementary materials.

For the retrieval task, we report the retrieval recall at rank K (R@K, K ∈ {1, 5, 10, 50, 100}), precision at rank K (P@K, K ∈ {1, 5, 10, 50, 100}), outlier sensitivity (S @K, K ∈ {1, 5, 10, 50, 100}). The metric recall is the percentage of relevant images retrieved over the total number of retrieved images, defined as recall = N R K where R represents the relevant images retrieved. The metric precision is assigned based on the existence of the same labels between the query image and the retrieved images. If δ(·) ∈ {0, 1} is an indicator function, the precision is defined as precision

. Additionally, we evaluate the outlier sensitivity by calculating the anomaly score difference with sensitivity =

where A means anomaly score. We scale the anomaly scores of MURA dataset into [0,1] with the sigmoid function due to the large variations of its anomaly scores.

The pipelines are developed using Pytorch 1.9.0, Python 3.7.3 and Cuda compilation tools V11.4 on a machine with 4 NVIDIA Quadro RTX A6000 GPUs with 48GB memory. The training for all the models is run for 50 epochs with a start learning rate 0.001 and a SGD optimizer.

As a representative image retrieval method with triplet data in training, we select DeepRank as our baseline. State-of-the-art CBIR approaches including FastAP, Mul-tiSimilarity, CircleLoss and SupCon are used to compare the model performance. Notably, we keep the feature extractor consistent for all the methods to ensure fair comparisons.

Quantitative Results:. Table 1 presents the recall and precision performance for both Stanford MURA and CheXpert datasets respectively. Since the data in CheXpert can have multiple labels, we calculate the correct hit with the strategy -loose match, which means that for a query chest X-ray with multiple labels, a retrieval image is relevant as long as it has one label matched. Compared to the baseline DeepRank, Oscars can enhance the recall and precision performances on both datasets and achieve the best recall at 1 and precision at 1. In general, SupCon has the highest recall for MURA dataset. Nonetheless, Oscars achieves the best precision for MURA dataset and recall for CheXpert. Additionally, we report the sensitivity results in the supplementary materials.

Qualitative Results:. Figure D.7 shows an example of a HAND query image in MURA dataset. The corresponding retrieval results including ours are present in different rows. As can be seen, although many methods can achieve high recall and precision (see Table 1 ), they fail to distinguish the intra-class variations. Especially, Mul-tiSimilarity and SupCon exhibit little sensitivity to the noisy query. Comparatively, our method can prioritize intra-class similarity and rank the images with similar anomaly semantics ahead. Please refer to the supplementary materials for more results. Figure 3 : Hand results, left is the query image, right shows retrieval results. Green boxes mean both intra-and inter-class correct; blue boxes are for inter-class correct predictions. Each retrieval image has its label on top of itself. For correct predictions, we also put the anomaly scores on them. Closer anomaly scores mean more similarity.

Impact of Lambda:. We also explore the impacts of applying different λ values to the loss function (Eqn. 1). A good balance between the intra-class and inter-class information will enable the retrieval system to acquire both accurate inter-class and outlier-sensitive intra-class results. Figure 4 illustrates the performance variations in different datasets under different settings. λ decides how the model learns to weight the intra-class and inter-class information simultaneously. We observe that too much weight on the intra-class similarity will degrade the inter-class similarity predictions. Experiments suggest 0.05 can work well.

In this work, we propose an outlier-sensitive radiography image retrieval system OSCARS, which goes beyond retrieving images with the most inter-class similarity but also inspects the intra-class similarity implicitly when query images show certain variations. Utilizing the automatic learning of clean internal distribution, the intraclass variations of external sources are captured and used to generate intra-class labels by splitting the class into several groups. Feeding the sampled quadruplets consisting of both the intra, inter-class positive and negative samples to the image feature learner, a weighted margin loss is adopted to optimize the retrieval network. The resulting retrieval system is sensitive to outlier-related queries as it has learnt to rank the retrieved results based on both intraclass and inter-class similarities. This outlier-sensitive image retrieval approach provides clinical users the access to receive more relevant medical images and allow radiologists to process and analyze radiography images more effectively.

We here introduce the details about the Stanford MURA and CheXpert dataset, and the samples we used in training and evaluation.

Stanford MURA contains 21,471 images. We sample a quadruplet for each image and split the quadruplets into training and validation with the ratio of 8:2. We evaluate the retrieval performance with the left unseen 1873 images.

CheXpert in total has 223,414 training images, of which 138,358 are in frontal view. By filtering out the invalid samples, 118,286 are left. For each image, we sample a quadruplet, resulting 118,286 training quadruplets. We thus split the quadruplets into training(80%) and validation(20%) parts. The left 282 frontal chest X-rays are used for testing.

Since both the two datasets are in varied sizes, we resize all the images into a fixed size of 224 × 224 × 3 to fit the feature extractor network.

We report the sensitivity results on both MURA and CheXpert datasets. We only calculate the sensitivity for the correct hits. Therefore, even in some situations, a model has lower sensitivity values, the general evaluation of the model performance should take the recall and precision into consideration. Because the anomaly score ranges of MURA can vary a lot, we scale its score into the range of [0,1] with a sigmoid function. For CheXpert dataset, we keep the original anomaly scores in use. As CheXpert data is with multi-labels, one sample with more than one label can have multiple anomaly scores considering each class variations. Therefore, when there are multiple hits, we take the minimum difference of the anomaly scores between the query image and the database images. Compared to MURA classes, chest X-rays are often similar with each other and thus difficult to retrieve. We here present the results for CheXpert with higher float precision. Generally, the lower the sensitivity values the better. Appendix C. Visualization of MURA Retrieval Figure C .5: Query results, left is the query image, retrieval results are shown in the right part. Green boxes mean both intra-and inter-class correct; blue boxes are for inter-class correct predictions. Each retrieval image has its corresponding label on top of itself, and for the correct predictions, we also put the anomaly scores on them. Closer anomaly scores mean more similarity. 

Since a sample of CheXpert can have more than one lables, we encode the labels into a binary code with a length of 14, of which 1 means the data belongs to the class and 0 means irrelevant. The 14-bit label corresponds to the classes No Finding, Enlarged Cardiomediastinum, Cardiomegaly, Lung Lesion, Lung Opacity, Edema, Consolidation, Pneumonia, Atelectasis, Pneumothorax, Pleural Effusion, Pleural Other, Fracture, Support Devices in order. For simplicity, we only show labels, not with the anomaly scores. Figure D .7: CheXpert query results, left is the query image, retrieval results are shown in the right part. Green boxes mean both intra-class and interclass correct; blue boxes are for inter-class correct predictions and red boxes are for wrong predictions. Each retrieval image has its corresponding label on top of itself.

Content-based image retrieval in radiology: current status and future directions

A comparative study for chest radiograph image retrieval using binary texture and deep learning classification

Deep metric learning to rank

Deep learning for instance retrieval: A survey

An efficient radiographic image retrieval system using convolutional neural network

Imagenet: A large-scale hierarchical image database

Bridging gap between image pixels and semantics via supervision: A survey

A decade survey of content based image retrieval using deep learning

Cvad: A generic medical anomaly detector based on cascade vae

Medshift: identifying shift data for medical dataset curation

Deep residual learning for image recognition

Medical image retrieval: past and present

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Supervised contrastive learning

A chest x-ray image retrieval system for covid-19 detection using deep transfer learning and denoising auto encoder

Least squares quantization in pcm

Some methods for classification and analysis of multivariate observations

Medical image retrieval using deep convolutional neural network

Large dataset for abnormality detection in musculoskeletal radiographs

Learning with average precision: Training image retrieval with a listwise loss

Content-based medical image retrieval and intelligent interactive visual browser for medical education

Circle loss: A unified perspective of pair similarity optimization

Who belongs in the family?

Learning finegrained image similarity with deep ranking

Multi-similarity loss with general pair weighting for deep metric learning

Deep metric learning-based image retrieval system for chest radiograph and its clinical applications in covid-19