key: cord-0638477-gmhhix6j
authors: Gare, Gautam Rajendrakumar; Tran, Hai V.; deBoisblanc, Bennett P; Rodriguez, Ricardo Luis; Galeotti, John Michael
title: Weakly Supervised Contrastive Learning for Better Severity Scoring of Lung Ultrasound
date: 2022-01-18
journal: nan
DOI: nan
sha: 97a49929688cd1956b2603b35a4970b06725af47
doc_id: 638477
cord_uid: gmhhix6j

With the onset of the COVID-19 pandemic, ultrasound has emerged as an effective tool for bedside monitoring of patients. Due to this, a large amount of lung ultrasound scans have been made available which can be used for AI based diagnosis and analysis. Several AI-based patient severity scoring models have been proposed that rely on scoring the appearance of the ultrasound scans. AI models are trained using ultrasound-appearance severity scores that are manually labeled based on standardized visual features. We address the challenge of labeling every ultrasound frame in the video clips. Our contrastive learning method treats the video clip severity labels as noisy weak severity labels for individual frames, thus requiring only video-level labels. We show that it performs better than the conventional cross-entropy loss based training. We combine frame severity predictions to come up with video severity predictions and show that the frame based model achieves comparable performance to a video based TSM model, on a large dataset combining public and private sources.

Lung Ultrasound (LUS) imaging has presented itself to be an effective bedside tool for monitoring COVID-19 patients (Mento et al., 2020; Raheja et al., 2019; Amatya et al., 2018) . Several AI based applications have emerged that help with diagnosis and identification of COVID-19 lung biomarkers (Born et al., 2021 (Born et al., , 2020 Roy et al., 2020; Van Sloun and Demi, 2020; Xue et al., 2021; Gare et al., 2021) . Most of these methods rely on expert annotated data for learning, demanding scarce and expensive time from expert physicians and radiologists involved in the mitigation of the COVID-19 pandemic. This raises a need for label efficient learning techniques.

Monitoring patient severity and making prognostic predictions play a critical role in the allocation of limited medical resources. For this, several AI based patient severity scoring techniques have recently been proposed (Roy et al., 2020; Xue et al., 2021) which rely on video-and frame-based annotations. Labeling all of the individual frames in an ultrasound video clip is time-consuming and expensive though. Just labeling the ultrasound video clip is more suitable and treating the video clip severity label as the pseudo frame severity label for the corresponding frames of the video would be preferable. But doing so introduces label noise as not all the frames in a clip actually display the same severity sign. For instance, B-line artifact which is indicative of an unhealthy lung would not be consistently seen in all the frames of an unhealthy lung ultrasound clip, so not all the frames show the same level of disease state. We propose a contrastive learning strategy as a way to mitigate the label noise introduced by the use of such weak frame severity labels directly obtained from the corresponding video severity label.

Contrastive learning has been used previously in the literature as semi-and self-supervised learning techniques (Chen et al., 2020a) , quite a few applications of it have already been presented in the medical domain Wang et al., 2020; Xue et al., 2021) . Contrastive learning acts as a way to regularise feature embeddings to learn discriminative features that enforce intra-class features to have a greater overlap (or similarity) than inter-class features by using objective functions that operate on the cosine similarity of the feature embeddings. Many techniques apply contrastive learning for differentiating COVID-19, Healthy and other pneumonic diseases Chen et al., 2020b) . Chen et al. (2020b) applied contrastive learning on CT scans as a few-shot COVID-19 diagnosis technique by bringing together the feature embedding of the same classes and pulling apart the feature embedding of different classes. Similarly, Zhang et al. applied contrastive learning on CT scans and paired text to enhance the network's domain invariance without using any expert annotation. Xue et al. (2021) applied contrastive learning on the patient level feature embedding in an attempt to align features from 2 different modalities corresponding to LUS and clinical information, to predict the patient severity. The LUS feature embeddings are high level feature embeddings that are aggregated from frame level features to ultrasound zone level features. In addition to making the feature embedding of the two modalities align, they take care of preserving the patient severity discriminate features, by the introduction of novel additional loss components to the contrastive loss. Taking a cue from them, we also augment the contrastive loss with additional terms to retain the ultrasound severity discriminate features.

We propose a weakly supervised training methodology by applying contrastive learning for the prediction of ultrasound video clip severity score, by making use of the noisy frame severity scores directly obtained from the corresponding video severity score. We show that the proposed contrastive learning setup is more robust to the weak frame severity label noise and thus generalizes better, compared to the cross-entropy loss based training.

Given an ultrasound B-mode grey image I g , the task is to find a function F : [ I g ] → L that maps the image I g to ultrasound severity score labels L ∈ {0, 1, 2, 3}. Because the pleural line produces distinct artifacts (A-lines, B-lines) when scattering ultrasound based on the lung condition, the classification model should learn underlying mappings between the pleural line, artifacts, and pixel values, for making the predictions. score-0 score-1 score-2 score-3

Convex Linear 

We compiled a lung ultrasound dataset with linear and curvilinear videos sourced from the publicly usable subset of the POCOVID-Net dataset (Born et al., 2020 (Born et al., , 2021 (128 videos), as well as our own private dataset (160 videos). Our dataset consists of multiple ultrasound B-scans of left and right lung regions at depths ranging from 4cm to 6cm under different scan settings, obtained using a Sonosite X-Porte ultrasound machine. The combined dataset consists of ultrasound scans of healthy and COVID-19 patients, totaling 288 videos (113 Healthy and 175 COVID-19) resulting in about 50K images. Figure 1 shows the data distribution into the various ultrasound severity scores and probes.

We use the same 4-level ultrasound severity scoring scheme as defined in (Sim) and similarly used in (Roy et al., 2020) . The score-0 indicates a normal lung with the presence of a continuous pleural line and horizontal A-line artifact. Scores 1 to 3 signify an abnormal lung, wherein score-1 indicates the presence of alterations in the pleural line with ≤ 5 vertical B-line artifacts, score-2 has the presence of > 5 B-lines and score-3 signifies confounding B-lines with large consolidations. All the manual labeling was performed by individuals with at least a month of training from a pulmonary ultrasound specialist. Refer to Figure 4 for sample images corresponding to the severity scores.

We perform dataset upsampling to address the class imbalance for the training data, wherein we upsample all the minority class labeled data to get a balanced training dataset (Rahman and Davis, 2013). All the images are resized to 312x232 pixels using bilinear interpolation. Data augmentation is not applied.

To access the ultrasound severity score of the video clips, we make use of the video labels as the noisy weak labels for the corresponding video frames. We augment the cross-entropy loss training objective for the classification task, using the contrastive learning objective in order to learn features that are robust to the frame-level label noise.

The proposed contrastive learning objective is inspired by (Xue et al., 2021) , wherein discriminative representations are learned using the contrastive loss consisting of three parts, which respectively cope with the intra-class alignment L IA , inter-class contrastive learning L CL , and contrastive continuity L CC . The intra-class alignment L IA objective is to bring the feature embeddings of the same severity score closer, the inter-class contrastive learning L CL objective is to differentiate the feature embeddings of different severity scores and the contrastive continuity L CC ensure that the hierarchy among the severity scores is preserved. The proposed contrastive learning approach can be implemented by optimizing the following objective:

where,

where, N is the total number of frames, sim(a, b) = a T b a b is the cosine similarity between vectors a and b. u is the feature embeddings extracted after the global average pooling layer of the network, which is 2048-dimensional vector. s is the ultrasound severity score of the corresponding frame feature u.

Unlike (Xue et al., 2021) which only relate the immediate severity levels, we explicitly relate all severity levels to enforce linear relationships in order to preserve the sequential nature of possible output choices (e.g. severity-1 is closer to severity-2 than severity-1 to severity-3) while simultaneously achieving the desired contrast in the loss. Our approach uniquely avoids the incorrect possibility of the model learning multi-dimensional distances among outputs, which could for example make severity-0 seem very close to severity-3 if the model incorrectly learned a cyclical order among the various severity levels. Prior systems do not take this ordinal relationship into account which can give rise to unnatural ordering. As can be observed in the confusion matrix shown in Figure 4 .

During training, for the input frame under consideration i, we randomly sample the frames k, m, n from different video clips which have different severity scores than i and randomly select frame j corresponding to the same video clip as i within a 10 frame window.

The overall training objective L overall consists of the weighted combination of cross-entropy loss L ce for classification error and contrastive learning loss L con for feature regularization:

where, the cross-entropy loss L ce = 1 N i −g i log p i , in which N is the total number of frames, g i is the ground truth one-hot severity score, p i is the predicted probability scores from the last softmax layer of the network and the contrastive learning loss L con is as defined in Equation (1). For all our experiments we set α as 0.5.

Using the frame predicted probability scores p i , we calculate the video's predicted probability scores p v by taking the max severity-category score from all the corresponding video frame's predicted probability scores as:

, is severity category probability scores 0 to 3 respectively of frame i belonging to video v. Using these video predicted probability scores p v we evaluate the video-based severity scoring metrics of the model.

The network is implemented with PyTorch and trained using the stochastic gradient descent algorithm (Bottou, 2010) with an Adam optimizer (Kingma and Ba, 2015) set with an initial learning rate of 0.001. The model is trained on an Nvidia Titan RTX GPU, with a batch size of 8 for 30 epochs for the classification task. The ReduceLRonPlateau learning-rate scheduler was used which reduces the learning rate by a factor (0.5) when the performance metric (accuracy) plateaus on the validation set. For the final evaluation, we pick the best model with the highest validation set accuracy to test on the held out test set.

For the severity classification, we report accuracy, precision, recall, and F1 score (Born et al., 2020; Roy et al., 2020) . The receiver operating characteristic (ROC) curve is also reported along with its area under the curve (AUC) metric (Kim et al., 2020) , wherein for the calculation of the metric the weighted average is taken, where the weights correspond to the support of each class and for the multi-label we consider the one-vs-all approach. (Fawcett, 2006) 

We train the ResNet-50 (RN50) (He et al., 2016) model, commonly used for classification and benchmarking methods using the proposed contrastive learning setup and compare its performance with the model trained only using the cross-entropy loss, in order to access the robustness achieved using the contrastive learning objective to the noisy weak frame severity score labels. We also compare the performance with the model trained using the original contrastive learning loss in Xue et al. (2021) and a TSM (Lin et al., 2018) based video classification network similar to (Gare et al.) , training details in Appendix-A. We conduct five independent runs, wherein each run we randomly split the videos into train, validation, and test sets with 70%, 10%, and 20% split ratio respectively, by maintaining the same split ratio for all the individual severity scored clips and ensuring that all frames corresponding to a video remain in the same split. The training set is upsampled to address the class imbalance (Rahman and Davis, 2013) . We report the resulting metrics in form of mean and standard deviation over the five independent runs. Table 1 shows the mean and standard deviation of the frame-based severity scoring metrics, obtained by evaluating on the held-out test set using the models from the five independent runs. We observe that the contrastive learning (CL) based trained models preform better than the cross-entropy (CE) trained model, wherein the original and the proposed contrastive learning loss have similar scores with the original loss performing slightly better. We calculate the video-based severity scoring metrics of the models by calculating the video predicted probability score p v obtained by taking the max severity-category score from all the corresponding video frame's predicted probability scores p, as defined in Equation (6). Table 2 shows the mean and standard deviation of the video-based severity scoring metrics, obtained by evaluating on the held out test set using the models from the five independent runs. We again observe that the contrastive learning (CL) based trained models preform better than the cross-entropy (CE) trained model and has comparable performance with the video based TSM model. With our proposed loss function achieving the highest accuracy, recall, and F1-score. The macro average and individual severity score's RoC plots of the CL trained model using the proposed loss for video-based prediction is shown in Figure 2 . The lower performance on severity score-3 compared to other scores could be due to the limited number of training data for severity score-3. Figure 4 shows the confusion matrix of both the contrastive loss trained models on the combined 5 runs.

On comparing the model's scoring metrics on the held out test set with the validation (val) set used for hyperparameter optimization (see Table 3 ), we see that though the CE trained model achieved higher accuracy and F1-score (avg) on the validation set compared to our CL trained model, it was outperformed on the held out test set by the CL trained model. This suggests that the CL trained model generalized better to the unseen data, which is indicative of robust features learned using the contrastive loss.

We visualize the model's layer-2 Grad-CAM (Selvaraju et al., 2016) and show the mean Grad-CAM image corresponding to the four severity scores taken over the entire test set (∼ 10K images) for the best run in Figure 4 . We also shown Grad-CAM on four randomly selected images for which our CL trained model appeared to be looking at the correct locations (pleural line and A-line & B-line artifacts), whereas CE trained model was basing its predictions on non-lung tissue. For these four test images, the CL model correctly predicted the severity scores, whereas the CE model got all predictions wrong. Which suggests that the contrastive learning objective lead to learning better discriminative features. 

We demonstrated a weakly supervised method for scoring the COVID-19 lung ultrasound scan clips, using our proposed contrastive learning objective. Which treats video-based severity labels as frame-based severity labels thus reducing labeling cost. While these frame proposed loss is confused between immediate severity scores which is reasonable and is less confused between non-immediate severity scores compared to the original loss.

score-0 score-1 score-2 score-3 grey random sample CE RN50 CL RN50 mean over testset CE RN50 CL RN50 labels are noisy, we demonstrated that the contrastive learning objective is robust to such label noise compared to the cross-entropy learning objective. We showed that the frame based model trained using the proposed contrastive learning loss achieves comparable performance to a video based TSM model. cross-entropy loss. For fair comparison with the frame based models no augmentation is used.

We compare our video-based scoring with prior method reported scores in the literature (Roy et al., 2020; Xue et al., 2021) in Table 4 . We see that our method achieves higher scores, though noting that these scores are obtained on different datasets. 

Lung Ultrasound for COVID-19 -Tabular View -Clini-calTrials

Diagnostic use of lung ultrasound compared to chest radiograph for suspected pneumonia in a resource-limited setting

Automatic detection of COVID-19 from a new lung ultrasound imaging dataset (POCUS)

Accelerating Detection of Lung Pathologies with Explainable Ultrasound Image Analysis

Large-scale machine learning with stochastic gradient descent

A simple framework for contrastive learning of visual representations

Momentum Contrastive Learning for Few-Shot COVID-19 Diagnosis from Chest CT Images

An introduction to ROC analysis

The Role of Pleura and Adipose in Lung Ultrasound AI

Dense pixel-labeling for reverse-transfer and diagnostic learning on lung ultrasound for covid-19 and pneumonia detection

Deep residual learning for image recognition

Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. The Lancet Digital Health

Adam: A method for stochastic optimization

TSM: Temporal Shift Module for Efficient Video Understanding

On the Impact of Different Lung Ultrasound Imaging Protocols in the Evaluation of Patients Affected by Coronavirus Disease

Application of Lung Ultrasound in Critical Care Setting: A Review

Addressing the Class Imbalance Problem in Medical Datasets

Deep Learning for Classification and Localization of COVID-19 Markers in Point-of-Care Lung Ultrasound

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Localizing B-Lines in Lung Ultrasonography by Weakly Supervised Deep Learning, In-Vivo Results

Contrastive Cross-site Learning with Redesigned Net for COVID-19 CT Classification

Modality alignment contrastive learning for severity assessment of COVID-19 from lung ultrasound and clinical information

CONTRASTIVE LEARNING OF MEDICAL VISUAL REPRESENTATIONS FROM PAIRED IMAGES AND TEXT

This present work was sponsored in part by US Army Medical contracts W81XWH-19-C0083 and W81XWH-19-C0101. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC). We would also like to thank our collaborators at the Carnegie Mellon University (CMU), Louisiana State University (LUS), and University of Pittsburgh (Upitt). We are pursuing intellectualproperty protection. Galeotti serves on the advisory board of Activ Surgical, Inc. He and Rodriguez are involved in the startup Elio AI, Inc.

We follow the same setup of (Gare et al.) for training a TSM network (Lin et al., 2018) with ResNet-18 (RN18) (He et al., 2016) backbone and bi-directional residual shift with 1/8 channels shifted in both directions. The model is fed input clips of 16 frames wide (224x224 pixels) sampled using the same strategy as in Gare et al.. For testing, 3 sequential sample clips per video are evaluated which are used to get the corresponding video predicted probability scores p v , as defined in Equation (6). The model is trained for 30 epochs using