key: cord-0146528-ei40bnfr authors: Dong, Nanqing; Voiculescu, Irina title: Federated Contrastive Learning for Decentralized Unlabeled Medical Images date: 2021-09-15 journal: nan DOI: nan sha: a140469b5cd3549268ab601ef38473f798895071 doc_id: 146528 cord_uid: ei40bnfr A label-efficient paradigm in computer vision is based on self-supervised contrastive pre-training on unlabeled data followed by fine-tuning with a small number of labels. Making practical use of a federated computing environment in the clinical domain and learning on medical images poses specific challenges. In this work, we propose FedMoCo, a robust federated contrastive learning (FCL) framework, which makes efficient use of decentralized unlabeled medical data. FedMoCo has two novel modules: metadata transfer, an inter-node statistical data augmentation module, and self-adaptive aggregation, an aggregation module based on representational similarity analysis. To the best of our knowledge, this is the first FCL work on medical images. Our experiments show that FedMoCo can consistently outperform FedAvg, a seminal federated learning framework, in extracting meaningful representations for downstream tasks. We further show that FedMoCo can substantially reduce the amount of labeled data required in a downstream task, such as COVID-19 detection, to achieve a reasonable performance. Recent studies in self-supervised learning (SSL) [21] have led to a renaissance of research on contrastive learning (CL) [1] . Self-supervised or unsupervised CL aims to learn transferable representations from unlabeled data. In a CL framework, a model is first pre-trained on unlabeled data in a self-supervised fashion via a contrastive loss, and then fine-tuned on labeled data. Utilizing the state-ofthe-art (SOTA) CL frameworks [8, 16, 3, 24] , a model trained with only unlabeled data plus a small amount of labeled data can achieve comparable performance with the same model trained with a large amount of labeled data on various downstream tasks. As a data-driven approach, deep learning has fueled many breakthroughs in medical image analysis (MIA). Meanwhile, large-scale fully labeled medical datasets require considerable human annotation cost, which makes data scarcity a major bottleneck in practical research and applications. To leverage unlabeled data, CL [10, 4, 23] has obtained promising results on unlabeled data, yet none of these studies consider federated learning (FL), which can handle sensitive arXiv:2109.07504v1 [cs. LG] 15 Sep 2021 Fig. 1 : Illustration of FCL workflow. Each node works on an independent local dataset. The parameter server works with data nodes for periodic synchronization, metadata transfer (Sec. 2.2), and self-adaptive aggregation (Sec. 2.3). For each downstream task, the pre-trained model weights are fine-tuned with small amount of labeled data in either supervised or semi-supervised fashion. Due to data privacy, the labeled data of a downstream task can only be accessed locally. data stored in multiple devices [14, 20, 22, 7] . Data privacy regulations restrict collecting clinical data from different hospitals for conventional data-centralized CL. Under FL protocols, medical images (either raw or encoded) must not be exchanged between data nodes. In this work, we design an FCL framework on decentralized medical data (i.e. medical images stored on multiple devices or at multiple locations) to extract useful representations for MIA. There are two direct negative impacts of the FL environment on CL. First, different imaging protocols in hospitals (nodes) create domain shift [6] . In contrast to CL on centralized data, each node in FCL only has access to its local data, which has a smaller variation in sample distribution. This impairs the CL performance on single nodes. Second, without the supervision of ground truth labels, there is no guarantee for the performance of CL in each node. Applying supervised FL frameworks on CL directly might lead to worse generalization ability and hence poorer outcome as the performance of FCL could be dominated by the performance of a few nodes. How to aggregate CL models across nodes is still an open question. We introduce FedMoCo, a robust FCL framework for MIA (see Fig. 1 ). FedMoCo uses MoCo [8] as the intra-node CL model. To mitigate the above hurdles, we propose metadata transfer, an inter-node augmentation module utilizing Box-Cox power transformation [2] and Gaussian modeling; we also propose self-adaptive aggregation, a module based on representational similarity analysis (RSA) [12] . We empirically evaluate FedMoCo under various simulated scenarios. Our experiments show that FedMoCo can consistently outperform FedAvg [15] , a SOTA FL framework; FedMoCo can efficiently reduce annotation cost in downstream tasks, such as COVID-19 detection. By pre-training on unlabeled non-COVID datasets, FedMoCo requires only 3% examples of a labeled COVID-19 dataset to achieve 90% accuracy. Our contributions are threefold: (1) to the best of our knowledge, this is the first work of FCL on medical images; (2) we propose FedMoCo with two novel modules; (3) our results provide insights into future research on FCL. Problem Formulation. K > 1 denotes the number of nodes and D k is an are non-IID data. There is an additional master node, which does not store any clinical data. The master node is implemented as a parameter server (PS) [13] . The model of interest is f θ and its parameter set θ 0 is randomly initialized in PS. At the beginning of the federated training, K copies of θ 0 are distributed into each node as {θ k } K k=1 , i.e. we fully synchronize the data nodes with PS. An important FL protocol is data privacy preservation: exchanging training data between nodes is strictly prohibited. Instead, node k updates θ k by training on D k independently. After the same number of local epochs, {θ k } K k=1 are aggregated into θ 0 in PS. Again, we synchronize {θ k } K k=1 with aggregated θ 0 . This process is repeated until certain criteria are met. To enforce data privacy, we only ever exchange model parameters {θ k } K k=0 and metadata between the data nodes and PS. The learning outcome is θ 0 for relevant downstream tasks. We create a positive pair by generating two random views of the same image through data augmentation. A negative pair is formed by taking two random views from two different images. Given an image x and a family of stochastic image transformations T , we randomly sample two transformations τ and τ to have two random views τ (x) and τ (x) which form a positive pair. Let z denote the representation of an image extracted by a CNN encoder f θ . In contrast to previous works [27, 8, 3] , we add a ReLU function between the last fully-connected layer and a L2-normalization layer, which projects all extracted features into a non-negative feature space. The non-negative feature space is a prerequisite for Sec. 2.2. For query image x ∈ D k , we have positive pair z q = f θ q k (τ (x)) and , the contrastive loss InfoNCE [18] is defined as where the temperature parameter is te. We use the dynamic dictionary with momentum update in MoCo [8] to maintain a large number of negative examples in Eq. 1. In node k, there are two CNNs, a CNN encoder (θ q k ) for the query image and another (θ d k ) for the corresponding positive example which are defined as where m ∈ [0, 1) is the momentum coefficient. For query image x, we have z q = f θ q k (τ (x)) and z 0 = f θ d k (τ (x)). The dynamic dictionary of node k must not exchange information with other nodes. This limits the sample variation of the dynamic dictionary, which limits CL performance. In addition, a node overfitted to local data may not generalize well to other nodes. We overcome these two hurdles by utilizing metadata from the encoded feature vectors in each node. For node k, after the full synchronization with PS (i.e. before the start of next round of local updates), we In order to enforce a Gaussian-like distribution of the features we use the Box-Cox power transformation (BC). BC is a reversible transformation defined as where λ controls the skewness of the transformed distribution. Then, we calculate the mean and covariance 1 of the transformed features where y i = BC(f θ k (x i )). The metadata of the learned representations in each node are collected by PS as {(µ k , Σ k )} K k=1 . {(µ j , Σ j )} j =k is sent to node k. We name this operation as metadata transfer. Metadata transfer improves the CL performance of node k by augmenting the dynamic dictionary of node k statistically with metadata collected from all the other nodes j = k. For each pair of (µ k , Σ k ), there is a corresponding Gaussian distribution N (µ k , Σ k ). In the next round of local updates, we increase the sample variation in each node by sampling new points from the Gaussian distributions when minimizing Eq. 1. Specifically, for node k, we increase the number of negative examples in Eq. 1 from N to N (1 + η) , where η ≥ 0 is a hyper-parameter to control the level of interaction between the node k and the other nodes (e.g. η = 0 means no interaction). We sample ηN K−1 examples from each N (µ l , Σ l ) ∀ l = k. Letỹ ∼ N (µ l , Σ l ) ∀ l = k, we havez = BC −1 (ỹ) and the new contrastive loss is Algorithm 1 FedMoCo. The training in each data node is warmed up for tw rounds before metadata transfer. A local round could be a few local epochs. Initialize θ 0 Eq. 6 and Eq. 7 Given locally updated {θ t k } K k=1 , which are collected at the end of round t, the aggregation step of round t can be formulated as θ t = θ t 0 = K k=1 a k θ t k , where a k denotes k th element of diagonal matrix A. For example, in FedAvg [15] , we use a k = n k K j=1 nj because the number of labels indicates the strength of supervision in each node, which does not work for unsupervised learning. For node k with large n k and small variation in D k (i.e. examples are similar), f θ k converges faster but θ k learns less meaningful representations (e.g. overfitted to no findings or to certain diseases). But a k = n k K j=1 nj gives θ t k a larger weight and θ 0 k is dominated by θ t k . Instead, we propose self-adaptive aggregation to compute matrix A. Let −1 ≤ r k ≤ 1 denote representational similarity analysis (RSA) [12] of node k. In round t we first take n k random samples from D k as a subset D k for computational and statistical efficiency. For each x i ∈ D k , we get two representations f θ t−1 0 (x i ) (aggregated weights at the end of round t − 1, which are also globally synchronized weights at the beginning of round t) and f θ t k (x i ) (locally updated weights at the end of round t). We define ρ ij as Pearson's correlation coefficient between f θ (x i ) and f θ (x j ) ∀ 0 ≤ i, j ≤ n k and define RDM ij = 1 − ρ ij , where RDM is representation dissimilarity matrix [12] for f θ . Then, r k is defined based on Spearman's rank correlation: where d i is the difference between the ranks of i th elements of the lower triangular of RDM for f θ t−1 and RDM for f θ t k , and n = n k (n k − 1)/2. We define A for the aggregation step at the end of round t as The CL performance at node k in round t is measured by r k . Intuitively, given the same θ t−1 at the start of round t, a small r k indicates that the change between the representations f θ t−1 and f θ t is large, i.e. node k is still learning meaningful representations. Eq. 7 assigns larger weights to local models with higher potentials in representational power. The complete pseudo-code is given in Algorithm 1. For a fair comparison, we use the same set of hyperparameters and the same training strategy for all experiments. We use ResNet18 [9] as the network backbone, initialized with the same random seed. We use the Momentum optimizer with momentum 0.9. Without using any pre-trained weights, we demonstrate that FedMoCo can learn from scratch. The initial learning rate is 0.03 and is multiplied by 0.1 (and 0.01) at 120 (and 160) epochs. The batch size is 64 for each node, the weight decay is 10 −4 , m in Eq. 2 is 0.999, λ in Eq. 3 is 0.5, τ is 0.2, N is 1024, and η in Eq. 5 is 0.05. We estimate RSA with n k = 100. In the absence of FCL for MIA, we compare FedMoCo with a strong baseline which is an integration of MoCo [8] and FedAvg [15] , a seminal supervised FL model. All models are implemented with PyTorch on NVIDIA Tesla V100. Datasets. FCL should work for any type of medical images. We illustrate FCL on anterior and posterior chest X-rays (CXRs). We use three public largescale CXR datasets as the unlabeled pre-training data to simulate the federated dataset size # of classes multi-label multi-view balanced resolution CheXpert [11] 371920 14 390 × 320 ChestX-ray8 [26] environment, namely CheXpert [11] , ChestX-ray8 [26] , and VinDr-CXR [17] (see Table 1 ). Three datasets are collected and annotated from different sources independently and express a large variety in data modalities (see Fig. 2 ). The original images are cropped and downsampled to 256 × 256. Three datasets contain noisy labels and the label distributions are different. Data Augmentation. Driven by clinical domain knowledge and experimental findings, We provide a augmentation policy as illustrated in Fig. 3 . Compared with previous studies of MoCo on medical images [10, 4, 23] , we propose stochastic histogram equalization for CL on medical images. The stochasticity comes from uniformly sampling the parameters of CLAHE [19] . For a fair comparison, we use the same augmentation policy for all models in this work. Linear Classification Protocol. We evaluate the performance of unsupervised pre-training by following linear classification protocol (LCP) [8, 16, 3, 24] . ResNet18 is first pre-trained on the unlabeled dataset. Then a supervised linear classifier (a fully-connected layer) is trained on top of the frozen features extracted from a labeled dataset for 50 epochs with a constant learning rate 0.1. We report the classification accuracy on the validation set of the labeled data as the performance of FCL. We choose a public COVID-19 CXR dataset [5] with 3886 CXRs. Note that COVID-19 has not been seen in three pre-training datasets. We use 50% CXRs for training and 50% for testing. The experiments are simulated in a controllable federated environment. To eliminate the effect of n k , we first evaluate FedMoCo in a situation that each node has the same number of CXRs. We create 3 data nodes by randomly sampling 10000 CXRs from each of three datasets. We create 6 nodes by partitioning each data node equally. We evaluate the pre-training performance on T = 200 and T = 400, where one round is one local epoch and FedMoCo is warmed up for t w = 50 epochs. We provide the performance of MoCo trained in a single node with the same data as Oracle for centralized CL. LCP results with mean and standard deviation of 3 rounds are present in Table 2 . We have two empirical findings for FCL: more nodes will decrease the performance, and more training time may not improve the performance of downstream tasks. To show the isolated contribution of metadata transfer and self-adaptive aggregation, we simulate two common situations in FL with K = 3 and T = 200. First, with the same data as above, we create unbalanced data distributions in Table 3 . Second, with the same number of examples in each node, we create unbalanced data distributions in terms of the label distribution by only sampling healthy CXRs (no findings) for the nodes of CheXpert and ChestX-ray8 and only sampling CXRs with disease labels for the node of VinDr-CXR. The results are presented in Table 4 . In summary the above results, FedMoCo can outperform FedAvg consistently under non-IID challenges while metadata transfer and self-adaptive aggregation both can improve the performance as an individual module. An interesting finding against common sense is that sometimes FCL can outperform centralized CL depending on the data distributions, which may worth further investigation. Downstream Task Evaluation. To demonstrate the practical value of the representations extracted by FCL under data scarcity, we use COVID-19 detection [5] as the downstream task. We use the same training set and test set in LCP. We fine-tune the ResNets pre-trained by FCL models in Table 2 (K = 3 and T = 200) with only 3% of the training set. We train a randomly initialized ResNet with the full training set as the Oracle for supervised learning. For a fair comparison, we use a fixed learning rate 0.01 to train all models for 100 epochs. We report the highest accuracy in Table 5 . FedMoCo outperforms centralized CL and FedAvg. Compared with standard supervised learning, FedMoCo utilizes only 3% of labels to achieve 90% accuracy, which greatly reduces the annotation cost. In this work, we formulate and discuss FCL on medical images and propose FedMoCo. We evaluate the robustness of FedMoCo under a few characteristic non-IID challenges and use COVID-19 detection as the downstream task. More investigations will be conducted, but these initial results already provide insights into future FCL research. In future, we plan to focus on the task affinity [25] between FCL and its corresponding downstream tasks. We will quantitatively analyze how the representations extracted by FCL can influence the performance of different downstream tasks. Contrastive learning and neural oscillations An analysis of transformations A simple framework for contrastive learning of visual representations Momentum contrastive learning for few-shot covid-19 diagnosis from chest ct images Can ai help in screening viral and covid-19 pneumonia? Unsupervised domain adaptation for automatic estimation of cardiothoracic ratio Federated deep learning for detecting covid-19 lung abnormalities in ct: a privacy-preserving multinational validation study Momentum contrast for unsupervised visual representation learning Deep residual learning for image recognition Sample-efficient deep learning for covid-19 diagnosis based on ct scans Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison Representational similarity analysisconnecting the branches of systems neuroscience Communication efficient distributed machine learning with the parameter server Privacy-preserving federated brain tumour segmentation Communication-efficient learning of deep networks from decentralized data Self-supervised learning of pretext-invariant representations Vindr-cxr: An open dataset of chest x-rays with radiologist's annotations Representation learning with contrastive predictive coding Adaptive histogram equalization and its variations. Computer vision, graphics, and image processing The future of digital health with federated learning Learning classification with unlabeled data Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data MoCo pretraining improves representation and transferability of chest x-ray models What makes for good views for contrastive learning Branched multitask networks: deciding what layers to share Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases Unsupervised feature learning via nonparametric instance discrimination Acknowledgements We would like to thank Huawei Technologies Co., Ltd. for providing GPU computing service for this study.