key: cord-0556328-8odahpwj authors: Ziller, Alexander; Usynin, Dmitrii; Remerscheid, Nicolas; Knolle, Moritz; Makowski, Marcus; Braren, Rickmer; Rueckert, Daniel; Kaissis, Georgios title: Differentially private federated deep learning for multi-site medical image segmentation date: 2021-07-06 journal: nan DOI: nan sha: ac9655cae86713ca84b44ce557ede85a0d4c0ae5 doc_id: 556328 cord_uid: 8odahpwj Collaborative machine learning techniques such as federated learning (FL) enable the training of models on effectively larger datasets without data transfer. Recent initiatives have demonstrated that segmentation models trained with FL can achieve performance similar to locally trained models. However, FL is not a fully privacy-preserving technique and privacy-centred attacks can disclose confidential patient data. Thus, supplementing FL with privacy-enhancing technologies (PTs) such as differential privacy (DP) is a requirement for clinical applications in a multi-institutional setting. The application of PTs to FL in medical imaging and the trade-offs between privacy guarantees and model utility, the ramifications on training performance and the susceptibility of the final models to attacks have not yet been conclusively investigated. Here we demonstrate the first application of differentially private gradient descent-based FL on the task of semantic segmentation in computed tomography. We find that high segmentation performance is possible under strong privacy guarantees with an acceptable training time penalty. We furthermore demonstrate the first successful gradient-based model inversion attack on a semantic segmentation model and show that the application of DP prevents it from divulging sensitive image features. Training effective machine learning (ML) models in medical imaging is a data-driven problem, where model utility is typically directly dependent on the quantity and quality of data available during training. In recent works Sheller et al. (2019 Sheller et al. ( , 2020 , federated learning (FL) has been proposed to allow the utilisation of multi-site clinical datasets to enable and encourage collaboration between data owners, obtaining larger pools of high quality, diverse and representative data while avoiding direct data sharing. While FL circumvents centralised data pooling, it is not a privacy-enhancing technology (PT), as it does not provide the federation with any formal notion of privacy in regards to the patient data they are holding. This can leave the model vulnerable to catastrophic privacy breaches through attacks such as model inversion (Zhu and Han, 2020; Zhao et al., 2020; Geiping et al., 2020) or membership inference (Shokri et al., 2017) during model training by malicious parties inside or outside the federation. It is therefore imperative that collaborative learning does not just benefit the utility of the model through a richer data pool, but also provides formal privacy guarantees to the participating parties, for example through the utilisation of PTs such as differential privacy (DP) or encrypted algorithm training. However, the application of PTs often comes at the cost of decreased model utility (privacy-accuracy-trade-off ), e.g. due to the addition of noise to the training process (Bagdasaryan and Shmatikov, 2019) . Minimising this trade-off is a complex, yet fundamental process, and has so far not been conclusively investigated in the area of medical imaging. In the present work, we perform federated medical image segmentation under image-level differential privacy guarantees. Through detailed experiments, we demonstrate that the appropriate choice of architecture and training technique can enable excellent model performance despite rigorous privacy guarantees while minimising network overhead and computational demands. We use the term Federated Learning (FL, (Konečnỳ et al., 2016) ) to denote a collaborative learning protocol in which a number of parties (nodes/workers) jointly train a neural network. Training occurs in rounds during which the central server sends the model to the nodes, local training occurs for a number of iterations, whereupon the updated models are aggregated by Federated (Gradient) Averaging (McMahan et al., 2017) . We assume a cross-silo topology, based on a small number of centres with relatively large datasets. Moreover, we assume homogeneous node compute resources and constant node availability. We use the following definition of Differential Privacy (DP) by (Dwork et al., 2014) : For some randomised algorithm (=mechanism) M, all subsets of its image S, sensitive dataset D and its neighbouring dataset D , whereby D and D differ by at most one record, we say that M is ( , δ)-differentially private (DP) if, for a (typically small) constant and 0 ≤ δ < 1: The probability P r is taken over the randomness of the algorithm M . DP is, therefore, an attribute of an algorithm that makes it approximately equivariant to exclusion or inclusion of a data point. It quantifies an individuals contribution to the final outcome of the computation. When the difference between the contributions of multiple participants is minimal, it is not possible to reliably determine the presence or the absence of an individual. Sheller and colleagues have demonstrated the utilisation of FL in the context of brain tumour segmentation (Sheller et al., 2019 (Sheller et al., , 2020 . However, neither work utilises PTs to provide privacy guarantees to the included patients. also showcase brain tumour segmentation using FL. The Sparse Vector Technique utilised in the study however only provides privacy guarantees to the model's parameters and not to the dataset records, which is not a meaningful notion of privacy. In comparison, DP-SGD, utilised in our study, provides the guarantees to each individual patient, providing the federation with an a pragmatic, information-theoretic privacy-preserving solution instead. Fay et al. (Fay et al., 2020) utilise the Private Aggregation of Teacher Ensembles for brain tumour segmentation. This technique was originally developed for classification tasks and imposes strong assumptions on the learning process. Consequently authors could not demonstrate reasonable privacy guarantees in their study while still experiencing a steep utility penalty. Yang et al. (Yang et al., 2021) utilised FL for COVID-19 lesion segmentation in computed tomography (CT) but did not employ PTs. Lastly, Sarma et al. (Sarma et al., 2021) demonstrate FL segmentation of prostate volumes in magnetic resonance imaging (MRI), but also did not employ any PTs. As witnessed from the survey of previous works above, although several studies have dealt with the topic of FL for medical image segmentation, our work is, to our best knowledge, the first to utilise differentially private stochastic gradient descent (DP-SGD) in addition to FL in order to provide both stringent image-level privacy guarantees and maintain high model utility in the setting of medical image segmentation. We summarise our main contributions below: • We present an in-depth study on the application of DP to medical image segmentation by successfully training several segmentation model architectures in the federated setting. Contrary to the previous works, our implementation of differentially private training provides strict, provable guarantees with respect to each individual image, which allows the federation to obtain a meaningful measure of privacy. • Our models trained with FL achieve comparable segmentation performance to centrally trained models while suffering only mild privacy-utility trade-offs. • We demonstrate the first successful model inversion attack on semantic segmentation architectures, leading to the full reconstruction of input images for certain models. We thus provide evidence that, despite prevailing literature opinion, FL in itself is an insufficient technique for protecting patient privacy. We then empirically show that -consistent with its theoretical privacy guarantees-the addition of DP to model training completely thwarts such privacy-centred attacks. For FL experimentation, we simulated a scenario in which three hospitals (=workers/nodes) are collaboratively training the neural network coordinated by a central server. For this, we split the dataset randomly by patient onto the three servers maintaining an equal number of patients per server. We conducted all experimentation using the PriMIA framework (Kaissis et al., 2021) , a generic open-source software package for privacy-preserving and federated deep learning on medical imaging, which we adapted to semantic medical image segmentation. For deep neural network training, we utilised DP-stochastic gradient descent (DP-SGD) (Abadi et al., 2016) which extends DP guarantees to gradient-based optimisation by clipping the L 2 -norm of the per-sample gradients of each minibatch to a specific value and adding Gaussian noise of predetermined magnitude to the averaged minibatch gradients before performing an optimisation step. We utilise the Rényi Differential Privacy Accountant (Mironov et al., 2019) , an extension of the moments accountant technique by Abadi et al. for the privacy analysis of this algorithm, i.e. the calculation of privacy loss at each individual site in terms of . We note that the reported privacy guarantees are record-level, and not patient-level guarantees. Moreover, we regarded all datasets used as public for the purposes of experimentation such as hyperparameter searches. DP training was performed under three privacy regimes, shown in Table 1 . In the following, we will refer to these as low, medium and high privacy regimes. Finally, we note that the utilisation of Batch Normalisation layers is incompatible with DP training, as the running statistics of the layers are maintained non-privately. Hence, we deactivated the running statistics collection for Batch Normalisation layers, effectively converting them to Instance/Channel Normalisation Layers (Dai and Heckel, 2019) which are DP-compatible. (Hyper-)parameter Low Medium High 1.0 1.0 1.0 local 5.98 3.58 1.82 federated 11.5 7.08 3.54 Table 1 : Definition of privacy regimes used in the study. The Noise Multiplier and L 2 Clipping Norm are hyperparameters of the DP-SGD algorithm. δ and are record-level guarantees for the local model and per-site record-level guarantees for the federated models. also designates the privacy budget, i.e. the privacy loss at which model training is aborted. In order to empirically verify whether reconstruction of training data by an Honest-but-Curious (HbC) adversary with white-box model access (Evans et al., 2017) that participates in the training protocol is possible, we performed a model inversion attack utilising the gradients shared during Federated Averaging. We employ an approach similar to the Deep Leakage from Gradients (DLG) (Zhu and Han, 2020; Zhao et al., 2020; Geiping et al., 2020) attack to infer the data that generated the update. The attack relies on approximating the input image through a gradient descent-based perturbation of a randomly initialised noise matrix, based on a differentiable similarity metric to the gradient captured during model training. Unlike the original attack, designed for an image reconstruction in a classification context given a model update and the corresponding label, we provided the gradient update and a segmentation mask, which are used to reconstruct the input. We observed that the use of a segmentation mask of the victim by the adversary greatly improves the results of the reconstruction. In cases of more complex models, the attack was not successful without access to the segmentation mask, which we consider a limitation of this method. All experiments were carried out using U-Net-like architectures (Ronneberger et al., 2015) with modifications detailed in (Yakubovskiy, 2020) , introducing newer architectures as backbones to the encoder portion of the U-Net. Since network input/output overhead is a critical bottleneck for federated learning, we focused on architectures which provide a good balance between network size and performance on established benchmarks. Hence, we included Mo-bileNet V2 (Sandler et al., 2018) and ResNet-18 (He et al., 2016) as backbones. Moreover, we considered MoNet (Knolle et al., 2020), a novel lightweight U-Net-like architecture with extremely few parameters specifically optimised for FL. Lastly, for comparability to the original U-Net, we utilised an eleven-layer VGG architecture with Batch Normalisation (Simonyan and Zisserman, 2014) (VGG11 BN). An overview of the number of parameters (i.e. model size) and Multiply/Accumulate operations (MACs) can be found in Table 2 . We note that MoNet relies on large receptive field dilated (atrous) convolutions which introduce a substantial number of operations despite the small network size. Moreover, the MobileNet-V2 architecture is optimised for CPU performance (Orsic et al., 2019) . We therefore carried out timing experiments both on CPU and GPU (compare Table 4 ). We performed hyperparameter optimisation on non-private baseline models in a decentralised manner over the entire federation to obtain suitable values for the learning rate, beta parameters for the Adam optimiser, as well as translation, rotation and scale values for image augmentation. This corresponds to local hyperparameter optimisation prior to DP training to avoid repeated dataset interaction which would consume the privacy budget. For FL, the synchronisation rate, number of different image augmentations as well as whether or not the model parameters were weighted by the number of samples on the worker during aggregation were additionally optimised. The settings found through optimisation were used to train the corresponding models with DP. All models were trained and evaluated on the MSD Liver segmentation task (Simpson et al., 2019) . The dataset was split into a 63% training set, 7% validation set and 30% held-out testing set. All architectures were pre-trained on a pancreatic CT segmentation dataset from our institution to improve convergence speed and offer equal starting conditions to all models. Model evaluation results on the test set are listed in Table 3 . We observed that models trained with FL are able to achieve performance on-par with locally trained models. In line with previous results (Abadi et al., 2016) , all models witnessed performance deterioration due to the utilisation of DP. Surprisingly, we found the performance penalty to be especially high on the ResNet-18 architecture, which did not converge at all in the FL setting with DP. Moreover, we consistently found the VGG-11 BN architecture to perform best. This finding suggests that a high parameter count alone is not the sole determinant of performance deterioration due to DP as suggested in previous work (Papernot et al., 2019) . Instead, both the collaborative training setting and the architecture itself seem to contribute to this phenomenon. However, these findings await detailed future analysis. Moreover, the high variance of the per-image Dice score distributions for ResNet-18 and MobileNet V2 backbones suggest a disparate effect of DP on different images (see Figure 1 ), with the MoNet and VGG-11 BN architectures maintaining a more consistent performance in the privately trained local setting. Exemplary segmentation results are shown in Figure 2 . In Table 4 , we compare the time required per epoch of training for the various models. We found the optimised, widely-used architectures to train more efficiently. In comparison, MoNet trades off computational efficiency for a much smaller size to alleviate network input/output constraints. In this study we employ a novel gradient-based model inversion attack based on the works of Geiping et al. Geiping et al. (2020) . In our setting the adversary starts with a randomly initialised < image, segmentation mask > pair and a captured model update. The optimisation task that is executed by the adversary involves perturbing the image to such extent that it (along with a corresponding segmentation mask) produces a gradient update similar to the one that the adversary captured. We utilise a cosine similarity as the cost function in this optimisation process. The procedure is repeated until either the loss starts diverging or the final iteration is reached. We note that for our method the attacker is assumed to have access to a segmentation mask that corresponds to the sensitive image. We consider this to be a limitation of our approach and note that for smaller networks, utilisation of randomly initialised segmentation mask can provide the adversary with a suitable reconstruction result. However, for the purposes of medical imaging, larger models, such as VGG-11 or MoNet require a segmentation mask in order to allow the attacker model to converge and to produce an acceptable reconstruction result. Gradient-based model inversion attacks were performed in two settings: initially, gradients captured from FL training without DP were attacked. We then evaluated the attacks on the same architectures with the addition of DP. As presented in Figure 3 , non-private models sharing unprotected updates during training risk their data being reconstructed in full. In comparison, models trained with the addition of DP yield no usable information to the inversion attack regardless of architecture. In general, architectural complexity seems to inhibit attack success to a greater degree than shown in previous work on classification (Geiping et al., 2020) . To the best of our knowledge, this is the first work to demonstrate DP-SGD-based collaborative model training in the context of semantic medical image segmentation. Our main conclusion is that the provision of rigorous privacy guarantees is possible in the FL setting while maintaining high model utility. One area that we highlight as a promising research direction is an investigation into the relationship between privacy-oriented modes of training and the selection of an optimal model architecture. We found that larger model architectures can be more robust to noise addition, whereas lightweight models can be advantageous in non-private FL settings. This highlights a promising research area that considers the application of privacy-preserving mechanisms in task-specific deployments in order to better tailor defense mechanisms to learning tasks with optimal utility preservation. Additionally, we outline a requirement for investigations into the field of pragmatic applications of DP, in order to allow privately trained models to be interpretable by machine learning researchers and facilitate the widespread utilisation of private collaborative model training. Our novel model inversion attack in the unprotected FL setting resulted in Figure 3 : Gradient based reconstructions on the models used in the study. Top row: unprotected models. Bottom row: models with DP (Noise Multiplier 2.0, L 2 clipping norm 0.5). a catastrophic privacy breach, while only utilising a segmentation mask and a shared model update. This highlights that FL alone is an insufficient privacy preservation mechanism in collaborative learning and should be regarded as a method for preservation of data ownership/governance which facilitates controlled data access. As our method requires possession of a segmentation mask by the adversary, future work will include a natural relaxation of this requirement and the study of robust, scalable model inversion attacks. We note that similarly to other model inversion implementations, supporting large batches of images is a non-trivial task, therefore we also outline this area as a potential future work direction. We expect that our work will stimulate further research on privacy-preserving machine learning, essential to large scale, multi-site medical imaging analysis, in order to allow collaborative model training while mitigating associated privacy risks. The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects. Deep learning with differential privacy ACM SIGSAC Conference on Computer and Communications Security Differential Privacy Has Disparate Impact on Model Accuracy Channel normalization in convolutional neural network avoids vanishing gradients The algorithmic foundations of differential privacy A pragmatic introduction to secure multi-party computation Decentralized differentially private segmentation with pate Inverting gradients-how easy is it to break privacy in federated learning? Deep residual learning for image recognition End-to-end privacy preserving deep learning on multi-institutional medical imaging Efficient, highperformance pancreatic segmentation using multi-scale feature extraction Federated learning: Strategies for improving communication efficiency Privacy-preserving federated brain tumour segmentation Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data Rényi differential privacy of the sampled gaussian mechanism In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images Making the shoe fit: Architectures, initializations, and tuning for learning with privacy U-net: Convolutional networks for biomedical image segmentation Mobilenetv2: Inverted residuals and linear bottlenecks Federated learning improves site performance in multicenter deep learning without data sharing Multi-institutional deep learning modeling without sharing patient data: A feasibility study on brain tumor segmentation Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data Membership inference attacks against machine learning models Very deep convolutional networks for large-scale image recognition A large annotated medical image dataset for the development and evaluation of segmentation algorithms Segmentation models pytorch Federated semi-supervised learning for COVID region segmentation in chest CT using multi-national data from china, italy, japan idlg: Improved deep leakage from gradients Deep leakage from gradients Georgios Kaissis received funding from the Technical University of Munich, School of Medicine Clinician Scientist Programme (KKF), project reference H14. Dmitrii Usynin received funding from the Technical University of Munich/ Imperial College London Joint Academy for Doctoral Studies. This research was supported by the UK Research and Innovation London Medical Imaging & Artificial Intelligence Centre for Value Based Healthcare. The funders played no role in the design of the study, the preparation of the manuscript or the decision to publish. The liver segmentation dataset is described and provided available at https://arxiv.org/pdf/1902.09063. Authors declare no conflicts of interest.