key: cord-0163783-as8bizwj authors: Kamp, Michael; Fischer, Jonas; Security, Jilles Vreeken CISPA Helmholtz Center for Information; Informatics, Max Planck Institute for title: Federated Learning from Small Datasets date: 2021-10-07 journal: nan DOI: nan sha: 9b37543334e7d4018d8a1d335908dfed62682cff doc_id: 163783 cord_uid: as8bizwj Federated learning allows multiple parties to collaboratively train a joint model without sharing local data. This enables applications of machine learning in settings of inherently distributed, undisclosable data such as in the medical domain. In practice, joint training is usually achieved by aggregating local models, for which local training objectives have to be in expectation similar to the joint (global) objective. Often, however, local datasets are so small that local objectives differ greatly from the global objective, resulting in federated learning to fail. We propose a novel approach that intertwines model aggregations with permutations of local models. The permutations expose each local model to a daisy chain of local datasets resulting in more efficient training in data-sparse domains. This enables training on extremely small local datasets, such as patient data across hospitals, while retaining the training efficiency and privacy benefits of federated learning. How can we learn high quality models when data is inherently distributed into small parts that cannot be shared or pooled, as we for example often encounter in the medical domain (Rieke et al., 2020) ? Federated learning solves many but not all of these problems. While it can achieve good global models without disclosing any of the local data, it does require sufficient data to be available at each site in order for the locally trained models to achieve a minimum quality. In many relevant applications, this is not the case: in healthcare settings we often have as little as a few dozens of samples (Granlund et al., 2020; Su et al., 2021; Painter et al., 2020) , but also domains where DL is generally regarded as highly successful, such as natural language processing and object detection often suffer from a lack of data Kang et al., 2019) . In this paper, we present an elegant idea in which models are moved around iteratively and passed from client to client, thus forming a daisy-chain that the model traverses. This daisy-chaining allows us to learn from such small, distributed datasets simply by consecutively training the model with the data availalbe at each site. We should not do this naively, however, since it would not only lead to overfitting -a common problem in federated learning which can cause learning to diverge (Haddadpour and Mahdavi, 2019) -but also violate privacy, since a client can infer from a model upon the data of the client it received it from (Shokri et al., 2017) . To alleviate these issues, we propose an approach to combine daisy-chaining of local datasets with aggregation of models orchestrated by a coordinator, which we term federated daisy-chaining (FEDDC). standard federated learning cannot. For non-convex models such as convolutional neural networks, it improves the performance upon the state-of-the-art on standard benchmark and medical datasets. Formally, we show that FEDDC allows convergences on datasets so small that standard federated learning diverges by analyzing aggregation via the Radon point from a PAC-learning perspective. We substantiate this theoretical analysis by showing that FEDDC in practice matches the accuracy of a model trained on the full data of the SUSY binary classification dataset, beating standard federated learning by a wide margin. In fact, FEDDC allows us to achieve optimal model quality with only 2 samples per client. In an extensive empirical evaluation, we then show that FEDDC outperforms vanilla federated learning (McMahan et al., 2017) , naive daisy-chaining, and FedProx (Li et al., 2020a) on the benchmark dataset CIFAR10 (Krizhevsky, 2009) , and more importantly on two realworld medical datasets. In summary, our contributions are as follows. • FEDDC, an elegant novel approach to federated learning from small datasets via a combination of daisy-chaining and aggregation, • a theoretical guarantee that FEDDC improves models in terms of , δ-guarantees, which standard federated averaging can not, • a thorough discussion of the privacy aspects and mitigations suitable for FEDDC, including an empirical evaluation of differentially private FEDDC, and • an extensive set of experiments showing that FEDDC substantially improves model quality for small datasets, being able to train ResNet18 on a pneumonia dataset on as little as 8 samples per client. Learning from small datasets is a well studied problem in machine learning. In the literature, we find among others general solutions, such as using simpler models, and transfer learning (Torrey and Shavlik, 2010) , to more specialized ones, such as data augmentation (Ibrahim et al., 2021) and fewshot learning (Vinyals et al., 2016; Prabhu et al., 2019) . In our scenario, however, data is abundant, but the problem is that the local datasets at each site are small and cannot be pooled. Federated learning and its variants have been shown to learn from incomplete local data sources, e.g., non-iid label distributions (Li et al., 2020a; and differing feature distributions (Li et al., 2020b; Reisizadeh et al., 2020a) , but were proven to fail in case of large gradient diversity (Haddadpour and Mahdavi, 2019) and too dissimilar label distribution (Marfoq et al., 2021) . For very small datasets, local empirical distributions may vary greatly from the global data distribution-while the difference of empirical to true distribution decreases exponentially with the sample size (e.g., according to the Dvoretzky-Kiefer-Wolfowitz inequality), for small sample sizes the difference can be substantial, in particular if the data distribution differs from a Normal distribution (Kwak and Kim, 2017). FedProx (Li et al., 2020a ) is a variant of federated learning that is particularly suitable for tackling non-iid data distributions. It increases training stability by adding a momentum-like proximal term to the objective functions. This increase in stability, however, comes at the cost of not being privacypreserving anymore (Rahman et al., 2021) . We compare FEDDC to FedProx in Section 7. We can reduce sample complexity by training networks only partially, e.g., by collaboratively training only a shared part of the model. This approach allows training client-specific models in the medical domain , but by design cannot train a global model. Kiss and Horvath (2021) propose a decentralized and communication-efficient variant of federated learning that migrates models over a decentralized network and stores incoming models locally at each client until sufficiently many models are collected on each client for an averaging step, similar to Gossip federated learing (Jelasity et al., 2005) . The variant without averaging is similar to simple daisy-chaining which we compare to in Section 7. FEDDC is compatible with any aggregation operator, including the Radon point (Kamp et al., 2017) and the geometric median (Pillutla et al., 2019) . It can also be straightforwardly combined with approaches to improve communication-efficiency, such as dynamic averaging (Kamp et al., 2018) , and model quantization (Reisizadeh et al., 2020b) . We assume iterative learning algorithms (cf. Chp. 2.1.4 Kamp, 2019) A : X × Y × H → H that update a model h ∈ H using a dataset D ⊂ X × Y from an input space X and output space Y, i.e., h t+1 = A(D, h t ). Given a set of m ∈ N clients with local datasets D 1 , . . . , D m ⊂ X × Y drawn iid from a data distribution D and a loss function : Y × Y → R, the goal is to find a single model h * ∈ H that minimizes the risk In centralized learning, the datasets are pooled as D = i∈[m] D i and A is applied to D until convergence. Note that applying A on D can be the application to any random subset, e.g., as in mini-batch training, and convergence is measured in terms of low training loss, small gradient, or small deviation from previous iterate. In standard federated learning (McMahan et al., 2017) , A is applied in parallel for b ∈ N rounds on each client locally to produce local models h 1 , . . . , h m . These models are then centralized and aggregated using an aggregation operator agg : H m → H, i.e., h = agg(h 1 , . . . , h m ). The aggregated model h is then redistributed to local clients which perform another b rounds of training using h as a starting point. This is iterated until convergence of h. In the following section, we describe FEDDC. We propose federated daisy chaining as an extension to federated learning and hence assume a setup where we have m clients and one designated coordinator node. 1 We provide pseudocode of our approach as Algorithm 1. The client Each client trains its local model in each round on local data (line 4), and sends its model to the coordinator every b rounds for aggregation, where b is the aggregation period, and every d rounds for daisy chaining, where d is the daisy-chaining period (line 6). This re-distribution of models results in each individual model following a daisy-chain of clients, training on each local dataset. Such a daisy-chain is interrupted by each aggregation round. The coordinator Upon receiving models (line 10), in a daisy-chaining round (line 11) the coordinator draws a random permutation π of clients (line 12) and re-distributes the model of client i to client π(i) (line 13), while in an aggregation round (line 15), the coordinator instead aggregates all local models (line 16) and re-distributes the aggregate to all clients (line 17). where t max is the overall number of rounds. Although inherently higher than in plain federated learning, the overall amount of communication in daisy chained federated learning is still low. In particular, in each communication round, each client sends and receives only a single model from the coordinator. The amount of communication per communication round is thus linear in the number of clients and model size, similar to federated averaging. In the following section we show that the additional daisy-chaining rounds ensure convergence for small datasets in terms of PAC-like , δ-guarantees. Next, we theoretically analyze the key properties of FEDDC in terms of PAC-like ( , δ)-guarantees. For that, we make the following assumption on the learning algorithm A. Assumption 1 (( , δ)-guarantees). The learning algorithm A applied on all datasets drawn iid from D of size n ≥ n 0 ∈ N produces a model h ∈ H such that with probability δ ∈ (0, 1] it holds for > 0 that P (ε(h) > ) < δ . Require: daisy-chaining period d, aggregation period b, learning algorithm A, aggregation operator agg, m clients with local datasets D 1 , . . . , D m 1: initialize local models h 1 0 , . . . , h m 0 2: at local client i in round t 3: draw random set of samples S from local dataset D i 4: send h t to all clients 18: end if The sample size n 0 is a monotone function in δ and , i.e., for fixed n 0 is monotonically increasing with δ and for fixed δ it is monotonically decreasing with (note that typically n 0 is a polynomial in −1 and log(δ −1 )). Here ε(h) is the risk defined in Equation 1. We will show that aggregation for small local datasets can diverge and that daisy-chaining can prevent this. For this, we analyze the development of ( , δ)guarantees on model quality when aggregating local models with and without daisy-chaining. It is an open question how such an ( , δ)-guarantee develops when averaging local models. Existing work analyzes convergence (Haddadpour and Mahdavi, 2019; Kamp et al., 2018) or regret (Kamp et al., 2014) and thus gives no generalization bound. Recent work on generalization bounds for federated averaging via the NTK-framework (Huang et al., 2021) is promising, but not directly compatible with daisy-chaining: the analysis of Huang et al. (2021) requires local datasets to be disjoint which would be violated by a daisy-chaining round. Using the Radon point (Radon, 1921) as aggregation operator, however, does permit analyzing the development of ( , δ)-guarantees. In particular, it was shown that for fixed the probability of bad models is reduced doubly exponentially (Kamp et al., 2017) when we aggregate models using the (iterated) Radon point (Clarkson et al., 1996) . Here, a Radon point of a set of points S from a space X is-similar to the geometric median-a point in the convex hull of S with a high centrality (more precisely, a Tukey depth (Tukey, 1975; Gilad-Bachrach et al., 2004) of at least 2). For a Radon point to exist, the size of S has to be sufficiently large; the minimum size of S ⊂ X is denoted the Radon number of the space X and for X ⊆ R d the radon number is d + 2. Let r ∈ N be the Radon number of H, A be a learning algorithm as in assumption 1, and ε be convex. Assume m ≥ r h many clients with h ∈ N. For > 0, δ ∈ (0, 1] assume local datasets D 1 , . . . , D m of size larger than n 0 ( , δ) drawn iid from D, and h 1 , . . . , h m be local models trained on them using A. Let r h be the iterated Radon point with h iterations computed on the local models. Then it follows from Theorem 3 in Kamp et al. (2017) that for all i ∈ [m] it holds that where the probability is over the random draws of local datasets. This implies that the iterated Radon point only improves over the local models if δ < r −1 . Consequently, local models need to achieve a minimum quality for the federated learning system to converge. Corollary 2. Given a model space H with Radon number r ∈ N, convex risk ε, and a learning algorithm A with sample size n( , δ). Given > 0 and any h ∈ N, if local datasets D 1 , . . . , D m with m ≥ r h are smaller than n 0 ( , r −1 ), then federated learning using the Radon point does not improve model quality in terms of ( , δ)-guarantees. In other words, when using aggregation by Radon points alone, an improvement in terms of ( , δ)guarantees is strongly dependent on large enough local datasets. Furthermore, given δ > r −1 , the guarantee can become arbitrarily bad by increasing the number of aggregation rounds. Federated Daisy-Chaining as given in Algo. 1 permutes local models at random, which is in theory equivalent to permuting local datasets. This way, the amount of data visible to each model is increased. Since the permutation is drawn at random, the minimum amount of distinct local samples observed by each model can be given with high probability. Proof. For m clients with m local datasets, the chance of a client i to not see dataset j after τ many permutations is m−1 m τ . The probability that each of the m clients is not seeing m − k + 1 other datasets is hence , and corresponds to the probability of each client seeing less than k distinct other datasets. The probability of all clients seeing at least k distinct datasets is hence at least Taking the logarithm on both sides with base (m − 1)/m < 1 yields Multiplying with m − k + 1 and observing that τ many daisy-chaining rounds with period d require T = τ d total rounds yields the result. From Lm. 3 it follows that when we perform daisy-chaining with m clients, and local datasets of size n, for at least d ln δ((ln(m − 1) − ln(m))(m − k + 1)m) −1 rounds, each local model will with probability at least 1 − δ be trained on at least kn samples. Proposition 4. Given a model space H with Radon number r ∈ N, convex risk ε, and a learning algorithm A with sample size n( , δ). Given > 0, δ ∈ (0, r −1 ) and any h ∈ N, if local datasets D 1 , . . . , D m of size n ∈ N with m ≥ r h , then Alg. 1 using the Radon Proof. The number of daisy-chaining rounds before computing a Radon point ensure that with probability 1 − δ all local models are trained on at least kn samples with k = n 0 ( , δ)/n, i.e., each model is trained on at least n 0 ( , δ) samples and thus an ( , δ)-guarantee holds for each model. Since δ < r −1 , this guarantee is improved as detailed in Eq. (2). To support this theoretical result, we compare FEDDC using the iterated Radon point with standard federated learning on the SUSY binary classification dataset (Baldi et al., 2014) , training a linear model on 441 clients with only 2 samples per client. The results in Figure 1 show that after 500 rounds FEDDC reached the test accuracy of a model that has been trained on the centralized dataset (ACC=0.77) beating federated learning by a large margin (ACC=0.65). Before further investigating FEDDC empirically in Section 7, we discuss the privacy-aspects of FEDDC in the following section. A major benefit of federated learning is that data remains undisclosed on the local clients and only model parameters are exchanged. It is, however, possible to infer upon local data given model parameters . In classical federated learning there are two types of attacks that would allow such inference: (i) an attacker intercepting the communication of a client with the coordinator obtaining model updates to infer upon the clients data, and (ii) a malicious coordinator obtaining models to infer upon the data of each client. A malicious client cannot learn about other clients data, since it only obtains the average of all local models. In federated daisychaining there is a third possible attack: (iii) a malicious client obtaining model updates from another client to infer upon its data. In the following, we discuss potential defenses against these three types of attacks in more detail. Note that we limit the discussion on attacks that aim at inferring upon local data, thus breaching data privacy. For a discussion of attacks that aim to poison the learning process (Bhagoji et al., 2019) or create backdoors for adversarial examples, we refer to Lyu et al. (2020) . A general and wide-spread approach to tackle all three possible attack types is to add noise to the model parameters before sending. Using appropriate clipping and noise, this guarantees , δdifferential privacy for local data at the cost of a slight-to-moderate loss in model quality. Another approach to tackle an attack on communication (i) is to use encrypted communication. One can also protect against a malicious coordinator (ii) by using homomorphic encryption that allows the coordinator to average models without decrypting them (Zhang et al., 2020) . This, however, only works for particular aggregation operators and does not allow to perform daisy-chaining. Secure daisy-chaining in the presence of a malicious coordinator (ii) can, however, be performed using asymmetric encryption. Assume each client creates a public-private key pair and shares the public key with the coordinator. To avoid the malicious coordinator to send clients its own public key and act as a man in the middle, public keys have to be announced (e.g., by broadcast). While this allows sending clients to identify the recipient of their model, no receiving client can identify the sender. Thus, inference on the origin of a model remains impossible. For a daisy-chaining round the coordinator sends the public key of the receiving client to the sending client, the sending client checks the validity of the key and sends an encrypted model to the coordinator which forwards it to the receiving client. Since only the receiving client can decrypt the model, the communication is secure. In standard federated learning, a malicious client cannot infer upon the data of other clients from model updates, since it only receives the average model. In federated daisy-chaining, it receives the model from a random, unknown client in each daisy-chaining round. Now, the malicious client can infer upon the membership of a particular data point in the local dataset of the client the model originated from, i.e., a membership inference attack (Shokri et al., 2017) . Similarly, the malicious client can infer upon the presence of data points with certain attributes in the dataset (Ateniese et al., 2015) . The malicious client, however, does not know the client the model was trained on, i.e., it does not know the origin of the dataset. Using a random scheduling of daisy-chaining and averaging rounds at the coordinator, the malicious client cannot even distinguish between a model from another client or the average of all models. Nonetheless, daisy-chaining opens up new potential attack vectors (e.g., by clustering received models to potentially determine their origins). These potential attack vectors can be tackled by adding noise to model parameters as discussed above, since "[d]ifferentially private models are, by construction, secure against membership inference attacks" (Shokri et al., 2017) . To investigate the impact of this privacy technique on FEDDC, we apply it in practice: We train a small ResNet on 250 clients using FEDDC with d = 2 and b = 10. Details on the experimental setup can be found in Supp. A.1,A.2. Differential privacy is achieved by clipping local model updates and adding Gaussian noise as proposed by Geyer et al. (2017) . The results shown in Figure 2 indicate that the standard trade-off between model quality and privacy holds for FEDDC as well. Moreover, for mild privacy settings the model quality does not decrease. That is, FEDDC is able to robustly predict even under differential privacy. We evaluate FEDDC against the state-of-the-art in federated learning on synthetic and real world data. In particular, we compare to standard Federated averaging (FedAvg) (McMahan et al., 2017) , FedAvg with equal communication as FEDDC, FedProx (Li et al., 2020a) , and simple daisy-chaining without aggregation. As real world applications we consider the image classification problem CI-FAR10 (Krizhevsky, 2009) , publicly available MRI scans for brain tumors 2 , and chest X-rays for pneumonia (e.g., from COVID-19) 3 . For reproducibility, we provide details on architectures, and experimental setup in Supp. A.1,A.2. The implementation of the experiments is publicly available at https://anonymous.4open.science/r/FedDC-1BC9. We first investigate the potential of FEDDC on a synthetic binary classification dataset generated by the sklearn (Pedregosa et al., 2011) make_classification function with 100 features. On this dataset, we train a simple MLP with 3 hidden layers on m = 50 clients with n = 10 samples per client. We compare FEDDC with d = 1 and b = 200 to FedAvg with b = 200. The results presented in Figure 3 show that FEDDC achieves an optimal test performance of 0.89 (centralized training on all data achieves a test accuracy of 0.88), substantially outperforming FedAvg. The results indicate that the main reason is overfitting of local clients, since for FedAvg train accuracy reaches 1.0 quickly after each averaging step. In the following, we investigate how these promising results translate to real-world datasets. To compare FEDDC with the state of the art on real world data, we first consider the CI-FAR10 image benchmark. To find a suitable aggregation period b for FEDDC and FedAvg, we first run a search grid across periods for 250 clients with small versions of ResNet (details in Supp. A.2). We report the results in Figure 4 and set the period for FEDDC to 10, and consider federated averaging with periods of both 1 and 10. For our next experiment, we equip 150 clients each with a ResNet18. To simulate our setting that each client has a small amount of samples, each one of them only receives 64 samples. Note that the combined amount of examples is only one fifth of the original training data, hence we cannot expect the typical performance on this dataset. As NNs are non-convex, Radon points are no longer suitable as aggregation method, we instead resort to averaging. Results are reported in Table 1 . We observe that FEDDC achieves substantially higher accuracy of more than 6 percentage points over federated averaging with the same amount of communication. Looking closer, we see that FedAvg drastically overfits, achieving training accuracies of 0.97, a similar trends as reported in Figure 3 for synthetic data. We further see that daisy-chaining alone, besides its privacy issues, performs worse than FEDDC. Similarly, FedProx run with b = 10 and µ = 0.1 only achieves an accuracy of 0.545. Empirical evaluation shows that FEDDC drastically improves upon state-of-the-art methods for federated learning for settings with only small amounts of available data. This confirms the theoretical potential, given by the , δ-guarantees, of improving model quality, which is unique among federated learning methods. Using the iterated Radon point as aggregation method, and given as few as 2 samples per client, FEDDC matches the test accuracy of a model trained on the whole SUSY dataset, outperforming standard federated learning by over 12% points of accuracy. This result shows that unlike federated learning, FEDDC does not heavily overfit and is able to learn a generalized model, and is consistent with a synthetic prediction task using multi-layer perceptrons. To study FEDDC in the context of real data, we consider both the standard image benchmark data CIFAR10, as well as two challenging image classification tasks from the health domain where only little data is available. On each of these tasks, FEDDC consistently outperforms state-of-the-art federate learning methods. Similar to before, we observe overfitting of standard federate learning methods. To rule out any effects due to increased communication, we also considered FedAvg with the same amount of communication as our method, however, FedAvg shows no improvement. Through FEDDC, we present an effective solution to the problem of federated learning on small datasets. We further show that our method is able to robustly predict even under the effect of differential privacy, and suggest effective measures based on encryption as mitigations against attacks on communication or malicious coordinators. We considered the problem of learning high quality models in settings where data is inherently distributed across sites, data cannot be shared between sites, and each site only has very little data available. We propose an elegant, surprisingly simple approach that effectively solves this problem, by combining the idea of model aggregation approaches from federated learning with the concept of passing individual models around while still maintaining privacy. We showed that this approach theoretically improves models in terms of , δ-guarantees, which state-of-the-art federated averaging can not provide. In extensive empirical evaluations, including challenging image classification tasks from the health domain, we further show that for settings with limited data available per site, our method improves upon existing work by a wide margin. It thus paves the way for learning high quality models from small datasets. Although the amount of communication is not a critical issue for the settings where we intend FEDDC to be used in, it does make for engaging future work to improve its communication efficiency and hence also enable it for settings with limited bandwidth, e.g., regarding model training on mobile devices. Both from a practical, as well as from a security and privacy perspective, it would also be interesting to study how to formulate FEDDC in a decentralized setting, when no coordinator is available. A.1 NETWORK ARCHITECTURES Here, we detail network architectures considered in our empirical evaluation All code is publicly available to ensure reproducability. MLP for Synthetic Data A standard MLP with ReLU activations and three linear layers of size 100,50,20. Averaging round experiment For this set of experiments we use smaller versions of ResNet architectures with 3 blocks, where the blocks use 16, 32, 64 filters, respectively. In essence, these are smaller versions of the original ResNet18 to keep training of 250 networks feasible. CIFAR10 & Pneumonia For CIFAR10, we consider a standard ResNet18 architecture, where weights are initialized by a Kaiming Normal and bias are zero-initialized. Each client constructs and initializes a ResNet network separately. For pneumonia, X-ray images are resized to be of size (224, 224) . MRI For the MRI scan data, we train a small convolutional network of type Conv(32)-Batchnorm-ReLU-MaxPool-Conv(64)-Batchnorm-ReLU-MaxPool-Linear, where Conv(x) are convolutional layers with x filters of kernel size 3. The pooling layer uses a stride of 2 and kernel size of 2. The Linear layer is of size 2 matching the number of output classes. All scan images are resized to be of size (150,150). In this section, we give additional information for the training setup for individual experiments in our empirical evaluation. SUSY experiments SUSY is a binary classification dataset with 18 features. We train linear models with stochastic gradient descent (learning rate 0.0001, found by grid-search on an independent part of the dataset) on 441 clients. We aggregate every 50 rounds. Aggregation is performed via the iterated Radon point (Kamp et al., 2017) with h = 2 iterations. FEDDC performs daisy-chaining with period d = 1. The test accuracy is evaluated on a test set with 1 000 000 samples drawn iid at randomly. The synthetic binary classification dataset is generated by the sklearn (Pedregosa et al., 2011) make_classification Averaging rounds parameter optimization To find a suitable number when averaging should be carried out, we explore b ∈ {1, 10, 20, 50, 100, 200, 500, ∞} on CIFAR10 using 250 clients each equipped with a small ResNet. We assign 64 samples to each client drawn at random (without replacement) from the CIFAR10 training data and use a batch size of 64. For each parameter, we train for 10k rounds with SGD using cross entropy loss and initial learning rate of 0.1, multiplying the rate by a factor of .5 every 2500 rounds. CIFAR10 differentail privacy and main experiments We keep the same experimental setup as for hyperparameter finding, but now use 100 clients each equipped with a ResNet18. For the differential privacy experiment Hacking smart machines with smarter ones: How to extract meaningful data from machine learning classifiers Searching for exotic particles in high-energy physics with deep learning Analyzing federated learning through an adversarial lens Approximating center points with iterative radon points Differentially private federated learning: A client level perspective Bayes and tukey meet at the center point Hyperpolarized mri of human prostate cancer reveals increased lactate with tumor grade driven by monocarboxylate transporter 1 On the convergence of local descent methods in federated learning Fl-ntk: A neural tangent kernel-based framework for federated learning analysis Augmentation in healthcare: Augmented biosignal using deep learning and tensor representation Gossip-based aggregation in large dynamic networks Black-Box Parallelization for Machine Learning Communicationefficient distributed online prediction by dynamic model synchronization Effective parallelisation for machine learning Efficient decentralized deep learning by dynamic model averaging Few-shot object detection via feature reweighting Migrating models: A decentralized view on federated learning Learning multiple layers of features from tiny images Central limit theorem: the cornerstone of modern statistics Federated optimization in heterogeneous networks Fedbn: Federated learning on non-iid features via local batch normalization A survey of text data augmentation Threats to Federated Learning On safeguarding privacy and security in the framework of federated learning Federated multi-task learning under a mixture of distributions Communication-efficient learning of deep networks from decentralized data The angiosarcoma project: enabling genomic and clinical discoveries in a rare cancer through patient-partnered research Scikit-learn: Machine learning in Python Robust aggregation for federated learning Manish Chaplain, David Sontag, and Xavier Amatriain. Few-shot learning for dermatological disease diagnosis Mengen konvexer Körper, die einen gemeinsamen Punkt enthalten Md Saddam Hossain Mukta, and AKM Najmul Islam. Challenges, applications and design aspects of federated learning: A survey Robust federated learning: The case of affine distribution shifts Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization The future of digital health with federated learning Membership inference attacks against machine learning models Comprehensive integrative profiling of upper tract urothelial carcinomas Can you really backdoor federated learning Transfer learning Mathematics and picturing data Matching networks for one shot learning Federated learning with matched averaging Federated learning with differential privacy: Algorithms and performance analysis Flop: Federated learning on medical datasets using partial networks Batchcrypt: Efficient homomorphic encryption for cross-silo federated learning