key: cord-0131430-wjbtogjz authors: Zhang, Zhe; Ma, Shiyao; Nie, Jiangtian; Wu, Yi; Yan, Qiang; Xu, Xiaoke; Niyato, Dusit title: Semi-Supervised Federated Learning with non-IID Data: Algorithm and System Design date: 2021-10-26 journal: nan DOI: nan sha: 3796a5a3d98006311767ea8c1be87b36bad25d07 doc_id: 131430 cord_uid: wjbtogjz Federated Learning (FL) allows edge devices (or clients) to keep data locally while simultaneously training a shared high-quality global model. However, current research is generally based on an assumption that the training data of local clients have ground-truth. Furthermore, FL faces the challenge of statistical heterogeneity, i.e., the distribution of the client's local training data is non-independent identically distributed (non-IID). In this paper, we present a robust semi-supervised FL system design, where the system aims to solve the problem of data availability and non-IID in FL. In particular, this paper focuses on studying the labels-at-server scenario where there is only a limited amount of labeled data on the server and only unlabeled data on the clients. In our system design, we propose a novel method to tackle the problems, which we refer to as Federated Mixing (FedMix). FedMix improves the naive combination of FL and semi-supervised learning methods and designs parameter decomposition strategies for disjointed learning of labeled, unlabeled data, and global models. To alleviate the non-IID problem, we propose a novel aggregation rule based on the frequency of the client's participation in training, namely the FedFreq aggregation algorithm, which can adjust the weight of the corresponding local model according to this frequency. Extensive evaluations conducted on CIFAR-10 dataset show that the performance of our proposed method is significantly better than those of the current baseline. It is worth noting that our system is robust to different non-IID levels of client data. Federated Learning (FL) [1] , [2] is a distributed machine learning paradigm that allows multiple edge devices (or clients) to cooperatively train a shared global model [3] - [5] . The most obvious difference between FL and traditional distributed machine learning is that clients can privately access local training data without sharing data with cloud centers [6] - [8] . However, the current mainstream work is based on an unrealistic assumption: the training data of local clients have ground-truth [9] . In our daily lives, it is not common for each client to have rich labeled data. For example, in the early stage of COVID-19 epidemic, community hospitals without enough labeled data may not be able to train a high-precision pathophoresis prediction model. On the other hand, in most cases, putting together a properly labeled dataset for a given FL task is a time-consuming, expensive, and complicated endeavor [9] . Therefore, it is challenging to train a high-quality global model in the real scenario of a lack of labeled data. In the face of the above challenges, recent works [9] - [14] study how to design a semi-supervised FL (SSFL) system that can efficiently integrate semi-supervised learning into FL techniques. For example, Jeong et al. in [12] proposed an SSFL system with a new inter-client consistency loss to achieve this goal. In fact, consistency regularization technique is widely used in semi-supervised learning which keeps the same output that the same data inject two different noises [15] , [16] . Furthermore, pseudo-label methods are important for SSFL, where they mainly utilize pseudo-labels whose predicted value is higher than the confidence threshold to achieve highprecision SSFL [9] , [10] , [17] . However, it is worth noting that there still remain gaps when deploying SSFL in practice. First, traditional SSFL methods generally introduce semisupervised techniques (such as consistency loss and pseudolabel) directly into the FL system, which ignores the implicit contribution between iterative updates of the global model. Previous work only focused on how to set pseudo-labels or how to decompose the parameters of labeled and unlabeled data for disjoint learning. In this way, the learned global model will be biased towards labeled data (supervised model) or unlabeled data (unsupervised model) instead of the global model [12] . This implies that we need to observe the implicit effects between iterations of the global model at a fine-grained level. Second, the non-independent identically distributed (non-IID) of data between clients has always been a key and challenging issue in FL. The reason is that there are too many differences in data distribution, features, and the number of labels between clients, which is not conducive to the convergence of the global model. Currently, many efforts have effectively alleviated the non-IID problem, such as FedBN [18] utilized local batch normalization to alleviate the feature shift before average aggregating local models. However, methods such as these add additional computational and communication overhead to the server or client. In this paper, to address the first issue, we propose the Federated Mixing (FedMix) algorithm, which performs param-eter decomposition of disjointed learning for supervised model (learned on labeled data), unsupervised model (learned on unlabeled data), and global model. In particular, this algorithm analyzes the implicit effects between iterations of the global model in a fine-grained manner. To address the second issue, we propose a novel aggregation rule called Federated Frequency (FedFreq), which dynamically adjusts the weight of the corresponding local model by recording the training frequency of the client to alleviate the non-IID problem. Furthermore, we introduce the Dirchlet distribution function to simulate the different non-IID level scenario in our experiment. The main contributions of this paper are as follows: • We present a robust Semi-supervised Federated Learning system design, where the system aims to solve the problem of data availability and non-IID in FL. In our system design, we propose FedMix algorithm to improve the naive combination of FL and semi-supervised learning methods. • We propose a novel aggregation rule called FedFreq, which dynamically adjusts the weight of the corresponding local model by recording the training frequency of the client to alleviate the non-IID problem. • We conduct extensive evaluations on CIFAR-10 dataset, which show that the performance of our designed system is 3% higher than the baseline. Semi-supervised federated learning attempts to use semisupervised learning techniques [19] - [23] to further improve the performance of the FL model in scenarios where there is unlabeled data on the client side [11] . For example, Long et al. in [10] proposed a semi-supervised federated learning (SSFL) system, FedSemi, which unifies the consistency-based semisupervised learning model [24] , dual model [15] , and average teacher model [25] to achieve SSFL. The DS-FL system [26] was proposed to solve the communication overhead problem in SSFL. Reference [27] proposes a method to study the distribution of non-IID data, which introduces a probability distance metric to evaluate the difference in client data distribution in SSFL. Different from the literature [9] , [10] , [26] , in this paper, we focus on labels-at-server scenario and also solve the problem of data availability and data heterogeneity in SSFL. If the local data set distribution of each client is inconsistent (i.e., non-IID problem) [18] , [28] - [33] , the local objective loss function of the client will be inconsistent with the global objective [34] . In particular, when the model of the local client is updated larger, such a difference will be more obvious. Therefore, we need to design some robust FL systems to solve the above problems. Some studies try to design a robust federated learning algorithm to solve the non-IID problem. For example, FedProx [6] limits the distance between the local model and the global model by introducing an additional L 2 regularization term in the local target function to limiting the size of the local model update. However, this method has a disadvantage that each client needs to individually adjust the local regularization term to obtain good model performance. FedNova [35] improved FedAvg in the aggregation phase, which normalized and scaled the model update according to the local training batch of the client. Although previous studies have alleviated the problem of non-IID to some extent, they only evaluated the data distribution at specific non-IID levels and lacked extensive experimental verification for different non-IID scenarios. Therefore, we propose a more comprehensive data distribution and data partition strategy, i.e., we introduce the Dirichlet distribution function to simulate different non-IID levels of client data. Federated learning solves the problem of data island on the premise of privacy protection. In particular, the FL is a distributed machine learning framework, which requires clients to hold data locally, where these clients coordinate to train a shared global model ω * . In FL, there is a server S and k clients, each of which holds an IID or non-IID datasets D k . Specifically, for a training sample x on the client side, let (ω; x) be the loss function at the client, where ω ∈ R d denotes the model's trainable parameters. Therefore, we let L(ω) = E x∼D [ (ω; x)] be the loss function at the server. Thus, FL needs to optimize the following objective function at the server: where p k ≥ 0, k p k = 1 indicates the relative influence of kth client on the global model. In FL, to minimize the above objective function, the server and clients execute the following steps: • Step 1, Initialization: The server sends the initialized global model ω 0 to the selected clients. • Step 2, Local training: The client uses the local optimizer (e.g., SGD, Adam) on the local dataset D k to train the received initialization model. Then, each client uploads the local model ω k t to the server. • Step 3, Aggregation: The server collects and uses a certain algorithm (e.g., FedAvg [1] ) to aggregate the model updates uploaded by these clients to obtain a new global Then, the server sends the updated global model ω t+1 to all selected clients. Note that FL repeats the above steps until the global model converges. In the real world, (e.g., financial and medical fields), unlabeled data is easy to gain, while labeled data is often difficult to obtain. Meanwhile, annotating data requires a lot of manpower and material resources. To this end, the researchers proposed a machine learning paradigm namely Semi-supervised Learning [36] , [37] to learn a high-precision model on a mixed dataset (part of the data is labeled, and some of the data is unlabeled). Thus, in recent years, semi-supervised learning has become a hot research direction in the field of deep learning. In this section, we introduce a basic assumption and two methods of semi-supervised learning. Assumption 1: In machine learning, there is a basic assumption that if the feature of two unlabeled samples u 1 and u 2 are similar, the corresponding model prediction results y 1 and y 2 are the same [38] According to the above assumption, we adopt two common semi-supervised learning methods as follows: Consistency Regularization: The main idea of this method is that the model prediction results should be the same whether noise is added or not on an unlabeled training sample [15] , [36] . We generally use data augmentation (such as image flipping and shifting) methods to add noise to increase the diversity of the dataset. Specifically, for an unlabeled sample u i in unlabeled Thus, we can calculate the consistency loss as follows: where m is the total number of unlabeled samples and f θ (u i ) indicates the model output of unlabeled sample u i . Pseudo-label: The pseudo-label method [24] is to utilize some labeled samples to train a model that to set the pseudo labels for unlabeled samples. Previous work generally used sharpening [39] and argmax [24] methods to set pseudo-label, where the former make the distribution of model output extreme of unlabeled samples and the latter change the model output to one-hot of unlabeled samples. In the SSFL system, there are two essential scenarios of SSFL based on the location of the labeled data. The first scenario considers a conventional case where clients have both labeled and unlabeled data (labels-at-client), and the second scenario considers a more challenging case, where the labeled data is only available at the server (labels-at-server). In particular, in this paper, we consider only the labels-at-server scenario. Next, we give the definition of the problem studied in this paper as follows: Labels-at-server Scenario: In SSFL, we assume that there is a server S and K clients, where the server holds a labeled dataset D s = {(x i , y i )} n i=1 and each client holds a local Thus, in this scenario, for unlabeled training sample u i , let L k u be the loss function at the client side: where m is the number of unlabeled samples, π(·) is the data augmentation function (e.g., flip and shift of the unlabeled samples),ŷ i is the pseudo label of unlabeled sample u i , and f θ k (u i ) indicates the output of unlabeled sample u i on model θ k of the k-th client. For labeled sample x i , let L s be the loss function at the server side: where n is the number of labeled samples, and f θ (x i ) indicates the output of labeled sample x i on model θ. Therefore, the objective function of this scenario in SSFL system is to minimize the following loss function: Note that the whole learning process is similar to the traditional FL system, except that the server not only aggregates the client model parameters but also trains the model with labeled data. In our system setting, the server S holds a labeled dataset where n indicates the number of labeled samples. For K clients, we assume that k-th client holds a local unlabeled dataset where m denotes the number of unlabeled samples in the local client. Similar to the traditional FL system, the server and clients in the SSFL are cooperative to train a high-performance global model ω * . The goal of previous work is to optimize the objective function mentioned above, i.e., Equation (5) . However, they ignore the implicit contribution between iterations of the global model, which results in the learned global model not being optimal. Inspired by the above facts, we propose an SSFL algorithm called FedMix that focuses on the implicit contributions between iterations of the global model in a fine-grained manner. We define the supervised model trained on the labeled dataset as σ, the unsupervised model trained on the unlabeled dataset as ψ, and the aggregated global model as ω. Specifically, we design a strategy that assigns three weights α, β, and γ to the unsupervised model ψ, supervised model σ, and the previous round of global model, respectively. The designed algorithm can capture the implicit relationship between each iteration of the global model in a fine-grained manner. Thus, the steps of our proposed FedMix algorithm are as follows: • Step 1, Initialization: The server randomly selects a certain proportion of F (0 < F < 1) clients from all Server . . . local clients to send the initialized global model ω 0 . Note that the global model ω 0 also remains on the server-side. • Step 2, Server Training: Unlike FL, in our SSFL system, the server not only aggregates the model uploaded by the clients, but also trains the supervised model σ (i.e., σ t ← ω t ) on the labeled dataset D s . Thus, the server uses the local optimizer on the labeled dataset D s to train the supervised model σ. The minimization of the objective function is defined as follows: where λ s is the hyperparameter, x and y are from labeled dataset D s , and f σt (x) means the output of labeled samples on supervised model σ at t-th training round. • Step 3, Local Training: The k-th client utilizes the local unlabeled data to train the received global model ω t (i.e., ψ k t ← ω t ) and then obtains the unsupervised model ψ k t+1 . Thus, we define the following objective function: where λ 1 , λ 2 , and λ L1 are hyperparameters to control the ratio between the loss terms, ψ k t is the unsupervised model of the k-th client at t-th training round, u is from unlabeled dataset D k , π(·) is the form of perturbation, i.e., π 1 is the shift augmentation, π 2 is the flip augmentation, ||σ t − ψ k t || 2 is a penalty term that aims to let the kth client unsupervised model ψ k t learn the knowledge of the supervised model σ t (note that σ t is inferred from Equation (9)), andŷ is pseudo label obtained by using our proposed argmax method. The argmax method is defined as follows:ŷ where Max(·) is a function that can output the maximum probability that unlabeled data belongs to a certain class, 1(·) is the one-hot function that can change the numerical value to 1, A represents the number of unlabeled data after data augmentation, and u is from unlabeled dataset D k . Specifically, we discard low-confident predictions below confidence threshold τ = 0.80 when generating pseudolabels. • Step 4, Aggregation: The server uses the proposed FedFreq (see Section IV-B) aggregation algorithm to aggregate the unsupervised models uploaded by the clients to obatin the global unsupervised model, i.e.,ψ t+1 = K k=1 w k t+1 ψ k t+1 , where ψ k t+1 is the unsupervised model of the k-th client at t + 1-th training round and w k t+1 is the weight of the k-th client. The server then aggregates the global unsupervised model ψ t+1 , the supervised model σ t+1 , and the global model ω t from the previous round t to obtain a new global model ω t+1 : where α, β, and γ are the corresponding weights of the three models and (α, β, γ) ∈ {α+β+γ = 1∧α, β, γ 0}. Repeat all the above steps until the global model converges. The proposed FedMix algorithm is shown in Algorithm 1. In this section, we present the designed FedFreq aggregation algorithm, which can dynamically adjust the weight of the corresponding local model according to the training frequency of the client to alleviate the non-IID problem. We observe that the parameter distribution of the global model will be biased towards clients that often participate in federated training, which is obviously not friendly to the robustness of the global model. Therefore, our insight is to reduce the influence of clients with high training frequency on the global model to improve the sum . . . . . . Algorithm 1 FedMix algorithm on labels-at-server scenario. Input: The client set K, B server is the mini-batch size at the server side, E server is the number of epochs at the server side, B client is the local mini-batch size at the client side, E client is the number of local epochs at the client side, and η is the learning rate. σ t ← ω t for the server epoch e from 1 to E server do 6: for mini-batch b ∈ B server do 7: m ← max(F · K, 1) S t ← randomly select m clients from the client set K for each client k ∈ S t in parallel do 13: ψ k t ← ω t ψ k t+1 ← ClientUpdate(k, ψ k t ) end for 24: end for 25: return ω * to server. robustness of the model. Thus, the formal expression of the FedFreq aggregation algorithm is as follows: where F is the sample proportion of the server, K is the total , q k t+1 is the number of times that the k-th client has been trained up to the t + 1-th round, and S t+1 denotes the set of clients selected by the server in round t + 1. Then, for the client, the FedFreq aggregation rule is expressed as follows: To better evaluate the robustness of the designed system to non-IID data, in this paper, we introduce the Dirchlet distribution function [40] , [41] , which is a popular non-IID function, to adjust the non-IID level of the local client data. Specifically, we generate data distributions of different non-IID levels by adjusting the parameter (i.e., µ) of the Dirchlet distribution function. We assume that the local dataset D k of k-th client has c classes, and thus, the definition of Dirichlet distribution function is as follows: where Θ is a set of c samples randomly selected from the Dirichlet function, i.e., Θ = {ϕ 1 , . . . , ϕ c } and Θ ∼ Dir(µ 1 , · · · , µ c ), µ, µ 1 , .., µ c are the parameters of the Dirichlet distribution function (where µ = µ 1 = µ 2 = ... = µ c ), and p k (ϕ c ) denotes the proportion of the c-th class data in all data of the client. In particular, the smaller the µ, the higher the non-IID level of the data distribution of each client; otherwise, the data distribution of the client tends to the IID setting. Therefore, we adjust the parameters of the Dirichlet distribution function to simulate the different non-IID levels of the client's local dataset. For example, as shown in Fig. 3 , we demonstrate the data distribution when µ = {0.1, 1, 10}. In the Labels-at-Server scenario, we compared our method FedMix and the baseline FedMatch [12] on the CIFAR-10 dataset. Furthermore, we simulate the federated learning setup (one server and K clients) on a commodity machine with Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz and NVIDIA GeForce RTX 2080Ti GPU. Dataset 1) CIFAR-10 dataset under IID setting: We use the CIFAR-10 dataset including 56,000 training samples and 2,000 test samples as the validation dataset in our experiment. The training set includes 55,000 unlabeled samples and 1,000 labeled samples, where the former is used to train the unsupervised model at the local and the latter is used to train the supervised model at the server. The unlabeled samples are equally distributed to 100 clients at a ratio of 1 : 100, of which there are 550 samples for each client (i.e., 55 for each class, and 10 classes in total). Similarly, the labeled samples have a total of 1,000 and contain 10 classes on the server, of which there are 100 samples in each class. Meanwhile, we set a participation rate F = 0.05 of clients, i.e., 5 clients are randomly selected for training in each round. Dataset 2) CIFAR-10 dataset under non-IID setting: Our setting is similar to the above IID setting, except that the Dirchlet distribution function is introduced to adjust the non-IID level of the local client data. Specifically, we generate data distributions of different non-IID levels by adjusting the parameter (i.e., µ) of the Dirchlet distribution function. Meanwhile, we simulate quantity imbalance and class imbalance the local client data. In particular, we make each client hold a different number of training samples. For example, some customers have 580 samples, while some customers have less than 50 samples. Second, we make the client hold a different sample size for each category in the data. For example, some clients have ten types of data, and some clients have less than two types of data. Baseline and training details: Our baseline is FedMatch [12] naively using unsupervised model and supervised model parameter decomposition strategy, i.e., ω = ψ + σ. In the training process, both our model and baseline use Stochastic Gradient Descent (SGD) to optimize the ResNet-9 neural network with initial learning rate η = 1e − 3. We set training round t = 150, the number of labeled samples on sever is N s = 1000, local client training epoch E client = 1 and minibatch size B client = 64, the server training epoch E server = 1 and mini-batch size B server = 64. Second, we set the data augmentation number in the argmax method A = 5. As shown in Fig. 4 , under IID and non-IID settings, our method FedMix is better than baseline under each different aggregation method settings. For example, under a non-IID setting, the convergence accuracy of our method is 47.5% about 3% higher than that of the baseline. In particular, the accuracy of our method increases faster and more stable in the early stage of model training. The reason is that: (1) The FedMix focuses on the implicit contributions between iterations of the global model in a fine-grained manner, while the FedMatch only naively uses model parameter decomposition. (2) Frequencybased aggregation method FedFreq is more suitable for non-IID settings. Notably, the FedFreq only requires the server to give appropriate weights in the aggregation process according to the training frequency of each client, which does not bring additional computational overhead to the server and local clients. Fig. 5 shows the performance comparison of the proposed method under different hyperparameter settings. To be specific, the three hyperparameters are the weights of the global unsupervised model, the supervised model, and the previous round of global model. From Fig. 5 , we can find that under the non-IID setting, as α decreases, the accuracy curve of the proposed method becomes unstable. The reason for this phenomenon is that: with the decrease of global unsupervised model weight, FedFreq is losing its effect. In particular, when α = 0.5, β = 0.3, and γ = 0.2, our global model is the aggregation of the optimal weights of the three models. In addition, we find that the proposed method is easy to achieve better performance when these three parameters are relatively close under the IID setting, as shown in Fig. 5(b) . Fig. 6(a) shows the performance comparison of the proposed method on different non-IID levels of client data. In this experiment, we let µ = 0.1 denote the highest non-IID level of the client data. In this case, as the value of µ increases, the local client data distribution is closer to the IID setting. It can be seen from Fig. 6 that for different non-IID levels, our method all can achieve stable accuracy. Meanwhile, the model convergence accuracy under µ = {0.1, 1, 10, 100} settings does not differ by more than 1%. Therefore, our method is not sensitive to the different levels of client data distribution, i.e., it is robust to different types of data distribution settings. Fig. 6(b) shows the performance comparison of the proposed method in the case of different numbers of labeled samples at the server. Obviously, the converged accuracy of our method is 47% with 800 labeled samples, which is 2% higher than FedMatch. However, when the number of labeled samples is reduced to 700, the accuracy of our model decreases greatly. Therefore, we regard N s = 800 as the best setting for our method. In this section, we further analyze the advantages of FedMix compared to FedMatch in labels-at-server scenario. 1) The performance of FedMix training model under the CIFAR-10 dataset is better than FedMatch. This is due to FedMatch simply uses the strategy of parameter decomposition of unsupervised model and supervised model in the training process, i.e., ω t = ψ t +σ t . In this way, the learned global model will be biased towards unlabeled data (unsupervised model) or labeled data (supervised model) instead of the overall data. Thus, in order to avoid the drift problem of the global model, FedMix adds the global model from the previous round to the model parameter aggregation, i.e., ω t = αψ t + βσ t + γω t−1 . Meanwhile, we conducted a sensitivity experiment of model performance to different hyperparameter weights to find the optimal weight combination. 2) FedMix is robust to different levels of non-IID data. In our experiment, we introduced the Dirichlet distribution function to simulate the local client non-IID data in FL. In details, we generate data distributions of different non-IID levels by adjusting the parameters of the Dirchlet distribution function, i.e., µ = {0.1, 1, 10, 100} respectively correspond to different levels of non-IID. The results show that the performance difference of our model does not exceed 1% under different levels of non-IID settings. FedMatch uses a pseudorandom method to generate the non-IID data distribution of each client. However, in reality there is no such distribution, which will cause the model to lose its robustness. In this paper, we studied the labels-at-server scenario and addressed the problem of data availability and non-IID in FL. To solve the first problem, we designed a robust SSFL system that uses the FedMix algorithm to achieve high-precision semi-supervised learning. To tackle the non-IID problem, we propose a novel aggregation algorithm FedFreq, which effectively achieves the stable performance of the global model in the training process without adding additional computational overhead. Through experimental verification, our robust SSFL system is significantly better than the baseline in performance. In future work, we will further improve the algorithm to maximize the use of unlabeled data. Furthermore, we will continue to strengthen the theory of SSFL so that it can be better applied in real-world scenarios. Communication-efficient learning of deep networks from decentralized data Incentive mechanism for reliable federated learning: A joint optimization approach to combining reputation and contract theory Privacy-preserving traffic flow prediction: A federated learning approach When information freshness meets service latency in federated learning: A task-aware incentive scheme for smart industries Deep anomaly detection for time-series data in industrial iot: a communication-efficient on-device federated learning approach Federated optimization in heterogeneous networks Federated learning for 6g communications: Challenges, methods, and future directions Reliable federated learning for mobile networks Rc-ssfl: Towards robust and communication-efficient semi-supervised federated learning system Fedsemi: An adaptive federated semi-supervised learning framework Towards utilizing unlabeled data in federated learning: A survey and prospective Federated semi-supervised learning with inter-client consistency & disjoint learning Semi-supervised federated learning for travel mode identification from gps trajectories Graphfl: A federated learning framework for semi-supervised node classification on graphs Temporal ensembling for semi-supervised learning Adversarial dropout for supervised and semi-supervised learning Fixmatch: Simplifying semisupervised learning with consistency and confidence Fedbn: Federated learning on non-iid features via local batch normalization Introduction to semi-supervised learning Semi-supervised learning (chapelle Semisupervised learning with deep generative models S4l: Self-supervised semi-supervised learning Semiboost: Boosting for semi-supervised learning Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results Distillation-based semi-supervised federated learning for communication-efficient collaborative training with non-iid private data Improving semi-supervised federated learning by reducing the gradient diversity of models On the convergence of fedavg on non-iid data Federated learning with non-iid data Robust and communication-efficient federated learning from non-iid data Federated learning with hierarchical clustering of local updates to improve training on non-iid data Optimizing federated learning on non-iid data with reinforcement learning Asynchronous online federated learning for edge devices with non-iid data Federated learning on non-iid data silos: An experimental study Tackling the objective inconsistency problem in heterogeneous federated optimization Regularization with stochastic transformations and perturbations for deep semi-supervised learning Unsupervised data augmentation for consistency training A survey on deep semi-supervised learning Mixmatch: A holistic approach to semi-supervised learning Bayesian nonparametric federated learning of neural networks Measuring the effects of nonidentical data distribution for federated visual classification