key: cord-0634849-epqsufib authors: Wang, Zhiguo; Wang, Xintong; Sun, Ruoyu; Chang, Tsung-Hui title: Federated Semi-Supervised Learning with Class Distribution Mismatch date: 2021-10-29 journal: nan DOI: nan sha: 372bf08d1ac756ef986ae43ebdd0db63e9900dbd doc_id: 634849 cord_uid: epqsufib Many existing federated learning (FL) algorithms are designed for supervised learning tasks, assuming that the local data owned by the clients are well labeled. However, in many practical situations, it could be difficult and expensive to acquire complete data labels. Federated semi-supervised learning (Fed-SSL) is an attractive solution for fully utilizing both labeled and unlabeled data. Similar to that encountered in federated supervised learning, class distribution of labeled/unlabeled data could be non-i.i.d. among clients. Besides, in each client, the class distribution of labeled data may be distinct from that of unlabeled data. Unfortunately, both can severely jeopardize the FL performance. To address such challenging issues, we introduce two proper regularization terms that can effectively alleviate the class distribution mismatch problem in Fed-SSL. In addition, to overcome the non-i.i.d. data, we leverage the variance reduction and normalized averaging techniques to develop a novel Fed-SSL algorithm. Theoretically, we prove that the proposed method has a convergence rate of $mathcal{O}(1/sqrt{T})$, where $T$ is the number of communication rounds, even when the data distribution are non-i.i.d. among clients. To the best of our knowledge, it is the first formal convergence result for Fed-SSL problems. Numerical experiments based on MNIST data and CIFAR-10 data show that the proposed method can greatly improve the classification accuracy compared to baselines. the convergence of FedAvg when the objective function is convex. For a nonconvex FL optimization, in [9] , the authors obtained a tight convergence rate for FedAvg. However, FedAvg is known to suffer slow convergence when the data among clients are non-i.i.d.. Therefore, various techniques such as adding a proximal term (FedProx [10] ), variance reduction (VRL-SGD [11] , SCAFFOLD [9] ), and normalized averaging (FedNova [12] ) are proposed. Despite the popularity, most of the existing works on FL focused on the supervised learning tasks where the data owned by the clients are well labeled. Nevertheless, data labeling can be expensive and time consuming. This issue is more severe under the FL setting, since clients may not have the resources to provide labels for their personal data, e.g., pictures in mobile phones or medical images in hospitals [13] . It raises a fundamental question of how to effectively make use of these massive and distributed unlabeled data for improving FL. In ML, a standard solution to utilize unlabeled data is semi-supervised learning (SSL) methods [13] [14] [15] . Therefore, it is reasonable to consider SSL for FL. We provide a brief review of existing SSL methods for data classification. According to [14] , [15] , traditional centralized SSL methods include self-training, graph-based methods and semi-supervised support vector machines (SVMs), to name a few. The success of these methods relies on some critical assumptions, in addition to the standard smoothness, cluster and manifold assumptions. For example, the self-training method originated from [16] by Yarowsky is a well-known bootstrapping algorithm, which relies on the assumption that the data have a well-separated clustering structure [14] . Some theoretical analyses support the effectiveness of self-training algorithms in [17] , [18] . Many recent approaches for semisupervised learning advocate to train a neural network based on the consistency loss, which forces the model to generate consistent outputs when its inputs are perturbed, such as the pseudo-labeling [19] , the ladder network [20] , the Π model [21] , the mean teacher [22] , the Virtual adversarial training (VAT) [23] and the Mixmatch [24] . Currently, the existing SSL methods often assume that labeled data and unlabeled data come from the same class distribution, but in practice the unlabeled data are unlikely to be manually purified beforehand [25] . For example, in medical diagnosis, unlabeled medical images may contain some rare diseases that never appeared in the labeled data set. As illustrated in Fig. 1 , for an image classification task for Client A, labeled images contain two classes (bird and dog), /15 Bird Dog Deer Cat but unlabeled images include three novel classes (deer, car and horse) that are not present in the labeled images. Hence, the unlabeled data may consist of both relevant and irrelevant data. However, using the irrelevant unlabeled data often leads to performance degradation for SSL [26] . We call such problem as the class distribution mismatch problem. Recently, there are some attempts to overcome this problem for centralized SSL. For example, reference [27] applied a safe deep SSL method to alleviate the harm caused by the irrelevant unlabeled data, and the key idea is to select some relevant unlabeled data rather than using all unlabeled data directly. However, such method requires solving a complicated optimization problem for data selection. A relevant direction is to consider distributed SSL, which has received considerable interests recently [28] [29] [30] [31] [32] [33] . In [30] , [31] , the authors considered semi-supervised SVMs under distributed setting, and used consensus-constrained distributed optimization methods [34] , [35] . In [32] , [33] , manifold regularized SSL methods are studied, which however requires the clients to estimate the Euclidean distance matrix of data samples in advance. This may cause significant communication overhead when the system has massively distributed clients and when the data at the clients have very different distributions. With the development of federated learning, the authors of [36] make a brief prospect into Fed-SSL. The recent work [37] proposed a semi-supervised FL method called FedSem to exploit the unlabeled data in smart city applications. FedSem requires two steps: the first step is to train a global model using only the labeled data and the second step is to inject unlabeled data into the learning process using the pseudo labeling technique. In [38] , the author propose a new inter-client consistency loss that regularizes the models learned at local clients to output the same prediction. Reference [39] also uses a Fed-SSL technique based on the pseudo labeling technique to aid the diagnosis of COVID-19. However, these methods in [37] [38] [39] neither consider the non-i.i.d. data among clients nor consider the class distribution mismatch problem between labeled data and unlabeled data. It is worthwhile to point out that there exist both challenges and opportunities for Fed-SSL under the two issues. The challenge lies in that non-i.i.d. data slow down the FL algorithm convergence, and thus the clients hardly make an accurate prediction for unlabeled data in early iterations. It is conceivable that an early mistake of the pseudo labeling method can reinforce itself by generating incorrectly labeled data. Thus, re-training with these data will lead to an even worse model in successive iterations [14] . On the other hand, the opportunity lies in that the FL setting provides a means to leverage the data information from other clients (see Fig. 1 ) so that the class distribution mismatch problem can be alleviated and it helps the clients predict correct labels for the unlabeled data. This also brings a new challenge: How to transfer the knowledge of other clients to help predict labels of data with classes that are not seen in the local labeled data subset? These factors make a naive combination of existing FL algorithms and centralized SSL techniques hardly to deliver satisfactory performance. In this paper, we develop a new approach to handle the Fed-SSL problem under non-i.i.d. data and class distribution mismatch, aiming at fully utilizing both the labeled and unlabeled data to achieve a high quality FL performance in a communication-efficient way. Our contributions are summarized as follows. • Problem formulation: We formulate the Fed-SSL problem as a joint optimization problem of the model parameters and the pseudo labels of the unlabeled data. To eliminate performance degradation caused by the non-i.i.d. data and the class distribution mismatch, we introduce two regularization terms in the objective function (see (5a)). One is a penalty term for the pseudo labels, targeting at boosting the prediction accuracy at early training stages. The second is a confidence penalty for the model parameters, which facilitates the knowledge transfer between the clients so the unlabeled data can be classified into some novel classes that are not seen in the labeled data. • Algorithms design: We propose a novel federated SSL algorithm not only with heterogeneous local SGD iterations but also variance reduction technique, called Fed-SHVR. Since the Fed-SSL problem involves two blocks of variables, block coordinate descent (BCD) method is known for its effectiveness in handling the loss function with multiple blocks. To improve communication efficiency, similar to FedAvg, within each communication round of Fed-SVHR, each client performs multiple epochs of local SGD with respect to model parameters and one step of pseudo label prediction. By recognizing the fact that heterogeneous local SGD iterations suffer from large variance and objective inconsistency when the data distribution among clients are non-i.i.d. [12] , inspired by [9] , [12] , we introduce gradient correction terms at clients to reduce the variance among clients, and normalized averaging of local gradients at the server to ensure objective consistency. • Convergence analysis: We prove that Fed-SHVR converges to a stationary point of the Fed-SSL problem in a sublinear rate of O(1/ √ T ), where T is the number of communication rounds. The convergence analysis neither requires an assumption on the boundedness of gradient dissimilarity in [12] nor the convexity of the objective function. To the best of our knowledge, such convergence results for Fed-SSL have not been presented in the literature. • Experiment: The performance of Fed-SHVR is evaluated by experiments on the MNIST dataset and CIFAR-10 dataset. The experiment results show that under non-i.i.d. data and class distribution mismatch among clients, Fed-SHVR can greatly improve the classification accuracy compared with baselines [38] , [40] , including naive combinations of FL with centralized SSL methods. Synopsis: Section II presents the formulated Fed-SSL optimization problem. In Section III, the proposed Fed-SVHR algorithm is presented. Section IV presents the convergence conditions and convergence rate of Fed-SVHR. The performance of the Fed-SVHR algorithm is illustrated in Section V. Finally, the conclusion is given in Section VI. In this section, we review the centralized SSL problem and present the considered novel Fed-SSL optimization problem. x i is the i-th labeled sample with label y i ∈ R C , which is a one-hot vector representing the true class label and C is the number of classes. For supervised learning, one can train a classification model by minimizing the following cross entropy loss function where θ is the classification model parameter, f θ (x i ) ∈ R C is the predicted probability vector (e.g., the softmax output) for each data x i . In order to utilize the unlabeled data U, we follow the pseudo-labeling method and denotev i ∈ R C as the pseudo label for each unlabeled sample u i , for i ∈ [M ] {1, . . . , M }. In particular, we consider the following joint model training and pseudo label prediction problem [41] wherev = {v 1 , . . . ,v M } is the set of pseudo labels and e = [1, . . . , 1] ∈ R C is an all-one vector. As seen from (1), the pseudo labelsv are treated as prediction variables and are jointly optimized with the model parameter θ. The simplex constraints in (1b) imply that the obtainedv are soft labels. The objective function in (1a) is composed of a supervised training loss of the labeled data and a training loss of the unlabeled data using pseudo labels. The weight α 0 is used to balance the supervised loss and unsupervised loss [19] . To solve (1), one can use the popular block coordinate descent (BCD) method by updating θ andv in an alternating fashion, as described below. • Updatingv with fixed θ: It corresponds to the following problem From the definition of cross entropy, the objective function (2a) is linear with respect tov. Thus, the following closed solutions of (2) can be readily obtained as s denote the j-th and the s-th entry ofv i and f θ (u i ), respectively. • Updating θ with fixedv: It corresponds to which can be handled by the standard SGD method [24] . Note that the alternating updates in (3) and (4) are exactly the same as the pseudo-labeling method in [19] . Thus, the pseudo-labeling method [19] is in fact an application of the BCD method to the joint model training and pseudo label prediction problem (1). Based on (1), we formulate a Fed-SSL problem next. Consider an FL setting with a server and K distributed clients. We assume that both the labeled data L and unlabeled data U are distributed in the K clients. Specifically, for each client k, it owns local dataset D k = L k ∪ U k , where L k = {(x k,i , y k,i ), i = 1, . . . , N k }, is the local training dataset and U k = {u k,i , i = 1, . . . , M k } is the local unlabeled dataset. Here, D k , k ∈ [K] {1, . . . , K} are non-overlapped, and N k and M k are the numbers of labeled samples and unlabeled samples of client k, respectively. Note that under the FL scenario, the data size N k and M k could be unbalanced among the clients. Moreover, as shown in Fig. 1 , the labeled data and the large unlabeled data may be drawn from different class distributions, e.g., the class distribution mismatch. Then, we propose the following Fed-SSL optimization problem (2) as [28] . Notably, compared with (1a), we introduce in (5a) two additional regularization terms r 1 (v k ) and r 2 (θ) for the pseudo label and model parameter, respectively, where α 1 and α 2 are the weight coefficients. The motivations of r 1 (·) and r 2 (·) are explained below. Regularization r 1 (·): When the data are non-i.i.d. among clients, FL algorithms perform unstably and have slow convergence [9] . So the model mostly makes incorrect predictions at the early training stages. However, the pseudo labeling method takes hard pseudo-labels in (3) as "ground truth", which therefore causes overconfident mistakes. Meanwhile, the early mistakes of the pseudo labeling method could reinforce itself and result in an even worse model in successive iterations. Thus, we require a regularization of the pseudo labels to reduce the confidence level. Toward this goal, let us choose r 1 (v k ) as follows where u = [ 1 C , . . . , 1 C ] ∈ R C is a uniform distribution, and KL(·, ·) means the Kullback-Leibler divergence. Adding the regularization term r 1 (v k ) in (6) makes the pseudo label prediction less decisive, as seen from the lemma below. (6). For fixed θ and k ∈ [K], the closed form solution of optimization is given by where [v k,i ] j denotes the jth-entry ofv k,i . Note that the closed solution (8) is exactly the sharpening function proposed in the popular Mixmatch method [24] . When α 1 → 0, the output ofv k,i in (8) approaches a one-hot distribution, which is degraded to the hard pseudo label in (3). On the contrary, the soft pseudo-label in (8) eliminates overconfident mistakes, which can achieve a better recovery accuracy [41] . As one will see in Section V, the regularizer r 1 (v k ) can help to speed up the convergence of Fed-SSL at early training stages. Regularization r 2 (·): The unlabeled data may have some unseen class when the distributions of labeled data and unlabeled data are mismatched. Thus, we may transfer the required knowledge from other clients. To enable this, inspired by the work [42] , we introduce a regularizer for the model output f θ (u k ) as follows As we see, KL(f θ (u k ), u) would make the prediction f θ (u k ) away from a categorical distribution like the one-hot vector. This will prevent the local model from making prediction solely based on its local knowledge of seen labels. Instead, through the FL process, the model parameters are able to infer some novel classes that have never appeared in its local labeled data. In this section, we present the proposed Fed-SHVR algorithm for solving problem (5) and it is presented in Algorithm 1. Since the Fed-SSL problem (5) involves two blocks of variables θ andv, many of the exiting FL algorithms cannot be applied directly [43] . In view of that BCD steps like in (3) and (4) can be used to handle Fed-SSL optimization (5), we propose to train a global model for (5) in a novel way that combines BCD with the local SGD strategy [1] to reduce the communication cost. Since the non-i.i.d. data always happen for FL scenarios and at the same time averaging the local SGD can cause higher variance among clients, the variance reduction techniques [11] are desired. Specifically, for each communication round t = 1, 2, . . ., our algorithm has two parts: one is the client update and the other is the server update. Client Update: Based on the client's local data, the model parameter θ and pseudo labelv k are alternatively updated by the BCD framework. Firstly, the pseudo labels for optimization (5) is obtained by solving optimization (7) . Noticing that the soft label in (8) is derived when r 1 (v k ) is selected as a proper regularizer in Lemma 1. Secondly, after updatingv k , we select mini-batch labeled data ξ k and unlabeled data ζ k uniformly at random from L k and U k , respectively. Then the stochastic gradient g k (θ k ,v k ) of the local loss function is obtained If we directly take several local SGDs to update model parameter θ in (16), it may cause client-drift when the data are non-i.i.d. among clients. To counter this drift, SCAFFOLD [9] and VRL-SGD [11] introduce a gradient correction term d t k for local SGD. Specifically, the gradient correction term d t k in (15) is used in (16) . As we show in Remark 1, (16) together with (15) is equivalent to taking gradient descent along an estimated global gradient direction of (5a), and thus effectively reduces the variance among clients. Finally, the client models after τ k local SGD updates (Steps (14) - (16) ) are sent to the server. Server Update: If the mini-batch size for labeled data and unlabeled data are B l and B u , respectively, and each client k performs E epochs, the number of local SGD iterations is τ k = max{m k E/B u , N k E/B l }, Thus, the clients are heterogeneous if they have different τ k (otherwise they are homogeneous). In [12] , the authors have proved that the standard averaging of client models [1] θ t = K k=1 ω k θ t−1,τ k k after heterogeneous local updates τ k will prevent the algorithm from converging to a stationary point, and such phenomenon is called objective inconsistency. To deal with heterogeneous local updates, in our algorithm, the server obtains the model θ t by weighted averaging between the normalized local gradients and the previous model parameter θ t−1 as follows Then, the server broadcasts θ t to the clients. where the second equality is due to (14) . By the above equality over k = 1, . . . , K, and using the fact In addition, using (14) again, we can rewrite d t k as below Substituting (16) into the above equality gives rise to where the final equality is due to (11) . With (12) substituted into (16), we have Interestingly, the above (13) resembles SAGA [44] , where the model is updated along an estimated global gradient direction considering the data of all clients. Thus, Fed-SHVR can be seen as an extension of variance reduction techniques of [11] and [44] . In this section, we build the convergence conditions of the proposed Fed-SHVR algorithm. Algorithm 1 Proposed Fed-SHVR 1: Input: initial model parameters θ 0 = θ 0,τ1 1 = · · · = θ 0,τ K K at the server side; initial pseudo labels ofv 0 1 , · · · ,v 0 K and gradient correction term d 0 1 = · · · = d 0 K = 0 at the clients; initial the learning rate η. 2: Calculate the local iterations τ k completed by client k and τ = K k=1 ω k τ k at server. 3: for communication round t = 1 to T do 4: Server side: Compute and broadcast θ t to all clients. Client side: 6: for client k = 1 to K (in parallel) do 7: Obtainv t+1 k from (7)- (8) 8: Update gradient correction term Set θ t,0 k = θ t . 10: for q = 1 to τ k do 11: Select data ξ t,q k and ζ t,q k uniformly at random from L k and U k , and update Upload θ t,τ k k to the server. 14: end for 15: end for We first make some standard assumptions. Assumption 1. The regularization terms r 1 (v) and r 2 (θ) are continuous differentiable functions. In addition, r 1 (v) is a µ-strongly convex, where µ > 0, i.e., for anyv 1 The regularization term r 1 (v k ) defined in (6) is strongly convex over the probabilistic simplex V k with respect to the 1 -norm (see [45] , Definition 2 and Example 2). However, note that the objective function F k (θ,v k ) in (5a) is not jointly convex with respect to (θ,v). Assumption 2. The local cost F k (θ,v k ) is L-smooth (possibly non-convex) with respect to (θ,v k ) for k ∈ [K], i.e., for all θ, θ andv k ,v k ∈ V k . Assumption 3. Givenv k , k ∈ [K], assume that the stochastic gradient satisfies the following conditions where σ is the noise variance and E denotes the expectation with respect to all random variables {ξ k , ζ k }. Assumption 2 makes a standard smoothness assumption in non-convex optimization with two variables [46] . Assumption 3 is a common assumption that the stochastic gradient noise is zero mean with bounded variance σ 2 [9] . Since variablev k has the constraint V k , let us define the following optimality gap: From the definition, obviously, g t ≥ 0, and it has the following property. Property 1. When g t = 0, the iterate {θ t ,v t } will be a stationary point of problem (5). Proof. See Appendix B. where c 1 = 8 Proof. See Appendix C. Theorem 1 shows that the proposed Fed-SHVR has a convergence rate of O(1/ √ T ). To the best of our knowledge, this is the first result that shows the convergence rate for Fed-SSL optimization. Remark 2. In [11] , [12] , the authors have proved that VRL-SGD and FedNova have a convergence rate of O(1/ √ T ). However, their proofs cannot be applied to the proposed Fed-SVHR since Fed-SVHR combines both the variance reduction and normalized averaging techniques, in addition to that Fed-SHVR involves two blocks of variables. Besides, different from [12] , our analysis does not require the assumption on the boundedness of gradient dissimilarity, i.e., where κ is a constant. Thus, new proof techniques are developed in our proof in order to establish the convergence rate in (22) ; details are given in Appendix C. In this section, we examine the numerical performance of the proposed Fed-SVHR algorithm and present comparison results with the existing methods for MNIST data and CIFAR-10 data. Let us consider the MNIST digit recognition task. The data is split into 60000 images for training and 10000 images for testing. There are K = 10 clients to study federated optimization. In addition, two ways of dividing the MNIST data over clients: • IID: Each client is randomly assigned a uniform distribution over 10 classes and receives 6000 examples, which contains 60 labeled data and 5940 unlabeled data. • Non-IID: Each client has only 60 labeled data of two digits, and 59400 unlabeled data are randomly partitioned across 10 clients using a Dirichlet distribution Dir 10 (0.1) [47] . Thus the class distribution between labeled data and unlabeled data in each client is mismatched. We run each experiment with 5 random seeds and report the average. Consider a simple Multi-layer perceptron (MLP) with one-hidden layers including 5000 units each using ReLU activations. The number of mini-batch is B l = 32 for labeled data, B u = 32 for unlabeled data. In (16), we set the learning rate η = 0.01, α 1 = 0.75, α 2 = 0.1 and α 0 ramp up its weight from 0 to its final value during the first 50 epochs. All clients perform E = 2 local epochs, then the number of local SGD iterations is τ k ∈ [270, 500] and τ k = 371 for Non-IID case and IID case, respectively. Example 1: Intuitively, the popular methods mean-teacher [22] and pseudo-labeling [19] can be combined with FedAvg [1] , called Fed-MT and Fed-Pseudo, respectively. Fig. 2 presents the test accuracy of the proposed Fed-SHVR and the other algorithms including Fed-MT, Fed-Pseudo and vanilla FedAvg under IID and Non-IID cases. It reveals that the performance of the proposed Fed-SHVR is much better than that of Fed-MT and Fed-Pseudo whether for the IID case or Non-IID case. Meanwhile, all of the Fed-SSL algorithms perform better than the supervised FedAvg that only uses 600 labeled data. Compared Fig. 2 (a) with Fig. 2 (b) , the Non-IID case with mismatched class distribution degrades the performance of Fed-SSL. Fortunately, our method introduces two regularized terms, which can help the Fed-SSL to obtain higher test accuracy. Example 2: We evaluated the performance of each federated SSL technique on IID case with 600 labeled data samples and varying amounts of unlabeled data, which resulted in the test accuracy shown in Fig. 3 (a) . Compared with FedAvg, increasing the number of unlabeled data tends to improve the performance of Fed-SSL techniques while the proposed Fed-SVHR performs best. Fig. 3 (b) shows the test accuracy curve when increasing the number of clients. From Fig. 3 (b) , we see that the performance of Fed-SLL methods is with a slight change, thus Fed-SSL methods are robust to the number of the client while they perform better than FedAvg that only uses labeled data. Example 3: We consider the ablation study. The proposed Fed-SHVR is a combination of FedNova and VRL-SGD, which takes heterogeneity local SGD iterations and variance reduction technique. In this subsection, if we use FedNova update model parameter without gradient correction term d t k in (16) that only uses the variance reduction technique of VRL-SGD but makes heterogeneous local SGD iterations. When varying the communication round at the Non-IID setting, Fig. 4 shows the training loss and standard deviation (std) curve. Here, at where p i t is the prediction accuracy of the global model for i-th experiment, andp t = 1 Fig. 4 (a) , it shows that Fed-SHVR reaches the smallest training loss. Fig. 4 (b) reveals that the Fed-SSL with variance reduction technique can reduce the variance among workers while Fed-SHVR has the smallest variance. Example 4: We discuss the effectiveness of the two regularization terms in (6) and (9) . If α 1 = α 2 = 0, it means Table I shows the test accuracy of the proposed Fed-SHVR method with different weight coefficients for the two regularization terms in (5a). Compared with FedAvg that only uses labeled data, the proposed Fed-SHVR performs better whenever we use the two regularization terms, which implies that the unlabeled data can improve the performance. For the Non-IID case, Fed-SHVR with α 1 = 0.75 performs much better than that with α 1 = 0 at the first communication round T = 1. It reveals that the regularizer r 1 (v k ) can speed up the convergence at early training iterations, especially, when the class distribution between labeled data and unlabeled data is mismatched. Table I also shows that the model regularizer r 2 (θ) helps us to improve the final accuracy at communication round T = 100. In other words, the confidence penalty r 2 (θ) leads to smoother output distributions, which results in better generalization. Moreover, we explore how regularizer affects the loss landscapes on generalization [48] . Fig. 5 plots a function of the form in the 2D surface, where F (·, ·) is defined in (5a), δ 1 and δ 2 are two direction vectors, (θ 100 ,v 100 ) is model parameter and pseudo label at communication round T = 100. From Fig. 5 , we observe that the loss function with r 2 (θ) in (5a) has a smooth landscape that produces the minimizers generalize better. We run experiments using CIFAR-10 benchmark, which contains 32x32 pixel RGB images belonging to 10 output classes ("airline","frog", "automobile", "bird", "cat", "dog","deer", "horse", "ship", "trunk"). CIFAR-10 consists of 50000 training samples and 10000 test samples. In our experiments, we consider FL setting with 40 clients. In each client, we construct the labeled data with 100 images, but these images only have 6-classes (such as, bird, cat, deer, dog, trunk, ship). Note that another client may have different 6 classes (such as, frog, horse, airline, automobile, bird, cat). Thus, the total number of labeled data is 4000 but the labeled data for the clients may have a different distribution. Let the total number of unlabeled data be 20000, which is randomly partitioned across 40 clients using a Dirichlet distribution Dir 40 (0.025). Thus the amount of the unlabeled data in each client may be different. Similar to [27] , we vary the ratio of unlabeled images from 6-classes to modulate class distribution mismatch in each client. For example, when the extent of labeled/ unlabeled class mismatch ratio is 1, all unlabeled data comes from the other 4-classes while the extent is 0.5 means half of the unlabeled data comes from classes 6-classes and the others come from 4-classes. All the methods in the comparison use a similar 13-layer ConvNet (CNN13) architecture with the same initial value. CNN13 model is popularly used in SSL [21] [22] [23] . The number of mini-batch is B l = 32 for labeled data, B u = 32 for unlabeled data. We set the learning rate η = 0.01, regularity coefficients α 1 = 0.75 and α 2 = 0.1. All clients perform E = 2 local epochs but the number of local SGD iterations τ k is varying from 40 to 90 in our simulation. Fig. 6 (a) shows the test accuracy with 1000 communication round of the proposed Fed-SVHR and the other federated semi-supervised methods, such as Fed-MT and Fed-Pseudo. Interestingly, it implies that the test accuracy of the proposed Fed-SVHR is raising with increasing the mismatch ratio. When the mismatch ratio is equal to 1, which means the distribution of labeled data is different from that of unlabeled data for each client. Reference [27] has shown the performance of centralized SSL method decreases significantly as class mismatches between labeled and unlabeled data increase. However, the proposed semi-supervised method under federated setting does perform better than Fed-SL with the increasing of class mismatches, and one possible reason is that Fed-SVHR and Fed-MT can use other client's information to improve the test accuracy. Comparing Fed-SVHR with Fed-Pseudo, the proposed Fed-SVHR uses two regularization terms to eliminate We have proposed a novel Fed-SSL algorithm for fully utilizing both labeled and unlabeled data in a heterogeneous FL setting, where the clients may have a small amount of labeled data and a large number of unlabeled data. To overcome the challenges caused by non-i.i.d. data distribution and class distribution mismatch, we have introduced two regularization terms in (5) and developed the new Fed-SSL algorithm (Algorithm 1), which adopts the variance reduction and normalized averaging techniques. We have proved that the proposed algorithm has a convergence rate of O(1/ √ T ), where T is the number of communication rounds. Numerical experiments have shown that the proposed algorithm greatly outperforms exiting Fed-SSL baselines and exhibits robust classification performance in scenarios with non-i.i.d. data and class distribution mismatch. The Lagrangian function of (7) is given by The optimal solution should satisfy the following condition Substituting (24) into (25), it can be rewritten as Then, we get . . , C, denote the value of class j. Combing the above two equations, for any j, we obtain Thus, Using condition (26), we have which gives exp (27), we obtain the final result. If g t = 0, according to (21) , we have Since v t+1 k is the optimal solution of F k (θ t ,v) in (7) . From the first order optimality condition, we get Combing (29) and (30), for ∀ θ,v k ∈ V k we have It shows that {θ t ,v t } is a stationary point to (5) . Next, we show two important lemmas about the descent of the objective with respect to θ andv, respectively. For the ease of presentation, let us define the following auxiliary variables Proof. Substituting (16) into (14), we show the relation between the iterates {θ t } as follows where the last equality dues to (11) . Using (32), we obtain Then, using Cauchy-Schwarz inequality and Assumptions 2-3, we get According to the definition of Π t+1 and Ξ t , we obtain the final result. Proof. Summing up all θ t,s k from 0 to q − 1 for (16), we obtain and θ t,0 = θ t , then it implies Recalling the definition of d t k in (12), we know where Substituting (35) into (34) and using Cauchy-Schwarz inequality give rise to Using the convexity of · 2 , we have By a similar argument, one can obtain Substituting the above two inequalities into (36) and using Assumptions 2-3, we have the following inequality Combining the definition of Ξ t and (37), we have Note that . Let a τ = (τ − 1)(2τ − 1), whereτ = max{τ 1 , . . . τ K }, we know τ k − 1 ≤ aτ τ k by τ k ≥ 1. Since K k=1 ω k = 1, we have According to the definition of B 1 , we get Using the convexity of · 2 and a negative term in the right hand side of the above equality, we have Then, using (18) in Assumption 2, we obtain Since η ≤ 1 L √ 11aτ , it derives 1 − 3η 2 L 2 a τ ≥ 8 11 and η 2 L 2 aτ 1−3η 2 L 2 aτ ≤ 1 8 . After rearranging, it gives the final result. Proof. From Assumption 2, we take an expectation over samples to obtain . (44) Next, we bound the term B 2 and the term B 3 in the right hand side of the inequality (44) . According to (32) and (19) in Assumption 3, we have where the last equality based on the common equality − a, b = we have where the inequality dues to the convexity of · 2 and Lsmooth assumption. Substituting (46) into (45), we get Let us bound the third term B 2 in the right hand side of inequality (44) . Using (32), we have where inequality dues to Cauchy-Schwarz inequality. Using (19) - (20) in Assumption (3), we get Using (44), (47) and (49), we derive According to the condition η ≤ 1 2Lτ in Lemma 4, we obtain Using Lemma 2, we have Based on η ≤ 1 2Lτ , it follows that Then (51) is relaxed as Substituting (33) in Lemma 3 into (52), we get Since η ≤ 1 11Laτ , it holds 11 8 So (53) is relaxed as According to (5a), we have In (7), it shows that v t+1 k is the optimal solution. Then we using the first order condition, it implies Since L CE (θ; u k ,v k ) is a linear function with respect tov k and r 1 (v k ) is a strongly convex function aboutv k . Using Assumption 1, we have Substituting (56) into (57) gives rise to Combing (55) and (58), we have Using (54), (55) and (59), we have When η ≤ 4µα1 τ +aτ L 2 , it follows µα1 2 − ηL 2 aτ 8 ≥ ητ 8 . Thus, we obtain the final result. Based on (43) of Lemma 4, we have Summing the above inequality above from 1 to T , we know that min t∈{1,...,T } Since 0 is the lower bound for cross-entropy loss F (θ,v) and Π T , in addition, according to the definition of Π t , we have Π 1 = 0. Thus, substituting η = K Tτ into the right side of inequality (62), we obtain the final result. Communication-efficient learning of deep networks from decentralized data Federated learning: Challenges, methods, and future directions Federated machine learning: Concept and applications Distributed learning in the nonconvex world: From batch data to streaming and beyond Federated optimization: Distributed machine learning for on-device intelligence Local SGD converges fast and communicates little On the convergence of fedavg on non-iid data First analysis of local GD on heterogeneous data Scaffold: Stochastic controlled averaging for federated learning Federated optimization in heterogeneous networks Variance reduced local sgd with lower communication complexity Tackling the objective inconsistency problem in heterogeneous federated optimization Semisupervised learning of classifiers: Theory, algorithms, and their application to human-computer interaction Semi-supervised learning literature survey Semi-supervised learning Unsupervised word sense disambiguation rivaling supervised methods Understanding the yarowsky algorithm Analysis of semi-supervised learning with the yarowsky algorithm Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks Semi-supervised learning with ladder networks Temporal ensembling for semi-supervised learning Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results Virtual adversarial training: a regularization method for supervised and semi-supervised learning Mixmatch: A holistic approach to semi-supervised learning Semi-supervised learning under class distribution mismatch Realistic evaluation of deep semi-supervised learning algorithms Safe deep semi-supervised learning for unseen-class unlabeled data Distributed semi-supervised learning with kernel ridge regression Distributed and asynchronous methods for semi-supervised learning Distributed semi-supervised support vector machines Distributed online semi-supervised support vector machine Fully decentralized semi-supervised learning via privacy-preserving matrix completion A distributed semi-supervised learning algorithm based on manifold regularization using wavelet neural network Multi-agent distributed optimization via inexact consensus ADMM On nonconvex decentralized gradient descent A survey towards federated semisupervised learning Exploiting unlabeled data in smart cities using federated learning Federated semi-supervised learning with inter-client consistency & disjoint learning Federated semi-supervised learning for covid region segmentation in chest ct using multi-national data from china, italy, japan Benchmarking semi-supervised federated learning Joint optimization framework for learning with noisy labels Regularizing neural networks by penalizing confident output distributions Clustering by orthogonal nmf model and non-convex penalty optimization SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives Logarithmic regret algorithms for strongly convex repeated games Block stochastic gradient iteration for convex and nonconvex optimization Demystifying model averaging for communication-efficient federated matrix factorization Visualizing the loss landscape of neural nets