key: cord-0426206-uig0ad2u authors: Yang, Seunghan; Park, Hyoungseob; Byun, Junyoung; Kim, Changick title: Robust Federated Learning with Noisy Labels date: 2020-12-03 journal: nan DOI: 10.1109/mis.2022.3151466 sha: e99d63f30cbb340a448f2530896ed65ba6355b7e doc_id: 426206 cord_uid: uig0ad2u Federated learning is a paradigm that enables local devices to jointly train a server model while keeping the data decentralized and private. In federated learning, since local data are collected by clients, it is hardly guaranteed that the data are correctly annotated. Although a lot of studies have been conducted to train the networks robust to these noisy data in a centralized setting, these algorithms still suffer from noisy labels in federated learning. Compared to the centralized setting, clients' data can have different noise distributions due to variations in their labeling systems or background knowledge of users. As a result, local models form inconsistent decision boundaries and their weights severely diverge from each other, which are serious problems in federated learning. To solve these problems, we introduce a novel federated learning scheme that the server cooperates with local models to maintain consistent decision boundaries by interchanging class-wise centroids. These centroids are central features of local data on each device, which are aligned by the server every communication round. Updating local models with the aligned centroids helps to form consistent decision boundaries among local models, although the noise distributions in clients' data are different from each other. To improve local model performance, we introduce a novel approach to select confident samples that are used for updating the model with given labels. Furthermore, we propose a global-guided pseudo-labeling method to update labels of unconfident samples by exploiting the global model. Our experimental results on the noisy CIFAR-10 dataset and the Clothing1M dataset show that our approach is noticeably effective in federated learning with noisy labels. Modern edge devices such as smart phones have been able to access an abundant amount of data, which is suitable for training deep learning models. Since each client device should transmit its local data to the central server for conventional centralized learning, it can lead to serious data privacy issues. To address these problems, federated learning has been actively studied to shift a learning environment from the central server to each edge device. In detail, federated learning allows a server model to be trained on each client's private data without transmitting raw data to the server. The federated learning paradigm consists of two stages: 1) In the beginning of each round, a server broadcasts the server model to selected clients, and these clients train Figure 1 : Test accuracy on the CIFAR-10 dataset at various noise ratios in the centralized setting (solid line) and the federated setting (dotted line). For federated learning with noisy labels, we distribute noisy data to clients in an i.i.d. fashion. Co-teaching and Joint Optimization are novel methods for the centralized setting, but these algorithms combined with FedAvg (McMahan et al. 2017) suffer from performance degradation in the federated setting. Best viewed in color. models on their own data for multiple iterations. 2) After the clients train their models, the server aggregates the clients' model parameters. The above process iterates until the global model converges. In FedAvg (McMahan et al. 2017) , model parameters of clients are aggregated in an element-wise manner with coefficients, which are proportional to the local dataset size. The global model effectively converges by FedAvg, especially when the local dataset follows an i.i.d. distribution. Many studies have been conducted to apply it to practical applications, e.g., dealing with noni.i.d. data (Li et al. 2018; Zhao et al. 2018; Shoham et al. 2019; Li et al. 2020b) , noisy communication (Ang et al. 2020) , domain adaptation (Peng et al. 2020) , fair resource allocation (Li et al. 2020a) , and continual learning (Yoon et al. 2020) . Although the above studies try to solve practical application issues related to preserving privacy, there are still remaining problems when local devices are used for training neural networks. In practice, all local data should be annotated by alternative labeling techniques such as exploit-ing machine-generated labels (Kuznetsova et al. 2018) due to privacy issues. These labels are inevitably corrupted unless the labeling techniques of all clients are accurate. Similarly, in the centralized setting, robust learning with noisy labels has attracted attention due to its applicability for realistic situations, and various algorithms have been proposed to train models accurately in the presence of noise. Recent algorithms have tried to minimize the effect of the noisy labels by sampling reliable data Wei et al. 2020; Huang et al. 2019; Guo et al. 2018) , updating labels Yi and Wu 2019) , or estimating labels from matched prototypes (Han, Luo, and Wang 2019; Lee et al. 2018) . These approaches have evolved into training the model with noisy labels successfully. The aforementioned approaches for dealing with noisy labels suffer from performance degradation in the federated setting, as illustrated in Fig. 1 . Unlike centralized learning, noise distributions in clients' data can be different from each other due to the discrepancy between their labeling systems or background knowledge. As a result, local models form inconsistent decision boundaries and their weights severely diverge from each other, i.e., weight divergence. This causes aggregation difficulties of local models, which are a serious problem in federated learning (Li et al. 2018; Chen, Bhardwaj, and Marculescu 2020; Lim et al. 2020) . Therefore, in federated learning with noisy labels, different noise distributions in clients should be considered, and the learning directions of clients' models should be kept similar. To treat these difficulties, we introduce a new federated learning scheme that the server cooperates with local models to maintain consistent decision boundaries by interchanging class-wise centroids, as described in Fig. 2 . In detail, we store local class-wise centroids on each device, which are central features of local data, and upload them on the server in every round. The server aggregates them into global centroids and broadcasts these centroids to clients. The centroids are used to update local models to maintain consistent decision boundaries with other clients, although the noise distributions in clients' data are different from each other. In local updates, we compute local centroids based on samples with relatively small-losses to reduce the effect of noisy data, motivated by . We adjust these centroids based on the similarity with global centroids to prevent them from being corrupted by representations of noisy data. Based on the centroids, we select confident samples to prohibit the model from fitting to noisy labels. We also utilize a global model for unconfident samples to correct the given labels, which alleviate overfitting to noisy samples in each local model. To the best of our knowledge, this is the first federated learning algorithm dealing with noisy labels. We present a new federated learning scheme interchanging additional information called centroids and propose novel algorithms for reducing the effect of noisy data. Our approach maintains high performance on various noise ratios in the federated setting ( Fig. 1 ). Federated learning has drawn striking attention in recent years because of the increasing number of edge devices and local data collected by them. Federated learning aims to fully utilize the information of each local data without causing any serious data privacy and communication issues by transmitting the local network's parameters instead of local raw data. For preventing the server from those issues, there are several restrictions on federated learning, and they raise various problems: 1) statistical challenges (non-i.i.d. data), 2) lower network bandwidth, 3) inconsistent accuracy across devices, and 4) noisy communication. FedProx (Li et al. 2018) , FedMA , and research about the convergence of FedAvg Li et al. 2020b) focus on the algorithms that converge the model in non-i.i.d. data. For the limitation of network bandwidth, DGC (Lin et al. 2018) , signSGD (Bernstein et al. 2018) , and STC (Sattler et al. 2019) only transmit important gradient changes. To maintain uniform accuracy of clients, Li et al. (Li et al. 2020a) propose fair resource allocation. For noisy communication, Ang et al. (Ang et al. 2020) aim to cope with the disturbance of noisy communication. Furthermore, various studies on specific tasks considering privacy-preserving are increased in a few years, e.g., domain adaptation (Peng et al. 2020) , and continual learning (Shoham et al. 2019; Yoon et al. 2020) . The aforementioned studies assume that every client has a clean dataset. However, correct annotation is not guaranteed since the local data are created by clients. Therefore, we consider that the local dataset consists of data with noisy labels and propose an algorithm to deal with it. There are many studies on the robustness of networks against noisy labels in the centralized setting. Since deep networks have sufficient capacity to fit on the whole noisy dataset (Zhang et al. 2016) , it is essential to develop robust training methods against noisy labels. Noise cleaningbased approaches Guo et al. 2018; Huang et al. 2019; Lyu and Tsang 2019; Wei et al. 2020) aim to detect noisy samples and train with clean samples. In particular, Co-teaching introduces a pair of networks with the same structure, and each network guides its peer network by using its small-loss instances. Among the label correction approaches Yi and Wu 2019) , Joint Optimization updates all dataset labels with pseudo-labels to prevent the network from fitting onto the noisy dataset. A new type of recent research focuses on label correction by adopting the representation power of the network to distinguish clean labels (Lee et al. 2018; Han, Luo, and Wang 2019) . Deep self-learning (Han, Luo, and Wang 2019) determines the label of the sample by comparing its features with several prototypes of the categories. Furthermore, meta-learning based methods (Ren et al. 2018; Li et al. 2019; Shu et al. 2019) focus on optimizing parameters that are less prone to overfitting, and another effective approach is to design robust loss functions (Zhang Previous algorithms for noisy labels aim to train networks in the centralized setting, not the federated setting. In the noisy federated setting case, clients have data with various noise distributions, and this can result in inconsistent decision boundaries and severe weight divergence in local models. To tackle these problems, we let the server cooperate with the clients to maintain consistent decision boundaries via class-wise centroids. By exploiting the centroids, we ensure that all local models have similar feature representations of classes. Moreover, we propose an algorithm for selecting confident samples and a self-training scheme suitable for the federated setting. In this section, we start with the problem definition, then describe our proposed local update and global update methods. In the federated setting with multiple clients and a global server, local training data of the k-th client consist of images and the corresponding labels , and the server cannot access any training data. In the noisy federated learning scenario, local datasets inevitably contain noise samples, where some of the given labels are not accurate, and noise distributions in clients' data are different from each other. The models can overfit to noisy data and suffer from weight divergence leading to aggregation difficulties in federated learning Lim et al. 2020) . To solve the above problem, we introduce global and local class-wise centroids, which are central features of each class in a server and clients, respectively. Local centroids are the average feature vectors from the global average pooling layer in each local dataset, and global centroids are calculated by reflecting the local centroids of selected clients, which are depicted in the next sections in detail. We denote global centroids and local centroids of the k-th client corresponding to class c by f c G and f c k . In addition, y i k andŷ i k indicate the one-hot vector of the ground truth label and a pseudo-label extracted by the softmax layer, respectively. At the beginning of each round, selected clients receive both global model parameters and global class-wise centroids from the server for local updates. Before local updates, selected clients download the global model parameters, and their models are trained with their own local dataset by exploiting the following loss function: where F k and C k denote the feature extractor and the classifier of the k-th client, respectively, and l ce (·) is the crossentropy loss function. A binary mask vector of the k-th client m k ∈ {0, 1} n k controls whether it learns the ground truth label or the pseudo-label. We propose a novel sampling approach to select confident samples that are used to update the mask m k . In addition, we introduce a global-guided pseudolabeling method, which takes advantage of the federated setting. Instead of a naive pseudo-labeling , we obtainŷ k by exploiting the global model F G and C G . This method improves local model performance while preventing the model from overfitting to noisy data. In parallel, each client loads global centroids from the server and updates its model to have similar features with the global centroids. To achieve this, we calculate local centroids on each local model depending on the similarity with global centroids, and we explicitly constrain local features to map local centroids. Note that the server and all of the clients store centroids in their own devices and transmit them in each communication round. There is an additional communication burden required for class-wise centroids, but the amount is much smaller than model parameters (0.01% to 0.03% in our experiments). Local centroids. We use feature vectors extracted from F k to compute local class-wise centroids f k . If we calculate f k by using all local samples with given labels, noise labels have a negative effect on the correct formulation of centroids. Therefore, we introduce loss-based local centroids, which are motivated by Arpit et al. 2017) . We only use features of samples with relatively small-losses to create accurate feature centroids. At first, we refine the dataset D k by selecting R(t) percentage of small-loss instances on each client as follows: where D k is the optimization variable, |·| stands for the cardinality of a set the number of samples, and R(t) controls how many small-loss samples should be selected in each round. Then, the k-th local model calculates naive average features of each classf c k depending on the small-loss samples as follows: whereñ c k is the number of samples corresponding to the label c inD k , and 1(·) is the indicator function returning 1 for true statements and 0 otherwise. However, these average features may differ from the other clients'. To avoid these undesirable deviations, we derive local centroids f c k by weighted average depending on the similarity between global centroids f c G and the average featureŝ f c k as follows: where sim(·, ·) can be any similarity function, but we choose cosine similarity for our experiments.f c k is calculated by Eq. 3, and global centroids f c G are transmitted from the server reflecting entire clients' centroids, which is described in the next section. We expect that class-wise centroids are the central features of clean samples. At the beginning of training, deep networks tend to prioritize learning simple patterns first (Arpit et al. 2017) , and we exploit this property to form global centroids less susceptible to noisy samples. After that, we update local centroids to reflect similarity with these global centroids. This similarity-based update can keep centroids less corrupted by noisy data even after a large number of training epochs. We exploit these local centroids to reduce weight divergence of clients' models. In detail, we design a loss function to map the features of the confident sample onto the centroids corresponding to the class as follows: where m i k denotes a binary mask of the k-th client that returns 1 for confident samples and 0 otherwise. Confident samples. We introduce a sampling approach to select confident samples for training each client's model without a detrimental influence from noisy labels. To this end, we introduce the feature similarity-based labels as follows: Note that we should not fully trust given labels because they may not be annotated accurately. Also they should not depend on feature similarity-based labels inducing the wrong labels for the hard samples. The complementary use of feature similarity-based and ground truth labels can help to find accurate confident samples. Therefore, we consider the similarity-based labels with local centroids and the ground truth labels at the same time. By adopting the ground truth label and the similarity-based label together, m i k for masking a confident sample is obtained as follows: We exploit this mask for L c and L cen to reduce the impact of noise samples. Since the number of confident samples is not fixed for each class, this mask can choose confident samples well regardless of different noise ratio for each class. Global-guided pseudo-labeling. To fully utilize the local data information, we exploit the well-known label correction method . Although this self-learning strategy with pseudo-labeling is powerful for label correction in the centralized setting, it leads local models to be selfbiased (Arazo et al. 2019) . Therefore, we propose globalguided pseudo-labeling, which corrects labels of local data by employing the server model. Our technique for the label estimation prevents local models from being self-biased. Each client receives the global model at the broadcast time and uses the model to generate global-guided pseudo-labelŝ y k as follows:ŷ where C G and F G are the client's networks with global parameters. After that, each client trains its network with these global-guided pseudo-labels by Eq. 1. Finally, the k-th local model is trained to minimize the sum of three losses: where L k e is the entropy regularization of prediction results. Note that this term forces probability distribution of each softmax output to a single class. p i indicates softmax outputs C k (F k (x i k )), and the loss L k e is calculated by − i p i logp i . λ cen and λ e indicate trade-off parameters. Our complete algorithm is illustrated in Fig. 3 . After the local update in each round, the clients upload model parameters and local centroids to the server. We exploit FedAvg (McMahan et al. 2017) for weight aggregation, which is well known as an effective algorithm for i.i.d. data. For centroid aggregation, the server updates global centroids by a similarity-based aggregation of uploaded local centroids. This leads the server model to explicitly deal with the different noise ratios in clients. Moreover, since it performs a class-wise summation of local centroids, it is less affected by different noise ratios in classes. Weight aggregation. We execute FedAvg (McMahan et al. 2017) for weight aggregation, which is suitable for an i.i.d. dataset. Since only noisy labels are added in i.i.d. data, we expect that the FedAvg algorithm works well enough in our experimental settings. FedAvg takes a weighted average of local parameters θ L as follows: where θ G is global parameters, and n and n k indicate the total number of data and the number of the k-th client's data, respectively. Global centroid aggregation. To tackle the different noise distributions in clients explicitly, we adjust global centroids to employ the similarity-based summation of local centroids. In every round, local centroids of selected clients update global centroids by using the similarity to previous global centroids in the server. Let K be the set of indices of clients selected in the current round, then global centroids are updated as follows: where w c k indicates similarity between stored global centroidsf c G and the uploaded the k-th client centroids of class c, and it is obtained by: Therefore, this weight update rule allows global centroids to reflect the similarity with local centroids, which depends on the client and class. The complete pseudo-code is shown in supplementary material. We adopt the following federated setting from FedAvg (McMahan et al. 2017) . We set the number of clients to 100, and distribute each dataset to clients in an i.i.d. fashion. We select local epoch and local mini-batch to 5 and 50, respectively, considering communication efficiency and memory limitations of local devices. Datasets CIFAR-10. CIFAR-10 (Krizhevsky and Hinton 2009) is a benchmark dataset of 10 categories, which contains 50,000 images for training and 10,000 images for testing. We replace the ground truth labels of CIFAR-10 with two types of noisy labels: symmetric flipping (Van Rooyen, Menon, and Williamson 2015) and pair flipping , which are described in supplementary material. Since our paper mainly focuses on the robustness in federated learning with various noisy ratios, the noise ratio is chosen from {0.1, 0.2, 0.3, 0.4, 0.5, 0.6} for symmetric flipping and {0.1, 0.2, 0.3, 0.4, 0.45} for pair flipping. In detail, we give noisy labels to the entire dataset with the designated noise ratio, and then randomly distribute it to 100 clients. This process induces different noise distributions in clients, and we fix the seed for a fair comparison. In ablation studies, we experiment with extremely different noise ratios. Clothing1M. Clothing1M (Xiao et al. 2015 ) is a large real-world dataset of 14 categories, which contains 1 million images of clothing with noisy labels since it is obtained from several online shopping websites. In (Xiao et al. 2015) , it is reported that the overall noise ratio is approximately 38.46%. The Clothing1M dataset also contains 50k, 14k, and 10k of clean data for training, validation, and testing, respectively, but we do not use the clean training data. For federated learning, we randomly divide the clothing1M dataset into 100 groups, which indicates the number of clients, and we set them as local datasets. (Deng et al. 2009 ) by following Han, Luo, and Wang 2019) . To prevent overfitting to a small number of training data in each local dataset, we augment training data by resizing, normalizing, and cropping images. The detailed experimental setup is described in supplementary material. Analysis CIFAR-10. We report the results of CIFAR-10 with symmetric flipping and pair flipping in Table 1 . As shown in Table 1 , our method achieves better overall test accuracy at various noise ratios. Co-teaching selects a fixed number of loss-based samples, which is vulnerable to different noise distributions in clients. It causes serious performance degradation of the server model since each client model is affected by noisy data. Joint Optimization follows a self-training scheme, which is a naive pseudo-labeling method. In federated learning, this self-training method may lead the network to be self-biased due to a large number of local epochs, and the weights of self-biased local models can be diverged severely. Notably in extremely noisy cases, clients cannot be trained properly due to high noise ratios, and when these local parameters are aggregated in the server, performance is further deteriorated. Our proposed method is also dependent on a loss-based algorithm but robust to different noise distributions in clients because of the similarity-based centroids update. Moreover, we exploit a global-guided pseudo-labeling method, which mitigates self-bias of each client, and we validate the effectiveness of this algorithm in ablation studies. Furthermore, neither of the two previous methods can guarantee that all local models are trained to form similar decision boundaries, which makes aggregation of local models unsuccessful. Our proposed method constrains class representations of all client models not to diverge from global class representations. This induces all local models to be trained to have similar boundaries, and we demonstrate the efficacy of the algorithm in noisy federated learning. Clothing1M. Different from the artificial noise in CIFAR-10, Clothing1M is a real-world noisy label dataset, including lots of unknown structure noise. By comparing the results in Table 2 , we can see that our proposed method outperforms the others by a large margin in the federated setting. In the case of Deep self-learning (Han, Luo, and Wang 2019) , which corrects the labels of the data comparing the similarity with the several prototypes of the features, it achieves great performance improvement in the centralized setting. However, in the federated setting, this algorithm suffers from the significant performance degradation since it does not constrain local models from having similar decision boundaries. We get the centroids with relatively small-losses and update them to be similar with global centroids. It can keep centroids less corrupted by noise data as well as achieve local decision boundaries similar to others. Our algorithm is ef- We conduct ablation studies to show that each proposed algorithm is effective in federated learning with noisy labels. Different noise ratios in clients. In the federated setting, clients may have different amounts of noise because of the discrepancy between clients' labeling systems. In detail, we split clients into five groups and assign different noise ratios. We set noise variance η, then divide noise range [ −η, +η] equally into five noise ratios and assign the noise ratio to each group. For example, if we set the noise ratio and noise variance η to 0.4 and 0.2, respectively, the noise ratio in each group is assigned one of {0.2, 0.3, 0.4, 0.5, 0.6}. In Table 3 , we experiment by fixing the noise ratio to 0.4 and changing noise variance. Our approach achieves consistent performance regardless of the different noise ratios in clients. Different noise ratios in classes. Due to background knowledge of the client user, the local data can have different noise ratios in classes. We assume an extreme situation where each client has totally erroneous samples for a single class. In detail, we force clients to have wrong labels for a single random class entirely and replace the labels of other classes with noisy labels of the designated noise ratio . We show that the proposed algorithm is robust to different noise ratios in classes in Table 3 . Confident samples. We evaluate the effectiveness of our sampling approach in Table 4 . Note that we set the noise ratio to 0.4 by using symmetric flipping. As shown in Table 4 , the mask for confident samples and the pseudo-labeling method complement each other. The network trained only with the selected samples by removing unconfident samples has better performance than the one trained with all samples, and the performance increases considerably when the unconfident samples' labels are replaced by pseudo-labels. Moreover, we show the precision and recall of noisy label detection of our sampling approach in Table 5 . The precision means the number of correctly detected noisy labels over the entire number of detected noisy labels and the recall means the number of correctly detected noisy labels over the entire number of noisy labels in data. Our sampling approach selects confident samples by the complementary use of feature similarity-based and ground truth labels. It leads to high accuracy for the precision of noise labels. Global-guided pseudo-labeling. We have conducted the experiments by removing global-guided pseudo-labeling or replacing the proposed method with naive pseudo-labeling in Table 4 . Although the self-learning strategy with pseudolabeling is powerful for label correction in centralized learning, it leads local models to be self-biased to their own datasets. Our global-guided pseudo-labeling outperforms a naive approach to prevent local models from being selfbiased. Interchanging class-wise centroids. To show the effectiveness of interchanging centroids, we experiment our algorithm while local models are updated without using global centroids. In detail, we use the loss function Eq. 5 by calculating local centroids without using Eq. 4, which cannot explicitly constrain local models to have similar boundaries. Figure 4 shows that global centroids help all clients to have similar feature representations, which leads to reducing weight divergence. In this paper, we have considered that each local dataset may consist of noisy labels in the practical federated learning scenario. Our proposed approach is to interchange additional information, which is global and local feature centroids of classes. We demonstrate that our approach is clearly effective in the noisy federated setting by reducing weight di-vergence. Moreover, we propose a novel algorithm for local updates including similarity-based confident example sampling and global-guided pseudo-labeling. In extensive experiments, we have shown that our approach outperforms existing state-of-the-art methods on CIFAR-10 and Clothing1M. Our study suggests a practical learning scenario, especially learning with noisy labels. Based on our proposed federated learning with noisy labels, contributors in various fields such as healthcare, fairness, and recommendation system can indirectly benefit from our global guided update scheme. In the case of healthcare, medical data with wrong annotations can be a potential threat to the smart healthcare system. Our scheme can help prevent a medical accident due to erroneous data from occurring in federated learning, and it would play a crucial role in the development of smart healthcare. Our approach promotes social trends shifting a learning environment from the central server to various edge devices by allowing the models to learn without precise data. It enables client-side learning without precise data, which does not require an expert for annotating specialized data for each device. In our setting, we only focus on dealing with noisy labels on i.i.d. data. More work is needed for noisy labels in federated learning with non-i.i.d. data. Han, J.; Luo, P.; and Wang, X. 2019 . Deep self-learning from noisy labels. In Proceedings of the IEEE International Conference on Computer Vision. 60,000 images for training and 10,000 images for testing. We replace the ground truth labels of MNIST with two types of noisy label: symmetric and pair flipping. We implemented the 9-Layer CNN applied in Co-teaching . In Table 6 , our proposed approach maintain high performance at various noise ratios. Image variances in different domains. In the real world, the data would differ from client to client not only in terms of noise in the labels but also the features themselves. We investigate this situation in federated learning. We assume that clients are in different environments, especially light intensity, e.g., a photo of a car in one client would be different from that in others in terms of light intensity, as illustrated in Fig. 5 . In detail, we use the noisy CIFAR-10 dataset with noise ratio 0.4, and change the light intensity of each client dataset at a rate of specific value within 0.5 to 1.5 times by using the ImageEnhance module in the PIL package. Our algorithm achieves 88.4%, although the noise exists in both labels and features (88.7% in the original setting). Real federated learning scenario. For centralized learning, transmitting medical data from each medical center to the central server causes privacy issues. In consideration of these issues, we experiment our algorithm under the assumption that the medical dataset is in each medical center. We choose the COVIDx Dataset (Wang and Wong 2020) , which consists of 15,282 chest x-ray images, and distribute the dataset to 100 medical centers (clients). Our approach achieves 93.5%, which is comparable to the centralized setting (93.6%). Number of participating clients. In real-world federated learning scenarios, the population base can be significantly larger and a considerably smaller portion can be selected every round. We provide experimental results on the noisy CIFAR-10 dataset with noise ratio 0.4 by changing the number of participating clients per round in Table 7 . Note that the client population is 100. Even if only two clients participate in communication, it shows comparable performance by reducing weight divergence of clients' models. Computational cost. Since our proposed algorithm increases computation-cost not only for local updates but also for global updates, we measure the time from the start of the round to the next round. As shown in Table 8 , the speed of our algorithm is similar to that of Joint Optimization . Due to the use of similarity-based updates and confident samples, our algorithm has a marginal increase in computational cost. Since Co-teaching exploits two networks, its computational time is longer than others that use only one network. Performance dependency of hyper-parameters. We use the noisy CIFAR-10 dataset and set the noise ratio to 0.4 by using symmetric flipping. We set T pl , λe, λcen, τ , T to 100, 0.8, 1.0, 0.4, 10, respectively. Then, we have conducted various experiments changing hyper-parameters, i. e., T pl , λe, λcen, τ , and T . As shown in Fig. 6 , the prediction accuracy is robust to the hyperparameters except T pl that is related to replacing ground truth labels with pseudo labels. Since the network cannot generate accurate pseudo-labelsŷ at the early stage, it achieves lower performance compared to networks trained with high T pl . Robust Federated Learning With Noisy Communication Pseudo-labeling and confirmation bias in deep semisupervised learning A closer look at memorization in deep networks signSGD with majority vote is communication efficient and fault tolerant FedMAX: Mitigating Activation Divergence for Accurate and Communication-Efficient Federated Learning Imagenet: A large-scale hierarchical image database Curriculumnet: Weakly supervised learning from large-scale web images Co-teaching: Robust training of deep neural networks with extremely noisy labels Federated Adversarial Domain Adaptation Training deep neural networks on noisy labels with bootstrapping Learning to reweight examples for robust deep learning Robust and communication-efficient federated learning from non-iid data Overcoming Forgetting in Federated Learning on Non-IID Data Learning an explicit mapping for sample weighting Joint optimization framework for learning with noisy labels Learning with symmetric label noise: The importance of being unhinged Federated Learning with Matched Averaging COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from Chest X-Ray Images Symmetric cross entropy for robust learning with noisy labels Combating noisy labels by agreement: A joint training method with co-regularization Learn Probabilistic end-to-end noise correction for learning with noisy labels Federated Continual Learning with Adaptive Parameter Communication Understanding deep learning requires rethinking generalization Generalized cross entropy loss for training deep neural networks with noisy labels We modified the official code to Pytorch version of Joint Optimization (Tanaka et al. 2018) and reproduced the model in Deep self-learning (Han, Luo, and Wang 2019) according to the paper. In our federated setting, we set the number of clients to 100, and 10 clients are selected at every round. Batch size and local epoch are 50 and 5, and we trained the network during 100, 1000, and 40 rounds for MNIST, CIFAR-10, and Clothing1M, respectively. We used SGD optimizer with a momen We determined balance parameters (λcen and λe) and T pl based on ablation studies and the previous work The initial learning rate is 0.25 for MNIST and CIFAR-10, and 0.01 for Clothing1M. For the Clothing1M dataset, the learning rate was decreased by 10 every 10 rounds MNIST and CIFAR-10, and 0.001 for Clothing1M. For Clothing1M, the learning rate was decreased by 10 every 10 rounds We determined balance parameters (λα and λ β ), start epoch, and learning rate based on We set α, β, and start epoch to 1.2, 0.8, and 100 respectively in MNIST and CIFAR-10. For Clothing1M, we used a followed the paper (Han, Luo, and Wang 2019) to choose hyper-parameters except the number of selected samples and prototypes is a benchmark dataset of 10 categories, which contains Algorithm 1: Robust Federated Learning with Noisy Labels Input: global weights θ G , global centroids f G , learning rate η, start round that uses pseudo-labels T pl Server executes: initialize θ G ; for each round t = 1 Update global weights θ G by Eq Update R(t) = 1 − min{ t T τ, τ } R(t))..., E do Shuffle training set D k N max do Fetch mini batch D k,j from D k Obtain small-loss setsD k,j by Eq. 2 from D k,j ; Obtain confident mask vector m k // Replacing pseudo-labels with ground truth labels Update local weights θ k by minimizing Eq Obtain loss-based average featuresf k,j by Eq. 3 fromD k,j Update local centroids f k,j by Eq Load f k,0 ← f k,Nmax ; Output: θ k and f k Supplementary Material: Robust Federated Learning with Noisy Labels Algorithm details Initialization of global centroids. Since we update local centroids using similarities with global centroids, randomly initialized global centroids hinder local models from deriving accurate local centroids. For this reason, average featuresf k are used instead of global centroids in the first round. After that, global centroids are computed by aggregating clients' local centroids.Loss-based centroids. At the beginning of training, deep networks tend to prioritize learning simple patterns first (Arpit et al. 2017) . We utilize this property to keep more instances at the start, i.e., R(t) is large in Eq. 2. As the training proceeds, we gradually reduce R(t) to prevent local models from fitting to noise samples following . We set T and τ to 10 and in our experiments, and we show that our approach is robust to these parameters in the experimental section.Pseudo-labels. In Eq. 1, we train the network with unconfident samples by using pseudo-labels. At the early stage, the network cannot generate accurate pseudo-labelsŷ due to insufficient training time. Therefore, at first, we replace pseudo-labelsŷ with ground truth labels y. After the number of rounds reaches predefined number (T pl ), we exploit pseudo-labels, then we train the network jointly by y andŷ.Scheduling for λ cen . To avoid the noisy mask at the early stage of the training procedure, we initialize λcen to 0 and gradually increase it to a predefined number.Mini-batch algorithm. We modify Eq. 4 for mini-batch SGD as follows:where f k,j indicates local centroids at j-th iteration. The full algorithm for our local and global updates is given in Algorithm 1. Note that f k,0 indicates global centroids fG in the first local epoch. Since CIFAR-10 (Krizhevsky and Hinton 2009) and MNIST (Le-Cun et al. 1998) are clean, following (Reed et al. 2014; Patrini et al. 2017) , we manually corrupt these datasets by the label transition matrix Q, where Qij = Pr(ỹ = j|y = i) given that noisyỹ is flipped from clean y. For symmetric flipping, we inject the symmetric label noise as follows:where n is the number of classes and indicates the noise ratio. Pair flipping is a well-known noise generation method that focuses on fine-grained classification with noisy labels, and its noise transition matrix Q is obtained as follows: