key: cord-0619971-b3jwh6as authors: Chen, Yujing; Chai, Zheng; Cheng, Yue; Rangwala, Huzefa title: Asynchronous Federated Learning for Sensor Data with Concept Drift date: 2021-09-01 journal: nan DOI: nan sha: 09d1aeaeece8c6fe551eab7a330fd706da3236dd doc_id: 619971 cord_uid: b3jwh6as Federated learning (FL) involves multiple distributed devices jointly training a shared model without any of the participants having to reveal their local data to a centralized server. Most of previous FL approaches assume that data on devices are fixed and stationary during the training process. However, this assumption is unrealistic because these devices usually have varying sampling rates and different system configurations. In addition, the underlying distribution of the device data can change dynamically over time, which is known as concept drift. Concept drift makes the learning process complicated because of the inconsistency between existing and upcoming data. Traditional concept drift handling techniques such as chunk based and ensemble learning-based methods are not suitable in the federated learning frameworks due to the heterogeneity of local devices. We propose a novel approach, FedConD, to detect and deal with the concept drift on local devices and minimize the effect on the performance of models in asynchronous FL. The drift detection strategy is based on an adaptive mechanism which uses the historical performance of the local models. The drift adaptation is realized by adjusting the regularization parameter of objective function on each local device. Additionally, we design a communication strategy on the server side to select local updates in a prudent fashion and speed up model convergence. Experimental evaluations on three evolving data streams and two image datasets show that model~detects and handles concept drift, and also reduces the overall communication cost compared to other baseline methods. With the rapid expansion of IoT devices in our digital universe and the explosion in the quantity of data generated by these device sensors, on-device learning has emerged as a new paradigm that enables the training of statistical models locally on the devices [1] , [2] . Keeping data on device, federated learning (FL) offers an approach to do training locally in a way that a global model is collaboratively trained under the coordination of a central server [3] - [6] . In general, most of the existing FL solutions operate under the assumption that device data are stationary. However, in the real-world, due to the underlying data generating mechanism, the distribution of the data is constantly evolving and the generation of data streams is in the non-stationary environment. This phenomenon is known as concept drift [7] , [8] , which exists commonly in the scenarios of large scale data learning. For example, user behavior/activity prediction models trained before the COVID-19 may not work equally well during COVID-19 pandemic as the user's behaviors changes, weather prediction models change from season to season, customer consumption patterns and underlying recommender systems vary due to the pandemic, economy and so on. In other research fields, concept drift has been referred as covariate shift [9] , dataset shift [10] and non-stationary learning [11] . In federated learning process, the occurrence of concept drift leads to a shift of test data distribution from the original training data leading to serious failures in terms of classification/regression performance [11] . Therefore, it is crucial that the federated learning algorithms work adaptively. In general, concept drift can be divided into two types based the changing speed of data patterns: gradual and sudden drift [12] . In case of gradual drift, data distribution changes significantly between the underlying data and the incoming data in a long period, while in sudden drift, this large amounts of change happens in a relatively short period of time. In real-world scenarios, drift could be a mixture of both types [13] . In this paper, we perform study on both these common situations. Previous approaches on coping with concept drift mainly include sliding window approaches, online algorithms, and adaptive ensembles [14] . Ensemble approaches are popular on improving classification accuracy for concept drift problems [15] , [16] . However, most of these prior work are deployed in static learning problems and cost more for training. Recent work on asynchronous federated learning propose an online learning approach to handle the streaming device data [17] . While this approach is novel at tackling the challenges associated with heterogeneous edge devices, it cannot deal with the concept drift on local device data. Inspired by this work, we propose a novel asynchronous federated learning method, FedConD, which can detect and handle concept drift on local devices adaptively. In addition, although [17] provides dynamic learning strategy on local devices to deal with the varying latency among devices, the global model may still be biased to the devices with lower latency as they communicate more updates to the server. Allowing the server to broadcast the global model to all devices at each round is also a waste of communication bandwidth. In light of these challenges, we design a new strategy which allows the server to send update requests to devices with high latency (e.g., devices with low bandwidth or poor hardware) instead of sending update requests to all devices blindly. This balances the update frequency among devices and leads to a more fair global model. Meanwhile, the overall communication between the server and devices is also reduced with an intelligent selection of local updates. The main contributions can be summarized as follows. • We propose an asynchronous federated learning framework, FedConD, to detect and handle the data distribution changes across edge devices. • In this framework the server adopts a novel communication strategy to speedup the model convergence and leads to a more fair global model with balanced frequency of local updates. FedConD can improve the predictive performance of the worst 20% devices while also maintains the best test performance for the top 20% devices. • Experiments show that the proposed framework achieves better prediction performance than baseline methods with sudden/gradual drift on local devices, and converges faster compared with the other asynchronous online federated learning approach. Federated learning has been proposed as an alternative setting for decentralized learning approaches by McMahan et al. [4] : leaving training data distributed on local nodes, a shared global model is learned by aggregating locallycomputed updates. Compared to conventional distributed machine learning [18] - [20] , this framework is robust to highly unbalanced and not independent and identically distributed (non-i.i.d) data, unstable network connections and a large number of heterogeneous, client nodes. Many extensions have been explored based on this original federated learning framework [6] , [21] , [22] . However, these synchronous FL approaches assume that the device data are stationary, which is not realistic in the real world situation. An online learning algorithm with asynchronous communication strategy was proposed by Chen et al. [17] to perform federated learning under a more close to real world setting. This solution has no consideration of concept drift during the training process and suffers from the communication bottleneck problem due to the asynchronous framework. To reduce communication costs in federated learning, new settings have been proposed to deal with a federated optimization problem either by model compression or communicating partial model [3] , [6] , [23] , [24] . But the model prediction performance will be hurt by model compression and the convergence speed will be prolonged by partial model communication. Concept drift is a phenomenon in which the statistical properties of a target domain change over time in an arbitrary manner [25] . It was first proposed by Schlimmer et al. [26] who pointed out that noisy data may turn into non-noise information over time. These changes might be caused by changes in hidden variables which cannot be measured directly [8] . A wide range of algorithms have been developed to address concept drift. According to the changes in data distribution over time, concept drift can be roughly categorized into two types [12] : 1) sudden/abrupt drift changing data distribution in a quite short period of time; 2) gradual drift which the large data distribution changes has a relatively longer time. There are some other concept drift types such as incremental drift [27] , which has a gradual period of "intermediate concept". The term "intermediate concept" was introduced by Gama el al. [11] to describe the transformation between concepts. The intermediate concept of gradual drift is one of the starting or ending concept, while in incremental drift the intermediate concept is a mixture of the starting concept and the ending concept [8] . In real-world scenarios, drift could be a mixture of many types. In this study, we consider the most common drift types: (i) sudden drift and (ii) gradual drift. Previous work on concept drift uses chunk based learning techniques [16] , [28] , [29] , in which a new classifier is learned with a new set of samples. These approaches are not suitable for federated learning, as the sampling rate of device varies, and devices may not generate enough samples quickly enough to build a new classifier. Furthermore, the prediction performance is sensitive to the chunk size, while this parameter is hard to decide in a real-world setting. Other widely used approaches to deal with concept drift problems are ensemble learning-based algorithms, such as using a variant of bagging [15] , dynamic adaptation to concept changes (DACC) [30] , dynamic weighted majority (DWM) [28] and streaming ensemble algorithm (SEA) [16] . In these algorithms, a new model is created when drift is detected, but this new model is added to an ensemble pool which also includes older models. These ensemble algorithms are not suitable for on-device learning as they cost more to setup, train, and deploy. Many algorithms have explored machine learning algorithms through parallelization and distribution [31] - [36] . There are very few approaches that address issues related to concept drift in a fully distributed network. Ang et al. [37] propose a P2P learning framework for concept drift classification, which includes a drift detection (reactive behavior) and simultaneously a drift prediction (proactive behavior) mechanisms. The basic idea behind the approach is the use of chunkbased technique with a triggering and an ensemble based approaches. As stated in Section II-B, ensemble approaches are not suitable for on-device learning, and moreover, this approach also suffers from high communication cost for the network wide model propagation. Hegedűs el al. [38] handle concept drift by maintaining new as well as old models in the network, and models of the data perform random walks in the P2P network. While these techniques can be used in distributed learning, the network topology without a central server is fundamentally different from the federated learning framework in this study. To the best of our knowledge, currently there are no robust solutions to tackle concept drift efficiently in asynchronous federated learning. We put forward the concept drift issue in the federated learning framework and propose techniques to detect and adapt the concept drift on local devices. Our proposed algorithm is detailed in Section IV. In this section, we define the classical federated learning objective and introduce the definition of concept drift. For classical federated learning algorithms, the goal is to minimize the following objective function: where K is the total number of devices, p k ≥ 0 is the weight of k-th device and K k=1 p k = 1. Suppose D k is data captured on device k, and let n k = |D k | be the number of samples on device k. We can set p k to be n k N , where N = k n k is the total number of samples over the entire dataset. The local objective of client k is defined as: where i (x i , y i ; w k ) is the corresponding loss function 1 for data sample {x i , y i } and w k is the local model parameter. The distribution D k may change over time. For a given time period [0, t], a set of samples S 0,t = (d 0 , ..., d t ) are from D k , where d t = (x t , y t ) is the data sample at time step t, x t is the feature vector, y t is the target label. Assume each training example on device k is generated by a source S k , then a sudden drift occurs when S t k is suddenly replaced by S t+1 k at time step t. Gradual drifts change with a slower rate and roughly can be divided into two types. The first type has a mixed distribution of D k and D k during the transition phase. As time goes on, the probability of observing examples from D k decreases, while that of examples from D k increases. The second type also be referred as incremental drift [8] , which has more than two data distributions, however the difference between any adjacent distributions is small and the transition usually takes a longer period of time. Concept drift is critical to online learning problems, because such inequality may lead to a inconsistency in decision boundaries, thereby increasing the error rate. Research into concept drift adaptation focuses on how to minimize the drop in accuracy and achieve the fastest recovery rate during the concept transformation process. We consider of both of these concept drift types and aim to minimize the performance drops of the learned models in this paper. In this section, we first introduce our strategy for concept drift detection and method to handle the concept drift on local devices. Then we explain the learning procedure on server. To begin with, we first update the local objective function of client k: where w is the global model on the server. The regularization component λ 2 ||w k − w|| 2 aims to restrict the amount of local deviation by penalizing large changes from the current model at the server [17] , [21] . By restricting the local updates to be closer to the initial (global) model, we can limit the impact of variable local updates efficiently. Figure 1 illustrates the learning process on local device. The device starts to do prediction on the new incoming data after receiving the global model w from server. Next a drift detection process will be applied on the prediction results to determine whether a concept drift happens (detail explanation is in Section IV-A). If a drift is detected, the drift adaptation process is initialized before the training starts, otherwise the training will start directly. Finally the trained local model w k will be uploaded to server for aggregation. The main idea behind our concept drift detection algorithm is that we can use the new generated samples on the devices for evaluating the models before we use them for training. Then using the results got from this evaluation we can decide whether the concept drifted. broadcast w t to the device with the fewest updates 8: end for 9: Learning Process of Local Client k at round t 10: receive w t from the server 11 to the server For device k, at time step t the classifier predicts the class label of x t+1 to beŷ t+1 . Most of previous works assume that the true label y t+1 is provided after some time, and both the y t+1 andŷ t+1 are available for the drift detection [25] , [37] - [39] , which is, the learning is in a supervised framework. Then, example x t+1 with its label y t+1 becomes a part of the training data and the process is repeated when the next instance is observed. We perform drift detection using statistical testing. Each time we do prediction on the new incoming data with the downloaded server model, and then perform evaluation on the predicted labels and the true labels. Next we add the evaluation result into a queue with bounded size. Suppose device k has performed local training a times, and let s k be the evaluation value, then the evaluation queue is [s 1 k , s 2 k , ..., s a k ]. By storing these evaluations in the history, we can detect the trend of the performance of the model and decide whether there is a drift occurs at time a + 1. We assume that s a+1 if no concept drift happens; and a significant decrease of s a+1 k suggests that the concept is changing. The test is performed by calculating the following statistic 2 : . We compare its value to the percentile of the standard normal distribution to obtain the observed significance level (p-value). If the p-value, P , is less than a significance level, then the null hypothesis (s a+1 k =s k ) is rejected and the alternative hypothesis (s a+1 k >s k ) is accepted, namely concept drift has been detected. We set the significance level as 0.05 (5%), which can indicate a difference is significant if P is less than this value. The bounded size of the queue is set to be 20 for our cases. The added regularization component in Equation (3) is a penalty term to the loss function. For instance, Zhang el al. [40] propose elastic averaging SGD in deep networks with a similar penalty term in its objective. In distributed methods, Shamir et al. propose DANE [41] and Reddi et al. provide AIDE [20] use similar regularization term in the local objective function, while these methods are within the data center setting. Li el al. propose FedProx [21] to add this penalty term to tackle heterogeneity in federated networks, and they show that FedProx is robust to systems heterogeneity with stable convergence compared to vanilla FedAvg. In Equation (3), λ controls the amount of penalty added to the original local objective f k (·). When a drift is detected, the local model will deviate further from the global model, thus we should increase the weight of penalty component to force the local model to be close to the global model. We show empirically that this strategy can safely incorporate variable amounts of local work resulting from concept drift. The server aggregates the received local models in an asynchronous manner, that is, server model will be updated after receiving one device's update. Suppose at round t + 1, server receives the update from device k. Let w t+1 be the server model, w t k be the local model, n t+1 k be the number of samples on devices k and N t+1 be the total samples over all devices. The server model is updated as follow: where η k is the learning rate of device k, and ∇h k (·) is the gradient on the local data of device k. Previous asynchronous FLs have no constrains on local device selection, that is, the server broadcasts global model w to all available devices at each round [17] , [42] . This communication strategy is not efficient due to 1) downlink communication cost is high due to the server need to communicate with all local devices at each round; 2) the server can easily become a communication bottleneck with tens of thousands of devices updating the model simultaneously; 3) the update frequency of devices vary due to reasons like system heterogeneity, network bandwidth or data heterogeneity. Therefore, the aggregated global model may bias to devices which have more frequent updates. In former work of communication-efficient federated learning, one commonly used approach to reduce the overall communication cost is compressing the model size transferred between the server and devices [6] , [23] , [43] . While these compression techniques can reduce the communication cost efficiently, the model performance is also hurt because the whole model information cannot be preserved during the compression-decompression process. Different from the synchronous FL frameworks, which allow only a portion of devices to perform local training and communicate with the server. With large amount of local devices, asynchronous FL framework should select local device update more wisely and also control the number of local devices which perform training simultaneously. To address the above mentioned problems we propose a communication-efficient strategy to reduce the overall communication cost during the learning process in asynchronous FL frameworks, and furthermore, learn a more fair global model with faster convergence speed. At the server side we maintain records of the update frequency of all devices {p 1 , p 2 , ..., p K }. At the beginning of each round, the server will send update request to the device with fewest updates, which is, getting min(p 1 , p 2 , ..., p K ) and send server model to the according device. Meanwhile, we design a parameter γ to control the number of local devices which are performing local training at the same time. A detailed analysis on this parameter can be found in Section VI-C. We perform extensive evaluations on three evolving data streams and two image datasets. • FitRec Dataset 3 : User sport records generated on mobile devices and uploaded to Endomondo, including multiple sources of sequential sensor data such as heart rate, speed, GPS as well as the sport type (e.g., biking, hiking), user gender and weather condition. Following [44] , we resampled the data in 10-second intervals. We use data of randomly selected 30 users for heart rate prediction. • Air Quality Dataset 4 : Air quality data collected from multiple weather sensor devices distributed in 9 locations of Beijing from Jan 2017 to Jan 2018, with features such as thermometer and barometer. Each area is modeled as a separate participant and the observed weather data is used to predict the measure of six air pollutants (e.g., PM2.5, PM10) from May 1st, 2018 to May 31st, 2018. • ExtraSensory Dataset 5 : Mobile phone sensor data (e.g., location services, audio, accelerator) collected from 60 users, performing any of 51 activities (e.g., walking, talking, running), using personal smart phones/watches was used as a more realistic alternative to the generated data set to show how the proposed algorithm could be applied in a different context. [45] . We use the provided 225-length feature vectors of time and frequency domain variables generated for each instance. • Fashion-MNIST 6 We compare the proposed approach with the following synchronous and asynchronous federated learning approaches. • FedAvg [3] , [4] : the commonly used synchronous federated learning approach proposed by McMahan et al. [4] . • FedProx [21] : synchronous federated learning framework with a proximal term on the local objective function to mitigate the data heterogeneity problem and to improve the model stability compared to FedAvg. • ASO-Fed [17] : asynchronous federated learning framework to deal with non-IID streaming device data. C. Experimental Setup 1) Drift Simulation: Real-world data satisfy the gradual drift situation as these data are generated by sensors in the nonstationary environment. For instance, data on each weather sensor of the Air Quality Dataset changes from season to season. Thus we use the three provided real-world datasets to test the algorithm for gradual drift. To make the problem more challenging, following [46] , we simulate the sudden concept drift in time-series datasets. First we randomly select a portion C of devices to add concept drifts. Then we modify the feature values to be random values between 10 − 1000 of a consecutive time period of samples and keep the according labels unchanged. We perform this sudden concept drift simulation on both the FitRec dataset and Air Quality Dataset. There are drawbacks for real-world data for evaluating concept drift handling methods, as the precise start and end time of drifts is unknown. Due to these limitations, it is difficult to evaluate methods and understand the drift [14] . Thus we also simulate concept drifts on the image data. We use the similar strategy as Mansukhani et al. [47] , which is, adding noise to a portion of the image data. Same as the timeseries datasets, we randomly select a portion C of participants to add noise. We set C as 0.1, which is, there are 10% devices with concept drifts in each dataset. To further evaluate the model performance with this parameter changes, we report the results with different values of C in Section VI-C. 2) Training Details: We split the each device's data into 60%, 20%, 20% for training, validation, and testing, respectively. As for each training data, we start with a random portion of the total training size, and increase by 0.01% − 0.1% each iteration to simulate the arriving data. We set the threshold δ value is 0.2 for image classification data and 0.25 for timeseries regression data. We set the portion of participation devices of FedAvg and FedProx as 0.2 at each round. We use a single layer LSTM followed by a fully connected layer for the two time-series datasets and two CNN layers followed by a max pooling layer for the two image datasets. The local epoch number of each client is set as 2. We design simple network architectures for all datesets so that it can be easily handled by mobile devices. All of the experiments are conducted with two Intel E5-2660 v3 10-core CPU at 2.60GHz [48] . Table I reports the prediction performance comparing the proposed model with the baseline approaches. We observer that FedConD receives the best predictive performance (Avg) and the lowest variance (Var) among all the FL approaches. These results indicate that FedConD performs well with concept drift on local devices. As the server dynamically requests updates from local devices based on their update frequency, our framework can balance the updates among all devices and learn a more fair optimal global model. We notice that ASO-Fed has the close performance as the proposed model both on the predictive performance and the statistic results. ASO-Fed designs an online learning algorithm with a decay coefficient to balance the previous and current local gradients, which also works on dealing with concept drifts. However, with a fixed regularization parameter across all devices, ASO-Fed is not flexible enough to handle drifts efficiently. Synchronous FLs (FedAvg and FedProx) do not perform well compared with the other two asynchronous approaches. To evaluate the proposed drift adaption strategy, we also report the statistics on devices with drifts in Table I . Each device in ExtraSensory data contains mixed drift types, thus we only report the variance across all devices. The results indicates that the proposed model can get the lowest variance within the drift devices. 1) Communication Efficiency: Figure 2 shows the learning curves of the global model on all FL frameworks. For Cifar-10 and Fashion-MNIST image data classification tasks, our proposed model achieves the best predictive performance and also converges faster than the other three approaches. For FitRec and Air Quality time-series data regression tasks, we observe larger fluctuations on ASO-Fed at the initial learning phase. Due to the high non-IID nature of these two datasets, it is common to see this fluctuations in asynchronous update frameworks as the server aggregates local updates one at each time. Our proposed framework FedConD converges more quickly and sees stable performance. The two synchronous frameworks, however, suffer from unstable model performance and cannot handle streaming data. We further compare the two asynchronous FL methods with respect to the communicated bytes required to achieve a certain target accuracy on a federated learning task. Table III shows the amount of upstream and downstream communications required to achieve the target prediction performance for the asynchronous FL methods. FedConDmanages to achieve the desired target performance within the smallest upload communication budget compared with ASOFed. Meanwhile, we see much smaller communication costs required for FedConD for the downstream communications. This is due to the design of γ to control the number of local training devices. We further discuss the selection of γ in Section VI-C. 2) Fairness of Client Participation: The server model aims to get a equal contributions from the local clients by adjusting the update frequency of local models. While the FL developer may have different fairness criteria, achieving a more fair accuracy distribution across devices is mostly the paramount [49] . We follow the popular metrics to measure the fairness of model accuracy for different strategies [49] , [50] . We evaluate the statistical performance of the test accuracy for FedConD on the best 20% and worst 20% devices. As shown in Table II , FedConD can improve the predictive performance of the worst 20% devices while also maintains the best test performance for the top 20% devices. This indicates that the proposed communication-efficient strategy on the server can lead to a balanced and fair model. Lacking methods to deal with streaming data and concept drift, FedAvg and FedProx are observed to have large gaps in the average performance between the top devices and bottom devices on the studied benchmarks. 1) Adaptation to new concepts.: We simulate sudden drift on the image benchmark data by injecting noise on local devices (experimental protocol described in Section V-C1). As shown in Figure 2 , FedConD achieves the best classification performance on Cifar-10 and Fashion-MNIST data compared with other competitors It is also noticeable that FedAvg and FedProx fail to converge in this scenario. ASO-Fed has the similar accuracy trend as FedConD, but it needs longer time to converge. This is because ASO-Fed has extra computation such as feature learning component on the server. Streaming data usually contains mixed type of concept drifts. 2) Adaptation to mixed concepts.: We also add sudden drift to the FitRec and Air Quality data to create the complicate drift types with a mix of gradual drift and sudden drift. We notice that the prediction performance of all FL algorithms have larger fluctuations with the mixed drifts. ASO-Fed has noticeable unstable performance at the beginning stage of the training process. FedAvg and FedProx still show no convergence trend. FedConD converges to the best performance with the lowest prediction errors. To better evaluate the effect of concept drift on both the global model and local models, we further show the predictive performance of the global model for FedConD in Figure 3 and local models for FedConD in Figure 4 . As shown in Figure 3 , FitRec is a non-IID dataset. Each local device contains data of only one sport type (e.g., hiking, biking). Thus the unstable learning curve is not only due to concept drifts but also caused by the non-IID aspects of this data. We observe the obvious drop of the prediction performance on Fashion-MNIST at the initial learning stage. This dataset is simpler than Cifar-10 for it contains only grayscale images, thus the changes on local data distributions can be reflected on the global model. In contrast, drifts on local data cannot be easily noticed on the server side for the Cifar-10 data. Air Quality data has a smaller number of devices, changing even one device's data distribution will have relatively longer affect on the global model. We show the predictive performance at each round on the drifted devices for the Fashion-MNIST dataset in Figure 4 . Obvious drops can be noticed from the learning curve of FedConD, which indicate the data distribution changes on these local devices and the proposed model can detect these changes. ASO-Fed has similar learning trend as FedConD as it has a local decay strategy to deal with online learning, while FedConD has higher accuracy when drift occurs and converges to better performance. FedAvg and FedProx fail to detect the local drift. We perform another empirical study on model performance with drifts occurring at different stages. Figure 5 shows the prediction performance of FedConD compared with other FL algorithms for the sudden drift occurs at 10% place (Early Stage), 40% place (Middle Stage) and 70% place (Late Stage) of the device data. ASO-Fed has larger errors than FedConD when the drift happens at an early time, and this error gap shrinks as the drift occurs at later times. Synchronous FL algorithms have unstable prediction performance no matter when the drift starts. We also observe that the prediction errors drop as the drift starts late, which indicates a later drift affects the server model less. C. Sensitivity to Tunable Parameters 1) Number of Drift Devices: We set the portion of drift devices to be 10% for the default experimental setting. In this section we also perform evaluations on the model performance with more devices which have concept drifts. As shown in Figure 6 , we show the convergence of FedConD compared to ASO-Fed with the number of drift devices increase. With 20% drift devices, slight decrease in the prediction performance for both approaches are observed for the Fashion-MNIST data. For the Air Quality data, the SMAPE value increases a little of ASO-Fed compared with 10% drift devices setting, while there is almost no difference on the prediction performance of FedConDon 10% and 20% drift devices. We observe the similar situation for 50% drift devices, where there are slight decrease on the prediction performance and the learning curves remain stable of FedConD. These results also prove the robustness of FedConD on dealing with local concept drifts across an increasing number of devices. 2) Number of Local Training Devices Simultaneously: Asynchronous learning framework allows local devices to perform local training simultaneously, which is, local device can start its own training any time when it is ready. However, with all of local devices performing local training at the same time and upload the local models to the server, the server can easily become a bottleneck. Meanwhile, devices with better resources (e.g., network, hardware) will perform more updates to the server and lead to a biased global model. To deal with these problems, we design a parameter γ to control the number of local active devices at the same time. Higher value of γ implies more devices are performing the local training at the same time and lower value of γ implies fewer devices perform local training simultaneously. In Figure 7 , we show the convergence trend of FedConD with different γ values. FedConD achieves the best predictive performance when γ = 0.2 or γ = 0.3. However, with γ = 0.2, FedConD converges faster, thus we use γ = 0.2 in the experiments of FedConD. 3) Regularization Parameter λ: Figure 8 displays the convergence trend of FedConD for changing the penalty on local model with concept drifts. Intuitively, the local model will deviate further from the global model if concept drift occurs. These results confirms our claim in Section IV-B. When a concept drift is detected on local device, we increase the penalty to force the local model to be closer to the global model. In Figure 8 , the model performance is better when the regularization parameter, λ is increased. Otherwise, the local model will deviate further from the global model and hurt the predictive performance on the global model. In this paper, we propose a novel framework, FedConD, to detect and handle concept drift in asynchronous federated learning. To the best of our knowledge, this is the first study of concept drift for the federated learning framework with heterogeneous device data. On the server side, in order to get a more fair global model and reduce the overall communication cost, we design a strategy to balance the local updates and control the number of devices which perform local training at the same time. Experimental results on three evolving streaming data and two image data show that the proposed model can detect and handle concept drift in asynchronous federated learning efficiently. Smartphones will get even smarter with on-device machine learning On-device machine learning: An algorithms and learning theory perspective Federated learning: Strategies for improving communication efficiency Communication-efficient learning of deep networks from decentralized data Client selection for federated learning with heterogeneous resources in mobile edge Communication-efficient federated learning for wireless edge intelligence in iot Concept drift in streaming data classification: algorithms, platforms and issues Regional concept drift detection and density synchronized drift adaptation Machine learning in non-stationary environments: Introduction to covariate shift adaptation When training and test sets are different: characterizing learning transfer A survey on concept drift adaptation The problem of concept drift: definitions and related work Quantifying the limited and gradual concept drift assumption Learning under concept drift: A review Leveraging bagging for evolving data streams A streaming ensemble algorithm (sea) for large-scale classification Asynchronous online federated learning for edge devices with non-iid data Distributed optimization with arbitrary local solvers Disco: Distributed optimization for selfconcordant empirical loss Aide: Fast and communication efficient distributed optimization Federated optimization in heterogeneous networks Tifl: A tier-based federated learning system Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data Fedat: A communication-efficient federated learning method with asynchronous tiers under non-iid data Concept drift detection via competence models Incremental learning from noisy data A concept drift-tolerant case-base editing technique Dynamic weighted majority: An ensemble method for drifting concepts Mining concept-drifting data streams using ensemble classifiers Online learning: Searching for the best forgetting strategy under concept drift Slow learners are fast Distributed training strategies for the structured perceptron Parallelized stochastic gradient descent Distributed delayed stochastic optimization Asynchronous distributed optimization using a randomized alternating direction method of multipliers An asynchronous distributed proximal gradient method for composite convex optimization On classifying drifting concepts in p2p networks Massively distributed concept drift handling in large networks Reacting to different types of concept drift: The accuracy updated ensemble algorithm Deep learning with elastic averaging sgd Communication-efficient distributed optimization using an approximate newton-type method Asynchronous federated optimization An exact quantized decentralized gradient descent algorithm Modeling heart rate and activity data for personalized fitness recommendation Extrasensory app: Data collection in-the-wild with rich user interface to self-report behavior Determining the training window for small sample size classification with concept drift Data drift detection for image classifiers The design and operation of cloudlab Advances and open problems in federated learning Fair resource allocation in federated learning