key: cord-0665588-leb9t2f7 authors: Farhadkhani, Sadegh; Guerraoui, Rachid; Hoang, Le-Nguyen; Villemaud, Oscar title: An Equivalence Between Data Poisoning and Byzantine Gradient Attacks date: 2022-02-17 journal: nan DOI: nan sha: 60b8330ef408b341f85018c8112f3390531e1bbb doc_id: 665588 cord_uid: leb9t2f7 To study the resilience of distributed learning, the"Byzantine"literature considers a strong threat model where workers can report arbitrary gradients to the parameter server. Whereas this model helped obtain several fundamental results, it has sometimes been considered unrealistic, when the workers are mostly trustworthy machines. In this paper, we show a surprising equivalence between this model and data poisoning, a threat considered much more realistic. More specifically, we prove that every gradient attack can be reduced to data poisoning, in any personalized federated learning system with PAC guarantees (which we show are both desirable and realistic). This equivalence makes it possible to obtain new impossibility results on the resilience to data poisoning as corollaries of existing impossibility theorems on Byzantine machine learning. Moreover, using our equivalence, we derive a practical attack that we show (theoretically and empirically) can be very effective against classical personalized federated learning models. Learning algorithms typically leverage data generated by a large number of users [SSP + 13, WPN + 19, WSM + 19] to often learn a common model that fits a large population [KMR15] , but also sometimes to construct a personalized model for each individual [RRS11] . Autocompletion [LB21], conversational [SHL18] and recommendation [IJW + 19] schemes are examples of such personalization algorithms deployed at scale. To be effective, besides huge amounts of data [BMR + 20, FZS21], these algorithms require customization, motivating research into the promising but challenging field of personalized federated learning [FMO20, HHHR20, DTN20]. Clearly, in applications such as content recommendation, activists, companies, and politicians have strong incentives to promote certain views, products or ideologies [Hoa20, HFE21]. Perhaps unsurprisingly, this led to the proliferation of fabricated activities to bias algorithms [BH19, NHK19], using for instance "fake reviews" [WNWW20] . The scale of this phenomenon is well illustrated by the case of Facebook which, in 2019 alone, reported the removal of around 6 billion fake accounts from its platform [FG19]. This is particularly concerning in the era of "stochastic parrots" [BGMS21] : climate denialists are incentivized to pollute textual datasets with claims like "climate change is a hoax", rightly assuming that autocompletion, conversational and recommendation algorithms trained on such data will more likely spread these views [MN20] . This raises serious concerns about the vulnerability of personalized federated learning to such misleading data. Data poisoning attacks clearly constitute now a major machine learning security issue [KNL + 20]. Overall, in adversarial environments like social media, and given the advent of deep fakes [JD21], we should expect most data to be strategically crafted and labeled. In this context, the authentication of the data provider is critical. In particular, the safety of learning algorithms arguably demands that they be trained solely on cryptographically signed data, namely, data that provably come from a known source. But even signed data cannot be wholeheartedly trusted since users typically have preferences over what ought to be recommended to others. Naturally, users have incentives to behave strategically in order to promote certain views or products. To study resilience, the Byzantine learning literature usually assumes that each federated learning worker may behave arbitrarily [BMGS17, YCRB18, KHJ21, YL21] . To understand the implication of this assumption, recall that at each iteration of a federated learning stochastic gradient descent, every worker is given the updated model, and asked to compute the gradient of the loss function with respect to (a batch of) its local data. Byzantine learning assumes that a worker may report any gradient; without having to certify that the gradient was generated through data poisoning. Whilst very general, and widely studied in the last few years, this gradient attack threat model has been argued to be unrealistic in practical federated learning [SHKR22], especially when the workers are machines owned by trusted entities [KMA + 21]. We prove in this paper a somewhat surprising equivalence between gradient attacks and data poisoning, in a general framework. Essentially, we give the first practically compelling argument for the necessity to protect learning against gradient attacks. Our result enables to carry over results on Byzantine gradient attacks to the data poisoning world. For instance, the impossibility result of [EFG + 21], combined with our equivalence result, says that the more heterogeneous the data, the more harmful the poisoning can be. Also, we can also derive very concrete data poisoning attacks from gradient ones. Contributions. As a preamble of our main result, we formalize local PAC* learning 1 [Val84] for personalized learning, and prove that a simple and general solution to personalized federated linear regression and classification is indeed locally PAC* learning. Our proof leverages a new concept called gradient-PAC* learning. We prove that gradient PAC* learning, which is verified by basic learning algorithms like linear and logistic regression, is sufficient to guarantee local PAC* learning. This is an important and nontrivial contribution of this paper. Our main contribution is then to prove that local PAC* learning in personalized federated learning essentially implies an equivalence between data poisoning and gradient attacks. More precisely, we show how any (converging) gradient attack can be turned into a data poisoning attack, with equal harm. As a corollary, we derive new impossibility theorems on what any robust personalized learning algorithm can guarantee, given heterogeneous genuine users and under data poisoning. Given how easy it generally is to create fake accounts on web platforms and to inject poisonous data through fake activities, our results arguably greatly increase the concerns about the vulnerabilities of learning from user-generated data, even when "Byzantine learning algorithms" are used, especially on controversial issues like hate speech moderation, where genuine users will inevitably provide conflicting reports on which words are abusive and ought to be removed. Finally, we present a simple but very general strategic gradient attack, called the countergradient attack, which any participant to federated learning can deploy to bias the global model towards any target model that better suits their interest. We prove the effectiveness of this attack under fairly general assumptions, which apply to many proposed personalized learning frameworks including [HHHR20, DTN20]. We then show empirically how this attack can be turned into a devastating data poisoning attack, with remarkably few data. 2 Related work. Collaborative PAC learning was introduced by [BHPQ17] , and then extensively studied [CZZ18, NZ18] , sometimes assuming Byzantine collaborating users [Qia18, JO20, KFAL20]. It was however assumed that all honest users have the same labeling function. In other words, all users agree on how every query should be answered. This is a very unrealistic assumption in many critical applications, like content moderation or language processing. In fact, in such applications, removing outliers can be argued to amount to ignoring minorities' views, which would be highly unethical. The very definition of PAC learning must then be adapted, which is precisely what we do in this paper (by also adapting it to parameterized models). A large literature has focused on data poisoning, with either a focus on backdoor [DCL19, ZMZ + 20, SMCO21, TJH + 20, SGG + 21] or triggerless attacks [BNL12, MBD + 17, SHN + 18, ZHL + 19, HGF + 20, BNS + 06, AMW + 21, GFH + 21]. However, most of this research analyzed data poisoning without signed data. A noteworthy exception is [MMM19], whose universal attack amplifies the probability of a (bad) property. Our work bridges the gap, for the first time, between that line of work and what has been called Byzantine resilience [MGR18, BBG19, XKG19, EMGR21]. Results in this area typically establish the resilience against a minority of adversarial users and many of them apply almost straightforwardly to personalized federated learning [EGG + 20, EFG + 21]. The attack we present in this paper considers a specific kind of Byzantine player, namely a strategic one [SMS + 21], whose aim is to bias the learned models towards a specific target model. The resilience of learning algorithms to such strategic users has been studied in many special cases, including regression [CPPS18, DFP10, PPP04, BPT17], classification [MPR12, CLP20, MAMR11, HMPW16], statistical estimation [CDP15] , and clustering [PS03] . While some papers provide positive results in settings where each user can only provide a single data point [CPPS18, PPP04] , [SMS + 21] show how to arbitrarily manipulate convex learning models through multiple data injections, when a single model is learned from all data at once. Structure of the paper. The rest of the paper is organized as follows. Section 2 presents a general model of personalized learning, formalizes local PAC* learning and describes a general federated gradient descent algorithm. Section 3 proves the equivalence between data poisoning and gradient attacks, under local PAC* learning. Section 4 proves the local PAC* learning properties for federated linear regression and classification. Section 5 describes a simple and general data poisoning attack, and shows its effectiveness against 2 2 , both theoretically and empirically. Section 6 concludes. Proofs of our theoretical results and details about our experiments are given in the Appendix. We consider a set [N ] = {1, . . . , N } of users. Each user n ∈ [N ] has a local signed dataset D n , and learns a local model θ n ∈ R d . Users may collaborate to improve their models. Personalized learning must then input a tuple of users' local datasets D (D 1 , . . . , D N ), and output a tuple 2 The code is provided in the Supplementary Material and will be made accessible online. of local models θ * (θ * 1 , . . . , θ * N ). Like many others, we assume that the users perform federated learning to do so, by leveraging the computation of a common global model ρ ∈ R d . Intuitively, the global model is an aggregate of all users' local models, which users can leverage to improve their local models. This model typically allows users with too few data to obtain an effective local model, while it may be mostly discarded by users whose local datasets are large. More formally, we consider a personalized learning framework which generalizes the models proposed by [DTN20] and [HHHR20]. Namely, we consider that the personalized learning algorithm outputs a global minimum (ρ * , θ * ) of a global loss given by where R is a regularization, typically with a minimum at θ n = ρ. For instance, [HHHR20] and [DTN20] define R(ρ, θ n ) λ ρ − θ n 2 2 , which we shall call the 2 2 regularization. But other regularizations may be considered, like the 2 regularization R(ρ, θ n ) λ ρ − θ n 2 , or the smooth-2 regularization R(ρ, θ n ) λ 1 + ρ − θ n 2 2 . Note that, for all such regularizations, the limit λ → ∞ essentially yields the classical non-personalized federated learning framework. In this paper, we focus on personalized learning algorithms that provably recover a user n's preferred model θ † n , if the user provides a large enough honest dataset D n , i.e. constructed with θ † n . Such honest datasets D n could typically be obtained by repeatedly drawing random queries (or features), and by using the user's preferred model θ † n to provide (potentially noisy) answers (or labels). We refer to Section 4 for examples. The model recovery condition is then formalized as follows. Definition 1. A personalized learning algorithm is locally PAC* learning if, for any subset H ⊂ [N ] of users, any preferred models θ † H , any ε, δ > 0, and any datasets D −H from other users n / ∈ H, there exists I such that, if all users h ∈ H provide honest datasets D h with at least |D h | ≥ I data points, then, with probability at least 1 − δ, Local PAC* learning is arguably a very desirable property. Indeed, it guarantees that any honest active user will not be discouraged to participate in federated learning as they will eventually learn their preferred model by providing more and more data. Note that the required number of data points I also depends on the datasets provided by other users D −H . This implies that a locally PAC* learning algorithm is still vulnerable to poisoning attacks as the attacker's data set is not a priori fixed. In Section 4, we will show how local PAC* learning can be achieved in practice, by considering specific local loss functions L n . While the computation of ρ * and θ * could be done by a single machine, which first collects the datasets D and then minimizes the global loss Loss, modern machine learning deployments often rather rely on federated (stochastic) gradient descent (or variants), with a central trusted parameter server. In this setting, each user n keeps their data D n locally. At each iteration t, the parameter server sends the latest global model ρ t to the users. Each user n is then expected to update its local model given the global model ρ t , either by solving θ t n arg min θn L n (θ n , D n ) + R(ρ t , θ n ) [DTN20] or by making a (stochastic) gradient step from the previous local model θ t−1 n [HR21]. User n is then expected to report the gradient g t n = ∇ ρ R(ρ t , θ t n ) of the global model to the parameter server. The parameter server then updates the global model, using a gradient step, i.e. it computes where η t is the learning rate at iteration t. For simplicity, here, and since our goal is to show the vulnerability of personalized federated learning even in good conditions, we assume that the network is synchronous and that no node can crash. Note also that our setting could be generalized to fully decentralized collaborative learning, as was done by [EFG + 21] . Users are only allowed to send plausible gradient vectors. More precisely, we denote Grad(ρ) {∇ ρ R(ρ, θ) | θ ∈ R d } the closure set of plausible (sub)gradients at ρ. If user n's gradient g t n is not in the set Grad(ρ t ), the parameter server can easily detect the malicious behavior and g t n will be ignored at iteration t. In the case of an 2 2 regularization, where R(ρ, θ) = λ ρ − θ 2 2 , we clearly have Grad(ρ) = R d for all ρ ∈ R d . It can be easily shown that, for 2 and smooth-2 regularizations, Grad(ρ) is the closed ball B(0, λ). Nevertheless, even then, a strategic user s ∈ [N ] can deviate from its expected behavior, to bias the global model in their favor. We identify in particular three sorts of attacks. Data poisoning: Instead of collecting an honest dataset, s fabricates any strategically crafted dataset D s , and then performs all other operations as expected. Model attack: At each iteration t, s fixes θ t s θ ♠ s , where θ ♠ s is any strategically crafted model. All other operations would then be executed as expected. Gradient attack: At each iteration t, s sends any (plausible) strategically crafted gradient g t s . Gradient attacks are intuitively most harmful, as the strategic user can adapt their attack based on what they observe during training. However, because of this, gradient attacks are more likely to be flagged as suspicious behaviors. At the other end, data poisoning may seem much less harmful. But it is also harder to detect, as the strategic user can report their entire dataset, and prove that they rigorously performed the expected computations. In fact, data poisoning can be executed, even if users directly provide the data to a (trusted) central authority, which then executes (stochastic) gradient descent. This is typically what is done to construct recommendation algorithms, where users' data are their online activities (what they view, like and share). Crucially, especially in applications with no clear ground truth, such as content moderation or language processing, the strategic user can always argue that their dataset is "honest"; not strategically crafted. Ignoring the strategic user's data on the basis that it is an "outlier" may then be regarded as unethical, as it amounts to rejecting minorities' viewpoints. We now present our main result, considering "model-targeted attacks", i.e., the attacker aims to bias the global model towards a target model θ † s . This attack was also previously studied by [SMS + 21]. Theorem 1 (Equivalence between gradient attacks and data poisoning). Assume local PAC* learning, and 2 2 , 2 or smooth-2 regularization. Suppose that each loss L n is convex and that the learning rate η t is constant. Consider any datasets D −s provided by users n = s. Then, for any target model For the sake of exposition, our results are stated for 2 2 or smooth-2 regularization only. But the proof, in Appendix B, holds for all continuous regularizations R with R(ρ, θ) → ∞ as ρ − θ 2 → ∞. We now sketch our proof, which goes through model attacks. To study the model attack, we define the modified loss with directly strategic user s's reported model θ ♠ s as where θ −s and D −s are variables and datasets for users n = s. Denote ρ * (θ ♠ s , D −s ) and θ * −s (θ ♠ s , D −s ) a minimum of the modified loss function. and θ * Lemma 1 (Reduction from model attack to data poisoning). Consider any data D and user s ∈ [N ]. Assume the global loss has a global minimum (ρ * , θ * ). Then (ρ * , θ * −s ) is also a global minimum of the modified loss with datasets D −s and strategic reporting θ ♠ s θ * s ( D). Now, intuitively, by virtue of local PAC* learning, strategic user s can essentially guarantee that the personalized learning framework will be learning θ * s ≈ θ ♠ s . In the sequel, we show that this is the case. Lemma 2 (Reduction from data poisoning to model attack). Assume 2 2 , 2 or smooth-2 regularization, and assume local PAC* learning. Consider any datasets D −s and any attack model θ ♠ s such that the modified loss Loss s has a unique minimum ρ * (θ ♠ s , D −s ), θ * −s (θ ♠ s , D −s ). Then, for any ε > 0, there exists a dataset D s such that we have Sketch of proof. Given local PAC*, for a large dataset D s constructed from θ ♠ s , s can guarantee θ * s ( D) ≈ θ ♠ s . By carefully bounding the effect of the approximation on the loss using the Heine-Cantor theorem, we show that this implies ρ * ( D) ≈ ρ * (θ ♠ s , D −s ) and θ * n ( D) ≈ θ * n (θ ♠ s , D −s ) for all n = s too. The precise analysis is nontrivial. We now prove that any successful converging model-targeted gradient attack can be transformed into an equivalently successful model attack. Lemma 3 (Reduction from model attack to gradient attack). Assume that L n is convex for all nodes n ∈ [N ], and that we use 2 2 , 2 or smooth-2 regularization. Consider a converging gradient attack g t s with limit g ∞ s that makes the global model ρ t converge to ρ ∞ with a constant learning rate η. Then for any Sketch of proof. The proof is based on the observation that since Grad is closed and g ∞ s ∈ Grad, we can construct θ ♠ s which approximately yields the gradient g ∞ s . Since any model attack can clearly be achieved by the corresponding honest gradient attack, model attacks and gradient attacks are thus equivalent. In light of our previous results, this implies that gradient attacks are essentially equivalent to data poisoning (Theorem 1). Note that Theorem 1 (and Lemma 3) assumes that the global model converges. Here, we prove that this assumption is automatically satisfied for converging gradients, at least when local models θ t n are fully optimized given ρ t , at each iteration t, in the manner of [DTN20], and under smoothness assumptions. Proposition 1. Assume that L n is convex and L-smooth for all nodes n ∈ [N ], and that we use 2 2 or smooth-2 regularization. If g t s converges and if η t = η is a constant small enough, then ρ t will converge too. Sketch of proof. Denote g ∞ s the limit of g t s . Gradient descent then behaves as though it was minimizing the loss plus ρ T g ∞ s (and ignoring R(ρ, θ s )). Essentially, classical gradient descent theory then guarantees ρ t → ρ ∞ , though the precise proof is nontrivial (see Appendix C). Given our equivalence, impossibility theorems on (heterogeneous) federated learning under (converging) gradient attacks imply impossibility results under data poisoning. For instance, [EFG + 21] and [HKJ20] proved theorems saying that the more heterogeneous the learning, the more vulnerable it is in a Byzantine context. In fact, and interestingly, [EFG + 21] and [HKJ20] actually leverage a model attack. Before translating the corresponding result, some work is needed to formalize what Byzantine resilience may mean in our setting. Definition 2. A personalized learning algorithm ALG achieves (F, N, C)-Byzantine learning if, for any subset H ⊂ [N ] of honest users with |H| = N − F , any honest vectors θ † H ∈ (R d ) |H| , given any ε, δ > 0, there exists I such that, when each honest user h ∈ H provides honest datasets D † h by answering I queries with model θ † h , then, with probability at least 1−δ, for any poisoning datasets D ♠ where θ † H is the average of honest users' preferred models. is a reasonable measure of the heterogeneity among honest users. Thus, our definition captures well the robustness of the algorithm ALG, for heterogeneous learning under data poisoning. Interestingly, our equivalence theorem allows to translate the model-attack-based impossibility theorems of [EFG + 21] into an impossibility theorem on data poisoning resilience. To the best of our knowledge, though similar to collaborative PAC learning [BHPQ17] , local PAC* learnability is a new concept in the context of personalized federated learning. It is thus important to show that it is not unrealistic. To achieve this, in this section, we provide sufficient conditions for a personalized learning model to be locally PAC* learnable. First, we construct local losses L n as sums of losses per input, i.e. for some "loss per input" function and a weight ν > 0. Appendix E gives theoretical and empirical arguments are provided for using such a sum (as opposed to an expectation). Remarkably, for linear or logistic regression, given such a loss, local PAC* learning can then be guaranteed. Theorem 2 (Personalized least square linear regression is locally PAC* learning). Consider 2 2 , 2 or smooth-2 regularization. Assume that, to generate a data x i , a user with preferred parameter θ † ∈ R d first independently draws a random vector query Q i ∈ R d from a bounded query distributioñ Q, with positive definite matrix 3 Σ = E Q i Q T i . Assume that the user labels Q i with answer A i = Q T i θ † +ξ i , where ξ i is a zero-mean sub-Gaussian random noise with parameter σ ξ , independent from Q i and other data points. Finally, assume that (θ, (Q i , A i )) = 1 2 (θ T Q i − A i ) 2 . Then the personalized learning algorithm is locally PAC* learning. Theorem 3 (Personalized logistic regression is locally PAC*-learning). Consider 2 2 , 2 or smooth-2 regularization. Assume that, to generate a data x i , a user with preferred parameter θ † ∈ R d first independently draws a random vector query Q i ∈ R d from a query distributionQ, whose support Supp(Q) is bounded and spans the full vector space R d . Assume that the user then labels Q i with answer A i = 1 with probability σ(Q T i θ † ), and labels it . Then the personalized learning algorithm is locally PAC* learning. The full proofs of theorems 2 and 3 are given in Appendix F. Here, we provide proof outlines. In both cases, we leverage the following stronger form of PAC* learning. Definition 3 (Gradient-PAC*). Let E(D, θ † , I, A, B, α) the event defined by The loss L is gradient-PAC* if, for any K > 0, there exist constants A K , B K > 0 and α K < 1, such that for any θ † ∈ R d with θ † 2 ≤ K, assuming that the dataset D is obtained by honestly collecting and labeling I data points according to the preferred model θ † , the probability of the event E(D, θ † , I, A K , B K , α K ) goes to 1 as I → ∞. Intuitively, this definition asserts that, as we collect more data from a user, then, with high probability, the gradient of the loss at any point θ too far from θ † will point away from θ † . In particular, gradient descent is then essentially guaranteed to draw θ closer to θ † . The right-hand side of the equation defining E(D, θ † , I, A, B, α) is subtly chosen to be strong enough to guarantee local PAC*, and weak enough to be verified by linear and logistic regression. ξ i Q i , which can be controlled by appropriate concentration bounds. Meanwhile, for logistic regression, for |b| ≤ K, we observe that (a − b)(σ(a) − σ(b)) ≥ c K min(|a − b| , |a − b| 2 ). Essentially, this proves that gradient-PAC* would hold if the empirical loss was replaced by the expected loss. The actual proofs, however, are nontrivial, especially in the case of logistic regression, which leverages topological considerations to derive a critical uniform concentration bound. Now, under very mild assumptions on the regularization R (not even convexity!), which are verified by the 2 2 , 2 and smooth-2 regularizations, we prove that the gradient-PAC* learnability through suffices to guarantee that personalized learning will be locally PAC* learning. Lemma 5. Consider 2 2 , 2 or smooth-2 regularization. If is gradient-PAC* and nonnegative, then personalized learning is locally PAC*-learning. Sketch of proof. Given other users' datasets, R yields a fixed bias. But as the user provides more data, by gradient-PAC*, the local loss dominates, thereby guaranteeing local PAC*-learning. Appendix G provides a full proof. Combining the two lemmas clearly yields theorems 2 and 3 as special cases. Note that our result actually applies to a more general set of regularizations and losses. Deep neural networks generally do not verify gradient PAC*. After all, because of symmetries like neuron swapping, different values of the parameters might compute the same neural network function. Thus the "preferred model" θ † is arguably ill-defined for neural networks 4 . Nevertheless, we may consider a strategic user who only aims to bias the last layer. In particular, assuming that all layers but the last one of a neural network are pretrained and fixed, then our theory may apply to the parameters of the last layer. We now construct a practical data poisoning attack, by introducing a new gradient attack, and by then leveraging our equivalence to turn it into a data poisoning attack. We define a simple, general and practical gradient attack, which we call the counter-gradient attack (CGA). Intuitively, this attack estimates the sum g †,t −s of the gradients of other users based on its value at the previous iteration, which can be inferred from the way the global model ρ t−1 was updated into ρ t . More precisely, apart from initializationĝ 1 −s 0, CGA makes the estimation Strategic user s then reports the plausible gradient that moves the global model closest to the user's target model θ † s , assuming others reportĝ t −s . In other words, at every iteration, CGA reports Note that this attack only requires user s to know the learning rates η t−1 and η t , the global models ρ t−1 and ρ t , and their target model θ † s . For convex sets Grad(ρ t ), it is straightforward to see that CGA boils down to computing the orthogonal projection of h t s on Grad(ρ t ). This yields very simple computations for 2 2 , 2 and smooth-2 regularizations. Proposition 2. For 2 2 regularization, CGA reports g t s = h t s . For 2 or smooth-2 regularization, CGA reports g t s = h t s min 1, λ/ h t s 2 . Proof. Equation (7) boils down to minimizing the distance between ρ t −θ † s ηt −ĝ t −s and Grad(ρ), which is the ball B(0, λ). This minimum is the orthogonal projection. Theoretical analysis. We prove that CGA is perfectly successful against 2 2 regularization. To do so, we suppose that, at each iteration t and for each user n = s, the local models θ n are fully optimized with respect to ρ t , and the honest gradients of g †,t n are used to update ρ. Theorem 4. Consider 2 2 regularization. Assume that is convex and L -smooth, and that η t = η is small enough. Then CGA is converging and optimal, as ρ t → θ † s . Sketch of proof. The main challenge is to guarantee that the other users' gradients g †,t n for n = s remain sufficiently stable over time to guarantee convergence, which can be done by leveraging L-smoothness. The full proof, with the necessary upper-bound on η, is given in Appendix H. The analysis of the convergence against smooth-2 is unfortunately significantly more challenging. Here, we simply make a remark about CGA at convergence. Proposition 3. If CGA against smooth-2 regularization converges for η t = η, then it either achieves perfect manipulation, or it is eventually partially honest, in the sense that the gradient by CGA correctly points towards θ † s . Proof. Denote P the projection onto the closed ball B(0, λ). If CGA converges, then, by Proposi- Empirical evaluation of CGA. We deployed CGA to bias the federated learning of MNIST. We consider a strategic user whose target model is one that labels 0's as 1's, 1's as 2's, and so on, until 9's that are labeled as 0's. In particular, this target model has a nil accuracy. Figure 1 shows that such a user effectively hacks the 2 2 regularization against 10 honest users who each have 6,000 data points of MNIST, in the case where local models only undergo a single gradient step at each iteration, but fails to hack the 2 regularization. See Appendix I for more details. We also ran a similar successful attack on the last layer of a deep neural network trained on cifar-10, which is detailed in Appendix J. We now show how to turn a gradient attack into model attack, against 2 2 regularization. It is trivial to transform any gradient g ∞ s such that ρ ∞ = θ † s into a model attack by setting θ ♠ s θ † s − 1 2 g ∞ s , as Proposition 4. Consider the 2 2 regularization. Suppose that g t s → g ∞ s and ρ t → θ † s , with a constant learning rate η t = η. Then, under the model attack θ ♠ Proof. Given a constant learning rate, the convergence ρ t → θ † s implies that the sum of honest users' gradients at ρ = θ † s equals −g ∞ s . Therefore, to achieve ρ * = θ † s , it suffices to send θ ♠ s such that the gradient of λ ρ − θ ♠ s 2 2 with respect to ρ at ρ = θ † s equals g ∞ s . Since the gradient is The case of linear regression. In linear regression, any model attack can be turned into a single data poisoning attack, as proved by the following theorem whose proof is given in Appendix K. Theorem 5. Consider the 2 2 regularization and linear regression. For any data D −s and any target value θ † s , there is a datapoint (Q, A) to be injected by user s such that The case of linear classification. We now consider linear classification, with the case of MNIST. By Lemma 2, any model attack can be turned into data poisoning, by (mis)labeling sufficiently many (random) data points, However, this may require creating too many data labelings, especially if the norm of θ ♠ s is large (which holds if s faces many active users), as suggested by Theorem 3. For efficient data poisoning, define the indifference affine subspace V ⊂ R d as the set of images with equiprobable labels. Intuitively, labeling images close to V is very informative, as it informs us directly about the separating hyperplanes. To generate images, we draw random images, project them orthogonally on V and add a small noise. We then label the image probabilistically with model θ ♠ s . Note that this leads us to consider images not in [0, 1] d . Nevertheless, Figure 2d shows the effectiveness of the resulting data poisoning attack, with only 2,000 data points, as opposed to the 60,000 honestly labeled data points that the 10 other nodes cumulatively have. Remarkably, complete data relabeling was achieved by poisoning merely 3.3% of the total database. More details are given in Appendix L. We show in this paper that, unlike what has been argued, e.g., [SHKR22], the gradient attack threat is not unrealistic. More precisely, for personalized federated learning with local PAC* guarantees, we proved that effective gradient attacks can be derived from strategic data reporting, with potentially surprisingly few data. In fact, by leveraging our newly found equivalence, we derived new impossibility theorems on what any robust learning can guarantee, even under data poisoning only. Yet such attacks are known to be ubiquitous for many high-risk applications, like content recommendation. Arguably, a lot more security measures are urgently needed to make large-scale learning algorithms safe. All our experiments are run on the classical datasets MNIST and FashionMNIST. We provide all of the source codes to reproduce the experiments: • The sum versus expectation experiments can be run by executing this file: https://www.dropbox.com/sh/qdgmz9air24nhyr/AAAtycEkxc_1hGbvU5YG18z4a?dl=0 • The counter-gradient attack experiments can be run by executing this file: https://www.dropbox.com/sh/bycqkccgmk4muzn/AACRD1yeTglLSHEd1OOAzmVqa?dl=0 • The data poisoning attack experiments can be run by executing this file: https://www.dropbox.com/sh/qodnl6ivzti8hch/AADgX4EYuSOotiMCAHyTIiGMa?dl=0 • The cifar10 on VGG 13-BN experiments can be run by executing this file: The experiments are seeded and the CuDNN backend is configured in deterministic mode in order to reduce the sources of non-determinism. We also turn of the benchmark mode. Executing the codes will generate the figures and statistics of our main paper, and most of the figures of our Appendix. Our other figures can be obtained by adjusting the hyperparameters of our codes. The full description of the architecture and optimisation algorithm used is described in Appendix E. The experimental setup details of each experiment are provided in the Appendix, along with additional results. The Appendix also contains the full proofs of our theorems. The safety of algorithms is arguably a prerequisite to their ethics. After all, an arbitrarily manipulable large-scale algorithm will unavoidably endanger the targets of the entities that successfully design such algorithms. Typically, unsafe large-scale recommendation algorithms may be hacked by health disinformation campaigns that aim to promote non-certified products, e.g., by falsely pretending that they cure COVID-19. Such algorithms must not be regarded as ethical, even if they were designed with the best intentions. We believe that our work helps understand the vulnerabilities of such algorithms, and will motivate further research in the ethics and security of machine learning. [DFP10] We say that f : R d → R is locally strongly convex if, for any convex compact set C ⊂ R d , there exists µ > 0 such that f is µ-strongly convex on C, i.e. for any x, y ∈ C and any λ ∈ [0, 1], we have It is well-known that if f is differentiable, this condition amounts to saying that ∇f (x) − ∇f (y) 2 ≥ µ x − y 2 for all x, y ∈ C. And if f is twice differentiable, then it amounts to saying ∇ 2 f (x) µI for all x ∈ C. Lemma 6. If f is locally strongly convex and g is convex, then f + g is locally strongly convex. Definition 5. We say that f : Lemma 7. If f is L f -smooth and g is L g -smooth, then f + g is (L f + L g )-smooth. Lemma 8. Suppose that f : R d × R d → R is locally strongly convex and L-smooth, and that, for any x ∈ X, where X ⊂ R d is a convex compact subset, the map y → f (x, y) has a minimum y * (x). Note that local strong convexity guarantees the uniqueness of this minimum. Then, there exists K such that the function y * is K-Lipschitz continuous on X. Proof. The existence and uniqueness of y * (x) hold by strong convexity. Fix x, x . By optimality of y * , we know that ∇ y f (x, y * (x)) = ∇ y f (x , y * (x )) = 0. We then have the following bounds where we first used the local strong convexity assumption, then the fact that ∇ y f (x, y * (x)) = 0, then the fact that ∇ y f (x , y * (x )) = 0, and then the L-smooth assumption. Lemma 9. Suppose that f : R d × R d → R is locally strongly convex and L-smooth, and that, for any x ∈ X, where X ⊂ R d is a convex compact subset, the map y → f (x, y) has a minimum y * (x). Define g(x) min y∈Y f (x, y). Then g is convex and differentiable on X and ∇g(x) = ∇ x f (x, y * (x)). Proof. First we prove that g is convex. Let x 1 , x 2 ∈ R d , and λ 1 , λ 2 ∈ [0, 1] with λ 1 + λ 2 = 1. For any y 1 , y 2 ∈ R d , we have ≤ f (λ 1 x 1 + λ 2 x 2 , λ 1 y 1 + λ 2 y 2 ) ≤ λ 1 f (x 1 , y 1 ) + λ 2 f (x 2 , y 2 ). Taking the infimum of the right-hand side over y 1 and y 2 yields g(λ 1 x 1 +λ 2 x 2 ) ≤ λ 1 g(x 1 )+λ 2 g(x 2 ), which proves the convexity of g. Now denote h(x) = ∇ x f (x, y * (x)). We aim to show that ∇g(x) = h(x). Let ε ∈ R d small enough so that x + ε ∈ X. Now note that we have which shows that h(x) is a superderivative of g at x. We now show that it is also a subderivative. To do so, first note that its value at x + ε is approximately the same, i.e. where we used the L-smoothness of f and Lemma 8. Now notice that But we know that h(x + ε) − h(x) 2 = O( ε 2 ). Rearranging the terms then yields which shows that h(x) is also a subderivative. Therefore, we know that g(x + ε) = g(x) + ε T h(x) + o( ε 2 ), which boils down to saying that g is differentiable in x ∈ X, and that ∇g(x) = h(x). Lemma 10. Suppose that f : X × R d → R is µ-strongly convex, where X ⊂ R d is closed and convex. Then g : X → R, defined by g(x) = inf y∈Y f (x, y), is well-defined and µ-strongly convex too. Proof. The function y → f (x, y) is still strongly convex, which means that it is at least equal to a quadratic approximation around 0, which is a function that goes to infinity in all directions as y 2 → ∞. This proves that the infimum must be reached within a compact set, which implies the existence of a minimum. Thus g is well-defined. Moreover, for any x 1 , x 2 ∈ X, y 1 , y 2 ∈ R d , and λ 1 , λ 2 ≥ 0 with λ 1 + λ 2 = 1, we have where we used the µ-strong convexity of f . Taking the infimum over y 1 , y 2 implies the µ-strong convexity of g. Now instead of proving our theorems for different cases separately, we make the following assumptions on the components of the global loss that encompasses both 2 2 and smooth-2 regularization, a well as linear regression and logistic regression. Assumption 1. Assume that is convex and L -smooth, and that R(ρ, θ) = R 0 (ρ − θ), where R 0 : R d → R is locally strongly convex (i.e. strongly convex on any convex compact set), L R 0smooth and satisfy R 0 (z) = Ω( z 2 ) as z 2 → ∞. Lemma 11. Under Assumption 1, Loss is locally strongly convex and L-smooth. Proof. All terms of Loss are L 0 -smooth, for an appropriate value of L 0 . By Lemma 7, their sum is thus also L-smooth, for an appropriate value of L. Now, given Lemma 6, to prove that Loss is locally strongly convex, it suffices to prove that ν θ n 2 2 + R 0 (ρ − θ 1 ) is locally strongly convex. Consider any convex compact set C ⊂ R d×(1+N ) . Since R 0 is locally strongly convex, we know that there exists µ > 0 such that ∇ 2 R 0 µI. As a result, Now define α 2µ ν+2µ . Clearly, 0 < α < 1. Moreover, 0 ≤ 1 α θ 1 − αρ 2 2 = 1 α 2 θ 1 2 2 + α 2 ρ 2 2 − 2ρ T θ 1 . Therefore 2ρ T θ 1 ≤ α 2 ρ 2 2 + 1 α 2 θ 1 2 2 , which thus implies which proves that ∇ 2 Loss κI, with κ > 0. This shows that Loss is locally strongly convex. Lemma 12. Under Assumption 1, ρ → θ * (ρ, D) is Lipchitz continuous on any compact set. Proof. Define f n (ρ, θ n ) ν θ n 2 2 + x∈Dn (θ n , x)+λ ρ − θ n 2 2 . If is L-smooth, then f n is clearly (|D n | L + ν + λ)-smooth. Moreover, if is convex, then for any ρ, the function θ n → f n (ρ, θ n ) is at least ν-strongly convex. Thus Lemma 8 applies, which guarantees that ρ → θ * (ρ, D) is Lipchitz. Lemma 13. Under Assumption 1, ρ → Loss(ρ, θ * (ρ, D), D) is L-smooth and locally strongly convex. Proof. By Lemma 11, the global loss is known to be L-smooth, for some value of L and locally strongly convex. Denoting f : ρ → Loss(ρ, θ * (ρ, D), D), we then have which proves that f is L-smooth. For strong convexity, note that since the global loss function is locally strongly convex, for any compact convex set C, there exists µ such that Loss(ρ, θ, D) is µ-strongly convex on C = (C 1 , C 2 ) ⊂ (R d , R N ×d ), therefore, by Lemma 10, f (ρ) will also be µ-strongly convex on C 1 which means that f (ρ) is locally strongly convex. B Proof of the equivalence B.1 Proof of the reduction from model attack to data poisoning Proof of Lemma 1. We omit making the dependence of the optima on D explicit, and we consider any other models ρ and θ −s . We have the following inequalities: where we used the optimality of (ρ * , θ * ) in the second line, and where we repeatedly used the fact that θ * s = θ ♠ s . This proves that (ρ * , θ * −s ) is a global minimum of the modified loss. First, we define the following modified loss function: where θ −s and D −s are variables and datasets for users n = s. We then define ρ * (θ ♠ s , D −s ) and θ * −s (θ ♠ s , D −s ) as a minimum of the modified loss function, and θ * s (θ ♠ s , D −s ) θ ♠ s . We now prove a slightly more general version of Lemma 2, which applies to a larger class of regularizations. It also shows how to construct the strategic's user data poisoning attack. Lemma 14 (Reduction from data poisoning to model attack). Assume local PAC* learning. Suppose also that R is continuous and that R(ρ, θ) → ∞ when ρ − θ 2 → ∞. Consider any datasets D −s and any attack model θ ♠ s such that the modified loss Loss s has a unique minimum ρ * (θ ♠ s , D −s ), θ * −s (θ ♠ s , D −s ). Then, for any ε, δ > 0, there exists I such that if user s's dataset D s contains at least I inputs drawn from model θ ♠ s , then, with probability at least 1 − δ, we have Clearly, 2 2 , 2 and smooth-2 are continuous regularizations, and verify R(ρ, θ) → ∞ when ρ − θ 2 → ∞. Moreover, setting δ 1/2 shows that the probability that the dataset D s satisfies the inequalities of Lemma 14 is positive. This implies in particular that there must be a dataset D s that satisfies these inequalities. All in all, this shows that Lemma 14 implies Lemma 2. Proof of Lemma 14. Let ε, δ > 0 and θ ♠ s ∈ R d . Denote ρ ♠ ρ * (θ ♠ s , D −s ) and θ ♠ θ * (θ ♠ s , D −s ) the result of strategic user s's model attack. We define the compact set C by We define D R d×N − C the closure of the complement of C. Clearly, ρ ♠ , θ ♠ −s / ∈ D. We aim to show that, when strategic user s reveals a large dataset D s whose answers are provided using the attack model θ ♠ s , then the same holds for any global minimum of the global loss ρ * ( D), θ * −s ( D) ∈ C. Note that, to prove this, it suffices to prove that the modified loss takes too large values, even when θ ♠ s is replaced by θ * s ( D). Let us now formalize this. Denote By a similar argument as that of Lemma 5, using the assumption R → ∞ at infinity, we know that the infimum is actually a minimum. Moreover, given that the minimum of the modified loss Loss s is unique, we know that the value of the loss function at this minimum is different from its value at ρ ♠ , θ ♠ −s . As a result, we must have η > 0. Now, since the function R is differentiable, it must be continuous. By the Heine-Cantor theorem, it is thus uniformly continuous on all compact sets. Thus, there must exist κ > 0 such that, for all models θ s satisfying θ s − θ ♠ s 2 ≤ κ, we have Now, Lemma 5 guarantees the existence of I such that, if user s provides a dataset D s of least I answers with the model θ ♠ s , then with probability at least 1 − δ, we will have θ * ε) . Under this event, we then have This shows that there is a high probability event under which the minimum of ρ, θ −s → Loss s ρ, θ −s , θ * s ( D), D −s cannot be reached in D. This is equivalent to what the theorem we needed to prove states. Proof of Lemma 3. We define By Lemma 13, we know that Loss 1 s (ρ) is locally strongly convex and has a unique minimum. By the definition of ρ ∞ , we must have n =s ∇ ρ R(ρ ∞ , θ * n (ρ ∞ )) + g ∞ s = 0, and thus ∇ ρ Loss 1 s (ρ ∞ ) = 0. Now define and ρ * (θ s ), its minimizer. Therefore, we have By Lemma 13, we know that Loss 2 s is locally strongly convex. Therefore, there exists µ 1 > 0 such that Loss 2 s (ρ, θ s ) is µ 1 -strongly convex in (θ s , ρ) : ∇ ρ R(ρ ∞ , θ s ) − g ∞ s 2 ≤ ε 2 , ρ − ρ * (θ s ) 2 ≤ 1 for ε 2 small enough. Therefore, for any 0 < ε < 1, if ρ ∞ − ρ * (θ s ) 2 > ε, we then have and thus ∇ ρ Loss 2 s (ρ ∞ , θ s ) 2 ≥ µ 1 ε. which is a contradiction. Therefore, we must have In fact, if g ∞ s belongs to the interior of Grad(ρ ∞ ), we can guarantee ∇ρR(ρ ∞ , θ ♠ s ) = g ∞ s . In this section, we prove a slightly more general result than Proposition 1. Namely, instead of working with specific regularizations, we consider a more general class of regularizations, identified by Assumption 1. Lemma 15. Suppose Assumption 1 holds true. Assume that L n is convex and L-smooth for all nodes n ∈ [N ]. If g t s converges and if η t = η is a constant small enough, then ρ t will converge too. Note that since 2 2 and smooth-2 regularizations satisfy Assumption 1, Lemma 15 clearly implies Proposition 1. We now introduce the key objects of the proof of Lemma 15. Denote g ∞ s the limit of the attack gradients g t s . We now define and prove that ρ t will converge to the minimizer of Loss 1 s (ρ). By Lemma 13, we show that Loss 1 s (ρ) is both locally strongly convex and L-smooth. Now define ζ t s g t s −g ∞ s . We then know ζ t s → 0 and ∇Loss 1 s (ρ t ) is the sum of all gradient vectors received from all users assuming the strategic user s sends the vector g ∞ s in all iterations. Thus, at iteration t of the optimization algorithm, we will take one step in the direction G t ∇Loss 1 We now prove the following lemma that bounds the difference between the function value in two successive iterations. Lemma 16. If Loss 1 s (ρ) is L-smooth and η t ≤ 1/L, we have Proof. Since Loss 1 s is L-smooth, we have Now plugging ρ t+1 − ρ t = −η t G t and ∇Loss 1 s (ρ t ) = G t − ζ t s into the inequality implies where we used the fact η t ≤ 1/L. Lemma 17. There is M such that, for all t, Loss 1 s (ρ t ) ≤ M . Proof. Consider the closed ball B(ρ * , 1) centered on ρ * and of radius 1. By Lemma 13, we know that Loss 1 s is locally strongly convex and thus there exists a µ 1 > 0 such that Loss 1 s is µ 1 -strongly convex on B(ρ * , 1). Now consider a point ρ 1 on the boundary of B(ρ * , 1). By strong convexity we have Now similarly, by the convexity of Loss 1 s on R d , for any ρ ∈ R d −B(ρ * , 1), we have ∇Loss 1 s (ρ 1 ) 2 ≥ √ µ 1 . Now since ζ t s → 0, there exists an iteration T 1 after which (t ≥ T 1 ), we have ζ t s 2 ≤ 1 4 √ µ 1 , Thus, for ρ t − ρ * 2 ≥ 1, the loss cannot increase at the next iteration. Now consider the case ρ t − ρ * 2 < 1 for t ≥ T 1 . The smoothness of Loss 1 s implies ∇Loss 1 s (ρ t ) 2 < L. Therefore, Now we define M 1 max ρ∈B(ρ * ,1+η(L+ 1 4 √ µ 1 )) Loss 1 s (ρ), the maximum function value in the closed ball B ρ * , 1 + η(L + 1 4 √ µ 1 ) . Therefore, we have Loss 1 s (ρ t+1 ) ≤ M 1 . So far we proved that for t ≥ T 1 , in each iteration of gradient descent either the function value will not increase or it will be upper-bounded by M 1 . This implies that for all t, the function value Loss 1 s (ρ t ) is upper-bounded by This concludes the proof. Lemma 18. There is a compact set X such that, for all t, ρ t ∈ X. Proof. Now since Loss 1 s is µ 1 -strongly convex in B(ρ * , 1), for any point ρ ∈ R d such that ρ − ρ t 2 = 1, we have But now by the convexity of Loss 1 s in R d , for any ρ such that ρ − ρ * 2 ≥ 1, we have This implies that if ρ t − ρ * 2 > 2 µ 1 M 2 − Loss 1 s (ρ * ) , then Loss 1 s (ρ t ) > M 2 . Therefore, we must have ρ t − ρ * 2 ≤ 2 µ 1 M 2 − Loss 1 s (ρ * ) , for all t ≥ 0. This describes a closed ball, which is a compact set. C.0.2 Convergence of the global model under converging gradient attack Lemma 19. Suppose u t ≥ 0 verifies u t+1 ≤ αu t + δ t , with δ t → 0. Then u t → 0. Proof. We now show that for any ε > 0, there exists an iteration T (ε), such that for t ≥ T (ε), we have u t ≤ ε. For this, note that by induction, we observe that, for all t ≥ 0, Since δ t → 0, there exists an iteration T 2 (ε) such that for all t ≥ T 2 (ε), we have δ t ≤ ε(1−α) 2 . Therefore, for t ≥ T 2 (ε), we have Denoting M 0 (ε) Therefore, for t ≥ ln ε 2(u 0 +M 0 (ε)) ln α , we have This proves that u t → 0. We now prove Lemma 15 (and hence Proposition 1). Proof of Lemma 15. Define X based on Lemma 18. Since Loss 1 s is locally strongly convex, there exists µ 2 > 0 such that Loss 1 s is µ 2 -strongly convex in a convex compact set X containing ρ t for all t ≥ 0. By the strong convexity of Loss 1 s (ρ), we have Now, using the fact we have But now note that Loss 1 s (ρ t ) − Loss 1 s (ρ * ) ≥ Loss 1 s (ρ t ) − Loss 1 s (ρ t+1 ). Thus, combining Equation (87) and Lemma 16 yields By rearranging the terms, we then have Now note that η ≤ 1/L < 1/µ 2 and thus 0 < 1 − µ 2 η < 1. We now define two sequences u t ρ t − ρ * 2 and δ t = η ζ t s 2 . We already know that δ t → 0, and we want to show u t also converges to 0. By Equation (90), we have which implies and thus Lemma 19 allows to conclude. But then, by the triangle inequality, we must have This is a contradiction. Thus (F, N, C)-Byzantine learning cannot be guaranteed for F ≥ N/2. for the case H = [N ] − [F ]. The first inequality implies ρ ALG ≤ F/(N − F ), while the second can then be rewritten But this equation is now deterministic. Since it must hold with a strictly positive probability, it must thus hold deterministically. Moreover, it holds for any ε > 0. Taking the limit ε → 0 yields the result. In this section, we provide both theoretical and empirical results to argue for using a sum-based local loss over an expectation-based local loss. Table 1 : Accuracy of trained models, depending on the use of expectation (denoted E) or sum (Σ), and on the use of linear classifier (L) or a 2-layer neural net (N N ) . Here, all users are honest and an 2 2 regularization is used, but there is a large heterogeneity in the amount of data per user. Indeed, intuitively, if one considers an expectation E x∼Dn [ (θ n , x)] rather than a sum, as is done by [HHHR20], [DTN20] and [EFG + 21], then the weight of an honest active user's local loss will not increase as a node provides more and more data, which will hinder the ability of θ n to fit the user's local data. In fact, intuitively, using an expectation wrongly yields the same influence to any two nodes, even when one (honest) node provides a much larger dataset D n than the other, and should thus intuitively be regarded as "more reliable". There is another theoretical argument for using the sum rather than the expectation. Namely, if the loss is regarded as a Bayesian negative log-posterior, given a prior exp − n∈[N ] ν θ n 2 − n∈[N ] R(ρ, θ n ) on the local and global models, then the term that fits local data should equal the negative loglikelihood of the data, given the models (ρ, θ). Assuming that the distribution of each data point x ∈ D n is independent from all other data points, and depends only on the local model θ n , this negative log-likelihood yields a sum over data points; not an expectation. We also empirically compared the performances of sum as opposed to the expectation. To do so, we constructed a setting where 10 "idle" users draw randomly 10 data points from the FashionMNIST dataset, while one "active" user has all of the FashionMNIST dataset (60,000 data points). We then learned local and global models, with R(ρ, θ) λ ρ − θ 2 2 , λ = 1. We compared two different classifiers to which we refer as a "linear model" and "2-layers neural network", both using CrossEntropy loss. The linear model has (784 + 1) × 10 parameters. The neural network has 2 layers of 784 parameters with bias, with ReLU activation in between, adding up to ((784+1)×784+(784+1)×10. Note also that, in all our experiments, we did not consider any local regularization, i.e. we set ν 0. All our experiments are seeded with seed 999. To see a strong difference between sum and average, we made the FashionMNIST dataset harder to learn, by randomly labeling 60% of the training set. Table 1 reports the accuracy of local and global models in the different settings. Our results clearly and robustly indicate that the use of sums outperforms the use of expectations. On each of the following plots, we display the top-1 accuracy on the MNIST test dataset (10 000 images) for the active user, for the global model and for one of the idle users (in Table 1 , the mean accuracy for idle users is reported), as we vary the value of λ. Intuitively, λ models how much we want the local models to be similar. In the case of learning FashionMNIST, given that the data is i.i.d., larger values of λ are more meaningful (though our experiments show that they may hinder convergence speed). However, in settings where users have different data distributions, e.g. because the labels depend on users' preferences, then smaller values of λ may be more relevant. Note that the use of a common value of λ in both cases is slightly misleading, as using the sum intuitively decreases the comparative weight of the regularization term. To reduce this effect, for this experiment only, we divide the local losses by the average of the number of data points per node for the sum version. This way, if the number of points is equal for all nodes, the two losses will be exactly the same. What's more, our experiments seem to robustly show that using the sum consistently outperforms the expectation, for both a linear classifier and a 2-layer neural network, for the problem of noisy FashionMNIST classification. Recall that we introduced noise into FashionMNIST to make the problem harder to learn and observe a clear difference between the average and the sum. In this section, we present results of our experiments when the noise is removed. Even without noise, the difference between using the sum and using the expectation still seems important. We acknowledge, however, that the plots suggest that even though we ran this experiment for 10 times more (and 5 times more for the linear model) than other experiments, we might not have reached convergence yet, and that the use of the expectation might still eventually gets closer to the case of sum. We believe that the fact that the difference between sum and expectation in the absence of noise is weak is due to the fact that the FashionMNIST dataset is sufficiently linearly separable. Thus, we achieve a near-zero loss in both cases, which make the sum and the expectation close at optimum. Even in this case, however, we observed that the sum clearly outperforms the expectation especially, in the first epochs. We argue that the reason for this is the following. By taking the average in local losses, the weights of the data of idle nodes are essentially blown out of proportion. As a result, the optimizer will very quickly fit these data. However, the signal from the data of the active node will then be too weak, so that the optimizer has to first almost perfectly fit the idle nodes' data before it can catch the signal of the active node's data and hence the average achieves weaker convergence performances than the sum. Throughout this section, we use the following terminology. Definition 6. Consider a parameterized event E(I). We say that the event E occurs with high probability if P [E(I)] → 1 as I → ∞. Define Σ 2 max x 2 =0 ( Σx 2 / x 2 ) the 2 operator norm of the matrix Σ. For symmetric matrices Σ, this is also the largest eigenvalue in absolute value. Theorem 6 (Covariance concentration, Theorem 6.5 in [Wai19]). Denote Σ = E Q i Q T i , where Q i ∈ R d is from a σ Q -sub-Gaussian random distributionQ. Then, there are universal constants c 1 , c 2 and c 3 such that, for any set {Q i } i∈[I] of i.i.d. samples fromQ, and any δ > 0, the sample , each increasingly ordered. Then and for each i = 1, ..., d. Lemma 20. Consider two symmetric definite positive matrices S and Σ. Denote ρ min and λ min their minimal eigenvalues. Then |ρ min − λ min | ≤ S − Σ 2 . Proof. This is a direct consequence of Theorem 7, for A = S, B = Σ − S, i = 1, and j = 0. Corollary 3. There are universal constants c 1 , c 2 and c 3 such that, for any σ Q -sub-Gaussian vector distributionQ ∈ R d and any δ > 0, the sample covariance Σ = 1 where min Sp(Σ) and min Sp(Σ) are the minimal eigenvalues ofΣ and Σ. Proof. This follows from Theorem 6 and Lemma 20. Lemma 21. With high probability, min Sp(Σ) ≥ min Sp(Σ)/2. Proof. Denote λ min min Sp(Σ) and λ min min Sp(Σ). Since each Q i is drawn i.i.d. from a σ Q -sub-Gaussian, we can apply Corollary 3. Namely, there are constants c 1 , c 2 and c 3 , such that for any δ > 0, we have We now set δ λ min /(4σ 2 Q ) and we consider I large enough so that c 1 d I + d I ≤ λ min /(4σ 2 Q ). With high probability, we then have λ min ≥ λ min /2. In this section, we prove the first part of Lemma 4. Namely, we prove that linear regression is gradient-PAC* learning. Before moving to the main proof that linear regression is gradient-PAC*, we first prove a few useful lemmas. These lemmas will rest on the following well-known theorems. Theorem 8 (Lemma 2.7.7 in [Ver18]). If X and Y are sub-Gaussian, then XY is sub-exponential. Lemma 23. There exists B such that i∈I ξ i Q i 2 ≤ BI 3/4 with high probability. Proof. By Lemma 22, the terms ξ i Q i [j] are iid, sub-exponential and have zero mean. Therefore, by Theorem 9, there exist constants c 4 and c 5 such that for any coordinate j ∈ [d] of ξ i Q i and for all 0 ≤ u ≤ c 4 , we have Plugging u = vI (−1/4) into the inequality for some small enough constant v, and using union bound then yields Defining B v √ d yields the lemma. We now move on to proving that least square linear regression is gradient-PAC*. Proof of Theorem 2. Note that ∇ θ (θ, Q, A) = (θ T Q − A)Q. Thus, on input i ∈ [I], we have Moreover, we have As a result, we have But now, with high probability, we have (θ − θ † ) T Σ(θ − θ † ) ≥ (λ min /2) θ − θ † 2 2 (Lemma 21) and i∈I ξ i Q i 2 ≤ BI (3/4) (Lemma 23). Using the fact that θ † 2 ≤ K and the Cauchy-Schwarz inequality, we have Denoting A K λ min 2 and B K B + 2νK and using the fact that I ≥ 1, we then have with high probability. This corresponds to saying Assumption 3 is satisfied for α = 3/4. In this section, we now prove the second part of Lemma 4. Namely, we prove that logistic regression is gradient-PAC* learning. We first prove two useful lemmas about the following logistic distance function. Definition 7. We define the logistic distance function by Lemma 24. If a, b ∈ R such that for some k > 0, |a| ≤ k and |b| ≤ k, then there exists some constant c k > 0 such that Proof. Note that the derivative of σ(z) is strictly positive, symmetric (σ (z) = σ (−z)) and monotonically decreasing for z ≥ 0. Therefore, for any z ∈ [−k, k], we know σ (z) ≥ c k σ (k). Thus, by the mean value theorem, we have Multiplying both sides by (a − b) 2 then yields the lemma. Lemma 25. If b ∈ R, and |b| ≤ k, for some k > 0, then there exists a constant d k , such that for any a ∈ R, we have Proof. Assume |a − b| ≥ 1 and define d k σ(k + 1) − σ(k). If b ≥ 0, since σ (z) is decreasing for Therefore, For the case of |a − b| ≤ 1, we also have ( F.3.2 A uniform lower bound Proof. Let u ∈ S d−1 . We know that there exists Q 1 , . . . , Q d ∈ Supp(Q) and α 1 , . . . , α d ∈ R such that u is colinear with α j Q j . In particular, we then have u T α j Q j = α j (Q T j u) = 0. Therefore, there must be a query Q * ∈ Supp(Q) such that Q T * u = 0, which implies a Q T * u > 0 By continuity of the scalar product, there must then also exist ε > 0 such that, for any Q ∈ B(Q * , ε), we have Q T u ≥ a/2, where B(Q * , ε) is an Euclidean ball centered on Q * and of radius ε. But now, by definition of the support, we know that p P [Q ∈ B(Q * , ε)] > 0. By the law of total expectation, we then have which is the lemma. Lemma 27. Assume that, for all unit vectors u ∈ S d−1 , we have E Q T u > 0, and that Supp(Q) is bounded by M Q . Then there exists C > 0 such that, with high probability, Proof. By continuity of the scalar product and the expectation operator, and by compactness of Now define ε C 0 /4M Q . Note that S d−1 ⊂ u∈S d−1 B(u, ε). Thus we have a covering of the hypersphere by open sets. But since S d−1 is compact, we know that we can extract a finite covering. In other words, there exists a finite subset S ⊂ S d−1 such that S d−1 ⊂ u∈S B(u, ε). Put differently, for any v ∈ S d−1 , there exists u ∈ S such that u − v 2 ≤ ε. Now consider u ∈ S. Given that Supp(Q) is bounded, we know that Q T i u ∈ [0, M Q ]. Moreover, such variables Q T i u are iid. By Hoeffding's inequality, for any t > 0, we have Choosing t = C 0 /2 then yields Taking a union bound for u ∈ S then guarantees which clearly goes to 1 as I → ∞. Thus ∀u ∈ S, i∈I Q T i u ≥ C 0 I 2 holds with high probability. Now consider v ∈ S d−1 . We know that there exists u ∈ S such that u − v 2 ≤ ε. Then, we have which proves the lemma. Lemma 28. Assume thatQ has a bounded support, whose interior contains the origin. Suppose also that θ † 2 ≤ K. Then there exists A K such that, with high probability, we have Proof. Note that by Cauchy-Schwarz inequality we have Thus, Lemma 25 implies the existence of a positive constant d K , such that for all θ ∈ R d , we have where u θ−θ † (θ − θ † )/ θ − θ † 2 is the unit vector in the direction of θ − θ † . Now, by Lemma 27, we know that, with high probability, for all unit vectors u ∈ S d−1 , we have Q T i u ≥ CI. Thus, for I sufficiently large, for any θ ∈ R d , with high probability, we have We now focus on the case of θ − θ † 2 ≤ f K . The triangle inequality yields Thus, by Lemma 24, we know there exists some constant c K such that Since distributionQ is bounded (and thus sub-Gaussian), by Theorem 6, with high probability, we have where λ min is the smallest eigenvalue of Combining this with (133), and defining A K min λ min c K 2 , e K , we then obtain the lemma. Now we proceed with the proof that logistic regression is gradient-PAC*. Proof of Theorem 3. Note that σ(−z) = e −z σ(z) = 1 − σ(z) and σ (z) = e −z σ 2 (z). We then have where 1 [A = 1] is the indicator function that outputs 1 if A = 1, and 0 otherwise. As a result, By Lemma 28, with high probability, we have To control the second term of (145), note that the random vectors 0. Therefore, by applying Hoeffding's bound to every coordinate of Z i , and then taking a union bound, for any B > 0, we have (148) Applying now Cauchy-Schwarz inequality, with high probability, we have Combining this with (138) and using θ † 2 2 ≤ K, we then have where B K = B + 2νK. This shows that Assumption 3 is satisfied for logistic loss for α = 3/4, and A K and B K as previously defined. Before proving the theorem, we prove a useful lemma that bounds the set of possible values for the global model and honest local models. Lemma 29. Assume that R and are nonnegative. For I large enough, if all honest active nodes h ∈ H provide at least I data, then, with high probability, θ * H must lie in a compact subset of R d×H that does not depend on I. ). Essentially, we will show that, if θ * H is too far from θ † H , then the loss will take values strictly larger than L 0 . Assumption 3 implies the existence of an event E that occurs with probability at least P 0 P (K H , I) |H| , under which, for any θ h ∈ R d , we have Note also that P 0 → 1 as I → ∞. We now integrate both sides over the line segment from θ † h to θ h . The fundamental theorem of calculus for line integrals then yields Now, if θ h − θ † h 2 > 2, we then have Now for I > I 1 max 2L 0 /A K H , (4B K H /A K H ) This implies that if θ h − θ † h 2 > 2 for any h ∈ H, then we have regardless of ρ and θ −H . Therefore, we must have θ † h − θ * h 2 ≤ 2. Such inequalities describe a bounded closed subset of R d×H , which is thus compact. Lemma 30. Assume that R(ρ, θ) → ∞ as ρ − θ 2 → ∞, and that θ † h − θ * h 2 ≤ 2 for all honest users h ∈ H. Then ρ * must lie in a compact subset of R d that does not depend on I. Proof. Consider an honest user h . Given our assumption on R → ∞, we know that there exists D K H such that if ρ − θ * h 2 ≥ D K H , then R(ρ, θ * h ) ≥ L 0 + 1. Thus any global optimum ρ * must satisfy Proof of Lemma 5. Fix ε, δ > 0. We want to show the existence of some value of I(ε, δ, D −H , θ † ) that will guarantee (ε, δ)-locally PAC* learning for honest users. By lemmas 29 and 30, we know that the set C of possible values for (ρ * , θ * H ) is compact. Now, we define the maximum of the norm of achievable gradients at the optimum. We know this maximum exists since C is compact. Using the optimality of (ρ * , θ * ), for all h ∈ H, we have Now note that P [E ∧ E ] = 1 − P [¬E ∨ ¬E ] ≥ 1 − P [¬E] − P [¬E ] = 2P 0 − 1. It now suffices to consider I larger than I 2 and large enough so that P (K H , I) |H| ≥ 1 − δ/2 (whose existence is guaranteed by Assumption 3, and which guarantees 2P 0 − 1 ≥ 1 − δ) and so that In other words, it is the loss when local models are optimized, and when the data of strategic user s are removed. Lemma 31. Assuming 2 2 regularization and convex loss-per-input functions , for any datasets D, Loss is strongly convex. As a result, so is Loss ρ −s . Proof. Note that the global loss can be written as a sum of convex function, and of ν θ n 2 2 + ρ − θ 1 2 2 . Using tricks similar to the proof of Lemma 11, we see that the loss is strongly convex. The latter part of the lemma is then a straightforward application of Lemma 10. We now move on to the proof of Theorem 4. Note that our statement of the theorem was not fully explicit, especially about the upper bound on the constant learning rate η. Here, we prove that it holds for η t = η ≤ 1/3L, where L is a constant such that Loss ρ −s is L-smooth. The existence of L is guaranteed by Lemma 13. Proof of Theorem 4. Note that by Lemma 9, Loss ρ −s is convex, differentiable and L-smooth, and ∇Loss ρ −s (ρ t ) = g †,t −s . For 2 2 regularization, we have Grad(ρ) = R d for all ρ ∈ R d . Then the minimum of (7) is zero, which is obtained when g t = θ † s − η t (g †,t −s + g t−1 s ) + η(g †,t−1 −s Therefore, ρ t+1 − ρ t = η(g †,t −s − g †,t−1 −s ) − η(g †,t−1 −s − g †,t−2 −s ). Then, using the L-smoothness of Loss ρ −s , and denoting u t ρ t+1 − ρ t 2 , we have u t+1 ≤ Lη t u t + Lη t−1 u t−1 . Now assume that η ≤ 1/3L. Then u t+1 ≤ 1 3 (u t + u t−1 ). We then know that u t+2 ≤ 1 3 (u t+1 + u t ) ≤ 1 3 ( 1 3 (u t + u t−1 ) + u t ) = 4 9 u t + 1 9 u t−1 . √ 7/3) t ( √ 7/3) max {v 0 , v 1 } . Thus, defining α √ 7/3 < 1, there exists C > 0 such that u t ≤ v t ≤ Cα t . This implies that ρ t+1 − ρ t 2 ≤ Cα t < ∞. Thus (ρ t+1 − ρ t ) converges, which implies the convergence of ρ t to a limit ρ ∞ . By L-smoothness, we know that g †,t −s must converge too. Taking (174) to the limit then implies ρ ∞ = θ † s . This shows that the strategic user achieves precisely what they want with CGA. It is thus optimal. In this section, CGA is executed against 10 honest users, each one having 6,000 randomly and data points of MNIST, drawn randomly and independently. CGA is run by a strategic user whose target model θ † s labels 0's as 1's, 1's as 2's, and so on, until 9's as 0's. We learn θ † s by relabeling the MNIST training dataset and learning from the relabeled data. We use λ = 1, Adam optimizer and a decreasing learning rate. Figure 15 : Norm of global model, distance to initialisation and distance to target, under attack by CGA. In particular, we see that the attack against 2 2 is successful, as the distance between the global model and the target model goes to zero. We considered VGG 13-BN, which was pretrained on cifar-10 by [Pha21] . We now assume that 10 nodes are given part of the cifar-10 database, while a strategic node also joins to the personalized federated gradient descent algorithm. The strategic node's goal is to bias the global model towards a target model, which misclassifies the cifar-10 data, by reclassifying 0 into 1, 1 into 2... and 9 into 0. We first show the result of performing counter-gradient attack on the last layer of the neural network. Essentially, images are now reduced to their vector embedding, and the last layer performs a simple linear classification akin to the case of MNIST (see Appendix I). Reconstructing an attack model whose effect is equivalent to the counter-gradient attack is identical to what was done in the case of MNIST (see Section 5.2). This last step is however nontrivial. On one hand, we could simply use the attack model to label a large number of random images. However, this solution would likely require a large sample complexity. For a more efficient data poisoning, we can construct vector embeddings on the indifference affine subspace V , as was done for MNIST in Section 5.3. This is what is shown below. We acknowledge however that this does not quite correspond to data poisoning, as it requires reporting a vector embedding and its label, rather than an actual image and its label. The challenge is then to reconstruct an image that has a given vector embedding. We note that, while this is not a straightforward task in general, this has been shown to be at least somewhat possible for some neural networks, especially when they are designed to be interpretable [ZF14, WWZ + 19, MCYJ19]. Bullseye polytope: A scalable clean-label poisoning attack with improved transferability A little is enough: Circumventing defenses for distributed learning On the dangers of stochastic parrots: Can language models be too big? The global disinformation order: 2019 global inventory of organised social media manipulation Collaborative PAC learning Machine learning with adversaries: Byzantine tolerant gradient descent Ilya Sutskever, and Dario Amodei. Language models are few-shot learners Poisoning attacks against support vector machines Can machine learning be secure? Best response regression Optimum statistical estimation with strategic data sources Learning strategy-aware linear classifiers Strategyproof linear regression in high dimensions Tight bounds for collaborative PAC learning via multiplicative weights A backdoor attack against lstm-based text classification systems K Single data poisoning for least square linear regression Proof of Theorem 5. We define the minimized loss with respect to ρ and without strategic user s by Now consider a subgradient g ∈ ∇ ρ Loss * −s (θ † s , D −s ) of the minimized loss at θ † s . For x −g 2λ , then have −g ∈ ∇ λ x 2 2 . We then define θ ♠ s θ † s − x.where Loss s is defined by (39). Now consider the data point (Q,Combining it all together with the uniqueness of the solution then yields arg minwhich is what we wanted.L Data poisoning against linear classification L.1 Generating efficient poisoning data For every label a ∈ {1, . . . , 9}, we define y a θ ♠ a − θ ♠ 0 , and c a −(θ ♠ a0 − θ ♠ 00 ) (where θ ♠ a0 is the bias of the linear classifier). The indifference subspace V is then the set of images Q ∈ R d such that Q T y a = c a for all a ∈ {1, . . . , 9}.To project any image X ∈ R d on V , let us first construct an orthogonal basis of the vector space orthogonal to V , using the Gram-Schmidt algorithm. Namely, we first define z 1 y 1 . Then, for any answer a ∈ {1, . . . , 9}, we defineIt is easy to check that for b < a, we have z T(181)By induction, we see that z T a Q is a constant independent from Q. Indeed, for a = 1, this is clear as z T 1 Q = y T 1 Q = c 1 . Moreover, for a > 1, then, in the computation of z T a Q, Q always appear as z T b Q for b < a. Moreover, denoting c a the constant such that z T a Q = c a for all a ∈ {1, . . . 9}, we see that these constants can be computed byFinally, we can simply perform repeated projection onto the hyperplanes where a is equally probable as the answer 0. To do this, we first define the orthogonal projection P (X, y, c) of X ∈ R d on the hyperplane x T y = c, which is given byIt is straightforward to verify that P (X, y, c) T y = c and that P (P (X, y, c), y, c) = P (X, y, c). We then canonically define repeated projection by induction, as P (X, (y 1 , . . . , y k+1 ), (c 1 , . . . , c k+1 )) P (P (X, (y 1 , . . . , y k ), (c 1 , . . . , c k )), y k+1 , c k+1 ).Now consider any image X ∈ R d . Its projection can be obtained by setting Q P (X, (z 1 , . . . , z 9 ), (c 1 , . . . c 9 )) + ξ.Note that to avoid being exactly on the boundary, and thus retrieve information about the scales of θ ♠ and on which side of the boundary favors which label, we add a small noise ξ, to make sure Q does not lie exactly on V (which would lead to multiple solutions for the learning), but small enough so that the probabilities of the different label remain close to 0.1 (the equiprobable probability). We acknowledge that images obtained this way may not be in [0, 1] d , like the images of the MNIST dataset. In general, one could search for points Q ∈ V ∩ [0, 1] d . Note that in theory, by Theorem 3 (or a generalization of it), labeling random images in [0, 1] d should suffice. However, in the case where V ∩[0, 1] d is empty (typically if no image in [0, 1] is argued by model θ ♠ s as realistically a 9), this procedure may require the labeling of significantly more images to be successful. Using the efficient poisoning data fabrication, we thus have a set of images (Q, p(Q)), where p a (Q) is the probability assigned to image Q and label a. This defines the following local loss for the strategic node: L s (θ s , D s ) = (Q,p(Q))∈Ds a∈{0,1,...,9}where σ a (θ s , Q) = exp(θ T sa Q+θ sa0 ) exp(θ T sb Q+θ sb0 ) is the probability that image Q has label a, according to the model θ s . We acknowledge that such labelings of queries is unusual. Evidently, in practice, an image may be labeled N times, and the number of labels N a it received can be set to be approximately N a ≈ N p a (Q).It is noteworthy that the gradient of the loss function is then given bywhere we defined Q + (1, Q) (which allows to factor in the bias of the model. This shows that ∇ θs L s (θ s , D s ) points systematically away from θ ♠ s , and thus that gradient descent will move towards θ ♠ s . In fact, if the set of images Q cover all dimensions (which occurs if there are Ω(d) images, which is the case for 2,000 images, since d = 784), then gradient descent will always move the model in the direction of θ ♠ s , which will be the minimum. Moreover, by overweighting each data (Q, p(Q)) by a factor α (as though the image Q was labeled α times), we can guarantee gradient-PAC* learning, which means that we will have θ * s ≈ θ ♠ s , even in the personalized federated learning framework. This shows why data poisoning should work in theory, with relatively few data injections.Note that the number of other users does make learning harder. Indeed, the gradient of the regularization R(ρ, θ s ) at ρ = θ † s and θ s = θ ♠ s is equal to 2λ θ † s − θ ♠ s 2 . As the number N − 1 of other users grows, we should expect this distance to grow roughly proportionally to N . In order to make strategic user s robustly learn θ ♠ s , the norm of the gradient of the local loss L s at θ † s must be vastly larger than 2λ θ † s − θ ♠ s 2 . This means that the value of α (or, equivalently, the number of data injected in D s ) must also grow proportionally to N .