key: cord-0171087-nf9pqwx3 authors: Liu, Boyi; Li, Jiayang; Yang, Zhuoran; Wai, Hoi-To; Hong, Mingyi; Nie, Yu Marco; Wang, Zhaoran title: Inducing Equilibria via Incentives: Simultaneous Design-and-Play Finds Global Optima date: 2021-10-04 journal: nan DOI: nan sha: 6c2c0d5f8e90fb31466ab0da55943ab6d681f5ef doc_id: 171087 cord_uid: nf9pqwx3 To regulate a social system comprised of self-interested agents, economic incentives (e.g., taxes, tolls, and subsidies) are often required to induce a desirable outcome. This incentive design problem naturally possesses a bi-level structure, in which an upper-level"designer"modifies the payoffs of the agents with incentives while anticipating the response of the agents at the lower level, who play a non-cooperative game that converges to an equilibrium. The existing bi-level optimization algorithms developed in machine learning raise a dilemma when applied to this problem: anticipating how incentives affect the agents at equilibrium requires solving the equilibrium problem repeatedly, which is computationally inefficient; bypassing the time-consuming step of equilibrium-finding can reduce the computational cost, but may lead the designer to a sub-optimal solution. To address such a dilemma, we propose a method that tackles the designer's and agents' problems simultaneously in a single loop. In particular, at each iteration, both the designer and the agents only move one step based on the first-order information. In the proposed scheme, although the designer does not solve the equilibrium problem repeatedly, it can anticipate the overall influence of the incentives on the agents, which guarantees optimality. We prove that the algorithm converges to the global optima at a sublinear rate for a broad class of games. to a sub-optimal solution. This dilemma prompts the following question that motivates this study: can we obtain the optimal solution to an incentive design problem without repeatedly solving the equilibrium problem? In this paper, we propose an efficient principled method that tackles the designer's problem and agents' problem simultaneously in a single loop. At the lower level, we use the mirror descent method [30] to model the process through which the agents move towards equilibrium. At the upper level, we use the gradient descent method to update the incentives towards optimality. At each iteration, both the designer and the agents only move one step based on the first-order information. However, as discussed before, the upper gradient relies on the corresponding lower equilibrium, which is not available in the single-loop update. Hence, we propose to use the implicit differentiation formula-with the equilibrium strategy replaced by the one at the current iteration-to estimate the upper gradient, which might be biased at the beginning. Nevertheless, we prove that if we improve the lower-level solution with larger step sizes, the upper-level and lower-level problems converge simultaneously. The proposed scheme hence guarantees optimality because it can anticipate the overall influence of the incentives on the agents eventually after convergence. The proposed approach is more efficient because the computation cost at each iteration is low and the algorithm converges to the optimal solution at a provably fast rate. Notation. We denote ·, · as the inner product in vector spaces. Given n matrices {A i } n i=1 , we denote blkdiag(A 1 , . . . , A n ) as the block diagonal matrix whose ith block is A i . For a vector form a = (a i ), we denote a −i = (a j ) j =i . For a finite set X ∈ R n , we denote ∆(X ) = {π ∈ R n + : x i ∈X π x i = 1}. For any vector norm · , we denote · * = sup z ≤1 ·, z as its dual norm. Given a strongly convex function ψ, we define the corresponding Bregman divergence as The incentive design problem studied in this paper is a special case of mathematical programs with equilibrium constraints (MPEC) [14] , a class of optimization problems constrained by equilibrium conditions. MPEC is closely related to bi-level programs [7] , which bind two mathematical programs together by treating one program as part of the constraints for the other. The study of bi-level programming can be traced back to the Stackelberg game in economics [36] . In the optimization literature, it was first introduced to tackle resource allocation problems [4] and has since found applications in such diverse topics as revenue management, network design, traffic control, and energy systems. In the past decade, researchers have discovered numerous applications of bi-level programming in machine learning, including meta-learning (ML) [9], adversarial learning [16] , hyperparameter optimization [23] and neural architecture search [19] . These newly found bilevel programs in ML are often solved by descent methods, which require differentiating through the (usually unconstrained) lower-level optimization problem [20] . The differentiation can be carried out either implicitly on the optimality conditions as in the conventional sensitivity analysis [see e.g., 1, 32, 3] , or explicitly by unrolling the numerical procedure used to solve the lower-level problem [see e.g., 23, 10]. In the explicit approach, one may "partially" unroll the solution procedure (i.e., stop after just a few rounds, or even only one round) to reduce the computational cost. Although this popular heuristic has delivered satisfactory performance on many practical tasks [21, 27, 9, 19] , it cannot guarantee optimality for bi-level programs under the general setting, as it cannot derive the accurate upper-level gradient at each iteration [38] . Unlike bi-level programs, MPEC (see [22] for a comprehensive introduction) is relatively underexplored in the ML literature so far. Recently, Li et al. [18] extended the explicit differentiation method for bi-level programs to MPECs. Their algorithm unrolls an iterative projection algorithm for solving the lower-level problem, which is formulated as a variational inequality (VI) problem. Leveraging the recent advance in differentiable programming [1] , they embedded each projection iteration as a differentiable layer in a computational graph, and accordingly, transform the explicit differentiation as standard backpropagation through the graph. Although backpropagation is efficient, constructing and storing such a graph-with potentially a large number of projection layers needed to find a good solution to the lower-level problem-is still demanding. To reduce this burden, partially unroll the iterative projection algorithm is a solution. Yet, it still cannot guarantee optimality for MPECs due to the same reason as for bi-level programs. The simultaneous design-and-play approach proposed in this study is proposed to address this dilemma. Our approach follows the algorithm of Hong et al. [15] and Chen et al. [6], which solves bi-level programs via single-loop update. Importantly, they solve both the upper-and the lower-level problem using a gradient descent algorithm and establish the relationship between the convergence rate of the single-loop algorithm and the step size used in gradient descent. However, their algorithms are limited to the cases where the lower-level optimization problem is unconstrained. Our work extends these single-loop algorithms to MPECs that have an equilibrium problem at the lower level. We choose mirror descent as the solution method to the lower-level problem because of its broad applicability to optimization problems with constraints [30] and generality in the behavioral interpretation of games [25, 17] . We show that the convergence of the proposed simultaneous design-and-play approach relies on the setting of the step size for both the upperand lower-level updates, a finding that echos the key result in [15] . We first give the convergence rate under mirror descent and the unconstrained assumption and then extend the result to the constrained case. For the latter, we show that convergence cannot be guaranteed if the lower-level solution gets too close to the boundary early in the simultaneous solution process. To avoid this trap, the standard mirror descent method is revised to carefully steer the lower-level solution away from the boundary. We study incentive design in both atomic games [29] and nonatomic games [33], classified depending on whether the set of agents is endowed with an atomic or a nonatomic measure. In social systems, both types of games can be useful, although the application context varies. Atomic games are typically employed when each agent has a non-trivial influence on the payoffs of other agents. In a nonatomic game, on the contrary, a single agent's influence is negligible and the payoff could only be affected by the collective behavior of agents. Consider a game played by a finite set of agents N = {1, . . . , n}, where each agent i ∈ N selects a strategy a i ∈ A i ⊆ R d i to maximize the reward, which is determined by a continuously differentiable Suppose that for all i ∈ N , the strategy set A i is closed and convex, and the reward function Example 3.1 (Cournot oligopoly). In the Cournot oligopoly model, each firm i ∈ N supplies the market with a quantity a i of goods. We assume that the price of the good is p = p 0 − j∈N γ j · a j , where a, γ 1 , . . . , γ n > 0. The profit of the firm i is given by where c i : R → R is the cost function for the firm i. Consider a game played by a continuous set of agents, which can be divided into a finite set of classes N = {1, . . . , n}. We assume that each i ∈ N represents a class of infinitesimal and homogeneous agents sharing the finite strategy set A i with |A i | = d i . The mass distribution for the class i is defined as a vector q i ∈ ∆(A i ) that gives the proportion of agents using each strategy. Let the cost of an agent in class i to select a strategy k ∈ A i given q = (q 1 , . . . , q n ) be c ik (q). Formally, a joint mass distribution q ∈ ∆(A) = i∈N ∆(A i ) is a Nash equilibrium, also known as a Wardrop equilibrium [37] , if for all i ∈ N , there exists b i such that The following result extends the VI formulation to Nash equilibrium in nonatomic game: denote Example 3.2 (Routing game). Consider a set of agents traveling from source nodes to sink nodes in a directed graph with nodes V and edges E. Denote N ⊆ V × V as the set of source-sink pairs, K i ⊆ 2 E as the set of paths connecting i ∈ N and E ik ⊆ E as the set of all edges on the path k ∈ K i . Suppose that each source-sink pair i ∈ N is associated with ρ i nonatomic agents aiming to choose a route from K i to minimize the total cost incurred. Let q ik and f ik be the proportion and number of travelers using the path k ∈ K w , then we have f ik = ρ i · q ik for all i ∈ N and k ∈ K i . Let x e be the number of travelers using the edge e, then the path and edge flow are related by x e = i∈N k∈K i f ik · δ eik , where δ eik equals 1 if e ∈ E ik and 0 otherwise. Let the cost on edge e be t e (x e ), then the total cost for a traveler selecting a path k ∈ K i will be e∈E t e (x e ) · δ eik . Despite the difference, we see that an equilibrium of both atomic and nonatomic games can be formulated as a solution to a corresponding VI problem in the following form where v i and X i denote different terms in the two types of games. Suppose that there exists an incentive designer aiming to induce a desired equilibrium. To this end, the designer can add incentives θ ∈ Θ ⊆ R d in the reward/cost functions. In this respect, the function v i in (3.3) will be parameterized by θ, denoted as v i θ . We assume that the designer's objective is determined by a function f : Θ × X → R. Denote S(θ) as the solution set to (3.3) . We then obtain the uniform formulation of the incentive design problem for both atomic games and nonatomic games If the equilibrium problem admits multiple equilibria, the agents may converge to different ones and it would be difficult to determine which one can better predict the behaviors of the agents without further information. Therefore, in this paper, we only consider the case where the game admits a unique equilibrium because, in the absence of uniqueness, additional rules are usually required to single out a representative one, which is beyond the scope of this paper. Sufficient conditions under which the game admits a unique equilibrium will be provided later. We propose to update θ and x simultaneously to improve the computational efficiency. The game dynamics at the lower level is modeled using the mirror descent method. Specifically, at the stage k, given θ k and x k , the agent first receives v i θ k (x k ) as the feedback. As in many cases, its exact value is difficult to obtain [25], we assume that the agents receive v i k as an estimator of v i θ k (x k ). After receiving the feedback, the agents update their strategies via is the Bregman divergence induced by a strongly convex function ψ i . Subsequently, the designer aims to update θ k in the opposite direction of the gradient of its objective function, which can be written as To compute the exact value of such a gradient, we first need to obtain the equilibrium x * (θ k ). However, at the stage k, we only have the current strategy x k . Therefore, we also have to establish an estimator of ∇x * (θ k ) and ∇f * (θ k ) using x k , the form of which will be specified later. We first consider unconstrained games with X i = R d i , for all i ∈ N . We select ψ i (·) as smooth function, i.e., there exists a constant H ψ ≥ 1 such that for all i ∈ N and Examples of ψ i satisfying this assumption include (but are not limited to) It can be directly checked that we can set H ψ = max i∈N δ i , where δ i is the largest singular value of Q i . In this case, the corresponding Bregman divergence becomes which is known as the squared Mahalanobis distance. Before laying out the algorithm, we first give the following lemma characterizing ∇ θ x * (θ). Proof. See Appendix A.2 for a detailed proof. Then for any given θ ∈ Θ and x ∈ X , we can define as an estimator of ∇f * (θ). Now we are ready to present the following bi-level incentive design algorithm for unconstrained games. Update strategy profile where ∇f k is an estimator of ∇f (θ k , x k+1 ) as defined in (4.3). Output: Last iteration incentive parameter θ k+1 and strategy profile x k+1 . We then consider the case where for all i ∈ N , x i is constrained within the probability simplex Here 1 d i ∈ R d i is the vector of all ones. In such a case, we naturally consider which is the Shannon entropy. Such a choice gives the following Bregman divergence, which is known as the KL divergence. To construct the estimator of ∇f * (θ), we have the following lemma as an extension of Lemma 4.1. As a consequence of Lemma 4.2, corresponding to (4.3), we define as an extension to the gradient ∇f * (θ), where J θ is an estimate of J θ . In addition to a different gradient estimate, we also modify Algorithm 1 to keep the iterations x k from hitting the boundary at the early stage. The modification involves an additional step that mixes the strategy with a uniform strategy 1 d i /d i , i.e., imposing an additional step upon finishing the update (4.4), where ν k+1 ∈ (0, 1) is a the mixing parameter, decreasing to 0 when k → ∞. In the following, we give the formal presentation of the modified bi-level incentive design algorithm for simplex constrained games. Input: θ 0 ∈ Θ, x 0 ∈ X , step sizes (α k , {β i k } i∈N ), k ≥ 0, and mixing parameters ν k , k ≥ 0. For k = 0, 1, . . . do: Update strategy profile where ∇f k is an estimator of ∇f (θ k , x k+1 ) defined in (4.5). In this section, we study the convergence of the proposed algorithms. We first make the following assumptions on the incentive design problem and our algorithms. For simplicity, define Assumption 5.1. The lower-level problem in (3.4) satisfies: • The strategy set X i of player i is a nonempty, compact, and convex subset of R d i . • There exist constants ρ θ , ρ x > 0 such that for all x ∈ X and θ ∈ Θ, • For all θ ∈ Θ, the equilibrium x * (θ) of the game is variationally stable with respect to D ψ , i.e., for all x ∈ X , Assumption 5.2. The upper-level problem in (3.4) satisfies: • The set Θ is compact and convex. • The function f * (θ) is µ-strongly convex, i.e., there exists some µ > 0 such that, for any θ, θ ′ ∈ Θ and x ∈ X , • The gradient ∇f * (θ) is uniformly bounded, i.e., there exists M > 0 such that for all θ ∈ Θ, it holds that ∇f * (θ) 2 ≤ M . • The extended gradient ∇f (θ, x) is Lipschitz continuous with respect to D ψ , i.e., there exists H > 0 such that for all x, x ′ ∈ X and θ ∈ Θ, The filtrations satisfy: • The gradient estimates v k and ∇f k are unbiased estimates, i.e., • The gradient estimates v k and ∇f k have bounded mean squared estimation errors, i.e., there exist δ f , δ u > 0 such that Remark. Under Assumption 5. In this part, we establish the convergence guarantee of Algorithm 1 for unconstrained games. We define the optimality gap ǫ θ k and the equilibrium gap ǫ x k+1 as We track such two gaps as the convergence criteria in the subsequent results. Theorem 5.4. For Algorithm 1, set the step sizes where H * = ρ θ /ρ x . Suppose that Assumptions 5.1-5.3 hold, then we have Proof. See Appendix B for a detailed proof and a detailed expression of convergence rates. In this part, we establish the convergence guarantee of Algorithm 2 for simplex constrained games. We still define optimality gap ǫ θ k as ǫ θ k = E θ k − θ * 2 2 . Yet, corresponding to (4.6), we track ǫ x k+1 as a measure of convergence for the strategies of the agents, which is defined as We are now ready to give the convergence guarantee of Algorithm 2. Theorem 5.5. For Algorithm 2, set the step sizes α k = α/(k+1) 1/2 , β k = β/(k+1) 2/7 , β i k = λ i ·β k , and ν k = 1/(k + 1) 4/7 with constants α and β satisfying Proof. See Appendix C for a detailed proof and a detailed form of the convergence rates. The rates in Theorem 5.5 appears to be slower than those in Theorem 5.4. This is due to the fact that the KL divergence is lower bounded by squared vector norms. We also note here that, as opposed to Algorithm 1 that uses smooth potential functions, the ill behavior of the Shannon entropy on the boundaries of the simplex becomes a hurdle in both algorithm design and analysis. The key to overcoming such an obstacle is the mixing step in (4.6), which keeps the iterations at a diminishing distance away from the boundaries. [6] Chen, T., Sun, Y. and Yin, W. (2021). A single-timescale stochastic bilevel optimization method. arXiv preprint arXiv:2102.04671 . [7] Colson, B., Marcotte, P. and Savard, G. In this section, we present detailed discussions on the properties of the games. We establish the following sufficient condition for variational stability. Lemma A.1. Define the matrix H λ (x) as a block matrix with (i, j)-th block taking the form of Suppose that ψ i satisfies the smooth condition (4.2). If H λ (x) + H λ (x) ⊤ ≺ −2 · H ψ · I d for some λ ∈ R N + , then the Nash equilibrium x * is λ-variationally stable with respect to the Bregman divergence D ψ . Proof. We define the λ-weighted gradient of the game as By definition, we can verify that H λ (x) is the Jacobian matrix of v λ (x) with respect to x. For any Taking additional dot product · , x − x ′ on both sides of (A.1), we have where the inequality follows from H λ ( where the second inequality follows from the fact that x * is a Nash equilibrium. Eventually, as ψ i satisfies (4.2), we have Suppose that ψ i is the Shannon entropy. If H λ (x) + H λ (x) ⊤ ≺ 0 for some λ ∈ R N + , then Nash equilibrium x * is λ-strongly variationally stable with respect to the KL divergence. where the second inequality follows from the fact that x * is a Nash equilibrium. Unconstrained game. We now provide the proof of Lemma 4.1. Proof of Lemma 4.1. Since X i = R d i for all i ∈ N , by the definition of x * (θ), we have v θ (x * (θ)) = 0 for all θ ∈ R d . Then, differentiating this equality with respect to θ on both ends, for any i ∈ N , we have Thus, we have As a consequence of Lemma 4.1, we have the following lemma addressing the sensitivity of Nash equilibrium with respect to the incentive parameter θ. Lemma A.3. Under Assumption 5.1, we have for { x k } k≥0 generated by Algorithm 1, where Proof. We have for some θ = ω · θ k + (1 − ω) · θ k−1 with ω ∈ [0, 1], where the last inequality follows from Assumption 5.1. Simplex constrained game. As a consequence of Lemma 4.2, we have the following lemma as the simplex constrained version of Lemma A.3. Lemma A.4. Under Assumption 5.1, we have Proof. Recall that J θ is defined in Lemma 4.2 and that · 2 is the spectral norm when operating on a matrix. We have Thus, by Lemma 4.2, we have for some θ = ω · θ k + (1 − ω) · θ k−1 with ω ∈ [0, 1], where the last inequality follows from Assumption 5.1. We first present the following two lemmas under the conditions presented in Theorem 5.4. (1 − β j /8) . Proof. See Appendix B.1 for a detailed proof. (1 − µα j ) . Thus, by Lemma B.1, we have Similarly, we have for α k = α/(k + 1), Thus, combining (B.1) with Lemma B.2, we obtain Here the third inequality follows from β k = β/α 2/3 · α Proof. Recall that We have where the second inequality follows from the fact that Taking the conditional expectation given F x k , we obtain where the second inequality follows from Assumption 5.3. Summing up (B.3) for all i ∈ N , we have By the λ-strong variational stability of x i (θ k ), we have Moreover, by Assumption 5. 3 we have where the first inequality follows from the optimality condition that v θ k (x * (θ k )) = 0, and the second inequality follows from Assumption 5.1. Thus, taking (B.5) and (B.6) into (B.4), we obtain By Assumptions 5.2 and 5.3, we have taking which into (B.9) and taking expectation on both sides, we further obtain By γ = (4 − 2β k )/β k , we have where the first inequality follows from α k−1 ≤ 2α k , and the second inequality follows from α k = α/(k + 1), β k = β/(k + 1) 2/3 , and α/β 3/2 ≤ 1/12H ψ HH * . Thus, we obtain Recursively applying (B.11), we obtain (1 − β j /4) . Thus, we conclude the proof Lemma B.1. Proof. Since the projection argmax x∈X is non-expansive, we have Taking the conditional expectation given F θ k , we obtain where the second inequality follows from Assumption 5.2 and the Cauchy-Schwartz inequality, and the last inequality follows from Assumption 5.2. Applying (B.10) to (B.12) and taking expectation on both sides, we get Recursively applying (B.13), we obtain (1 − µα j ) . Thus, we conclude the proof of Lemma B.2. We first present the following two lemmas under the conditions presented in Theorem 5.5 Lemma C.1. For all k ≥ 0, we have (1 − β j /8) (1 − β j /8) . Proof. See Appendix C.1 for a detailed proof. Lemma C.2. For all k ≥ 0, we have (1 − µα j ) . Proof. See Appendix C.2 for a detailed proof. Thus, by Lemma B.1, we have (1 − β j /8) (1 − β j /8) (1 − β j /8) . (C.1) Since ν l = 1/(l + 1) 4/7 , we have β l ν l log(1/ν l ) + 2ν l+1 + 2ν 2 l ≤ (4/β 2 + 4/7β) · β 2 l , taking which into (C.1), we obtain Similarly, we have for α k = α/(k + 1) 1/2 , Thus, combining (B.1) with Lemma B.2, we obtain (1 − µα j ) Here the third inequality follows from β k = β/α 4/7 · α 4/7 k . Therefore, (C.2) and (C.3) conclude the proof of Theorem 5.5. Proof. We first show that we have the exact form of x i k+1 as By the definition of KL divergence, we have summing up which gives where the second inequality follows from v θ (x * (θ)) ∞ ≤ V * , and the third inequality follows from Assumption 5.1. Thus, taking (C.9) and (C.10) into (C.8), we obtain for 6N H 2 where the second inequality follows from Lemma D.3. By Lemma D.3, we further have for ν k ≤ By Lemma D.2, we have for any γ > max{1, 1/ν 2 k }, By Assumption 5.3, we have where the third inequality follows from Assumptions 5.2 and 5.3, and the last inequality follows from Lemma D. 3 . Taking (C.14) into (C.13) and taking expectation on both sides, we obtain ǫ x k+1 ≤ 1 − β k /4 + 3 (1 + γ)/ν 2 k − 1 /2 · H 2 H 2 * α 2 k−1 · ǫ x k + (δ 2 u + 3V 2 * ) · λ 2 2 β 2 k + (1 − β k /4) · (1 + γ)/ν 2 k − 1 2 · H 2 * α 2 k−1 · (δ 2 f + 2M 2 + 6N H 2 ν k−1 ) − N β k ν k log ν k + 2N · (ν k+1 + ν 2 k ). (C.15) With γ = (4β k −2)/β k , α k = α/(k +1), β k = β/(k +1) 2/7 , ν k = 1/(k +1) 4/7 , and α/β 3/2 ≤ 1/7 H H * , we have taking which into (C.15), we obtain ǫ x k+1 ≤ (1 − β k /8) · ǫ x k + (δ 2 u + 3V 2 * ) · λ 2 2 β 2 k − N β k ν k log ν k + 2N · (ν k+1 + ν 2 k ) + β 2 k · (δ 2 f + 2M 2 + 6N H 2 ν k−1 )/8 H 2 ≤ (1 − β k /8) · ǫ x k + (δ 2 u + 3V 2 * ) · λ 2 2 β 2 k − N β k ν k log ν k + 2N · (ν k+1 + ν 2 k ) + β 2 k · (δ 2 f + 2M 2 + 6N H 2 )/8 H 2 . (C.16) Recursively applying (C.16), we obtain (1 − β j /8) + (δ 2 u + 3V 2 * ) · λ 2 2 + (δ 2 f + 2M 2 + 6N H 2 )/8 H 2 · k l=0 β 2 l · k j=l+1 (1 − β j /8) (−β l ν l log ν l + 2ν l+1 + 2ν 2 l ) · k j=l+1 (1 − β j /8) . Proof. Applying (C.14) to (B.12) and taking expectation on both sides, we get where the second inequality follows from ν k ≤ 1. Recursively applying (C.17), we obtain (1 − µα j ) . The following lemma is used in the analysis on unconstrained games. Lemma D.1. Let ψ(·) be 1-strongly convex with respect to the norm · . Assume that (4.2) holds. We have for any γ > H 2 ψ ≥ 1, Proof. By the definition of Bregman divergence, we have for any γ > 0, D ψ (x, z) − D ψ (y, z) = ψ(x) − ∇ψ(z), x − z − ψ(y) + ∇ψ(z), y − z = −D ψ (x, y) + ∇ψ(z) − ∇ψ(x), y − x ≤ −1/2 · x − y 2 + ∇ψ(z) − ∇ψ(x) * · y − x , where the inequality follows from 1-strong convexity of ψ(·) and the Cauchy-Schwartz inequality. By (4.2), we further have D ψ (x, z) − D ψ (y, z) ≤ −1/2 · x − y 2 + H ψ · z − x · y − x , where the second inequality follows from 1-strong convexity of ψ(·). Rearranging the terms in (D.1), we finish the proof of Lemma D.1. The following two lemmas are involved in the analysis on simplex constrained games. Lemma D.2. Let ψ(·) be the Shannon entropy. We have for any γ k > max{1, 1/ν 2 k }, for { x k } k≥0 generated by Algorithm 2. Proof. For the Shannon entropy ψ(·), we have ∇ x j ψ( x) = 1 + log[ x] j , which gives Thus, we have Replacing H ψ with 1/ν k in the proof of Lemma D.1, we conclude the proof of Lemma D.2. Lemma D.3. Let {x k } k≥0 and { x k } k≥0 be the sequences of strategy profiles generated by Algorithm 2 with ν k ≤ O(1/k). We have Proof. By the definition of KL divergence, we have Optnet: Differentiable optimization as a layer in neural networks Hierarchical optimization: An introduction Deep equilibrium models The theory of the market economy Some theoretical aspects of road traffic research Understanding short-horizon bias in stochastic meta-optimization where the last inequality holds with N H 2 u λ 2 2 β 2 k ≤ β k , which is satisfied by β ≤ 1/N H 2 u λ 2 2 . By Lemma D.1, we have for any γ > max{1, H 2 ψ },be the normalization factor of the exact form of x i k+1 in (C.5). We havewhere the last equality follows from the fact that j∈which concludes the proof of (C.4).Continuing from (C.4), we havewhere the second inequality follows from the fact thatTaking the conditional expectation given F x k , we obtainwhere the last inequality follows from Cauchy-Schwartz inequality and the fact thatMoreover, by Assumption 5.3, we haveBy (4.6), we have for allTaking (D.6) into (D.5), we obtainwhich concludes the proof of (D.2). Similar arguments also yields (D.3). Also, we havesumming up which for i ∈ N gives (D.4).