key: cord-0174372-vcr7eafu authors: Merhej, Ramona; Santos, Fernando P.; Melo, Francisco S.; Chetouani, Mohamed; Santos, Francisco C. title: Learning Collective Action under Risk Diversity date: 2022-01-30 journal: nan DOI: nan sha: f661fb7f3df78b749275472515fde69ddae3665e doc_id: 174372 cord_uid: vcr7eafu Collective risk dilemmas (CRDs) are a class of n-player games that represent societal challenges where groups need to coordinate to avoid the risk of a disastrous outcome. Multi-agent systems incurring such dilemmas face difficulties achieving cooperation and often converge to sub-optimal, risk-dominant solutions where everyone defects. In this paper we investigate the consequences of risk diversity in groups of agents learning to play CRDs. We find that risk diversity places new challenges to cooperation that are not observed in homogeneous groups. We show that increasing risk diversity significantly reduces overall cooperation and hinders collective target achievement. It leads to asymmetrical changes in agents' policies -- i.e. the increase in contributions from individuals at high risk is unable to compensate for the decrease in contributions from individuals at low risk -- which overall reduces the total contributions in a population. When comparing RL behaviors to rational individualistic and social behaviors, we find that RL populations converge to fairer contributions among agents. Our results highlight the need for aligning risk perceptions among agents or develop new learning techniques that explicitly account for risk diversity. The World Economic Forum recently (January 2021) published its 16 th report on global risks [3] . Among the most concerning risks are climate change, biodiversity loss, extreme weather, as well as societal division and economic fragility. While it is evident that large collective efforts are needed to avoid these disasters, people, institutions or countries remain reluctant to cooperate. On the one hand, no entity alone has the power of saving the system on its own. This is known as the the problem of many hands (PMH) and is amplified when actions are not directly harmful but only create the risk of a harm [50] . On the other hand, cooperation in such contexts entails a social dilemma: the best individual outcome occurs when others contribute to the collective good and risks are avoided without one's intervention. This selfish reasoning, and the shifting of responsibility onto others, configures the so-called Tragedy of the Commons. The tension within individuals/entities created by the urgent need of cooperation, the individually rational choice to defect, and the uncertainty about future outcomes, makes decision making non-trivial [6, 7, 25] . The Collective Risk Dilemma (CRD) is a simple game metaphor that tries to capture such challenges [14, 35, 43, 40, 51, 52] . In a CRD, agents decide how much of their wealth to contribute to a common cause in order to avoid the risk of a future disaster. The future disaster is only avoided with certainty if the agents manage to collect more contributions than a given target threshold. The behaviors of individuals playing CRDs have been analyzed both experimentally [13, 14, 35, 47] and theoretically, resorting to evolutionary game theory [40, 41, 43, 45, 52] and multi-agent reinforcement learning [15, 34] . Previous works however, assume an identical risk factor for all agents [14, 35, 41, 43] . In reality, heterogeneous perceptions and exposures to risk are ubiquitous. Most recently, the COVID-19 crisis highlighted our strength and weaknesses in successfully cooperating under such discrepancies. Particularly, it showed how different countries adopted different safety measures depending on how risky they assessed the situation to be [5, 21] . Diversity in risk perception was not only observed on a national scale but also within each country [26] . The pandemic also revealed how age or medical conditions can result in different levels of risk exposure to a same virus [1] . Still, some studies have looked into other types of heterogeneities among agents and have reported significant changes in reached cooperation and target achievement [22, 34, 52] . The findings on other heterogeneities motivated us to investigate the effect of introducing risk diversity in a population of agents facing collective risks. We examine how averaging out the risk value instead of considering risk diversity can alter the results we observe. While the game tensions play a decisive role in the choices made by the agents, the final equilibrium of the system also depends on the decision making process of agents. Decision making can be either modeled as a static or a dynamic process. A static perspective often models agents as rational and having full knowledge of all possible strategy profiles and outcomes for players. This leads agents to converge to the intersection of their best responses, known as the Nash equilibrium. Yet, experimental studies have shown that humans often make far from rational choices [16, 32, 20, 46] , and seem to adapt their policies based on previous experience. Reinforcement Learning (RL) suggests new tools to model decision making dynamics and, in fact, was shown to accurately model human behaviors in social dilemmas [39] . RL has rapidly evolved in the past years, and several variations were developed specifically to promote cooperation in social dilemmas [17, 23] . We do not use any of these algorithms in our work as they require a large sharing of information and have therefore mostly been applied to 2-player games. Additionally, our goal in this paper is to first understand how simple reinforcement learning dynamics can influence agents' cooperation in the presence of risk diversity. Examining the cooperation challenges that RL dynamics may pose under risk diversity is essential before moving on to designing algorithms that solve these challenges. As such, we focus on independent reinforcement learning algorithms where agents can only observe their own actions and rewards. To assess the strength and weakness of adaptive agents in reaching cooperative solutions under risk diversity, we compare the learned strategies with a set of static solutions. On one hand, we compare the behaviors under RL to those prescribed by individualistic and rational game theory, and on the other hand, to socially optimal solutions that maximize the total welfare in the population. We begin our paper with Section 2 on related work. After that, in Section 3 we model the collective risk dilemma, explain how risk diversity is introduced, and describe agents' learning dynamics. This is followed by Section 4 in which we derive the static solutions for the game. We display our results in Section 5 and conclude our work in Section 6. We examine how in a population of adaptive agents facing collective dilemmas, risk diversity can affect that population's ability to cooperate and effectively avoid a disastrous outcome. Previous works on CRDs, both experimental and theoretical, have concluded that higher risk translates into higher cooperation and consequently may help in escaping the tragedy of the commons [34, 35, 43, 45] . But the global risk is not the only decisive factor in an agent's willingness to cooperate. The introduction of different inequalities between agents can have a significant impact on cooperation. Under evolutionary game theory, inequalities in wealth, productivity and benefits are found to reduce agents' cooperation in a continuous public goods game [22] . Similar results are also found in a threshold public goods game when agents can only adapt by imitating agents from the same wealth class [52] . Wealth inequality is also shown to hinder target achievement in a study on CRDs with reinforcement learners [34] , and in an experimental study on a threshold public goods games [47] . While most studied heterogeneities in the literature focus on wealth inequality, we argue that risk diversity is another heterogeneity worth studying in populations facing collective risks. We distinguish between two types of risk diversity: risk perception diversity -where agents perceive a same risk as higher or lower than it actually is -and risk exposure diversity -where some agents are more or less vulnerable to facing a risk. In the context of risk perception diversity, a survey of 119 countries confirms significant variance in public concern and risk assessment of the global climate change problem [27] . In the context of risk exposure diversity, we saw that the recent COVID-19 pandemic led to the distinction between people at normal risk and those at increased risk for severe illness from COVID-19 [1] . Governmental units such as the Occupational Safety and Health Administration (OSHA) of the United States Department of Labor have classified jobs into four potential risk exposure levels [4] . The Organization for Economic Co-operation and Development (OECD) published a document urging governments to support the most vulnerable people [2] and several other studies on that subject have been published in different countries [10, 37, 49] . Risk diversity is therefore a fundamental feature taken into account by countries when elaborating their safety measures and preventive policies. Risk diversity on an individual level, also translates into behavioral diversity and safety measure compliance [48] . Whether risk diversity emerges from exposure or perception diversity is irrelevant when studying the emergent agent policies and hence their ability to reach a target threshold. However, it is the resulting consequences of reaching or not the threshold that differ if agents are effectively at higher/lower risk, or if this discrepancy is merely an impression. The behavioral differences between people at high risk and those not, as well as the previous work that demonstrated the impact of heterogeneities on a population's cooperation capabilities, inspired us to dedicate a study on risk diversity in collective risk dilemmas. Although n-player non symmetrical social dilemmas and games with mixed-motives are abundant in the real world, cooperation in multi-agent reinforcement learning has mainly focused on 2-player games. A study on sequential social dilemmas with deep RL [28] , identified coordination subproblems that prevent proper cooperation of agents. Coordination problems in MARL are quite common and are not only restricted to social dilemmas [33] . One of the reasons for coordination difficulty in MARL is the non-stationarity of the opponent and the simultaneous policy updates of the players [9] . Suggested solutions try to increase agents' understanding of the opponent's dynamics and leverage these to achieve higher cooperation. One algorithm proposes predicting the opponent's policy changes before computing the agent's policy gradient [53] . Another alternative suggests differentiating through the variations of the opponent to actively shape their learning [17] . A third solution incorporates both policy prediction and opponent shaping to increase stability while simultaneously escaping saddle points [29] . Other solutions to increase cooperation in MARL focus on enabling communication capacities between agents. Communication can take several forms. For example, agents may communicate by sending messages [19] , sharing intentions [24] or experiences [11] , advising actions [36] to one another etc. Enriching agents with communication capabilities has shown to improve performance [19, 11] , speed up learning [36, 19, 11] and enhance coordination [24, 36] . Implementing a centralized critic with decentralized actors is another form of indirect communication and information sharing among agents that can increase performance and cooperation [8, 18, 31] . A third set of solutions for overcoming cooperation difficulty in RL introduce conditional commitment in agents' policies. One example is an algorithm designed to always asymptotically behave as a Tit-for-Tat strategy by learning simultaneously a cooperative and a selfish Q-function and alternating between them to avoid exploitability [23] . Finally, solutions modifying agents' motivations can be seen as institutional solutions [12] . Notably, in MARL, intrinsic rewards can be engineered and added to environmental rewards to help agents solve a sub-problem of the game and facilitate the emergence of coordination [30] . We note that all advised solutions for increasing cooperation in MARL settings focus on 2-player games. Major computational and convergence problems still inhibit the scaling of these algorithms to n-player games. Additionally, most solutions are developed to increase cooperation in purely cooperative settings. We propose a non-symmetrical n-player mixed-motive game. We describe the emergent behaviors of simple reinforcement learners in these settings. The goal of the paper is to recognize the cooperation challenges of reinforcement dynamics in the context of such dilemmas. The outcomes of our study can be exploited to design more effective RL algorithms in the future. However, such developments remain out of the scope of our current paper. A Collective Risk Dilemma (CRD) is a game in which agents need to cooperate to avoid an eventual disaster [14, 35, 43, 40, 51, 52] . Agents' success in avoiding the disaster requires a minimum of collective efforts. Effort is modeled by the costly contribution of players towards a common pool. If contributions are below the threshold they will not alleviate the consequences of the disaster. Additionally, all contributions above the threshold do not create any additional value for the players. As a result, agents are simultaneously motivated to cooperate to increase the chances of avoiding the disaster, and to defect and free-ride with the hope that others will ensure disaster avoidance. Formally, in a population of finite size Z, we allocate for every player an initial endowment b. Players are then sampled into groups of size N to play CRDs. They need to jointly collect enough contributions to reach a target threshold t to avoid with certainty some common disaster. If a group manages to achieve the threshold target, the disaster is avoided and players only lose what they had contributed to the common pool. However, should the target not be met, agents, depending on their level of risk exposure to the disaster r i , will lose a fraction p of their remaining endowment. At the end of the game, player i who started with an initial endowment b will be left with where c i is a binary choice of contributing 0 or a fraction c of the endowment to the pool (c i ∈ {0, c}). The perceived benefits or harm of these losses in endowment is a subjective function known as the utility in economic game theory. One common utility function is the log-utility. The log-utility function has been used when studying the impact of wealth inequality in collective risk dilemmas [34] and is used more broadly in economy to capture what is known as a diminishing marginal utility [38] . It supposes that the loss of a given amount of money is perceived as more painful by poorer individuals than by richer ones. While all agents are equally wealthy in our scenario, we do intend to examine mixtures of heterogeneities in future works, such as the combination of wealth inequality with risk diversity. With that in mind, to better compare our results with future works, we decide to also adopt a log-utility function. The payoffs of the game are expressed as the difference in the log of agents' wealth before and after a game was played. Avoiding a disaster will cost a cooperator , and a defector x D = log b b = 0 or nothing. Facing a disaster will cost cooperatorsx C = log(1 − c − p(1 − c)) and defectorsx D = log(1 − p). The necessary conditions on r, c, p and t that ensure that the game designed is a social dilemma are detailed in Appendix A.1. The goal of each player is to find a probabilistic strategy π * i -representing the probability of player i choosing to cooperate -that maximizes the payoff. Introduction of risk diversity We consider risk diversity in the form of binary risk classes. That is, we split our population into two classes: agents at high risk of being affected by the disaster and agents at low risk. The former group represents a fraction z H of the population, and the latter a fraction z L = 1 − z H . Given an average population risk value r and a risk diversity value δ, if the target is not achieved, agents at high risk will lose an additional fraction p of their remaining wealth with probability r H = r + 1 2z H δ while agents at low risk only face that disaster with a risk probability Numerical Values The population size is set to Z = 200 individuals. The agents are organized in groups of N = 6. They are given an initial endowment b = 1 and can choose to either cooperate and contribute a fraction c = 0.1 of it to a common pool or defect and contribute nothing. Participants have stochastic policies π i that define the probabilities of choosing each action. The threshold t is set so that the target is only achieved if at least half of the agents in a group cooperate, i.e., t = M cb with M = N 2 . Agents at high and low risk are equally frequent in the population with z H = z L = 50% of the population. This means, for an average population risk r and a risk diversity δ, agents at high risk will face a disaster with probability r H = r + δ while agents at low risk will face a disaster with a risk probability r L = r − δ. If the threshold target is not achieved, every agent that faces a disaster pays a penalty of p = 0.7 or 70% of its remaining wealth. We proceed with two experiments: in the first, we fix the diversity value to δ = 0.1 and test varying average risk values r, while in the second, we set the population average risk value to r = 0.5 and vary the risk diversity value δ. This allows us to better understand the impact of risk diversity for regimes of high and low baseline risk (δ fixed and varying r) and also the impacts of increasing symmetric risk diversity (r = 0.5 and varying δ). The goal of the paper is to understand how simple reinforcement dynamics can encourage or discourage cooperative behaviors in populations with risk diversity. We choose to model the agents learning dynamics using the Roth-Erev Algorithm [39] which was shown to successfully model human decision making in social dilemmas. Accordingly, we create a population of Z agents and allow every player i, at every timestep k, to hold and update a propensity vector that assigns a propensity value for each of the possible actions. In the 2-actions collective risk dilemma, this translates to a vector are the respective propensities for the cooperative and the defective action at timestep k. For every interaction k in the learning process, agents normalize their propensity vector and sample one of the two actions following the obtained probabilities. At the end of the k th game, when returns are distributed, every player i, depending on the selected action A and the received reward x, updates the propensity vector such that where φ is a forgetting parameter that inhibits the propensities from growing to infinity. Further details about the population training procedure as well as numerical values are given in Appendix A.2. To highlight the peculiarities of adaptive agents in social dilemmas, we compare the learned and adaptive solutions to other statically tailored solutions. Particularly, we are interested in comparing the learned solutions to 1) solutions that are rational from an individualistic point of view and 2) to solutions that are rational from a communal or collective point of view. The rational solution from an individualistic perspective is the Nash equilibrium while the rational equilibrium from a communal perspective is the total welfare maximizing solution. Given the computational difficulty of finding Nash equilibria or social welfare maximizing solutions for an n-player, non-linear general-sum game with continuous strategies, we choose to work with class-based solutions that were previously proposed for CRDs under wealth inequality [34] . The solutions pre-impose perfect coordination between players of a same class (here agents at high/low risk), by forcing all agents from a same class to follow the same strategy. As a result, agents of a same class can be modeled as one large agent which transforms the n-player game into a 2-player game. Class-based Nash Following a similar reasoning to the one detailed for wealth inequality [34] , we transform the n-player game into a 2-player game. After transformation of the game, the method relies on a graphical approach to extract the intersection points of the best response strategies of the two classes. The intersection points represent the class-based Nash equilibria. We repeat this to extract class-based Nash solutions for all game settings and all risk diversities. The exact changes in reasoning with respect to wealth inequality as well as the plots for extracting class-based Nash points can be found in Appendix A.3.1. We use results obtained under class-based Nash equilibrium as the baseline to evaluate how rational the learned strategies of adaptive reinforcement learners are. Class-based maximum welfare We extend the class-based Nash method, to extract class-based maximum welfare points. We continue to impose absolute equality and fairness within a given class and evaluate the total secured welfare of a population for different combinations of strategies. Here, instead of plotting best response lines, we draw a heat-map with the secured welfare for each combination of class strategies. We define as the class-based maximum welfare solution, the point that minimizes the total losses in welfare for the population. Further details and the corresponding heat-maps are given in Appendix A.3.2. We study the consequences of risk diversity in populations of RL agents learning to play CRDs. After the training phase, the strategies are evaluated based on the resulting population's probability of achieving the target threshold t. For every setting, we rollout a game where the population is split into groups of N players. In each group, agents, following their learned strategies, choose to either contribute or not. We define as η, the average percentage of groups in the population that reach the target threshold. This random variable is evaluated and averaged over 10 6 simulations. Studies are run both on heterogeneous populations with risk diversity, as well as on their homogeneous counterparts (i.e., populations with the same average risk factor r but no diversity δ). To study the effect that risk inequality can have on a population facing a collective risk dilemma, we begin by comparing the group achievement rate η and the learned strategies of a homogeneous population on one hand, with those of a heterogeneous population with risk diversity factor δ = 0.1 on the other hand. We plot the results for varying average risk factors r in Figure 1 . With or without inequalities, we observe that the group achievement rate increases with the risk factor r ( Figure 1a However, while most studies reported that inequalities had a decisive impact on group achievement [34, 44, 52] , risk diversity has little or no impact on group performance for all risk values of r ≥ 0.3. Additionally, for r ≥ 0.3, only minor differences are observed in the strategies of agents at high and low risk, which as r increases, converge to the strategies learned by a homogeneous population. The results are in contrast with the results for r = 0.1 where risk diversity reduces target achievement and causes and a large gap in cooperation between the two classes. As r increases, the relative strength of the diversity δ/r decreases resulting in more homogeneous behaviors between the classes. Upon that, we investigate the role of the diversity factor δ. In a second experiment, we fix the average risk in the population to r = 0.5 and evaluate populations of varying risk diversity factors δ. In Figure 2a , we observe how for the same average risk, stronger diversity causes a drop in achievement. The steepest drop occurs when all the population's risk is only carried by half of the population, i.e. for δ = 0.5 (r L = r − δ = 0). Figure 2b shows the strategies followed by individuals at high and at low risk in each of the populations. We notice an increased gap in cooperation between the two classes as one class adjusts its cooperation rate faster than the other one. The reduced cooperation of agents at low risk is not compensated by a similar increase in cooperation from agents at high risk which explains the drop in target achievement as the δ increases. Next, we explore the effect of risk diversity on the policies of RL agents compared with a baseline of 1) individualistically rational solutions and 2) socially rational solutions. These are respectively the class-based Nash and the class-based maximum welfare policies from Section 4. First, in Figure 3a , we consider a constant diversity δ = 0.1 and observe the effect of changing the risk. Then, in Figure 3b , we fix the average risk r = 0.5 and look at the impact of varying diversity. In all cases, we notice that adaptive agents converge to more egalitarian solutions than the class-based agents in the sense that the gap in cooperation between agents at high and low risk is consistently smaller for RL populations compared to class-based populations. A higher risk reduces this gap while a higher diversity increases it. When comparing class-based Nash to class-based social welfare maximizing solutions, we notice that for agents at low risk, selfish Nash solutions usually recommend higher cooperation than social welfare solutions. However, in Figure 3b , we observe a cross point between the class-based Nash solutions and the class-based social welfare maximizing ones at δ = 0.2. From a selfish perspective, as δ increases, agents at low risk become less exposed to the disaster and the costs of high cooperation become larger than the costs of failure. In contrast, from a social perspective, as δ increases, the losses on agents at high risk increase and agents at low risk need to pitch-in to avoid further losses on the population. RL agents at low risk learn behaviors similar to the class-based Nash solutions and eventually stop cooperating with increasing diversity. However RL agents at high risk have trouble converging to solutions of high cooperation as recommended by the class-based Nash policies. On a final note, we highlight the distinctions between the class-based Nash and the general Nash solution. The general Nash is a point where no agent can increase its payoff by deviating alone from the chosen strategy. In other words, the Nash equilibrium considers fully independent players with no pre-established coordination. It finds a solution for both the game's cooperation and coordination dilemmas. The class-based Nash however, supposes no agent ever deviates alone from a chosen strategy. Instead, all agents of a same class, move in the same coordinated manner. The class-based Nash reduces the degrees of freedom and only solves the game's inter-class cooperation dilemma. As a result, the class-based Nash and the Nash equilibria may not always converge to the same solution. For instance, total defection is a Nash equilibrium in the CRD: if Z − 1 agents in the population defect, then the Z th agent's best response is to defect as well since the target threshold cannot be achieved alone. Yet, this equilibrium point was not found in any class-based Nash solutions. As a result of moving collectively, defection is less desirable for an agent because it simultaneously implies a defection of the rest of the agents in the same class. In all class-based Nash strategies, if we fix Z − 1 strategies in the population and only allow one agent to change its strategy, defection is indeed the most profitable choice. This proves that the class-based Nash points are not Nash equilibria. Interestingly, if we study the learned strategies for all tested risk values, i.e., if we fix Z − 1 strategies in the population and only allow the Z th agent to change its strategy, defection is again the most profitable choice. Learned strategies are not Nash equilibria either. We hypothesize that the large size of the population can hamper convergence to Nash equilibria for adaptive agents. In a similar way to the class-based update, if several agents in a population simultaneously increase their defection rate, the next interaction may become less profitable as the increase in failure (caused by a reduction in target achievement) is not compensated by the individual decrease in cooperation cost. Assessing the benefits of diverging alone from a strategy profile, which is necessary for computing Nash-equilibria, is not easily done in RL populations where all agents can simultaneously change their strategies. Learning with RL in large populations seems to help in escaping defective Nash equilibria. We examined how risk inequality between RL agents can affect a population's target achievement rate and the cooperation levels of different risk classes. First, we found that high risk diversity causes a noticeable decrease in group achievement. Second, as diversity increases, cooperation levels of agents at high and low risk respectively increase and decrease. However, while the changes in risk exposure are symmetrical, the changes in cooperation are not. The increase in cooperation of one class is always smaller than the accompanied decrease in cooperation of the other class, which raises significant target achievement difficulties. Third, we showed that RL populations converge to more egalitarian solutions among the two classes with respect to their class-based counterparts. Finally, we discussed how learning in large RL populations may help in avoiding defective Nash equilibria. We recall that risk diversity can emerge from a misalignment in either risk perception or risk exposure. In the case of risk perception diversity, our results highlight the need to align risk perceptions among individuals -using education for example [27] -to improve a population's ability in collectively reaching a target. However, if diversity in risk exposure relates to geographic locations, health problems, or other non modifiable variables, collective success demands altruistic actions from agents who may not directly benefit from cooperating. When agents are at very low risk, cooperative actions cannot be enforced using communication, retaliation or other classical solutions for cooperation in symmetrical social dilemmas. Mixing individualistic and social qualities in agents is necessary to achieve cooperative AI under risk diversity. For fully individualistic agents, allowing inter-agent contracts and bargains can be a way for selfish cooperation to emerge. This requires the understanding of the payoffs of the game, the capacity to develop win-win proposals, and the ability to implement those contracts (i.e, the ability to receive and offer rewards or incentives from and to other agents). Social dilemmas rise from a misalignment of individual and collective interests generated by specific tensions in the payoff function [32] . More specifically, mutual cooperation should always be preferred over a unilateral cooperation and a mutual defection. However, there should also always be either greed or fear that drives agents to defect to either exploit their peer or protect themselves from exploitation. Recall that in our game, a disaster is faced with probability r i by agent i if the group fails to achieve the target threshold. A total cooperation always results in target achievement while a total defection always results in failure of target achievement. As such, mutual cooperation yields a payoff x C = log(1 − c), whereas mutual defection yields with probability r i a payoff x D = log(1 − p) and with probability 1 − r i , a payoff x D = 0. To satisfy the conditions for a social dilemma, . Additionally, the threshold t needs to be lower bounded by cb, otherwise a unilateral cooperation would also avoid the disaster and hence be as good as a mutual cooperation. Finally, to incentivize agents to defect, the threshold needs to be achievable with less than a total cooperation (t < N cb), otherwise agents would have no motivation to free-ride. We train asynchronously agents of a population learning with the update rule in equation 2. A comparison between synchronous and asynchronous learning showed no significant differences in results for players learning to play the Ultimatum Game [42] . Similar conclusions were also reached with populations facing collective risks [34] . At every update-step k, a group of N agents is selected randomly from the population of Z agents. The agents in this group engage in the game described in section 3.1. Every player i in the group chooses randomly one of the available actions following probabilities p i,k that are derived by normalizing the propensity vector q i,k . The selected actions determine whether or not the target is achieved. If this is the case, then all agents avoid a disaster. Otherwise, the occurrence or not of a disaster for agent i, is sampled according to its risk exposure level r i . The payoffs for each agent are then distributed according to Section 3.1 after which all agents in the group update their propensity vectors. This is repeated for a total of K update-steps. While training, we keep track of the number of times every agent in the population has been selected in a vector u. Since the algorithm does not guarantee that all agents are chosen equally as many times, we define K , the minimum number of update-steps every agent needs to have performed before training is done. If after K total update-steps, some agent still hasn't performed at least K updates, then training continues until this condition is satisfied. Because the payoffs added to the propensity vector q i,k are negative, we choose the Softmax function to derive p i,k (A), the normalized propensities for each action. We have We initialize the propensity vectors q i,0 by sampling for each action, a random propensity value from a normal distribution N (µ = 0, σ = 1). During learning, we set the total number of update-steps to K = 2, 500, 000 and impose a minimum number of K = 30, 000 updates for every agent. The forgetting parameter is set to φ = 0.001. All simulations are repeated for 5 runs. We follow an analogous reasoning to the one for finding class-based Nash strategies under wealth inequality in collective risks [34] . We repeat the analysis while modifying what is necessary to accommodate homogeneous initial wealth and risk diversity. We detail the steps of this procedure below: be the number of players at low risk that actually contribute to the pool i.e. n c L ∈ {0, 1, ..., n L } and n c H be the number of contributors at high risk in the group i.e. n c H ∈ {0, 1, ..., n H }. Hence, a total number of (n L + 1) × (n H + 1) different combinations of group contributions are possible. The probability P n L (n c L , n c H ) that each of these possible configurations occur in a group of n L agents at low risk follows a binomial law and depends on π L and π H . Since the game is probabilistic, the probability of a player i avoiding a disaster given that he chose action a is given by equations 6 for players at low risk and equations 7 for players at high risk. We can now write the expected payoff functions of player i depending on whether he's at low or high risk. Let H n L L (π L , π H ) and H n L H (π L , π H ) be the respective expected payoff functions of agents at low and high risk involved in a game with n L players at low at risk and where all agents at low risk follow strategy π L and all agents at high risk follow strategy π H . The expected payoff of an agent depends on whether the game was successful or not and whether he contributed or not to the common pool. We have where x C ,x C , x D andx D are the payoffs described in Section 3.1. Finally, as groups are sampled randomly, the expected payoff accounting for the probability of an agent to find himself in a group with n L agents at low risk is Both players at low and high risk exposure aim at maximizing their respective payoff functions H L and H H . A Nash equilibrium (π * L , π * H ) satisfies H L (π * L , π * H ) ≥ H L (π L , π * H ) ∀ π L ∈ [0, 1] H H (π * L , π * H ) ≥ H H (π * L , π H ) ∀ π H ∈ [0, 1] Again, we rely on a graphical method and discretize the domain of π L and π H into intervals of length = 0.001. We calculate the corresponding payoff H L and H H over the space of possible (π L , π H ). Referring to equations 10, we plot for every π H , L's best response π BR L i.e. π BR L s.t. H L (π BR L , π H ) is maximized and similarly for every π L , H's optimal response π BR H . The intersections of the hence formed lines represent class-based Nash equilibrium points. We extract these points for different game configurations in each of our two scenarios: on one hand, for different average risk values r with a fixed risk diversity δ = 0.1 (Figure 4) , and on the other hand, for different risk diversity factors δ and a fixed average risk factor r = 0.5 ( Figure 5 ). When evaluating the expected return for the population, we do not look at the relative cost a loss has on an individual, but rather at the absolute impact it has on the population. We modify the value of the log-utility returns x C ,x C , x D andx D in Equations 8 and 9 and replace them by a linear utility. A successful cooperation from an agent costs the society x C = −cb and a failed cooperation costsx C = −cb − (1 − c)pb. Similarly, a successful defection costs nothing x D = 0 whereas a failed defection incurs a cost ofx D = −pb. Then, using Equations 10, we build a heat-map with the average population wealth for every combination of π L and π H strategies. Figure 6 illustrates some of the heat-maps obtained for different risk values and δ = 0.1, while Figure 7 illustrates heat-maps obtained for different δ values and an average population risk factor r = 0.5. Dark green colors represent solutions maximizing social welfare. We observe that the higher the risk factor, the lower the maximum social welfare obtained (see color bars). In all cases, individuals at high risk are recommended to cooperate more than those at low risk. Coronavirus disease (covid-19): Risks and safety for older people Covid-19: Protecting people and societies The global risk report 2021 A comparative study of strategies for containing the covid-19 pandemic in gulf cooperation council countries and the european union Effective choice in the prisoner's dilemma The evolution of cooperation. science Emergent tool use from multi-agent autocurricula The mechanics of n-player differentiable games Health justice strategies to combat covid-19: protecting vulnerable communities during a pandemic Shared experience actor-critic for multi-agent reinforcement learning Open problems in cooperative ai Coordination under threshold uncertainty in a public goods game. ZEW-Centre for European Economic Research Discussion Paper Timing uncertainty in collective risk dilemmas encourages group reciprocation and polarization. iScience Modeling behavioral experiments on uncertainty and cooperation with populationbased reinforcement learning Maximization, learning, and economic behavior Learning with opponent-learning awareness Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson Learning to communicate with deep multi-agent reinforcement learning The theory of learning in games Whose coronavirus strategy worked best? scientists hunt most effective policies Social dilemmas among unequals Foolproof cooperative learning Communication in multi-agent reinforcement learning: Intention sharing Social dilemmas: The anatomy of cooperation Socially connected and covid-19 prepared: The influence of sociorelational safety on perceived importance of covid-19 precautions and trust in government responses Predictors of public climate change awareness and risk perception around the world Multiagent reinforcement learning in sequential social dilemmas Stable opponent shaping in differentiable games Nicolas Heess, and Thore Graepel. Emergent coordination through competition Multiagent actor-critic for mixed cooperative-competitive environments Learning dynamics in social dilemmas Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems Cooperation between independent reinforcement learners under wealth inequality and collective risks The collective-risk social dilemma and the prevention of simulated dangerous climate change Learning to teach in cooperative multiagent reinforcement learning Protecting older people from covid-19: should the united kingdom start at age 60? Evaluating gambles using dynamics Learning in extensive-form games: Experimental data and simple dynamic models in the intermediate term Outcome-based partner selection in collective risk dilemmas Dynamics of informal risk sharing in collective index insurance Dynamics of fairness in groups of autonomous learning agents Risk of collective failure provides an escape from the tragedy of the commons Social diversity promotes the emergence of cooperation in public goods games Evolutionary dynamics of climate change under collective-risk dilemmas Signals: Evolution, learning, and information Giorgos Kallis, and Andreas L"oschel. Inequality, communication, and the avoidance of disastrous climate change in a public goods game Survey data on government risk communication and citizen compliance during the covid-19 pandemic in vietnam The covid-19 pandemic: Lessons on building more equal and sustainable societies. The Economic and Labour Relations Review Risk and responsibility A bottom-up institutional approach to cooperative governance of risky commons Climate policies under wealth inequality Multi-agent learning with policy prediction This work was partially supported by FCT-Portugal (UIDB/50021/2020, PTDC/MAT-APL/6804/2020, and PTDC/CCI-INF/7366/2020). This work has also received funding from the European Union's H2020 program (grant 76595).