key: cord-0638655-28e2p61v
authors: Reymond, Mathieu; Hayes, Conor F.; Willem, Lander; Ruadulescu, Roxana; Abrams, Steven; Roijers, Diederik M.; Howley, Enda; Mannion, Patrick; Hens, Niel; Now'e, Ann; Libin, Pieter
title: Exploring the Pareto front of multi-objective COVID-19 mitigation policies using reinforcement learning
date: 2022-04-11
journal: nan
DOI: nan
sha: 56738d717e95876649018148a3e654ac135f5823
doc_id: 638655
cord_uid: 28e2p61v

Infectious disease outbreaks can have a disruptive impact on public health and societal processes. As decision making in the context of epidemic mitigation is hard, reinforcement learning provides a methodology to automatically learn prevention strategies in combination with complex epidemic models. Current research focuses on optimizing policies w.r.t. a single objective, such as the pathogen's attack rate. However, as the mitigation of epidemics involves distinct, and possibly conflicting criteria (i.a., prevalence, mortality, morbidity, cost), a multi-objective approach is warranted to learn balanced policies. To lift this decision-making process to real-world epidemic models, we apply deep multi-objective reinforcement learning and build upon a state-of-the-art algorithm, Pareto Conditioned Networks (PCN), to learn a set of solutions that approximates the Pareto front of the decision problem. We consider the first wave of the Belgian COVID-19 epidemic, which was mitigated by a lockdown, and study different deconfinement strategies, aiming to minimize both COVID-19 cases (i.e., infections and hospitalizations) and the societal burden that is induced by the applied mitigation measures. We contribute a multi-objective Markov decision process that encapsulates the stochastic compartment model that was used to inform policy makers during the COVID-19 epidemic. As these social mitigation measures are implemented in a continuous action space that modulates the contact matrix of the age-structured epidemic model, we extend PCN to this setting. We evaluate the solution returned by PCN, and observe that it correctly learns to reduce the social burden whenever the hospitalization rates are sufficiently low. In this work, we thus show that multi-objective reinforcement learning is attainable in complex epidemiological models and provides essential insights to balance complex mitigation policies.

As shown by the COVID-19 pandemic, infectious disease outbreaks represent a paramount global challenge that should be tackled by prevention strategies. To this end, understanding the complex dynamics that underlie these epidemics is essential. Epidemiological transmission models allow us to capture and understand such dynamics and facilitate the study of prevention strategies through simulation. However, developing efficient mitigation strategies remains a challenging process, given the non-linear and complex nature of epidemics.

For this reason, reinforcement learning provides a methodology to automatically learn prevention strategies in combination with complex epidemic models . Previous research typically focuses on optimising policies with respect to a single objective, such as the pathogen's attack rate, while the mitigation of epidemics is a problem that inherently covers distinct and possibly conflicting criteria (i.a., prevalence, mortality, morbidity, cost). Therefore, optimizing on a single objective requires that these distinct criteria are somehow aggregated into a single metric. Generally, during learning, a decision maker will rely on an expert's assistance to manually design such a metric. However, a decision maker may be unaware of all possible trade-offs between different solutions, which may leave the decision maker unaware of possible solutions that better reflect their preferences, potentially resulting in sub-optimal performance. Furthermore, manually designing such metrics is time consuming, costly and error-prone, as this non-intuitive process requires repetitive and tedious tuning to achieve the desired behaviour [Hayes et al., 2021] . Moreover, taking a single objective approach has several other limiting factors [Hayes et al., 2021 , Roijers et al., 2013 .

To alleviate the challenging process of defining the learning problem explicitly, we can directly learn optimal behaviours by explicitly taking a multi-objective approach. By assuming that a decision maker will always prefer solutions for which at least one objective improves, it is possible to learn a set of optimal solutions called the Pareto front. This enables decision makers to review each solution on the Pareto front before making a decision, thereby being aware of the trade-offs that a solution may imply.

In this work, we investigate the use of multi-objective reinforcement learning (MORL) to learn a set of solutions that approximate the Pareto front of multi-objective mitigation strategies. We consider the first wave of the Belgian COVID-19 epidemic, which was mitigated by a hard lockdown . When the incidence of confirmed cases were steadily dropping, epidemiological experts were asked to investigate strategies to exit from the stringent lockdown which was imposed. Here, we consider the epidemiological model developed by that was constructed to describe the Belgian COVID-19 epidemic, and was fitted to hospitalisation incidence data and serial sero-prevalence data. This model constitutes a stochastic discrete-time age-structured compartmental model that simulates mitigation strategies by varying social distancing parameters concerning school, work and leisure contacts. Based on this model, we contribute MOBelCov, a novel mutli-objective epidemiological environment, in the form of a multi-objective Markov decision process (MOMDP). MOBelCov encapsulates the epidemiological model developed by to implement state transitions, with an action space that combines a proportional reduction of school, work and leisure contacts at each time step and defines a reward function based on two objectives: the attack rate (i.e., proportion of the population affected by the pathogen) and social burden.

To learn and explore the trade-offs between the attack rate and social burden we build upon a state-of-the-art MORL approach, namely Pareto Conditioned Networks (PCN) [Reymond et al., 2022] , which uses a single neural network to learn the policies that belong to the Pareto front. As PCN is an algorithm designed for discrete action-spaces, we extend it towards continuous action-spaces to accommodate MOBelCov's action-space. With this continuous action variant of PCN, we explore the Pareto front of multi-objective COVID-19 mitigation policies.

By evaluating the solution set of mitigation policies returned by PCN, we observe that PCN minimises the social burden in scenarios where hospitalization rates are sufficiently low. Therefore, in this work we illustrate that multiobjective reinforcement learning can be utilised to provide important insights surrounding the trade-offs between complex mitigation polices in real-world epidemiological models.

Real-world decisions problems often have multiple and possibly conflicting objectives. Multi-objective reinforcement learning (MORL) can be used to find optimal solutions for sequential decision making problems with multiple objectives [Hayes et al., 2021] . For MORL, problems are typically modelled as a multi-objective Markov decision process (MOMDP), i.e., a tuple, M = (S, A, T , γ, R), where S, A are the state and action spaces respectively, T : S × A × S → [0, 1] is a probabilistic transition function, γ is a discount factor determining the importance of future rewards and R : S × A × S → R n is an n-dimensional vector-valued immediate reward function, where n denotes the number of objectives. For single-objective RL, i.e., when n = 1, the goal is to find the policy π * that maximizes the expected sum of discounted rewards, also called the expected return:

where r t = R(s t , a t , s t+1 ). In contrast, in MORL, n > 1 which leads to vectorial returns. In this case, there can be policies for which, without any additional information, it is impossible to know if one is better than the other. For example, we cannot decide which policy between π 1 , π 2 is optimal if they lead to expected returns (also called Vvalues) V π1 = (0, 1), V π2 = (1, 0) respectively. We call these solutions non-dominated, i.e., solutions for which it is impossible to improve an objective without hampering another. The set that contains all the non-dominated solutions of the decision problem is called the Pareto front F . Our goal is to find the set of policies that lead to all the V-values contained in the Pareto front Π * = { π|V π ∈ F }. In general, we call any set of V-values a solution set. When a solution set contains only non-dominated V-values, it is referred to as a coverage set. In the case that no π exists that has a V π dominating any of the solutions in a coverage set, then this coverage set is the Pareto front.

Comparing the learned coverage sets of different algorithms is a non-trivial task, as one algorithm's output might dominate the other in some region of the objective-space, but be dominated in another. Intuitively, one would generally prefer the algorithm that covers a wider range of decision maker preferences. In this work we use several metrics to evaluate our algorithm's performance.

A widely used metric in the literature is called the hypervolume [Zitzler et al., 2003] . This metric evaluates the learned coverage set by computing its volume with respect to a fixed specified reference point. This metric is, by definition, the highest for the Pareto front, as no other possible solution can increase its volume (since they are all dominated). One disadvantage of the hypervolume is that it can be difficult to interpret. For example, when working in high-dimensional objective-spaces, adding or removing a single point can lead to wildly different hypervolume values, especially if the point lies close to an extremum of the space.

To alleviate these shortcomings, we additionally evaluate our work on a different metric called the ε-indicator I ε [Zitzler et al., 2003] , which measures how close a coverage set is to the Pareto front F . Intuitively, I ε shows that any solution of F is at most ε better with respect to each objective o than the closest solution of the evaluated coverage set:

The main disadvantage of this metric is that we need the true Pareto front to compute it, which in the case of our MOMDP (see Section 3) is unknown. To still gain insights from our learned policies, we approximate the true Pareto front using the non-dominated policies across all runs.

We note that the ε-indicator metric is quite pessimistic, as it measures worst-case performance [Zintgraf et al., 2015] , e.g., it will still report bad performance as long as a single point of the Pareto front is not correctly modeled, even if all the other points are covered. As such, we also use the I ε−mean [Reymond et al., 2022] which measures the average ε distance of the solutions in F with respect to the evaluated coverage set. Figure 1 shows a visual representation of the hypervolume and ε metrics in two dimensions.

3 COVID-19 model and the MOBelCov MOMDP 3.1 Stochastic compartment model for SARS-CoV-2

As an environment to evaluate non-pharmaceutical interventions, we consider the adapted version of the SEIR compartmental model presented by Abrams et al., that was used to investigate exit strategies in Belgium after the first epidemic wave of SARS-CoV-2 . This model constitutes a discrete-time stochastic model, that considers an age-structured population. This model generalises a standard SEIR model 1 , extended to capture the different stages of disease spread and history that are associated with SARS-CoV-2 (i.e., pre-symptomatic, asymptomatic, symptomatic with mild symptoms and symptomatic with severe symptoms) and to represent the stages The ε metrics first compute the maximum distance between each point in the Pareto front and its closest point in the coverage set (ε i ). We can then take their maximum value to compute the I ε metric, or their mean value to obtain the I ε−mean metric of the coverage set. which is used to derive the MOMDP. associated with severe disease, i.e., hospitalisation, admission to the intensive care unit (ICU) and death. As we consider an age-structured population, we consider this extended SEIR structure for K = 10 age groups, i.e., Contacts of the different age-groups which impact the propagation rate of the epidemic are modeled using social contact matrices (SCMs). We define a SCM for 6 different social environments: C home , C work , C transport , C school , C leisure , C other for the home, work, transport, school, leisure, other environments respectively. The model is described by a set of ordinary differential equations (ODEs), as described in SI. By formulating this set of differential equations defined above as a chain-binomial process we can obtain stochastic trajectories from this model .

In order to model different types of interventions, we follow . Firstly, in order to consider distinct exit scenarios, we alter the SCMs to reflect a contact reduction in a particular age group. Secondly, we assume that compliance to the interventions is gradual and model this using a logistic compliance function (see details in SI). We consider a contact reduction function that imposes a proportional reduction of work (including transport) p w , school p s and leisure p l contacts, which is implemented as a linear combination of contact matrices:

In order to apply multi-objective reinforcement learning, we construct the MOBelCov MOMDP based on the epidemiological model introduced in Section 3.1 and graphically depicted in Figure A .1.

The state of the MOMDP consists of the state variables that make up the epidemiological model (see Figure A .1), to which we refer as s m , together with the social contact matrixĈ that is currently in place. s m directly corresponds to the aggregation of the state variables in the epidemiological model, i.e., a tuple,

for each age group k ∈ {1, . . . , K}, where S encodes the members of the population who are susceptible to infection and E encodes the members of the population who have been exposed to COVID-19. Moreover, I presym , I asym , I mild , I hosp , I icu are the members of the population infected with COVID-19 and are, respectively, presymptomatic, asymptomatic, have mild symptoms, are hospitalised, or are in the ICU. Finally, H new k represents the number of newly hospitalised individuals in age group k.

Therefore, we define a state in MOBelCov as follows:

We paramaterise the model using the mean of the posteriors as reported by (details in SI).

Action-space: Our actions concern the installment of a social contact matrixĈ, with a particular reduction (see Section A.4). To this end, we use the proportional reduction parameters p w , p s , p l defined in Section A.4. Thus, each a ∈ A is a 3-dimensional continuous vector in [0, 1] 3 (i.e., a = [p w , p s , p l ]) which impacts the SCM according to Equation 3.

Transition function: The model defined by utilises a model transition probability M (s ′ m | s m ,Ĉ) (see SI for details on M ), whereĈ the currently installed SCM, that progresses the epidemiological model in one timestep. We use this function as the transition function in MOBelCov. For each timestep t, we simulate the model for one week, usingĈ obtained from a t .

As mentioned above and detailed in the SI, the compartment model is described by a set of ODEs. To obtain a stochastic version, the ODEs can be formulated as a chain-binomial process. Based on these cases, we also create and experiment on two versions of the MOBelCov environment: the (deterministic) ODE model (i.e., a MOMDP with a deterministic transition function) and the (stochastic) Binomial model (i.e., a MOMDP with a stochastic transition function).

We define a vector reward function which considers multiple objectives: attack rate (i.e., infections, hospitalisations) and the burden imposed by the interventions on the population.

The attack rate in terms of infections is defined as the difference in susceptibles from the current state and next state . Since this is a cost that needs to be minimized, we defined the corresponding reward function as the negative attack rate:

The reward function to reduce the attack rate in terms of hospitalisations is simply defined as the negative number of new hospitalizations:

Finally, we use the missed contacts resulting from the intervention measures as a proxy for societal burden. To quantify missed contacts, we consider the original social contact matrix C and the installed social contact matrixĈ, and compute the differenceĈ −C. The resulting difference matrix quantifies the average frequency of contacts missed.

To determine missed contacts for the entire population, we apply the calculated difference matrix to the population sizes of the respective age groups that are currently uninfected (i.e., susceptible and recovered individuals). Therefore we define the social burden reward function R SB , as follows:

where S k (s) represents the number of susceptible individuals in age group k in state s and R k represents the number of recovered individuals in age group k in state s. In Section 5, we optimize PCN on two different variants for the multi-objective reward function:

In multi-objective optimization, the set of optimal policies can grow exponentially in the number of objectives. Thus, recovering them all is an expensive process and requires an exhaustive exploration of the complete state space. To address this problem, we use Pareto Conditioned Networks (PCN), a method that uses a single neural network to encompass all non-dominated policies [Reymond et al., 2022] . The key idea behind PCN is to use supervised learning techniques to improve the policy instead of resorting to temporal-difference learning. This eliminates the movingtarget problem [Schmidhuber, 2019] , resulting in stable learning algorithm.

PCN uses a single neural network that takes a tuple s,ĥ,R as input.R represents the desired return of the decision maker, i.e. the return PCN should reach at the end of the episode.ĥ denotes the desired horizon that expresses the number of timesteps that should be executed before reachingR. At execution time, bothĥ andR are chosen by the decision maker at the start of the episode. Then, at every timestep, the desired horizon is updated according to the perceived reward r t ,R ←R − r t and the desired horizon is decreased by one,ĥ ←ĥ − 1.

PCN's neural network has an output for each action a i ∈ A. Each output represents the confidence the network has that, by taking the corresponding action in s,R will be reached inĥ timesteps. We can draw an analogy with a classification problem where the network should learn to classify (s,ĥ,R) to its corresponding label a i .

Similar to classification, PCN requires a labeled dataset with training examples to learn a mapping from input to label. However, contrary to classification, the data present in the dataset is not fixed. Each time PCN executes a trajectory it creates a tuple x = (s t , R t , T − t), y = a t for every transition of the trajectory and adds it to the dataset.

PCN collects new data for a fixed number of episodes, after which it re-trains the network with batch updates from the newly assembled dataset. This improves the policies induced by the network, which in turn enables the agent to gather superior data for the next training batch.

PCN trains the network as a classification problem, where each class represents a different action. Transitions x = s t , h t , R t , y = a t are sampled from the dataset, and the ground-truth output y is compared with the predicted output y = π(s t , h t , R t ). The predictor (i.e., the policy) is then updated using the cross-entropy loss function:

where y a = 1 if a = a t and y a = 0 otherwise.

While the original PCN algorithm is designed for MOMDPs with discrete actions-spaces, the problem we tackle (see Section 3) is defined in terms of a continuous action-space. We thus extend PCN for the continuous action-space setting. First, we change the output of the neural network such that there is a single output value for each dimension of the action-space. Since the actions should be bound in the domain of possible actions ([0, 1] in the case of MOBelCov, see Section 3), we apply a tanh non-linearity function on this output. Second, the problem becomes a regression problem instead of a classification problem, as the labeled dataset uses continuous labels y = a t instead of categories. We thus use a Mean Squared Error (MSE) loss to update the policy:

Since learning the full set of Pareto-efficient policies Π * requires that the policies π * ∈ Π * are deterministic stationary policies [Roijers et al., 2013] , we use the outputŷ as action at execution time. However, PCN improves its policy through exploration, by continuously updating its dataset with better trajectories. Thus, at training time, we follow [Lillicrap et al., 2015] and use a stochastic policy by adding random noise sampled from a Normal distribution to the action:

where η is a hyper-parameter defining the magnitude of noise to be added.

PCN trains its policy on a dataset that is collected by executing trajectories. It assumes that reenacting a transition from the dataset leads to the same episodic return. When the transition function T of the MOMDP is deterministic, the whole trajectory can be faithfully reenacted, which guarantees that we obtain the same return. Combined with the fact that PCN's policy is deterministic at execution time, conditioning the policy on a target episodic return is equivalent to conditioning it on the V-value V.

However, when T is stochastic we lose this guarantee. We cope with this issue by adding small random noise to R t when performing gradient descent, which reduces the risk of overfitting [Zur et al., 2009] . Moreover, although the Binomial model of MOBelCov is stochastic, the possible next-states resulting from a state-action pair are similar to each other. This allows PCN to compensate if r t = R(s t , a t , s t+1 ) is somewhat worse than expected. This is confirmed by our experiments (see Section 5), where the coverage sets learned by PCN on the ODE model and on the Binomial model are similar.

Our goal is to use PCN to learn deconfinment strategies in the MOBelCov environment. We aim to obtain policies that balance between the social burden experienced by the population and the epidemiological objective of minimising the attack rate. To this end we consider two cases for the vectorial reward functions [R ARH , R SB ] and [R ARI , R SB ], to learn and analyse policies under different targets with respect to the considered attack rate.

We now evaluate our extension of PCN for continuous action-spaces on both the ODE and Binomial models. Conform to , the simulation starts on the 1st of March 2020, by seeding a number of infections in the population. Two weeks later, on the 14th of March, the Belgian government initiated a full lockdown of the country. This is implemented by fixing the actions p w , p s , p l to 0.2, 0, 0.1 respectively. This lockdown ended on the 4th of May 2020, at which point the government decided on a multi-phase exit strategy to incrementally reduce teleworking, reopen schools and allow leisure activities, such as the re-opening of bars and the cultural sector. It is from this day onward that PCN aims to learn policies for diverse exit strategies, compromising between the total number of daily new hospitalizations and the total number of contacts lost as a proxy for social burden. The simulation lasts throughout the school holidays (from 01/07/2020 to 31/08/2020). Schools are closed during the school holidays, which is simulated by setting p s = 0, regardless of the corresponding value outputted by the policy, i.e., during periods of school closure p s is ignored.

To evaluate the quality of the policies discovered by PCN, we compare it to a baseline. The baseline consists of a set of 100 fixed policies, that iterate over all the possible social restriction levels, with values ranging from 0 to 1. In other words, the fixed policies directly operate in a fine-grained manner on the whole contact reduction functionĈ. This allows us to obtain a strong baseline for potential exit strategies over the objective space. We note that while such fixed policies are a feasible approach, they do not scale well in terms of action and objective spaces and they will not be able to provide an adaptive restriction level, which is what we aim to provide using PCN.

All experiments are averaged over 5 runs. The hyper-parameters and the neural network architecture can be found in the SI.

We learn a coverage set that ranges from imposing minimal restrictions to enforcing many restrictions (see Figure 4 ). The coverage set is shown in Figure 3 .

We notice that the coverage sets discovered by PCN almost completely dominate the coverage set of the baseline, showing that there are much better alternatives to the fixed policies. This is most visible in the compromising policies, where one has to carefully choose when to remove social restrictions while at the same time minimizing the impact on daily new hospitalizations. In these scenarios, PCN discovers policies that can drastically reduce the total number of new hospitalizations (e.g. more than 20000) for the same social burden. Analysing the executions of such policies (see Figure 4 , middle plot) shows a flattened hospitalization curve, with a gradual increase of social freedom during the school holidays such that the peak stays stable and gradually decreases over time.

Interestingly, we notice that the most restrictive policy (i.e. the one that prioritizes hospitalizations over social burden, see Figure 4 , bottom plot) still starts to gradually increase p w and p l from the end of July onward. This is because by then, the epidemic has mostly faded out, and it is safe to reduce social restrictions. The timing of this reduction is important as reducing restrictions too soon can lead to a new wave. PCN learns the impact of its decisions over time, and correctly infers the timing at which restrictions can be safely lifted. 0.659 ± 0.007 0.100 ± 0.054 0.028 ± 0.013 Fixed 0.6 ± 0.0 0.106 ± 0.0 0.058 ± 0.0 Table 1 : Evaluation metrics for the coverage sets comparing hospitalizations with social burden. In general training on the ODE results in slightly better coverage sets than on the Binomial model. Training on infections (ARI) still provides a competitive coverage set in terms of hospitalizations. All PCN coverage sets outperform the baseline.

Next, we compare the performance of PCN when trained on the ODE and Binomial models. The ODE model has the advantage of using a deterministic transition function, for which PCN is suited. However, taking into account stochasticity, using the Binomial model, is important as individual behavioural changes introduce substantial uncertainty in the course of the outbreak and require stochastic model evaluations to evaluate the impact of this uncertainty on the effectiveness of the intervention strategies. Regardless, the results displayed in Table 1 show that the solution sets that PCN finds using the Binomial model are competitive with the ODE model, as indicated by their hypervolumes and I ε−mean metrics. We do note that the difference in I ε indicator is more significant, indicating that a few solutions discovered using the ODE model are not found when using the Binomial model, impacting worst-case performance. Overall, we see that our extension of PCN performs efficiently, even with some degree of stochasticity in the transition function.

We now assess the difference in coverage sets when optimizing on R ARH versus R ARI . Although at a different scale, our experiments show that infections and hospitalizations are correlated. This is confirmed in Figure 3 : the coverage set in terms of hopsitalizations learned by PCN ARI is only slightly worse than the one learned by PCN ARH , even though PCN ARI optimized on the infection attack rate. This is expected, as during the initial phase of the epidemic, limited immunity was present in the population (few natural immunity and no vaccines), which induces a tight coupling between infection and hospitalisation cases. Moreover, the opposite also holds: both coverage sets are similar in terms of infections. However, there is one notable exception. A number of policies learned by PCN ARH start removing social restrictions before the epidemic is rooted out, resulting in a rise of hospitalizations at the end of the summer holidays. Since, following Abrams et al., this is when the simulation stops, the total number of hospitalizations remains low. The number of infections, however, starts growing before this rise in hospitalizations and, since its growth is exponential, ends up high. We note that PCN ARH also learns alternative policies for similar [R ARH , R SB ] that do not exhibit this behavior. This shows that, even though a policy might lead to a return that matches the preferences of the decision maker, its execution might still lead to undesired behavior. We argue that the use of such expert systems for decision making should be paired with careful interpretation.

The dataset of trajectories that PCN is trained on is pruned over time to keep only the most relevant trajectories. The returns of these trajectories are used in Figure 3 to show the discovered coverage set. Each of these returns can be used as desired return for policy execution. We now assess the robustness of the executed policies, by comparing the averaged return obtained over multiple policy executions with the corresponding target return. We show that the executed policies reliably obtain returns that are similar to the desired return used to condition PCN.

Results are shown in Table 2 . The I ε indicators shows that, over all policies, the decision maker will lose at worst 0.046 and 0.068 normalized returns in any of the objectives for the ODE, Binomial versions respectively. On average, it will lose 0.007, 0.024 normalized returns respectively, which corresponds to either a total of 3633, 1059 more hospitalizations than expected or a total of 262, 76 less contacts than expected.

0.046 ± 0.012 0.007 ± 0.003 PCN ARH (Binomial) 0.068 ± 0.021 0.024 ± 0.002 PCN ARI (ODE) 0.056 ± 0.013 0.008 ± 0.004 PCN ARI (Binomial) 0.058 ± 0.012 0.011 ± 0.003 Table 2 : Comparing the difference in the desired return provided to PCN and the actual return PCN obtained when executing its policy. We see that, regardless of the setting, the learned policy faithfully receives a return similar to its desired return.

Reinforcement learning (RL) has been used in conjunction with epidemiological models to learn policies to limit the spread of diseases and predict the effects of possible mitigation strategies [Probert et al., 2019 , Ernst et al., 2006 . For example, RL has been used extensively in modelling and controlling the spread of influenza [Das et al., 2008 , 2018 .

RL and Deep RL have been used extensively as a decision making aid to reduce the spread of COVID-19. For example, to learn effective mitigation strategies [Ohi et al., 2020] , to learn efficacy of lockdown and travel restrictions [Kwak et al., 2021] and to limit the influx of asymptomatic travellers [Bastani et al., 2021] .

Multi-objective methods have also been deployed to learn optimal strategies to mitigate the spread of COVID-19. Wan et al. [Wan et al., 2021] implement a model-based multi-objective policy search method and demonstrate their method on COVID-19 data from China. Given the method is model-based, a model of the transition function must be learned by sampling from the environment. The method proposed by Wan et al. [Wan et al., 2021] only considers a discrete action space which limits the application of their algorithm. Wan et al. [Wan et al., 2021] use linear weights to compute a set of Pareto optimal policies. However, methods which use linear weights can only learn policies on the convex-hull of the Pareto front [Vamplew et al., 2008] , therefore the full Pareto front cannot be learned. It is important to note the method proposed by Kompella et al. [Kompella et al., 2020] considers multiple objectives. However, the objectives are combined using a weighted sum with hand-tuned weights which are determined by the authors. The weighted sum is applied by the reward function and a single objective RL method is used to learn a single optimal policy. In contrast to previous work, our approach makes no assumptions regarding the scalarisation function of the user and is able to discover Pareto fronts of arbitrary shape.

Making decisions on how to mitigate epidemics has important ethical implications with respect to public health and societal burden. In this regard, it is crucial to approach this decision making from a balanced perspective, to which end we argue that multi-objective decision making is crucial. In this work, we establish a novel approach, i.e., an expert system, to study multi-faceted policies, and this approach shows great potential with respect to future epidemic control. We are aware of the ethical implications that expert systems have on the decision process and we make the disclaimer that all results based on, or derived from, the expert system that we propose should be carefully interpreted by experts in the field of public health, and in a much broader context of economics, well-being and education. We note that the work in this manuscript was conducted by a multi-disciplinary consortium that includes computer scientists and scientists with a background public health, epidemiology and bio-statistics.

In this work, we focus on the clinical outcomes of the intervention strategies and use the reduced contacts as proxy for the quality of life lost. This could be extended into more formal economic evaluations by assessing the health and monetary benefits and costs of different interventions for various stakeholders. The COVID-19 pandemic demonstrated the broad impact of infectious diseases on sectors outside health care. This stresses the need to capture a societal and thus multi-objective perspective in the decision making process on public health and health care interventions. Our learned policies confirm this, showing that focusing solely on reducing the number of hospitalizations results in taking drastic measures -more than a thousand social interactions lost per person over the span of 4 months -that may have a long-lasting impact on the population.

To conclude, we show that multi-objective reinforcement learning can be used to learn a wide set of high-quality policies on real-world problems, providing the decision maker with insightful and diverse alternatives, and showing the impact of taking extreme measures. 

In this work we utilise the compartmental model proposed by and extend the model to a multiobjective Markov decision process (MOMDP). The MOMDP used in this work is outlined in the main paper. However, we describe the details of the underlying compartmental model used to model the spread of COVID-19 below.

.1: Schematic diagram of the compartmental model for SARS-CoV-2 presented by which is used to derive the MOMDP.

We utilise an adapted SEIR mathematical compartmental model proposed by Abrams et al. to contruct the MOBelCov MOMDP with deterministic and stochastic state transitions. In this model members of the population are susceptible to infection when they are in compartment S 2 . If an individual comes into contact with an infections individual then they move to the exposed compartment, E, at a time specific rate, λ(t). After a period of time an exposed individual becomes infectious, where they move to the pre-symptomatic compartment, I presym , at rate γ. Once infected, individuals develop mild symptoms, I mild , with probability 1 − p or do not develop any symptoms, I asym , with probability p, where asymptomatic individuals recover, R, at rate δ 1 . Individuals who experience symptoms can suffer from a mild infection, I mild , and recover at rate δ 2 , or they suffer from a more severe infection, I sev , at a rate ψ. Individuals with a severe infection are then transferred to a hospital for treatment, I hosp with probability φ 1 . However, some individuals become critically ill and are transferred directly to the intensive care unit (ICU) with probability 1 − ψ 1 . Individuals in the hospital, I hosp and I icu , recover at rate δ 3 or δ 4 and die at rate τ 3 and τ 4 respectively. Figure  A. 1 outlines the compartmental model defined by .

The flows of the deterministic model are defined by a set of ordinary differential equations, which are outlined as follows:

where, for example, S = (S 1 (t), S 2 (t), ..., S k (t)) T is the vector representing the susceptible members of the population of each age group k at time t.

In this set of ordinary differential equations, each state variable represents a vector over all age groups for a particular compartment at time t. Infection dynamics are governed by an age-specific force of infection λ:

where k is the respective age group of a total of K age groups, and β(k, k ′ ) is the time-invariant transmission rate that encodes the average per capita rate at which an infectious individual in age group k makes an effective contact with a susceptible individual in age group k ′ , per unit of time.

Under the social contact hypothesis [Wallinga et al., 2006] , we have that:

where q is a proportionality factor and the matrix C denotes the social mixing behaviour within and between different age groups in the population, and is referred to as a social contact matrix. Following , we rely on distinct social contact matrices for symptomatic and asymptomatic individuals, respectively C s and C a . Therefore it is possible to define the transmission rates for both symptomatic and asymptomatic individuals as follows:

β a (k, k ′ ) = q a · C a (k, k ′ ).

(A.4) Moreover, the age-dependent force of infection can be defined as follows:

where λ(t) = (λ(1, t), λ(1, t), ..., λ(K, t)). For all further information about the different compartments and parameters please refer to the work of To create a version of MOBelCov with a deterministic transition function, M , we focus on the deterministic model proposed by . Given that the transitions within the compartmental model are deterministic it is possible to utilise the highlighted model transitions for MOBelCov. Applying the contact matrixĈ to the model state s m progresses the model for one timestep and returns a new compartmental model state s ′ m . Given s m andĈ, the deterministic compartmental model returns s ′ m with probability of 1. Therefore, it is possible to use this process as a deterministic transition function, M , for MOBelCov.

In Figure B .1, we display the coverage set with respect to the number of hospitalizations of the learned policies for both the ODE and Binomial variants, as the Binomial variants were omitted in the main paper for clarity. When trained on R ARH , the Binomial coverage set is on par with its ODE counterpart. On the other hand, the coverage sets when trained on R ARI are slightly worse, as confirmed by the evaluation metrics in Table 1 . Moreover, we also show the coverage sets with respect to the number of infections in Figure B .2. Interestingly, both variants trained on the Binomial model learn a coverage set that almost completely dominates the variants trained on the ODE model. This is due to the stochasticity of the Binomial model. PCN stores trajectories in its dataset, and keep a selection for training. Due to the stochasticity in the transition function, some trajectories might result in higher returns, even though the same policy was applied. Due to the high variability in infections, this difference might become significant. This is also why the variants using the Binomial model show a worse I ε and I ε − mean than their ODE counterpart: their desired return stems from a single trajectory, while we evaluate the policies over multiple trajectories. The resulting average return is then different than the desired return. Section 5.3 explains why some policies trained on R ARH show suboptimal behavior with respect to the number of infections.

PCN learns a coverage set containing more than 100 different policies. To gain a better insight about their behavior, and how they differ from each other, we plot executions of each learned policy in Figure B .3. The plots are displayed from the least restrictive policy in terms of social burden to the most restrictive one.

We notice that policies 60 to 140 display similar behavior. At first, they all completely restrict social contact, effectively continuing the lockdown. Afterwards, they progressively lessen these restrictions. What differs in these policies is the timing and the speed at which the restrictions are lessened. These policies correspond to the right-most part of the coverage set displayed in Figure B .1, which shows many alternative for the [0, 40000] hospitalizations segment.

We used the same hyper-parameters across all experiments. Each experiment resulted in 5 independent trials. Finally, we performed a grid-search over possible hyper-parameter values. All hyper-parameters used and their possible values explored during grid-search are displayed in Table B .1. 

Next to the hyper-parameter search, we also performed a grid search over 4 different neural network architectures. All the architectures have the same structure. We use a compartment embedding sc emb , a SCM embedding sm emb and a school-holidays embedding sh emb that take as inputs the compartment, the previous p w , p s , p l values (as they fully define the SCMĈ) and a boolean flag for school holidays, respectively. All these embeddings have a same-sized embedding of 64, which are multiplied together to form the full state embedding. This state-embedding is used as input for another network, s emb . Additionally, we use a common embedding c emb for the concatenation of the desired return and horizon. Finally, the results of s emb and c emb are multiplied together, before passing through a fully connected network f c that has 3 outputs, one for p w , p s , p l respectively. All the architectures of the different components are displayed in Table B .2. The variant used in all experiments is conv1d-big.

RL has also been used to learn effective mitigation strategies which limit the spread of COVID-19 [Ohi et al., 2020] . Kwak et al. [Kwak et al., 2021] use deep RL, with an SIRD model, to learn efficacy of lockdown and travel restrictions in controlling the COVID-19 pandemic using data collected from various government organisations. Bastani et al. [Bastani et al., 2021] use a RL algorithm called Eva to limit the influx of asymptomatic travellers infected with SARS- 

Deep reinforcement learning for large-scale epidemic control

Peter Vamplew, and Diederik M. Roijers. A practical guide to multi-objective rl and planning

A survey of multi-objective sequential decision-making

The impact of contact tracing and household bubbles on deconfinement strategies for covid-19

Modelling the early phase of the belgian covid-19 epidemic using a stochastic compartmental model and studying its implied future trajectories

Pareto conditioned networks

Performance assessment of multiobjective optimizers: An analysis and review

Quality assessment of morl algorithms: A utility-based approach

Reinforcement learning upside down: Don't predict rewards-just map them to actions

Continuous control with deep reinforcement learning

Noise injection for training artificial neural networks: A comparison with weight decay and early stopping

Context matters: using reinforcement learning to develop human-readable, state-dependent outbreak response policies

Clinical data based optimal sti strategies for hiv: a reinforcement learning approach

A large-scale simulation model of pandemic influenza outbreaks for development of dynamic mitigation strategies

Bayesian best-arm identification for selecting influenza mitigation strategies

Exploring optimal control of epidemic spread using rl

Deep reinforcement learning approaches for global public health strategies for covid-19 pandemic

Ioannis Vlachogiannis, Christos Hadjicristodoulou, Pagona Lagiou, Gkikas Magiorkinis, Dimitrios Paraskevis, and Sotirios Tsiodras. Efficient and targeted covid-19 border testing via rl

Multi-objective model-based reinforcement learning for infectious disease control

Intervening in the spread of the virus by, for example, reducing social contacts or government interventions introduces uncertainty in the further course of the outbreak. Therefore, to understand how this uncertainty affect the spread of the disease we introduce a stochastic component model which can model the uncertainty generated by interventions in social contacts.By formulating the set of differential equations defined above, as a chain-binomial, we can obtain stochastic trajectories from this model [Bailey, 1975] . A chain-binomal model assumes a stochastic model where infected individuals are generated by some underlying probability distribution. For the stochastic model we consider a time interval (t, t + h], where h is defined as the length between two consecutive time points. Similar to , in this work we set h = 1 24 . define the set of differential equations as a chain binomial as follows:where,Given MOBelCov also calculates new hospitalisations, H new , we define H new for the stochastic compartmental model as follows:. For more details on this model and the chain-binomial representation of the differential equations, we refer the reader to the work of .To create a version of MOBelCov with a stochastic transition function, M , we utilise the stochastic compartmental model outlined above. Given the transitions within the compartmental model are derived by an underlying probability distribution it is possible to utilise the stochastic compartmental model transitions for MOBelCov. As previously outlined in Section A.1 the contact matrixĈ applied the model state s m progresses the model and returns a new model state s ′ m . Given the underlying model dynamics are governed in a probabilistic manner, the model returns s ′ m stochastically. Therefore, it is possible to use this process as a stochastic transition function, M , for MOBelCov.

The model is parameterised using the mean of the posteriors as reported by .The population size for each of the considered age groups was taken from the Belgian statistical agency STATBEL 3 . To initialise the model, we used the number of confirmed cases until 13 March 2020 , as reported by the Belgian agency for public health Sciensano 4 .

In order to model different types of interventions, we follow . Firstly, to consider distinct exit scenarios, we alter the social contact matrices to reflect a contact reduction in a particular age group. Secondly, we assume that compliance to the interventions is gradual and model this using a logistic compliance function. We use the logistic compliance function in function of time t proposed by Abrams et al.,,where t I indicates the time the intervention started . We initialise β * 1 to the value estimated in by Abrams et al. and choose β * 0 = −5, as an intercept to have c(t) = 0 for t = 0, in correspondence with Figure F2 in the Supplementary Information of .Exploring the Pareto front of multi-objective COVID-19 mitigation policies CoV-2 to Greece. Kompella et al. [Kompella et al., 2020 ] define a COVID-19 simulator and a methodology to learn mitigation policies that minimize the economic impact without overwhelming the hospital capacity. RL has also been used to learn effective COVID-19 vaccination distribution strategies [Beigi et al., 2021 , Awasthi et al., 2020 and to create decision support systems for mitigating the spread of COVID-19.