key: cord-0108333-5si7sm6k authors: Arango, Mauricio; Pelov, Lyudmil title: COVID-19 Pandemic Cyclic Lockdown Optimization Using Reinforcement Learning date: 2020-09-10 journal: nan DOI: nan sha: 76997e155956ebcb6658285976cc070c398472ee doc_id: 108333 cord_uid: 5si7sm6k This work examines the use of reinforcement learning (RL) to optimize cyclic lockdowns, which is one of the methods available for control of the COVID-19 pandemic. The problem is structured as an optimal control system for tracking a reference value, corresponding to the maximum usage level of a critical resource, such as ICU beds. However, instead of using conventional optimal control methods, RL is used to find optimal control policies. A framework was developed to calculate optimal cyclic lockdown timings using an RL-based on-off controller. The RL-based controller is implemented as an RL agent that interacts with an epidemic simulator, implemented as an extended SEIR epidemic model. The RL agent learns a policy function that produces an optimal sequence of open/lockdown decisions such that goals specified in the RL reward function are optimized. Two concurrent goals were used: the first one is a public health goal that minimizes overshoots of ICU bed usage above an ICU bed threshold, and the second one is a socio-economic goal that minimizes the time spent under lockdowns. It is assumed that cyclic lockdowns are considered as a temporary alternative to extended lockdowns when a region faces imminent danger of overpassing resource capacity limits and when imposing an extended lockdown would cause severe social and economic consequences due to lack of necessary economic resources to support its affected population during an extended lockdown. This article examines the use of RL [1] to optimize cyclic lockdowns, which is one of the methods available for control of the COVID-19 pandemic. As originally described in [2] and [3] , cyclic lockdowns are a control method in the epidemic control toolbox. During the first phase of the pandemic, stringent combinations of Non-Pharmaceutical Interventions (NPIs) were applied by governments to slow down the exponential spread of the disease and reduce epidemic indicators to low values. We assume a lockdown is the combination of multiple NPIs, including stay-athome orders, school closures, and business closures, and that these NPIs are applied and released together. Across the world lockdown measures have generally been very successful and have helped save millions of lives as discussed in [4] and [5] . However, lockdowns have high economic and social costs, so in some regions it may be unsustainable to leave them in place for very long intervals, pressing governments to reduce restrictions. Yet the risk of new epidemic waves becomes high if the degree of unprotected social interaction increases too much during opening phases, which would require new deployment of lockdowns. We assume cyclic lockdowns are used only as an alternative to extended lockdowns when a region's government, due to economic limitations, cannot provide sufficient support to its affected population. In cases where there is an urgent need to lower the rate of spread of the virus in order to avoid overpassing the available capacity of critical resources, such as ICU beds and ICU medical teams, and at the same time the economy cannot be completely closed, cyclic lockdowns can be a viable alternative as described in [6] . Cyclic lockdowns can be used to buy time to strengthen other critical processes needed to control the epidemic, including testing, contact tracing, isolation facilities, hiring of necessary medical teams, expanding medical equipment stockpiles, and deploying unified and effective community education programs on required social practices, including use of masks and distancing. The main contribution of this effort is the development of a tool to estimate the optimal timing for open and closed segments during cyclic lockdowns. We approach this problem by framing it as an optimal control problem and solving it using RL methods. From the RL point of view it is a sequential decision problem with a combination of public health and economic impact goals that need to be optimized. The tool comprises a simulation and optimization framework that integrates a dynamic epidemic model with an RL control agent. The tool helps answer questions such as: when would the first lockdown occur, what is the percentage of time spent in open and in lockdown mode, what is the average length in days for both the open and closed segments, and what is the maximum effective reproduction number [7] , , that would avoid lockdowns. There is previous work in applying RL in epidemic control as described in [8] and [9] . However, to our knowledge there is no previous work in applying RL to optimize cyclic lockdowns. Also related, is work on using feedback control to guide the application of intervention measures [10] . Yet our work differs in that it provides optimal control and does so using RL methods. Cyclic lockdown strategies have been studied in [6] where predictable fixed open and closed cycle segment lengths are proposed (e.g. 4 days open and 10 closed). However, our work differs in that it optimizes cycles according to public health and economic goals. Cyclic lockdown patterns studied in [2] and [3] use a basic heuristic feedback control method where a simple fixed rule is used to decide when to apply or release a lockdown. The fixed rule applies a lockdown when the ICU beds in use exceeds a high ICU bed threshold and releases the lockdown when the ICU beds in use goes below a low ICU bed threshold. This approach has two disadvantages: on the one hand, it can cause very large ICU bed usage spikes that sometimes exceed the total ICU bed capacity and on the other hand it produces long lockdown cycles. We refer to this basic method as the baseline on/off control method and contrast our solution with it. Our implementation of the baseline on/off control method used the same value for both the high and low ICU bed threshold. This work is structured as an optimal control tracking problem where the optimization goal is to bring one of the epidemic system variables, the number of ICU beds in use, as close as possible to a reference input which is the ICU bed threshold. The input control action is the epidemic's effective reproduction number [7] , , which can only take two values, for example 1.5 (open) or 0.7 (lockdown). This type of on/off control is also referred to as bang-bang control. Instead of using conventional optimal control methods to find an optimal control policy function for , we use RL as described in [11] , [12] , [13] , [14] . In contrast to conventional optimal control, RL does not require a model of the target system dynamics. With RL, a control policy function can be learned by an RL control agent interacting directly with the target system or with a simulator. In this case, although a dynamic model of the epidemic is available, it is used as a simulator and is decoupled from the RL control agent. This provides significant flexibility to modify either the simulation model or the RL control agent with minimal or no impact on each other. The rest of this document describes the dynamic epidemic model in section 2, the RL control agent and how it was integrated with the epidemic model in section 3, and the experiments performed for cyclic lockdown policy optimization in section 4. Finally, conclusions are summarized in section 5. To simulate the dynamics of the COVID-19 pandemic we use an extended SEIR (Susceptible, Exposed, Infected, Recovered) model. The SEIR model is a compartmental epidemiological model [15] , where the population is divided into groups according to their status with respect to the disease progression. We use an extended SEIR model leveraging the model described in [16] . The compartments are Susceptible (S), Exposed (E), Infected (I), Removed (RM), Mild (M), Severe (SV), Hospitalized (H), Dead (D), and Recovered (RC) as illustrated in Figure 1 . The extended model assumes that infected persons self-isolate at the rate of and move to the Removed (RM) compartment. Removed infected persons can develop either a mild case (with probability ( ) where they self-isolate and recover without hospitalization, or a severe case (with probability ) = 1 − ( ) where hospitalization is required and a patient will either recover (with probability 1 − -) ) or die (with probability -) ). For each of the population groups, its rate of change is modeled as a differential equation and the combined model is a system of differential equations (listed below). In addition to the above probabilities, the differential equations have multiple change rate parameters, whose values are summarized in Table 1 . The following is the set of differential equations that define the extended SEIR model: 0.17 Probability of death in hospital after admitted 2 This system of differential equations can be solved through the Euler approximate numerical method as explained in [18] . The Euler method requires converting the set of continuous-time differential equations into a set of discrete-time difference equations that generate estimates of the population values in each of the compartments at each successive time interval: A key variable in this model, as will be described below, is the current number of ICU beds in use, ICU(t). Its value is derived as equal to 30% of the current number of hospitalized persons, H(t). Our model is dynamic because one of its parameters, the transmission rate, , which is the average number of persons an infectious individual transmits the disease to each day, is time-dependent. The transmission rate can be modified through interventions that change the level of interaction between people. The more stringent the social distancing NPIs, the lower the corresponding transmission rate. can also be reduced through interventions that reduce the infectiousness of asymptomatic infected individuals, such as requiring the use of face masks. , the effective reproduction number, which is the most widely known epidemic metric and represents the severity of an epidemic at any given time. is defined as the average number of persons one infected individual infects. Mathematically, it is the ratio of the transmission rate over the recovery rate: If the transmission rate is larger than the recovery rate, > 1 and it indicates the epidemic is expanding; if the transmission rate is lower than the recovery rate, < 1 and it indicates the epidemic is in decline. We assume the recovery rate is constant and is influenced only by biological and medical factors and not by NPIs. Hence, instead of using as the time-dependent input to the model, we will use , from which is immediately derived as × . As discussed above, NPIs determine the value of and of . In the Euler method-based solution, the only input to the set of equations on each time interval is an value (action) produced by the RL agent. Reinforcement learning involves a control agent that interacts through trial and error with a target system that needs to be controlled and produces as output an optimal policy function. The policy function is optimal with respect to goals that are specified as a reward function, which is provided as an input to the control agent. In our case, the target system is the epidemic and obviously it is not safe or realistic to do trial and error interactions with it. However, we use the epidemic model described above as an approximate version of the real system. The policy function produced by the control agent takes as input the state of the system and outputs the best action to take at every time step in the progression of the epidemic. In an online RL setting such this one, the control agent starts with an initial random policy function and interacts with the target system on every time step (daily in our scenario) by observing the current state and applying the policy function to generate an output action. In this scenario, an action is one of two possible values, corresponding either to non-lockdown (open) conditions or to lockdown (closed) conditions. The model runs for one time step with the new input, which changes the state of the model. The model's state is the set of compartment population sizes, how many susceptible, infected, hospitalized in ICU, and so on. The control agent uses a subset of the complete state, referred to as the observed state. With a new observed state, the control agent uses the reward function to calculate the reward value triggered by the latest action. On every interaction step, the control agent collects a data tuple comprising: current state, action, next state, and reward value. The tuple is fed to the agent's RL algorithm which incrementally adjusts the policy function so that it maximizes the total sum of reward values in an episode. An episode is a run of the model and the control agent over a sufficient number of steps (days) to observe the evolution of the epidemic. RL systems need specification of three essential information elements: the observed state space, action space, and the reward function. The observed state space is a collection of metrics that can be observed in the target system and that summarize its current situation from the perspective of the control agent. The observed state space comprises only one variable, the current number of infected persons, I(t). The reason is that for the type of goals that need to be optimized (e.g. operate closest to maximum ICU bed use threshold), this variable can be measured and is sufficient because other variables such as number of hospitalized patients and number of ICU beds in use are dependent on it. The action space is the range of action values that can be produced by the policy function. Recall the action values correspond to values, which reflect the combined strength of NPIs currently in place. To simplify our analysis, we chose to have only two action values in the action space, a high value, referred to as ( ) NOPQ and a low value, referred to as ( ) RSNTP-. The ( ) NOPQ value corresponds to non-lockdown NPI combinations (e.g. mandatory use of face masks) and the ( ) RSNTP-value corresponds to a lockdown NPI combination (stay-at-home orders, school closures, business closures). Since this a discrete action space, we use an RL algorithm that supports only discrete actions. The RL algorithm used is Double Deep Q-Network (DDQN) [20] . DDQN is an enhanced version of Q-learning [1] which is a value-based method that iteratively computes a value function, Q(s, a), or Q-function whose output is equal to the total value of an action a from the current state s until the final state of the target system that needs to be controlled. In Q-learning the Q-function is defined as an iterative form of the Bellman Equation [1] , whose current state, action, next state, and reward inputs are updated on every interaction between the Q-learning agent and the target system. In its most basic form the Q-function's mapping of inputs to outputs is represented as a table. However, a table doesn't generalize well to all possible state-action combinations that weren't visited during interaction. To improve generalization, Q-learning replaces the table method with regression as a function approximator. Deep Q-learning refers to methods where the function approximator is a neural network. DQN [19] and DDQN [20] algorithms have significantly improved Q-learning performance through the use of methods including experience replay, mini-batch sampling, and dual networks to contain Q-value overestimation. The reward function defines the goals that are optimized through the RL algorithm. We chose to optimize use of critical resources such as hospital beds, and in particular ICU beds, and to do it with the maximum possible economic activity without exceeding the ICU bed threshold. This is equivalent to maintaining the number of ICU beds in use, ICU(t), as close as possible to the ICU bed threshold, which corresponds to a tracking problem in optimal control. A basic reward function (or cost function when minimizing) for a tracking problem is to use the ICU error value defined as the difference between number of ICU beds in use and the ICU bed threshold to produce a shaped reward, meaning that the higher the error the lower the reward (higher penalty) as expressed in rule (1). The function also uses an error margin whereby a penalty is only applied if the error is lower than the margin. To address this issue, we use a reward function that explicitly assigns a portion of its value based on the control input, . This done by dividing the optimization goal into two sub-goals, each with its own reward function and producing a total reward function as a linear combination of the sub-goal rewards functions. The first sub-goal is to operate at the maximum possible level of economic activity, which translates to using ( ) NOPQ as much as possible. This is expressed in equation (2) which assigns a reward value of 0 if ( ) NOPQ is used, or a penalty (negative value) of − h , if ( ) RSNTP-is used. The second goal is to avoid overshooting the established ICU bed threshold. This is implemented in equation (3) by assigning a penalty (negative value) if the ICU error is higher than a predefined error margin i . This penalty is proportional to the ICU error, which is multiplied by a constant set to j k j l , so that the resulting penalty value is higher (more negative) than − h only for ICU error values higher than i . The total reward is a linear combination of these two values as expressed in equation (4) . Constants h and i in the reward equations were selected as follows: h = 0.1 and i = 0.05 • /)WPT)NS-. t is a constant that determines how much weight to assign to the ICU error component of the combined reward function. For each model, it was obtained through manual tuning using multiple trial runs. t calculation could be automated by treating it as another input action variable, in addition to , with a method such as described in [21] . This was not pursued in the current work and is left as future work. Both the RL agent and the extended SEIR epidemic simulator are implemented in Python and they are integrated using the OpeAI Gym API [22] . This is a simple API that defines a collection of methods and information elements that are needed for interaction between an RL agent and an environment (target system). The main method is step(a), which is invoked by the RL agent and executed by the target system. Its execution involves performing action a in the simulator and transitioning the system from the current state to the next state. The method's return values are the next state, the reward value, a flag indicating whether or not the end state has been reached, and a field containing any other additional information required by the agent. The experiments performed involved modeling a hypothetical region with a population of 20 million under the Covid-19 pandemic. The epidemic is simulated from its starting date, assumed to be on March 1, 2020. The initial reproduction number, w , was assumed to be 3.0 and the simulation runs with this fixed R until a lockdown is applied on March 25. It is assumed the lockdown causes a rapid change in R from 3.0 to 0.7. The lockdown lasts sixty days during which R is fixed at 0.7. At the end of the lockdown, R becomes time sensitive, referred to as and controlled by the RL agent. After the lockdown, in each step can switch between two values: ( ) NOPQ and ( ) RSNTP-, which is always set to 0.7. It is assumed that changes from ( ) NOPQ to ( ) RSNTP-and vice-versa are instantaneous. The minimum duration of an change is one day. The purpose of these experiments is to find a policy function for deciding at different ( ) NOPQ levels when to impose lockdowns and for how long. The goal of the policy function is to stay as close as possible to an operating threshold or reference value. The operating threshold employed is the maximum ICU beds in use as a percentage of the total ICU capacity. Other resource measurements could also be used as thresholds, such as a percentage of the total number of ICU medical teams available or a percentage of the total number contact tracers. The ICU threshold used is 1,400, which corresponds to 70% of a total ICU capacity of 2,000 units. The goals of the policy produced by the RL agent are defined by the reward function as described in the previous section. The first goal is to operate at ( ) NOPQ as long as possible, which is equivalent to minimizing the time spent in lockdowns (economic goal). The second goal is to avoid overpassing the ICU limit (public health goal). After training against the epidemic model, the RL algorithm produces a near-optimal policy according to these goals. The RL-based results are compared to experiments using the baseline on/off feedback control (fixed-rule) method described in the introduction section. Lockdown cycles were analyzed for ( ) NOPQ values of 1.7, 1.5, 1.3, and 1.1, for both RL-based and baseline on/off control. Figures 3 to 6 illustrate the results of these experiments. Each of the figures has two rows of graphs. The first row (a, b, c) corresponds to a simulation of the epidemic with as the input control signal generated by the RL agent. The second row (d, e, f) corresponds to a simulation run with as the input control signal generated by the baseline on/off feedback control method. The first graph in each row shows the number of currently infected persons, I(t), which is used as the RL agent's state variable, the second graph shows the number of occupied ICU beds, ICU(t), and the third graph shows the daily value of the control input . A low , ( ) RSNTP-, indicates lockdown conditions and a high , ( ) NOPQ , indicates open conditions. All of the simulations were done over a time span of 270 days, except for ( ) NOPQ = 1.1, which was done for 365 days. As seen in figures 3.b -6.b, the RL agent succeeds in closely tracking the ICU bed threshold reference value with minimal or no overshoot in every case. In contrast, overshoot values with the baseline on/off feedback control method are very high for ( ) NOPQ values above 1.1. With ( ) NOPQ = 1.7, the maximum overshoot value is 3,520 (Fig. 3e) , or 176% the total ICU capacity of 2,000. With ( ) NOPQ = 1.5 the maximum overshoot value is 2,848 (Fig. 4e) , or 142% of the total ICU capacity. With ( ) NOPQ = 1.3 the maximum overshoot value is 2,100 (Fig. 5e) , or 105% of the total ICU capacity. High overshoots with baseline on/off control occur because the epidemic is a dynamic system with inertia in its response to inputs at the system's stages (compartments in the SEIR model). This inertia is caused by response delays to inputs in each of the stages. In the case of ICU beds in use, ICU(t), which is equal to 30% of H(t), requires any change in the input = ( ) to propagate through the Exposed, Infected, and Severe compartments (see Figure 2 ) which results in a combined lag of h j + h K + h z , which is equal to 11 days based on the model parameters used as listed in Table 1 . This means that if a lockdown input is applied, it will only start causing a decrease in ICU(t) at least 11 days after the lockdown starts. During this lag interval, ICU(t) will continue to grow at an exponential rate dependent on = ( ). This explains the large increases in offshoots with the baseline on/off control method as ( ) NOPQ increases. Large spikes, as observed in Figures 3e, 4e , and 5e, are undesirable because they imply significantly higher number of deaths given that an increased number of patients in ICU beds results in a larger number of deaths. If spike values are such that they exceed the ICU beds capacity, the situation becomes worse because the fraction of hospitalized patients that die can increase due to lack of medical equipment. The simulation model accounts for this by increasing the probability of death-in-hospital parameter ( -) ), only during overflow days, by a factor equal to the ratio of ICU beds in use to ICU capacity. For the simulation run with ( ) NOPQ = 1.7 ( Figure 3 ) when using baseline on/off control, aggregate deaths over a period of 270 days were 19,553, 53% higher than 12,810 deaths when using RL-based control. With ( ) NOPQ = 1.5 aggregate deaths with baseline on/off control were 15,923, 28% higher, than 12,393 deaths with RL-based control. The RL agent learns a control policy that minimizes overshoots by applying lockdowns with enough days in advance to either reach very close to the ICU limit or slightly overpass it, but without excess days to avoid an unnecessary undershoot. This results in an ICU(t) signal with small oscillations around the ICU limit reference value. The required on/off control signal to produce small ICU(t) oscillations needs to have much shorter cycles than cycles in the control signal generated with baseline on/off control or some other blunt control method. The RL agent does learn a policy that produces shorter cycles and its open/closed segment lengths are optimal according to the epidemic model state. The key patterns to highlight with RL-based cyclic lockdowns are first, that as the level of virus spread increases, represented by an increase in NOPQ , the control policy generates shorter open segment lengths to reduce transmission. The second pattern is that, except for ( ) NOPQ = 1.1, the total cycle lengths for each NOPQ level are almost the same, between 9 and 10 days and what changes is the open/closed ratio. The total cycle length value depends on the parameters of the epidemic model (compartment delays) and the NOPQ level. We observed two bands in NOPQ levels: ( ) NOPQ ≤ 1.1 and ( ) NOPQ > 1.1. These two bands result from the exponential growth characteristics of the epidemic and its model. For ( ) NOPQ ≤ 1.1, growth is very slow and considered to be before the elbow of the exponential curve, and for ( ) NOPQ > 1.1, growth is significantly faster. For ( ) NOPQ ≤ 1.1, the total cycle lengths are slightly higher, such as 13 days in the case ( ) NOPQ = 1.1. RL-based cycle lengths ranging from 9 to 13 days are significantly shorter than the cycle lengths with baseline on/off control (see Figures 3f -6f respectively. This indicates that the cycle segment lengths with baseline on/off control are approximately eight times longer than with RL-based control. As described above, RL-control produces short lockdown cycles because these are the optimal lengths needed to align the controlled variable ICU(t) with the tracked reference value (ICU limit). However, in addition to the technical control reasons for using short lockdown cycles there are economic and social reasons to favor short lockdown cycles versus long and stringent lockdowns. Long lockdowns cause severe negative impact on the economy, especially in regions where governments don't have the economic strength to provide adequate help to citizens and businesses affected by the lockdown. Short cyclic lockdowns, as discussed in [6] , reduce the negative economic impact because most businesses and informal economic activity could continue functioning by adapting to the short lockdown intervals. The suggested workflow for applying the RL-based lockdown optimization method is to first measure the current , referred to as ( ) XR/YXS . This will be the value used for ( ) NOPQ . The value used for ( ) RSNTP-is the lowest average measured during the initial extended lockdown. Then, the ICU limit value is selected and with these three values as inputs, a new policy model is trained and saved. The RL-agent, then loads the trained policy model and runs it with the simulator for an episode length in days that results in multiple lockdown cycles. Data is recorded as in figures 3c-6c. The policy signal produced by the RL agent is used to obtain the average lengths of the open and closed segments of a lockdown cycle. These segment lengths are used by a region's government to set up a lockdown plan over multiple cycles. Once the lockdown plan is in operation, the average XR/YXS over multiple cycles is measured. New lockdown cycles are applied with the updated NOPQ and the update process is repeated after a certain number of cycles. If in addition to lockdowns there are strong parallel efforts to improve other control measures such as testing, contact tracing, isolation, and use of face masks, NOPQ may decrease. This can be determined by comparing ( ) XR/YXS measured in the field with the calculated average , RXSR . If ( ) XR/YXS is lower, then a new ( ) NOPQ can be calculated with equation (5), a new policy model trained, and new open/closed segment lengths are obtained. When ( ) NOPQ decreases, the open/close segment times ratio increases. We refer to this pattern as downward staircase, because repeating it enables a stepwise reduction in the percentage of time closed. When reaching ( ) XR/YXS = 0.8 or lower, ( ) NOPQ obtained with equation (5) is 1.0 or lower, which indicates the epidemic is no longer expanding or is contracting, if the value is lower than 1.0, and therefore there is no longer need to perform lockdowns. In summary, the downward staircase approach can be used to evolve from short lockdown segments at high ( ) NOPQ levels to shorter lockdowns at lower ( ) NOPQ levels and eventually to ( ) NOPQ levels where no lockdowns are required. We developed an approach to calculate optimal cyclic lockdown timings using an RL-based on-off controller. The problem was structured as an optimal control system for tracking a reference value, which in this case is an ICU bed limit. Tracking the ICU limit as close as possible achieves two optimization goals specified in the RL reward function: the first one is a public health goal that minimizes overshoots of ICU bed usage above the ICU bed limit, and the second one is a socio-economic goal that minimizes the time spent under lockdowns. The RL-based optimal on-off controller succeeds in producing control policies that track the ICU bed limit reference with high accuracy as illustrated in figures 3b-6b. Also, the RL-based controller generates short lockdown cycles, between 9 and 14 days with lockdown intervals of at most 6 days. These results contrast with high ICU limit overshoot values and much longer cycles and lockdown intervals when using basic heuristic feedback methods. The cyclic lockdown approach described here is intended to only be considered as a temporary alternative to extended lockdowns when a region is facing imminent danger of overpassing resource capacity limits, such as ICU beds, and when imposing an extended lockdown would cause severe social and economic consequences due to the region's lack of necessary economic resources to support its affected population. Under these circumstances, temporary cyclic lockdowns could be used as a bridge method to buy time for improving other processes needed to control the epidemic. When used in conjunction with other methods to reduce the virus spread, cyclic lockdowns can be applied in a downward staircase approach that helps to reduce ( ) NOPQ to safe levels close to or lower than 1.0. Finally, we highlight two areas of improvement and future work. The first one is automating the selection of weight coefficients when the reward function is a linear combination of multiple sub-reward functions, as is the case in the reward function described in section 3.2 and equation (5). This entails exploring methods that treat each of the weights as an additional action variable. The other area is finding optimal controllers for the cyclic lockdown problem using conventional optimal control methods such as model predictive control (MPC) and linear quadratic regulator (LQR) and comparing them with the RL-based solution. Reinforcement Learning: An Introduction. 2 nd edition Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand Effects of non-pharmaceutical interventions on COVID-19 cases, deaths, and demand for hospital services in the UK: a modelling study Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe The effect of large-scale anti-contagion policies on the COVID-19 pandemic Adaptive cyclic exit strategies from lockdown to suppress COVID-19 and allow economic activity The reproduction number of COVID-19 and its correlation with public health interventions Identifying Cost-Effective Dynamic Policies to Control Epidemics Context matters: using reinforcement learning to develop human-readable, state-dependent outbreak response policies How control theory can help us control COVID-19 Reinforcement Learning in Direct Adaptive Optimal Control Reinforcement Learning in Feedback Control Reinforcement Learning Versus Model Predictive Control: A Comparison on a Power System Problem From Reinforcement Learning to Optimal Control: A unified framework for sequential decisions Compartmental Models in Epidemiology Forecasting hospitalization and ICU rates of the COVID-19 outbreak: an efficient SEIR model A time-dependent SEIR model to analyse the evolution of the SARS-covid-2 epidemic outbreak in Portugal How Quickly does an Influenza Epidemic Grow Human-level control through deep reinforcement learning Deep Reinforcement Learning with Double Q-learning Generalizing Across Multi-Objective Reward Functions in Deep Reinforcement Learning Getting Started with Gym