key: cord-0511365-u9eat09c authors: Roy, Kaushik; Zhang, Qi; Gaur, Manas; Sheth, Amit title: Knowledge Infused Policy Gradients for Adaptive Pandemic Control date: 2021-02-11 journal: nan DOI: nan sha: c214873ded1cef7d1dd761ae1634f0e5ac1b5a91 doc_id: 511365 cord_uid: u9eat09c COVID-19 has impacted nations differently based on their policy implementations. The effective policy requires taking into account public information and adaptability to new knowledge. Epidemiological models built to understand COVID-19 seldom provide the policymaker with the capability for adaptive pandemic control (APC). Among the core challenges to be overcome include (a) inability to handle a high degree of non-homogeneity in different contributing features across the pandemic timeline, (b) lack of an approach that enables adaptive incorporation of public health expert knowledge, and (c) transparent models that enable understanding of the decision-making process in suggesting policy. In this work, we take the early steps to address these challenges using Knowledge Infused Policy Gradient (KIPG) methods. Prior work on knowledge infusion does not handle soft and hard imposition of varying forms of knowledge in disease information and guidelines to necessarily comply with. Furthermore, the models do not attend to non-homogeneity in feature counts, manifesting as partial observability in informing the policy. Additionally, interpretable structures are extracted post-learning instead of learning an interpretable model required for APC. To this end, we introduce a mathematical framework for KIPG methods that can (a) induce relevant feature counts over multi-relational features of the world, (b) handle latent non-homogeneous counts as hidden variables that are linear combinations of kernelized aggregates over the features, and (b) infuse knowledge as functional constraints in a principled manner. The study establishes a theory for imposing hard and soft constraints and simulates it through experiments. In comparison with knowledge-intensive baselines, we show quick sample efficient adaptation to new knowledge and interpretability in the learned policy, especially in a pandemic context. Reinforcement learning (RL) is one of the main techniques to solve sequential decision making problems. When combined with deep neural networks, RL has achieved impressive performance in many applications, including robotics [1] , game playing [2] , recommender systems [3] , etc. As RL fundamentally solves decision making problems via trial and error, a major drawback of RL is the huge amount of interactions required to learn good decision policies, which can lead to prohibitive cost and slow convergence. The inefficiency of RL has motivated studies on incorporating expert domain knowledge when solving decision making problems. In this direction, prior work has largely focused on the setting of imitation learning (IL), in which knowledge is represented through expert demonstrations, i.e., the expert demonstrates the desired behavior (rather than specifying the reward signal) in various scenarios stemming from a decision making problem. Such demonstrations can be used to either directly learn a classifier that mimics the expert's behavior, known as behavior cloning [4] , or infers the reward function that rationalizes the expert's behavior and is then optimized through RL, known as inverse RL [5] . The performance of IL relies on the quality of expert demonstrations. Expert demonstrations are often not exhaustive to provide supervision for all kinds of scenarios that might be countered. Moreover, sometimes only a suboptimal expert is available due to human's bounded rationality. Therefore policies learned with IL are inferior to the policies learned with RL that uses the problem's original reward. These limitations of IL has further motivated recent work that attempts to combine RL and IL. These approaches leverage both the original reward and expert demonstrations to learn better-than-expert policies faster than RL-only approaches. For example, AlphaGo is pretrained with human expert moves and then refined via RL [2] . This paper focuses on combining RL and expert knowledge to solve sequential decision making problems, which we term as knowledge infused RL. Different from the aforementioned prior works, we aim to deal with the following important, yet understudied challenges in knowledge infused RL (KIRL). 1) Partial observability and non-stationarity. Many real-world domains modeled as sequential decision making problems are partially observable. Moreover, there are often exogenous factors that are un-modeled in the input features, making the underlying decision making problem non-stationary. Most existing works in KIRL (e.g., combining RL and IL) ignore this and apply algorithms developed with the assumption of full observability and stationarity, while we aim to explicitly consider partial observability and non-stationarity in our KIRL method. 2) Structured knowledge representation. The IL framework assumes that knowledge is represented by expert demonstrations of low-level behavior, yet low-level demonstrations are difficult to obtain in many domains. Instead, human knowledge is often represented in a structured, high-level manner. As an example, consider the guidelines of a public health agency for a pandemic. We argue that leveraging such structured knowledge requires structured input representation for the RL algorithm. 3) Interpretability. We aim to develop approaches such that both the process of leveraging expert knowledge and the resulting learned policy are interpretable. The interpretability for KIRL is required in high-stakes domains such as public health and yet largely ignored in prior works. This paper focuses on the pandemic control setting that manifests all of the three aforementioned challenges. Consider a scenario where a notional city is in a pandemic, with the following characteristics: people living in homes, working at offices, and shopping at places, all connected in a geographical map. The city's government aims to optimize its pandemic control policy with KIRL to strike a balance between public health and economic resiliency. The true number of infected people is only partially observable to the government, as exhaustive testing is often not possible. In general, the course of the pandemic is a non-stationary process due to exogenous factors, such as people flowing into a certain area for some short-term local event, an abrupt decrease in the mortality rate of the disease due to new medications, etc. The partial observability and non-stationarity result in a non-homogeneous counting process of the observed number of infected individuals. The knowledge of experts consulted is often formatted into high-level guidelines; for example, shopping area A should be locked down before shopping area B since locking down B will result in more severe economic consequences. In order to incorporate such guidelines, the input to the KIRL algorithm needs to be represented in an interpretable manner. Further, the resulting pandemic control policy learned by KIRL needs to be also interpretable. For APC, during COVID-19, we have identified specific challenges that include the agent handling relational features, non-homogeneity in the feature counts, and learning non black-box interpretable structures through knowledge infusion [6] . We believe that the inability to handle these issues by previous approaches can pose bottlenecks in agent models for assisting policymakers. Our study aims to investigate our formulation in handling these specific challenges through knowledge infusion in functional space since we specify knowledge as functional constraints. There is a rich body of work on RL concerning relational feature-based functional spaces [7, 8] . However, they do not use knowledge infusion, handle count features, or deal with partial observability in the state. Poisson dependency networks have been proposed to handle multivariate count data but do not consider nonhomogeneity in the counts [9] . Odom et al., present a way to incorporate knowledge constraints in relational function spaces that can be used in conjunction with the work of Kersting et al., and Hadiji et al., to achieve knowledge infusion in Policy Gradients [10, 7, 9] for pandemic control. Our work most closely resembles this, and we, therefore, employ it as our evaluation baseline. We make key and necessary modifications to the agent's approximation architecture in moving from trees to linear basis, using kernel aggregates to handle non-homogeneity, partial observability and most importantly, development of a mathematical framework for hard and soft imposition of knowledge as functional constraints that are applicable in a wide range of scenarios in APC. Also, we prove that the baseline is an instance of our framework. Other approaches for knowledge infusion in functional spaces include the use of cost sensitivity constraints in imbalanced data, and monotonicity constraints which are not directly applicable to our setting [11, 12] . The use of monotonicity constraints in preference-based knowledge infused RL can be an interesting extension to our work. Our contribution is two folds: First, we create an agent-based pandemic simulator that models the interactions between individuals that move across specific locations within a community, such as homes, offices, shops, hospitals, etc. The spread of the disease is simulated using the typical SIR model. Interventions like locking down a specific location and increasing testing are the control measures modeled in the simulator. The pandemic simulator manifests all of the three motivating challenges. Second, we develop a novel KIRL algorithm, Knowledge Infused Policy Gradient, that addresses the challenges. To incorporate structured knowledge format and support interpretability, the policy is derived from learned relational features using an interpretable 2-layer neural network. An example of such a relational feature is -There exists a residential neighborhood, a person living in a home, and shop in the same route with many people shopping at this place, where a potential intervention is locking down such shops. The partial observability is addressed by aggregating the learned relational features over time. Further, expert knowledge is infused into the policy gradient-based optimization in an online manner so that the knowledge can be adjusted whenever necessary to adapt to the nonstationary course of the pandemic. Figure 1 shows the pipeline of our KIRL algorithm, which will be described in detail in Sections 2 and 3. In the standard RL framework, a Markov Decision Process (MDP) is defined by a set of states ∈ , actions ∈ , state transition probabilities on taking actions ( , , 1] , and a reward model ( , ) ∶ × → . A common way to specify the policy ( , ), to execute in state , is using a Boltzmann distribution: The knowledge infusion and analysis pipeline that begins with the agent interacting with the simulator to learn the policy, constructing a state representation and utilizing knowledge through KIPG to learn a knowledge infused agent policy for APC scenarios. This policy is then evaluated for its efficacy (% of test cases passed) and interpretability (readable and meaningful). , that optimize the value of a policy: Here, is the distribution from which the states are drawn and is the Q-value function corresponding to policy . The (s,a) function is estimated using state-action pair trajectories from a simulator, by Monte Carlo methods or function approximation. Kersting et al., show that parameterization of the policy is difficult to achieve owing to feature selection in continuous and relational environments, in which there are infinitely many possibilities [7] . Thus we employ gradient ascent in functional space to learn the function ( , ) directly, that rely on learned relational features (see Section 3.1). The learning of these features overcomes the problem of pre-defining count features, and thereby providing finer grained control. We start with an initial function 0 ( , ) and add = 1 − functions, ( , ) to fit the gradients: ( , ). In the parametric setting we use the gradient specified in Equation 1. Here we make the change from to the function ( , ), we need the form for the gradient ( , ) ( , ) , as this is the only component of the gradient dependent on ( , ), for each action ∈ . We instead compute the gradient as: In this paper, the agent uses a Neuro-Symbolic approximator for the policy by learning a set of linear models. Each linear model ( , ) can be viewed as a basis function for ( , ) approximation. To prevent the correlation between them, the sample state action pairs will be sub-sampled each time. The resulting policy will be a linear combination of these basis functions. We move away from treebased models, for the following reasons: 1) The basis functions can handle continuous, discrete, and relational inputs without extensive pre-processing or binning. 2) The final linear combination ( , ) can be laid out as a 2-layer Neural Network(NN), for which the weights in the 2nd layer are unity. The network weights can be refined through backpropagation. Beubeck et al., show that a 2-Layer NN with RELU activations are able to approximate arbitrary functions with high precision and ( ) neurons in the hidden layer, where is the dimension of the input and is the number of data samples [13] . Since the dimensionality in our setting is high and , i.e. size of the data is typically low, we can use just a few linear basis functions (neurons) in the hidden layer to achieve good approximation. 3) Each basis function ( , ) has interpretable structure. Interpretability theory for NN structures has recently been well studied and we require that the agent policy be robust and interpretable for applications such as APC. Dombrowski et al., show that a target interpretation on the decision making by the NN  ( , ) , given by the network weights (heat map), can be manipulated in terms of its features Φ( , ) and yet still yield almost the same interpretation  ( , ) [14] . This is related to the curvature of the output manifold of the NN. They propose to alleviate this issue by replacing RELU activations with soft-plus non linearities with a small parameter as: 1 log(1 + Φ( , )). With this modification, the weights are more robust to perturbations or modifications in the input ( , ). We learn the relational features (clauses) over of the state Φ( , ) using standard and well understood Inductive Logic Programming (ILP) methods [15] . The inductive bias is provided in the form of Aleph modes which can be automatically learned from a schema of the world [16, 17] . This bias is included to constraint the search to not include features that do not make sense. For example, to decide on locking down a shop we would not like an irrelevant feature that does not actually contain the shop or anything related to it. Furthermore, we count the number of examples over the features that make up Φ( , ) for example the feature: same(State,Res,Shop) ∧ pin(State,Person,Home) ∧ hin(State, Home,Res) outputs the number of persons, homes, shops, and residential areas that satisfy this feature description, where the feature denotes: There exists a residential neighborhood and shop in the same route and a person living in a home that is part of the residential neighborhood. We make use of a minimal threshold of mutual information to satisfy along with a maximum clause length for picking the features Φ( , ). It is likely in APC, that ( , ) is a process that depends on counts ℎ( , ) over latent state features ℎ( , ), such as number of persons ill i.e. Φ( , ) = ( ℎ( , ) ). This creates partial observability, and thus the entire observed data trajectory history  = {( 1 , 1 ), ( 2 , 2 ).. ( , )} influences counts Φ( , ) and in turn the policy ( , ). Hadiji et al., present a way to handle multivariate count models, although not latent counts [9] . However, their model does not take into account non-homogeneity in the counts that is also characteristic of APC scenarios. Hence, due to expected non-homogeneity, the count can be modeled as: where models a base count (bias). We aggregate count features that make up over the history  using this type of kernel  to handle partial observability and non-homogeneity in the counting process. The Kernel approach inspired by a Hawkes process model 1 , acts as a method to model a homogeneous count process in the local neighborhood around , where the non-homogeneity arises due to a union of several such locally homogeneous processes. Additionally, it has been studied before that the limiting distribution of a binary outcome process over an infinite horizon is a counting process, for example binomial to Poisson [18] . We can similarly use the same principle in reverse to model the policy as a binary outcome process over each discrete ( , ), at time . We thus model the policy ( , ) at time as a binary outcome problem, using relational count features from the trajectory history , up to time − 1, aggregated using kernel  = −( − ) 2 Using a relational description, we can mathematically formalize knowledge as specifying constraints over the parameters . More precisely, knowledge is specified as a set of functional constraints , where ∈ [1, ] for each action ∈ as: ∧ ( ) ⇒ ( ( ) = ( , )), which says that functional constraints are applied in probability space to the parameters, if the conjunction of conditions, ∧ are satisfied in state . If ( ) is high, then action is preferred in when ∧ applies. In functional space, ( , ) is constrained in place of . We will denote to be the data containing ( , ) pairs from trajectories. ( , ) is updated according to Bayes rule that defines the posterior as: Taking log on both sides we get log ( ( , )| ) ∝ log ( | ( , )) + log ( ( , )). For our problem the distribution is the policy , that we trying to learn. Therefore, ( | ( , )) = ( , ) = where can be thought of as a weight on how important we consider . It is worth noting that if we set all = in Equation 2, we recover the formulation in Odom et al. 's work [10] . We formally state this in Theorem 1, where the proof is omitted due to the space limit. More generally, ( ( , ), ) can assume functional forms other than Laplace distributions as well, depending on domain requirements. is the number of Functional Constraints that agree with the action taken and is the number that does not. The Bayesian formulation imposes constraints in a soft way such that the agent gradually incorporates the knowledge. A second way to incorporate knowledge as functional constraints that enforces hard imposition, is to use the conditional functional gradient ascent method [19] . In this method, after ( , ), after stages of boosting, is approximated as 0 ( , ) + ∑ −1 is a hyper-parameter that is empirically set and decayed as learning progresses. The form of the , is the same as in the Bayesian formulation. Thus these constraints can be infused into the optimization through conditional functional gradients to handle , and constraints as well. Any off the shelf LP solver can be used during optimization.The LP formulation for is detailed below: In APC, assistive agents are required to comply with general guidelines (hard constraints) while also benefiting from adapting to knoweldge in a gradual manner (soft constraints). We can combine the Bayesian formulation with Conditional Functional Gradients to achieve this type of agent. Thus, first knowledge is specified for soft infusion using = ∧ ( ) ⇒ ( ( ( , )) = ( , ( , ))), following which log ( , ) ( , ) is computed as ( ( , ) − ( | ( , ))) + − ( ( , ) − ). Next, hard constraints that the agent has to comply with can be specified for hard infusion using ℎ ∧ ℎ ( ) ⇒ ( ( ( , )) = ℎ ( ℎ , ( , ))), which can be optimized using an LP solver or by solving the La- There exists a shop, a workplace, and a residential neighborhood in the same route and person living in a home belonging to that residential neighborhood. 3 sopen(State,Shop) There exists a Shop that is open. Some relational features that are learned with their English descriptions. It can be seen that the features allow finer grained control at the level of individual shops, homes, residences, workplaces, and routes. In this way, knowledge infusion is carried for the agent to effectively assist policy makers with both incoming knowledge about the developing situation and compliance with general guidelines. We design a simulator that interacts with the agent. It simulates the pandemic spread in a small city. The simulator includes information about persons, households, and facilities in the city namely -residential areas, hospitals, shops, and workplaces. Actions are lock/unlock parts of the city as well as increase testing by 10%. The reward model is to reduce human fatalities. We now compare the knowledge infusion methods, with and without feature aggregation. Possible actions are to lockdown or unlock routes, homes, residential neighborhoods, shops, or workplaces and increase testing at these locations. "NilPolicy" is also included as part of the action space. All features are existentially quantified. The agent learns by interacting with the simulator to optimize the reward of minimizing infections. It is then subsequently required to adapt to new knowledge about the city map. This knowledge is provided in the form of prohibition of closing down of certain parts of the city specified as functional constraints with denoting the constraint strength/importance adjudged by the constraint specifier, edge Infusion using all methods, fares against Policy Gradient without Knowledge Infusion (KI) and against the baseline which use relational count features and combines Odom et al., [10] 's knowledge infusion with Kersting et al., [7] 's Policy Gradient approach. The baseline also uses linear basis instead of trees. We define a test-case passed as the number of times policy choice is equal to the real MDP choice. Recall from Figure 1 that the real MDP is known during evaluation. The percentage of test cases passed is reported against number of simulator trajectories. Note that in the combined setting, the baseline approach which uses Odom's KI cancels out the effect of knowledge that is weighted contrastingly i.e. and − . Also, as seen in Section 3.1, aggregation is used to model Partial Observability in the state. We note that aggregation shows improved performance as without aggregation the % test-cases passed are on average 10% lower than with for 20,50 and 100 trajectories, across all comparison settings. It can be seen that the sample efficiency vastly improves with KI than without. More over, in other settings the baseline is similar in results because it is exactly derivable from the bayesian formulation as proved in Theorem 1. We define an Event as -setting into movement, a certain population of people who don't follow the typical simulation dynamics. For example, as already mentioned, the simulator encodes that persons in a residential area shop at a "shop" that is in the neighborhood. But we select certain persons who may instead deviate from this routine. A concrete example of this might be those who deviate from their daily routine of going to work to instead gather at a location staging a rally. This Event if not handled early, for example by locking down the location and testing everyone that attended, can cause unintended consequences in terms of both human fatalities and economic losses incurred. We demonstrate how knowledge infusion can be used to mitigate this effect. We must take care here to analyse the Event before acting too quickly in infusing knowledge. Since the effect of imposing policy in RL at any given time step propagates forward to all other time steps, in the case that the Event is transient in nature (passes quickly), knowledge infusion may cause more harm than good. We analyze the weights at the input layer of the neural network to understanding which parts of the state space were highlighted towards the enforcement of the agent policy. This is similar to a heat map visualization like Dombrowski et al. use except that only the input layer weights are considered [14] . The single hidden layer is a composition of the input features and is hence omitted from the interpretability analysis. We take the example of the top 2 largest weights for lockshop and the percentage of test cases that this held true in, to illustrate that the interpretability. The top 2 features are feature IDs 1 and 3 with 0.85% test cases passed. We do this for the combined knowledge infusion setting. The feature IDs are from Table 1 . We can see that for example: for lockshop, the features with ID 1 and 3, hold the most weight for % of the test cases. The simulator dynamics encodes that persons in a residential area shop at a "shop" that is in the neighborhood. This is what the "same" predicate means. Thus the feature states that there are many people in homes that are part of a residential area with a shop. Thus this implies due to the simulator dynamics that many people will be shopping regularly at this shop (i.e. high interaction). Therefore this result is not only interpretable but shows that the agent learns to implement locking down of a shop only when the shop is a source of high interaction among people and when it is open. In settings where there is continuously evolving dynamics and the resulting non-stationarity and partial observability, standard RL frameworks suffer from unaffordable delays due to non-stationarity and sub-optimal policy learning due to partial observability. This is because of their fundamental trial and error based correction. In our example setting of APC, this delay and in correction or a sub-optimal policy can prove extremely costly. In various other Real-Life Scenarios, we see similar issues. We develop a principled Knowledge Infusion framework to enable effective control of unintended consequences that arise there-of and demonstrate its effectiveness. We will explore more specifications for the knowledge as functional constraints and their applications in future work in enhanced simulation settings. Reinforcement learning in robotics: A survey Mastering the game of go with deep neural networks and tree search A contextual-bandit approach to personalized news article recommendation A reduction of imitation learning and structured prediction to noregret online learning Apprenticeship learning via inverse reinforcement learning Knowledge-infused deep learning Non-parametric policy gradients: A unified treatment of propositional and relational domains Fitted q-learning for relational domains Poisson dependency networks: Gradient boosted models for multivariate count data Knowledge-based probabilistic logic learning Learning from imbalanced data in relational domains: A soft margin approach A unified framework for knowledge intensive gradient boosting: Leveraging human experts for noisy sparse domains Network size and weights size for memorization with two-layers neural networks Explanations can be manipulated and geometry is to blame Inductive logic programming User friendly automatic construction of background knowledge: Mode construction from er diagrams The aleph manual On the convergence of binomial to poisson distributions Functional frank-wolfe boosting for general loss functions