$$\alpha $$ -MCMP: Trade-Offs Between Probability and Cost in SSPs with the MCMP Criterion

Crispino, Gabriel Nunes; Freire, Valdinei; Delgado, Karina Valdivia

doi:10.1007/978-3-031-45368-7_8

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14195))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

515 Accesses
2 Citations

Abstract

In Stochastic Shortest Path (SSP) problems, not always the requirement of having at least one policy with a probability of reaching goals (probability-to-goal) equal to 1 can be met. This is the case when dead ends, states from which the probability-to-goal is equal to 0, are unavoidable for any policy, which demands the definition of alternate methods to handle such cases. The $\alpha $-strong probability-to-goal priority is a property that is maintained by a criterion if a necessary condition to optimality is that the ratio between the probability-to-goal values of the optimal policy and any other policy is bound by a value of $0 \le \alpha \le 1$. This definition is helpful when evaluating the preference of different criteria for SSPs with dead ends. The Min-Cost given Max-Prob (MCMP) criterion is a method that prefers policies that minimize a well-defined cost function in the presence of unavoidable dead ends given policies that maximize probability-to-goal. However, it only guarantees $\alpha $-strong priority for $\alpha = 1$. In this paper, we define $\alpha $-MCMP, a criterion based on MCMP with the addition of the guarantee of $\alpha $-strong priority for any value $0 \le \alpha \le 1$. We also perform experiments comparing $\alpha $-MCMP and GUBS, the only other criteria known to have $\alpha $-strong priority for $0 \le \alpha \le 1$, to analyze the difference between the probability-to-goal of policies generated by each criterion.

Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF

Probability-to-Goal and Expected Cost Trade-Off in Stochastic Shortest Path

Masked Triples

The Discounted Cost Model

1 Introduction

Markov Decision Processes (MDPs) [11] constitute the main theoretical framework for modeling probabilistic planning problems, in which an agent interacts with an environment by applying actions with stochastic outcomes, to optimize a given objective function. In this context, Stochastic Shortest Path (SSP) [1] problems model scenarios in which an agent needs to minimize the expected accumulated cost to goal. The solution to an SSP is a policy, a mapping between states to actions. Conventionally, SSPs assume that there is at least one policy that has a probability of 1 to reach goals when followed by the agent.

Realistically this is a strong requirement. Environments commonly have dead ends, defined as states from which goals can not be reached. When these states exist and are unavoidable, the conventional requirement for solving SSPs is not well-defined. This requires the definition of alternate criteria to address this particular case.

Several criteria were proposed in the literature in this context. The MAXPROB criterion [7] chooses policies that maximize the probability of reaching goals (probability-to-goal). The S³P [14] and iSSPUDE [6] criteria also prefer policies that maximize probability-to-goal, then choose the ones that minimize cost measures only considering histories that reach the goal. The MCMP criterion [16], on the other hand, chooses policies that maximize probability-to-goal, but then it minimizes a different cost measure that does not ignore histories that have dead-end states. This can lead to more natural optimal policies when compared to S³P and iSSPUDE since dead ends are taken into account. Also, since it can be formulated as a linear program, efficient state-of-the-art methods can be used to solve this criterion.

Other criteria make trade-offs between probability-to-goal and cost measures, such that they do not only prefer policies that maximize probability-to-goal. The fSSPUDE [6] criterion uses a finite penalty to give up, such that the agent can pay this value to exit the process at any step. A discount factor, commonly used in infinite horizon MDPs [11], can also be used in SSPs for making such trade-offs [15]. The GUBS criterion [4] was proposed as an alternative for criteria that make trade-offs between probability-to-goal and cost measures. Among other features, GUBS maintains good theoretical properties, like the $\alpha $-strong probability-to-goal priority [2], which guarantees a lower bound on the ratio of probability-to-goal values considering pairs of policies, for a given value of $0 \le \alpha \le 1$.

Between the criteria that were mentioned, GUBS is the only one that guarantees $\alpha $-strong probability-to-goal priority for $0 \le \alpha \le 1$. However, we can modify constraints in the MCMP linear programming formulation such that a new criterion, which also guarantees $\alpha $-strong probability-to-goal priority for $0 \le \alpha \le 1$, is obtained.

In this paper, we thus propose $\alpha $-MCMP, a criterion derived from modifying a constraint in MCMP’s linear programming formulation such that it guarantees $\alpha $-strong probability-to-goal priority for $0 \le \alpha \le 1$. We also perform experiments to evaluate the differences in the preference of policies of $\alpha $-MCMP and GUBS.

The structure of this work is defined as follows: Sect. 2 contains the background for this work, outlining definitions and properties of SSPs and alternate criteria to solve them in the presence of unavoidable dead ends; Sect. 3 introduces $\alpha $-MCMP and a proof that it maintains the $\alpha $-strong probability-to-goal priority property, as well as some differences between this new criterion and GUBS; Sect. 4 describes the empirical evaluation that was performed; and Sect. 5 has the conclusion of the present work.

2 Background

The following subsections will cover definitions and properties that will be used throughout this paper.

2.1 Stochastic Shortest Path (SSP) Problems

A Stochastic Shortest Path (SSP) problem [1] is an indefinite-horizon stochastic process in which at time t the agent is at a state $s_t$ and can choose an action $a_t$ that will lead to $s_{t + 1}$ given a probability distribution on $s_t$ and a by paying a cost of $c_t$.

Definition 1

An SSP [1] is a tuple $\mathcal {M} = \langle \mathcal {S}, s_0, \mathcal {A}, P, c, \mathcal {G}\rangle $, where:

$\mathcal {S}$ is a finite set of states;
$s_0\in \mathcal {S}$ is the initial state;
$\mathcal {A}$ is a finite set of actions;
$P: \mathcal {S}\times \mathcal {A}\times \mathcal {S}\rightarrow [0, 1]$ is the transition function, such that $P(s, a, s') = \Pr (s_{t + 1} = s' \mid s_t = s, a_t = a)$;
$c: \mathcal {S}\times \mathcal {A}\rightarrow \mathbb {R}_{\ge 0}$ is the cost function, which assigns a cost c(s, a) for taking action a at state s;
$\mathcal {G}\subset \mathcal {S}$ is the set of absorbing goal states, i.e. $c(g, a) = 0$ and $P(g, a, g) = 1, \forall g\in \mathcal {G}$ and $a\in \mathcal {A}$. Also, $c(s, a) > 0$ for $s\in \mathcal {S}\setminus \mathcal {G}$.

The objective in an SSP is to find a policy $\pi $, a mapping between states to actions, that minimizes the expected cost to goal $V^{\pi }(s) = \lim _{T\rightarrow \infty } \mathbb {E}\left[ \sum _{t=0}^{T-1}c_{t}\mid \pi , s_0 = s\right] $. The probability-to-goal of a policy $\pi $ from state $s\in \mathcal {S}$ is given by the function $P^\pi _G(s) = \lim _{t\rightarrow \infty }\Pr (s_{t}\in \mathcal {G}\mid \pi , s)$. We also define the maximum probability-to-goal of a state s in an SSP as the function $P_G(s) = \max _{\pi \in \Pi }P^\pi _G(s)$, in which $\Pi $ is the set of all policies. A policy $\pi $ is proper if the probability-to-goal when following it is 1, i.e. $P^\pi _G(s_0) = 1$.

When the agent acts for multiple steps $t \in \{0, 1, \dots , T\}$, she generates a history $h = \{\langle s_1, a_1, c_1\rangle , \langle s_2, a_2, c_2\rangle , ..., \langle s_{T-1}, a_{T-1}, c_{T-1}\rangle , s_T \}$, where $s_t\in \mathcal {S}$ is the state in which the agent is at step t, $a_t\in \mathcal {A}$ is the action applied at $s_t$, and $c_t = c(s_t, a_t)$. We define $\mathcal {H}=(\mathcal {S}\times \mathcal {A}\times \mathbb {R}_{>0})^{*}\times \mathcal {S}$ as the set of all histories.

Among methods to solve SSPs, it is possible to describe an SSP as a linear program and solve it by using this program as an input to a solver [3]. In this LP, the objective is to find the values of $x_{s, a}$, which are variables that represent the expected number of times action $a\in \mathcal {A}$ is executed at state $s\in \mathcal {S}$. This linear program is outlined in LP1.

in(s) represents the expected flow entering s, while out(s) represents the expected flow leaving s. Constraint (C4) restricts that except for $s_0$ and goal states, the expected flow entering a state must be equal to the expected flow leaving it. (C5) indicates that the expected flow leaving $s_0$ must be higher than the one entering $s_0$ by a difference of 1. In other words, in expectation, the agent must leave the initial state one more time than she enters it (this does not consider the first step of the process). Finally, the total expected flow reaching goal states must equal 1.

The objective function then minimizes the expected total cost of reaching the goal given that the agent starts at $s_0$, given the constraints specified. Because the optimal policy $\pi ^*$ that can be generated after solving LP1 is guaranteed to be deterministic, it can be defined as $\pi ^*(s) = \arg \max _{a\in \mathcal {A}} x_{s, a}$.

2.2 SSPs with Dead Ends and Alternate Criteria

When the policy is improper, $V^\pi $ is not well-defined and diverges. Thus, when there does not exist a proper policy, the standard criterion for solving SSPs is not well-defined either. This happens when dead ends, states from which no goal state can be reached, are unavoidable. Figure 1 contains an example of an SSP with an unavoidable dead end. This problem has a state $s_0$, an unavoidable dead-end state $s_{de}$ and a goal state $s_g$. At the state $s_0$, action a leads to $s_g$ deterministically with cost $c_a$. Action b has cost $c_b$ and leads to $s_g$ with probability P, and to $s_{de}$ with probability $1 - P$.

For SSPs with unavoidable dead ends, different criteria have to be used to solve these problems. One can maximize probability without optimizing cost, which is the case of the MAXPROB criterion [7]. Instead of just maximizing probability, several criteria take the dual approach of minimizing a well-defined cost measure in the case of unavoidable dead ends, but only considering policies that maximize probability-to-goal. Throughout this work, we will refer to this class of criteria as lexicographic. S³P [14] and MCMP [16] are examples of such criteria. When considering the example outlined in Fig. 1, lexicographic criteria would choose a as the optimal action at $s_0$, because it maximizes probability-to-goal.

Other criteria can make trade-offs between probability-to-goal and some cost measures. fSSPUDEs [6], for example, use a finite penalty to pay when a dead end is reached. A discount factor [15] can also be used for this objective. Additionally, the GUBS criterion [4] has parameters that combine Expected Utility Theory with goal prioritization to make such trade-offs. In the example in Fig. 1, these criteria could either choose between actions a or b, depending on which values for their parameters are selected.

Sections 2.4 and 2.5 will detail the MCMP and GUBS criteria, respectively. Before that, the next subsection will formally define the $\alpha $-strong probability-to-goal priority property.

2.3 Making Trade-Offs Between Probability-to-Goal and Cost Measures

When analyzing different criteria for solving SSPs with unavoidable dead-ends, we might ask how to evaluate the decisions made by these criteria.

One method for doing so is to take into account how each criterion maintains the $\alpha $-strong probability-to-goal priority property [2]. The definition for this property is given as follows:

Definition 2

($\alpha $ -strong probability-to-goal priority [2]). Consider $ 0 \le \alpha \le 1$, we say that a decision criterion has $\alpha $-strong probability-to-goal priority if for all SSPs $\mathcal {M}$ and all pairs of policies $\pi ,\pi '\in \Pi $, the following condition is true:

$$ \pi \succeq \pi ' \implies \frac{P_G^{\pi }(s_{0})}{P_G^{\pi '}(s_{0})} \ge \alpha , $$

where $P_G^{\pi '}(s_{0}) > 0$. If $\pi ^*$ is an optimal policy then for all $\pi '\in \Pi $ the following equation holds:

$$ \pi ^* \succeq \pi ' \implies \frac{ P_G^{\pi ^*}(s_{0}) }{ P_G^{\pi '}(s_{0}) } \ge \alpha . $$

Note that $\frac{P_G^{\pi }(s_{0})}{P_G^{\pi '}(s_{0})} \ge \alpha $ is a necessary condition for $\pi \succeq \pi '$ to be true.

Lexicographic criteria have 1-strong probability-to-goal priority, as it is guaranteed for these criteria that every policy preferred over another has a larger value of probability-to-goal [2]. The GUBS criterion guarantees $\alpha $-strong priority for $0\le \alpha \le 1$, while the fSSPUDE and discounted cost criteria only have such guarantees for $\alpha = 0$ (only 0-strong) [2].

2.4 Min-Cost Given Max-Prob (MCMP)

The Min-Cost given Max-Prob (MCMP) criterion [16] shrinks trajectories by pruning histories in the first state that a dead-end state is reached, if that ever happens. This is formulated by the function $\psi : \mathcal {H}\rightarrow \mathcal {H}$:

$$\begin{aligned} \psi (h) = {\left\{ \begin{array}{ll} \{s_i\}, &{}\text {if } |h| = 1\text { or } P_G(s_i) = 0\\ \{\langle s_i, a_i, c_i \rangle \}\cup \psi (\{\langle s_{i + 1}, a_{i + 1}, c_{i + 1} \rangle , \dots \}), &{}\text {otherwise}, \end{array}\right. } \end{aligned}$$

(1)

such that $h = \{\langle s_i, a_i, c_i \rangle , \langle s_{i + 1}, a_{i + 1}, c_{i + 1} \rangle , \dots \}$.

Based on that, the value of a policy under the MCMP criterion is given by the following function:

$$ \bar{C}_{MCMP}^\pi (s) = \mathbb {E}\left[ \sum _{t=0}^{|\psi (h)|}c_{t}\mid \pi ,s_{0}=s\right] . $$

Finally, a policy $\pi ^*_{MCMP}$ is optimal under MCMP if it minimizes $\bar{C}_{MCMP}$ given that it maximizes $P^\pi _G(s_0)$:

$$\begin{aligned} \pi ^*_{MCMP} = \min _{\pi \in \Pi _{MP}} \bar{C}_{MCMP}^\pi (s_0), \end{aligned}$$

such that $\Pi _{MP}$ is the set that maximizes $P^\pi _G(s_0)$, i.e. $\Pi _{MP} = \{\pi \mid P^\pi _G(s_0) = \max _{\pi '\in \Pi }P^{\pi '}_G(s_0)\}$.

Solving MCMP is equivalent to solving the following LP 2, a modified version of LP1. The differences between the original LP for SSPs and LP2 are that the constraints (C4) and (C5) were replaced respectively by (C7) and (C8), in which equalities are replaced by inequalities, and (C6) was replaced by (C9), in which 1 is substituted by $P_G(s_0)$. The inequality introduced in (C7) means that the expected flow when reaching a state (in(s)) is higher than the expected flow when leaving it (out(s)). This can be interpreted as an implicit give-up action with no cost that can be taken by the agent, which represents the pruning of histories that contain dead ends defined in the function $\psi $. On a similar note, the inequality represented by (C8) means that the expected flow reaching the initial state $s_0$ is higher than the expected flow leaving this same state. In other words, the agent always enters $s_0$, since it is the initial state (thus $in(s_0) = 1$), but not always leaves without giving up (which means that $out(s_0) >= 0$).

Also note that a probabilistic Markovian policy $\pi ^*(s, a) = x_{s, a} / out(s)$ can be generated after solving LP2.

To find $P_G(s_0)$, another linear program, referred to as LP3, can be used [16]. In this LP, the objective function maximizes the total expected flow of reaching goal states, which is equivalent to the probability-to-goal from the initial state $s_0$.

Since MCMP is a lexicographic criterion, it follows that it maintains 1-strong probability-to-goal priority [2]. As an example, consider the SSP displayed in the Fig. 1. As mentioned before, MCMP will choose as optimal the policy that takes action a at $s_0$. Let $\pi $ be this policy. The cost value under the MCMP criterion for $\pi $ is $C^\pi _{MCMP}(s_0) = c_a$.

Additionally, we can modify MCMP to make trade-offs between probability and cost by simply substituting $P_G(s_0)$ in (C6) to any probability value p, and LP2 will then generate a policy $\pi $ such that $P^\pi _G(s_0) = p$ [8].

In the example of Fig. 1, if we replace $P_G(s_0)$ in (C9) with the value P, for instance, LP2 could return a policy that always takes b at $s_0$, or some linear combination of actions a and b that would yield probability-to-goal of P, if the MCMP cost of doing so is less than always taking b. To illustrate this more concretely, consider the case of $P = 0.95$, $c_a = 2$, and $c_b = 1$. If we replace $P_G(s_0)$ in (C9) with $P = 0.95$, the optimal policy returned by LP2 always takes b, yielding an expected cost of 1. However, if we replace $P_G(s_0)$ with 0.98 in (C6) instead, the optimal policy returned in this case takes a with probability 0.6 and b with probability 0.4, leading to an expected cost of $0.6 \times 2 + 0.4 \times 1 = 1.6$ and probability-to-goal $0.6 \times 1 + 0.4\times 0.95 = 0.98$. Note that, in this case, LP2 allows for a policy that, for example, takes a with a probability of 0.98 and takes the implicit give-up action with a probability of 0.02. This also yields a probability-to-goal of 0.98, thus fulfilling (C9), but it is not the optimal policy because its resulting expected cost is $0.98 \times 2 = 1.96$, which is higher than 1.6.

2.5 Goals with Utility-Based Semantics (GUBS)

The Goals with Utility-based Semantics (GUBS) criterion [4] was proposed as an alternative method for solving SSPs with unavoidable dead ends by combining goal prioritization with Expected Utility Theory. It combines these two characteristics by its two parameters: the constant goal utility $K_g$, and the utility function over cost u, respectively.

Definition 3

(GUBS criterion [4]). The GUBS criterion evaluates a history by the utility function U:

$$\begin{aligned} U(C_T, \beta _T) = u(C_T) + K_g \beta _{T}. \end{aligned}$$

(2)

An agent follows the GUBS if she evaluates a policy $\pi $ under the following value function:

$$\begin{aligned} V^\pi _{GUBS}(s) = \lim _{T\rightarrow \infty }\mathbb {E}[U(C_T, \beta _T) | \pi , s_0=s]. \end{aligned}$$

(3)

A policy $\pi ^*$ is optimal under the GUBS criterion if it maximizes the function $V^{\pi }_{GUBS}$, i.e.:

$$ V^{\pi ^{*}}_{GUBS}(s) \ge V^{\pi }_{GUBS}(s)\quad \forall \pi \in \Pi \text { e } \forall s\in \mathcal {S}. $$

The GUBS criterion is the only criterion among all other mentioned criteria that guarantees $\alpha $-strong probability-to-goal priority for any value $0 \le \alpha \le 1$.

Corollary 1

($\alpha $ -strong probability-to-goal priority for GUBS [2]). For an arbitrary $0 \le \alpha \le 1$, the GUBS criterion has $\alpha $-strong probability-to-goal priority.

For $0 \le \alpha < 1$, the property is maintained if the following condition is true:

$$\begin{aligned} K_g \ge \frac{(U_{max} - U_{min}) \alpha }{1 - \alpha }, \end{aligned}$$

such that the values returned by u are in the set $[U_{min}, U_{max}]$.

For $\alpha = 1$, this property is true when $U_{max} = U_{min}$ and $K_g > 0$.

The eGUBS criterion [5] is a specialization of GUBS, in which the function u is an exponential function taken from risk-sensitive SSPs [10].

Definition 4

(eGUBS criterion [2, 5]). The eGUBS criterion considers the GUBS criterion where the utility function u is defined as follows:

$$\begin{aligned} u(C_T) = {\left\{ \begin{array}{ll} 0, \text {if } C_T = \infty \\ e^{\lambda C_T}, \text { otherwise}, \end{array}\right. } \end{aligned}$$

over some accumulated cost $C_T$ and a risk factor $\lambda <0$.

In the GUBS criterion, generated policies are non-Markovian, since they need to have the state space augmented with the accumulated cost. From the properties of eGUBS, however, it is possible to use algorithms that guarantee that these non-Markovian policies are finite, based on a maximum cost obtained from the problem structure [5]. eGUBS-VI [5] is an algorithm that does this by computing an optimal policy under eGUBS completely via value iteration, while eGUBS-AO* [2] is another algorithm that can compute an optimal policy by using a mix of value iteration and heuristic search.

Also, since for eGUBS $U_{min} = 0$ and $U_{max} = 1$, the lower bound obtained in Corollary 1 can be restricted to $K_g \ge \frac{\alpha }{1 - \alpha }$ [2].

3 $\alpha $-MCMP Criterion

In this section, we will show how to define a new criterion called $\alpha $-MCMP, which maintains the $\alpha $-strong probability-to-goal priority by leveraging the MCMP definition.

MCMP is 1-strong since it is a criterion that maximizes probability-to-goal. However, we can relax the LP definition of MCMP to ensure it maintains the $\alpha $-strong probability-to-goal priority for $0 \le \alpha \le 1$. This can be done by replacing the constraint (C9) with a new constraint (C10), which instead of making the total expected flow entering goal states to be equal to $P_G(s_0)$ (the maximum probability-to-goal), it considers this flow as $\alpha P_G(s_0)$.

LP4 contains the modified version of LP2:

This generates a new criterion that minimizes a similar yet different cost function than MCMP. Throughout this work, we refer to this modified version of MCMP as $\alpha $-MCMP. It can be formally defined as the following:

Definition 5

Given a value of $0 \le \alpha \le 1$, a policy $\pi ^*_{\alpha MCMP}$ is optimal under the $\alpha $-MCMP criterion if it minimizes $\bar{C}_{MCMP}$ given policies with probability-to-goal equal to $\alpha P_G(s_0)$:

$$\begin{aligned} \pi ^*_{\alpha MCMP} = \min _{\pi \in \Pi _{\alpha MP}} \bar{C}_{MCMP}^\pi (s_0), \end{aligned}$$

for $\Pi _{\alpha MP} = \{\pi \mid P^\pi _G(s_0) = \alpha P_G(s_0)\}$.

The following theorem demonstrates that $\alpha $-MCMP thus maintains $\alpha $-strong priority for any $0\le \alpha \le 1$.

Theorem 1

($\alpha $ -strong probability-to-goal priority for $\alpha $ -MCMP). The $\alpha $-MCMP criterion has $\alpha $-strong probability-to-goal priority for any value $0 \le \alpha \le 1$.

Proof

By replacing (C9) with the constraint $\sum \limits _{s_g\in \mathcal {G}}in(s_g) \ge \alpha P_G(s_0)$, the condition $\pi ^*\succeq \pi \implies \frac{P^{\pi ^*}_G(s_0)}{P^{\pi }_G(s_0)}\ge \alpha $ from Definition 2 holds, in which $\pi ^*$ is the optimal policy generated by LP2 and $\pi $ is any other policy.

Consider an arbitrary policy $\pi '$ and let $\pi $ be defined as the policy that follows $\pi '$ with probability $\frac{\alpha P_G(s_0)}{P^{\pi '}_G(s_0)}$ and gives up with probability $1 - \frac{\alpha P_G(s_0)}{P^{\pi '}_G(s_0)}$. The probability-to-goal value of $\pi $ is $P^{\pi }_G(s_0) = \frac{\alpha P_G(s_0)}{P^{\pi '}_G(s_0)}P^{\pi '}_G(s_0) + (1 - \frac{\alpha P_G(s_0)}{P^{\pi '}_G(s_0)})\times 0 = \alpha P_G(s_0)$. It then follows that $\pi $ is at least as good as $\pi '$, because $\bar{C}^{\pi }_{MCMP}(s_0) = \alpha P_G(s_0) \bar{C}^{\pi '}_{MCMP}(s_0) \le \bar{C}^{\pi '}_{MCMP}(s_0)$.

Thus, a policy with probability-to-goal $\alpha P_G(s_0)$ will always be as good as any policy with probability-to-goal higher than $\alpha P_G(s_0)$ when the constraint $\sum \limits _{s_g\in \mathcal {G}}in(s_g) \ge \alpha P_G(s_0)$ is used. This means that using the constraint $\sum \limits _{s_g\in \mathcal {G}}in(s_g) = \alpha P_G(s_0)$ (C10) is equivalent to using $\sum \limits _{s_g\in \mathcal {G}}in(s_g) \ge \alpha P_G(s_0)$. Since the restriction $\sum \limits _{s_g\in \mathcal {G}}in(s_g) \ge \alpha P_G(s_0)$ maintains the $\alpha $-strong probability-to-goal priority property for $0\le \alpha \le 1$ and (C10) is equivalent to using it, the $\alpha $-MCMP criterion maintains the $\alpha $-strong probability-to-goal priority property for $0\le \alpha \le 1$. $\square $

3.1 Relationship Between $\alpha $-MCMP and GUBS

By defining $\alpha $-MCMP, we now have a criterion that maintains the $\alpha $-strong probability-to-goal priority property for $0 \le \alpha \le 1$ like GUBS, but with the advantages of generating a Markovian policy and having efficient solutions inherited from MCMP.

Although both criteria maintain the $\alpha $-strong priority for $0\le \alpha \le 1$, the probability-to-goal of optimal policies in $\alpha $-MCMP will always equal $\alpha P_G(s_0)$. In the GUBS criterion, when choosing a value of $K_g$ by following Corollary 1, its resulting policy will not necessarily have a fixed probability-to-goal value of $\alpha P_G(s_0)$. This value can instead be higher than $\alpha P_G(s_0)$, if a deterministic policy with such a value exists and if the chosen utility function allows for that.

For example, consider the eGUBS criterion and a value of $\alpha = 0.95$. From Corollary 1, a value of $K_g \ge \frac{0.95}{0.05} = 19$ guarantees that eGUBS is 0.95-strong. Also, consider the example in Fig. 1, such that $P = 0.95$ and the cost $c_a$ is larger than $c_b$ by a margin of $\delta $, i.e., $c_a = c_b + \delta $, for $\delta > 0$. Under this setting, eGUBS can choose a as the optimal action at $s_0$ if the expected utility when following it is greater than the one when following b, i.e., $u(c_a) + K_g > 0.95 (u(c_b) + K_g) + 0.05 u(\infty )$. Considering $K_g = 20 \ge 19$:

$$\begin{aligned} u(c_a) + 20 &> 0.95 (u(c_b) + 20) + 0.05 u(\infty ) \\ e^{\lambda c_a} + 20 &> 0.95 e^{\lambda c_b} + 0.95 \times 20\\ e^{\lambda c_a} - 0.95 e^{\lambda c_b} &> 0.95 \times 20 - 20\\ e^{\lambda c_a} - 0.95 e^{\lambda c_b} &> -1 \end{aligned}$$

Thus, for values of $\lambda $, $c_a$, and $c_b$ such that $e^{\lambda c_a} - 0.95 e^{\lambda c_b} > -1$, action a will be optimal. For instance, if $\delta = 1$, $\lambda = -0.1$, $c_b = 100$, and $c_a = 100 + 1 = 101$, the utility of taking action a is $e^{-0.1 \times 101} + 20 \approx 20$, which is higher than the utility of taking b that is $0.95 (e^{-0.1 \times 100} + 20) \approx 19$. For this same setting, the optimal policy under $\alpha $-MCMP would be the one that chooses a with probability 0.95, and gives up with probability 0.05. This policy has an expected cost of $0.95 \times 101 = 95.95$.

In summary, for a value of $\alpha = 0.95$ and depending on the trade-off incurred by the chosen value of $\lambda $, the GUBS criterion can choose an optimal policy $\pi $ such that $P^\pi _G(s_0) \ge \alpha P_G(s_0)$. $\alpha $-MCMP, on the other hand, will always select a policy that has a probability-to-goal value equal to $\alpha P_G(s_0)$.

Table 1 summarizes some of the key differences between $\alpha $-MCMP and GUBS.

4 Experiments

We have performed experiments in the Navigation [12], River [4], and Triangle Tireworld [9] domains (Figs. 2a, 2b, and 2c illustrate these domains, respectively) to evaluate $\alpha $-MCMP empirically and compare it to the GUBS criterion.

Table 1. Key differences between $\alpha $-MCMP and GUBS.

Full size table

The Navigation domain is a grid world, where the agent starts at the rightmost column in the last row, and the goal is in the same column but in the first row. In the middle rows, the actions have a probability of making the agent disappear, thus going to a dead-end state. These probabilities are lower on the left and higher on the right. The River domain is similar, also being a grid world, but representing an agent at a river bank having to cross the river to get to the goal that is on the other side. Actions in the river have a positive probability of making the agent fall down one row, while actions in the bank and the bridge, located in the first row, are deterministic. If the agent gets to the first row after falling down in the river, the waterfall is reached, in which no actions to other states are available, thus being a dead end. Finally, in the Triangle Tireworld domain, the agent is represented by a car that needs to go to a goal location, always with a positive probability of getting a flat tire. Certain states have available tires to change, but these are in general further away from the goal than states with no tires available. In the instances depicted in Fig. 2c, for example, the black circles represent locations that have available tires. If the car gets a flat tire after getting to a state with no tire to replace, then a dead end is reached.

We aimed to evaluate the different probability-to-goal values obtained from optimal policies generated from both criteria when varying their parameters. For that, we varied $\alpha $ between the values of $\{10^{-5}, 10^{-4}, 10^{-3},$ $10^{-2}, 0.1, 0.5, 0.999\}$ and computed LP4 to obtain the optimal policy of $\alpha $-MCMP. For GUBS, we used the value of $K_g$ as in Corollary 1 as a parameter for the given values of $\alpha $ and then computed the eGUBS criterion for this value of $K_g$ and different values of $\lambda $ using the eGUBS-VI algorithm. The values of $\lambda $ were $\{-0.05,$ $-0.06, -0.07, -0.08, -0.09, -0.1, -0.2\}$ for the Navigation domain, $\{-0.01, -0.1,$ $ -0.2,-0.3, -0.4, -0.5\}$ for the River domain, and $\{-0.1, -0.2, -0.25,-0.3,-0.35,$ $ -0.4\}$ for the Triangle Tireworld domain. A single instance was used for each domain. The number of states for each of these instances is 101, 200 and 19562 for the Navigation, River, and Triangle Tireworld domains, respectively. Note that the numbers of processed states can be considerably higher for eGUBS because it might need to process a high number of augmented states.

The domains were defined as PDDLGym [13] environments, and the code of the implementation is available at https://github.com/GCrispino/ssp-deadends/tree/bracis2023-paper.

Figures 3, 4, and 5 contain graphs that display the probability-to-goal values for each policy obtained for GUBS and $\alpha $-MCMP given different values of $\alpha $ (displayed in $\log _{10}$ scale in the x-axis), respectively for domains Navigation, River and Triangle Tireworld. From them, it can be observed that, as mentioned before, while the probability-to-goal obtained by $\alpha $-MCMP always equals $\alpha P_G(s_0)$, the values obtained from eGUBS’ optimal policies are different. In the different lines^{Footnote 1} reflecting different values of the risk factor $\lambda $, the probability-to-goal values are always higher than $\alpha P_G(s_0)$. How higher these probabilities are, though, depends on the chosen values of $\lambda $. For example, in the River domain, we can observe several different types of probability-to-goal curves when varying $\lambda $. The one that results in the widest difference between the minimum and the maximum probability-to-goal values is when $\lambda = -0.5$, for which the minimum probability-to-goal obtained was about 0.07 for $\alpha = 10^{-5}$, and the maximum value was 1, obtained when $\alpha \in \{10^{-3}, 10^{-2}, 0.1, 0.5, 0.999\}$.

In fact, for all domains, we observed that the higher the absolute value of $\lambda $ is, the higher the difference between the minimum and maximum probability-to-goal values that were obtained for this value of $\lambda $ are. It is also interesting to note that intermediate values of $\lambda $ can cover intermediate values of probability-to-goal that higher values of $\lambda $ used cannot. For example, in the River domain, when $\lambda $ is equal to $-0.1$ and $-0.2$, values of probability-to-goal close to 0.6 were obtained, which was not the case when the absolute value of $\lambda $ was higher than $-0.2$ (i.e. when $\lambda $ is equal to $-0.3, -0.4$, and $-0.5$). As another example, in the Navigation domain when $\lambda = -0.1$, the value of about 0.47 was obtained, while for different values of $\lambda $ such as $-0.2$ and for absolute values lower than 0.09 (i.e. when $\lambda $ is equal to $-0.05, -0.06, -0.07$, and $-0.08$), either a value close to 0 or close to the maximum probability-to-goal was obtained.

5 Conclusion

In this paper, we define $\alpha $-MCMP, a criterion to solve SSPs with unavoidable dead ends, that is obtained from MCMP. We also show how the modifications that were done to MCMP to generate $\alpha $-MCMP make it guarantee the $\alpha $-strong probability-to-goal priority property. Besides this, $\alpha $-MCMP also provides good advantages inherited from MCMP, such as Markovian policies and efficient state-of-the-art methods to solve it. It also has some differences when compared to GUBS, such as when preferring policies that might give up in the middle of the process instead of policies that do not do it.

Finally, we performed experiments to evaluate policies obtained by $\alpha $-MCMP and compared them to policies generated by the eGUBS criterion. The results indicate that we can make trade-offs in both criteria by choosing values of $\alpha $ a priori. Nonetheless, the way that these criteria make this trade-off is different. $\alpha $-MCMP always has probability-to-goal equal to $\alpha P_G(s_0)$, while the compromise made by eGUBS will depend on the value of $\lambda $ used in its utility function.

This paper attempts to contribute in the general understanding of sequential decision making in the presence of unavoidable dead-end states, such as several other works in this area [2, 4,5,6,7, 14,15,16].

Notes

1.
Note that not every line representing a value of $\lambda $ can be seen in the figures, because the values in these lines might be very close to the values in others, which can make them get covered by these other lines.

References

Bertsekas, D.: Dynamic Programming and Optimal Control. Athena Scientific, Belmont, Mass (1995)
MATH Google Scholar
Crispino, G.N., Freire, V., Delgado, K.V.: GUBS criterion: arbitrary trade-offs between cost and probability-to-goal in stochastic planning based on expected utility theory. Artif. Intell. 316, 103848 (2023)
Article MathSciNet MATH Google Scholar
d’Epenoux, F.: A probabilistic production and inventory problem. Manage. Sci. 10(1), 98–108 (1963)
Article Google Scholar
Freire, V., Delgado, K.V.: GUBS: a utility-based semantic for goal-directed Markov decision processes. In: Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, pp. 741–749 (2017)
Google Scholar
Freire, V., Delgado, K.V., Reis, W.A.S.: An exact algorithm to make a trade-off between cost and probability in SSPs. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 29, pp. 146–154 (2019)
Google Scholar
Kolobov, A., Weld, D., et al.: A theory of goal-oriented MDPs with dead ends. In: Uncertainty in artificial intelligence: proceedings of the Twenty-eighth Conference [on uncertainty in artificial intelligence] (2012), pp. 438–447 (2012)
Google Scholar
Kolobov, A., Weld, D.S., Geffner, H.: Heuristic search for generalized stochastic shortest path MDPs. In: Proceedings of the Twenty-First International Conference on International Conference on Automated Planning and Scheduling, pp. 130–137 (2011)
Google Scholar
Kuo, I., Freire, V.: Probability-to-goal and expected cost trade-off in stochastic shortest path. In: Gervasi, O., Murgante, B., Misra, S., Garau, C., Blečić, I., Taniar, D., Apduhan, B.O., Rocha, A.M.A.C., Tarantino, E., Torre, C.M. (eds.) ICCSA 2021. LNCS, vol. 12951, pp. 111–125. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86970-0_9
Chapter Google Scholar
Little, I., Thiebaux, S., et al.: Probabilistic planning vs. replanning. In: ICAPS Workshop on IPC: Past, Present and Future, pp. 1–10 (2007)
Google Scholar
Patek, S.D.: On terminating Markov decision processes with a risk-averse objective function. Automatica 37(9), 1379–1386 (2001)
Article MATH Google Scholar
Puterman, M.: Markov decision processes: discrete stochastic dynamic programming. Wiley, New York (1994)
Book MATH Google Scholar
Sanner, S., Yoon, S.: IPPC results presentation. In: International Conference on Automated Planning and Scheduling (2011). http://users.cecs.anu.edu.au/ssanner/IPPC_2011/IPPC_2011_Presentation.pdf
Silver, T., Chitnis, R.: PDDLGym: Gym environments from PDDL problems. In: International Conference on Automated Planning and Scheduling (ICAPS) PRL Workshop, pp. 1–6 (2020). https://github.com/tomsilver/pddlgym
Teichteil-Königsbuch, F.: Stochastic safest and shortest path problems. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 1825–1831 (2012)
Google Scholar
Teichteil-Königsbuch, F., Vidal, V., Infantes, G.: Extending classical planning heuristics to probabilistic planning with dead-ends. In: Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, pp. 1017–1022 (2011)
Google Scholar
Trevizan, F.W., Teichteil-Königsbuch, F., Thiébaux, S.: Efficient solutions for stochastic shortest path problems with dead ends. In: Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI) (2017), pp. 1–10 (2017)
Google Scholar

Download references

Acknowledgments

This study was supported in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) - Finance Code 001, by the São Paulo Research Foundation (FAPESP) grant $\#$2018/11236-9 and the Center for Artificial Intelligence (C4AI-USP), with support by FAPESP (grant #2019/07665-4) and by the IBM Corporation.

Author information

Authors and Affiliations

Universidade de São Paulo, São Paulo, Brazil
Gabriel Nunes Crispino, Valdinei Freire & Karina Valdivia Delgado

Authors

Gabriel Nunes Crispino
View author publications
Search author on:PubMed Google Scholar
Valdinei Freire
View author publications
Search author on:PubMed Google Scholar
Karina Valdivia Delgado
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Gabriel Nunes Crispino .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Murilo C. Naldi
Centro Universitario da FEI, São Bernardo do Campo, Brazil
Reinaldo A. C. Bianchi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Crispino, G.N., Freire, V., Delgado, K.V. (2023). $\alpha $-MCMP: Trade-Offs Between Probability and Cost in SSPs with the MCMP Criterion. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14195. Springer, Cham. https://doi.org/10.1007/978-3-031-45368-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-45368-7_8
Published: 12 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45367-0
Online ISBN: 978-3-031-45368-7
eBook Packages: Computer ScienceComputer Science (R0)

\(\alpha \)-MCMP: Trade-Offs Between Probability and Cost in SSPs with the MCMP Criterion