Abstract
Stochastic shortest path problems (SSPs) are Markov decision processes with goal states and the problem is to find policies to achieve the goal with the lowest possible expected cost. Something common in this type of problem is the existence of states from which it is not possible to reach the goal, these states are called dead-end. In this case, it is important to have methods that consider the importance, not only of cost, but also of the probability of achieving the goal. In Reinforcement Learning (RL), the problem of making this trade-off between cost to goal and probability to goal has been little studied and the common strategy to deal with SSPs with dead-ends is using discounts. In some works, penalties are used on exploration in SSPs with dead-ends. However, using discounts and penalties can lead to errors in finding a policy with a desired trade-off. The GUBS criterion (Goals with Utility-Based Semantic) considers this type of trade-off without using discounts or penalties, has good semantics based on the expected utility theory and has been used to solve SSPs with dead-end states in the planning area. Thus, in this work, the GUBS criterion is used to propose the first two algorithms for RL to do a trade-off between probability to goal and cost to goal, without using discounts or penalties: the Q-learning-GUBS and Q-learning-eGUBS+Cmax. Theoretical and experimental results show that the proposed algorithms make this trade-off according to the configuration of the GUBS’ parameters (Code enabled on https://github.com/QlearningGubs/code).
Access provided by University of Notre Dame Hesburgh Library. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Stochastic shortest path problems (SSPs) are Markov decision processes (MDPs) with a set of goal states. In these problems, the objective is to find a policy with the lowest expected cumulative cost to achieve these goals. In many cases, these problems have states from which it is not possible to reach the goal, these states are called dead-ends. In problems with dead-ends the criterion of the lowest expected cost becomes ill-defined when there is no a proper policy (one that reaches the goal with probability 1) [6].
Some works try to solve SSPs with dead-end, some of them seek to find the policy with the maximum probability of reaching the goal (MAXPROB) [7] or give greater preference to probability and consider cost as a secondary aspect (lexicographic criteria) [6, 14]. However, it is not always recommended to find policies with the highest probability of achieving the goal as they may have high costs. Sometimes it is better to make a trade-off between probability and cost, for example, decreasing the probability a little to get a large decrease in cost or increasing the cost a little to get a large increase in probability, this characteristic is called infinite-infinitesimal trade-off [1]. An example of this type of infinite-infinitesimal trade-off is illustrated in the River problem.
The River problem [4] considers a \(N_x \times N_y\) grid. The agent must cross a river from one river bank starting from an initial position \(s_0\) to a destination position on the other bank. There is a bridge at position \(y = N_y\). The agent’s actions are to move in the four cardinal directions. The actions on the river’s bank (\(x = 1\) or \(x = N_x\)) and the bridge are deterministic. The actions are probabilistic within the river. In the river, there is a probability P of following down the river, with the risk of reaching the waterfall located at \(y = 1\), where, the agent becomes trapped or killed (dead-end).
The optimal solution for the lexicographical criteria is to walk toward the bridge and traverse it since this policy has a probability 1 to reach the goal, even if the cost is very high. In this problem, for example, a solution with a probability 0.9999 should be accepted if we get a large decrease in cost.
In Reinforcement Learning (RL), most of the theoretical works focus on finite-horizon and infinite-horizon or on loop-free SSP [12]. Also, the problem of making trade-off has been little studied for SSPs with dead-ends. Additionally, a common strategy to follow with SSPs with dead-end states in RL is using discounts. Geibel and Wysotzki [5] consider an MDP with goal and error states (undesirable or dangerous states). The proposed RL solution is based on weighting the cost and the probability of entering error states. This work uses discount factor. Park [8] considers an MDP with goal and fail states. The proposed Q-learning-like algorithm records past failure experiences and uses them to guide the exploration, this work [8] also uses a discount factor. Using a discount factor could have an undesirable effect of biasing importance towards short-term behavior and thus diminishing the incentive to eventually achieve the goal state [1, 12]. In some works [12], penalties are used in the exploration problem for SSPs with dead-ends. However, using penalties can lead to errors in finding a policy with a desired trade-off too, as was formally demonstrated by [1].
The GUBS criterion (Goals with Utility-Based Semantic) [4] was proposed to make a correct infinite-infinitesimal trade-off in planning problems. This criterion allows us to find solutions for SSPs with dead-end, with a parameterization with meaning based on the expected utility theory. Thus, in this work, the GUBS criterion is used to propose the first two algorithms for RL to do a trade-off between probability to goal and cost to goal, without using discounts or penalties, that are: the Q-learning-GUBS and the Q-learning-eGUBS+Cmax. Additionally, to the best of our knowledge, the proposed algorithms are the first to use the expected utility theory to make this trade-off in RL, which allows more precise value criteria to be designed.
2 Problem Formalism and Background
An Stochastic Shortest Path (SSP) is defined by a tuple \(M= \langle S,A,P,c,\mathcal {G} \rangle \), where: S is a discrete and finite set of completely observed states that model the world, A is a finite set of actions, \(P:S \times A \times S \rightarrow [0,1]\) is a probabilistic transition function, where, \(P(s,a,s') = Pr(s_{t+1} = s'|s_t = s, a_t = a)\) is the probability of transition from a state \(s \in S\) to a state \(s' \in S\) after executing an action \(a \in A\), \(c: S \times A \rightarrow \mathcal {R}\) is a cost function for each action \(a \in A\) in a state \(s \in S\), \(\mathcal {G} \subset S\) is the set of goal states, i.e. \(g \in \mathcal {G}\), such that \(c(g,a)=0\), \(P(g,a, g)=1\), \(\forall a \in A\) and \(c(s,a) > 0\), \(\forall s \in S \setminus \mathcal {G}\).
In a state \(s_t \in S\) of the environment at time t, the agent executes an action \(a_t \in A\) that has a cost \(c_t=c(s_t,a_t)\) and probabilistic effects generating a future state \(s_{t+1}\) as a function of \(P(s_t,a_t,s_{t+1})\). Every interaction that occurs can be summarized in a history of length T represented by \(h_T=\{s_1,a_1,c_1,s_2,a_2,c_2,s_3,...,s_{T-1},a_ {T-1},c_{T-1},s_T\}\). The set of all historical possibilities is \(\mathcal {H}=(S \times A \times \mathcal {R}_{>0})^* \times S\).
In an SSP, the process follows a history \(h_T\) that ends in a goal state \(g \in \mathcal {G}\). The solution for an SSP is a policy \(\pi \) that can be Stationary (\(\pi : S \rightarrow A\)) where, each state s is mapped on an action a or not Stationary (\(\pi : \mathcal {H} \rightarrow A\)) where, a historical \(h_t\) is mapped at time t in an action \(a_t\). This policy must be optimal \(\pi ^*\) under some criteria. A policy is called proper if the process achieves a goal state with probability 1 when it is followed, formally, \(\lim _{t\rightarrow \infty }Pr(s_t \in \mathcal {G}|\ \pi ,s_0)=1\) where, \(s_0\) is the initial state of the process. The optimal value function \(V^*\) for an SSP is defined by the Bellman equation \(V^{*}(s) =\min _{a \in A} \{ c(s,a) + \displaystyle \sum _{s' \in S}P(s,a,s')V^{*}(s') \}\). The optimal policy \(\pi ^*(s)\) for state s is \(\pi ^*(s) = \arg \min _{a \in A}\{c(s,a)+\displaystyle \sum _{s' \in S}P(s,a,s')V^{*}(s')\}\).
A state s is called dead-end state when the probability of reaching a goal state from s is 0. Formally, \(\lim _{t\rightarrow \infty }Pr(s_t \in \mathcal {G}|\pi ,s_0 \in \mathcal {D})=0\). \(\mathcal {D} \subset S \) is the set of dead-end states. Conventional SSPs do not consider dead-end.
2.1 Criteria for SSPs with Dead-Ends
In environments with dead-end, the agent must decide between finding a policy with a low expected total cost \(\overline{C}_G(s)\) from an initial state s until the goal state or a policy with the maximum probability of achieving the goal \(P_G(s)\). Several criteria have been proposed to solve SSPs with dead-end some of which prioritize the probability \(P_G(s)\) [7, 13,14,15]. Instead of prioritizing probability, an alternative is to make a trade-off between cost to goal or probability to goal. The criterion fSSPUDE [6] extends the original SSP with an artificial action q and a penalty D. When q is used, the process transitions to a goal state and the agent pays the cost D, therefore \(P(s,q,g) = 1\), \(\forall s \in S\) and for a goal state g; and \(c(s,q) = D\), \(\forall s \in S\). Then, the objective is to find a policy that minimizes the expected accumulated cost \(\overline{C}_G(s) = \lim _{T\rightarrow \infty } E[\sum ^{T-1}_{t=0} c(s_t,a_t)| \pi ,s_0]\). Thus, fSSPUDE can give more or less importance to avoid dead-end by choosing an appropriate D. The discounted cost criterion uses the discount factor to allow policies with a probability of reaching the goal less than 1, to converge [13]. The objective is to minimize the expected accumulated cost \(\overline{C}_G(s) = \lim _{T\rightarrow \infty } E[\sum ^{T-1}_{t=0} \gamma ^t c (s_t,a_t)| \pi ,s_0]\).
The penalty D used in fSSPUDE and the discount factor used in discounted cost criterion depend on the SSP. Because it is necessary to know the structure of the SSP, both values are difficult to determine to make the desired trade-off between probability and cost [1]. Besides, these criteria do not have the property \(\alpha -\)strong probability-to-goal for \(\alpha > 0\) [1].
Definition 1
(\(\alpha -\) strong probability-to-goal [1]). For \(0 \le \alpha \le 1\), a criterion has \(\alpha -\)strong probability-to-goal if for every SSP and for any pair of policies \(\pi , \pi ' \in \varPi \), the following condition is true: \(\pi \succ \pi ' \Longrightarrow \frac{P^{\pi }_G(s_0)}{P^{\pi '}_G(s_0)} \ge \alpha \).
Where \(\pi \succ \pi '\) means that the decision-maker prefers \(\pi \) to \(\pi '\).
2.2 Goals with Utility-Based Semantic-GUBS
A criterion that solves SSPs with dead-end with a clear trade-off between cost and probability of achieving the goal is the GUBS [4] criterion. GUBS is based on utility theory and evaluates policies based on:
where, u(.) is a utility function, \(C_T = \sum ^{T-1}_{t=0}c(s_t,a_t)\) is the accumulated cost in time T, \(K_g \) is the reward for reaching a goal state, and \(\beta _T\) is 1 if a goal is reached.
Unlike the discounted cost criterion and fSSPUDE, in which the choice of penalty and discount factor depend on the SSP, the value \(K_g\) do not depend on the SSP. Additionally, GUBS has the \(\alpha -\)strong probability-to-goal property [1]. Some planning algorithms have been proposed to calculate the GUBS criterion, they are: GUBS-VI [4], eGUBS-VI [3] and eGUBS-AO* [1]. The GUBS-VI algorithm, which is based on the Value Iteration (VI) algorithm solves GUBS for discrete costs and considers as a parameter the maximum accumulated cost \(C_{max}\) for every state, including dead-end states, which will have a cost of \(C_{max}\). As the algorithm needs to calculate the utility based on the accumulated cost, the policy found is non-stationary. To be able to use VI, extended states must be used with the cost accumulated so far. The formulation with discrete cost and extended states for GUBS is presented next.
Definition 2
(Discrete Cost [4]). An SSP with dead-end can be restated into an MDP with bounded discrete costs \(M= \langle X,A,P',c',R'_g \rangle \), where, \(X = {S} \times \mathbb {N}\) is an extended state space, which are the states with the accumulated cost so far, \(P'(x,a,x')=P(s,a,s')\) where, \(x = (s,c)\) and \(x'=(s',c+c(s,a))\), \(x,x' \in X\) and c is the accumulated cost, \(c'(x,a)=u( c+c(s,a))-u(c)\)Footnote 1 for a utility function u(.), and \(R'_g = K_g\) is the terminal reward for the goal states with value \(K_g \in \mathbb {R}\).
The GUBS-VI algorithm uses Definition 2 to calculate the Q-value:
A special case of the GUBS criterion is the Exponential GUBS (eGUBS) criterion. In this case, the utility function is an exponential function used in risk sensitive SSPs (Risk Sensitive SSP - RS-SSP) [9], in which the risk factor \(\lambda \), that models the agent’s risk attitude, is negative. The eGUBS is defined in terms of the lexicographic criterion.
Definition 3
(Risk-Sensitive Lexicographic criterion [1, 3]). The Risk-Sensitive Lexicographic criterion considers:
-
If \(P_G^{\pi }(s) > P_G^{\pi '}(s)\), then \(\pi \succ \pi '\); or
-
If \(P_G^{\pi }(s) = P_G^{\pi '}(s)\) and \(V^{\pi }_{\lambda } > V^{\pi '}_{\lambda }\), then \(\pi \succ \pi '\).
The eGUBS-VI algorithm [3] solves the eGUBS criterion exactly. eGUBS-VI calls the Risk-Sensitive-Lexicographic-VI algorithm [3] which receives an SSP and a \(\lambda \) and calculates the optimal policy \(\pi ^*_l\), the optimal value \(V_ {\lambda }\) and the maximum probability for the goal \(P_G\) for the lexicographic criterion described in Definition 3. We will use Risk-Sensitive-Lexicographic-VI as part of one of the algorithms proposed in this work.
The eGUBS-AO* algorithm optimizes the search for the solution considering a finite, acyclic hypergraph, whose nodes are augmented states (s, c) and hyperedges leading (s, c) to its successor state \(s',c'\), where \(s'\) is the successor state in the original MDP and \(c' = c + c(s,a)\) for the action performed a. eGUBS-AO* solves eGUBS by means of a heuristic search with a variant of the AO* algorithm.
2.3 Reinforcement Learning
In RL, it is common to divide the process into episodes (represented by e). An episode can end in several ways, for example, when a goal state, a dead-end or a pre-defined time are reached. The Q-Learning algorithm [16] is a well-known algorithm in RL. The Q-value for all \(s_t \in S\) in time t when performing action \(a_t \in A\) in time t and following the optimal policy \(\pi ^*\) is:
where, \(\alpha _t\) is the learning rate (\(0\le \alpha _t \le 1\)) that determines the learning speed, \(\gamma \) is the discount factor on future values and \(s_{t+1}\) is the next state after executing action \(a_t\).
Theorem 1
(Convergence of the Q-learning [16]) Let the costs in each time t be \(c_t\le {C}\), learning rates \(0 \le \alpha _t < 1 \), and \(\sum _{t=1}^{\infty }\alpha _t(s,a) = \infty ,\sum _{t=1}^{\infty }(\alpha _t(s,a))^2 < \infty \) then \(Q_t(s,a)\rightarrow Q^*(s,a)\) as \(t\rightarrow \infty \), for every state s and action a, with probability 1.
The algorithm proposed by Geibel and Wysotzki [5] is one of the few methods for RL designed specifically to perform a trade-off between cost and probability of reaching an error state. Although this algorithm uses the concept of probability of reaching an error state, this concept is closely related to the concept of probability to reaching the goal used in this work.
Let \(\varPhi \) be the set of error states. The probability of reaching an error state (also called risk) from an initial state s is defined as \(\rho ^{\pi }(s) = E\left[ \sum ^{\infty }_{t=0}\bar{\gamma }^t\bar{r}_t|s_0=s\right] \), where \(\bar{\gamma }^t\) is the discount factor for the risk at time t and \(\bar{r}_{s,a}(s') = {\left\{ \begin{array}{ll} 1, & \text {If} \,s\in \varPhi \, \text {and}\, s' = \eta \\ 0, & else \end{array}\right. }\), where \(\eta \) is a fictitious absorbing state where the agent transits after reaching an error state. With these concepts, a new value for the risk is defined as \(\bar{Q}^{\pi }(s,a) = E\left[ \bar{r}_0 + \bar{\gamma }\rho ^{\pi }(s')\right] \), where \(s_0 = s, a_0 = a\) and \(s'\) is the next state. Then a new general value is defined as \(Q^{\pi }_{\xi }(s,a) = \xi Q^{\pi }(s,a) - \bar{Q}^{\pi }(s,a)\). The trade-off between the probability of reaching an error state and the cost to goal is determined by varying the weight \(\xi \).
3 RL Under the GUBS Criterion
As shown in the previous section, the GUBS criterion is the only known criterion that guarantees \(\alpha \)-strong probability-to-goal priority. In this work, RL algorithms are introduced to solve the GUBS criterion. Some challenges to face in the development of RL algorithms that use the GUBS criterion are: (1) use extended states with cost since GUBS finds non-stationary policies, something that is not the default behavior of RL algorithms; (2) define the type of the initial and final policies used in the algorithms; (3) the equations of the GUBS-VI algorithm [4] must be adapted based on temporal difference without using the transition function a priori; and (4) define or compute the maximum accumulated cost \(C_{max}\) that is used by the GUBS algorithms. The algorithms introduced are: Q-learning-GUBS and Q-learning-eGUBS+\(C_{max}\). These algorithms do not need to know transition probabilities or costs and do not use discounts or penalties to find the solution considering the GUBS criterion. They are inspired by two well-known algorithms based on temporal difference, that are: Q-Learning and Dyna-Q (a model-based algorithm).
3.1 Q-Learning-GUBS
The Q-learning-GUBS algorithm is based on the Q-learning [16] and GUBS-VI [4] algorithms. Like GUBS-VI, Q-learning-GUBS solves GUBS for discrete costs, can use several utility functions (for example the exponential function and others that hold the properties defined in [1]), and is an approximate algorithm that has as a parameter the maximum accumulated cost \(C_{max}\) for every state.
Based on Eqs. 3 and 2, the following is the modified Bellman equation to solve the discrete cost problem (Definition 2).
where, the Q-value is calculated for each extended state \(x_t = (s_t,c_t)\) considering the utility \(c'_{t+1 }\) of Definition 2. Notice that Eq. 4 does not consider a discount factor. Furthermore, Q-learning-GUBS uses a random policy as an initial policy and calculates a non-stationary final policy.
The Q-learning-GUBS algorithm (Algorithm 1) first initializes the Q table considering as value the differences between the minimum utility and the utility for each accumulated cost c for states that are not goal states. For the goal states, the algorithm sets the Q-value as \(K_g\) (line 25). Then, the algorithm iterates over a set of episodes in which it computes \(\alpha _e\) and \(\epsilon _e\) in line 4 and gets the initial extended state \(x_0\) in line 5. At each step, an action a is chosen for the current state s following an \(\epsilon _e\)-greedy policy, it acts on the environment, observes the cost of the action c(s, a) and the next state \(s'\) (lines 11 and 12). For goal states \(s \in \mathcal {G}\), \(c(s,a) = 0\) (line 14). Then, with this data, the Q-value is calculated (line 18) for each discrete accumulated cost \(c'' \in \{0,...,C_{max}\}\) based on Eq. 4. The algorithm moves to another episode (i) when a goal state is reached, (ii) when the accumulated cost is greater than or equal to the value \(C_{max}\) or (iii) when a dead-end is found (line 19). Finally, the algorithm calculates the GUBS policy based on the Q table, choosing the actions with the highest value for each extended state (line 25).
Q-learning-GUBS convergence. The algorithm finds an approximate policy for the problem regardless of the utility function used. The proof of convergence is based on the following definition and theorems.
Definition 4
(SSPs with discrete cost and maximum accumulated cost \(C_{max}\)). The SSP of Definition 2 can be restated into a new SSP with the same elements but with a finite set of states \(X' \subseteq X\) where, \(X'= \{x | x = (s,c), c \le C_{max} \}\) and c is the accumulated cost.
Theorem 2
(Q-learning-GUBS convergence). Let M be the SSP of Definition 4 where, in the algorithm, \(0 \le \alpha _t < 1 \), \(\sum ^{n}_{i=1} \alpha _t = \infty \) and \(\sum ^{ n}_{i=1} \alpha _t^2<\infty \), and as the number of episodes \(n \rightarrow \infty \), Q-learning-GUBS converges to the optimal value \(Q'^*(x ,a)\) for M.
Proof
Since the SSP of Definition 4 has a finite number of states, it is possible to use the same proof of convergence as the Q-Learning algorithm (Theorem 1). Thus, the Algorithm 1 converges to the optimal value \(Q'^*(x,a)\) for the SSP of the Definition 4.

Theorem 3
(The Q-learning-GUBS algorithm is approximate). Let M be the SSP of Definition 2 where, \(0 \le \alpha _t < 1 \), \(\sum ^{n}_{t=1} \alpha _t = \infty \) and \(\sum ^{n}_{t=1} \alpha _t^2<\infty \), and \(n \rightarrow \infty \) is the number of episodes, so:
-
1.
Q-learning-GUBS converges to the value \(Q'^*(x,a)\) where, \(Q^*(x,a) - Q'^*(x,a) \ge 0\) for every extended state x and action a, and \(Q^*(x,a)\) is the optimal value for the problem in Definition 2;
-
2.
\(\lim _{C_{max} \rightarrow \infty }Q'^*(x,a) = Q^*(x,a)\).
Proof
The Q-learning-GUBS algorithm restates the original SSP M (Definition 2) into an SSP \(M'\) with maximum cost \(C_{max}\) (Definition 4). Since \(M'\) has a set of states \(X' \subseteq X\), then \(Q^*(x,a) - Q'^*(x,a) \ge 0\). Let \(Q^{\pi *}\) be the optimal policy value of the original SSP in the SSP of Definition 4, then \(Q^*(x,a) \ge Q'^* (x,a) \ge Q^{\pi *} (x,a)\). The difference between these values is given by the traces that do not reach the goal before the cost \(C_{max}\). The greater the value \(C_{max}\), the lower the probability of having these traces, therefore \(Q^*(x,a) \rightarrow Q^{\pi *} (x,a)\) and \( Q^{\pi *} (x,a) \rightarrow Q'^* (x,a)\). Thus \(\lim _{C_{max} \rightarrow \infty }Q'^*(x,a) = Q^*(x,a)\).
3.2 Q-Learning-eGUBS+\(C_{max}\)
Q-learning-eGUBS+\(C_{max}\) is based on GUBS-VI, eGUBS-VI, Q-learning and Dyna-Q [11] algorithms. As eGUBS-VI, Q-learning-eGUBS+\(C_{max}\) computes a lexicographic policy as an initial policy and calculates a non-stationary final policy. Although Q-learning-eGUBS+\(C_{max}\) is based on the GUBS-VI algorithm that solves the GUBS criterion, it calculates the lexicographic policy with the Risk-Sensitive Lexicographic-VI algorithm ( [3]) which is based on the exponential utility function. Thus, Q-learning-eGUBS+\(C_{max}\) solves the eGUBS criterion.
Different from Q-learning-GUBS, Q-learning-eGUBS+\(C_{max}\) allows to calculate the \(C_{max}\) value automatically. In the Q-learning-eGUBS+\(C_{max}\) algorithm, a lexicographic policy \(\pi _l\) is found based on a model of the transition probability \(\hat{P}\) and the costs of actions \(\hat{c}\) in the SSP, which is calculated at each f episodes with the Risk-Sensitive Lexicographic-VI algorithm [3]. As in [11] and [2] the model of probabilities is calculated by \(\hat{P}(s,a,s') = \frac{\sum N(s,a,s')}{\sum _{s''}N(s,a,s'')}\), and the model of cost is calculated by \( \hat{C}(s,a) = \frac{\sum c(s,a)}{\sum _{s''}N(s,a,s'')}\) where, \(N(s,a,s')\) is the number of times a state s is visited, the action a is performed and passed to the state \(s'\).
To compute the \(C_{max}\) value, after calculating the lexicographic policy \(\pi _l\), the algorithm Q-learning-eGUBS+\(C_{max}\) evaluates the expected cost to goal \(\overline{C}^{\pi _l}_G(s)\) of the policy \(\pi _l\) iteratively for each state s based on Eq. 5, without considering the infinite costs of the dead-end states.
The \(C_{max}\) value is equal to the cost of the lexicographic policy from the initial state \(s_0\) to the goal:
Since the steps for computing the lexicographic policy and \(C_{max}\) are expensive in terms of processing time cost, and this cost depends on the number of states and the type of problem, these steps are only performed in some episodes.

Q-learning-eGUBS+\(C_{max}\) (Algorithm 2), in addition to the inputs of the Q-learning-GUBS algorithm, receives as input the minimum error \(\varepsilon _l\) used in the Risk-Sensitive Lexicographic-VI algorithm and the amount f of episodes needed before recalculating the lexicographic policy. After initializing the values of Q, N(s, a), \(N(s,a,s')\), \(\hat{c}(s,a)\), \(\hat{C}( s,a)\) and \(\hat{P}(s,a,s')\); Algorithm 2 performs a set of episodes. In each episode, it performs the same operations to interact with the environment and to calculate the value Q as in the Q-learning-GUBS algorithm. Furthermore, each f episodes, Algorithm 2 recalculates the model of probabilities \(\hat{P}\) and costs \(\hat{c}\) based on the counts of the visits to states N(s, a) and \(N(s,a ,s')\) and the accumulated cost of the visits to states \(\hat{C}(s,a)\) (lines 8 and 9). It uses these models to calculate the lexicographic policy (line 11) using the Risk-Sensitive Lexicographic-VI algorithm [3]. The value \(C_{max}\) is calculated in line 12 based on Eq. 6. Algorithm 2 calculates the GUBS value in line 25 based on Eq. 4. Finally, based on the GUBS value the optimal GUBS policy \(\pi ^*\) is calculated in line 26. The counts N(s, a) and \(N(s,a,s')\) and the accumulated cost \(\hat{C}(s,a)\) used in the model calculation, are updated in each episode (line 27).
4 Experiments
The experiments were carried out in two domains: the River [4] (see Sect. 1) and the Navigation domains. In the Navigation domain [10], a robot is walking in a grid with dimensions \(N_x \times N_y\). In this domain, the robot starts from an initial position \(s_0\) and tries to go to a destination position. In the first and last lines (\(y = 1\) and \(y = N_y\)) the actions are deterministic and in the intermediate lines, the agent has a probability P(x, y) of disappearing, going to a dead-end state. The probability of disappearing is greater as the agent is closer to the goal.
Although these domains are simple, they have characteristics that make them difficult. These problems have uncertainty, risk and complex decision making. They were modeled as SSPs with extended states, which include the position and accumulated cost. Actions correspond to decisions to move and have cost 1. In the River domain the experiments were carried out in an environment of size \(N_y = 20\) and \(N_x=5\), the initial state is on a bank of the river \(x_0 = 1, y_0 = 2\), the goal on the other bank at position \(x_g = N_x, y_g = 1\) and two probability values were used \(P=0.4\) and \(P=0.6\). These are the same values of P used by [4] and they show different preferences for proper policies. In the Navigation domain, the initial state is at \(x_0 = N_x, y_0 = N_y\), the goal is at position \(x_g = N_x, y_g = 1\), and two instances were used. An instance of size \(N_x = 15 \times N_y = 3\), with a minimum probability of disappearing of 0.2 and another instance of size \(N_x = 10\) \(\times \) \(N_y = 4\) with a minimum probability of disappearing of 0.1. Table 1 shows the parameters used by the algorithms. A function with linear decay was used to vary \(\alpha \) and \(\epsilon \) from their respective maximum values to the minimum values of both parameters. The initial extended state was chosen randomly for each episode. Table 2 shows the parameters used for the River and Navigation problems.
In Sects. 4.1 and 4.2, the proposed algorithms are compared with the GUBS-VI algorithm. This choice is because the proposed algorithms are based on GUBS-VI. The purpose of this comparison is to verify if the proposed algorithms found the same results obtained by GUBS-VI. Theoretically, the proposed algorithms should obtain the same result, however, since they use sampling, they could obtain a very close solution within the number of episodes used. In the figures shown in Sects. 4.1 and 4.2, the x axis has the values of \(K_g\) expressed in logarithm base 2 and the y axis shows the probability and cost of reaching the goal from the initial state. Each point in the figures is the average of 5 executions of algorithms. Additionally, the standard deviation is shown. We also record the learning curves of these experiments. These results show that the algorithm that converges the fastest is Q-learning-GUBS algorithm. We do not include these curves for space reasons.
In Sect. 4.3, we compare the proposed algorithms with the Geibel and Wysotzki criterion [5]. Notice that, the GUBS criterion and other criteria (fSSPUDE and discount cost) were compared by [1].
4.1 Results of the River Domain
Figures 1a) and 1b) show the probability to goal obtained by the GUBS-VI algorithm and the proposed algorithms Q-learning-GUBS and Q-learning-eGUBS+\(C_{max}\) in the River problems with \(P=0.4\) and \(P=0.6\), respectively. The probabilities obtained by the algorithms are very close to the results obtained by the algorithm GUBS-VI for both instances of the domain. The proper policy for the \(P=0.4\) instance is reached when \(K_g > 16\) (\(\log (K_g)=4\)), and the proposed algorithms find policies with probabilities increasingly closer to 1. For the instance with \(P=0.6\), the optimal policy when \(K_g \ge 0.5\) (\(\log (K_g)=-1\)) is a proper policy (i.e. crossing the bridge). In this case, the proposed algorithms find the proper policy. Figures 1c) and 1d) compare the costs reached from the initial state by the three algorithms for the same instances of the River domain. The results are very close between the algorithms for both instances.
4.2 Results of the Navigation Domain
Figures 2a) and 2b) show the probabilities to goal obtained by the three algorithms in the two described instances of the Navigation domain. The probabilities found by the proposed algorithms are very close to the probabilities found by the GUBS-VI algorithm for both instances of the problem. However, there is a larger standard deviation for values of \(K_g\) that have policies close to MAXPROB and correspond to the furthest route from the goal (the probability for the goal is 0.8 for the instance \(N_x = 15 \times N_y = 3\), and 0.81 for the instance \(N_x = 10 \times N_y = 4\)).
The algorithms find a policy according to \(K_g\). For values \(K_g \ge 0.3\) (\(\log (K_g) = -1.7\)) in instance \(N_x = 15 \times N_y = 3\) and \(K_g \ge 0.05\) (\(\log (K_g) = -4.3\)) in the instance \(N_x = 10 \times N_y = 4\), the algorithm Q-learning-GUBS can find the MAXPROB. However, due to the parameters used, the algorithm Q-learning-eGUBS+\(C_{max}\) can not find the MAXPROB policy in the instance \(N_x = 15 \times N_y = 3\) in any of the 5 executions.
Figures 2c) and 2d) compare the costs reached from the initial state by the three algorithms for the same two instances of the Navigation domain. The policy costs calculated by the Q-learning-eGUBS+\(C_{max}\) algorithm are very close to the policy costs calculated by the algorithm GUBS-VI.
The Q-learning-GUBS algorithm finds policies with higher costs than the policies found by the GUBS-VI algorithm, especially for large values of \(K_g\). Next, we analyze this behavior. When \(K_g\) is large, the Q-learning-GUBS algorithm oscillates more especially during the learning of probabilities in states with uncertainty, it is difficult to reach the exact value of the actions since, in addition, the difference between action values are very small in this problem. Then, the found policy uses actions with values close to the optimal values for some extended states, but not the optimal one. It causes the formation of cycles, in which the policy can return to the same real state but with a higher cost. Thus, the total cost to the goal is increased. As far as we have analyzed, by changing the parameters, it is possible to reach the right cost, for example, by increasing the number of episodes.
4.3 Comparison with the Geibel and Wysotzki Criterion
We compare the proposed algorithms with the Geibel and Wysotzki criterion [5]. For this comparison, we varied the parameter \(\xi \) between the values \([0.1, 1 \times 10^7]\) to evaluate if Geibel and Wysotzki criterion can compute the optimal policies for eGUBS only by choosing the parameters appropriately and, if not, how close it can get to the optimal value. The discounts \(\gamma = 0.9\) and \(\bar{\gamma } = 1\) were used.
Figure 3 shows the \(log_2(\xi )\) on the x axis and, on the y-axis, the values corresponding to the optimal policy found by the GUBS-VI, Q-learning-GUBS and Q-learning-eGUBS+Cmax algorithms with \(K_g = 5\) and \(C_{max} = 100\), and the values of the optimal policies found by the Geibel and Wysotzki criterion evaluated with the GUBS criterion for the River domain of size \(N_y = 20\) and \(N_x=5\) with \(P=0.4\). Each point in the figure is the average of 5 executions of the algorithms.
The values found by the algorithms Q-learning-eGUBS+Cmax, Q-learning-GUBS and GUBS-VI are close. When \(log_2(\xi ) = 1\), the values of the policies found by the Geibel and Wysotzki criterion are close to the ones found by GUBS-VI, Q-learning-eGUBS+Cmax and Q-learning-GUBS.
Notice that, to carry out this experiment, an indefinite set of values for \(\xi \) had to be varied, this is because there is no clear procedure to choose a certain value of \(\xi \) for this criterion, to have a guarantee of finding the resulting policy under GUBS. In certain situations, it may not be feasible to carry out this type of procedure to guarantee finding certain policies.
5 Conclusion
In this work, we propose the first two algorithms for RL that use the GUBS criterion: Q-learning-GUBS and Q-learning-eGUBS+\(C_{max}\) to make a trade-off between probability to goal and cost to goal. Q-learning-GUBS is an approximate algorithm and can use several utility functions. Q-learning-eGUBS+\(C_{max}\) is a model-based algorithm that calculates the \(C_{max}\) value for the particular case of the exponential utility function.
Experiments were carried out with the proposed algorithms with several instances corresponding to two domains. The results show that these algorithms can find policies that make a trade-off between probability and cost according to the parameter \(K_g\) and according to the utility function used. We also compare the proposed algorithms with Geibel and Wysotzki criterion [5]. The values of the policies found by Geibel and Wysotzki criterion for some values of parameters are close to those found by the proposed algorithms. Different from Geibel and Wysotzki criterion that is based on \(\xi \) and discount factors, where the choice of parameters depends on the problem; the parameters of GUBS (\(K_g\), \(U_{max}\) and \(U_{min}\)) do not depend on the problem, so, the GUBS parameters of the proposed algorithms can be rationally set.
The proposed algorithms could be the basis for designing more sophisticated RL algorithms that can work in more complex domains.
Notes
- 1.
Notice that \(c'\) is an utility over costs and the optimal policy must maximize the utility.
References
Crispino, G.N., Freire, V., Delgado, K.V.: GUBS criterion: arbitrary trade-offs between cost and probability-to-goal in stochastic planning based on expected utility theory. Artif. Intell. 316(C) (2023)
Faycal, T., Zito, C.: Dyna-t: dyna-q and upper confidence bounds applied to trees. arXiv abs/2201.04502 (2022)
Freire, V., Delgado, K.V., Reis, W.A.S.: An exact algorithm to make a trade-off between cost and probability in SSPs. In: International Conference on Automated Planning and Scheduling 2019, vol. 29, pp. 146–154 (2019)
Freire, V., Delgado, K.V.: GUBS: a utility-based semantic for goal-directed Markov decision processes. In: International Conference on Autonomous Agents and Multiagent Systems, pp. 741–749 (2017)
Geibel, P., Wysotzki, F.: Risk-sensitive reinforcement learning applied to control under constraints. J. Artif. Intell. Res. 24, 81–108 (2005)
Kolobov, A., Mausam, Weld, D.S.: A theory of goal-oriented MDPs with dead ends. In: Conference on Uncertainty in Artificial Intelligence, pp. 438–447. AUAI Press, Arlington, Virginia, USA (2012)
Kolobov, A., Mausam, M., Weld, D., Geffner, H.: Heuristic search for generalized stochastic shortest path MDPs. In: International Conference on Automated Planning and Scheduling 2011, vol. 21, pp. 130–137 (2011)
Park, I.W., Kim, J.H., Park, K.H.: Accelerated Q-learning for fail state and action spaces. In: 2008 IEEE International Conference on Systems, Man and Cybernetics, pp. 763–767 (2008)
Patek, S.D.: On terminating Markov decision processes with a risk-averse objective function. Automatica 37(9), 1379–1386 (2001)
Sanner, S., Yoon, S.: IPPC results presentation In: International Conference on Automated Planning and Scheduling (2011). http://users.cecs.anu.edu.au/ssanner/IPPC_2011/IPPC_2011_Presentation.pdf. Accessed Aug 2024
Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In: International Conference on Machine Learning, pp. 216–224. Morgan Kaufmann (1990)
Tarbouriech, J.: Goal-oriented exploration for reinforcement learning. Ph.D. thesis, Université de Lille (2022)
Teichteil-Konigsbuch, F., Vidal, V., Infantes, G.: Extending classical planning heuristics to probabilistic planning with dead-ends. In: AAAI Conference on Artificial Intelligence, pp. 1017–1022. AAAI Press (2011)
Teichteil-Königsbuch, F.: Stochastic safest and shortest path problems. In: AAAI Conference on Artificial Intelligence, vol. 26, pp. 1825–1831 (2012)
Trevizan, F.W., Teichteil-Königsbuch, F., Thiébaux, S.: Efficient solutions for stochastic shortest path problems with dead ends. In: Conference on Uncertainty in Artificial Intelligence (2017)
Watkins, C.J.C.H., Dayan, P.: Technical note. Q-Learning. Mach. Learn. 8(7), 279–292 (1992)
Acknowledgments
This study was supported in part by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) – Finance Code 001, by the São Paulo Research Foundation (FAPESP) grant \(\#\)2018/11236-9 and the Center for Artificial Intelligence (C4AI-USP), with support by FAPESP (grant #2019/07665-4) and by the IBM Corporation.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Polar, C.D., Delgado, K.V., Freire, V. (2025). Reinforcement Learning with Utility-Based Semantic for Goals. In: Paes, A., Verri, F.A.N. (eds) Intelligent Systems. BRACIS 2024. Lecture Notes in Computer Science(), vol 15413. Springer, Cham. https://doi.org/10.1007/978-3-031-79032-4_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-79032-4_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-79031-7
Online ISBN: 978-3-031-79032-4
eBook Packages: Computer ScienceComputer Science (R0)


