1 Introduction

A Markov Decision Process (MDP) provides a mathematical framework for representing and solving sequential decision-making problems under uncertainty in fully observable environments [4]. In these problems, an agent must interact with an environment through actions that can have stochastic outcomes.

Various optimization criteria are used to find an optimal policy for a Markov decision process. The most commonly used criterion in finite and indefinite horizon MDPs is accumulated reward. In infinite horizon problems, the average reward per decision step criterion is used.

The average reward criterion is also known as long-run average reward, mean-payoff, limit-average reward, or steady-state reward [1]. This criterion is risk-neutral, does not use discount, and is naturally used in cyclic and non-terminating problems. This criterion is fundamental for areas such as controlled queues and inventory systems with frequent restocking, where decisions are made frequently and continually [4]. It is also important for natural resource management and environmental economics, for example, to determine the extraction rate of a renewable natural resource [3].

The average reward criterion depends on the behavior of the stochastic process in the limit, which determines some models of the criterion according to the structure of the Markov chain generated by stationary policies.

One way to think about the stochastic process in the limit is by summing the rewards (i.e., calculating the expected accumulated reward) and then calculating how much is added to this sum at each step, which is called gain. When time tends to infinity, the gain tends to a constant value. Since this sum considers sequences of \(h=\infty \) states, this way of thinking in this work is called \(h=\infty \). Another way to think about the stochastic process in the limit is by calculating the steady-state distribution, i.e., calculating the probability that the system is in each state in the long term. Considering these probabilities and the immediate reward, the expected reward is calculated. This way of thinking in this work is called \(h=1\). Considering the risk-neutral average reward criterion, both ways of thinking about the stochastic process in the limit yield the same result.

In many situations, it is desirable to consider risk during the process at each stage. In the financial field, for example, a risk-averse investor might prefer to minimize uncertainties by opting for a lower guaranteed return, even if this implies obtaining a smaller gain [7]. Moreover, considering risk at each stage can increase the robustness of policies, allowing them to function in a wider variety of scenarios, such as in the use of renewable natural resources that suffer from greater volatility due to climate change. Among the works that use the average reward criterion and consider risk during the process are those that use exponential utility functions [2] and CVaR [7] (referred to as steady CVaR or long-run CVaR).

While the algorithm that considers an exponential utility function [2] calculates the accumulated reward and how much is added to this sum, thus following the \(h=\infty \) approach; the long-run CVaR algorithm [7] calculates the steady-state distribution and considers the immediate reward, which corresponds to the \(h=1\) approach.

The objective of this work is to propose a mathematical framework that provides a unified treatment of the literature on average reward and risk in MDPs, including works that use exponential utility functions [2] and CVaR [7], as well as to include interpretations with \(1 \le h \le \infty \), allowing differentiation of policies that may not be distinguished with existing criteria.

This paper is organized as follows. Section 2 reviews the definitions of MDP, average reward criterion, and risk-sensitive average reward literature. Section 3 presents the proposed framework. Section 4 evaluates the policies of an MDP for risk-neutral and risk-sensitive criteria with the proposed framework. Finally, Sect. 5 presents the conclusions.

2 Background

A Markov decision process [4] is described by a tuple \(\mathcal {M}=\langle S,A,P,R\rangle \) where: S is a finite set of states; A is a finite set of actions; \(P: S \times A \times S \rightarrow [0,1]\) is a transition function that represents the probability that \(j \in S\) is reached after the agent executes an action \(a \in A\) in a state \(s\in S\), i.e., \(\Pr (s_{N+1}=j |s_{N}=s,a_{N}=a) = P(s,a,j)\); and \(R: S \times A \rightarrow \mathbb {R}\) is a reward function that represents the reward of executing an action \(a \in A\) in a state \(s \in S\).

In this work, the average reward criterion is used for infinite-horizon problems. The average reward criterion depends on the behavior of the stochastic process in the limit, which determines certain models of the criterion based on the structure of the Markov chain generated from stationary policies. Among the types of Markov chains are unichain, multichain, recurrent, ergodic, and aperiodic chains. A unichain Markov chain has a closed irreducible set of states (a set of states where all states communicate with each other and do not communicate with states outside the set) and a possibly empty set of transient states (states that will not be revisited after some point in time). Otherwise, the Markov chain is multichain. A recurrent Markov chain is a unichain chain where the set of transient states is empty. A Markov chain is ergodic if it is recurrent and aperiodic. A Markov chain is aperiodic if the returns to states occur at irregular intervals.

2.1 Risk-Neutral Criterion

The expected total reward of a policy \(\pi \) from the initial state s up to the decision epoch \(N+1\) is a function \(v^\pi _{N+1}\) defined by

$$\begin{aligned} v^\pi _{N+1}(s) = E \Bigg \{ \sum _{n=1}^{N} r(S_n,A_n) \Big | S_1=s,\pi \Bigg \} = E \Bigg \{ \sum _{n=1}^{N} r_n \Bigg \}, \end{aligned}$$
(1)

where \(S_n\) and \(A_n\) refer to the random variables of the state and action in the time step n. When N tends to infinity, \(v^\pi _{N+1}(s)\) diverges.

The expected average reward of a policy \(\pi \), i.e. the gain of a policy \(\pi \), is

$$\begin{aligned} \begin{aligned} g^\pi (s) & = \lim _{N\rightarrow \infty } \frac{1}{N} v^\pi _{N+1}(s) \\ & = \lim _{N\rightarrow \infty } \frac{1}{N} \sum _{n=1}^{N} P^{n-1}_{\pi }r_{\pi }(s), \end{aligned} \end{aligned}$$
(2)

where \(P^{n}_{\pi }\) is the n-th power of the transition matrix \(P_\pi \) and \(r_{\pi }\) is the reward function of \(\pi \).

A policy with the highest gain \(g^*(s)\) is called the optimal policy \(\pi ^*\),

$$\begin{aligned} g^*(s) = g^{\pi ^*}(s) = \sup _{\pi \in \varPi } g^\pi (s), \end{aligned}$$
(3)

where \(\varPi \) is the set of stationary policies.

In the following sections, the simplified version of the notation shown on the right side of Eq. 1 will be used.

2.2 Risk-Sensitive Criterion with Utility Function

There are several ways to incorporate risk into decision making with MDPs. One way is to use a utility function \(u: \mathbb {R} \rightarrow \mathbb {R}\), such as the exponential utility function [2]. The exponential utility function u is defined by

$$\begin{aligned} u(x) = -(\text {sgn} \gamma )e^{-\gamma x}, \end{aligned}$$
(4)

with inverse

$$\begin{aligned} u^{-1}(y) = \frac{1}{\gamma } \text {ln}[-\text {sgn}(\gamma )y], \end{aligned}$$
(5)

where \(\gamma \) is the risk factor and \(\text {sgn} (\gamma )\) indicates the sign of \(\gamma \).

The certainty equivalent \(v_{u}^{\pi }\) is the value whose utility is equal to the utility of the expected value of the accumulated reward \(v^\pi \), i.e.,

$$\begin{aligned} u(v^\pi ) = u(v_{u}^{\pi }). \end{aligned}$$
(6)

A positive risk factor \(\gamma \) indicates risk aversion and implies \(v^\pi _u < v^\pi \), while a negative factor indicates risk propensity and implies \(v^\pi _u > v^\pi \).

The utility function applied to \(v_{u}^{\pi }\) is defined as

$$\begin{aligned} {u^\pi } = u(v_{u}^{\pi }) = -(\text {sgn} \gamma ) e^{-\gamma {v_{u}^{\pi }}}. \end{aligned}$$
(7)

The utility of policy \(u^\pi \) in iteration \(n+1\) can be calculated by equation [2]

$$\begin{aligned} {u^{\pi , n+1}(s)} = \sum _{j\in S} P_\pi ^n(s,j)e^{-\gamma r_\pi (s)}{u^{\pi ,n}(j)}, \end{aligned}$$
(8)

where \(u^{\pi ,0} = (-1, ..., -1)\), \(|u^{\pi ,0}| = |S|\) e \(P_\pi ^0 = I\). From \(u^{\pi ,{n+1}}(s)\), the certainty equivalent can be calculated by \(v_{u}^{\pi ,n}(s) = u^{-1}(u^{\pi ,n}(s))\). During the process, \(v_{u}^{\pi ,n}\) grows linearly, and its variation is called the certainty equivalent gain, denoted by \(g^\pi _u\). Using the exponential utility criterion, an agent seeks a policy that maximizes the certainty equivalent gain of an MDP.

Howard and Matheson [2] propose a policy iteration algorithm that finds a stationary policy and deals only with acyclic and irreducible Markov processes (all states communicate). Given an arbitrary initial policy, the system of Eqs. 9 must be solved in the policy evaluation step to find the certainty equivalent gain \(g_u^\pi \) and the certainty equivalents \(v_u^\pi \) [2]

$$\begin{aligned} e^{-\gamma (g_u^\pi +v_u^\pi (s))} =\sum _{j\in S} p(j|s,\pi (s))e^{-\gamma (r(s,\pi (s))+v_u^\pi (j))}. \end{aligned}$$
(9)

Since there are more variables than equations, one of the certainty equivalent values must be set to 0. Then, using the values found in the policy evaluation step of the previous policy, a policy improvement step is applied to select, in each state s, a new action a that maximizes the equation [2]

$$\begin{aligned} v_u^\pi (s,a) = \frac{1}{\gamma } \text {ln}\left[ \sum _{j\in S} p(j|s,a)e^{-\gamma (r(s,a)+v_u^\pi (j))}\right] . \end{aligned}$$
(10)

When no change occurs in any state, the optimal policy has been found.

2.3 Risk-Sensitive Criterion with CVaR

Another way to insert risk into MDPs is through the use of VaR and CVaR risk measures. The VaR (Value at Risk) and CVaR (Conditional-Value-at-Risk) metrics are widely used for portfolio management of financial assets.

VaR measures the worst expected loss within a given \( \alpha \) confidence level, where \(\alpha \in (0, 1)\). VaR is defined as the \( 1- \alpha \) quantile of Z, i.e.

$$\begin{aligned} \text {VaR}_\alpha (Z) = \min \{ z | F(z) \ge 1-\alpha \}, \end{aligned}$$
(11)

where Z is a random variable and F(z) is the cumulative distribution function.

An alternative measure is the CVaR metric, a coherent risk measure [6], that is computed by averaging losses that exceed the VaR value. CVaR, with a confidence level \(\alpha \in (0, 1)\) is defined as [5]

$$\begin{aligned} \text {CVaR}_\alpha (Z) = \min _{w\in \mathbb {R}} \left\{ { w + \frac{1}{\alpha } E\left[ {(Z-w)^{+}}_{ }\,\right] }_{ }\,\right\} , \end{aligned}$$
(12)

where \((x)^{+}\) = max\(\{x,0\}\) represents the positive part of x, and w represents the decision variable that, at the optimum point, reaches the value of VaR

$$\begin{aligned} \text {CVaR}_\alpha (Z) = \text {VaR}_\alpha (Z) + \frac{1}{\alpha } E\left[ {(Z-\text {VaR}_\alpha (Z))^{+}}_{ }\,\right] . \end{aligned}$$
(13)

MDPs with the CVaR criterion for average reward were studied for the first time by Xia and Glynn [7]. However, they only deal with stationary policies associated with ergodic Markov chains, and they suggest that the results can be extended to unichain MDPs. The CVaR gain is defined as [7]

$$\begin{aligned} g^\pi _{\text {CVaR}_\alpha } = \lim _{N\rightarrow \infty } \frac{1}{N} \sum _{n=1}^{N} \text {CVaR}_\alpha (r_{n}). \end{aligned}$$
(14)

Note that in Eq. 14, the random variable is only the immediate reward. Additionally, a local optimality equation of Bellman is proposed, and a policy iteration-type algorithm is developed.

3 A Unified Framework

In this work, some interpretations of the risk-neutral average reward criterion are considered, followed by interpretations of the risk-sensitive average reward criterion with utility function and CVaR. The proposed framework considers stationary and aperiodic policies, i.e., stationary policies that generate aperiodic unichain or aperiodic multichain Markov chains. For aperiodic policies \(P^*_\pi =\lim _{n \rightarrow \infty } P^n_\pi \) exists.

3.1 Formalization of the Risk-Neutral Criterion

One of the most common ways to calculate the value function for MDPs is by computing the accumulated reward. This approach can be extended to infinite horizon, where to ensure convergence of the value, it is divided by the number of decision steps N (Eq. 15). Note that in this equation, the accumulated reward is the random variable. From this equation, various interpretations of the gain of an MDP for a policy \(\pi \) can be derived. The relationships between these interpretations are shown in Eqs. 15 to 18.

$$\begin{aligned} g^\pi & = \lim _{N\rightarrow \infty } \frac{1}{N} E\left[ \sum _{n=1}^{N} r_{n} \right] \end{aligned}$$
(15)
$$\begin{aligned} & = \lim _{N\rightarrow \infty } E\left[ \sum _{n=1}^{N+1} r_{n} - \sum _{n=1}^{N} r_{n} \right] \end{aligned}$$
(16)
$$\begin{aligned} & = \lim _{N\rightarrow \infty } \sum _{n=1}^{N} \frac{1}{N} E[r_{n}] \end{aligned}$$
(17)
$$\begin{aligned} & = \lim _{N\rightarrow \infty } \sum _{n=1}^{N} \frac{1}{Nh} E\left[ \sum _{i=n}^{n+h} r_{i}\right] \end{aligned}$$
(18)

Equation 15 was described before. Equation 16 interprets gain as the increase in reward at each decision step. In this approach, the value is calculated as the difference between the expected accumulated rewards considering two consecutive time steps. Even though each accumulated reward tends to infinity, the difference between them is finite. This approach is similar to the gain extracted in iterative algorithms, such as the relative value iteration algorithm [4].

Equation 17 considers the immediate reward as a random variable. The gain is interpreted as the sum of the expected immediate rewards divided by N.

Equation 18 conjectures the interpretation of the gain as the Cesaro limit of the expected accumulated reward for h decision steps after time n. In this equation, the random variable is the accumulated reward of trajectories of fixed size h. The division is done by Nh because N sums of h rewards are made.

Note that Eqs. 16, 17 and 18 converge to some distribution, allowing the use of the steady-state matrix \(P^*_\pi \) to calculate the gain.

3.2 Formalization of the Risk-Sensitive Criterion

Before formalizing the risk-sensitive criterion, some notations used in this section are introduced.

Let \(\psi \) be a sequence of states and \(\psi (k)\) be the k-th state of this sequence. Let \(S^h\) be the set of all sequences \(\psi \) of size h. Let \(f_{s,n}^{\pi ,h}: S^h \rightarrow [0,1]\), \(f_{s,n}^{\pi ,h}(\psi )\) is the probability of the sequence \(\psi \) of size h occurring with the process being started in state s and time n, and can be calculated as

$$\begin{aligned} \begin{aligned} f_{s,n}^{\pi ,h}(\psi ) & = Pr\left( \bigwedge _{k=0}^{h-1} S_{n+k}=\psi (k) | S_{0}=s\right) \\ & = Pr\left( S_{n+0}=\psi (0) | S_{0}=s)\prod _{k=1}^{h-1}Pr(S_{n+k}=\psi (k)|S_{n+k-1}=\psi (k-1)\right) . \end{aligned} \end{aligned}$$
(19)

Let \(v^{\pi ,h}_n(\psi )\) be the accumulated reward of the sequence \(\psi \) of size h with the process starting at time n

$$\begin{aligned} v^{\pi ,h}_n(\psi ) = \sum _{k=n}^{n+h-1} r( \psi (k) ). \end{aligned}$$
(20)

Risk-Sensitive Criterion with Utility Function. The utility function u can be applied either to the sum of rewards or directly to the immediate reward. The average of the expected utility of the accumulated reward is observed in Eq. 21 and represents the approach of Howard and Matheson [2], which uses the exponential utility function. This equation is similar to Eq. 15, but it includes applying the utility function u to the sum of rewards

$$\begin{aligned} g^\pi _{u} = \lim _{N\rightarrow \infty } \frac{1}{N} E\left[ u\left( \sum _{n=1}^{N} r_{n} \right) \right] . \end{aligned}$$
(21)

Equation 22 is equivalent to Eq. 21. It calculates the expected value of the utility of the accumulated reward of all sequences of size \(N\rightarrow \infty \) that start in state s and at time 1. Then the average of this expected value is calculated by

$$\begin{aligned} \begin{aligned} g^\pi _u & = \lim _{N\rightarrow \infty } \frac{1 }{N} \sum _{\psi \in {S^{N}}} f_{s,1}^{\pi ,N}(\psi ) u\left( \sum _{k=1}^{N} r( \psi (k) ) \right) . \end{aligned} \end{aligned}$$
(22)

When \(N\rightarrow \infty \), the probability of a state occurring from the sequence of states tends to the steady state at any time \(n\in {N}\), so that the probability of the sequence \(f_{s,n}^{\pi ,h}\) can be calculated using the steady-state probability of each state in the sequence in Eq. 19. This interpretation introduces how an arbitrary sequence of states is intended to be visualized in the upcoming equations.

Equation 23 calculates the sum of the expected utility of the accumulated rewards of sequences of fixed size h starting at any time n such that \(1\le n \le {N}\), with \(N\rightarrow \infty \), and finally their average is calculated by

$$\begin{aligned} g_{u}^{\pi ,h} = \lim _{N\rightarrow \infty } \frac{1}{N} \sum _{n=1}^{N} E\left[ u\left( \sum _{i=n}^{n+h-1} r_{i} \right) \right] . \end{aligned}$$
(23)

Equation 25 reinterprets Eq. 23 by inserting a sequence \(\psi \) of h states, the probability of this sequence of states \(f_{s,n}^{ \pi ,h}(\psi )\) (Eq. 19) and its accumulated reward \(v^{\pi ,h}_n(\psi )\) (Eq. 20). Thus, for each sequence of size h starting at time n, its probability is multiplied by the utility of its accumulated reward

$$\begin{aligned} {u_{n}^{\pi ,h}} = \sum _{\psi \in {S^h}} f_{s,n}^{\pi ,h}(\psi ) u\left( v^{\pi ,h}_n(\psi ) \right) . \end{aligned}$$
(24)

Then the sum for all time n of the inverse of \(u_{n}^{\pi ,h}\) is calculated and its average is obtained when \(N\rightarrow \infty \) by

$$\begin{aligned} g^{\pi ,h}_u = \lim _{N\rightarrow \infty } \frac{1 }{N} \sum _{n=1}^{N} {u^{-1}(u^{\pi ,h}_{n})}. \end{aligned}$$
(25)

For \(h=1\) and u being the identity function, Eq. 25 becomes the risk-neutral average reward, equivalent to Eq. 2.

Note that in this work Eq. 25 is proposed to classify existing works in the literature concerning h. When \(h=\infty \), Eq. 25 is equivalent to the criterion used by Howard and Matheson [2].

Risk-Sensitive Criterion with CVaR. Equation 26 considers the average CVaR of the accumulated reward. Analogous to Eq. 21, it considers the accumulated reward of the stochastic process, but instead of applying the utility function u, it applies the CVaR function

$$\begin{aligned} g^\pi _{\text {CVaR}} = \lim _{N\rightarrow \infty } \frac{1}{N} \text {CVaR}\left[ \sum _{n=1}^{N} r_{n} \right] . \end{aligned}$$
(26)

As far as we know, there is no work in the literature that solves Eq. 26.

Similar to Eq. 23, but considering the CVaR instead of the utility function u, Eq. 27 considers the accumulated rewards of fixed-length sequences h started at each time n and calculates the average of the CVaR values of all these sequences:

$$\begin{aligned} g^{\pi ,h}_\text {CVaR} = \lim _{N\rightarrow \infty } \frac{1}{N} \sum _{n=1}^{N} \text {CVaR}\left[ \sum _{i=n}^{n+h-1} r_{i} \right] . \end{aligned}$$
(27)

Equation 28 reinterprets Eq. 27 by inserting a sequence \(\psi \) of h states, the probability of this sequence of states \(f_{s,n}^{ \pi ,h}\) (Eq. 19) and its accumulated reward \(v^{\pi ,h}_n(\psi )\) (Eq. 20). Thus, for sequences \(\psi \) of size h starting at time n:

$$\begin{aligned} \begin{aligned} v_{\text {CVaR},n}^{\pi ,h} & = \text {CVaR}_\alpha (v^{\pi ,h}_n(\psi )) \\ & = \text {VaR}_\alpha (v^{\pi ,h}_n(\psi )) + \frac{1}{\alpha } \sum _{\psi \in {S^h}} f_{s,n}^{\pi ,h}(\psi ) \left[ {(v^{\pi ,h}_n(\psi )-\text {VaR}_\alpha (v^{\pi ,h}_n(\psi )))^{+}}_{ }\,\right] . \end{aligned} \end{aligned}$$
(28)

Then the sum is calculated for all time n and then its average is obtained for \(N\rightarrow \infty \)

$$\begin{aligned} g^{\pi ,h}_{\text {CVaR}} = \lim _{N\rightarrow \infty } \frac{1 }{N} \sum _{n=1}^{N} v^{\pi ,h}_{\text {CVaR},n}. \end{aligned}$$
(29)

Note that in this work Eq. 29 is proposed to classify existing works in the literature concerning h. When \(h=1\), Eq. 28 is equivalent to Eq. 14 of Xia and Glynn [7].

4 Numerical Example

Example 1

Consider an MDP with 4 states and three policies \(\pi _1\), \(\pi _2\) and \(\pi _3\), where the transition matrix of \(\pi _1\) and \(\pi _2\) is

$$\begin{aligned} P_{\pi _1} = P_{\pi _2}= \begin{bmatrix} 0.5 & 0.5 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0.5 & 0.5\\ 1 & 0 & 0 & 0 \end{bmatrix}, \end{aligned}$$
(30)

and the transition matrix for \(\pi _3\) is

$$\begin{aligned} P_{\pi _3} = \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 \end{bmatrix}. \end{aligned}$$
(31)

Let \(r_{\pi }\) be the reward function of policy \(\pi \), \(r_{\pi _1}=(0,6,0,-5)\), \(r_{\pi _2}=(0,1, 0,0)\) and \(r_{\pi _3}=(\frac{1}{6})\).

Figure 1 represents the three policies, where the vertices represent the states, the edges the actions, and the values in parentheses represent the transition probability and the reward, respectively.

The Markov chains generated by the stationary policies in the example 1 are unichain and aperiodic. In this way, all policies have a constant gain, that is, the gain of any policy is independent of the state [4].

Fig. 1.
figure 1

Graphical representation of the policies of the MDP given in example 1.

The steady-state transition matrix is calculated by \(P_\pi ^* = \lim _{n\rightarrow \infty } P_\pi ^{n+1}=P_\pi ^nP_\pi \) and for policies \(\pi _1\) and \(\pi _2\) it is

$$\begin{aligned} P_{\pi _1}^* = P_{\pi _2}^* = \begin{bmatrix} \frac{1}{3} & \frac{1}{6} & \frac{1}{3} & \frac{1}{6}\\ \frac{1}{3} & \frac{1}{6} & \frac{1}{3} & \frac{1}{6}\\ \frac{1}{3} & \frac{1}{6} & \frac{1}{3} & \frac{1}{6}\\ \frac{1}{3} & \frac{1}{6} & \frac{1}{3} & \frac{1}{6} \end{bmatrix}. \end{aligned}$$
(32)

For \(\pi _3\), \(P^*_{\pi _3}=P_{\pi _3}\).

Although the policies \(\pi _1\) and \(\pi _2\), with the same transition matrix, have the same expected average reward, the distributions of accumulated rewards for policy \(\pi _1\) have greater variance than those for policy \(\pi _2\) for different values of h, as shown in Fig. 2. Therefore, risk-sensitive criteria are expected to differentiate between policies \(\pi _1\) and \(\pi _2\).

Fig. 2.
figure 2

Graphs of the distribution of accumulated reward and cumulative distribution function (CDF) of accumulated reward for different values of \( h \).

Figures 2 a), c), e) and g) show the distribution of the accumulated reward for policies \(\pi _1\) (blue bars), \(\pi _2\) (orange bars) and \(\pi _3\) (green bars) of the MDP given in the example 1 for \(h \in \{1,2,5,101\}\), respectively. In addition to the distributions, the values \(v^{\pi ,h}\) (red line), \(v^{\pi ,h}_{u_{1}}\) (purple line), \(v^ {\pi ,h}_{\text {VaR}_{0.2}}\) (brown line) and \(v^{\pi ,h}_{\text {CVaR}_{0.2}}\) (pink line) are showed.

The accumulated rewards for \(h=1\) and \(h=2\) range from \(-5\) to 6 for \(\pi _1\) and range from 0 to 1 for \(\pi _2\). For \(h=5\), the accumulated rewards range from \(-5\) to 7 for \(\pi _1\) and from 0 to 2 for \(\pi _2\). For \(h=101\), the accumulated rewards range from \(-5\) to 30 for \(\pi _1\) and from 0 to 25 for \(\pi _2\). The accumulated reward for \(\pi _3\) does not vary for each h. It is expected that a risk-averse criterion may consider these three policies differently. Additionally, when h tends to infinity, all accumulated rewards tend to infinity (see Figs. 2 a), c), e) and g)).

Figures 2 b), d), f), and h) show the cumulative distribution function (CDF) for policies with the same values of h. For each policy, the CDFs are symmetric. The median of \(\pi _1\) and \(\pi _2\) are the same for all values of h.

In the next subsections, examples of the gain calculation for each of the criteria used are shown.

Consider the following notation regarding policies: \(A\succ B\) means that the decision-maker prefers A to B, \(A \sim B\) means that the decision-maker has an equal preference for either A or B and \(A\succeq B\) means that the decision-maker prefers or has an equal preference for A or B.

4.1 Calculation of Gain Using the Risk-Neutral Criterion

The gain of a policy \(\pi \), defined in Eq. 15, can be calculated by dividing the accumulated reward \(v^{\pi }_N\) by the number of steps N, i.e., \(g^ \pi =\lim _{N\rightarrow \infty } \frac{1}{N}v^{\pi }_N\). \(v_N^\pi \) can be calculated iteratively by

$$\begin{aligned} v^{\pi ,n+1}=v^{\pi ,n}+P_\pi ^nr_\pi , \end{aligned}$$
(33)

where \(v^{\pi ,0}\) is a null vector of size |S| and \(P_\pi ^0 = I\).

Using this form of calculation for the example, the gain of policies \(\pi _1\), \(\pi _2\) and \(\pi _3\) is \(g^{\pi _1} = g^{\pi _2} = g^{\pi _3} = (0.167, 0.167, 0.167, 0.167)\).

The gain of a policy defined in Eq. 17 can be calculated by multiplying the steady-state matrix \(P_\pi ^*\) by the policy reward \(r_\pi \), i.e., \(g^{\pi } =P_{\pi }^*r_{\pi }\) [4]. In the example using this second form of calculation, the gain is \(g^{\pi _1} = g^{\pi _2} = g^{\pi _3} = (0.167, 0.167, 0.167, 0.167).\)

Note that, using the risk-neutral criterion \(\pi _3 \sim \pi _2 \sim \pi _1\).

4.2 Calculation of Gain Using Utility Function with \(\gamma =1\) for \(h=1\)

To calculate the gain \(g^{\pi ,h}_\text {u}\), it is necessary to calculate \(f_{s,n}^{\pi ,h}(\psi )\) (Eq. 19) which is the probability of the sequence \(\psi \) of size h occurring with the process starting in state s and in time n. It is also necessary to calculate \(v_{n}^{\pi ,h}(\psi )\) (Eq. 20) which is the accumulated reward of the sequence \(\psi \) and \(u(v^{\ pi,h}_n(\psi ))\). Table 1 shows the sequences \(\psi \) of size \(h=1\) for policy \(\pi _1\) (column 1), \(f_{s,n}^{\pi _1,h=1 }(\psi )\) (column 2) and \(v_{n}^{\pi _1,h=1}(\psi )\) (column 3) and \(u(v^{\pi ,h}_n(\psi ))\) (column 4). For policy \(\pi _1\), \(u^{\pi _1,h=1}_n=\frac{1}{3}*(-1)+\frac{1}{6}*(-0.002)+\frac{1}{3}*(-1)+\frac{1}{6}*(-148.413)=-25.403\) (Eq. 24) is calculated. Thus, \(v^{\pi _1,h=1}_{n} = u^{-1}(u^{\pi ,h=1}_{n}) = -3.235\) and \(g^{h=1}_{u_1} = \frac{v^{\pi ,h=1}_{n}}{1} = -3.235\).

For policy \(\pi _2\), \(u^{\pi _2,h=1}_n=\frac{5}{6}*(-1)+\frac{1}{6}*(-0.368) = - \)0.895. Consequently \(v^{\pi _2,h=1}_{n}=u^{-1}(-0.895) = 0.111\) and \(g^{h=1}_{u_1}=0.111\). For policy \(\pi _3\), \(u^{\pi _3,h=1}_n=-0.846\) and \(v^{\pi _3,h=1}_{n}=g^{h=1}_{ u_1}=0.167\).

Figure 2a) shows the values \(v^{\pi ,h=1}_{u_{1}}\) for the three policies (purple vertical lines), which are used to calculate their respective gains. For \(h=1\) and \(\gamma =1\), \(\pi _3 \succ \pi _2 \succ \pi _1\).

Table 1. Sequences \(\psi \) of size \(h=1\) for policy \(\pi _1\), \(f_{s,n} ^{\pi _1,h=1}(\psi )\), \(v_{n}^{\pi _1,h}(\psi )\) and \(u(v_{n}^{\pi _1,h}(\psi ))\).

4.3 Calculation of Gain Using Utility Function with \(\gamma =1\) for \(h=\infty \)

Solving the system of Eqs. 9 proposed in [2] to evaluate policies, \({g}^{\pi _1}_u = {g}^{\pi _2}_u = (0.157, 0.157, 0.157, 0.157)\) and \({g}^{\pi _3}_u=(0.167)\). As stated previously, \({g}^{\pi }_u= {g}^{\pi ,h=\infty }_u\) for any policy.

Note that \(g^{\pi }_u \le g^{\pi }\) for any of the policies, what is expected of a risk-averse agent, whose certainty equivalent gain is less than or equal to the gain.

\(g^{\pi ,h}_{u}\) were calculated for \(1\le h \le 2000\) and policies \(\pi \in \{\pi _1, \pi _2, \pi _3\}\). As expected \(g^{\pi ,h}_{u}\) approaches \(g^{\pi ,h=\infty }_{u}\) as h increases. The value of \(g^{\pi _1,h=2000}_{u}=g^{\pi _2,h=2000}_{u}=(0.155, 0.155, 0.155, 0.155)\) and \(g^{\pi _3,h=2000}_{u}=(0.167)\).

For \(h=\infty \) and \(\gamma =1\), \(\pi _3 \succ \pi _2 \sim \pi _1\).

4.4 Calculation of Gain Using CVaR with \(\alpha =0.2\) for \(h=1\)

To calculate the gain \(g^{\pi ,h}_\text {CVaR}\), it is necessary to calculate \(f_{s,n}^{\pi ,h}(\psi )\) (Eq. 19), \(v_{n}^{\pi ,h}(\psi )\) (Eq. 20) and the distribution of these accumulated rewards \(f_{v_{n}^{\pi }}^{h}\). Table 1 shows the sequences \(\psi \) of size \(h=1\) for policy \(\pi _1\) (column 1), \(f_{s,n}^{\pi _1,h=1 }(\psi )\) (column 2) and \(v_{n}^{\pi _1,h=1}(\psi )\) (column 3).

Thus, the possible accumulated rewards are \((-5,0,6)\) with distribution \(f_{v^{\pi _1}_n}^{h=1}=(\frac{1}{6}, \frac{4}{6}, \frac{1}{6})\). Considering a confidence level of \(20\%\), \(v^{\pi _1,h=1}_{\text {VaR}_{0.2}}=\text {VaR}_{0.2}(v_{n} ^{\pi _1,h=1})=0\) and \(v^{\pi _1,h=1}_{\text {CVaR}_{0.2}}=\text {CVaR}_{0.2}(v_{n}^{\pi _1,h=1}) =\frac{\frac{1}{6}(-5)+(0.2-\frac{1}{6})*0}{0.2}=-4.167\). Thus, \(g^{\pi _1,h=1}_{\text {CVaR}_{0.2}} = \frac{v^{\pi _1,h=1}_{\text {CVaR}_{0.2} }}{1} = -4,167\).

For policy \(\pi _2\), the possible cumulative rewards are (0, 1) with distribution \(f_{v_{n}^{\pi _2}}^{h=1}=(\frac{5}{6}, \frac{1}{6})\). Considering a confidence level of \(20\%\), \(v^{\pi _2,h=1}_{\text {VaR}_{0.2}}=0\) and \(v^{\pi _2,h=1 }_{\text {CVaR}_{0.2}}=\frac{0.2*0}{0.2}=0\). Thus, \(g^{\pi _2,h=1}_{\text {CVaR}_{0.2}} = \frac{v^{\pi _2,h=1}_{\text {CVaR}_{0.2} }}{1} = 0\).

For policy \(\pi _3\), the only accumulated reward is \(v_{n}^{\pi _3,h=1} = (\frac{1}{6})\) with distribution \(f_{v_{n}^{ \pi _3}}^{h=1}=(1)\). Considering a confidence level of \(20\%\), the \(v^{\pi _3,h=1}_{\text {VaR}_{0.2}}=v^{\pi _3,h=1}_{\text{ CVaR}_{0.2}}=\frac{1}{6}\), and \(g^{\pi _3,h=1}_{\text {CVaR}_{0.2}} = \frac{v^{ \pi _3,h=1}_{\text {CVaR}_{0.2}}}{1} = \frac{1}{6}\).

Figure 2a) shows the values \(v^{\pi ,h=1}_{\text {CVaR}_{0.2}}\) for the three policies (pink vertical lines) that are used to calculate their respective gains. For \(h=1\) and a confidence level of \(20\%\) \(\pi _3 \succ \pi _2 \succ \pi _1\).

The values found in this example for \(h=1\) are the same values found with the criterion used by Xia and Glynn [7].

4.5 Comparison of \(g^{\pi }\), \(g_{u}^{\pi ,h}\), \(g_{\text {VaR}}^{\pi ,h}\) and \(g_{\text {CVaR}}^{\pi ,h}\)

Figure 3a) shows (\(g_{u}^{\pi ,h}\)) for \(1 \le h \le 2000\) and for risk factors \(\gamma \in \{1 , 0.1, 0.01, 0.001\}\). For all policies, it is noted that when \(\gamma \) tends to 0 for any value of h the value of the utility function \(g_{u_\gamma }^{\pi ,h}\) tends to the risk-neutral gain, \(g^{\pi }\). This property was demonstrated by Howard and Matheson [2] for \(h=\infty \). It is also observed that for all policies, the gain \(g_{u_1}^{\pi ,h}\) when h tends to infinity approaches \(g_{u_1}^{\pi ,h=\infty }\). Additionally, if \(h=\infty \) (orange lines in Fig. 3), \(\pi _3 \succ \pi _1 \sim \pi _2\) and if \(h \ne \infty \) (dotted lines in Fig. 3) \(\pi _3 \succ \pi _2 \succ \pi _1\).

Fig. 3.
figure 3

Gain of the policies varying h.

Figure 3b) shows \(g_{\text {VaR}}^{\pi ,h}\) for \(1 \le h \le 2000\) and for confidence levels \(\alpha \in \{0.01, 0.1 , 0.2, 0.5,0.8,0.9,0.99\}\). For policies \(\pi _1\) and \(\pi _2\), for confidence levels above \(50\%\), \(g_{\text {VaR}_\alpha }^{\pi ,h} \ge g^\pi \) if \(h \ge 1\). When h tends to infinity \(g_{\text {VaR}_\alpha }^{\pi ,h}\) tends to \(g^\pi \). For confidence levels below \(50\%\), \(g_{\text {VaR}_\alpha }^{\pi ,h} \le g^\pi \) if \(h \ge 2\) and approaches \( g^\pi \) when h tends to infinity. For policy \(\pi _3\), \(g_{\text {VaR}_\alpha }^{\pi ,h}=g^\pi \) for all h.

Figure 3c) shows \(g_{\text {CVaR}}^{\pi ,h}\) for \(1 \le h \le 2000\) and for confidence levels \(\alpha \in \{0.01, 0.1, 0.2, 0.5,0.8,0.9,0.99\}\). The values \(g_{\text {CVaR}_\alpha }^{\pi , h}\) tend to \(g^\pi \) when h tends to infinity for any value of \(\alpha \) and, as expected, when \(\alpha \) tends to 1, \(g_{\text {CVaR}_\alpha }^{\pi ,h}\) approaches \(g^\pi \) for any h. Additionally, if \(h=\infty \), \(\pi _1 \sim \pi _2 \sim \pi _3\) and if \(h \ne \infty \) (dotted lines in Fig. 3) \(\pi _3 \succeq \pi _2 \succeq \pi _1\).

5 Conclusions

In this work, we proposed a mathematical framework that unifies the average reward and risk criteria for MDPs, for stationary policies that generate aperiodic unichain or multichain Markov chains. The proposed formalization uses sequences of states of size h, allowing two ways of thinking about the stochastic process in the limit: (i) the accumulated reward at infinity and (ii) the steady-state distribution.

The gain from the risk-neutral average reward criterion does not vary regardless of the number of states in the sequence. Works on average reward with exponential utility function [2] and with the CVaR risk measure [7] have been classified using the proposed framework. The exponential utility criterion [2] was classified as \(h=\infty \) since the sum of the reward at infinity was used, and the CVaR criterion [7] was classified as \(h=1\) because it considered the immediate reward and the steady-state distribution.

To the best of our knowledge, the exponential utility function applied to sequences of states of size \(h=1\) does not exist in the literature. This criterion allowed differentiation of policies not distinguished by exponential utility with \(h=\infty \) and by risk-neutral average reward, as demonstrated in the numerical examples.

As far as we know, there is no work in the literature that characterizes CVaR for \(h=\infty \). Through numerical experiments, it was conjectured that this criterion is equivalent to risk-neutral, which needs to be formally demonstrated in future work.