key: cord-0640328-pjsey8di
authors: Jaeger, Manfred; Bacci, Giorgio; Bacci, Giovanni; Larsen, Kim Guldstrand; Jensen, Peter Gjol
title: Approximating Euclidean by Imprecise Markov Decision Processes
date: 2020-06-26
journal: nan
DOI: nan
sha: 1fae298a7c4e9ac81063358a55cb87366b16afce
doc_id: 640328
cord_uid: pjsey8di

Euclidean Markov decision processes are a powerful tool for modeling control problems under uncertainty over continuous domains. Finite state imprecise, Markov decision processes can be used to approximate the behavior of these infinite models. In this paper we address two questions: first, we investigate what kind of approximation guarantees are obtained when the Euclidean process is approximated by finite state approximations induced by increasingly fine partitions of the continuous state space. We show that for cost functions over finite time horizons the approximations become arbitrarily precise. Second, we use imprecise Markov decision process approximations as a tool to analyse and validate cost functions and strategies obtained by reinforcement learning. We find that, on the one hand, our new theoretical results validate basic design choices of a previously proposed reinforcement learning approach. On the other hand, the imprecise Markov decision process approximations reveal some inaccuracies in the learned cost functions.

Markov Decision Processes (MDP) [12] provide a unifying framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning. They are used in several areas, including economics, control, robotics and autonomous systems. In its simplest form, an MDP comprises a finite set of states S, a finite set of control actions Act, which for each state s and action a specifies the transition probabilities P a (s, s ) to successor states s . In addition, transitioning from a state s an action a has an immediate cost C(s, a) 1 . The overall problem is to find a strategy σ that specifies the action σ(s) to be made in state s in order to optimize some objective (e.g. the expected cost of reaching a goal state).

For many applications, however, such as queuing systems, epidemic processes (e.g. COVID19), and population processes the restriction to a finite state-space is inadequate. Rather, the underlying system has an infinite state-space and the decision making process must take into account the continuous dynamics of the system. In this paper, we consider a particular class of infinite-state MDPs, namely Euclidean Markov Decision Processes [9] , where the state space S is given by a (measurable) subset of R K for some fixed dimension K.

As an example, consider the semi-random walk illustrated on the left of Fig. 1 with state-space S = [0, x max ]×[0, t max ] (one dimensional space, and time). Here the goal is to cross the x = 1 finishing line before t = 1. The decision maker has two actions at her disposal: to move fast and expensive (cost 3), or to move slow and cheap (cost 1). Both actions have uncertainty about distance traveled and time taken. This uncertainty is modeled by a uniform distribution over a successor state square: given current state (x, t) and action a ∈ {slow , fast}, the distribution over possible successor states is the uniform distribution over [x+δ(a)−ε, x+δ(a) +ε]×[t+τ (a)−ε, t+τ (a) +ε], where (δ(a), τ (a)) represents the direction of the movement in space and time which depends on the action a, while the parameter ε models the uncertainty. Now, the question is to find the strategy σ : S → Act that will minimize the expected cost of reaching a goal state. In [9] , we proposed two reinforcement learning algorithms implemented in UPPAAL STRATEGO [5] , using online partition refinement techniques. In that work we experimentally demonstrated its improved convergence tendencies on a range of models. For the semi-random walk example, the online learning algorithm returns the strategy illustrated on the right of Fig. 1 .

However, despite its efficiency and experimentally demonstrated convergence properties, the learning approach of [9] provides no hard guarantees as to how far away the expected cost of the learned strategy is from the optimal one. In this pa-per we propose a step-wise partition refinement process, where each partitioning induces a finite-state imprecise MDP (IMDP). From the induced IMDP we can derive upper and lower bounds on the expected cost of the original infinite-state Euclidean MDP. As a crucial result, we prove the correctness of these bounds, i.e., that they are always guaranteed to contain the true expected cost. Also, we provide value iteration procedures for computing lower and upper expected costs of IMDPs. Figure 2 shows upper and lower bounds on the expected cost over the regions shown in Figure 1 . Applying the IMDP value iteration procedures to the partition learned by UPPAAL STRATEGO therefore allows us to compute guaranteed lower and upper bounds on the expected cost, and thereby validate the results of reinforcement learning. The main contributions of this paper can by summarized as follows:

-We define IMDP abstractions of infinite state Euclidean MDPs, and establish as key theoretical properties: the correctness of value iteration to compute upper and lower expected cost functions, the correctness of the upper and lower cost functions as bounds on the cost function of the original Euclidean MDP, and, under a restriction to finite time horizons, the convergence of upper and lower bounds to the actual cost values. -We demonstrate the applicability of the general framework to analyze the accuracy of strategies learned by reinforcement learning.

Related Work. Our work is closely related to various types of MDP models proposed in different areas. Imprecise Markov Chains and Imprecise Markov Decision processes have been considered in areas such as operations research and artificial intelligence [15, 4, 14] . The focus here typically is on approximating optimal policies for fixed, finite state spaces. In the same spirit, but from a verification point of view, [2] focuses on reachability probabilities.

Lumped Markov chains are obtained by aggregating sets of states of a Markov Chain into a single state. Much work is devoted to the question of when and how the resulting process again is a Markov chain (it rarely is) [13, 6] . The interplay of lumping and imprecision is considered in [7] Most work in this area is concerned with finite state spaces. Abstraction by state space partitioning (lumping) can be understood as a special form of partial observability (one only observes which partition element the current state belongs to). A combination or partial observability with imprecise probabilities is considered in [8] [10] introduce abstractions of finite state MDPs by partitioning the state space. Upper and lower bounds for reachability probabilities are obtained from the abstract MDP, which is formalized as a two player stochastic game. [11] is concerned with obtaining accurate specifications of an abstraction obtained by state space partitioning. The underlying state space is finite, and a fixed partition is given.

Thus, while there is a large amount of closely related work on abstracting MDPs by state space partitioning, and imprecise MDPs that can result from such an abstraction, to the best of our knowledge, our work is distinguished from previous work by: the consideration of infinite continuous state spaces for the underlying models of primary interest, and the focus on the properties of refinement sequences induced by partitions of increasing granularity. -S ⊆ R K is a measurable subset of the K-dimensional Euclidean space equipped with the Borel σ-algebra B K . -G ⊆ S is a measurable set of goal states, -Act is a finite set of actions, -T : S × Act × B K → [0, 1] defines for every a ∈ Act a transition kernel on (S, B K ), i.e., T (s, a, ·) is a probability distribution on B K for all s ∈ S, and T (·, a, B) is measurable for all B ∈ B K . Furthermore, the set of goal states is absorbing, i.e. for all s ∈ G and all a ∈ Act: T (s, a, G) = 1. -C : S × Act → R ≥0 is a cost-function for state-action pairs, such that for all a ∈ Act: C(·, a) is measurable, and C(s, a) = 0 for all s ∈ G.

A run π of an MDP is a sequence of alternating states and actions s 1 a 1 s 2 a 2 · · · . We denote the set of all runs of an EMDP M as Π M . We use π i to denote (s i , a i ), π ≤i for the prefix s 1 a 1 s 2 a 2 · · · s i a i , and π >i for the tail s i+1 a i+1 s i+2 a i+2 · · · of a run. The cost of a run is

The set Π M is equipped with the product σ-algebra (B K ⊗ 2 Act ) ∞ generated by the cylinder sets B 1 × {a 1 } × · · · × B n × {a n } × (S × Act) ∞ (n ≥ 1, B i ∈ B K , a i ∈ Act). We denote with B + the Borel σ-algebra restricted to the non-negative reals, and withB + the standard extension toR ≥0 := R ≥0 ∪ {∞}, i.e. the sets of the form B and B ∪ {∞}, where B ∈ B + .

Due to space constraints proofs are only included in the extended online version of this paper.

We next consider strategies for EMDPs. We limit ourselves to memoryless and stationary strategies, noting that on the rich Euclidean state space S this is less of a limitation than on finite state spaces, since a non-stationary, time dependent strategy can here be turned into a stationary strategy by adding one real-valued dimension representing time.

Definition 2 (Strategy). A (memoryless,stationary) strategy for an MDP M is a function σ : S → (Act → [0, 1]), mapping states to probability distributions over Act, such that for every a ∈ Act the function s ∈ S → σ(s)(a) is measurable.

The following lemma is mostly a technicality that needs to be established in order to ensure that an MDP in conjunction with a strategy and an initial state distribution defines a Markov process on S × Act, and hence a probability distribution on Π M . Lemma 2. If σ is a strategy, then

is a transition kernel on (S × Act, B K × 2 Act ).

Usually, an initial state distribution will be given by a fixed initial state s = s 1 . We then denote the resulting distribution over Π M by P s,σ (this also depends on the underlying M; to avoid notational clutter, we do not always make this dependence explicitly in the notation).

Definition 3 (Expected Cost). Let s ∈ S. The expected cost at s under strategy σ is the expectation of C ∞ under the distribution P s,σ , denoted E σ (C, s). The expected cost at initial state s then is defined as

Example 1. If s ∈ G, then for any strategy σ: P s,σ ( i≥1 {s i ∈ G}) = 1, and hence E(C, s) = 0. However, E(C, s) = 0 can also hold for s ∈ G, since C(s, a) = 0 also is allowed for non-goal states s.

Note that, for any strategy σ, the functions E σ (C, ·) and E(C, ·) are [0, ∞]valued measurable functions on S. This follows by measurability of C(·, a) and σ(·)(a), for all a ∈ Act, and [1, Theorem 13.4].

We next show that expected costs in EMDPs can be computed by value iteration. Our results are closely related to Theorem 7.3.10 in [12] . However, our scenario differs from the one treated by Puterman [12] in that we deal with uncountable state spaces, and in that we want to permit infinite cost values. Adapting Puterman's notation [12] , we introduce two operators, L and L σ , on [0, ∞]-valued measurable functions E on S, defined as follows:

The operators above are well-defined: Since the set of actions Act is finite, for every E we can define a deterministic strategy d, such that LE = L d E. We can establish an even stronger relation:

As a first main step we can show that the expected cost under the strategy σ is a fixed point for the operator L σ :

As a corollary of Lemma 4 and Proposition 1, E(C, ·) is a pre-fixpoint of the L operator. Moreover, we can show that it is the least pre-fixpoint of L.

By Proposition 2 and Tarski fixed point theorem, E(C, ·) is the least fixed point of L. The following theorem, provides us with a stronger result, namely, that E(C, ·) is the supremum of the point-wise increasing chain

L n := L n ⊥ (n ≥ 1), and L := sup n≥0 L n

The following theorem then states that value iteration converges to E(C, ·).

The value iteration of Theorem 1 is a mathematical process, not an algorithmic one, as it is defined pointwise on the uncountable state space S. Our goal, therefore, is to approximate the expected cost function E(C, ·) of an EMDP by expected cost functions on finite state spaces consisting of partitions of S. In order to retain sufficient information of the original EMDP to be able to derive provable upper and lower bounds for E(C, ·), we approximate the EMDP by an Imprecise Markov Decision Processes (IMDPs) [15] .

-S is a finite set of states -G ⊆ S is the set of goal states, -Act is a finite set of actions, -T * : S × Act → 2 (S→R ≥0 ) assigns to state-action pairs a closed set of probability distributions over S; the set of goal states is absorbing, i.e., for all s ∈ G and all T (s, a) ∈ T * (s, a): t∈G T (s, a)(t) = 1, -C * : S × Act → 2 R ≥0 assigns to state-action pairs a closed set of costs, such that for all s ∈ G, a ∈ Act: C * (s, a) = {0}.

Memoryless, stationary strategies σ are defined as before. In order to turn an IMDP into a fully probabilistic model, one also needs to resolve the choice of a transition probability distribution and cost value.

Definition 5 (Adversary, Lower/Upper expected cost). An adversary α for an IMDP consists of two functions

A strategy σ, an adversary α, and an initial state s together define a probability distribution P s,σ,α over runs π with s 1 = s, and hence the expected cost E σ,α (C * (π), s). We then define the lower and upper expected cost as

Since T * (s, a) and C * (s, a) are required to be closed sets, we can here write min α and max α rather than inf α , sup α . Furthermore, the closure conditions are needed to justify a restriction to stationary adversaries, as the following example shows (cf. also Example 7.3.2 in [12] ).

Since there is only one action, there is only one strategy σ.

Then, if the adversary at the i'th step selects transition probabilities ( i , 1 − i , 0) one obtains E min (C * (π), s 1 ) = 1 − δ. For every stationary adversary the transition from s 1 to s 2 will be taken eventually with probability 1, so that here E min (C * (π), s 1 ) = 1.

We note that only in the case of E max does α act as an "adversary" to the strategy σ. In the case of E min , σ and α represent co-operative strategies. In other definitions of imprecise MDPs only the transition probabilities are setvalued [15] . Here we also allow an imprecise cost function. Note, however, that for the definition of E min (C * , s) and E max (C * , s) the adversary's strategy α C will simply be to select the minimal (respectively maximal) possible costs, and that we can also obtain E min , E max as the expected lower/upper costs on IMDPs with point-valued cost functions

where then the adversary has no choice for the strategy α C .

We now characterize E min , E max as limits of value iteration, again following the strategy of the proof of Theorem 7.3.10 of [12] . In this case, the proof has to be adapted to accommodate the additional optimization of the adversary, and, as in Section 2.1, to allow for infinite costs. We again start by defining suitable operators L min , L max on [0, ∞]-valued functions C defined on S:

where opt ∈ {min, max}. The mapping

defines the α T of an adversary. Similarly

defines a strategy. Let ⊥ be the function that is constant 0 on S. Denote L opt,n := (L opt ) n ⊥, and L opt := sup n≥0 L opt,n

We can now state the applicability of value iteration for IMDPs as follows:

We note that even though L opt , in contrast to the L operator for EMDPs, now only needs to be computed over a finite state space, we do not obtain from Theorem 2 a fully specified algorithmic procedure for the computation of E opt , because the optimization over T * (s, a) contained in (5) will require customized solutions that depend on the structure of the T * (s, a).

From now on we only consider EMDPs whose state space S is a compact subset of R K . We approximate such a Euclidean MDP by IMDPs constructed from finite partitions of S. In the following, we denote with A = {ν 1 , . . . , ν |A| } ⊂ 2 S a finite partition of S. We call an element ν ∈ A a region and shall assume that each such ν is Borel measurable. For s ∈ S we denote by [s] A the unique region ν ∈ A such that s ∈ ν. The diameter of a region is δ(ν) := sup s,s ∈ν s − s , and the granularity of a A is defined as δ(A) := max ν∈A δ(ν). We say that a partition B refines a partition A if for any ν ∈ B there exist µ ∈ A with ν ⊆ µ. We write A B in this case.

A Euclidean MDP M = (S, G, Act, T , C) and a partition A of S induces an abstracting IMDP [10, 11] according to the following definition.

Definition 6 (Induced IMDP). Let M = (S, Act, s init , T , C, G) be an MDP, and let A be a finite partition of S consistent with G in the sense that for any ν ∈ A either ν ⊆ G or ν ∩ G = ∅. The IMDP defined by M and A then is a) is the marginal of T (s, a, ·) on A, i.e. T A (s, a)(ν ) = ν T (s, a, dt), and cl denotes topological closure.

-C * A (ν, a) = cl ({C(s, a)|s ∈ ν}) The following theorem states how an induced IMDP approximates the underlying Euclidean MDP. In the following, we use sub-scripts on expectation operators to identify the (I)MDPs that define the expectations. 

If A B, then B improves the bounds in the sense that

Our goal now is to establish conditions under which the approximation (10) becomes arbitrarily tight for partitions of sufficiently high granularity. This will require certain continuity conditions for M as spelled out in the following definition. In the following, d tv stands for the total variation distance between distributions. Note that we will be using d tv both for discrete distributions on partitions A, and for continuous distributions on S. We observe that due to the assumed compactness of S, the first condition of Definition 7 is satisfied if T is defined as a function T (s, a, t) on S × Act × S that for each a as a function of s, t is continuous on S × S, and such that T (s, a, ·) is for all s, a a density function relative to Lebesgue measure. We next introduce some notation for N -step expectations and distributions. In the following, we use τ to denote strategies for induced IMDPs defined on partitions A, whereas σ is reserved for strategies defined on Euclidean state spaces S. For a given partition A and strategy τ for M A let α + , α − denote two strategies for the adversary (to be interpreted as strategies that are close to achieving sup α E τ ,α (C * (π), ·) and inf α E τ ,α (C * (π), ·), respectively, even though we will not explicitly require properties that derive from this interpretation). We then denote with P N τ ,α + , P N τ ,α − the distributions defined by τ , α + and τ , α − on run prefixes of length N , and with E N τ ,α + , E N τ ,α − the corresponding expectations for the sum of the first N costs

The P N and E N also depend on the initial state ν 1 . To avoid notational clutter, we do not make this explicit in the notation. We then obtain the following approximation guarantee: 

and

Theorem 4 is a strengthening of Theorem 2 in [9] . The latter applied to processes that are guaranteed to terminate within N steps. Our new theorem applies to the expected cost of the first N steps in a process of unbounded length. When the process has a bounded time horizon of no more than N steps, and if we let τ , α + , α − be the strategy and the adversaries that achieve the optima in (3), respectively (4), then (13) becomes

We conjecture that this actually also holds true for arbitrary EMDPs:

A 1 · · · A i · · · be a sequence of partitions consistent with G such that lim i→∞ δ(A i ) = 0.

Then for all s ∈ S:

The approximation guarantees given by Theorems 3 and 4 have two important implications: first, they guarantee the correctness and asymptotic accuracy of upper/lower bounds computed by value iteration in IMDP abstractions of the underlying EMDP. Second, they show that the hypothesis space of strategies defined over finite partitions that underlies the reinforcement learning approach of [9] is adequate in the sense that it contains strategy representations that approximate the optimal strategy for the underlying continuous domain arbitrarily well.

We now use our semi-random walker example to illustrate the theory presented in the preceding sections, and to demonstrate its applicability to the validation of machine learning models.

We first illustrate experimentally the bounds and convergence properties ex- Figure 3 shows the upper and lower expected costs that we obtain from the induced IMDPs. One can see how the intervals narrow with successive partition refinements. The bounds on the section S 0 are closer and converge more uniformly than on S 0.7 . This shows that in the upper left region of the state space (x < 0.5, t ≥ 0.7) the adversary has a greater influence on the process than at the lower part of the state space (x ∼ 0), and the difference between a cooperative and a non-cooperative adversary is more pronounced. Ultimately, induced strategies are of greater interest than the concrete cost functions. Once upper and lower expectations define the same strategy, further refinement may not be necessary. Figure 4 illustrates for the whole state space S the strategies σ obtained from the lower (Equation (3)) and upper (Equation (4)) approximations. On regions colored blue and yellow, both strategies agree to take the fast and slow actions, respectively. The regions colored light green are those where the lower bound strategy chooses the fast action, and the upper bound strategy the slow action. Conversely for the regions colored light red. One can observe how the blue and yellow areas increase in size with successive partition refinements. However, this growth is not entirely monotonic: for example, some regions in the upper left that for ∆ = 0.1 are yellow are sub-divided in successive refinements ∆ = 0.05, 0.025 into regions that are partly yellow, partly light green.

We now turn to partitions computed by the reinforcement learning method developed in [9] , and a comparison of the learned cost functions and strategies with those obtained from the induced IMDPs. We have implemented the semirandom walker in UPPAAL STRATEGO and used reinforcement learning to learn partitions, cost functions and strategies. Our learning framework produces a sequence of refinements, based on sampling 100 additional runs for each refinement. In the following we consider the models learned after k = 27 and k = 205 refinements. Figure 5 illustrates expected costs functions for the partition learned at k = 205. One can observe a strong correlation between the bounds and the learned costs. Nevertheless, the learned cost function sometimes lies outside the given bounds. This is to be expected, since the random sampling process may produce data that is not sufficiently representative to estimate costs for some regions. Turning again to the strategies obtained on the whole state space, we first note that the learned strategy at k = 205, which is shown in Figure 1 (right) exhibits an overall similarity with the strategies illustrated in Figure 4 , with the fast action preferred along a diagonal region in the middle of the state space. To understand the differences between the learning and IMDP results, it is important to note that in the learning setting s 0 = (0, 0) is taken to be the initial state of interest, and all sampling starts there. As a result, regions that are unlikely to be reached (under any choice of actions) from this initial state will obtain very little relevant data, and therefore unreliable cost estimates. This is not necessarily a disadvantage, if we want to learn an optimal control strategy for processes starting at s 0 . The value iteration process does not take into account the distinguished nature of s 0 . Figure 6 provides a detailed picture of the consistency of the strategies learned at k = 27 and k = 205 with the strategies obtained from value iteration over the same partitions. Drawn in blue/yellow are those regions where the learned strategy picks the fast/slow action, and at least one of upper or lower bound strategies selects the same action. Light blue are those regions where the learned strategy chooses the fast action, but both IMDP strategies select slow. In a single region in the k = 205 partition (drawn in light yellow) the learned strategy chooses the slow, while both IMDP strategies select fast. As Figure 6 shows, the areas of greatest discrepancies (light blue) are those in the top left and bottom right, which are unlikely to be reached from initial state (0, 0).

In this paper we have developed theoretical foundations for the approximation of Euclidean MDPs by finite state space imprecise MDPs. We have shown that bounds on the cost function computed on the basis of the IMDP abstractions are correct, and that for bounded time horizons they converge to the exact costs when the IMDP abstractions are refined. We conjecture that this convergence also holds for the total cost of (potentially) infinite runs.

The results we here obtained provide theoretical underpinnings for the learning approach developed in [9] . Upper and lower bounds computed from induced IMDPs can be used to check the accuracy of learned value functions. As we have seen, data sparsity and sampling variance can make the learned cost functions fall outside computed bounds. One can also use value iteration on IMDP approximations directly as a tool for computing cost functions and strategies, which then would come with stronger guarantees than what we obtain through learning. However, compared to the learning approach, this has important limitations: first, we will usually only obtain a partial strategy that is uniquely defined only where upper and lower bounds lead to the same actions. Second, we will require a full model of the underlying EMDP, from which IMDP abstractions then can be derived, and the optimization problem over adversaries that is part of the value iteration process must be tractable. Reinforcement learning, on the other hand, can also be applied to black box systems, and its computational complexity is essentially independent of the complexities of the underlying dynamic system.

The following lemma collects some basic facts about total variation distance:

Lemma 5. Let A be a finite set, and P , P be distributions on A with d tv (P , P ) ≤ .

A Let f , f functions on A with values in R ≥0 and |f (ν) − f (ν)| ≤ for all ν.

Then

where E, E denote expectation under P and P , respectively. B For each ν ∈ A let Q ν , Q ν be distributions on a space S (discrete or continuous), such that d tv (Q ν , Q ν ) ≤ for all ν. Then

Proof. For A we write

and

then (16) follows. The proof for B is very similar:

Using the definition of total variation as d tv (P , P ) = sup S⊆S |P (S) − P (S)| the first term on the right can be bounded by , and the second by 2 .

Proof. For each i, π → C(π i ) is (B K ⊗ 2 Act ) ∞ − B + measurable according to the measurability condition on C. It follows that also C (N ) (π) :

Lemma 2. If σ is a strategy, then

is a transition kernel on (S × Act, B K × 2 Act ). 

From the above, measurability of LE follows from the measurability of C(·, a), for all a ∈ Act, and of minima of measurable functions [1, Theorem 13.4] . Measurability of L σ E follows similarly by additionally noticing that for any strategy σ, the [0, 1]-valued function σ(·)(a) is measurable, for all a ∈ Act.

Proof. inf σ L σ ≤ L follows by noticing that L = inf d L d , where d ranges only over deterministic strategies. To establish the reverse equality, notice that, for all σ and s ∈ S, a∈Act σ(s)(a) = 1.

Thus, L ≤ L σ , for all strategies σ. From this we obtain inf σ L σ ≥ L.

Proposition 1. For any strategy σ, E σ (C, ·) = L σ E σ (C, ·).

Proof. We have to show that the following holds for all states s ∈ S:

By monotone convergence theorem and linearity of the integral, we have

By definition, the first expectation in (19) is just π∈Π C(π 1 ) P s,σ (dπ) = a∈Act C(s, a) · σ(s)(a) .

and by a change of variable in the integral, the second expectation in (19) is

Thus, (18) follows. Proof. By Lemma 4, Proposition 1 and monotonicity of L σ we have

Next we prove that if E ≥ LE, then E ≥ E(C, ·). By induction on n ≥ 1, we prove that, for all s ∈ S and strategies σ

The base case n = 1 follows by definition of P s,σ (dπ) and because E is positive:

As for the inductive step, assume (20) holds for n ≥ 1. Then Let d be the deterministic strategy such that LE = L d E. By hypothesis, E ≥ LE, and by monotonicity of L d , we obtain E ≥ (L d ) n E, for all n ≥ 1. Thus, by (20) and monotone convergence theorem, for all s ∈ S

Since E(C, s) = inf σ E σ (C, s), from the above we have E(s) ≥ E(C, s). Proof. The chain ⊥ ≤ L 1 ≤ L 2 ≤ . . . is monotonically increasing. This is immediate from ⊥ ≤ L⊥ and monotonicity of the operator L.

Next we show that L is a fixed point of the L operator. Clearly, ⊥ ≤ LL, and by monotonicity of L, for all n ≥ 1, L n ≤ LL. Hence L ≤ LL. Now we establish LL ≤ L. If L(s) = ∞, the inequality holds trivially on s. Assume L(s) < ∞. Then there exist a sequence (a n ) n≥0 ∈ Act such that L(s) = sup = sup n≥0 C(s, a n ) + t∈S L n (t) T (s, a n , dt) .

Let S ∞ = {t ∈ S | L(t) = ∞}. In the following we show that ∃N ≥ 0 such that, ∀n ≥ N . T (s, a n , S ∞ ) = 0 .

If S ∞ = ∅, (21) holds trivially. Let S ∞ = ∅. Assume by contradiction that for all N ≥ 0 there exists n ≥ N such that T (s, a n , S ∞ ) > 0. This is equivalent to the existence of a subsequence (a k ) such that for all a k , T (s, a k , S ∞ ) > 0. For b ∈ R and n ≥ 0, denote by E n b the set {t ∈ S | L n (t) ≥ b}. Then,

Moreover, for all b, b ∈ R and n ≥ 0, if b ≥ b then E n b ⊆ E n b and by monotonicity of the operator L,

. Thus, by [1, Theorem 10.2], for all a k and b ∈ R

T (s, a k ,

Since T (s, a k , S ∞ ) > 0, by (22), , for all a k , T (s, a k , n≥0 E n b ) > 0. Consequently, by (23), for all b ∈ R, exist k such that T (s, a k , E k b ) ≥ 0. Thus, by

and the fact that b can assume arbitrarily large values, L(s) = ∞. This contradicts our initial assumption that L(s) < ∞. Therefore (21) must hold. By (21), for all n ≥ N , t∈S L n (t) T (s, a n , dt) = t∈S (L(t)−L(t))L n (t) T (s, a n , dt). Thus the following hold: L(s) = sup n≥N C(s, a n ) + t∈S L n (t) T (s, a n , dt) = sup n≥N C(s, a n ) + t∈S L(t) T (s, a n , dt) + ∆(s) Hence, if ∆(s) = 0, we get L(s) ≥ LL(s).

The finiteness of Act ensures the existence of an action a ∈ Act repeating infinitely often in (a n ) n≥N . Thus exists a subsequence (n k ) such that, for all n k t∈S L n k (t) − L(t) T (s, a n k , dt) = Finally, we show that L = E(C, ·). By monotonicity of L, for all n ≥ 0, we have L n ≤ E(C, ·). Hence L ≤ E(C, ·). The reverse inequality L ≥ E(C, ·) follows by Proposition 2 since L ≥ LL.

Theorem 2. Let opt ∈ {min, max}. Then E opt (C * (π), ·) = L opt (9) Proof.

Step 1: The sequence L opt,k is monotonically increasing: this is immediate from the facts that L opt,1 ≥ ⊥ because C opt ≥ 0, and C ≥ C ⇒ L opt C ≥ L opt C .

Step 2: We show that L opt is a fixed point of the L operator. Let By monotonicity, for s ∈ S opt,∞ we have L opt L opt (s) = ∞. Now let s ∈ S opt,<∞ . We define separately for the two cases of opt: The set Act opt,<∞ (s) is non-empty (the closedness of T * is again required here), and we can limit the optimization to actions that after the adversary's choice do not lead to infinite cost states: Moreover, the restriction of the minimization to actions from Act opt,<∞ also already is valid for the definition of L opt L opt,n (s) for all sufficiently large n. We have that L opt,n → L opt uniformly on the (finite) set S opt,<∞ . It follows that for all s ∈ S opt,<∞ and a ∈ Act opt,<∞ : Step 3: E opt is the least fixed-point of the L opt -operator. That E opt is a fixedpoint follows immediately from our restriction to stationary and memoryless strategies.

Let C ≥ 0 be an arbitrary fixed-point of L opt . Recalling (7), we can then write C(s) = opt such that for all ν with δ(ν) ≤ δ and all a ∈ Act: C max (ν, a) − C min (ν, a) ≤ , and d tv (T (s, a), T (s , a) ) ≤ for all s, s ∈ ν. Let A have granularity ≤ δ. We then have

By induction hypothesis, the left term is bounded by < /2. According to Lemma 5 A, the right term is bounded by /2, thus yielding (13) .

The bound (14) directly follows from Lemma 5 B.

Probability and Measure

On the complexity of model checking interval-valued discrete time markov chains

Measure Theory. Birkhäuser

Imprecise markov chains with an absorbing state

Uppaal Stratego

Optimal state-space lumping in markov chains

Computing inferences for large-scale continuous-time markov chains by combining lumping with imprecision

Partially observable markov decision processes with imprecise parameters

Teaching stratego to play ball: Optimal synthesis for continuous space mdps

Game-based abstraction for markov decision processes

Approximate abstractions of markov chains with interval decision processes

Markov Decision Processes

A finite characterization of weak lumpable markov processes. part i: The discrete time case

Using imprecise continuous time markov chains for assessing the reliability of power networks with common cause failure and non-immediate repair

Markov decision processes with imprecise transition probabilities