Journal of Artificial Intelligence Research 64 (2019) 817-859 Submitted 07/18; published 03/19

Modeling and Planning with Macro-Actions in
Decentralized POMDPs

Christopher Amato camato@ccs.neu.edu
Khoury College of Computer Sciences, Northeastern University
Boston, MA 02115 USA

George Konidaris gdk@cs.brown.edu
Department of Computer Science, Brown University
Providence, RI 02912 USA

Leslie P. Kaelbling lpk@csail.mit.edu
MIT Computer Science and Artificial Intelligence Laboratory
Cambridge, MA 02139 USA

Jonathan P. How jhow@mit.edu
MIT Laboratory for Information and Decision Systems

Cambridge, MA 02139 USA

Abstract

Decentralized partially observable Markov decision processes (Dec-POMDPs) are gen-
eral models for decentralized multi-agent decision making under uncertainty. However,
they typically model a problem at a low level of granularity, where each agent’s actions are
primitive operations lasting exactly one time step. We address the case where each agent
has macro-actions: temporally extended actions that may require different amounts of time
to execute. We model macro-actions as options in a Dec-POMDP, focusing on actions that
depend only on information directly available to the agent during execution. Therefore,
we model systems where coordination decisions only occur at the level of deciding which
macro-actions to execute. The core technical difficulty in this setting is that the options
chosen by each agent no longer terminate at the same time. We extend three leading Dec-
POMDP algorithms for policy generation to the macro-action case, and demonstrate their
effectiveness in both standard benchmarks and a multi-robot coordination problem. The
results show that our new algorithms retain agent coordination while allowing high-quality
solutions to be generated for significantly longer horizons and larger state-spaces than pre-
vious Dec-POMDP methods. Furthermore, in the multi-robot domain, we show that, in
contrast to most existing methods that are specialized to a particular problem class, our
approach can synthesize control policies that exploit opportunities for coordination while
balancing uncertainty, sensor information, and information about other agents.

1. Introduction

The Dec-POMDP (Bernstein, Givan, Immerman, & Zilberstein, 2002; Oliehoek & Amato,
2016) is a general framework for decentralized sequential decision-making under uncertainty
and partial observability. Dec-POMDPs model problems where a team of agents shares the
same objective function, but where each individual agent can only make noisy, partial ob-
servations of the environment. Solution methods for Dec-POMDPs aim to produce policies

c©2019 AI Access Foundation. All rights reserved.


Amato, Konidaris, Kaelbling & How

that optimize reward while considering uncertainty in action outcomes, sensors, and infor-
mation about the other agents.

Although much research has been conducted on solution methods for Dec-POMDPs,
solving large instances remains intractable. Advances have been made in optimal algorithms
(see, for example, Amato, Chowdhary, Geramifard, Ure, & Kochenderfer, 2013; Amato,
Dibangoye, & Zilberstein, 2009; Aras, Dutech, & Charpillet, 2007; Boularias & Chaib-
draa, 2008; Dibangoye, Amato, Buffet, & Charpillet, 2013; Oliehoek, Spaan, Amato, &
Whiteson, 2013; Dibangoye, Amato, Buffet, & Charpillet, 2016), but most approaches that
scale well make very strong assumptions about the domain (e.g., assuming a large amount
of independence between agents) (Dibangoye, Amato, Doniec, & Charpillet, 2013; Melo
& Veloso, 2011; Nair, Varakantham, Tambe, & Yokoo, 2005) and/or have no guarantees
about solution quality (Oliehoek, Whiteson, & Spaan, 2013; Seuken & Zilberstein, 2007b;
Velagapudi, Varakantham, Sycara, & Scerri, 2011). One reason for this intractability is that
actions are modeled as primitive (low-level) operations that last exactly one time step. The
length of a single step can be adjusted (trading off solution quality for horizon length), but
is always assumed to be the same for all agents. This allows synchronized action selection,
but also requires reasoning about action selection and coordination at every time step.

In single-agent (i.e., MDP) domains, hierarchical approaches to learning and planning
(Barto & Mahadevan, 2003), exemplified by the options framework (Sutton, Precup, &
Singh, 1999), have explored using higher-level, temporally extended macro-actions (or op-
tions ) to represent and solve problems, leading to significant performance improvements in
planning (Silver & Ciosek, 2012; Sutton et al., 1999). We now extend these ideas to the
multi-agent case by introducing a Dec-POMDP formulation with macro-actions modeled
as options. The primary technical challenge here is that decision-making is no longer syn-
chronized: each agent’s options must be selected, and may complete, at different times. To
permit coordination, agents must use their knowledge of option policies to reason about the
progress of other agents and their impact on the world.

The use of macro-actions in the multi-agent case can incorporate the benefits of the
single agent case, such as simpler and more efficient modeling of real systems (e.g., robots
with actions that execute predefined controllers) (Stone, Sutton, & Kuhlmann, 2005), more
efficient planning (Sutton et al., 1999), skill transfer (Konidaris & Barto, 2007), and skill-
specific abstractions (Konidaris & Barto, 2009; Dietterich, 2000). Additional benefits can
be gained by exploiting known structure in the multi-agent problem. For instance, in some
cases macro-actions may only depend on locally observable information. One example is a
robot navigating to a waypoint in a security patrol application. Only local information is
required for navigation, but choosing which waypoint to navigate to next requires reason-
ing about the location and state of all the other robots. Macro-actions with independent
execution allow coordination decisions to be made only when necessary (i.e., when choosing
macro-actions) rather than at every time step. Furthermore, macro-actions can build on
other macro-actions, allowing hierarchical planning. The resulting macro-action formulation
allows asynchronous decision-making using actions with varying time durations.

We therefore focus on the case where the agents are given local options that depend
only on information locally observable to the agent during execution. Our results show that
high-quality solutions can be found for a typical Dec-POMDP benchmark as well as large
problems that traditional Dec-POMDP methods cannot solve: a four agent meeting-in-a-

818


Modeling and Planning with Macro-Actions in Decentralized POMDPs

grid problem and a domain based on robots navigating among movable obstacles (Stilman
& Kuffner, 2005). Our macro-action-based methods can scale well in terms of the problem
horizon and domain variables, but do not directly target scalability in terms of the number
of agents (although such extensions are possible in the future). Incorporating macro-actions
into Dec-POMDPs results in a scalable algorithmic framework for generating solutions for
a wide range of probabilistic multi-agent systems.

One important application area for our approach is multi-robot systems. For single
robots, automatic planning systems provide a flexible general-purpose strategy for con-
structing plans given high-level declarative domain specifications, even in the presence of
substantial stochasticity and partial observability (Thrun, Burgard, & Fox, 2005). By in-
corporating macro-actions into Dec-POMDPs, we show that this strategy can be effectively
extended to multi-robot systems: our methods naturally bridge Dec-POMDPs and multi-
robot coordination, allowing principled decentralized methods to be applied to real domains.
To solidify this bridge, we describe a process for creating a multi-robot macro-action Dec-
POMDP (MacDec-POMDP) model, solving it, and using the solution to produce a set of
executable SMACH (Bohren, 2010) finite-state machine task controllers. Our methods al-
low automatic off-line construction of robust multi-robot policies that support coordinated
actions—including generating communication strategies that exploit the environment to
share information critical to achieving the group’s overall objective.

2. Background

We now describe the Dec-POMDP and options frameworks, upon which our work is based.

2.1 Decentralized Partially-Observable Markov Decision Processes

Dec-POMDPs (Bernstein et al., 2002) generalize POMDPs1 (Kaelbling, Littman, & Cas-
sandra, 1998) and MDPs2 (Puterman, 1994) to the multi-agent, decentralized setting. As
depicted in Figure 1, Dec-POMDPs model a team of agents that must cooperate to solve
some task by receiving local observations and individually selecting and executing actions
over a sequence of time steps. The agents share a single reward function that specifies their
objective, but which is not typically observed during execution. Execution is decentralized
because each agent must select its own action at each time step, without knowledge of
the actions chosen or observations received by the other agents. Finally, the problem is
partially observable because, while the formalism assumes that there exists a Markovian
state at each time step, the agents do not have access to it. Instead, each agent receives a
separate observation at each time step, which reflects its own partial and local view of the
world.

More formally, a Dec-POMDP is defined by a tuple 〈I,S,{Ai},T,R,{Ωi},O,h〉, where:

• I is a finite set of agents.
• S is a finite set of states with designated initial state distribution b0.
• Ai is a finite set of actions for each agent i with A = ×iAi the set of joint actions.

1. POMDPs are Dec-POMDPs where there is only one agent or the decision-making by the agents is
centralized.

2. MDPs are POMDPs where the state is fully observable.

819


Amato, Konidaris, Kaelbling & How

Environment 

a1 

o1 
an 

on 

r 

Figure 1: An n-agent Dec-POMDP. Each agent i receives observatoons oi and executes
actions ai; all agents receive a single collective reward r.

• T is a state transition probability function, T : S ×A×S → [0, 1], that specifies the
probability of transitioning from state s ∈ S to s′ ∈ S when actions ~a ∈ A are taken
by the agents. Hence, T(s,~a,s′) = Pr(s′|~a,s).
• R is a reward function: R : S × A → R, giving the immediate reward for being in

state s ∈ S and taking actions ~a ∈ A.
• Ωi is a finite set of observations for each agent, i, with Ω = ×iΩi the set of joint

observations.

• O is an observation probability function: O : Ω × A × S → [0, 1], the probability of
the agents receiving observations ~o ∈ Ω given actions ~a ∈ A were taken which results
in state s′ ∈ S. Hence O(~o,~a,s′) = Pr(~o|~a,s′).
• h is the number of steps until the problem terminates, called the horizon.

Note that while the actions and observations are factored with one factor per agent, the
state—which represents the state of the whole system—need not be.

The solution to a Dec-POMDP is a joint policy —a set of policies, one for each agent.
In an MDP, a solution policy is represented directly as a mapping from states to actions.
In partially observed settings, the agents do not have access to the state, and so must
represent policies some other way. In POMDP settings it is typically possible to calculate
the belief state—a probability distribution over the unobserved state—and represent the
agent’s policy as a mapping from belief state to actions. However, this is not possible in
the Dec-POMDP setting, because each agent would need access to the histories of all the
other agents to calculate a (centralized) belief state. We therefore represent the history of
each agent explicitly: the action-observation history for agent i, hAi = (a

0
i ,o

0
i , . . . ,a

t
i,o

t
i),

represents the actions taken and observations received at each step (up to step t); the set
of such histories for agent i is HAi . Each agent’s policies are then a function of the agent’s
history, and are either represented as a policy tree, where the vertices indicate actions to
execute and the edges indicate transitions conditioned on an observation, or as a finite state
controller which executes in a similar manner. An example of each is given in Figure 2.

The value of a joint policy, π, from state s is

V π(s) = E

[
h−1∑
t=0

γtR(~at,st)|s,π

]
,

820


Modeling and Planning with Macro-Actions in Decentralized POMDPs

a1

a2

a2 a1 a1

o1 o2

o1 o1 o2o2

a3

a3

(a)

o1

o1 o2

o2

a2a1

(b)

Figure 2: A single agent’s policy represented as (a) a policy tree and (b) a finite-state
controller with initial state shown with a double circle.

which represents the expected value of the immediate reward for the set of agents summed
for each step of the problem given the action prescribed by the policy until the horizon is
reached. In the finite-horizon case (which we consider in this paper), the discount factor,
γ, is typically set to 1. An optimal policy beginning at state s is π∗(s) = argmaxπ V

π(s).
The goal is to maximize the total cumulative reward, beginning at some initial distribution
over states b0.

Dec-POMDPs have been widely studied and there are number of significant advances
in algorithms (e.g., see recent surveys Amato et al., 2013; Oliehoek, 2012; Oliehoek &
Amato, 2016). Unfortunately, optimal (and boundedly optimal) methods (Amato et al.,
2009; Bernstein, Amato, Hansen, & Zilberstein, 2009; Aras et al., 2007; Boularias & Chaib-
draa, 2008; Dibangoye et al., 2013; Oliehoek et al., 2013) do not scale to large problems
and approximate methods (Oliehoek et al., 2013; Seuken & Zilberstein, 2007b; Velagapudi
et al., 2011; Wu, Zilberstein, & Chen, 2010a; Wu, Zilberstein, & Jennings, 2013) do not scale
or perform poorly as problem size (including horizon) grows. Subclasses of the full Dec-
POMDP model have been explored, but they make strong assumptions about the domain
(e.g., assuming a large amount of independence between agents) (Dibangoye et al., 2013;
Melo & Veloso, 2011; Nair et al., 2005). The key question is then: how can scalability with
respect to horizon and domain variables be achieved while making minimal (and accurate)
assumptions about the problems being solved?

Our solution to this question is the use of hierarchy in Dec-POMDPs. While many
hierarchical approaches have been developed for multi-agent systems (e.g., Horling & Lesser,
2004), very few are applicable to multi-agent models based on MDPs and POMDPs. In this
paper, the hierarchy will take the form of options replacing each agent’s primitive actions.
The result is a general framework for asynchronous decision making operating at multiple
levels of granularity that fits many real-world problems. As a result, we target scalability
with respect to the problem horizon an domain variables (actions, observations and states),
but leave scalability with respect to the number of agents to future work (e.g., by combining
the methods from this paper with those that scale in terms of the number of agents).

821


Amato, Konidaris, Kaelbling & How

Figure 3: A multi-robot warehouse domain.

2.2 Multi-Robot Domains

Our work is motivated by multi-robot coordination domains. Consider the multi-robot
warehousing problem (shown in Figure 3) that we present in the experiments. A team
of robots is tasked with finding a set of large and small boxes in the environment and
returning them to a shipping location. Large boxes require multiple robots to push. As a
result, coordination is necessary not just for assigning robots to push specific boxes, but
also because two robots are required to cooperate to push the larger box at the same time.
There is stochasticity in the movements of robots and partial observability with respect
to the location of the boxes and other robots: both can be only be detected when they
are within range. We also consider cases where the robots can send communication signals
to each other, but we do not define the meaning of the messages. Therefore, our planner
must determine where the robots should navigate, what boxes they should push and what
communication messages should be sent (if any) at each step of the problem to optimize the
solution for the team. The robots must make these decisions based solely on the information
they individually receive during execution (e.g., each robot’s estimate of its own location
as well as where and when boxes and other robots have been seen).

This multi-robot warehousing problem can be formalized as a Dec-POMDP.3 In fact,
any problem where multiple robots share a single overall reward or cost function can be
formulated as a Dec-POMDP. Therefore, a Dec-POMDP solver could potentially automat-
ically generate control policies (including policies over when and what to communicate) for
very rich decentralized control problems, in the presence of uncertainty. Unfortunately, this
generality comes at a cost: as mentioned above, Dec-POMDPs are typically infeasible to
solve except for very small problems (Bernstein et al., 2002). By contrast, we will show
that by considering macro-actions, we retain the ability to coordinate while allowing high-
quality solutions to be generated for significantly larger problems than would have been
possible using other Dec-POMDP-based methods. In this example, macro-actions could

3. In fact, there is a common Dec-POMDP benchmark that can be thought of as a simple version of a
warehouse problem (Seuken & Zilberstein, 2007a).

822


Modeling and Planning with Macro-Actions in Decentralized POMDPs

be navigating to a small or large box, pushing a box (alone or with another robot) to a
destination, or communicating with another robot.

Macro-actions are a natural model for the modular controllers often sequenced to obtain
robot behavior. The macro-action approach leverages expert-designed or learned controllers
for solving subproblems (e.g., navigating to a waypoint or grasping an object), bridging the
gap between traditional robotics research and work on Dec-POMDPs. This approach has the
potential to produce high-quality general solutions for real-world heterogeneous multi-robot
coordination problems by automatically generating control and communication policies.

2.3 The Options Framework

The options framework (Sutton et al., 1999) provides methods for learning and planning
using high-level actions, or options, in Markov decision processes. In that setting, an option
is defined by a tuple:

m = (βm,Im,πm),

consisting of a stochastic termination condition, βm : S → [0, 1], which determines the
probability with which an option ceases to execute in each state; an initiation set, Im ⊂ S,
which determines whether or not an option can be executed from a state; and a stochastic
option policy, πm : S ×A → [0, 1], that maps states to action execution probabilities. An
option describes a policy that an agent can choose to execute from any state in Im, which
results in the execution of policy πm until execution ceases according to βm. The set of
options is termed M.

For example, in the warehouse example above, an option-based macro-action may be
navigating to a waypoint. In that case, the initiation set may be all states (it is available
anywhere), the option policy may be a policy that navigates the robot to the waypoint
location from any location and the termination condition may be the state that represents
the waypoint location or a set of states within a given radius of the waypoint. There may
also be terminal states for failure to reach the waypoint (e.g., states representing the robot
getting stuck).

The resulting problem is known as a Semi-Markov Decision Process, or SMDP (Sutton
et al., 1999). Note that we can create an option for a single-step action a by defining
πm(s,a) = βm(s) = 1,∀s, and Im = S. The option framework therefore generalizes the
traditional MDP setting.

The goal is to generate a (possibly stochastic) policy, µ : S×M → [0, 1], that selects an
appropriate option given the current state. The Bellman equation for the SMDP is:

V µ(s) =
∑
m

µ(s,m)

[
R(s,m) +

∑
s′

p(s′|s,m)V µ(s′)

]
,

where p(s′|s,m) =
∑∞

k=0 p
m
s (s

′,k)γk with pms (s
′,k) representing the probability that option

m will terminate in state s′ from state s after k steps and R(s,m) is an expectation over
discounted rewards until termination E[r

t + γrt+1 + . . . + γk−1rt+k] (for executing option m
starting at time t and terminating at time t + k). If the option-policies and the lower-level
MDP is known, these quantities can be calculated from the underlying models. If these
quantities are not known, learning can be used to generate a solution.

823


Amato, Konidaris, Kaelbling & How

When the state is partially observable, these ideas can be directly transferred to the
POMDP case. This can be done by representing the POMDP as a belief-state MDP. That
is, given a current belief state, b, and a policy of option-based macro-actions, µ, the value
can be calculated as:

V µ(b) =
∑
m

µ(b,m)

[
R(b,m) +

∫
b′
p(b′|b,m)V µ(b′)

]
,

where µ(b,m) now selects a policy based on the belief state, R(b,m) =
∑

s b(s)R(s,m)
and p(b′|b,m) =

∑∞
k=0 p

m
b (b
′,k)γk with pmb (b

′,k) representing the probability that option m
will terminate in belief state b′ from belief b after k steps. Several POMDP methods have
been developed that use option-based macro-actions (Theocharous & Kaelbling, 2003; He,
Brunskill, & Roy, 2011; Lim, Sun, & Hsu, 2011).

Using either of these approaches directly is not possible in a decentralized multi-agent
setting. First, the centralized information (a state or belief state) that prior approaches use
for high-level action selection is not present during execution in the Dec-POMDP setting.
Consequently, the action selection function, µ, must be reformulated for the decentralized
case. Second, in the multi-agent case the inclusion of temporally extended actions means
that action selection is no longer synchronized across agents—some agents’ options would
terminate while others are still executing. Therefore, it is not clear when macro-actions
should be considered complete (i.e., up to which point rewards and transitions should be
calculated), which complicates the definition of the reward and transition functions, R and
p. We now introduce a framework that addresses these issues, thereby enabling the use of
options in the Dec-POMDP setting.

3. Adding Options to Dec-POMDPs

We extend the Dec-POMDP model by replacing the local actions available to each agent
with option-based macro-actions. Specifically, the action set of each agent i, which is
denoted Ai above, is replaced with a finite set of options Mi. Then, M = ×iMi the set of
joint options, replacing A, the joint set of actions. We focus on local options for each agent
i, each of which is defined by a tuple:

mi = (βmi,Imi,πmi),

where βmi : H
A
i → [0, 1] is a stochastic termination condition; Imi ⊂ H

M
i is the initiation

set; and πmi : H
A
i × Ai → [0, 1] is the option policy. Note that H

A
i is agent i’s primitive

action-observation history, while HMi is agent i’s macro-action-macro-observation history
(or option history, which is formally defined below). The different histories allow the agents
to locally maintain the necessary information to know how to execute and terminate macro-
actions (based on low-level actions and observations, typically beginning when an option is
first executed) and initiate them (based on high-level history information that is maintained
over a longer timeframe). Such local options model systems where the execution of a
particular option, once selected, does not require coordination between agents, but can
instead be completed by the agent on its own. Decision making that enables coordination
between agents need only happen at the level of which option to execute, rather than inside

824


Modeling and Planning with Macro-Actions in Decentralized POMDPs

the options themselves. Of course, other (non-local) forms of options that control and
depend on multiple agents are possible, but we discuss the local form due to its simplicity
and generality.

The macro-actions for the warehouse problem are discussed in 6.2.1, but, in short,
macro-actions can be defined for navigation, pushing and communication. For example,
there are macro-actions for navigating to each room that could contain boxes. For these
macro-actions, the initiation set is all observations (they are available everywhere), the
policy navigates the robot to the specified room using low-level observation information
that is available to that robot (using low-level observation histories) and the termination
condition consists of observations that are only possible inside the desired room (localization
information within the room).

3.1 The MacDec-POMDP Model

We will refer to Dec-POMDPs with such macro-actions as MacDec-POMDPs. In the
MacDec-POMDP, the agent and state spaces remain the same as in the Dec-POMDP defi-
nition, but macro-actions and macro-observations are added. Formally, a MacDec-POMDP
is a tuple 〈I,S,{Mi},{Ai},T,R,{Zi},{Ωi},ζi,O,h〉, where:

• I, S, {Ai}, T , R, {Ωi}, O and h are the same as the Dec-POMDP definition (and
represent the ‘underlying’ Dec-POMDP),

• Mi is a finite set of macro-actions for each agent i with M = ×iMi the set of joint
macro-actions,

• ζi is a finite set of macro-observations for each agent, i, with ζ = ×iζi the set of joint
macro-observations,

• Zi is a macro-observation probability function for agent i: Zi : ζi×Mi×S → [0, 1], the
probability of the agent receiving macro-observation zi ∈ ζi given macro-action mi ∈
Mi has completed and the current state is s

′ ∈ S. Hence Zi(zi,mi,s′) = Pr(zi|mi,s′).

Note that the macro-observations are assumed to be independently generated for each agent
after that agent’s macro-action has completed. This is reasonable since macro-action com-
pletion is asynchronous (making it uncommon that multiple macro-actions terminate at the
same time) and are generated based on the underlying state (which could include informa-
tion about the other agents).

In the MacDec-POMDP, we will not attempt to directly represent the transition and
reward functions, but instead infer them by using the underlying Dec-POMDP model or
a simulator4. That is, because we assume either a model or a simulator of the underlying
Dec-POMDP is known, we can evaluate policies using macro-actions in the underlying
Dec-POMDP by either knowing that underlying Dec-POMDP model or having a simulator
that implements such a model. This evaluation using the Dec-POMDP model or simulator
can be thought of as ‘unrolling’ each agent’s macro-action and when any macro-action
completes, selecting an appropriate next macro-action for that agent. As a result, a formal
representation of higher-level transition and reward models is not necessary.

4. In related work based on the ideas in this paper, we do generate such an explicit model that considers
time until completion for any macro-action, resulting in the semi-Markovian Dec-POSMDP (Omidshafiei,
Agha-mohammadi, Amato, & How, 2017).

825


Amato, Konidaris, Kaelbling & How

3.2 Designing Macro-Observations

In the MacDec-POMDP, macro-observations are assumed to be given or designed. De-
termining the set of macro-actions that provides the necessary information, without un-
necessarily adding problem variables remains an open question (as it is in the primitive
observation case). In general, the high-level macro-observations can consist of any finite set
for each agent, but some natural representations exist. For instance, the macro-observation
may just be the particular terminal condition that was reached (e.g., the robot entered
office #442). A lot of information is lost in this case, so macro-observations can also be
action-observation histories, representing all the low-level information that took place during
macro-action execution. When action-observation histories are used, initiation conditions
of macro-actions can depend on the histories of macro-actions already taken and their re-
sults. Option policies and termination conditions will generally depend on histories that
begin when the macro-action is first executed (action-observation histories). While defining
the ‘best’ set of macro-observations is an open problem, there is some work on choosing
them and learning the macro-observation probability functions (Omidshafiei, Liu, Everett,
Lopez, Amato, Liu, How, & Vian., 2017a). In this paper, we assume they are defined based
on the underlying state (as defined above). The macro-observation probability function can
be adapted to depend on terminal conditions or local observations rather than states.

3.3 MacDec-POMDP Solutions

Solutions to MacDec-POMDPs map from option histories to macro-actions. An option
history, which includes the sequence of macro-observations seen and macro-actions selected,
is defined as hMi = (z

0
i ,m

1
i , . . . ,z

t−1
i ,m

t
i). Here, z

0
i may be a null macro-observation or an

initial macro-observation produced from the initial belief state b0. Note that while histories
over primitive actions provide the number of steps that have been executed (because they
include actions and observations at each step), an option history typically requires many
more (primitive) steps to execute than the number of macro-actions listed.

We can then define policies for each agent, µi, for choosing macro-actions that depend
on option histories. A (stochastic) local policy, µi : H

M
i × Mi → [0, 1] then depends on

these option histories and a joint policy for all agents is written as µ. The evaluation of
such policies is more complicated than the Dec-POMDP case because decision-making is
no longer synchronized. In cases when a model of macro-action execution (e.g., the option
policy) and the underlying Dec-POMDP are available we can evaluate the high-level policies
in a similar way to other Dec-POMDP-based approaches. Given a joint policy, the primitive
action at each step is determined by the (high-level) policy, which chooses the macro-action,
and the macro-action policy, which chooses the (primitive) action. This ‘unrolling’ uses the
underlying Dec-POMDP to generate (primitive) transitions and rewards, but determines
what actions to take from the macro-actions. The joint high-level and macro-action policies
can then be evaluated as:

V µ(s) = E

[
h−1∑
t=0

γtR(~at,st)|s,π,µ

]
.

When the underlying Dec-POMDP and the macro-action policies are not available, we can
use a simulator or a high-level model to execute the policies and return samples of the

826


Modeling and Planning with Macro-Actions in Decentralized POMDPs

relevant values. Simulation is very similar to model-based evaluation, but uses Monte Carlo
estimation as discussed in Section 5.

For example, we can evaluate a joint 2-agent policy µ which begins with macro-actions
m1 and m2 at state s and executes for t steps as:

V
µ
t (m1,m2,s) =

∑
o1,o2

O(o1,o2,a1,a2,s)
∑
a1,a2

πm1 (o1,a1)πm2 (o2,a2)

[
R(a1,a2,s)+

∑
s′

T(s′,a1,a2,s)
∑
o′1,o

′
2

O(o′1,o
′
2,a1,a2,s

′)

(
βm1 (o

′
1)βm2 (o

′
2)
∑
m′1,m

′
2

µ1(o
′
1,m

′
1)µ2(o

′
2,m

′
2)V

µ
t−1(s

′,m′1,m
′
2) (both terminate)

+ βm1 (o
′
1)
(
1 −βm2 (o

′
2)
)∑
m′1

µ1(o
′
1,m

′
1)V

µ
t−1(m

′
1,m2,s

′) (agent 1 terminates)

+
(
1 −βm1 (o

′
1)
)
βm2 (o

′
2)
∑
m′2

µ2(o
′
2,m

′
2)V

µ
t−1(m1,m

′
2,s
′) (agent 2 terminates)

+
(
1 −βm1 (o

′
1)
)(

1 −βm2 (o
′
2)
)
V
µ
t−1(m1,m2,s

′)

)]
, (neither terminates)

where single observations are used instead of longer histories for macro-action policies, π,
and termination conditions, β. For simplicity, we also use observations based on the current
state, O(o1,o2,a1,a2,s), rather than the next state. The example can easily be extended to
consider histories and the other observation function (as well as more agents). Also, note
that macro-actions will be chosen from the policy over macro-actions µ based on the option
history, which is not shown explicitly (after termination of a macro-action, a high-level
macro-observation will be generated and the next specified macro-action will be chosen as
described above). Note that agents’ macro-actions may terminate at different times; the
appropriate action is then chosen by the relevant agent’s policy and evaluation continues.
Because we are interested in a finite-horizon problem, we assume evaluation continues for
h (primitive) steps.

Given that we can evaluate policies over macro-actions, we can then compare these
policies. We can define a hierarchically optimal policy µ∗(s) = argmaxµV

µ(s) which defines
the highest-valued policy among those that use the given MacDec-POMDP. Because a
hierarchically optimal policy may not include all possible history-dependent policies, it may
have lower value than the optimal policy for the underlying Dec-POMDP (the globally
optimal policy ).5

A globally optimal policy can be guaranteed by including the primitive actions in the
set of macro-actions for each agent and mapping the primitive observation function to
the macro-observation function, because the same set of policies can be created from this
primitive macro-action set as would be created in the underlying Dec-POMDP. However,

5. Unlike flat Dec-POMDPs, stochastic policies may be beneficial in the macro-action case because full
agent histories are no longer used. This remains an area of future work.

827


Amato, Konidaris, Kaelbling & How

m1 m2 
(a) Step 1

m1

m1 m1

z1 z3
m1

m1 m2

z1 z3
m2

m1 m1

z1 z2
m2

z4
m2

m1 m2

z1 z2
m2

z4

(b) Step 2 of DP

Figure 4: Policies for a single agent after (a) one step and (b) two steps of dynamic pro-
gramming using macro-actions m1 and m2 and macro-observations z (some of
which are not possible after executing a particular macro-action).

this typically makes little sense, because it is at least as hard as planning in the underlying
Dec-POMDP directly.

4. Algorithms

Because Dec-POMDP algorithms produce policies mapping agent histories to actions, they
can be extended to consider macro-actions instead of primitive actions by adjusting policy
evaluation and keeping track of macro-action progress and termination. We discuss how
macro-actions can be incorporated into three such algorithms; extensions can also be made
to other approaches.

In these cases, deterministic polices are generated which are represented as policy trees
(as shown in Figure 4). A policy tree for each agent defines a policy that can be executed
based on local information. The root node defines the macro-action to choose in the known
initial state, and macro-actions are specified for each legal macro-observation of the root
macro-action (as seen in Figure 4(b)). In the figure, macro-observations that are not shown
are not possible after the given macro-action has completed. Execution continues until
the primitive horizon h is reached, meaning some nodes of the tree may not be reached
due to the differing execution times of some macro-actions. Such a tree can be evaluated
up to a desired horizon using the policy evaluation given above (i.e., evaluation using the
underlying Dec-POMDP model or simulator). All the methods we discuss use some form
of search through the policy space to generate high-quality macro-action-based policies.

4.1 Dynamic Programming

A simple exhaustive search method can be used to generate hierarchically optimal determin-
istic policies which use macro-actions. This algorithm is similar in concept to the dynamic
programming algorithm used in Dec-POMDPs (Hansen, Bernstein, & Zilberstein, 2004),
but full evaluation and pruning (removing dominated policies) are not used at each step
(since these cannot naturally take place in the macro-action setting). Instead we can exploit
the structure of macro-actions to reduce the space of policies considered. Due to the inspira-
tion from dynamic programming for finite-horizon Dec-POMDPs (Hansen et al., 2004), we
retain the name for the algorithm, but our algorithm is not a true dynamic programming

828


Modeling and Planning with Macro-Actions in Decentralized POMDPs

Algorithm 1 Option-based dynamic programming (O-DP)

1: function OptionDecDP(h)
2: t ← 0
3: PrimitiveHorizonBelowh ← true
4: Mt ←∅
5: repeat
6: Mt+1 ←ExhaustiveBackup(Mt)
7: PrimitiveHorizonBelowh ←TestPolicySetsLength(Mt+1)
8: t ← t + 1
9: until PrimitiveHorizonBelowh = false

10: return Mt
11: end function

algorithm as a full evaluation is not conducted and built on at every step (as discussed
below).

We can exhaustively generate all combinations of macro-actions by first considering
each agent using any single macro-actions to solve the problem, as seen for one agent with
two macro-actions (m1 and m2) in Figure 4(a). We can test all combinations of these 1-
macro-action policies for the set of agents to see if they are guaranteed to reach (primitive)
horizon h (starting from the initial state). If any combination of policies does not reach h
with probability 1, we will not have a valid policy for all steps. Therefore, an exhaustive
backup is performed by considering starting from all possible macro-actions and then for any
legal macro-observation of the macro-action (represented as z in the figure), transitioning
to one of the 1-macro-action policies from the previous step (see Figure 4(b)). This step
creates all possible next (macro-action) step policies. We can check again to see if any of
the current set of policies will terminate before the desired horizon and continue to grow the
policies (exhaustively as described above) as necessary. When all policies are sufficiently
long, all combinations of these policies can be evaluated as above (by flattening out the
polices into primitive action Dec-POMDP policies, starting from some initial state and
proceeding until h). The combination with the highest value at the initial state, s0, is
chosen as the (hierarchically optimal) policy.

Pseudocode for this approach is given in Algorithm 1. Here, Mt represents the set
of (joint) macro-action policies generated for t (macro-action) steps. ExhaustiveBackup
performs the generation of all possible next-step policies for each agent and TestPolicySet-
sLength checks to see if all policies reach the given horizon, h. PrimitiveHorizonBelowh
represents whether there is any tree that has a primitive horizon less than h. The algorithm
continues until all policies reach h and the final set of policies Mt can be returned for
evaluation.

This algorithm will produce a hierarchically optimal deterministic policy because it
constructs all legal deterministic macro-action policies that are guaranteed to reach hori-
zon h. This follows from the fact that macro-actions must last at least one step and all
combinations of macro-actions are generated at each step until it can be guaranteed that
additional backups will cause redundant policies to be generated. Our approach represents
exhaustive search in the space of legal policies that reach a desired horizon. As such it is

829


Amato, Konidaris, Kaelbling & How

not a true dynamic programming algorithm, but additional ideas from dynamic program-
ming for Dec-POMDPs (Hansen et al., 2004) can be incorporated. For instance, we could
prune policies based on value, but this would require evaluating all possible joint policies
at every state after each backup. This evaluation would be very costly as the policy would
be flattened after each backup and all combinations of flat policies would be evaluated for
all states for all possible reachable horizons. Instead, beyond just scaling in the horizon
due to the macro-action length, another benefit of our approach is that only legal policies
are generated using the initiation and terminal conditions for macro-actions. As seen in
Figure 4(b), macro-action m1 has two possible terminal states while macro-action m2 has
three. Furthermore, macro-actions are only applicable given certain initial conditions. For
example, m1 may not be applicable after observing z4 and m2 may not be applicable after
z1. This structure limits the branching factor of the policy trees produced and thus the
number of trees considered.

4.2 Memory-Bounded Dynamic Programming

Memory-bounded dynamic programming (MBDP) (Seuken & Zilberstein, 2007b) can also
be extended to use macro-actions as shown in Algorithm 2. MBDP is similar to the dynamic
programming method above, but only a finite number of policy trees are retained (given by
parameter MaxTrees ) after each backup.

After an exhaustive backup has been performed (in either DP or MBDP), at most
|Mi|×|Mi,t−1||ζi| new trees for each agent i given the previous policy set Mi,t−1 is generated
(although it will often be much less since many macro-actions may not be possible after
a given macro-observation). The key addition in MBDP is that, next, a subset of t-step
trees, M̂t, is chosen by evaluating the full set of trees, Mt, at states6 that are generated
by a heuristic policy (Hpol in the algorithm). The heuristic policy is executed for the first
h−t−1 steps of the problem.7 Heuristic policies can include centralized MDP or POMDP
policies or random policies (or a combination of these), providing a set of possible states
to consider at that depth. A set of MaxTrees states is generated and the highest valued
trees for each state are kept. This process of exhaustive backups and retaining MaxTrees
trees continues, using shorter and shorter heuristic policies until the all combinations of the
retained trees reach horizon h. Again, the set of trees with the highest value at the initial
state is returned.

This approach is potentially suboptiomal because a fixed number of trees are retained,
and tree sets are optimized over states that are both assumed to be known and may never
be reached. Nevertheless, since the number of policies retained at each step is bounded
by MaxTrees, MBDP has time and space complexity linear in the horizon. As a result,
MBDP and its extensions (Amato et al., 2009; Kumar & Zilberstein, 2010; Wu et al., 2010a)
have been shown to perform well in many large Dec-POMDPs. The macro-action-based
extension of MBDP uses the structure provided by the initiation and terminal conditions

6. The original MBDP algorithm (Seuken & Zilberstein, 2007b) uses beliefs rather than states at lines 9 and
10 of the algorithm. Our algorithm could similarly use beliefs, but we discuss using states for simplicity.

7. Note that h is a primitive (underlying Dec-POMDP) horizon, while t is a macro-action step. While
backups will often result in increasing policy length by more than one primitive step, we conservatively
use one step here, but recognize that more accurate calculations along with corresponding better state
estimates are possible.

830


Modeling and Planning with Macro-Actions in Decentralized POMDPs

Algorithm 2 Option-based memory bounded dynamic programming (O-MBDP)

1: function OptionMBDP(MaxTrees,h,Hpol)
2: t ← 0
3: PrimitiveHorizonBelowh ← true
4: Mt ←∅
5: repeat
6: Mt+1 ←ExhaustiveBackup(Mt)
7: M̂t+1 ←∅
8: for all k ∈ MaxTrees do
9: sk ← GenerateState(Hpol,h− t− 1)

10: µ̂t+1 ←M̂t+1 ∪ arg maxµt+1∈Mt+1 V
µt+1 (sk)

11: end for
12: t ← t + 1
13: Mt ←M̂t+1
14: PrimitiveHorizonBelowh ←TestPolicySetsLength(Mt)
15: until PrimitiveHorizonBelowh = false
16: return Mt
17: end function

as in the dynamic programming approach in Algorithm 1, but does not have to produce
all policies that will reach horizon h as the algorithm no longer is seeking hierarchical
optimality. Scalability can therefore be adjusted by reducing the MaxTrees parameter
(although solution quality may be reduced).

4.3 Direct Cross Entropy Policy Search

Another method for solving Dec-POMDPs that has been effective is a cross entropy method,
called DICE (for DIrect Cross Entropy) (Oliehoek, Kooi, & Vlassis, 2008). Instead of using
dynamic programming, this method searches through the space of policy trees by sampling.
That is, it maintains sampling distributions (the probability of choosing an action) at each
history of each agent. Policies are sampled based on these distributions and the resulting
joint policies are evaluated. A fixed number of best-performing policies are retained and the
sampling distributions are updated based on the action choice frequency of these policies
(mixed with the current distributions). Policy sampling and distribution updates continue
for a fixed number of iterations (or a convergence test such as one based on KL-divergence).

The macro-action version of DICE is described in Algorithm 3. The inputs are the
number of iterations of the algorithm (Iter), the number of joint policies to sample at each
iteration, N, the number of joint policies used for updating the sampling distributions, Nb,
the learning rate, α, and the (primitive) horizon, h. The best value, Vbest, is initialized to
negative infinity and the sampling distributions are typically initialized to uniform action
distributions.

In the macro-action case, sampling distributions that are based on option histories
are used instead of primitive histories. Specifically, we maintain ξh

M
i (m) for each option

history hMi , of each agent, i, which represents the probability of selecting macro-action
m after that agent observes history hMi . The algorithm then begins with an empty set

831


Amato, Konidaris, Kaelbling & How

Algorithm 3 Option-based direct cross entropy policy search (O-DICE)

1: function OptionDICE(Iter,N,Nb,α,h)
2: Vbest ←−∞
3: ξ ←InitialDistribution
4: for all i ∈ Iter do
5: M←∅
6: for n ← 0 to N do
7: µ ← Sample(ξ)
8: M←M∪{µ}
9: V ← V µ(s0)

10: if V > Vbest then
11: Vbest ← V
12: µbest ← µ
13: end if
14: end for
15: Mbest ←KeepBestPols(M,Nb)
16: ξnew ← Update(ξ)
17: ξnew ← αξnew + (1 −α)ξ
18: ξ ← ξnew
19: end for
20: return µbest
21: end function

of joint policies, M, and samples N policies for each agent. Because macro-actions often
have limited initial and terminal conditions, sampling is more complicated. It is done in
a top down fashion from the first macro-action until the (primitive) horizon is reached,
while taking into account the possible macro-observations after starting from the initial
state and executing the policy to that point. This allows both the terminal conditions and
initial sets to be used to create distributions over valid macro-actions based on the previous
histories. These N policies for each agent are evaluated and the if a new best policy is
found, the value and policy are stored in Vbest and µbest. The policies with the Nb highest
values from the N are stored in Mbest and ξnew is updated for each agent’s histories with
ξ
hMi
new(m) = 1/Nb

∑
µ∈Mbest I(πi,h

M
i ,m) where I(µi,h

M
i ,m) is an indicator variable that is 1

when macro-action m is taken by policy µi after history h
M
i . This ξnew is mixed with the

previous distribution, ξ, based on the learning rate, α, and the process continues until the
number of iterations is exhausted. The best joint policy, µbest can then be returned.

5. Simulation-Based Execution in MacDec-POMDPs

The MacDec-POMDP framework is a natural way to represent and generate behavior for
realistic general problems such as multi-robot systems, but requiring full knowledge of both
the high-level macro-action model and the low-level Dec-POMDP model is often impracti-
cal. To use the MacDec-POMDP model as described above, we would assume an abstract
model of the system is given in the form of macro-action representations, which include

832


Modeling and Planning with Macro-Actions in Decentralized POMDPs

the associated policies as well as initiation and terminal conditions. These macro-actions
are controllers operating in (possibly) continuous time with continuous (low-level) actions
and feedback, but their operation is discretized for use with the planner. This discretiza-
tion represents an underlying discrete Dec-POMDP which consists of the primitive actions,
states of the system, and the associated rewards. While the complexity of MacDec-POMDP
solution methods primarily depends on the size of the MacDec-POMDP model, and not the
size of the underlying Dec-POMDP (as only policies over macro-actions are needed with
execution in the underlying Dec-POMDP being fixed), it is often difficult to generate and
represent a full Dec-POMDP model for real-world systems.

We therefore extend this model to use a simulator rather than a full model of the
problem, as shown in Figure 5. In many cases, a simulator already exists or is easier to
construct than the full model. Our planner still assumes the set of macro-actions and macro-
observations are known, but the policies of the macro-actions as well as the underlying
Dec-POMDP are not explicitly known. Instead, we make the more realistic assumption
that we can simulate the macro-actions in an environment similar to the real-world domain.
As such, our proposed algorithms for generating policies over macro-actions remain the
same (since constructing policies of macro-actions only requires knowledge of the set of
macro-actions and their initiation and terminal conditions), but all evaluation is conducted
in the simulator (through sampling) rather than through enumerating all reachable states
to compute the Bellman equation. That is, by using policy search, we can decouple the
process of finding solutions with the process of evaluating them. As a result, we assume the
macro-action and macro-observation sets are discrete, but the underlying state, action and
observation spaces can be continuous.

Op#mized	  controllers	  for	  each	  robot	  
(in	  SMACH	  format)	  

System	  descrip#on	  
(macro-­‐ac#ons,	  dynamics,	  sensor	  uncertainty,	  rewards/costs)	  

Planner	  
(solving	  the	  MacDec-­‐POMDP)	  

Figure 5: A high level system diagram for multi-robot problems where the system can be
described formally or using a simulator, solutions are generated with our planning
methods and the output is a set of controllers, one for each robot.

Specifically, a fixed policy can be evaluated by Monte Carlo sampling starting at an
initial state (or belief state), choosing an action for each agent according to the policy,
sampling an observation from the system, updating the current position in the policy (i.e.,
the current node in each agent’s policy tree) and then continuing this process until some
maximum time step has been reached. The value of the k-th sample-based trajectory
starting at s0 and using policy π is given by V

π,k(s0) = r
k
0 + . . . + γ

TrkT , where r
k
t is the

reward given to the team on the t-th step. After K trajectories, V̂ π(s0) =
∑K

k=1
V π,k(s0)

K
.

833


Amato, Konidaris, Kaelbling & How

As the number of samples increases, the estimate of the policy’s value will approach the
true value. This sample-based evaluation is necessary in large or continuous state spaces.
Sample-based evaluation has been used in the Dec-POMDP case (Wu, Zilberstein, & Chen,
2010b; Liu, Amato, Liao, Carin, & How, 2015), but we extend the idea to the macro-action
case where there is the added benefit of abstracting away the details of the macro-action
policies.

In the multi-robot case, given the macro-actions, macro-observations and simulator, our
off-line planners can automatically generate a solution which optimizes the value function
with respect to the uncertainty over outcomes, sensor information, and other robots. The
planner generates the solution in the form of a set of policy trees (as in Figure 4) which
are parsed into a corresponding set of SMACH controllers (Bohren, 2010), one for each
robot. SMACH controllers are hierarchical state machines for use in a ROS (Quigley,
Conley, Gerkey, Faust, Foote, Leibs, Wheeler, & Ng, 2009) environment. Just like the
policy trees they represent, each node in the SMACH controller represents a macro-action
which is executed on the robot (e.g., navigation to a waypoint or wait for another robot)
and each edge corresponds to a macro-observation. Our system can automatically generate
SMACH controllers—which are typically designed by hand—for complex, general multi-
robot systems.

6. Experiments

We test the performance of our macro-action-based algorithms in simulation, using existing
benchmarks, a larger domain, and in a novel multi-robot warehousing domain.

6.1 Simulation Experiments

For the simulation experiments, we test on a common Dec-POMDP benchmark, a four agent
extension of this benchmark, and a large problem inspired by robot navigation. Our algo-
rithms were run on a single core 2.5 GHz machine with 8GB of memory. For option-based
MBDP (O-MBDP), heuristic policies for the desired lengths were generated by producing
1000 random policies and keeping the joint policy with the highest value at the initial state.
Sampling was used (10000 simulations) to determine if a policy will terminate before the
horizon of interest.

6.1.1 An Existing Dec-POMDP Problem: Meeting in a Grid

The meeting-in-a-grid problem is an existing two-agent Dec-POMDP benchmark in which
agents receive 0 reward unless they are both in one of two corners in a 3x3 grid (Amato
et al., 2009). Agents can move up, down, left, right or stay in place, but transitions are
noisy, so an agent may move to an adjacent square rather than its desired location. Each
agent has full observability of its own location, but cannot observe the other agent (even
when they share the same grid square). We defined two options for each agent: each one
moving the agent to one of the two goal corners. Options are valid in any (local) state
and terminate when they reach the appropriate goal corner. An agent stays in a corner
on a step by choosing the appropriate option again. Macro-observations are the agent’s
location (they are the same as the primitive observations, but the agent only observes

834


Modeling and Planning with Macro-Actions in Decentralized POMDPs

Va
lu
e

0

25

50

75

100

Horizon
0 25 50 75 100

O-DP O-MBDP(3) DecRSPI MBDP(3) FB-HSVI

Ti
m

e 
(s

)

0

7500

15000

22500

30000

Horizon

0 25 50 75 100

O-DP O-MBDP(3) DecRSPI MBDP(3) FB-HSVI

Figure 6: Value and time results for the meeting in a grid Dec-POMDP benchmark includ-
ing leading Dec-POMDP approaches DecRSPI and MBDP as well as option-based
DP and MBDP.

it’s updated location after completion of a macro-action). It is clear that these options
provide the important macro-actions for the agents and navigation is possible based on
local information in this problem. While this is a very small problem, it allows for direct
comparison with Dec-POMDP methods.

Results for this problem are split between Figure 6 and Table 1 because not all results
are available for all algorithms. We compared against one leading optimal Dec-POMDP al-
gorithm, feature-based heuristic search value iteration (FB-HSVI) (Dibangoye et al., 2016),
and three leading approximate Dec-POMDP algorithms: MBDP with incremental policy
generation (MBDP-IPG) (Amato et al., 2009), rollout sampling policy iteration (DecR-
SPI) (Wu et al., 2010a) and trial-based dynamic programming (TBDP) (Wu et al., 2010b).
MaxTrees = 3 was used in both O-MBDP and MBDP-IPG (referred to as MBDP in the
figure and table). Results for other algorithms are taken from their respective publications.
As such, results were generated on different machines, but the trends should remain the
same. The left figure shows that all approaches achieve approximately the same value,
but option-based DP (O-DP) cannot solve horizons longer than 10 without running out of
memory. Impressively, FB-HSVI is able to scale to horizon 30 by not explicitly represent-
ing a policy and maintaining a compressed distribution over agent histories and the state.
Nevertheless, since FB-HSVI is an optimal method, it becomes intractable as the horizon
grows (it would be an interesting area of future research to see how macro-actions could be
combined with the compressed representation of FB-HSVI). The right figure shows the time
required for different horizons. All approaches run quickly for small horizons, but DecRSPI
required an intractable amount of time as the horizon grows. The table shows time and
value results for larger horizons. Again, all approaches achieve similar values, but O-MBDP
is much faster than MBDP-IPG or TBDP. The benefit of using a macro-action representa-
tion can be seen most directly by comparing O-MBDP and MBDP, which are both based
on the same algorithm: there is a significant improvement in running time, while solution
quality is maintained.

835


Amato, Konidaris, Kaelbling & How

Value Time (s)

h = 100 h = 200 h = 100 h = 200

O-MBDP(3) 94.4 194.4 133 517
MBDP(3) 92.1 193.4 3084 13875
TBDP 92.8 192.1 427 1372

Table 1: Times and values for larger horizons on the meeting in a grid benchmark.

VAL 6 8 10 15 20
size 10 O-DP 0 8.00E-04 0.0095

O-MBDP 0 0.002 0 0.0516 0.2423

TIME 6 8 10 15 20
size 10 O-DP 7.6325 153.665 11574.73

O-MBDP 14.6114 40.2515 72.914 245.49563 413.1414

0

0.05

0.1

0.15

0.2

0.25

5 10 15 20

10x10

Va
lu
e

Horizon
O-DP O-MBDP

0

3000

6000

9000

12000

5 10 15 20

10x10

Ti
m

e 
(s

)

Horizon
O-DP O-MBDP

Figure 7: 4-agent meeting in a grid results showing (a) value and (b) running time on a
10 × 10 grid.

6.1.2 Larger Grids with More Agents

To test the scalability of these approaches, we consider growing the meeting-in-a-grid bench-
mark to a larger grid size and a larger number of agents. That is, agents still receive zero
reward unless all agents are in one of the goal corners. The same options and macro-
observations are used as in the 3x3 version of the problem. We generated results for several
four-agent problems with random starting locations for each agent. We did not compare
with current optimal or approximate Dec-POMDP methods because, while they may be
theoretically applicable, current implementations cannot solve problems with more than
two agents or the methods assume structure (e.g., factorization or independence) that is
not present in our problem.

Results for option-based dynamic programming and MBDP on problems with a 10×10
grid are shown in Figure 7. Three trees were used for O-MBDP. It is worth noting that
these are very large problems with 108 states. Also, the 4-agent version of the problem is
actually much harder than the 2-agent problem in Section 6.1.1, because all 4 agents must
be in the same square to receive any reward (rather than just 2) and the grid is much larger
(10x10 rather than 3x3). Agents are randomly initialized, but for horizon 10, it may be
impossible for all 4 agents to reach each other in the given time. By horizon 20 (the largest
we solved), the agents can often reach each other, but just at the later horizons due to
noise and the large grid. For instance, an optimal solution to a deterministic version of this
problem (an upper bound for the stochastic problem we use) for horizon 20 is approximately

836


Modeling and Planning with Macro-Actions in Decentralized POMDPs

2. The dynamic programming method is able to solve problems with a long enough horizon
to reach the goal (producing positive value), but higher horizons are not solvable. The
MBDP-based approach is able to solve much larger horizons, requiring much less time than
O-DP. O-MBDP is able to produce near-optimal values for horizons that are also solvable
by O-DP, but results may be further from optimal as the horizon grows (as is often the case
with MBDP-based approaches).

6.1.3 Two-Agent NAMO

We also consider a two-agent version of the problem of robots navigating among movable
obstacles (Stilman & Kuffner, 2005). Here, as shown in Figure 8, both agents are trying
to reach a goal square (marked by G), but there are obstacles in the way. Each robot can
move in four directions (up, down, left and right) or use a ‘push’ action to attempt to move
a box to a specific location (diagonally down to the left for the large box and into the corner
for both small boxes). The push action fails and the robot stays in place when the robot
is not in front of the box. Robots can move the small boxes (b1 and b2) by themselves,
but must move the larger box (b3) together. Observations are an agent’s own location (but
not the location of the other agent) and whether the large box or the same numbered box
has been moved (i.e., agent 1 can observe box 1 and agent 2 can observe box 2). There is
noise in both navigation and in box movement: movement is successful with probably 0.9
and pushing the small and large boxes is successful with probably 0.9 and 0.8, respectively.
To encourage the robots to reach the goal as quickly as possible, there is a negative reward
(-1) when any agent is not in the goal square.

Four options were defined for each agent. These consisted of 1) moving to a designated
location to push the big box, 2) attempting to push the large box, 3) pushing the designated
small box (box 1 for agent 1 and box 2 for agent 2) to the corner square, and 4) moving
to the goal. The option of moving to the goal is only valid when at least one box has been
moved and movement of any box is only valid if the large box and the agent’s designated box
has not yet been moved. Movement options terminate at the desired location and pushing
options terminate with the box successfully or unsuccessfully moved. Macro-observations
were the same as primitive observations (the agent’s location and box movements). These
options provide high-level choices for the agents to coordinate on this problem, while ab-
stracting away the navigation tasks to option execution. Options for just moving to the
small boxes could also be incorporated, but were deemed unnecessary because coordination
is unnecessary for pushing the small boxes.

Results for option-based dynamic programming are given in Figure 9. Here, O-DP
performs very well on a range of different problem sizes and horizons. Because negative
reward is given until both agents are in the goal square, more steps are required to reach
the goal as the problem size increases. The agents will stay in the goal upon reaching it,
causing the value to plateau. As shown in the top figure, O-DP is able to produce this
policy for the different problem sizes and horizons. The running times for each of the grid
sizes (5 × 5 to 25 × 25) are shown in the bottom figure for the horizon 25 problem. Here,
we see the running time increases for larger state spaces but the growth is sublinear.

A comparison with other Dec-POMDP algorithms (including O-MDBP) is shown in
Table 2. For TBDP and GMAA-ICE* (a leading optimal Dec-POMDP algorithm) (Oliehoek

837


Amato, Konidaris, Kaelbling & How

b2 G 

b1 

b3 

Figure 8: A 6x6 two-agent NAMO problem.

-30

-22.5

-15

-7.5

0

0 7.5 15 22.5 30

Value

size 5 size 10 size 15 size 20

0

150

300

450

600

0 5 10 15 20 25 30

Time

Ti
m

e 
(s

)

-40

-30

-20

-10

0

0 10 20 30 40

Value

Va
lu

e

Horizon
size 5 size 10 size 15 size 20 size 25

0

750

1500

2250

3000

0 10 20 30 40

Time

Ti
m

e 
(s

)

Horizon
size 5 size 10 size 15 size 20 size 25

0 5 10 15 20 25 30
? Size

size 5 size 10 size 15 size 20

0

125

250

375

500

0 875000 1750000 2625000 3500000

Chart 6

Ti
m

e 
(s

)

Number of States
O-DP horizon 25

Figure 9: Value and time results for O-DP in the two-agent NAMO problem for various
size grids (where size is the length of a single side)

et al., 2013), the grid size was increased while at least horizon 4 could be solved and then
the horizon was increased until it reached 100. Results for these algorithms were provided
by personal communication with the authors and run on other machines, but the trends
remain the same. For O-MBDP, 20 trees were used because smaller numbers resulted in
poor performance, but parameters were not exhaustively evaluated. The results show that
TBDP is able to solve the 4×4 problem, but runs out of memory when trying to solve any
5 × 5 problems. GMAA*-ICE can solve larger problem sizes, but runs out of memory for
larger horizons. GMAA*-ICE scales better with the increased state space because it is able
to exploit the factorization of the problem, but is limited to very small horizons because it
is solving the underlying Dec-POMDP optimally. The inability for current approaches to
solve these problems is not surprising given their size. By contrast, O-DP is able to solve
the 25×25 problem which has over 3 million states states while O-MBDP solves the 50×50
problem that has has 50 million states. O-MBDP is able to solve even larger problems, but
we did not analyze its performance beyond the 50 × 50 problem.

3. Larger problem sizes were not tested for GMAA*-ICE, but some may be solvable. Note that for any
problem larger than 4 × 4 horizons beyond 4 are not solvable and the running time is already high for
the 12 × 12 case.

838


Modeling and Planning with Macro-Actions in Decentralized POMDPs

Num. of States h Value Time (s)

O-DP 3.125 × 106 100 −42.7 40229
O-MBDP(20) 5 × 107 100 −93.0 4723
GMAA*-ICE3 165, 888 4 −4 11396
TBDP 2, 048 100 −6.4 1078

Table 2: Largest representative NAMO problems solvable by each approach. For GMAA*-
ICE and TBDP problem size was increased until horizon 4 was not solvable.

Figure 10: The multi-robot warehouse domain with depots and robots labeled.

6.2 Multi-Robot Experiments

We also tested our methods in a warehousing scenario using a collection of iRobot Creates
(Figure 10) where we varied the communication capabilities available to the robots. The re-
sults demonstrate that our methods can automatically generate the appropriate motion and
communication behavior while considering uncertainty over outcomes, sensor information
and other robots.

6.2.1 The Warehouse Domain

We consider three robots in a warehouse that are tasked with finding and retrieving boxes of
two different sizes: large and small. Robots can navigate to known depot locations (rooms)
to retrieve boxes and bring them back to a designated drop-off area. The larger boxes
can only be moved effectively by two robots (if a robot tries to pick up the large box by
itself, it will move to the box, but fail to pick it up). While the locations of the depots are
known, the contents (the number and type of boxes) are unknown. In our implementation,
we assumed there were three boxes (one large and two small), each of which was equally
likely to be in one of two depots. Our planner generates a SMACH controller for each of
the robots off-line using our option-based algorithms. These controllers are then executed
online in a decentralized manner.

839


Amato, Konidaris, Kaelbling & How

In each scenario, we assumed that each robot could observe its own location, see other
robots if they were within (approximately) one meter, observe the nearest box when in a
depot and observe the size of the box if it is holding one (defining the resulting macro-
observations). In the simulator used by the planner to evaluate solutions, the resulting
state space includes the location of each robot (discretized into nine possible locations) and
the location of each of the boxes (in a particular depot, with a particular robot or at the
goal). In particular, there are

∏
i∈I locAgi×

∏
b∈B locBb states, where locAgi is the location

of an agent and is discretized to a 3x3 grid and locBb represents the location of each of
3 boxes (at a depot, with a robot, at a goal, or with a pair of robots), with the size of
locBb for all b is numDepots + numAgents + numGoals + numAgents∗numAgents where
we set numDepots = 2, numAgents = 3 and numGoals = 1. The primitive actions are
to move in four different directions as well as pickup, drop and communication actions.
The macro-actions and macro-observations vary a bit for each scenario, but are detailed
in the sections below. Note that this primitive state and action representation is used for
evaluation purposes and not actually implemented on the robots (which just utilize the
SMACH controllers). Higher fidelity simulators could also be used, but running time may
increase if the simulations are computationally intensive (average solution times for the
policies presented below were approximately one hour). The three-robot version of this
scenario has 2,460,375 states, which is several orders of magnitude larger than problems
typically solvable by Dec-POMDP approaches.8 These problems are solved using the option-
based MBDP algorithm initialized with a hand coded heuristic policy.

Navigation has a small amount of noise in the amount of time required to move to
locations (reflecting the real-world dynamics): this noise increases when the robots are
pushing the large box (reflecting the need for slower movements and turns in this case).
Specifically, the robots were assumed to transition to the desired square deterministically
with no boxes, with probability 0.9 with the small box and with probability 0.8 with the
large box. Picking up boxes and dropping them was assumed to be deterministic. These
noise parameters were assumed to be known in this work, but they could also be learned by
executing macro-actions multiple times in the given initiation sets.9 Note that the MacDec-
POMDP framework is very general so other types of macro-actions and observations could
also be used (including observation of other failures). More details about each scenario are
given below.

6.2.2 Scenario 1: No Communication

In the first scenario, the robots cannot communicate with each other. Therefore, all coop-
eration is based on the controllers that are generated by the planner (which were generated
offline) and observations of the other robots (when executing online). The macro-actions
were: Go to depot 1, Go to depot 2, Go to the drop-off area, Pick up the small
box, Pick up the large box, and Drop off a box.

8. Our state representation technically has 1,259,712,000 states, since we also include observations of each
agent (of which there are 8 in this version of the problem) in the state space.

9. These parameters and controllers were loosely based on the actual robot navigation and box pushing.
Other work has looked at more directly determining these models and parameters (Amato, Konidaris,
Anders, Cruz, How, & Kaelbling, 2017; Omidshafiei et al., 2017).

840


Modeling and Planning with Macro-Actions in Decentralized POMDPs

(a) Two robots set out for differ-
ent depots.

(b) Robots observe boxes in de-
pots (large on left, small on
right).

(c) White robot moves to the
large box and green robot moves
to the small one.

(d) The white robot waits at
the large box while green robot
pushes the small box.

(e) Green robot drops the box off
at the goal.

(f) The green robot goes to depot
1 and sees the other robot and
large box.

(g) Green robot moves to help
the white robot.

(h) The green robot moves to the
box and the two robots push it
back to the goal.

Figure 11: Scenario 1 (no communication).

The depot macro-actions are applicable anywhere and terminate when the robot is
within the walls of the appropriate depot. The drop-off and drop macro-actions are only
applicable if the robot is holding a box, and the pickup macro-actions are only applicable
when the robot observes a box. Picking up the small box was assumed to succeed determin-
istically, but the model could easily be adjusted if the pickup mechanism is less robust. The
macro-observations are the basic ones defined above: the robot can observe it’s own location
(9 discrete positions), whether there is another robot present in the location, observe the
nearest box when in a depot (small, large or none) and observe the size of the box if it
is holding one (small, large or none). The macro-actions correspond to natural choices for
robot controllers.

This case10 (seen in Figure 11 along with a depiction of the executed policy in Figure
12) uses only two robots to more clearly show the optimized behavior in the absence of
communication. The robots begin in the drop-off area and the policy generated by the
planner begins by assigning one robot to go to each of the depots (seen in Figure 11(a)).
The robots then observe the contents of the depots they are in (seen in Figure 11(b)). If

10. All videos can be seen at http://youtu.be/fGUHTHH-JNA

841


Amato, Konidaris, Kaelbling & How

d2 

m ps 

d1 

d1 m 

d1 m 

d1 m 

goal 

g m 

, D2 

m dr 

m pl 

goal 

g m 
, D1 

m dr 

, D2 

, goal 

,       , D1 

, goal 

m pl 

 goal 

g m 
, D1 

m dr 

,       , D1 

, goal 

, D1 

, D1 

d1 

[repeat for 6 more steps ] 

Macro-actions 
d1=depot 1 
d2=depot 2 

g=goal (drop-off area) 
ps=pick up small box 
pl=pick up large box 

dr=drop box 
 
 
Figure 12: Path executed in policy trees for the no communication scenario by the white
robot (left) and the green robot (right). Only macro-actions executed (nodes)
and observations seen are shown. Observations are shown pictorially, with the
box sizes (small as a square and large as a rectangle) and robots (white create)
given along the corresponding edge.

there is only one robot in a depot and there is a small box to push, the robot will push the
small box (Figures 11(c) and 11(d)). If the robot is in a depot with a large box and no other
robots, it will stay in the depot, waiting for another robot to come and help push the box
(Figure 11(d)). In this case, once the other robot is finished pushing the small box (Figure
11(e)), it goes back to the depots to check for other boxes or robots that need help (Figure
11(f)). When it sees another robot and the large box in the depot on the left (depot 1), it
attempts to help push the large box (Figure 11(g)) and the two robots are successful pushing
the large box to the goal (Figure 11(h)). The planner has automatically derived a strategy
for dynamic task allocation—two robots go to each room, and then search for help needed
after pushing any available boxes. This behavior was generated by an optimization process
that considered the different costs of actions and the uncertainty involved (in the current
step and into the future) and used those values to tailor the behavior to the particular
problem instance.

842


Modeling and Planning with Macro-Actions in Decentralized POMDPs

(a) The three robots begin moving
to the waiting room.

(b) One robot goes to depot 1 and
two robots go to depot 2. The de-
pot 1 robot sees a large box.

(c) The robot saw a large box, so
it moved to the waiting room while
the other robots pushed the small
boxes.

(d) The depot 1 robot waits while
the other robots push the small
boxes.

(e) The two robots drop off the
small boxes at the goal while the
other robot waits.

(f) The green robot goes to the
waiting room to check for signals
and the white robot sends signal
#1.

(g) Signal #1 is interpreted as a
need for help in depot 1, so they
move to depot 1 and push the
large box.

(h) The two robots in depot 1
push the large box back to the
goal.

Figure 13: Scenario 2 (limited communication).

6.2.3 Scenario 2: Local Communication

In scenario 2, robots can communicate when they are within one meter of each other. The
macro-actions are the same as above, but we added ones to communicate and wait for
communication. The resulting macro-action set is: Go to depot 1, Go to depot 2, Go
to the drop-off area, Pick up the small box, Pick up the large box, Drop off
a box, Go to an area between the depots (the "waiting room"), Send signal #1,
Send signal #2, and Wait in the waiting room for another robot.

Here, we allow the robots to choose to go to a “waiting room” which is between the
two depots. This permits the robots to possibly communicate or receive communications
before committing to one of the depots. The waiting-room macro-action is applicable in
any situation and terminates when the robot is between the waiting room walls. The
depot macro-actions are now only applicable in the waiting room, while the drop-off, pick
up and drop macro-actions remain the same. The wait macro-action is applicable in the

843


Amato, Konidaris, Kaelbling & How

waiting room and terminates when the robot observes another robot in the waiting room.
The signaling macro-actions are applicable in the waiting room and are observable by other
robots that are within approximately a meter of the signaling robot. The macro-observations
are the same as in the previous scenario, but now include observations for the two signals.
Note that we do not specify how each communication signal should be interpreted, or when
they should be sent.

The results for this three-robot domain are shown in Figure 13. The robots go to the
waiting room (Figure 13(a)) and then two of the robots go to depot 2 (the one on the right)
and one robot goes to depot 1 (the one on the left) (Figure 13(b)). Because there are three
robots, the choice for the third robot is random while one robot will always be assigned
to each of the depots. Because there is only a large box to push in depot 1, the robot in
this depot goes back to the waiting room to try to find another robot to help it push the
box (Figure 13(c)). The robots in depot 2 see two small boxes and they choose to push
these back to the goal (also Figure 13(d)). Once the small boxes are dropped off (Figure
13(e)) one of the robots returns to the waiting room and then is recruited by the other
robot to push the large box back to the goal (Figures 13(f) and 13(g)). The robots then
successfully push the large box back to the goal (Figure 13(h)). In this case, the planning
process determines how the signals should be used to perform communication.

6.2.4 Scenario 3: Global Communication

In the last scenario, the robots can use signaling (rather than direct communication). In
this case, there is a switch in each of the depots that can turn on a blue or red light.
This light can be seen in the waiting room and there is another light switch in the waiting
room that can turn off the light. (The light and switch were simulated in software and
not incorporated in the physical domain.) The macro-actions were: Go to depot 1, Go
to depot 2, Go to the drop-off area, Pick up the small box, Pick up the large
box, Drop off a box, Go to the "waiting room", Turn on a blue light, Turn on
a red light, and Turn off the light.

The first seven macro-actions are the same as for the communication case except we
relaxed the assumption that the robots had to go to the waiting room before going to the
depots (making both the depot and waiting room macro-actions applicable anywhere). The
macro-actions for turning the lights on are applicable in the depots and the macro-actions
for turning the lights off are applicable in the waiting room. The macro-observations are
the same as in the previous scenario, but the two signals are now the lights instead of the
communication signals. While the lights were intended to signal requests for help in each
of the depots, we did not assign a particular color to a particular depot. In fact, we did not
assign them any meaning at all, allowing the planner to set them in any way that improves
performance.

The results are shown in Figure 14. Because one robot started ahead of the others, it
was able to go to depot 1 to sense the size of the boxes while the other robots go to the
waiting room (Figure 14(a)). The robot in depot 1 turned on the light (red in this case,
but not shown in the images) to signify that there is a large box and assistance is needed
(Figure 14(b)). The green robot (the first other robot to the waiting room) sees this light,
interprets it as a need for help in depot 1, and turns off the light (Figure 14(c)). The other

844


Modeling and Planning with Macro-Actions in Decentralized POMDPs

(a) One robot starts first and goes
to depot 1 while the other robots
go to the waiting room.

(b) The robot in depot 1 sees a
large box, so it turns on the red
light (the light is not shown).

(c) The green robot sees light first,
turns it off, and goes to depot 1.
The white robot goes to depot 2.

(d) Robots in depot 1 move to the
large box, while the robot in depot
2 begins pushing the small box.

(e) Robots in depot 1 begin push-
ing the large box and the robot in
depot 2 pushes a small box to the
goal.

(f) The robots from depot 1 suc-
cessfully push the large box to the
goal.

Figure 14: Scenario 3 (signaling).

robot arrives in the waiting room, does not observe a light on and moves to depot 2 (also
Figure 14(c)). The robot in depot 2 chooses to push a small box back to the goal and
the green robot moves to depot 1 to help the other robot (Figure 14(d)). One robot then
pushes the small box back to the goal while the two robots in depot 1 begin pushing the
large box (Figure 14(e)). Finally, the two robots in depot 1 push the large box back to
the goal (Figure 14(f)). This behavior is optimized based on the information given to the
planner. The semantics of all these signals as well as the movement and signaling decisions
were decided on by the planning algorithm to maximize value.

6.2.5 Simulation Results

We also evaluated the multi-robot experiments in the simulator to evaluate the difference
in performance between option-based MBDP (O-MBDP) and option-based direct cross
entropy policy search (O-DICE). Option-based dynamic programming is not scalable enough
to solve the domains to the horizons considered. For O-MBDP, maxTrees = 3, which was
chosen to balance solution quality and running time. For O-DICE, Iter = 100, N = 10,
Nb = 5, and α = 0.1, which were chosen based on suggestions from the original work
(Oliehoek et al., 2008). A version of O-DICE was also implemented that, rather than
maintaining sampling distributions for the whole tree, only maintains a single sampling
distribution that is used at each node in the tree. This later version of O-DICE is referred
to as O-DICE (1) and can be thought of as a biased form of Monte Carlo sampling.

As can be seen in Table 3, O-DICE outperforms O-MBDP in terms of both value and
time. In all cases, versions of O-DICE are more scalable than O-MBDP, even though
only 3 trees were used for O-MBDP. For problems in which both O-MBDP and O-DICE

845


Amato, Konidaris, Kaelbling & How

No Communication

O-MBDP(3) O-DICE (1) O-DICE (full)

value time (s) value time (s) value time (s)

Horizon 7 0 10910 0 50 0 312
Horizon 8 0 27108 0 552 0 3748
Horizon 9 1.161 161454 1.169 601 1.158 5247
Horizon 10 − − 2.163 618 2.159 6400
Horizon 11 − − 3.033 699 3.120 10138

Communication

O-MBDP(3) O-DICE (1) O-DICE (full)

value time (s) value time (s) value time (s)

Horizon 7 0.225 49 0.221 46 0.217 207
Horizon 8 0.421 139 0.409 444 0.420 2403
Horizon 9 1.60 ∗ 1.650 549 1.650 3715
Horizon 10 − − 2.179 901 2.795 3838
Horizon 11 − − − − − −

Signalling

O-MBDP(3) O-DICE (1) O-DICE (full)

value time (s) value time (s) value time (s)

Horizon 7 0.225 353 0.221 63 0.221 204
Horizon 8 0.421 16466 0.417 649 0.430 4011
Horizon 9 1.663 87288 1.691 659 1.694 7362
Horizon 10 − − 2.392 682 2.782 7447
Horizon 11 − − 3.756 763 3.964 10336

Table 3: Multi-robot warehouse simulation results for option-based MBDP (O-MBDP) and
option-based direct cross entropy search (O-DICE) using parameters for full his-
tories (full) or just a single value (1). Value and time in seconds is given with −
signifying the algorithm runs out of memory before generating any valid solution
and ∗ signifying the algorithm runs out of memory before completion.

could produce solutions, the values were very similar, but the O-DICE methods required
significantly less time. O-MBDP either runs out of memory (due to the large number of trees
generated during a backup step) or takes a really long time to generate the maxTrees trees.
Using more efficient versions of MBDP (e.g., Wu et al., 2010a) should improve performance,
but performance improvements could also be made to O-DICE. An extensive comparison
has not been conducted between these algorithms even for primitive action Dec-POMDP
domains, but we expect that performance will depend on the domain and parameters used

846


Modeling and Planning with Macro-Actions in Decentralized POMDPs

(e.g., heuristics in MBDP). The full version of O-DICE was able to outperform the single
parameter version of O-DICE in terms of value, but also required more time.

6.2.6 Infinite Horizon Comparisons

Unlike POMDPs, Dec-POMDP finite-horizon methods are typically not scalable enough to
solve large or infinite-horizon problems. As a consequence, special-purpose infinite-horizon
methods have been developed which typically use a finite-state controller policy represen-
tation instead of a policy tree. The finite-state controller allows memory to be bounded.
As a consequence, finite-state controller-based methods are typically more scalable for large
horizon problems, but perform poorly for smaller horizons.

Finite-state controllers, which condition action selection on an internal memory state,
have been widely used in Dec-POMDPs (Bernstein et al., 2009; Amato, Bernstein, & Zil-
berstein, 2010a; Amato, Bonet, & Zilberstein, 2010b; Pajarinen & Peltonen, 2011; Wu
et al., 2013; Kumar, Zilberstein, & Toussaint, 2015; Kumar, Mostafa, & Zilberstein, 2016).
Finite-state controllers operate in the same way as policy trees in that there is a designated
initial node and following action selection at that node, the controller transitions to the
next node depending on the observation seen. This continues for the infinite steps of the
problem. Finite-state controllers explicitly represent infinite-horizon policies, but can also
be used for finite-horizon policies.

Recently, we and others have extended the ideas of macro-actions from this paper to
use finite-state controller representations. In particular, heuristic search (Amato et al.,
2017) and a DICE-based approach (Omidshafiei et al., 2017) have been explored. G-DICE
(Omidshafiei et al., 2017) is the same as O-DICE except it is applied to the finite-state
controller representation rather than the tree. The heuristic search method from (Amato
et al., 2017) is similar to multi-agent A* approaches (Oliehoek et al., 2013; Szer, Charpillet,
& Zilberstein, 2005; Oliehoek, Spaan, & Vlassis, 2008; Oliehoek, Whiteson, & Spaan, 2009),
but again is applied to the finite-state controller representation rather than the tree.

It is worth noting that the key difference is the policy representation and the algorithms
in this paper could be applied to finite-state controllers and many finite-state controller-
based methods could be applied to trees. This paper introduces macro-actions in Dec-
POMDPs and explores some initial algorithms for tree-based solutions; many future algo-
rithms are now possible.

Nevertheless, for thoroughness of results, we provide the performance of the heuristic
search method MDHS (Amato et al., 2017) on our benchmark problems. MDHS is an
anytime algorithm, so it will continue to improve until the best parameters for the given
controller size are found. For a fair comparison, we let it run for the same amount of time
as the full version of O-DICE. We set the parameters in the same way as the previous work
(Amato et al., 2017) (e.g., 5 controller nodes were used) and the initial lower bound was
found from the best of 100 random controller parameterizations. Reporting results for all
horizons of all domains becomes redundant, but the results we provide are representative
of the other domains and horizon values.

As can be seen in Table 4, MDHS often achieves values that are similar to the O-
DICE values, but sometimes significantly underperforms the other method. For instance,
MDHS can only achieve 17% of the O-DICE value in the meeting in a grid problem with 4

847


Amato, Konidaris, Kaelbling & How

Meeting in a Grid

2 agents, hor=100 2 agents, hor=200 4 agents, hor=20

value time (s) value % O-DICE value % O-DICE

92.478 98% 192.407 99% 0.076 17%

NAMO

size 10 size 15 size 20

value % O-DICE value % O-DICE value % O-DICE

Horizon 10 −10 100% −10 100% −10 100%
Horizon 20 −18.533 88% −20 100% −20 100%
Horizon 30 −19.558 89% −27.458 94% −29.961 99%

Robot warehouse

No Communication Communication Signaling

value % O-DICE value % O-DICE value % O-DICE

Horizon 7 0 100% 0.205 94% 0.207 94%
Horizon 8 0 100% 0.393 94% 0.428 99%
Horizon 9 1.120 97% 1.138 69% 1.611 95%
Horizon 10 2.064 94% 1.167 42% 1.055 38%
Horizon 11 2.932 94% − − 3.807 96%

Table 4: Results for the controller-based MDHS method on our benchmark problems along
with the performance relative of O-DICE (full).

agents, 38% of the O-DICE value is produced in the horizon 10 robot warehouse problem
with signaling and 69% and 42% of the O-DICE value is produced in the horizon 9 and
10 warehouse problems with communication. The values for the NAMO problems are
not particularly interesting as all policies have the same value until the horizon becomes
significantly longer than the domain size (since the agent requires more steps to reach the
goal as the domain size increase), but we still see that MDHS does not achieve the full
O-DICE values for non-degenerate horizons.

In general, MDHS is more scalable in terms of the horizon (e.g., solving the horizon
11 robot warehouse problem with communication), but scalability depends on choosing a
proper controller size to balance solution quality and computational efficiency. As a result,
controller-based methods, such as MDHS, can return lower quality solutions on horizons
that are solvable by the tree-based methods. MDHS will also require an intractable amount
of time to improve solutions as the number of observations grows since it searches for assign-
ments for all possible next observations in the controller (Omidshafiei et al., 2017). As is
currently the case in (primitive) Dec-POMDPs, tree-based and controller-based algorithms
both have their place in macro-action-based Dec-POMDPs. The performance of MDHS (or
controller-based methods more generally) relative to tree-based methods is very problem

848


Modeling and Planning with Macro-Actions in Decentralized POMDPs

and horizon dependent (as seen in our results). A general rule of thumb may be to use a
tree-based method for finite-horizon problems that are solvable and to use controller-based
(or other methods) otherwise.

7. Related Work

While many hierarchical approaches have been developed for multi-agent systems (Hor-
ling & Lesser, 2004), very few are applicable to multi-agent models based on MDPs and
POMDPs. Perhaps the most similar approach is that of Ghavamzadeh et al. (Ghavamzadeh,
Mahadevan, & Makar, 2006). This is a multi-agent reinforcement learning approach with a
given task hierarchy where communication is used to coordinate actions at higher levels and
agents are assumed to be independent at lower levels. This work was limited to a multi-
agent MDP model with (potentially costly) communication, making the learning problem
challenging, but the planning problem is simpler than the full Dec-POMDP case.

Other approaches have considered identifying and exploiting independence between
agents to limit reasoning about coordination and improve scalability. Approaches in-
clude general assumptions about agent independence like transition independent Dec-MDPs
(Becker, Zilberstein, Lesser, & Goldman, 2004b) and factored models such as ND-POMDPs
(Nair et al., 2005) as well as methods that consider coordination based on ‘events’ or states.
Events which may require or allow interaction have been explored in Dec-MDPs (Becker,
Lesser, & Zilberstein, 2004a) and (centralized) multi-robot systems (Messias, Spaan, &
Lima, 2013). Other methods have considered locations or states where interaction is needed
to improve scalability in planning (Spaan & Melo, 2008; Velagapudi et al., 2011) and learn-
ing (Melo & Veloso, 2011).

The work on independence assumes agents are always independent or coordinate using a
fixed factorization, making it less general than an option-based approach. The work on event
and state-based coordination focuses on a different type of domain knowledge: knowledge
of states where coordination takes place. While this type of knowledge may be available,
it may be easier to obtain and utilize procedural knowledge. The domain may therefore
be easier to specify using macro-actions with different properties (such as independence or
tight coordination), allowing planning to determine the necessary states for coordination.
Furthermore, this type of state information could be used to define options for reaching
these coordination points. Lastly, macro-actions could possibly be used in conjunction with
previous methods, further improving scalability.

As mentioned in the introduction, we do not target scalability with respect to the number
of agents. Several such methods have been developed that make various assumptions about
agent abilities and policies (e.g., Sonu, Chen, & Doshi, 2015; Varakantham, Adulyasak, &
Jaillet, 2014; Velagapudi et al., 2011; Oliehoek et al., 2013; Nguyen, Kumar, & Lau, 2017a,
2017b). Macro-action-based methods could potentially be incorporated into these methods
to again increase scalability in terms of both the number of agent as well as the horizon
and other problem variables.

There are several frameworks for multi-robot decision making in complex domains. For
instance, behavioral methods have been studied for performing task allocation over time
with loosely-coupled (Parker, 1998) or tightly-coupled (Stroupe, Ravichandran, & Balch,

849


Amato, Konidaris, Kaelbling & How

2004) tasks. These are heuristic in nature and make strong assumptions about the type of
tasks that will be completed.

Linear temporal logic (LTL) has also been used to specify robot behavior (Belta, Bic-
chi, Egerstedt, Frazzoli, Klavins, & Pappas, 2007; Loizou & Kyriakopoulos, 2004); from
this specification, reactive controllers that are guaranteed to satisfy the specification can
be derived. These methods are appropriate when the world dynamics can be effectively
described non-probabilistically and when there is a useful characterization of the robot’s
desired behavior in terms of a set of discrete constraints. When applied to multiple robots,
it is necessary to give each robot its own behavior specification. In contrast, our approach
(probabilistically) models the domain and allows the planner to automatically optimize the
robots’ behavior.

Market-based approaches use traded value to establish an optimization framework for
task allocation (Dias & Stentz, 2003; Gerkey & Matarić, 2004). These approaches have
been used to solve real multi-robot problems (Kalra, Ferguson, & Stentz, 2005), but are
largely aimed to tasks where the robots can communicate through a bidding mechanism.

Emery-Montemerlo et al. (Emery-Montemerlo, Gordon, Schneider, & Thrun, 2005) in-
troduced a (cooperative) game-theoretic formalization of multi-robot systems which resulted
in solving a Dec-POMDP. An approximate forward search algorithm was used to generate
solutions, but because a (relatively) low-level Dec-POMDP was used scalability was limited.
Their system also required synchronized execution by the robots.

8. Discussion

We have considered local options in this paper, but our framework could support other
types of options. For example, we could consider options in which the policy is local but
the initiation and termination sets are not—for example, initiation and termination could
depend on the agent’s history, or other agent’s states. Generalizing a local option in this way
retains the advantages described here, because the decision about which option to execute
already requires coordination but executing the option itself does not. We could also use
options with history-based policies, or define multi-agent options that control a subset of
agents to complete a task. In general, we expect that an option will be useful for planning
when its execution allows us to temporarily ignore some aspect of the original problem. For
example, the option might be defined in a smaller state space (allowing us to ignore the full
complexity of the problem), or use only observable information (allowing us to ignore the
partially observable aspect of the problem), or involve a single agent or a subset of agents
communicating (allowing us to ignore the decentralized aspect of the problem).

We can gain additional benefits by exploiting known structure in the multi-agent prob-
lem. For instance, most controllers only depend on locally observable information and do
not require coordination. For example, consider a controller that navigates to a waypoint.
Only local information is required for navigation—the robot may detect other robots but
their presence does not change its objective, and it simply moves around them—but choos-
ing the target waypoint likely requires the planner to consider the locations and actions of
all robots. Macro-actions with independent execution allow coordination decisions to be
made only when necessary (i.e., when choosing macro-actions) rather than at every time
step. Because MacDec-POMDPs are built on top of Dec-POMDPs, macro-action choice

850


Modeling and Planning with Macro-Actions in Decentralized POMDPs

may depend on history, but during execution macro-actions may depend only on a single
observation or on any number of steps of history, or even represent the actions of a set of
robots. That is, macro-actions are very general and can be defined in such a way to take
advantage of the knowledge available to the robots during execution.

We have so far assumed that the agent is given an appropriate set of macro-actions with
which to plan. In all of our domains, there were quite natural choices for macro-actions and
macro-observations (e.g., navigating to depots and observing that you are in a depot along
with its contents), but such natural representations are not always present. Research on skill
discovery (McGovern & Barto, 2001) has attempted to devise methods by which a single
agent can instead acquire an appropriate set of options autonomously, through interaction
with its (fully observable) environment. While some of these methods may be directly
applicable, the characteristics of the partially observable, multi-agent case also offer new
opportunities for skill discovery. For example, we may wish to synthesize skills that collapse
uncertainty across multiple agents, perform coordinated multi-agent actions, communicate
essential state information, or allow agents to synchronize and replan. Related work has
begun to explore some of these topics (Omidshafiei et al., 2017, 2017a), but many open
questions remain.

In terms of multi-robot domains, we demonstrated macro-action-based approaches on
multiple other domains with limited sensing and communication. These other domains
included a logistics (beer delivery) domain, where two robots must efficiently find out about
and service beer orders in cooperation with a ‘picker/bartender’ robot, which can retrieve
items (Amato, Konidaris, Anders, Cruz, How, & Kaelbling, 2015; Amato et al., 2017), a
package delivery domain, where a group of aerial robots must retrieve and deliver packages
from base locations to delivery locations while dealing with limited battery life (Omidshafiei,
Agha-mohammadi, Amato, & How, 2015; Omidshafiei, Agha-mohammadi, Amato, Liu,
How, & Vian, 2016; Omidshafiei et al., 2017a, 2017) as well as an adversarial domain in
which a team of robots is playing capture the flag against another team of robots (Hoang,
Xiao, Sivakumar, Amato, & How, 2018).

Also, our results have shown that the use of macro-actions can significantly improve
scalability—for example, by allowing us to use larger grids with the same set of agents and
obstacles in the NAMO problem (see Figure 8). However, in such cases—where the state
space grows but the number of agents and significant interactions does not—we should in
principle be able to deal with any size grid with no increase in computation time, because
the size of the grid is irrelevant to the coordination aspects of the problem. This does
not occur in the work presented here because we plan in the original state space; methods
for constructing a more abstract task-level representation (Konidaris, Kaelbling, & Lozano-
Perez, 2018) could provide further performance improvements.

It is also worth noting that our approach can incorporate state-of-the-art methods for
solving more restricted scenarios as options. The widespread use of techniques for solving
restricted robotics scenarios has led to a plethora of usable algorithms for specific problems,
but no way to combine these in more complex scenarios. Our approach can build on the large
amount of research in single and multi-robot systems that has gone into solving difficult
problems such as navigation in a formation (Balch & Arkin, 1998), cooperative transport
of an object (Kube & Bonabeau, 2000), coordination with signaling (Beckers, Holland, &
Deneubourg, 1994) or communication under various limitations (Rekleitis, Lee-Shue, New,

851


Amato, Konidaris, Kaelbling & How

& Choset, 2004). The solutions to these problems could be represented as macro-actions
in our framework, building on existing research to solve even more complex multi-robot
problems.

This paper focused on (sample-based) planning using macro-actions, but learning could
also be used to generate policies over macro-actions. In particular, other work developed a
method that learns policies using only high-level macro-action trajectories (macro-actions
and macro-observations) (Liu, Amato, Anesta, Griffith, & How, 2016). As a result, the
methods don’t need any models and are applicable in cases where data is difficult or costly
to obtain (e.g., human demonstrations, elaborate training exercises). Our experiments
showed that the methods can also produce very high-quality solutions, even outperforming
and improving upon hand-coded ‘expert’ solutions with a small amount of data. We also
improved upon and tested these approaches in a multi-robot search and rescue problem
(Liu, Sivakumar, Omidshafiei, Amato, & How, 2017). In general, using macro-actions with
other multi-agent reinforcement learning methods (including the popular deep methods
e.g., Foerster, Assael, de Freitas, & Whiteson, 2016; Omidshafiei, Pazis, Amato, How, &
Vian, 2017b; Lowe, Wu, Tamar, Harb, Abbeel, & Mordatch, 2017; Rashid, Samvelyan,
Schroeder, Farquhar, Foerster, & Whiteson, 2018; Palmer, Tuyls, Bloembergen, & Savani,
2018; Omidshafiei, Kim, Liu, Tesauro, Riemer, Amato, Campbell, & How, 2019) could be
a promising way of improving performance, while allowing asynchronous action execution.

Finally, while this paper focused on dynamic programming (Hansen et al., 2004; Seuken
& Zilberstein, 2007b) and direct policy search methods (Oliehoek et al., 2008), forward
search methods (Oliehoek et al., 2013; Szer et al., 2005; Oliehoek et al., 2008, 2009; Diban-
goye et al., 2016) are likely to perform well when using MacDec-POMDPs. When building
up policies from the last step, as in dynamic programming, adding macro-actions to the
beginning of a tree changes when the macro-actions deeper down the tree will be completed.
In forward search methods, actions are added to the leaves of the tree, leaving the com-
pletion times for previous macro-actions in the policy (those at earlier heights) the same.
We have not explored such search methods for MacDec-POMDPs, but they appear to be
promising.

9. Conclusion

We presented a new formulation for representing decentralized decision-making problems
under uncertainty using higher-level macro-actions (modeled as options), rather than primi-
tive (single-step) actions. We called this framework the macro-action Dec-POMDP (MacDec-
POMDP). Because our macro-action model is built on top of the Dec-POMDP framework,
Dec-POMDP algorithms can be extended to solve problems with macro-actions while re-
taining agent coordination. We focused on local options, which allow us to reason about
coordination only when deciding which option to execute. Our results have demonstrated
that high-quality results can be achieved on current benchmarks, and that very large prob-
lems can be effectively modeled and solved this way. As such, our macro-action framework
represents a promising approach for scaling multi-agent planning under uncertainty to real-
world problem sizes.

We also have demonstrated that complex multi-robot domains can be solved with Dec-
POMDP-based methods. The MacDec-POMDP model is expressive enough to capture

852


Modeling and Planning with Macro-Actions in Decentralized POMDPs

multi-robot systems of interest, but also simple enough to be feasible to solve in practice.
Our results show that a general purpose MacDec-POMDP planner can generate cooperative
behavior for complex multi-robot domains with task allocation, direct communication, and
signaling behavior emerging automatically as properties of the solution for the given problem
model. Because all cooperative multi-robot problems can be modeled as Dec-POMDPs,
MacDec-POMDPs represent a powerful tool for automatically trading-off various costs, such
as time, resource usage and communication while considering uncertainty in the dynamics,
sensors and other robot information. These approaches have great potential to lead to
automated solution methods for general probabilistic multi-robot coordination problems
with heterogeneous robots in complex, uncertain domains.

More generally, this work opens the door to many research questions about representing
and solving multi-agent problems hierarchically. Promising avenues for future work include
exploring different types of options, further work on reinforcement learning for either gen-
erating options or policies over options, and developing more scalable solution methods
that exploit domain and hierarchical structure. One example of such structure would be
the use of a factored reward function (Nair et al., 2005) which allows more efficient policy
generation and evaluation.

Acknowledgements

We would like to thank Matthijs Spaan and Feng Wu for providing results as well as Ari An-
ders, Gabriel Cruz, and Christopher Maynor for their help with the robot experiments. Re-
search supported in part by NSF project #1664923, ONR MURI project #N000141110688,
DARPA YFA D15AP00104, AFOSR YIP FA9550-17-1-0124, and NIH R01MH109177.

References

Amato, C., Bernstein, D. S., & Zilberstein, S. (2010a). Optimizing fixed-size stochastic
controllers for POMDPs and decentralized POMDPs. Journal of Autonomous Agents
and Multi-Agent Systems, 21 (3), 293–320.

Amato, C., Bonet, B., & Zilberstein, S. (2010b). Finite-state controllers based on Mealy
machines for centralized and decentralized POMDPs. In Proceedings of the AAAI
Conference on Artificial Intelligence, pp. 1052–1058.

Amato, C., Chowdhary, G., Geramifard, A., Ure, N. K., & Kochenderfer, M. J. (2013).
Decentralized control of partially observable Markov decision processes. In Proceedings
of the IEEE Conference on Decision and Control, pp. 2398–2405.

Amato, C., Dibangoye, J. S., & Zilberstein, S. (2009). Incremental policy generation for
finite-horizon DEC-POMDPs. In Proceedings of the International Conference on Au-
tomated Planning and Scheduling, pp. 2–9.

Amato, C., Konidaris, G. D., Anders, A., Cruz, G., How, J. P., & Kaelbling, L. P. (2015).
Policy search for multi-robot coordination under uncertainty. In Proceedings of the
Robotics: Science and Systems Conference.

853


Amato, Konidaris, Kaelbling & How

Amato, C., Konidaris, G. D., Anders, A., Cruz, G., How, J. P., & Kaelbling, L. P. (2017).
Policy search for multi-robot coordination under uncertainty. The International Jour-
nal of Robotics Research.

Aras, R., Dutech, A., & Charpillet, F. (2007). Mixed integer linear programming for exact
finite-horizon planning in decentralized POMDPs. In Proceedings of the International
Conference on Automated Planning and Scheduling, pp. 18–25.

Balch, T., & Arkin, R. C. (1998). Behavior-based formation control for multi-robot teams.
IEEE Transactions on Robotics and Automation, 14 (6), 926–939.

Barto, A., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning.
Discrete Event Dynamic Systems, 13, 41–77.

Becker, R., Lesser, V., & Zilberstein, S. (2004a). Decentralized Markov Decision Processes
with Event-Driven Interactions. In Proceedings of the International Conference on
Autonomous Agents and Multiagent Systems, pp. 302–309.

Becker, R., Zilberstein, S., Lesser, V., & Goldman, C. V. (2004b). Solving transition-
independent decentralized Markov decision processes. Journal of Artificial Intelligence
Research, 22, 423–455.

Beckers, R., Holland, O., & Deneubourg, J.-L. (1994). From local actions to global tasks:
Stigmergy and collective robotics. In Artificial life IV, Vol. 181, p. 189.

Belta, C., Bicchi, A., Egerstedt, M., Frazzoli, E., Klavins, E., & Pappas, G. J. (2007).
Symbolic planning and control of robot motion [grand challenges of robotics]. Robotics
& Automation Magazine, IEEE, 14 (1), 61–70.

Bernstein, D. S., Amato, C., Hansen, E. A., & Zilberstein, S. (2009). Policy iteration for
decentralized control of Markov decision processes. Journal of Artificial Intelligence
Research, 34, 89–132.

Bernstein, D. S., Givan, R., Immerman, N., & Zilberstein, S. (2002). The complexity
of decentralized control of Markov decision processes. Mathematics of Operations
Research, 27 (4), 819–840.

Bohren, J. (2010). SMACH. http://wiki.ros.org/smach/.

Boularias, A., & Chaib-draa, B. (2008). Exact dynamic programming for decentralized
POMDPs with lossless policy compression. In Proceedings of the International Con-
ference on Automated Planning and Scheduling.

Dias, M. B., & Stentz, A. T. (2003). A comparative study between centralized, market-
based, and behavioral multirobot coordination approaches. In Proceedings of
IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. 3, pp.
2279 – 2284.

Dibangoye, J. S., Amato, C., Buffet, O., & Charpillet, F. (2013). Optimally solving Dec-
POMDPs as continuous-state MDPs. In Proceedings of the International Joint Con-
ference on Artificial Intelligence.

Dibangoye, J. S., Amato, C., Buffet, O., & Charpillet, F. (2016). Optimally solving Dec-
POMDPs as continuous-state MDPs. Journal of Artificial Intelligence Research, 55,
443–497.

854


Modeling and Planning with Macro-Actions in Decentralized POMDPs

Dibangoye, J. S., Amato, C., Doniec, A., & Charpillet, F. (2013). Producing efficient error-
bounded solutions for transition independent decentralized MDPs. In Proceedings of
the International Conference on Autonomous Agents and Multiagent Systems.

Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function
decomposition. Journal of Artificial Intelligence Research, 13, 227–303.

Emery-Montemerlo, R., Gordon, G., Schneider, J., & Thrun, S. (2005). Game theoretic
control for robot teams. In Proceedings of the International Conference on Robotics
and Automation, pp. 1163–1169.

Foerster, J., Assael, I. A., de Freitas, N., & Whiteson, S. (2016). Learning to communicate
with deep multi-agent reinforcement learning. In Advances in Neural Information
Processing Systems, pp. 2137–2145.

Gerkey, B. P., & Matarić, M. J. (2004). A formal analysis and taxonomy of task allocation
in multi-robot systems. International Journal of Robotics Research, 23 (9), 939–954.

Ghavamzadeh, M., Mahadevan, S., & Makar, R. (2006). Hierarchical multi-agent rein-
forcement learning. Journal of Autonomous Agents and Multi-Agent Systems, 13 (2),
197–229.

Hansen, E. A., Bernstein, D. S., & Zilberstein, S. (2004). Dynamic programming for partially
observable stochastic games. In Proceedings of the National Conference on Artificial
Intelligence, pp. 709–715.

He, R., Brunskill, E., & Roy, N. (2011). Efficient planning under uncertainty with macro-
actions. Journal of Artificial Intelligence Research, 523–570.

Hoang, T. N., Xiao, Y., Sivakumar, K., Amato, C., & How, J. (2018). Near-optimal ad-
versarial policy switching for decentralized asynchronous multi-agent systems. In
Proceedings of the International Conference on Robotics and Automation.

Horling, B., & Lesser, V. (2004). A survey of multi-agent organizational paradigms. The
Knowledge Engineering Review, 19 (4), 281–316.

Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in
partially observable stochastic domains. Artificial Intelligence, 101, 1–45.

Kalra, N., Ferguson, D., & Stentz, A. T. (2005). Hoplites: A market-based framework for
planned tight coordination in multirobot teams. In Proceedings of the International
Conference on Robotics and Automation, pp. 1170 – 1177.

Konidaris, G., & Barto, A. G. (2007). Building portable options: Skill transfer in rein-
forcement learning. Proceedings of the International Joint Conference on Artificial
Intelligence, 7, 895–900.

Konidaris, G., Kaelbling, L. P., & Lozano-Perez, T. (2018). From skills to symbols: Learn-
ing symbolic representations for abstract high-level planning. Journal of Artificial
Intelligence Research, 61, 215–289.

Konidaris, G. D., & Barto, A. G. (2009). Skill discovery in continuous reinforcement learning
domains using skill chaining. In Advances in Neural Information Processing Systems
22, pp. 1015–1023.

855


Amato, Konidaris, Kaelbling & How

Kube, C. R., & Bonabeau, E. (2000). Cooperative transport by ants and robots. Robotics
and Autonomous Systems, 30 (1-2), 85–101.

Kumar, A., Mostafa, H., & Zilberstein, S. (2016). Dual formulations for optimizing dec-
pomdp controllers. In Proceedings of the International Conference on Automated
Planning and Scheduling.

Kumar, A., & Zilberstein, S. (2010). Point-based backup for decentralized POMDPs: com-
plexity and new algorithms. In Proceedings of the International Conference on Au-
tonomous Agents and Multiagent Systems, pp. 1315–1322.

Kumar, A., Zilberstein, S., & Toussaint, M. (2015). Probabilistic inference techniques for
scalable multiagent decision making. Journal of Artificial Intelligence Research, 53 (1),
223–270.

Lim, Z., Sun, L., & Hsu, D. J. (2011). Monte Carlo value iteration with macro-actions. In
Advances in Neural Information Processing Systems, pp. 1287–1295.

Liu, M., Amato, C., Anesta, E., Griffith, J. D., & How, J. P. (2016). Learning for decen-
tralized control of multiagent systems in large partially observable stochastic environ-
ments. In Proceedings of the AAAI Conference on Artificial Intelligence.

Liu, M., Amato, C., Liao, X., Carin, L., & How, J. P. (2015). Stick-breaking policy learning
in Dec-POMDPs. In Proceedings of the International Joint Conference on Artificial
Intelligence.

Liu, M., Sivakumar, K., Omidshafiei, S., Amato, C., & How, J. P. (2017). Learning for
multi-robot cooperation in partially observable stochastic environments with macro-
actions. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots
and Systems, pp. 1853–1860.

Loizou, S. G., & Kyriakopoulos, K. J. (2004). Automatic synthesis of multi-agent motion
tasks based on ltl specifications. In Decision and Control, 2004. CDC. 43rd IEEE
Conference on, Vol. 1, pp. 153–158. IEEE.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O. P., & Mordatch, I. (2017). Multi-agent
actor-critic for mixed cooperative-competitive environments. In Advances in Neural
Information Processing Systems, pp. 6379–6390.

McGovern, A., & Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement
learning using diverse density. In Proceedings of the Eighteenth International Confer-
ence on Machine Learning, pp. 361–368.

Melo, F., & Veloso, M. (2011). Decentralized MDPs with sparse interactions. Artificial
Intelligence.

Messias, J. V., Spaan, M. T. J., & Lima, P. U. (2013). GSMDPs for multi-robot sequential
decision-making. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial
Intelligence, pp. 1408–1414.

Nair, R., Varakantham, P., Tambe, M., & Yokoo, M. (2005). Networked distributed
POMDPs: a synthesis of distributed constraint optimization and POMDPs. In Pro-
ceedings of the National Conference on Artificial Intelligence.

856


Modeling and Planning with Macro-Actions in Decentralized POMDPs

Nguyen, D. T., Kumar, A., & Lau, H. C. (2017a). Collective multiagent sequential deci-
sion making under uncertainty. In Proceedings of the AAAI Conference on Artificial
Intelligence.

Nguyen, D. T., Kumar, A., & Lau, H. C. (2017b). Policy gradient with value function
approximation for collective multiagent planning. In Advances in Neural Information
Processing Systems, pp. 4322–4332.

Oliehoek, F. A. (2012). Decentralized POMDPs. In Wiering, M., & van Otterlo, M. (Eds.),
Reinforcement Learning: State of the Art, Vol. 12 of Adaptation, Learning, and Opti-
mization, pp. 471–503. Springer Berlin Heidelberg.

Oliehoek, F. A., & Amato, C. (2016). A Concise Introduction to Decentralized POMDPs.
Springer.

Oliehoek, F. A., Kooi, J. F., & Vlassis, N. (2008). The cross-entropy method for policy
search in decentralized POMDPs. Informatica, 32, 341–357.

Oliehoek, F. A., Spaan, M. T. J., Amato, C., & Whiteson, S. (2013). Incremental clustering
and expansion for faster optimal planning in Dec-POMDPs. Journal of Artificial
Intelligence Research, 46, 449–509.

Oliehoek, F. A., Spaan, M. T. J., & Vlassis, N. (2008). Optimal and approximate Q-value
functions for decentralized POMDPs. Journal of Artificial Intelligence Research, 32,
289–353.

Oliehoek, F. A., Whiteson, S., & Spaan, M. T. J. (2009). Lossless clustering of histories
in decentralized POMDPs. In Proceedings of the International Conference on Au-
tonomous Agents and Multiagent Systems.

Oliehoek, F. A., Whiteson, S., & Spaan, M. T. J. (2013). Approximate solutions for factored
Dec-POMDPs with many agents. In Proceedings of the International Conference on
Autonomous Agents and Multiagent Systems.

Omidshafiei, S., Agha-mohammadi, A., Amato, C., & How, J. P. (2015). Decentralized
control of partially observable Markov decision processes using belief space macro-
actions. In Proceedings of the International Conference on Robotics and Automation,
pp. 5962–5969.

Omidshafiei, S., Agha-mohammadi, A., Amato, C., & How, J. P. (2017). Decentralized
control of multi-robot partially observable Markov decision processes using belief space
macro-actions. The International Journal of Robotics Research.

Omidshafiei, S., Agha-mohammadi, A., Amato, C., Liu, S.-Y., How, J. P., & Vian, J. (2016).
Graph-based cross entropy method for solving multi-robot decentralized POMDPs. In
Proceedings of the International Conference on Robotics and Automation.

Omidshafiei, S., Kim, D.-K., Liu, M., Tesauro, G., Riemer, M., Amato, C., Campbell, M., &
How, J. (2019). Learning to teach in cooperative multiagent reinforcement learning.
In Proceedings of the AAAI Conference on Artificial Intelligence.

Omidshafiei, S., Liu, S.-Y., Everett, M., Lopez, B., Amato, C., Liu, M., How, J. P., &
Vian., J. (2017a). Semantic-level decentralized multi-robot decision-making using

857


Amato, Konidaris, Kaelbling & How

probabilistic macro-observations. In Proceedings of the International Conference on
Robotics and Automation.

Omidshafiei, S., Pazis, J., Amato, C., How, J. P., & Vian, J. (2017b). Deep decentralized
multi-task multi-agent reinforcement learning under partial observability. In Proceed-
ings of the International Conference on Machine Learning.

Pajarinen, J. K., & Peltonen, J. (2011). Periodic finite state controllers for efficient POMDP
and DEC-POMDP planning. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F.,
& Weinberger, K. (Eds.), Advances in Neural Information Processing Systems 24, pp.
2636–2644.

Palmer, G., Tuyls, K., Bloembergen, D., & Savani, R. (2018). Lenient multi-agent deep
reinforcement learning. In Proceedings of the International Conference on Autonomous
Agents and Multiagent Systems, pp. 443–451.

Parker, L. E. (1998). ALLIANCE: An architecture for fault tolerant multirobot cooperation.
IEEE Transactions on Robotics and Automation, 14 (2), 220–240.

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Pro-
gramming. Wiley-Interscience.

Quigley, M., Conley, K., Gerkey, B. P., Faust, J., Foote, T., Leibs, J., Wheeler, R., & Ng,
A. Y. (2009). ROS: an open-source robot operating system. In ICRA Workshop on
Open Source Software.

Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G., Foerster, J., & Whiteson, S. (2018).
QMIX: Monotonic value function factorisation for deep multi-agent reinforcement
learning. In Proceedings of the International Conference on Machine Learning, pp.
4295–4304.

Rekleitis, I., Lee-Shue, V., New, A. P., & Choset, H. (2004). Limited communication, multi-
robot team based coverage. In Robotics and Automation, 2004. Proceedings. ICRA’04.
2004 IEEE International Conference on, Vol. 4, pp. 3462–3468. IEEE.

Seuken, S., & Zilberstein, S. (2007a). Improved memory-bounded dynamic programming for
decentralized POMDPs. In Proceedings of the Conference on Uncertainty in Artificial
Intelligence, pp. 344–351.

Seuken, S., & Zilberstein, S. (2007b). Memory-bounded dynamic programming for DEC-
POMDPs. In Proceedings of the International Joint Conference on Artificial Intelli-
gence, pp. 2009–2015.

Silver, D., & Ciosek, K. (2012). Compositional planning using optimal option models. In
Proceedings of the International Conference on Machine Learning.

Sonu, E., Chen, Y., & Doshi, P. (2015). Individual planning in agent populations: Ex-
ploiting anonymity and frame-action hypergraphs. In Proceedings of the International
Conference on Automated Planning and Scheduling.

Spaan, M. T. J., & Melo, F. S. (2008). Interaction-driven Markov games for decentralized
multiagent planning under uncertainty. In Proceedings of the International Conference
on Autonomous Agents and Multiagent Systems, pp. 525–532.

858


Modeling and Planning with Macro-Actions in Decentralized POMDPs

Stilman, M., & Kuffner, J. (2005). Navigation among movable obstacles: Real-time reasoning
in complex environments. International Journal on Humanoid Robotics, 2 (4), 479–
504.

Stone, P., Sutton, R. S., & Kuhlmann, G. (2005). Reinforcement learning for robocup soccer
keepaway. Adaptive Behavior, 13 (3), 165–188.

Stroupe, A. W., Ravichandran, R., & Balch, T. (2004). Value-based action selection for
exploration and dynamic target observation with robot teams. In Proceedings of the
International Conference on Robotics and Automation, Vol. 4, pp. 4190–4197. IEEE.

Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework
for temporal abstraction in reinforcement learning. Artificial Intelligence, 112 (1), 181–
211.

Szer, D., Charpillet, F., & Zilberstein, S. (2005). MAA*: A heuristic search algorithm for
solving decentralized POMDPs. In Proceedings of the Conference on Uncertainty in
Artificial Intelligence.

Theocharous, G., & Kaelbling, L. P. (2003). Approximate planning in POMDPs with
macro-actions. In Advances in Neural Information Processing Systems.

Thrun, S., Burgard, W., & Fox, D. (2005). Probabilistic Robotics (Intelligent Robotics and
Autonomous Agents). The MIT Press.

Varakantham, P., Adulyasak, Y., & Jaillet, P. (2014). Decentralized stochastic planning
with anonymity in interactions. In Proceedings of the AAAI Conference on Artificial
Intelligence, pp. 2505–2512.

Velagapudi, P., Varakantham, P. R., Sycara, K., & Scerri, P. (2011). Distributed model
shaping for scaling to decentralized POMDPs with hundreds of agents. In Proceedings
of the International Conference on Autonomous Agents and Multiagent Systems, pp.
955–962.

Wu, F., Zilberstein, S., & Chen, X. (2010a). Point-based policy generation for decentralized
POMDPs. In Proceedings of the International Conference on Autonomous Agents and
Multiagent Systems, pp. 1307–1314.

Wu, F., Zilberstein, S., & Chen, X. (2010b). Rollout sampling policy iteration for de-
centralized POMDPs. In Proceedings of the Conference on Uncertainty in Artificial
Intelligence, pp. 666–673.

Wu, F., Zilberstein, S., & Jennings, N. R. (2013). Monte-carlo expectation maximization
for decentralized POMDPs. In Proceedings of the International Joint Conference on
Artificial Intelligence, pp. 397–403. AAAI Press.

859