Multi-agent active perception with prediction rewards

Transcript Of Multi-agent active perception with prediction rewards
Multi-agent active perception with prediction rewards
Mikko Lauri Department of Computer Science
Universität Hamburg Hamburg, Germany [email protected]
Frans A. Oliehoek Department of Computer Science
TU Delft Delft, the Netherlands [email protected]
Abstract
Multi-agent active perception is a task where a team of agents cooperatively gathers observations to compute a joint estimate of a hidden variable. The task is decentralized and the joint estimate can only be computed after the task ends by fusing observations of all agents. The objective is to maximize the accuracy of the estimate. The accuracy is quantified by a centralized prediction reward determined by a centralized decision-maker who perceives the observations gathered by all agents after the task ends. In this paper, we model multi-agent active perception as a decentralized partially observable Markov decision process (Dec-POMDP) with a convex centralized prediction reward. We prove that by introducing individual prediction actions for each agent, the problem is converted into a standard DecPOMDP with a decentralized prediction reward. The loss due to decentralization is bounded, and we give a sufficient condition for when it is zero. Our results allow application of any Dec-POMDP solution algorithm to multi-agent active perception problems, and enable planning to reduce uncertainty without explicit computation of joint estimates. We demonstrate the empirical usefulness of our results by applying a standard Dec-POMDP algorithm to multi-agent active perception problems, showing increased scalability in the planning horizon.
1 Introduction
Active perception, collecting observations to reduce uncertainty about a hidden variable, is one of the fundamental capabilities of an intelligent agent [2]. In multi-agent active perception a team of autonomous agents cooperatively gathers observations to infer the value of a hidden variable. Application domains include search and rescue robotics, sensor networks, and distributed hypothesis testing. A multi-agent active perception task often has a finite duration: after observations have been gathered, they are collected to a central database for inference. While the inference phase is centralized, the observation gathering phase is decentralized: each agent acts independently, without knowing the observations collected by the other agents nor guaranteed communication to the other agents.
The key problem in multi-agent active perception is to determine how each agent should act during the decentralized phase to maximize the informativeness of the collected observations, evaluated afterwards during the centralized inference phase. The problem can be formalized as a decentralized partially observable Markov decision process (Dec-POMDP) [3, 15], a general model of sequential multi-agent decision-making under uncertainty. At each time step in a Dec-POMDP, each agent in the team takes an individual action. The next state is determined according to a Markov chain conditional on the current hidden state and all individual actions. Each agent then perceives an individual observation correlated with the next state and the individual actions. The agents should act so as to maximize the expected sum of shared rewards, accumulated at each time step over a finite horizon.
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
In the decentralized phase, the per-step reward depends on the hidden state and the individual actions of all agents. In typical Dec-POMDPs, the reward on all time steps is of this form. Active perception problems are modelled by a reward that is a convex function of the team’s joint estimate of the hidden state, for example the negative entropy [11]. This encourages agents to act in ways that lead to joint state estimates with low uncertainty. In analogy to the centralized inference phase, the reward at the final time step can be thought of as a centralized prediction reward where a centralized decision-maker perceives the individual action-observation histories of all agents and determines a reward based on the corresponding joint state estimate. Due to the choice of reward function, algorithms for standard Dec-POMDPs are not applicable to such active perception problems. Despite the pooling of observations after the end of the task, the problem we target is decentralized. We design a strategy for each agent to act independently during the observation gathering phase, without knowing for sure how the others acted or what they perceived. Consequently, the joint state estimate is not available to any agent during the observation gathering phase. Strategies executable in a centralized manner are available only if all-to-all communication during task execution is possible, which we do not assume. As decentralized strategies are a strict subset of centralized strategies, the best decentralized strategy is at most as good as the best centralized strategy [16].
In this paper, we show that the convex centralized prediction reward can be converted to a decentralized prediction reward that is a function of the hidden state and so-called individual prediction actions. This converts the Dec-POMDP with a convex centralized prediction reward into a standard Dec-POMDP with rewards that depend on the hidden state and actions only. This enables solving multi-agent active perception problems without explicit computation of joint state estimates applying any standard Dec-POMDP algorithm. We show that the error induced when converting the centralized prediction reward into a decentralized one, the loss due to decentralization, is bounded. We also give a sufficient condition for when the loss is zero, meaning that the problems with the centralized, respectively decentralized, prediction rewards are equivalent. We prove the empirical usefulness of our results by applying standard Dec-POMDP solution algorithms to active perception problems, demonstrating improved scalability over the state-of-the-art.
The remainder of the paper is organized as follows. We review related work in Section 2, and give preliminary definitions for Dec-POMDPs in Section 3. In Section 4, we introduce our proposed conversion of a centralized prediction reward to a decentralized prediction reward. We propose a method to apply standard Dec-POMDP algorithms to multi-agent active perception in Section 5, and empirically evaluate the method in Section 6. Section 7 concludes the paper.
2 Related work
We briefly review possible formulations of multi-agent active perception problems, and then focus on the Dec-POMDP model that provides the most general formulation.
Multi-agent active perception has been formulated as a distributed constraint optimization problem (DCOP), submodular maximization, or as a specialized variant of a partially observable Markov decision process (POMDP). Probabilistic DCOPs with partial agent knowledge have been applied to signal source localization [9, 26]. DCOPs with Markovian dynamics have been proposed for target tracking by multiple sensors [13]. DCOPs are a simpler model than Dec-POMDPs, as a fixed communication structure is assumed or the noise in the sensing process is not modelled. Submodular maximization approaches assume the agents’ reward can be stated as a submodular set function, and apply distributed greedy maximization to obtain an approximate solution [22, 8, 7]. Along with the structure of the reward function, inter-agent communication is typically assumed. Specialized variants of POMDPs may also be applied. If all-to-all communication without delay during task execution is available, centralized control is possible and the problem can be solved as a multi-agent POMDP [24]. Auctioning of POMDP policies can facilitate multi-agent cooperation when agents can communicate [6]. Best et al. [4] propose a decentralized Monte Carlo tree search planner where agents periodically communicate their open-loop plans to each other.
Multi-agent active perception without implicit communication with uncertainty on state transitions and the agents’ perception may be modelled as a Dec-ρPOMDP [10, 11]. In contrast to standard Dec-POMDPs, the reward function in a Dec-ρPOMDP is a convex function of the joint state estimate, for example the negative entropy. Unfortunately, because of the convex reward, standard Dec-POMDP solution algorithms are not applicable. Furthermore, the heuristic algorithm proposed in [11, 12]
2
requires explicit computation of reachable joint state estimates. Computing and storing these joint state estimates adds significant memory overhead to the already high computational burden of solving Dec-POMDPs [3]. We address both shortcomings, showing that a Dec-ρPOMDP can be converted to a standard Dec-POMDP with linear rewards. This in turn enables applying standard planning algorithms that do not require computation of reachable joint state estimates, leading to improved scalability.
Our work in the decentralized setting draws inspiration from approaches using prediction rewards in single-agent and centralized scenarios. Araya-López et al. [1] proposed to tackle single-agent active perception as a ρPOMDP with a convex reward. The related POMDP with information rewards (POMDP-IR) was proposed in [25]. The POMDP-IR model adds prediction actions that the agent selects in addition to the usual actions. Active perception is facilitated by rewarding the agent for correctly predicting the true underlying state. The equivalence of ρPOMDP and POMDP-IR model was later established [20]. Recently, Satsangi et al. [21] propose a reinforcement learning method to solve ρPOMDPs taking advantage of the equivalence. In this paper we prove an analogous equivalence for Dec-POMDPs, by converting a Dec-ρPOMDP to a standard Dec-POMDP with individual prediction actions and a decentralized prediction reward. Unlike in the POMDP case, the conversion does not always result in a perfect equivalence, but is associated with a loss due to decentralization.
3 Multi-agent active perception as a Dec-POMDP
In this section, we review how active perception problems are modelled as Dec-POMDPs. We also review plan-time sufficient statistics, which allow us to concisely express joint state estimates and value functions of a Dec-POMDP. We concentrate on the practically relevant active perception problem where a prediction reward at the final time step depends on the joint state estimate.
3.1 Decentralized POMDPs
We define a Dec-POMDP where the action and observation spaces along with the reward function are time-dependent, as this will be convenient for our subsequent introduction of prediction actions. Definition 1. A Dec-POMDP is a tuple M = h, I, S, b0, A, Z, T , R , where
• h ∈ N is the time horizon of the problem,
• I = {1, 2, . . . , n} is a set of n agents,
• S is the finite set of states s,
• b0 ∈ ∆(S) is the initial state distribution at time step t = 0,
• A is the collection of individual action spaces Ai,t for each agent i ∈ I and time step t = 0, . . . , h − 1. The tuple at = a1,t, a2,t, . . . , an,t of individual actions is called the joint action at time step t,
• Z is the collection of individual observation spaces Zi,t for each agent i ∈ I and time step t = 1, . . . , h. The tuple zt = z1,t, z2,t, . . . , zn,t of individual observations is called the joint observation at time step t,
• T is the dynamics function specifying conditional probabilities P(zt+1, st+1 | st, at), and
• R is the collection of reward functions Rt(st, at) for time steps t = 0, . . . , h − 1.
An admissible solution of a Dec-POMDP is a decentralized joint policy π, i.e., a tuple π1, . . . , πn where the individual policy πi of each agent i maps individual observation sequences zi,t = (zi,1, . . . , zi,t) to an individual action.1 An individual policy is a sequence πi = (δi,0, . . . , δi,h−1) of individual decision rules that map length-t individual observation sequences to an individual action δi,t(zi,t) = ai,t. A joint decision rule is a tuple δt = δ1,t, . . . , δn,t that maps a length-t joint observation sequence zt = z1,t, . . . , zn,t to a joint action δt(zt) = at. We shall use notation z−i,t to denote the individual observation sequences of all agents except i.
1zi,0 = ∅ as there is no observation at time t = 0.
3
The objective is to find an optimal joint policy π∗ that maximizes the expected sum of rewards,
that is, π∗ = argmaxπ E
h−1 t=0
Rt
(st
,
δt
(zt
))
, where the expectation is with respect to the distri-
bution of states and joint observation sequences induced under π, i.e., P(s0, . . . , sh, zh | b0, π)
b0(s0)
h−1 t=0
T (zt+1,
st+1
|
st,
δt(zt)).
In this paper, we are interested in cooperative active perception. We assume that after the task
terminates at time step h, the joint observation sequence zh is used to compute the conditional
distribution bh ∈ ∆(S) over the final state sh, that is, bh(sh) P(sh | zh, b0, π). We seek policies that, instead of only maximizing the expected sum of rewards Rt(st, at), also maximize the informativeness of bh.
The Dec-ρPOMDP [11] models active perception by maximizing a convex function of bh, e.g., the negative entropy. This convex function may be though of as a centralized pre-
diction reward that is independent of the agents’ individ-
ual actions. Conceptually, the centralized prediction reward
is determined by a virtual centralized decision-maker that
perceives the agents’ individual action and observation sequences at time h, computes bh, and then determines the final reward. We approximate the centralized prediction re-
ward using a piecewise linear and convex function as illus-
trated in Figure 1. Consider a bounded, convex, and differentiable function f : ∆(S) → R. The tangent hyperplane α ∈ R|S| of f at b ∈ ∆(S) is α = ∇f (b) − f ∗(∇f (b)), where ∇ denotes the gradient and f ∗ is the convex conjugate of f [5]. We select a finite set of linearization points bj ∈ ∆(S), and define Γ as the set of corresponding tangent hyperplanes αj.2 Then, we obtain a lower approximation as f (b) ≥ maxα∈Γ s b(s)α(s). This approximation is also used in ρPOMDPs, and has a bounded error [1]. The approxi-
mation is also used in [21], where error bounds for cases such as 0-1 rewards are provided. The Dec-ρPOMDP problem we
consider is as follows.
f (b)
b, α1
0
b, α2
b, α3
−0.2
−0.4
−0.6
−0.8
−1 0 0.2 0.4 0.6 0.8 1 b
Figure 1: A convex function f is approximated by maxi b, αi where αi are tangent hyperplanes and ·, · is
the inner product. The horizontal
axis depicts b(s) in a system with two states s and s , where b(s ) = 1 − b(s). A prediction action corresponds to selecting a single αi.
Definition 2 (Dec-ρPOMDP). A Dec-ρPOMDP is a pair M, Γ , where M is a Dec-POMDP and Γ is a set of tangent hyperplanes that determine the centralized prediction reward ρ : ∆(S) → R
defined as ρ(b) max b(s)α(s).
α∈Γ s
The standard Dec-POMDP is the special case where Γ contains a single all-zero element. The
objective in the Dec-ρPOMDP is to find an optimal joint policy that maximizes the expected sum of
rewards and the centralized prediction reward, i.e., E
h−1 t=0
Rt
(st
,
δt
(zt
))
+
ρ(bh
)
.
3.2 Sufficient plan-time statistics and the optimal value function
Sufficient plan-time statistics are probability distributions over states and joint observation sequences
given the past joint policy followed by the agents [14]. A past joint policy at time t is a joint policy
specified until time t, denoted ϕt = ϕ1,t, . . . , ϕn,t , where each individual past policy is a sequence of individual decision rules: ϕi,t = (δi,0, . . . , δi,t−1). The sufficient plan-time statistic for initial state distribution b0 and a past joint policy ϕt is defined as
σt(st, zt) P(st, zt | b0, ϕt),
(1)
and at the starting time, σ0(s0) b0(s0). The conditional σt(· | zt) is the state distribution after perceiving the joint observation sequence zt when executing policy ϕt with initial state distribution b0. The marginal σt(zt) is the probability of the joint observation sequence zt given ϕt and b0.
Sufficient plan-time statistics are updateable to any extension of the past joint policy ϕt. The extension of an individual past policy ϕi,t by an individual decision rule δi,t is defined ϕi,t+1 = ϕi,t ◦ δi,t
2We defer discussion on how we select the linearization points to Section 5.
4
(δi,0, . . . , δi,t−1, δi,t). The extension of a past joint policy ϕt by δt = δ1,t, . . . , δn,t is defined
ϕt+1 = ϕt ◦ δt ϕ1,t+1, . . . , ϕn,t+1 . Let zt+1 = (zt, zt+1) be a joint observation sequence that extends zt with zt+1. Given the statistic σt for ϕt and the next joint decision rule δt, the updated statistic for ϕt+1 = ϕt ◦ δt is σt+1 = Uss(σt, δt), where the update operator Uss is defined as
[Uss(σt, δt)](st+1, zt+1)
T (zt+1, st+1 | st, δt(zt))σt(st, zt).
(2)
st
The statistic σt is sufficient to predict the expected immediate reward Rˆt(σt, δt) of choosing the next
joint decision rule δt:
Rˆt(σt, δt) = σt(st, zt)Rt(st, δt(zt)).
(3)
zt ,st
Furthermore, σh is sufficient to predict the expected centralized prediction reward ρˆ(σh):
ρˆ(σh) = σh(zh) max σh(sh | zh)α(sh).
(4)
α∈ Γ
zh
sh
In contrast to the reward in earlier stages, ρˆ does not depend on decision rules at all. We interpret ρˆ as the expectation of the prediction reward of a centralized decision-maker who selects the best α ∈ Γ after perceiving the full history of joint actions and observations at task termination.
Let π be a joint policy consisting of the joint decision rules δ0, δ1, . . . , δh−1. The value-to-go of π starting from a sufficient plan-time statistic σt at time step t is
Qπ(σ , δ ) = Rˆt(σt, δt) + ρˆ(σt+1)
if t = h − 1,
(5)
t tt
Rˆt(σt, δt) + Qπt+1(σt+1, δt+1) if 0 ≤ t < h − 1,
with the shorthand σt+1 = Uss(σt, δt). The value function V π of a joint policy π is defined as the
sum of expected rewards when the agents act according to π, V π(σ0) Qπ0 (σ0, δ0). The value
function of an optimal policy π∗ is denoted V ∗, and it satisfies V ∗(σ0) ≥ V π(σ0) for all π. We write
VM π
and
V
π M,Γ
for the value function in a standard Dec-POMDP and a Dec-ρPOMDP, respectively.
4 Conversion to standard Dec-POMDP
In this section, we show that any Dec-ρPOMDP can be converted to a standard Dec-POMDP by adding individual prediction actions for each agent, and by introducing a decentralized prediction reward. Each individual prediction action corresponds to a selection of a tangent hyperplane of the centralized prediction reward, as illustrated in Figure 1. As the individual prediction actions are chosen in a decentralized manner, the decentralized prediction reward never exceeds the centralized prediction reward. Consequently, we prove that an optimal policy of the standard Dec-POMDP may be applied to the Dec-ρPOMDP with bounded error compared to the true optimal policy. We give a sufficient condition for when this loss due to decentralization is zero and a Dec-ρPOMDP is equivalent to a standard Dec-POMDP.
A major implication of our results is that it is possible to apply any Dec-POMDP solver to a DecρPOMDP problem. We approximate a centralized prediction reward by a decentralized prediction reward that only depends on states and actions. This allows planning for active perception problems without explicit computation of joint state estimates.
We first introduce our proposed conversion and prove its properties, including the error bound. We conclude the section by giving a sufficient condition for when the two problems are equivalent. All omitted proofs are found in the supplementary material.
Definition 3. Given a Dec-ρPOMDP M, Γ with M = h, I, S, b0, A, Z, T , R , convert it to a standard Dec-POMDP M+ = h + 1, I, S, b0, A+, Z+, T +, R+ where the horizon is incremented by one and
• in A+, the individual action space Ai,h for each agent i ∈ I at time h is a set of individual prediction actions ai,h, with one individual prediction action for each tangent hyperplane α ∈ Γ; for other time steps Ai,t are as in A,
5
• in Z+, the individual observation space Zi,h+1 for each agent i ∈ I at time h + 1 contains a single null observation; for other time steps Zi,t are as in Z,
• T +(zh+1, sh+1 | sh, ah) has probability of one for the joint null observation and sh+1 = sh, and is otherwise zero; for other time steps T + is equal to T , and
• in R+, the reward function Rh at time step h is a linear combination of individual prediction rewards Ri,h of each agent, such that for a joint prediction action ah = a1,h, . . . , an,h the decentralized prediction reward is
1n
Rh(sh, ah) = n Ri,h(sh, ai,h),
(6)
i=1
with Ri,h(sh, ai,h) = αai,h (sh), where αai,h ∈ Γ is the tangent hyperplane corresponding to the individual prediction action ai,h; for other time steps Rt are as in R.
As the action and observation sets in M and M+ are the same until t = h, past joint policies are interchangeable, and plan-time sufficient statistics are identical.
Lemma 1. Let M, Γ be a Dec-ρPOMDP and define M+ as above. Then, for any past joint policy ϕt with t ≤ h, the respective plan-time sufficient statistics σt in M, Γ and σt+ in M+ are equivalent, i.e., σt ≡ σt+.
In M+ the horizon is incremented, so that each agent takes one more action than in M, Γ . The
additional action is one of the newly added individual prediction actions at time step h. To select an
individual prediction action, an agent may use 1) individual information, that is, the agent’s individual
observation sequence, and 2) plan-time information common to all agents, that is, the plan-time
sufficient statistic. An individual prediction rule φi maps the individual observation sequence zi,h
and the plan-time sufficient statistic σh to an individual prediction action. A joint prediction rule
φ = φ1, . . . , φn is a tuple of individual prediction rules, and maps a joint observation sequence zh
and σh to a joint prediction action. The expected decentralized prediction reward given σh and φ is
defined analogously to Eq. (3) as Rˆh(σh, φ)
zh,sh σh(sh, zh)Rh(sh, φ(zh, σh)).
An optimal joint prediction rule maximizes the expected decentralized prediction reward. We prove
that an optimal joint prediction rule consists of individual prediction rules that maximize the expected
individual prediction reward. We then show that the expected decentralized prediction reward is
at most equal to the centralized prediction reward, and that a similar relation holds between value functions in the respective Dec-POMDP and Dec-ρPOMDP. The key property required is that the
decentralized prediction reward is a sum of individual prediction rewards.
Lemma 2 (Optimal joint prediction rule). Let M, Γ be a Dec-ρPOMDP and define M+ as above,
and let σh be a plan-time sufficient statistic for any past joint policy. Then the joint prediction rule
φ∗ =
φ
∗ 1
,
.
.
.
,
φ
∗ n
where each individual prediction rule φ∗i is defined as
φ
∗ i
(
zi,h
,
σh
)
argmax
σh(sh, z−i,h | zi,h)Ri,h(sh, ai,h)
(7)
ai,h∈Ai,h z−i,h,sh
maximizes expected decentralized prediction reward, that is, Rˆh(σh, φ∗) = maxφ Rˆh(σh, φ).
In the Dec-ρPOMDP M, Γ , the centralized prediction reward is determined by a centralized virtual agent that has access to the observation histories of all agents. In our converted standard Dec-POMDP M+, each agent individually takes a prediction action, leading to the decentralized prediction reward. The decentralized prediction reward never exceeds the centralized prediction reward as we show next.
Lemma 3. The expected decentralized prediction reward Rˆh(σh, φ∗) in M+ is at most equal to the expected centralized prediction reward ρˆ(σh) in M, Γ , i.e., Rˆh(σh, φ∗) ≤ ρˆ(σh).
Lemma 4. Let M, Γ and M+ be as defined above. Let ϕh be a past joint policy for M+, and let
φ∗ be the optimal joint prediction rule. Then, the value of ϕh ◦ φ∗ in M+ is at most equal to the
value of ϕh in
M, Γ
,
i.e.,
VM ϕh+◦φ∗ (σ0)
≤
V
ϕh M,Γ
(σ0).
We now show that the difference between the optimal values of the standard Dec-POMDP M+ and the Dec-ρPOMDP M, Γ is bounded. We call the error the loss due to decentralization.
6
Theorem 1 (Loss due to decentralization). Consider a Dec-ρPOMDP M, Γ with the optimal
value
function
V
∗ M,Γ
.
Let
π
be
an
optimal
policy
for
the
standard
Dec-POMDP
M+
created
as
in
Definition 3, and denote by ϕh the past joint policy consisting of the first h decision rules of π. Then
the
difference
of
V
∗ M,Γ
and
the
value
function
V
ϕh M,Γ
of applying ϕh to
M, Γ
is bounded by
|V
∗ M,Γ
(σ0)
−
V
ϕh M,Γ
(σ0)|
≤
2 max |ρˆ(σh)
−
Rˆh(σh, φ∗)|,
(8)
σh
where φ∗ is the optimal joint prediction rule.
Proof. For clarity, in the following we omit the argument σ0 from the value functions. Suppose the
optimal policy π∗ in M, Γ consists of the joint decision rules δ0∗, . . . , δh∗−1. For M+, define the
partial joint policy ϕ∗h = (δ0∗, . . . , δh∗−1) and its extension ϕ∗h ◦ φ∗. By Lemma 2, an optimal joint
policy π of M+ is an extension of some past joint policy ϕh by φ∗, i.e., π = ϕh ◦ φ∗. Because
ϕh ◦ φ∗ is optimal in M+, we have that VM ϕ∗h+◦φ∗ ≤ VM ϕh+◦φ∗ . By Lemma 4, VM ϕh+◦φ∗ ≤ V ϕMh ,Γ .
Finally, because π∗ is optimal in
M, Γ
,
we
have
V
ϕh M,Γ
≤
V
∗ M,Γ
.
Now,
|V
∗ M,Γ
−
VM ϕh+ |
≤
|V
∗ M,Γ
− VM ϕ∗h+◦φ∗ | + |VM ϕ∗h+◦φ∗ − VM ϕh+ |
(9a)
≤
|V
∗ M,Γ
−
VM ϕ∗h+◦φ∗ |
+
|VM ϕ∗h+◦φ∗
−
V
∗ M,Γ
|
=
2
·
|V
∗ M,Γ
− VM ϕ∗h+◦φ∗ |
(9b)
h−1
h−1
=2·
Rˆt(σt∗, δt∗) + ρˆ(σh∗) −
Rˆt(σt∗, δt∗) + Rˆh(σh∗, φ∗) (9c)
t=0
t=0
= 2 · |ρˆ(σh∗) − Rˆh(σh∗, φ∗)| ≤ 2 · max |ρˆ(σh) − Rˆh(σh, φ∗)|.
(9d)
σh
We first apply the triangle inequality. The second inequality holds as VM ϕ∗h+◦φ∗ ≤ V ϕMh ,Γ ≤ V ∗M,Γ .
The next equality is by symmetry of absolute difference. The third line follows from the definition of the value function and Lemma 1. Note that we refer by σt∗ to the sufficient plan-time statistics reached under partial joint policies ϕ∗t of π∗. The final inequality follows by maximizing the difference of the expected centralized and decentralized prediction rewards.
The theorem shows that given a Dec-ρPOMDP M, Γ , we may solve the standard Dec-POMDP M+ and apply its optimal policy to the Dec-ρPOMDP with bounded error. If there is only a single agent, that agent is equivalent to the conceptualized centralized agent, which implies ρˆ ≡ Rˆh and that the loss is zero. The equivalence of a ρPOMDP with a convex prediction reward and a POMDP with information rewards first shown in [20] is therefore obtained as a special case.
The proof indicates that the error is at most twice the difference of the expected centralized prediction reward and the expected decentralized prediction reward at the sufficient plan-time statistic σh∗ at time h under an optimal policy π∗ of M, Γ . This suggests that the final bound given as maximum over all sufficient plan-time statistics may be overly pessimistic for some cases. While a complete characterization of such cases is beyond the scope of this paper, we give below a sufficient condition for when the error is zero in the multi-agent case. The proof is in the supplementary material.
Observation 1. Consider the setting of Theorem 1. Assume that the observation sequence of each agent is conditionally independent of the observation sequences of all other agents given the past joint policy and initial state distribution, i.e., for every agent i, σh(zh) = σh(zi,h)σh(z−i,h). Then π∗ is an optimal joint policy for M, Γ if and only if π∗ ◦ φ∗ is an optimal joint policy for M+.
The sufficient condition above is restrictive. Informally, it requires that each agent executes its own independent active perception task. However, the loss due to decentralization may be small if a multi-agent active perception task almost satisfies the requirement. It might be possible to derive less restrictive sufficient conditions in weakly-coupled problems by building on influence-based abstractions [17] Such weakly-coupled cases may arise, e.g., in distributed settings where two agents deployed in different regions far away from each other have limited influence on each other.
5 Adaptive prediction action search
Recall that the centralized prediction reward function ρ is obtained by approximating a continuous convex function f by a set of α-vectors. The approximation is more accurate the more α-vectors
7
Algorithm 1 Adaptive prediction action search (APAS) for Dec-ρPOMDP planning
Input: Dec-ρPOMDP M, Γ , convex function f : ∆(S) → R, number of linearization points K
Output: Best joint policy found, πbest
1: Vbest ← −∞, πbest ← ∅
2: repeat
3: // Policy optimization phase 4: M+ ← CONVERTDECPOMDP(M, Γ) 5: π ← PLAN(M+)
Apply Definition 3 Use any Dec-POMDP planner
6: V ← EVALUATE(π)
7: if V > Vbest then Vbest ← V, πbest ← π
8: // Adaptation phase
9: Γ ← ∅
10: for k = 1, . . . , K do
11:
zh ∼ σh(zh)
Sample joint observation sequence zh using πbest
12:
bk ← σh(· | zh)
13:
αk ← ∇f (bk) − f ∗(∇f (bk))
Final state estimate corresponding to zh Tangent hyperplane of f at bk
14:
Γ ← Γ ∪ {αk}
15: end for
16: until converged
17: return πbest
are used: for error bounds, see [1, 21]. We propose a planning algorithm for Dec-ρPOMDPs called Adaptive Prediction Action Search (APAS) that dynamically updates the α-vectors during planning. This avoids the need to a priori create a large number of α-vectors some of which may turn out to not be useful. Moreover, APAS allows application of any standard Dec-POMDP algorithm to solve Dec-ρPOMDPs by using the conversion proposed in Section 4. This can avoid the computation of all reachable state estimates required by earlier algorithms for Dec-ρPOMDPs [11, 12].
The pseudocode for APAS is shown in Algorithm 1. APAS consists of two phases that are repeated until convergence: the policy optimization phase and the α-vector adaptation phase. In the policy optimization phase, the best joint policy for the current set Γ of α-vectors is found. On the first iteration we initialize Γ randomly as explained in the next section. In the adaptation phase, the α-vectors are then modified such that they best approximate the final reward for joint state estimates that are most likely reached by the current joint policy. In the optimization phase the Dec-ρPOMDP is converted to a standard Dec-POMDP as described in Section 4. We then apply any standard DecPOMDP algorithm to plan a joint policy π for the converted Dec-POMDP. If the value of π exceeds the value of the best policy so far, πbest, the best policy is updated. In the adaptation phase, we sample a final state distribution bk under the currently best policy, and insert the tangent hyperplane at bk into Γ. This corresponds to sampling a point on the horizontal axis in Figure 1 and adding the corresponding tangent hyperplane to Γ. This is repeated K times. Specifically, we simulate a trajectory under the current best policy πbest to sample a joint observation sequence (Line 11), use Bayesian filtering to compute the corresponding state estimate bk (Line 12), and compute corresponding α-vector (Line 13). The updated set Γ of α-vectors provides the best approximation of the final reward for the sampled state estimates. The time complexity of APAS is determined by the complexity of the Dec-POMDP planner called in PLAN. A reference implementation is available at https://github.com/laurimi/multiagent-prediction-reward.
6 Experiments
The Dec-ρPOMDP we target is computationally more challenging (NEXP-complete [3]) than centralized POMDP, POMDP-IR, or ρPOMDP (PSPACE-complete [19]). Immediate all-to-all communication during task execution would be required to solve the problem we target as a centralized problem, and the resulting optimal value would be higher [16]. Therefore, a comparison to centralized methods is neither fair nor necessary. We compare APAS to the NPGI algorithm of [11] that solves a Dec-ρPOMDP by iterative improvement of a fixed-size policy graph. As the PLAN subroutine of APAS, we use the finite-horizon variant of the policy graph improvement method of [18]. This method is algorithmically similar to NPGI which helps isolate the effect of applying the proposed conversion to a standard Dec-POMDP from the effect of algorithmic design choices to the extent
8
Table 1: Average policy values ± standard error in the MAV (left) and the rovers domains (right).
h APAS (ours)
MAV APAS (no adapt.)
NPGI [12]
Optimal h
APAS (ours)
Rovers APAS (no adapt.)
NPGI [12]
Optimal
2 -2.006 ± 0.005 -2.112 ± 0.002 3 -1.936 ± 0.004 -2.002 ± 0.002 4 -1.879 ± 0.004 -1.943 ± 0.002 5 -1.842 ± 0.005 -1.918 ± 0.002 6 -1.814 ± 0.004 -1.898 ± 0.002 7 -1.789 ± 0.004 -1.892 ± 0.003 8 -1.820 ± 0.005 -1.885 ± 0.006
-1.931 -1.833 -1.768 -1.725 — — —
-1.919 2 -3.484 ± 0.002 -3.800 ± 0.006
-1.831 3 -3.402 ± 0.008 -3.680 ± 0.006
—
4 -3.367 ± 0.011 -3.640 ± 0.006
—
5 -3.293 ± 0.012 -3.591 ± 0.006
—
6 -3.333 ± 0.012 -3.631 ± 0.007
—
7 -3.375 ± 0.014 -3.673 ± 0.008
—
8 -3.496 ± 0.014 -3.860 ± 0.010
-3.495 -3.192 -3.036 -2.981 — — —
-3.479 -3.189 — — — — —
possible. Any other Dec-POMDP algorithm may also be used with APAS. Details on the algorithms and parameter settings are provided in the supplementary material.
We evaluate on the Dec-ρPOMDP domains from [11]: the micro air vehicle (MAV) domain and information gathering rovers domain. In the MAV domain two agents cooperatively track a moving target to determine if the target is friendly or hostile. In the rovers domain, two agents collect samples of scientific data in a grid environment. The reward at the final time step is the negative entropy of tKhe=jo2in, tasntdatfeoresthtiemraotveebrhs,ptrhoabtleism, fK(b=h) 5=indivshidbuha(lsphr)eldnicbthio(snha)c.tiFoonrs.thFeorMAAPVASpr,owbeleimni,tiwaleizuesΓe by randomly sampling linearization points bk ∈ ∆(S), k = 1, . . . , K using [23]. The corresponding α-vectors for the negative entropy are αk(s) = ln bk(s). We only observed minor benefits from using a greater number of α-vectors and only for long horizons, details are reported in the supplementary material. We run 100 repetitions using APAS and report the average policy value and its standard error. Results for NPGI are as reported in [12]. To ensure comparability, all reported policy values are computed using f as the final reward. As a baseline, we report the optimal value if known.
Comparison to the state-of-the-art. The results are shown in Table 1. While APAS does not exceed state-of-the-art performance in terms of policy quality, the major advantage of APAS is that it is able to scale to greater horizons than NPGI. The number of joint state distributions reachable under the current policy grows exponentially in the horizon h. Computation of these state distributions is required by NPGI, causing it to run out of memory with h > 5.
While running APAS, we compute the value of the policies exactly (Alg. 1, Line 6). For long horizons, exact evaluation requires a significant fraction of the computation budget. We expect that further scaling in terms of planning horizon is possible by switching to approximate evaluation, at the cost of noisy value estimates. Results on computation time as well as results for horizons up to 10 for the Rovers problem are reported in the supplementary material.
Benefit of adaptive prediction action selection. The adaptation phase of APAS is disabled by not running Lines 9-15. Instead, at the start of each iteration of the while loop we randomly sample K linearization points using [23] to create Γ and solve the corresponding Dec-POMDP. We repeat this procedure 1000 times and report the results in the column “APAS (no adapt.)” of Table 1. We conclude that the α-vector adaptation clearly improves performance in both domains.
7 Conclusion
We showed that multi-agent active perception modelled as a Dec-ρPOMDP can be reduced to a standard Dec-POMDP by introducing individual prediction actions. The difference between the optimal solution of the standard Dec-POMDP and the Dec-ρPOMDP is bounded. Our reduction enables application of any standard Dec-POMDP solver to multi-agent active perception problems, as demonstrated by our proposed APAS algorithm.
Our results allow transferring advances in scalability for standard Dec-POMDPs to multi-agent active perception tasks. In multi-agent reinforcement learning, rewards typically depend on the underlying state of the system. Therefore, our reduction result also enables further investigation into learning for multi-agent active perception. An investigation of the necessary conditions for when the loss due to decentralization is zero is another future direction.
9
Broader Impact
This work is theoretical in nature, and therefore does not present any foreseeable societal consequence.
Acknowledgments
This project had received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 758824 —INFLUENCE).
References
[1] Mauricio Araya-López, Olivier Buffet, Vincent Thomas, and Francois Charpillet. A POMDP Extension with Belief-dependent Rewards. In Advances in Neural Information Processing Systems, pages 64–72, 2010.
[2] Ruzena Bajcsy, Yiannis Aloimonos, and John K Tsotsos. Revisiting active perception. Autonomous Robots, 42(2):177–196, 2018.
[3] Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
[4] Graeme Best, Oliver M Cliff, Timothy Patten, Ramgopal R Mettu, and Robert Fitch. Dec-MCTS: Decentralized planning for multi-robot active perception. The International Journal of Robotics Research, 38(2-3):316–337, 2019.
[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[6] Jesus Capitan, Matthijs T.J. Spaan, Luis Merino, and Anibal Ollero. Decentralized multi-robot cooperation with auctioned POMDPs. The International Journal of Robotics Research, 32(6):650–671, 2013.
[7] Micah Corah and Nathan Michael. Distributed matroid-constrained submodular maximization for multirobot exploration: Theory and practice. Autonomous Robots, 43(2):485–501, 2019.
[8] Bahman Gharesifard and Stephen L Smith. Distributed submodular maximization with limited information. IEEE Transactions on Control of Network Systems, 5(4):1635–1645, 2017.
[9] Manish Jain, Matthew Taylor, Milind Tambe, and Makoto Yokoo. DCOPs meet the real world: Exploring unknown reward matrices with applications to mobile sensor networks. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 181–186, 2009.
[10] Mikko Lauri, Eero Heinänen, and Simone Frintrop. Multi-robot active information gathering with periodic communication. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 851–856, 2017.
[11] Mikko Lauri, Joni Pajarinen, and Jan Peters. Information Gathering in Decentralized POMDPs by Policy Graph Improvement. In Intl. Conf. on Autonomous Agents and Multiagent Systems (AAMAS), pages 1143–1151, 2019.
[12] Mikko Lauri, Joni Pajarinen, and Jan Peters. Multi-agent active information gathering in discrete and continuous-state decentralized POMDPs by policy graph improvement. Autonomous Agents and MultiAgent Systems, 34(42):1–44, 2020. Issue no. 2.
[13] Duc Thien Nguyen, William Yeoh, Hoong Chuin Lau, Shlomo Zilberstein, and Chongjie Zhang. Decentralized multi-agent reinforcement learning in average-reward dynamic DCOPs. In AAAI Conference on Artificial Intelligence, pages 1447–1455, 2014.
[14] Frans A. Oliehoek. Sufficient plan-time statistics for decentralized POMDPs. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 302–308, 2013.
[15] Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016.
[16] Frans A. Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate Q-value functions for decentralized POMDPs. Journal of Artificial Intelligence Research, 32:289–353, 2008.
10
Mikko Lauri Department of Computer Science
Universität Hamburg Hamburg, Germany [email protected]
Frans A. Oliehoek Department of Computer Science
TU Delft Delft, the Netherlands [email protected]
Abstract
Multi-agent active perception is a task where a team of agents cooperatively gathers observations to compute a joint estimate of a hidden variable. The task is decentralized and the joint estimate can only be computed after the task ends by fusing observations of all agents. The objective is to maximize the accuracy of the estimate. The accuracy is quantified by a centralized prediction reward determined by a centralized decision-maker who perceives the observations gathered by all agents after the task ends. In this paper, we model multi-agent active perception as a decentralized partially observable Markov decision process (Dec-POMDP) with a convex centralized prediction reward. We prove that by introducing individual prediction actions for each agent, the problem is converted into a standard DecPOMDP with a decentralized prediction reward. The loss due to decentralization is bounded, and we give a sufficient condition for when it is zero. Our results allow application of any Dec-POMDP solution algorithm to multi-agent active perception problems, and enable planning to reduce uncertainty without explicit computation of joint estimates. We demonstrate the empirical usefulness of our results by applying a standard Dec-POMDP algorithm to multi-agent active perception problems, showing increased scalability in the planning horizon.
1 Introduction
Active perception, collecting observations to reduce uncertainty about a hidden variable, is one of the fundamental capabilities of an intelligent agent [2]. In multi-agent active perception a team of autonomous agents cooperatively gathers observations to infer the value of a hidden variable. Application domains include search and rescue robotics, sensor networks, and distributed hypothesis testing. A multi-agent active perception task often has a finite duration: after observations have been gathered, they are collected to a central database for inference. While the inference phase is centralized, the observation gathering phase is decentralized: each agent acts independently, without knowing the observations collected by the other agents nor guaranteed communication to the other agents.
The key problem in multi-agent active perception is to determine how each agent should act during the decentralized phase to maximize the informativeness of the collected observations, evaluated afterwards during the centralized inference phase. The problem can be formalized as a decentralized partially observable Markov decision process (Dec-POMDP) [3, 15], a general model of sequential multi-agent decision-making under uncertainty. At each time step in a Dec-POMDP, each agent in the team takes an individual action. The next state is determined according to a Markov chain conditional on the current hidden state and all individual actions. Each agent then perceives an individual observation correlated with the next state and the individual actions. The agents should act so as to maximize the expected sum of shared rewards, accumulated at each time step over a finite horizon.
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
In the decentralized phase, the per-step reward depends on the hidden state and the individual actions of all agents. In typical Dec-POMDPs, the reward on all time steps is of this form. Active perception problems are modelled by a reward that is a convex function of the team’s joint estimate of the hidden state, for example the negative entropy [11]. This encourages agents to act in ways that lead to joint state estimates with low uncertainty. In analogy to the centralized inference phase, the reward at the final time step can be thought of as a centralized prediction reward where a centralized decision-maker perceives the individual action-observation histories of all agents and determines a reward based on the corresponding joint state estimate. Due to the choice of reward function, algorithms for standard Dec-POMDPs are not applicable to such active perception problems. Despite the pooling of observations after the end of the task, the problem we target is decentralized. We design a strategy for each agent to act independently during the observation gathering phase, without knowing for sure how the others acted or what they perceived. Consequently, the joint state estimate is not available to any agent during the observation gathering phase. Strategies executable in a centralized manner are available only if all-to-all communication during task execution is possible, which we do not assume. As decentralized strategies are a strict subset of centralized strategies, the best decentralized strategy is at most as good as the best centralized strategy [16].
In this paper, we show that the convex centralized prediction reward can be converted to a decentralized prediction reward that is a function of the hidden state and so-called individual prediction actions. This converts the Dec-POMDP with a convex centralized prediction reward into a standard Dec-POMDP with rewards that depend on the hidden state and actions only. This enables solving multi-agent active perception problems without explicit computation of joint state estimates applying any standard Dec-POMDP algorithm. We show that the error induced when converting the centralized prediction reward into a decentralized one, the loss due to decentralization, is bounded. We also give a sufficient condition for when the loss is zero, meaning that the problems with the centralized, respectively decentralized, prediction rewards are equivalent. We prove the empirical usefulness of our results by applying standard Dec-POMDP solution algorithms to active perception problems, demonstrating improved scalability over the state-of-the-art.
The remainder of the paper is organized as follows. We review related work in Section 2, and give preliminary definitions for Dec-POMDPs in Section 3. In Section 4, we introduce our proposed conversion of a centralized prediction reward to a decentralized prediction reward. We propose a method to apply standard Dec-POMDP algorithms to multi-agent active perception in Section 5, and empirically evaluate the method in Section 6. Section 7 concludes the paper.
2 Related work
We briefly review possible formulations of multi-agent active perception problems, and then focus on the Dec-POMDP model that provides the most general formulation.
Multi-agent active perception has been formulated as a distributed constraint optimization problem (DCOP), submodular maximization, or as a specialized variant of a partially observable Markov decision process (POMDP). Probabilistic DCOPs with partial agent knowledge have been applied to signal source localization [9, 26]. DCOPs with Markovian dynamics have been proposed for target tracking by multiple sensors [13]. DCOPs are a simpler model than Dec-POMDPs, as a fixed communication structure is assumed or the noise in the sensing process is not modelled. Submodular maximization approaches assume the agents’ reward can be stated as a submodular set function, and apply distributed greedy maximization to obtain an approximate solution [22, 8, 7]. Along with the structure of the reward function, inter-agent communication is typically assumed. Specialized variants of POMDPs may also be applied. If all-to-all communication without delay during task execution is available, centralized control is possible and the problem can be solved as a multi-agent POMDP [24]. Auctioning of POMDP policies can facilitate multi-agent cooperation when agents can communicate [6]. Best et al. [4] propose a decentralized Monte Carlo tree search planner where agents periodically communicate their open-loop plans to each other.
Multi-agent active perception without implicit communication with uncertainty on state transitions and the agents’ perception may be modelled as a Dec-ρPOMDP [10, 11]. In contrast to standard Dec-POMDPs, the reward function in a Dec-ρPOMDP is a convex function of the joint state estimate, for example the negative entropy. Unfortunately, because of the convex reward, standard Dec-POMDP solution algorithms are not applicable. Furthermore, the heuristic algorithm proposed in [11, 12]
2
requires explicit computation of reachable joint state estimates. Computing and storing these joint state estimates adds significant memory overhead to the already high computational burden of solving Dec-POMDPs [3]. We address both shortcomings, showing that a Dec-ρPOMDP can be converted to a standard Dec-POMDP with linear rewards. This in turn enables applying standard planning algorithms that do not require computation of reachable joint state estimates, leading to improved scalability.
Our work in the decentralized setting draws inspiration from approaches using prediction rewards in single-agent and centralized scenarios. Araya-López et al. [1] proposed to tackle single-agent active perception as a ρPOMDP with a convex reward. The related POMDP with information rewards (POMDP-IR) was proposed in [25]. The POMDP-IR model adds prediction actions that the agent selects in addition to the usual actions. Active perception is facilitated by rewarding the agent for correctly predicting the true underlying state. The equivalence of ρPOMDP and POMDP-IR model was later established [20]. Recently, Satsangi et al. [21] propose a reinforcement learning method to solve ρPOMDPs taking advantage of the equivalence. In this paper we prove an analogous equivalence for Dec-POMDPs, by converting a Dec-ρPOMDP to a standard Dec-POMDP with individual prediction actions and a decentralized prediction reward. Unlike in the POMDP case, the conversion does not always result in a perfect equivalence, but is associated with a loss due to decentralization.
3 Multi-agent active perception as a Dec-POMDP
In this section, we review how active perception problems are modelled as Dec-POMDPs. We also review plan-time sufficient statistics, which allow us to concisely express joint state estimates and value functions of a Dec-POMDP. We concentrate on the practically relevant active perception problem where a prediction reward at the final time step depends on the joint state estimate.
3.1 Decentralized POMDPs
We define a Dec-POMDP where the action and observation spaces along with the reward function are time-dependent, as this will be convenient for our subsequent introduction of prediction actions. Definition 1. A Dec-POMDP is a tuple M = h, I, S, b0, A, Z, T , R , where
• h ∈ N is the time horizon of the problem,
• I = {1, 2, . . . , n} is a set of n agents,
• S is the finite set of states s,
• b0 ∈ ∆(S) is the initial state distribution at time step t = 0,
• A is the collection of individual action spaces Ai,t for each agent i ∈ I and time step t = 0, . . . , h − 1. The tuple at = a1,t, a2,t, . . . , an,t of individual actions is called the joint action at time step t,
• Z is the collection of individual observation spaces Zi,t for each agent i ∈ I and time step t = 1, . . . , h. The tuple zt = z1,t, z2,t, . . . , zn,t of individual observations is called the joint observation at time step t,
• T is the dynamics function specifying conditional probabilities P(zt+1, st+1 | st, at), and
• R is the collection of reward functions Rt(st, at) for time steps t = 0, . . . , h − 1.
An admissible solution of a Dec-POMDP is a decentralized joint policy π, i.e., a tuple π1, . . . , πn where the individual policy πi of each agent i maps individual observation sequences zi,t = (zi,1, . . . , zi,t) to an individual action.1 An individual policy is a sequence πi = (δi,0, . . . , δi,h−1) of individual decision rules that map length-t individual observation sequences to an individual action δi,t(zi,t) = ai,t. A joint decision rule is a tuple δt = δ1,t, . . . , δn,t that maps a length-t joint observation sequence zt = z1,t, . . . , zn,t to a joint action δt(zt) = at. We shall use notation z−i,t to denote the individual observation sequences of all agents except i.
1zi,0 = ∅ as there is no observation at time t = 0.
3
The objective is to find an optimal joint policy π∗ that maximizes the expected sum of rewards,
that is, π∗ = argmaxπ E
h−1 t=0
Rt
(st
,
δt
(zt
))
, where the expectation is with respect to the distri-
bution of states and joint observation sequences induced under π, i.e., P(s0, . . . , sh, zh | b0, π)
b0(s0)
h−1 t=0
T (zt+1,
st+1
|
st,
δt(zt)).
In this paper, we are interested in cooperative active perception. We assume that after the task
terminates at time step h, the joint observation sequence zh is used to compute the conditional
distribution bh ∈ ∆(S) over the final state sh, that is, bh(sh) P(sh | zh, b0, π). We seek policies that, instead of only maximizing the expected sum of rewards Rt(st, at), also maximize the informativeness of bh.
The Dec-ρPOMDP [11] models active perception by maximizing a convex function of bh, e.g., the negative entropy. This convex function may be though of as a centralized pre-
diction reward that is independent of the agents’ individ-
ual actions. Conceptually, the centralized prediction reward
is determined by a virtual centralized decision-maker that
perceives the agents’ individual action and observation sequences at time h, computes bh, and then determines the final reward. We approximate the centralized prediction re-
ward using a piecewise linear and convex function as illus-
trated in Figure 1. Consider a bounded, convex, and differentiable function f : ∆(S) → R. The tangent hyperplane α ∈ R|S| of f at b ∈ ∆(S) is α = ∇f (b) − f ∗(∇f (b)), where ∇ denotes the gradient and f ∗ is the convex conjugate of f [5]. We select a finite set of linearization points bj ∈ ∆(S), and define Γ as the set of corresponding tangent hyperplanes αj.2 Then, we obtain a lower approximation as f (b) ≥ maxα∈Γ s b(s)α(s). This approximation is also used in ρPOMDPs, and has a bounded error [1]. The approxi-
mation is also used in [21], where error bounds for cases such as 0-1 rewards are provided. The Dec-ρPOMDP problem we
consider is as follows.
f (b)
b, α1
0
b, α2
b, α3
−0.2
−0.4
−0.6
−0.8
−1 0 0.2 0.4 0.6 0.8 1 b
Figure 1: A convex function f is approximated by maxi b, αi where αi are tangent hyperplanes and ·, · is
the inner product. The horizontal
axis depicts b(s) in a system with two states s and s , where b(s ) = 1 − b(s). A prediction action corresponds to selecting a single αi.
Definition 2 (Dec-ρPOMDP). A Dec-ρPOMDP is a pair M, Γ , where M is a Dec-POMDP and Γ is a set of tangent hyperplanes that determine the centralized prediction reward ρ : ∆(S) → R
defined as ρ(b) max b(s)α(s).
α∈Γ s
The standard Dec-POMDP is the special case where Γ contains a single all-zero element. The
objective in the Dec-ρPOMDP is to find an optimal joint policy that maximizes the expected sum of
rewards and the centralized prediction reward, i.e., E
h−1 t=0
Rt
(st
,
δt
(zt
))
+
ρ(bh
)
.
3.2 Sufficient plan-time statistics and the optimal value function
Sufficient plan-time statistics are probability distributions over states and joint observation sequences
given the past joint policy followed by the agents [14]. A past joint policy at time t is a joint policy
specified until time t, denoted ϕt = ϕ1,t, . . . , ϕn,t , where each individual past policy is a sequence of individual decision rules: ϕi,t = (δi,0, . . . , δi,t−1). The sufficient plan-time statistic for initial state distribution b0 and a past joint policy ϕt is defined as
σt(st, zt) P(st, zt | b0, ϕt),
(1)
and at the starting time, σ0(s0) b0(s0). The conditional σt(· | zt) is the state distribution after perceiving the joint observation sequence zt when executing policy ϕt with initial state distribution b0. The marginal σt(zt) is the probability of the joint observation sequence zt given ϕt and b0.
Sufficient plan-time statistics are updateable to any extension of the past joint policy ϕt. The extension of an individual past policy ϕi,t by an individual decision rule δi,t is defined ϕi,t+1 = ϕi,t ◦ δi,t
2We defer discussion on how we select the linearization points to Section 5.
4
(δi,0, . . . , δi,t−1, δi,t). The extension of a past joint policy ϕt by δt = δ1,t, . . . , δn,t is defined
ϕt+1 = ϕt ◦ δt ϕ1,t+1, . . . , ϕn,t+1 . Let zt+1 = (zt, zt+1) be a joint observation sequence that extends zt with zt+1. Given the statistic σt for ϕt and the next joint decision rule δt, the updated statistic for ϕt+1 = ϕt ◦ δt is σt+1 = Uss(σt, δt), where the update operator Uss is defined as
[Uss(σt, δt)](st+1, zt+1)
T (zt+1, st+1 | st, δt(zt))σt(st, zt).
(2)
st
The statistic σt is sufficient to predict the expected immediate reward Rˆt(σt, δt) of choosing the next
joint decision rule δt:
Rˆt(σt, δt) = σt(st, zt)Rt(st, δt(zt)).
(3)
zt ,st
Furthermore, σh is sufficient to predict the expected centralized prediction reward ρˆ(σh):
ρˆ(σh) = σh(zh) max σh(sh | zh)α(sh).
(4)
α∈ Γ
zh
sh
In contrast to the reward in earlier stages, ρˆ does not depend on decision rules at all. We interpret ρˆ as the expectation of the prediction reward of a centralized decision-maker who selects the best α ∈ Γ after perceiving the full history of joint actions and observations at task termination.
Let π be a joint policy consisting of the joint decision rules δ0, δ1, . . . , δh−1. The value-to-go of π starting from a sufficient plan-time statistic σt at time step t is
Qπ(σ , δ ) = Rˆt(σt, δt) + ρˆ(σt+1)
if t = h − 1,
(5)
t tt
Rˆt(σt, δt) + Qπt+1(σt+1, δt+1) if 0 ≤ t < h − 1,
with the shorthand σt+1 = Uss(σt, δt). The value function V π of a joint policy π is defined as the
sum of expected rewards when the agents act according to π, V π(σ0) Qπ0 (σ0, δ0). The value
function of an optimal policy π∗ is denoted V ∗, and it satisfies V ∗(σ0) ≥ V π(σ0) for all π. We write
VM π
and
V
π M,Γ
for the value function in a standard Dec-POMDP and a Dec-ρPOMDP, respectively.
4 Conversion to standard Dec-POMDP
In this section, we show that any Dec-ρPOMDP can be converted to a standard Dec-POMDP by adding individual prediction actions for each agent, and by introducing a decentralized prediction reward. Each individual prediction action corresponds to a selection of a tangent hyperplane of the centralized prediction reward, as illustrated in Figure 1. As the individual prediction actions are chosen in a decentralized manner, the decentralized prediction reward never exceeds the centralized prediction reward. Consequently, we prove that an optimal policy of the standard Dec-POMDP may be applied to the Dec-ρPOMDP with bounded error compared to the true optimal policy. We give a sufficient condition for when this loss due to decentralization is zero and a Dec-ρPOMDP is equivalent to a standard Dec-POMDP.
A major implication of our results is that it is possible to apply any Dec-POMDP solver to a DecρPOMDP problem. We approximate a centralized prediction reward by a decentralized prediction reward that only depends on states and actions. This allows planning for active perception problems without explicit computation of joint state estimates.
We first introduce our proposed conversion and prove its properties, including the error bound. We conclude the section by giving a sufficient condition for when the two problems are equivalent. All omitted proofs are found in the supplementary material.
Definition 3. Given a Dec-ρPOMDP M, Γ with M = h, I, S, b0, A, Z, T , R , convert it to a standard Dec-POMDP M+ = h + 1, I, S, b0, A+, Z+, T +, R+ where the horizon is incremented by one and
• in A+, the individual action space Ai,h for each agent i ∈ I at time h is a set of individual prediction actions ai,h, with one individual prediction action for each tangent hyperplane α ∈ Γ; for other time steps Ai,t are as in A,
5
• in Z+, the individual observation space Zi,h+1 for each agent i ∈ I at time h + 1 contains a single null observation; for other time steps Zi,t are as in Z,
• T +(zh+1, sh+1 | sh, ah) has probability of one for the joint null observation and sh+1 = sh, and is otherwise zero; for other time steps T + is equal to T , and
• in R+, the reward function Rh at time step h is a linear combination of individual prediction rewards Ri,h of each agent, such that for a joint prediction action ah = a1,h, . . . , an,h the decentralized prediction reward is
1n
Rh(sh, ah) = n Ri,h(sh, ai,h),
(6)
i=1
with Ri,h(sh, ai,h) = αai,h (sh), where αai,h ∈ Γ is the tangent hyperplane corresponding to the individual prediction action ai,h; for other time steps Rt are as in R.
As the action and observation sets in M and M+ are the same until t = h, past joint policies are interchangeable, and plan-time sufficient statistics are identical.
Lemma 1. Let M, Γ be a Dec-ρPOMDP and define M+ as above. Then, for any past joint policy ϕt with t ≤ h, the respective plan-time sufficient statistics σt in M, Γ and σt+ in M+ are equivalent, i.e., σt ≡ σt+.
In M+ the horizon is incremented, so that each agent takes one more action than in M, Γ . The
additional action is one of the newly added individual prediction actions at time step h. To select an
individual prediction action, an agent may use 1) individual information, that is, the agent’s individual
observation sequence, and 2) plan-time information common to all agents, that is, the plan-time
sufficient statistic. An individual prediction rule φi maps the individual observation sequence zi,h
and the plan-time sufficient statistic σh to an individual prediction action. A joint prediction rule
φ = φ1, . . . , φn is a tuple of individual prediction rules, and maps a joint observation sequence zh
and σh to a joint prediction action. The expected decentralized prediction reward given σh and φ is
defined analogously to Eq. (3) as Rˆh(σh, φ)
zh,sh σh(sh, zh)Rh(sh, φ(zh, σh)).
An optimal joint prediction rule maximizes the expected decentralized prediction reward. We prove
that an optimal joint prediction rule consists of individual prediction rules that maximize the expected
individual prediction reward. We then show that the expected decentralized prediction reward is
at most equal to the centralized prediction reward, and that a similar relation holds between value functions in the respective Dec-POMDP and Dec-ρPOMDP. The key property required is that the
decentralized prediction reward is a sum of individual prediction rewards.
Lemma 2 (Optimal joint prediction rule). Let M, Γ be a Dec-ρPOMDP and define M+ as above,
and let σh be a plan-time sufficient statistic for any past joint policy. Then the joint prediction rule
φ∗ =
φ
∗ 1
,
.
.
.
,
φ
∗ n
where each individual prediction rule φ∗i is defined as
φ
∗ i
(
zi,h
,
σh
)
argmax
σh(sh, z−i,h | zi,h)Ri,h(sh, ai,h)
(7)
ai,h∈Ai,h z−i,h,sh
maximizes expected decentralized prediction reward, that is, Rˆh(σh, φ∗) = maxφ Rˆh(σh, φ).
In the Dec-ρPOMDP M, Γ , the centralized prediction reward is determined by a centralized virtual agent that has access to the observation histories of all agents. In our converted standard Dec-POMDP M+, each agent individually takes a prediction action, leading to the decentralized prediction reward. The decentralized prediction reward never exceeds the centralized prediction reward as we show next.
Lemma 3. The expected decentralized prediction reward Rˆh(σh, φ∗) in M+ is at most equal to the expected centralized prediction reward ρˆ(σh) in M, Γ , i.e., Rˆh(σh, φ∗) ≤ ρˆ(σh).
Lemma 4. Let M, Γ and M+ be as defined above. Let ϕh be a past joint policy for M+, and let
φ∗ be the optimal joint prediction rule. Then, the value of ϕh ◦ φ∗ in M+ is at most equal to the
value of ϕh in
M, Γ
,
i.e.,
VM ϕh+◦φ∗ (σ0)
≤
V
ϕh M,Γ
(σ0).
We now show that the difference between the optimal values of the standard Dec-POMDP M+ and the Dec-ρPOMDP M, Γ is bounded. We call the error the loss due to decentralization.
6
Theorem 1 (Loss due to decentralization). Consider a Dec-ρPOMDP M, Γ with the optimal
value
function
V
∗ M,Γ
.
Let
π
be
an
optimal
policy
for
the
standard
Dec-POMDP
M+
created
as
in
Definition 3, and denote by ϕh the past joint policy consisting of the first h decision rules of π. Then
the
difference
of
V
∗ M,Γ
and
the
value
function
V
ϕh M,Γ
of applying ϕh to
M, Γ
is bounded by
|V
∗ M,Γ
(σ0)
−
V
ϕh M,Γ
(σ0)|
≤
2 max |ρˆ(σh)
−
Rˆh(σh, φ∗)|,
(8)
σh
where φ∗ is the optimal joint prediction rule.
Proof. For clarity, in the following we omit the argument σ0 from the value functions. Suppose the
optimal policy π∗ in M, Γ consists of the joint decision rules δ0∗, . . . , δh∗−1. For M+, define the
partial joint policy ϕ∗h = (δ0∗, . . . , δh∗−1) and its extension ϕ∗h ◦ φ∗. By Lemma 2, an optimal joint
policy π of M+ is an extension of some past joint policy ϕh by φ∗, i.e., π = ϕh ◦ φ∗. Because
ϕh ◦ φ∗ is optimal in M+, we have that VM ϕ∗h+◦φ∗ ≤ VM ϕh+◦φ∗ . By Lemma 4, VM ϕh+◦φ∗ ≤ V ϕMh ,Γ .
Finally, because π∗ is optimal in
M, Γ
,
we
have
V
ϕh M,Γ
≤
V
∗ M,Γ
.
Now,
|V
∗ M,Γ
−
VM ϕh+ |
≤
|V
∗ M,Γ
− VM ϕ∗h+◦φ∗ | + |VM ϕ∗h+◦φ∗ − VM ϕh+ |
(9a)
≤
|V
∗ M,Γ
−
VM ϕ∗h+◦φ∗ |
+
|VM ϕ∗h+◦φ∗
−
V
∗ M,Γ
|
=
2
·
|V
∗ M,Γ
− VM ϕ∗h+◦φ∗ |
(9b)
h−1
h−1
=2·
Rˆt(σt∗, δt∗) + ρˆ(σh∗) −
Rˆt(σt∗, δt∗) + Rˆh(σh∗, φ∗) (9c)
t=0
t=0
= 2 · |ρˆ(σh∗) − Rˆh(σh∗, φ∗)| ≤ 2 · max |ρˆ(σh) − Rˆh(σh, φ∗)|.
(9d)
σh
We first apply the triangle inequality. The second inequality holds as VM ϕ∗h+◦φ∗ ≤ V ϕMh ,Γ ≤ V ∗M,Γ .
The next equality is by symmetry of absolute difference. The third line follows from the definition of the value function and Lemma 1. Note that we refer by σt∗ to the sufficient plan-time statistics reached under partial joint policies ϕ∗t of π∗. The final inequality follows by maximizing the difference of the expected centralized and decentralized prediction rewards.
The theorem shows that given a Dec-ρPOMDP M, Γ , we may solve the standard Dec-POMDP M+ and apply its optimal policy to the Dec-ρPOMDP with bounded error. If there is only a single agent, that agent is equivalent to the conceptualized centralized agent, which implies ρˆ ≡ Rˆh and that the loss is zero. The equivalence of a ρPOMDP with a convex prediction reward and a POMDP with information rewards first shown in [20] is therefore obtained as a special case.
The proof indicates that the error is at most twice the difference of the expected centralized prediction reward and the expected decentralized prediction reward at the sufficient plan-time statistic σh∗ at time h under an optimal policy π∗ of M, Γ . This suggests that the final bound given as maximum over all sufficient plan-time statistics may be overly pessimistic for some cases. While a complete characterization of such cases is beyond the scope of this paper, we give below a sufficient condition for when the error is zero in the multi-agent case. The proof is in the supplementary material.
Observation 1. Consider the setting of Theorem 1. Assume that the observation sequence of each agent is conditionally independent of the observation sequences of all other agents given the past joint policy and initial state distribution, i.e., for every agent i, σh(zh) = σh(zi,h)σh(z−i,h). Then π∗ is an optimal joint policy for M, Γ if and only if π∗ ◦ φ∗ is an optimal joint policy for M+.
The sufficient condition above is restrictive. Informally, it requires that each agent executes its own independent active perception task. However, the loss due to decentralization may be small if a multi-agent active perception task almost satisfies the requirement. It might be possible to derive less restrictive sufficient conditions in weakly-coupled problems by building on influence-based abstractions [17] Such weakly-coupled cases may arise, e.g., in distributed settings where two agents deployed in different regions far away from each other have limited influence on each other.
5 Adaptive prediction action search
Recall that the centralized prediction reward function ρ is obtained by approximating a continuous convex function f by a set of α-vectors. The approximation is more accurate the more α-vectors
7
Algorithm 1 Adaptive prediction action search (APAS) for Dec-ρPOMDP planning
Input: Dec-ρPOMDP M, Γ , convex function f : ∆(S) → R, number of linearization points K
Output: Best joint policy found, πbest
1: Vbest ← −∞, πbest ← ∅
2: repeat
3: // Policy optimization phase 4: M+ ← CONVERTDECPOMDP(M, Γ) 5: π ← PLAN(M+)
Apply Definition 3 Use any Dec-POMDP planner
6: V ← EVALUATE(π)
7: if V > Vbest then Vbest ← V, πbest ← π
8: // Adaptation phase
9: Γ ← ∅
10: for k = 1, . . . , K do
11:
zh ∼ σh(zh)
Sample joint observation sequence zh using πbest
12:
bk ← σh(· | zh)
13:
αk ← ∇f (bk) − f ∗(∇f (bk))
Final state estimate corresponding to zh Tangent hyperplane of f at bk
14:
Γ ← Γ ∪ {αk}
15: end for
16: until converged
17: return πbest
are used: for error bounds, see [1, 21]. We propose a planning algorithm for Dec-ρPOMDPs called Adaptive Prediction Action Search (APAS) that dynamically updates the α-vectors during planning. This avoids the need to a priori create a large number of α-vectors some of which may turn out to not be useful. Moreover, APAS allows application of any standard Dec-POMDP algorithm to solve Dec-ρPOMDPs by using the conversion proposed in Section 4. This can avoid the computation of all reachable state estimates required by earlier algorithms for Dec-ρPOMDPs [11, 12].
The pseudocode for APAS is shown in Algorithm 1. APAS consists of two phases that are repeated until convergence: the policy optimization phase and the α-vector adaptation phase. In the policy optimization phase, the best joint policy for the current set Γ of α-vectors is found. On the first iteration we initialize Γ randomly as explained in the next section. In the adaptation phase, the α-vectors are then modified such that they best approximate the final reward for joint state estimates that are most likely reached by the current joint policy. In the optimization phase the Dec-ρPOMDP is converted to a standard Dec-POMDP as described in Section 4. We then apply any standard DecPOMDP algorithm to plan a joint policy π for the converted Dec-POMDP. If the value of π exceeds the value of the best policy so far, πbest, the best policy is updated. In the adaptation phase, we sample a final state distribution bk under the currently best policy, and insert the tangent hyperplane at bk into Γ. This corresponds to sampling a point on the horizontal axis in Figure 1 and adding the corresponding tangent hyperplane to Γ. This is repeated K times. Specifically, we simulate a trajectory under the current best policy πbest to sample a joint observation sequence (Line 11), use Bayesian filtering to compute the corresponding state estimate bk (Line 12), and compute corresponding α-vector (Line 13). The updated set Γ of α-vectors provides the best approximation of the final reward for the sampled state estimates. The time complexity of APAS is determined by the complexity of the Dec-POMDP planner called in PLAN. A reference implementation is available at https://github.com/laurimi/multiagent-prediction-reward.
6 Experiments
The Dec-ρPOMDP we target is computationally more challenging (NEXP-complete [3]) than centralized POMDP, POMDP-IR, or ρPOMDP (PSPACE-complete [19]). Immediate all-to-all communication during task execution would be required to solve the problem we target as a centralized problem, and the resulting optimal value would be higher [16]. Therefore, a comparison to centralized methods is neither fair nor necessary. We compare APAS to the NPGI algorithm of [11] that solves a Dec-ρPOMDP by iterative improvement of a fixed-size policy graph. As the PLAN subroutine of APAS, we use the finite-horizon variant of the policy graph improvement method of [18]. This method is algorithmically similar to NPGI which helps isolate the effect of applying the proposed conversion to a standard Dec-POMDP from the effect of algorithmic design choices to the extent
8
Table 1: Average policy values ± standard error in the MAV (left) and the rovers domains (right).
h APAS (ours)
MAV APAS (no adapt.)
NPGI [12]
Optimal h
APAS (ours)
Rovers APAS (no adapt.)
NPGI [12]
Optimal
2 -2.006 ± 0.005 -2.112 ± 0.002 3 -1.936 ± 0.004 -2.002 ± 0.002 4 -1.879 ± 0.004 -1.943 ± 0.002 5 -1.842 ± 0.005 -1.918 ± 0.002 6 -1.814 ± 0.004 -1.898 ± 0.002 7 -1.789 ± 0.004 -1.892 ± 0.003 8 -1.820 ± 0.005 -1.885 ± 0.006
-1.931 -1.833 -1.768 -1.725 — — —
-1.919 2 -3.484 ± 0.002 -3.800 ± 0.006
-1.831 3 -3.402 ± 0.008 -3.680 ± 0.006
—
4 -3.367 ± 0.011 -3.640 ± 0.006
—
5 -3.293 ± 0.012 -3.591 ± 0.006
—
6 -3.333 ± 0.012 -3.631 ± 0.007
—
7 -3.375 ± 0.014 -3.673 ± 0.008
—
8 -3.496 ± 0.014 -3.860 ± 0.010
-3.495 -3.192 -3.036 -2.981 — — —
-3.479 -3.189 — — — — —
possible. Any other Dec-POMDP algorithm may also be used with APAS. Details on the algorithms and parameter settings are provided in the supplementary material.
We evaluate on the Dec-ρPOMDP domains from [11]: the micro air vehicle (MAV) domain and information gathering rovers domain. In the MAV domain two agents cooperatively track a moving target to determine if the target is friendly or hostile. In the rovers domain, two agents collect samples of scientific data in a grid environment. The reward at the final time step is the negative entropy of tKhe=jo2in, tasntdatfeoresthtiemraotveebrhs,ptrhoabtleism, fK(b=h) 5=indivshidbuha(lsphr)eldnicbthio(snha)c.tiFoonrs.thFeorMAAPVASpr,owbeleimni,tiwaleizuesΓe by randomly sampling linearization points bk ∈ ∆(S), k = 1, . . . , K using [23]. The corresponding α-vectors for the negative entropy are αk(s) = ln bk(s). We only observed minor benefits from using a greater number of α-vectors and only for long horizons, details are reported in the supplementary material. We run 100 repetitions using APAS and report the average policy value and its standard error. Results for NPGI are as reported in [12]. To ensure comparability, all reported policy values are computed using f as the final reward. As a baseline, we report the optimal value if known.
Comparison to the state-of-the-art. The results are shown in Table 1. While APAS does not exceed state-of-the-art performance in terms of policy quality, the major advantage of APAS is that it is able to scale to greater horizons than NPGI. The number of joint state distributions reachable under the current policy grows exponentially in the horizon h. Computation of these state distributions is required by NPGI, causing it to run out of memory with h > 5.
While running APAS, we compute the value of the policies exactly (Alg. 1, Line 6). For long horizons, exact evaluation requires a significant fraction of the computation budget. We expect that further scaling in terms of planning horizon is possible by switching to approximate evaluation, at the cost of noisy value estimates. Results on computation time as well as results for horizons up to 10 for the Rovers problem are reported in the supplementary material.
Benefit of adaptive prediction action selection. The adaptation phase of APAS is disabled by not running Lines 9-15. Instead, at the start of each iteration of the while loop we randomly sample K linearization points using [23] to create Γ and solve the corresponding Dec-POMDP. We repeat this procedure 1000 times and report the results in the column “APAS (no adapt.)” of Table 1. We conclude that the α-vector adaptation clearly improves performance in both domains.
7 Conclusion
We showed that multi-agent active perception modelled as a Dec-ρPOMDP can be reduced to a standard Dec-POMDP by introducing individual prediction actions. The difference between the optimal solution of the standard Dec-POMDP and the Dec-ρPOMDP is bounded. Our reduction enables application of any standard Dec-POMDP solver to multi-agent active perception problems, as demonstrated by our proposed APAS algorithm.
Our results allow transferring advances in scalability for standard Dec-POMDPs to multi-agent active perception tasks. In multi-agent reinforcement learning, rewards typically depend on the underlying state of the system. Therefore, our reduction result also enables further investigation into learning for multi-agent active perception. An investigation of the necessary conditions for when the loss due to decentralization is zero is another future direction.
9
Broader Impact
This work is theoretical in nature, and therefore does not present any foreseeable societal consequence.
Acknowledgments
This project had received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 758824 —INFLUENCE).
References
[1] Mauricio Araya-López, Olivier Buffet, Vincent Thomas, and Francois Charpillet. A POMDP Extension with Belief-dependent Rewards. In Advances in Neural Information Processing Systems, pages 64–72, 2010.
[2] Ruzena Bajcsy, Yiannis Aloimonos, and John K Tsotsos. Revisiting active perception. Autonomous Robots, 42(2):177–196, 2018.
[3] Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
[4] Graeme Best, Oliver M Cliff, Timothy Patten, Ramgopal R Mettu, and Robert Fitch. Dec-MCTS: Decentralized planning for multi-robot active perception. The International Journal of Robotics Research, 38(2-3):316–337, 2019.
[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[6] Jesus Capitan, Matthijs T.J. Spaan, Luis Merino, and Anibal Ollero. Decentralized multi-robot cooperation with auctioned POMDPs. The International Journal of Robotics Research, 32(6):650–671, 2013.
[7] Micah Corah and Nathan Michael. Distributed matroid-constrained submodular maximization for multirobot exploration: Theory and practice. Autonomous Robots, 43(2):485–501, 2019.
[8] Bahman Gharesifard and Stephen L Smith. Distributed submodular maximization with limited information. IEEE Transactions on Control of Network Systems, 5(4):1635–1645, 2017.
[9] Manish Jain, Matthew Taylor, Milind Tambe, and Makoto Yokoo. DCOPs meet the real world: Exploring unknown reward matrices with applications to mobile sensor networks. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 181–186, 2009.
[10] Mikko Lauri, Eero Heinänen, and Simone Frintrop. Multi-robot active information gathering with periodic communication. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 851–856, 2017.
[11] Mikko Lauri, Joni Pajarinen, and Jan Peters. Information Gathering in Decentralized POMDPs by Policy Graph Improvement. In Intl. Conf. on Autonomous Agents and Multiagent Systems (AAMAS), pages 1143–1151, 2019.
[12] Mikko Lauri, Joni Pajarinen, and Jan Peters. Multi-agent active information gathering in discrete and continuous-state decentralized POMDPs by policy graph improvement. Autonomous Agents and MultiAgent Systems, 34(42):1–44, 2020. Issue no. 2.
[13] Duc Thien Nguyen, William Yeoh, Hoong Chuin Lau, Shlomo Zilberstein, and Chongjie Zhang. Decentralized multi-agent reinforcement learning in average-reward dynamic DCOPs. In AAAI Conference on Artificial Intelligence, pages 1447–1455, 2014.
[14] Frans A. Oliehoek. Sufficient plan-time statistics for decentralized POMDPs. In Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 302–308, 2013.
[15] Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016.
[16] Frans A. Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate Q-value functions for decentralized POMDPs. Journal of Artificial Intelligence Research, 32:289–353, 2008.
10