Multiagent active perception with prediction rewards
Transcript Of Multiagent active perception with prediction rewards
Multiagent active perception with prediction rewards
Mikko Lauri Department of Computer Science
Universität Hamburg Hamburg, Germany [email protected]
Frans A. Oliehoek Department of Computer Science
TU Delft Delft, the Netherlands [email protected]
Abstract
Multiagent active perception is a task where a team of agents cooperatively gathers observations to compute a joint estimate of a hidden variable. The task is decentralized and the joint estimate can only be computed after the task ends by fusing observations of all agents. The objective is to maximize the accuracy of the estimate. The accuracy is quantiﬁed by a centralized prediction reward determined by a centralized decisionmaker who perceives the observations gathered by all agents after the task ends. In this paper, we model multiagent active perception as a decentralized partially observable Markov decision process (DecPOMDP) with a convex centralized prediction reward. We prove that by introducing individual prediction actions for each agent, the problem is converted into a standard DecPOMDP with a decentralized prediction reward. The loss due to decentralization is bounded, and we give a sufﬁcient condition for when it is zero. Our results allow application of any DecPOMDP solution algorithm to multiagent active perception problems, and enable planning to reduce uncertainty without explicit computation of joint estimates. We demonstrate the empirical usefulness of our results by applying a standard DecPOMDP algorithm to multiagent active perception problems, showing increased scalability in the planning horizon.
1 Introduction
Active perception, collecting observations to reduce uncertainty about a hidden variable, is one of the fundamental capabilities of an intelligent agent [2]. In multiagent active perception a team of autonomous agents cooperatively gathers observations to infer the value of a hidden variable. Application domains include search and rescue robotics, sensor networks, and distributed hypothesis testing. A multiagent active perception task often has a ﬁnite duration: after observations have been gathered, they are collected to a central database for inference. While the inference phase is centralized, the observation gathering phase is decentralized: each agent acts independently, without knowing the observations collected by the other agents nor guaranteed communication to the other agents.
The key problem in multiagent active perception is to determine how each agent should act during the decentralized phase to maximize the informativeness of the collected observations, evaluated afterwards during the centralized inference phase. The problem can be formalized as a decentralized partially observable Markov decision process (DecPOMDP) [3, 15], a general model of sequential multiagent decisionmaking under uncertainty. At each time step in a DecPOMDP, each agent in the team takes an individual action. The next state is determined according to a Markov chain conditional on the current hidden state and all individual actions. Each agent then perceives an individual observation correlated with the next state and the individual actions. The agents should act so as to maximize the expected sum of shared rewards, accumulated at each time step over a ﬁnite horizon.
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
In the decentralized phase, the perstep reward depends on the hidden state and the individual actions of all agents. In typical DecPOMDPs, the reward on all time steps is of this form. Active perception problems are modelled by a reward that is a convex function of the team’s joint estimate of the hidden state, for example the negative entropy [11]. This encourages agents to act in ways that lead to joint state estimates with low uncertainty. In analogy to the centralized inference phase, the reward at the ﬁnal time step can be thought of as a centralized prediction reward where a centralized decisionmaker perceives the individual actionobservation histories of all agents and determines a reward based on the corresponding joint state estimate. Due to the choice of reward function, algorithms for standard DecPOMDPs are not applicable to such active perception problems. Despite the pooling of observations after the end of the task, the problem we target is decentralized. We design a strategy for each agent to act independently during the observation gathering phase, without knowing for sure how the others acted or what they perceived. Consequently, the joint state estimate is not available to any agent during the observation gathering phase. Strategies executable in a centralized manner are available only if alltoall communication during task execution is possible, which we do not assume. As decentralized strategies are a strict subset of centralized strategies, the best decentralized strategy is at most as good as the best centralized strategy [16].
In this paper, we show that the convex centralized prediction reward can be converted to a decentralized prediction reward that is a function of the hidden state and socalled individual prediction actions. This converts the DecPOMDP with a convex centralized prediction reward into a standard DecPOMDP with rewards that depend on the hidden state and actions only. This enables solving multiagent active perception problems without explicit computation of joint state estimates applying any standard DecPOMDP algorithm. We show that the error induced when converting the centralized prediction reward into a decentralized one, the loss due to decentralization, is bounded. We also give a sufﬁcient condition for when the loss is zero, meaning that the problems with the centralized, respectively decentralized, prediction rewards are equivalent. We prove the empirical usefulness of our results by applying standard DecPOMDP solution algorithms to active perception problems, demonstrating improved scalability over the stateoftheart.
The remainder of the paper is organized as follows. We review related work in Section 2, and give preliminary deﬁnitions for DecPOMDPs in Section 3. In Section 4, we introduce our proposed conversion of a centralized prediction reward to a decentralized prediction reward. We propose a method to apply standard DecPOMDP algorithms to multiagent active perception in Section 5, and empirically evaluate the method in Section 6. Section 7 concludes the paper.
2 Related work
We brieﬂy review possible formulations of multiagent active perception problems, and then focus on the DecPOMDP model that provides the most general formulation.
Multiagent active perception has been formulated as a distributed constraint optimization problem (DCOP), submodular maximization, or as a specialized variant of a partially observable Markov decision process (POMDP). Probabilistic DCOPs with partial agent knowledge have been applied to signal source localization [9, 26]. DCOPs with Markovian dynamics have been proposed for target tracking by multiple sensors [13]. DCOPs are a simpler model than DecPOMDPs, as a ﬁxed communication structure is assumed or the noise in the sensing process is not modelled. Submodular maximization approaches assume the agents’ reward can be stated as a submodular set function, and apply distributed greedy maximization to obtain an approximate solution [22, 8, 7]. Along with the structure of the reward function, interagent communication is typically assumed. Specialized variants of POMDPs may also be applied. If alltoall communication without delay during task execution is available, centralized control is possible and the problem can be solved as a multiagent POMDP [24]. Auctioning of POMDP policies can facilitate multiagent cooperation when agents can communicate [6]. Best et al. [4] propose a decentralized Monte Carlo tree search planner where agents periodically communicate their openloop plans to each other.
Multiagent active perception without implicit communication with uncertainty on state transitions and the agents’ perception may be modelled as a DecρPOMDP [10, 11]. In contrast to standard DecPOMDPs, the reward function in a DecρPOMDP is a convex function of the joint state estimate, for example the negative entropy. Unfortunately, because of the convex reward, standard DecPOMDP solution algorithms are not applicable. Furthermore, the heuristic algorithm proposed in [11, 12]
2
requires explicit computation of reachable joint state estimates. Computing and storing these joint state estimates adds signiﬁcant memory overhead to the already high computational burden of solving DecPOMDPs [3]. We address both shortcomings, showing that a DecρPOMDP can be converted to a standard DecPOMDP with linear rewards. This in turn enables applying standard planning algorithms that do not require computation of reachable joint state estimates, leading to improved scalability.
Our work in the decentralized setting draws inspiration from approaches using prediction rewards in singleagent and centralized scenarios. ArayaLópez et al. [1] proposed to tackle singleagent active perception as a ρPOMDP with a convex reward. The related POMDP with information rewards (POMDPIR) was proposed in [25]. The POMDPIR model adds prediction actions that the agent selects in addition to the usual actions. Active perception is facilitated by rewarding the agent for correctly predicting the true underlying state. The equivalence of ρPOMDP and POMDPIR model was later established [20]. Recently, Satsangi et al. [21] propose a reinforcement learning method to solve ρPOMDPs taking advantage of the equivalence. In this paper we prove an analogous equivalence for DecPOMDPs, by converting a DecρPOMDP to a standard DecPOMDP with individual prediction actions and a decentralized prediction reward. Unlike in the POMDP case, the conversion does not always result in a perfect equivalence, but is associated with a loss due to decentralization.
3 Multiagent active perception as a DecPOMDP
In this section, we review how active perception problems are modelled as DecPOMDPs. We also review plantime sufﬁcient statistics, which allow us to concisely express joint state estimates and value functions of a DecPOMDP. We concentrate on the practically relevant active perception problem where a prediction reward at the ﬁnal time step depends on the joint state estimate.
3.1 Decentralized POMDPs
We deﬁne a DecPOMDP where the action and observation spaces along with the reward function are timedependent, as this will be convenient for our subsequent introduction of prediction actions. Deﬁnition 1. A DecPOMDP is a tuple M = h, I, S, b0, A, Z, T , R , where
• h ∈ N is the time horizon of the problem,
• I = {1, 2, . . . , n} is a set of n agents,
• S is the ﬁnite set of states s,
• b0 ∈ ∆(S) is the initial state distribution at time step t = 0,
• A is the collection of individual action spaces Ai,t for each agent i ∈ I and time step t = 0, . . . , h − 1. The tuple at = a1,t, a2,t, . . . , an,t of individual actions is called the joint action at time step t,
• Z is the collection of individual observation spaces Zi,t for each agent i ∈ I and time step t = 1, . . . , h. The tuple zt = z1,t, z2,t, . . . , zn,t of individual observations is called the joint observation at time step t,
• T is the dynamics function specifying conditional probabilities P(zt+1, st+1  st, at), and
• R is the collection of reward functions Rt(st, at) for time steps t = 0, . . . , h − 1.
An admissible solution of a DecPOMDP is a decentralized joint policy π, i.e., a tuple π1, . . . , πn where the individual policy πi of each agent i maps individual observation sequences zi,t = (zi,1, . . . , zi,t) to an individual action.1 An individual policy is a sequence πi = (δi,0, . . . , δi,h−1) of individual decision rules that map lengtht individual observation sequences to an individual action δi,t(zi,t) = ai,t. A joint decision rule is a tuple δt = δ1,t, . . . , δn,t that maps a lengtht joint observation sequence zt = z1,t, . . . , zn,t to a joint action δt(zt) = at. We shall use notation z−i,t to denote the individual observation sequences of all agents except i.
1zi,0 = ∅ as there is no observation at time t = 0.
3
The objective is to ﬁnd an optimal joint policy π∗ that maximizes the expected sum of rewards,
that is, π∗ = argmaxπ E
h−1 t=0
Rt
(st
,
δt
(zt
))
, where the expectation is with respect to the distri
bution of states and joint observation sequences induced under π, i.e., P(s0, . . . , sh, zh  b0, π)
b0(s0)
h−1 t=0
T (zt+1,
st+1

st,
δt(zt)).
In this paper, we are interested in cooperative active perception. We assume that after the task
terminates at time step h, the joint observation sequence zh is used to compute the conditional
distribution bh ∈ ∆(S) over the ﬁnal state sh, that is, bh(sh) P(sh  zh, b0, π). We seek policies that, instead of only maximizing the expected sum of rewards Rt(st, at), also maximize the informativeness of bh.
The DecρPOMDP [11] models active perception by maximizing a convex function of bh, e.g., the negative entropy. This convex function may be though of as a centralized pre
diction reward that is independent of the agents’ individ
ual actions. Conceptually, the centralized prediction reward
is determined by a virtual centralized decisionmaker that
perceives the agents’ individual action and observation sequences at time h, computes bh, and then determines the ﬁnal reward. We approximate the centralized prediction re
ward using a piecewise linear and convex function as illus
trated in Figure 1. Consider a bounded, convex, and differentiable function f : ∆(S) → R. The tangent hyperplane α ∈ RS of f at b ∈ ∆(S) is α = ∇f (b) − f ∗(∇f (b)), where ∇ denotes the gradient and f ∗ is the convex conjugate of f [5]. We select a ﬁnite set of linearization points bj ∈ ∆(S), and deﬁne Γ as the set of corresponding tangent hyperplanes αj.2 Then, we obtain a lower approximation as f (b) ≥ maxα∈Γ s b(s)α(s). This approximation is also used in ρPOMDPs, and has a bounded error [1]. The approxi
mation is also used in [21], where error bounds for cases such as 01 rewards are provided. The DecρPOMDP problem we
consider is as follows.
f (b)
b, α1
0
b, α2
b, α3
−0.2
−0.4
−0.6
−0.8
−1 0 0.2 0.4 0.6 0.8 1 b
Figure 1: A convex function f is approximated by maxi b, αi where αi are tangent hyperplanes and ·, · is
the inner product. The horizontal
axis depicts b(s) in a system with two states s and s , where b(s ) = 1 − b(s). A prediction action corresponds to selecting a single αi.
Deﬁnition 2 (DecρPOMDP). A DecρPOMDP is a pair M, Γ , where M is a DecPOMDP and Γ is a set of tangent hyperplanes that determine the centralized prediction reward ρ : ∆(S) → R
deﬁned as ρ(b) max b(s)α(s).
α∈Γ s
The standard DecPOMDP is the special case where Γ contains a single allzero element. The
objective in the DecρPOMDP is to ﬁnd an optimal joint policy that maximizes the expected sum of
rewards and the centralized prediction reward, i.e., E
h−1 t=0
Rt
(st
,
δt
(zt
))
+
ρ(bh
)
.
3.2 Sufﬁcient plantime statistics and the optimal value function
Sufﬁcient plantime statistics are probability distributions over states and joint observation sequences
given the past joint policy followed by the agents [14]. A past joint policy at time t is a joint policy
speciﬁed until time t, denoted ϕt = ϕ1,t, . . . , ϕn,t , where each individual past policy is a sequence of individual decision rules: ϕi,t = (δi,0, . . . , δi,t−1). The sufﬁcient plantime statistic for initial state distribution b0 and a past joint policy ϕt is deﬁned as
σt(st, zt) P(st, zt  b0, ϕt),
(1)
and at the starting time, σ0(s0) b0(s0). The conditional σt(·  zt) is the state distribution after perceiving the joint observation sequence zt when executing policy ϕt with initial state distribution b0. The marginal σt(zt) is the probability of the joint observation sequence zt given ϕt and b0.
Sufﬁcient plantime statistics are updateable to any extension of the past joint policy ϕt. The extension of an individual past policy ϕi,t by an individual decision rule δi,t is deﬁned ϕi,t+1 = ϕi,t ◦ δi,t
2We defer discussion on how we select the linearization points to Section 5.
4
(δi,0, . . . , δi,t−1, δi,t). The extension of a past joint policy ϕt by δt = δ1,t, . . . , δn,t is deﬁned
ϕt+1 = ϕt ◦ δt ϕ1,t+1, . . . , ϕn,t+1 . Let zt+1 = (zt, zt+1) be a joint observation sequence that extends zt with zt+1. Given the statistic σt for ϕt and the next joint decision rule δt, the updated statistic for ϕt+1 = ϕt ◦ δt is σt+1 = Uss(σt, δt), where the update operator Uss is deﬁned as
[Uss(σt, δt)](st+1, zt+1)
T (zt+1, st+1  st, δt(zt))σt(st, zt).
(2)
st
The statistic σt is sufﬁcient to predict the expected immediate reward Rˆt(σt, δt) of choosing the next
joint decision rule δt:
Rˆt(σt, δt) = σt(st, zt)Rt(st, δt(zt)).
(3)
zt ,st
Furthermore, σh is sufﬁcient to predict the expected centralized prediction reward ρˆ(σh):
ρˆ(σh) = σh(zh) max σh(sh  zh)α(sh).
(4)
α∈ Γ
zh
sh
In contrast to the reward in earlier stages, ρˆ does not depend on decision rules at all. We interpret ρˆ as the expectation of the prediction reward of a centralized decisionmaker who selects the best α ∈ Γ after perceiving the full history of joint actions and observations at task termination.
Let π be a joint policy consisting of the joint decision rules δ0, δ1, . . . , δh−1. The valuetogo of π starting from a sufﬁcient plantime statistic σt at time step t is
Qπ(σ , δ ) = Rˆt(σt, δt) + ρˆ(σt+1)
if t = h − 1,
(5)
t tt
Rˆt(σt, δt) + Qπt+1(σt+1, δt+1) if 0 ≤ t < h − 1,
with the shorthand σt+1 = Uss(σt, δt). The value function V π of a joint policy π is deﬁned as the
sum of expected rewards when the agents act according to π, V π(σ0) Qπ0 (σ0, δ0). The value
function of an optimal policy π∗ is denoted V ∗, and it satisﬁes V ∗(σ0) ≥ V π(σ0) for all π. We write
VM π
and
V
π M,Γ
for the value function in a standard DecPOMDP and a DecρPOMDP, respectively.
4 Conversion to standard DecPOMDP
In this section, we show that any DecρPOMDP can be converted to a standard DecPOMDP by adding individual prediction actions for each agent, and by introducing a decentralized prediction reward. Each individual prediction action corresponds to a selection of a tangent hyperplane of the centralized prediction reward, as illustrated in Figure 1. As the individual prediction actions are chosen in a decentralized manner, the decentralized prediction reward never exceeds the centralized prediction reward. Consequently, we prove that an optimal policy of the standard DecPOMDP may be applied to the DecρPOMDP with bounded error compared to the true optimal policy. We give a sufﬁcient condition for when this loss due to decentralization is zero and a DecρPOMDP is equivalent to a standard DecPOMDP.
A major implication of our results is that it is possible to apply any DecPOMDP solver to a DecρPOMDP problem. We approximate a centralized prediction reward by a decentralized prediction reward that only depends on states and actions. This allows planning for active perception problems without explicit computation of joint state estimates.
We ﬁrst introduce our proposed conversion and prove its properties, including the error bound. We conclude the section by giving a sufﬁcient condition for when the two problems are equivalent. All omitted proofs are found in the supplementary material.
Deﬁnition 3. Given a DecρPOMDP M, Γ with M = h, I, S, b0, A, Z, T , R , convert it to a standard DecPOMDP M+ = h + 1, I, S, b0, A+, Z+, T +, R+ where the horizon is incremented by one and
• in A+, the individual action space Ai,h for each agent i ∈ I at time h is a set of individual prediction actions ai,h, with one individual prediction action for each tangent hyperplane α ∈ Γ; for other time steps Ai,t are as in A,
5
• in Z+, the individual observation space Zi,h+1 for each agent i ∈ I at time h + 1 contains a single null observation; for other time steps Zi,t are as in Z,
• T +(zh+1, sh+1  sh, ah) has probability of one for the joint null observation and sh+1 = sh, and is otherwise zero; for other time steps T + is equal to T , and
• in R+, the reward function Rh at time step h is a linear combination of individual prediction rewards Ri,h of each agent, such that for a joint prediction action ah = a1,h, . . . , an,h the decentralized prediction reward is
1n
Rh(sh, ah) = n Ri,h(sh, ai,h),
(6)
i=1
with Ri,h(sh, ai,h) = αai,h (sh), where αai,h ∈ Γ is the tangent hyperplane corresponding to the individual prediction action ai,h; for other time steps Rt are as in R.
As the action and observation sets in M and M+ are the same until t = h, past joint policies are interchangeable, and plantime sufﬁcient statistics are identical.
Lemma 1. Let M, Γ be a DecρPOMDP and deﬁne M+ as above. Then, for any past joint policy ϕt with t ≤ h, the respective plantime sufﬁcient statistics σt in M, Γ and σt+ in M+ are equivalent, i.e., σt ≡ σt+.
In M+ the horizon is incremented, so that each agent takes one more action than in M, Γ . The
additional action is one of the newly added individual prediction actions at time step h. To select an
individual prediction action, an agent may use 1) individual information, that is, the agent’s individual
observation sequence, and 2) plantime information common to all agents, that is, the plantime
sufﬁcient statistic. An individual prediction rule φi maps the individual observation sequence zi,h
and the plantime sufﬁcient statistic σh to an individual prediction action. A joint prediction rule
φ = φ1, . . . , φn is a tuple of individual prediction rules, and maps a joint observation sequence zh
and σh to a joint prediction action. The expected decentralized prediction reward given σh and φ is
deﬁned analogously to Eq. (3) as Rˆh(σh, φ)
zh,sh σh(sh, zh)Rh(sh, φ(zh, σh)).
An optimal joint prediction rule maximizes the expected decentralized prediction reward. We prove
that an optimal joint prediction rule consists of individual prediction rules that maximize the expected
individual prediction reward. We then show that the expected decentralized prediction reward is
at most equal to the centralized prediction reward, and that a similar relation holds between value functions in the respective DecPOMDP and DecρPOMDP. The key property required is that the
decentralized prediction reward is a sum of individual prediction rewards.
Lemma 2 (Optimal joint prediction rule). Let M, Γ be a DecρPOMDP and deﬁne M+ as above,
and let σh be a plantime sufﬁcient statistic for any past joint policy. Then the joint prediction rule
φ∗ =
φ
∗ 1
,
.
.
.
,
φ
∗ n
where each individual prediction rule φ∗i is deﬁned as
φ
∗ i
(
zi,h
,
σh
)
argmax
σh(sh, z−i,h  zi,h)Ri,h(sh, ai,h)
(7)
ai,h∈Ai,h z−i,h,sh
maximizes expected decentralized prediction reward, that is, Rˆh(σh, φ∗) = maxφ Rˆh(σh, φ).
In the DecρPOMDP M, Γ , the centralized prediction reward is determined by a centralized virtual agent that has access to the observation histories of all agents. In our converted standard DecPOMDP M+, each agent individually takes a prediction action, leading to the decentralized prediction reward. The decentralized prediction reward never exceeds the centralized prediction reward as we show next.
Lemma 3. The expected decentralized prediction reward Rˆh(σh, φ∗) in M+ is at most equal to the expected centralized prediction reward ρˆ(σh) in M, Γ , i.e., Rˆh(σh, φ∗) ≤ ρˆ(σh).
Lemma 4. Let M, Γ and M+ be as deﬁned above. Let ϕh be a past joint policy for M+, and let
φ∗ be the optimal joint prediction rule. Then, the value of ϕh ◦ φ∗ in M+ is at most equal to the
value of ϕh in
M, Γ
,
i.e.,
VM ϕh+◦φ∗ (σ0)
≤
V
ϕh M,Γ
(σ0).
We now show that the difference between the optimal values of the standard DecPOMDP M+ and the DecρPOMDP M, Γ is bounded. We call the error the loss due to decentralization.
6
Theorem 1 (Loss due to decentralization). Consider a DecρPOMDP M, Γ with the optimal
value
function
V
∗ M,Γ
.
Let
π
be
an
optimal
policy
for
the
standard
DecPOMDP
M+
created
as
in
Deﬁnition 3, and denote by ϕh the past joint policy consisting of the ﬁrst h decision rules of π. Then
the
difference
of
V
∗ M,Γ
and
the
value
function
V
ϕh M,Γ
of applying ϕh to
M, Γ
is bounded by
V
∗ M,Γ
(σ0)
−
V
ϕh M,Γ
(σ0)
≤
2 max ρˆ(σh)
−
Rˆh(σh, φ∗),
(8)
σh
where φ∗ is the optimal joint prediction rule.
Proof. For clarity, in the following we omit the argument σ0 from the value functions. Suppose the
optimal policy π∗ in M, Γ consists of the joint decision rules δ0∗, . . . , δh∗−1. For M+, deﬁne the
partial joint policy ϕ∗h = (δ0∗, . . . , δh∗−1) and its extension ϕ∗h ◦ φ∗. By Lemma 2, an optimal joint
policy π of M+ is an extension of some past joint policy ϕh by φ∗, i.e., π = ϕh ◦ φ∗. Because
ϕh ◦ φ∗ is optimal in M+, we have that VM ϕ∗h+◦φ∗ ≤ VM ϕh+◦φ∗ . By Lemma 4, VM ϕh+◦φ∗ ≤ V ϕMh ,Γ .
Finally, because π∗ is optimal in
M, Γ
,
we
have
V
ϕh M,Γ
≤
V
∗ M,Γ
.
Now,
V
∗ M,Γ
−
VM ϕh+ 
≤
V
∗ M,Γ
− VM ϕ∗h+◦φ∗  + VM ϕ∗h+◦φ∗ − VM ϕh+ 
(9a)
≤
V
∗ M,Γ
−
VM ϕ∗h+◦φ∗ 
+
VM ϕ∗h+◦φ∗
−
V
∗ M,Γ

=
2
·
V
∗ M,Γ
− VM ϕ∗h+◦φ∗ 
(9b)
h−1
h−1
=2·
Rˆt(σt∗, δt∗) + ρˆ(σh∗) −
Rˆt(σt∗, δt∗) + Rˆh(σh∗, φ∗) (9c)
t=0
t=0
= 2 · ρˆ(σh∗) − Rˆh(σh∗, φ∗) ≤ 2 · max ρˆ(σh) − Rˆh(σh, φ∗).
(9d)
σh
We ﬁrst apply the triangle inequality. The second inequality holds as VM ϕ∗h+◦φ∗ ≤ V ϕMh ,Γ ≤ V ∗M,Γ .
The next equality is by symmetry of absolute difference. The third line follows from the deﬁnition of the value function and Lemma 1. Note that we refer by σt∗ to the sufﬁcient plantime statistics reached under partial joint policies ϕ∗t of π∗. The ﬁnal inequality follows by maximizing the difference of the expected centralized and decentralized prediction rewards.
The theorem shows that given a DecρPOMDP M, Γ , we may solve the standard DecPOMDP M+ and apply its optimal policy to the DecρPOMDP with bounded error. If there is only a single agent, that agent is equivalent to the conceptualized centralized agent, which implies ρˆ ≡ Rˆh and that the loss is zero. The equivalence of a ρPOMDP with a convex prediction reward and a POMDP with information rewards ﬁrst shown in [20] is therefore obtained as a special case.
The proof indicates that the error is at most twice the difference of the expected centralized prediction reward and the expected decentralized prediction reward at the sufﬁcient plantime statistic σh∗ at time h under an optimal policy π∗ of M, Γ . This suggests that the ﬁnal bound given as maximum over all sufﬁcient plantime statistics may be overly pessimistic for some cases. While a complete characterization of such cases is beyond the scope of this paper, we give below a sufﬁcient condition for when the error is zero in the multiagent case. The proof is in the supplementary material.
Observation 1. Consider the setting of Theorem 1. Assume that the observation sequence of each agent is conditionally independent of the observation sequences of all other agents given the past joint policy and initial state distribution, i.e., for every agent i, σh(zh) = σh(zi,h)σh(z−i,h). Then π∗ is an optimal joint policy for M, Γ if and only if π∗ ◦ φ∗ is an optimal joint policy for M+.
The sufﬁcient condition above is restrictive. Informally, it requires that each agent executes its own independent active perception task. However, the loss due to decentralization may be small if a multiagent active perception task almost satisﬁes the requirement. It might be possible to derive less restrictive sufﬁcient conditions in weaklycoupled problems by building on inﬂuencebased abstractions [17] Such weaklycoupled cases may arise, e.g., in distributed settings where two agents deployed in different regions far away from each other have limited inﬂuence on each other.
5 Adaptive prediction action search
Recall that the centralized prediction reward function ρ is obtained by approximating a continuous convex function f by a set of αvectors. The approximation is more accurate the more αvectors
7
Algorithm 1 Adaptive prediction action search (APAS) for DecρPOMDP planning
Input: DecρPOMDP M, Γ , convex function f : ∆(S) → R, number of linearization points K
Output: Best joint policy found, πbest
1: Vbest ← −∞, πbest ← ∅
2: repeat
3: // Policy optimization phase 4: M+ ← CONVERTDECPOMDP(M, Γ) 5: π ← PLAN(M+)
Apply Deﬁnition 3 Use any DecPOMDP planner
6: V ← EVALUATE(π)
7: if V > Vbest then Vbest ← V, πbest ← π
8: // Adaptation phase
9: Γ ← ∅
10: for k = 1, . . . , K do
11:
zh ∼ σh(zh)
Sample joint observation sequence zh using πbest
12:
bk ← σh(·  zh)
13:
αk ← ∇f (bk) − f ∗(∇f (bk))
Final state estimate corresponding to zh Tangent hyperplane of f at bk
14:
Γ ← Γ ∪ {αk}
15: end for
16: until converged
17: return πbest
are used: for error bounds, see [1, 21]. We propose a planning algorithm for DecρPOMDPs called Adaptive Prediction Action Search (APAS) that dynamically updates the αvectors during planning. This avoids the need to a priori create a large number of αvectors some of which may turn out to not be useful. Moreover, APAS allows application of any standard DecPOMDP algorithm to solve DecρPOMDPs by using the conversion proposed in Section 4. This can avoid the computation of all reachable state estimates required by earlier algorithms for DecρPOMDPs [11, 12].
The pseudocode for APAS is shown in Algorithm 1. APAS consists of two phases that are repeated until convergence: the policy optimization phase and the αvector adaptation phase. In the policy optimization phase, the best joint policy for the current set Γ of αvectors is found. On the ﬁrst iteration we initialize Γ randomly as explained in the next section. In the adaptation phase, the αvectors are then modiﬁed such that they best approximate the ﬁnal reward for joint state estimates that are most likely reached by the current joint policy. In the optimization phase the DecρPOMDP is converted to a standard DecPOMDP as described in Section 4. We then apply any standard DecPOMDP algorithm to plan a joint policy π for the converted DecPOMDP. If the value of π exceeds the value of the best policy so far, πbest, the best policy is updated. In the adaptation phase, we sample a ﬁnal state distribution bk under the currently best policy, and insert the tangent hyperplane at bk into Γ. This corresponds to sampling a point on the horizontal axis in Figure 1 and adding the corresponding tangent hyperplane to Γ. This is repeated K times. Speciﬁcally, we simulate a trajectory under the current best policy πbest to sample a joint observation sequence (Line 11), use Bayesian ﬁltering to compute the corresponding state estimate bk (Line 12), and compute corresponding αvector (Line 13). The updated set Γ of αvectors provides the best approximation of the ﬁnal reward for the sampled state estimates. The time complexity of APAS is determined by the complexity of the DecPOMDP planner called in PLAN. A reference implementation is available at https://github.com/laurimi/multiagentpredictionreward.
6 Experiments
The DecρPOMDP we target is computationally more challenging (NEXPcomplete [3]) than centralized POMDP, POMDPIR, or ρPOMDP (PSPACEcomplete [19]). Immediate alltoall communication during task execution would be required to solve the problem we target as a centralized problem, and the resulting optimal value would be higher [16]. Therefore, a comparison to centralized methods is neither fair nor necessary. We compare APAS to the NPGI algorithm of [11] that solves a DecρPOMDP by iterative improvement of a ﬁxedsize policy graph. As the PLAN subroutine of APAS, we use the ﬁnitehorizon variant of the policy graph improvement method of [18]. This method is algorithmically similar to NPGI which helps isolate the effect of applying the proposed conversion to a standard DecPOMDP from the effect of algorithmic design choices to the extent
8
Table 1: Average policy values ± standard error in the MAV (left) and the rovers domains (right).
h APAS (ours)
MAV APAS (no adapt.)
NPGI [12]
Optimal h
APAS (ours)
Rovers APAS (no adapt.)
NPGI [12]
Optimal
2 2.006 ± 0.005 2.112 ± 0.002 3 1.936 ± 0.004 2.002 ± 0.002 4 1.879 ± 0.004 1.943 ± 0.002 5 1.842 ± 0.005 1.918 ± 0.002 6 1.814 ± 0.004 1.898 ± 0.002 7 1.789 ± 0.004 1.892 ± 0.003 8 1.820 ± 0.005 1.885 ± 0.006
1.931 1.833 1.768 1.725 — — —
1.919 2 3.484 ± 0.002 3.800 ± 0.006
1.831 3 3.402 ± 0.008 3.680 ± 0.006
—
4 3.367 ± 0.011 3.640 ± 0.006
—
5 3.293 ± 0.012 3.591 ± 0.006
—
6 3.333 ± 0.012 3.631 ± 0.007
—
7 3.375 ± 0.014 3.673 ± 0.008
—
8 3.496 ± 0.014 3.860 ± 0.010
3.495 3.192 3.036 2.981 — — —
3.479 3.189 — — — — —
possible. Any other DecPOMDP algorithm may also be used with APAS. Details on the algorithms and parameter settings are provided in the supplementary material.
We evaluate on the DecρPOMDP domains from [11]: the micro air vehicle (MAV) domain and information gathering rovers domain. In the MAV domain two agents cooperatively track a moving target to determine if the target is friendly or hostile. In the rovers domain, two agents collect samples of scientiﬁc data in a grid environment. The reward at the ﬁnal time step is the negative entropy of tKhe=jo2in, tasntdatfeoresthtiemraotveebrhs,ptrhoabtleism, fK(b=h) 5=indivshidbuha(lsphr)eldnicbthio(snha)c.tiFoonrs.thFeorMAAPVASpr,owbeleimni,tiwaleizuesΓe by randomly sampling linearization points bk ∈ ∆(S), k = 1, . . . , K using [23]. The corresponding αvectors for the negative entropy are αk(s) = ln bk(s). We only observed minor beneﬁts from using a greater number of αvectors and only for long horizons, details are reported in the supplementary material. We run 100 repetitions using APAS and report the average policy value and its standard error. Results for NPGI are as reported in [12]. To ensure comparability, all reported policy values are computed using f as the ﬁnal reward. As a baseline, we report the optimal value if known.
Comparison to the stateoftheart. The results are shown in Table 1. While APAS does not exceed stateoftheart performance in terms of policy quality, the major advantage of APAS is that it is able to scale to greater horizons than NPGI. The number of joint state distributions reachable under the current policy grows exponentially in the horizon h. Computation of these state distributions is required by NPGI, causing it to run out of memory with h > 5.
While running APAS, we compute the value of the policies exactly (Alg. 1, Line 6). For long horizons, exact evaluation requires a signiﬁcant fraction of the computation budget. We expect that further scaling in terms of planning horizon is possible by switching to approximate evaluation, at the cost of noisy value estimates. Results on computation time as well as results for horizons up to 10 for the Rovers problem are reported in the supplementary material.
Beneﬁt of adaptive prediction action selection. The adaptation phase of APAS is disabled by not running Lines 915. Instead, at the start of each iteration of the while loop we randomly sample K linearization points using [23] to create Γ and solve the corresponding DecPOMDP. We repeat this procedure 1000 times and report the results in the column “APAS (no adapt.)” of Table 1. We conclude that the αvector adaptation clearly improves performance in both domains.
7 Conclusion
We showed that multiagent active perception modelled as a DecρPOMDP can be reduced to a standard DecPOMDP by introducing individual prediction actions. The difference between the optimal solution of the standard DecPOMDP and the DecρPOMDP is bounded. Our reduction enables application of any standard DecPOMDP solver to multiagent active perception problems, as demonstrated by our proposed APAS algorithm.
Our results allow transferring advances in scalability for standard DecPOMDPs to multiagent active perception tasks. In multiagent reinforcement learning, rewards typically depend on the underlying state of the system. Therefore, our reduction result also enables further investigation into learning for multiagent active perception. An investigation of the necessary conditions for when the loss due to decentralization is zero is another future direction.
9
Broader Impact
This work is theoretical in nature, and therefore does not present any foreseeable societal consequence.
Acknowledgments
This project had received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 758824 —INFLUENCE).
References
[1] Mauricio ArayaLópez, Olivier Buffet, Vincent Thomas, and Francois Charpillet. A POMDP Extension with Beliefdependent Rewards. In Advances in Neural Information Processing Systems, pages 64–72, 2010.
[2] Ruzena Bajcsy, Yiannis Aloimonos, and John K Tsotsos. Revisiting active perception. Autonomous Robots, 42(2):177–196, 2018.
[3] Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
[4] Graeme Best, Oliver M Cliff, Timothy Patten, Ramgopal R Mettu, and Robert Fitch. DecMCTS: Decentralized planning for multirobot active perception. The International Journal of Robotics Research, 38(23):316–337, 2019.
[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[6] Jesus Capitan, Matthijs T.J. Spaan, Luis Merino, and Anibal Ollero. Decentralized multirobot cooperation with auctioned POMDPs. The International Journal of Robotics Research, 32(6):650–671, 2013.
[7] Micah Corah and Nathan Michael. Distributed matroidconstrained submodular maximization for multirobot exploration: Theory and practice. Autonomous Robots, 43(2):485–501, 2019.
[8] Bahman Gharesifard and Stephen L Smith. Distributed submodular maximization with limited information. IEEE Transactions on Control of Network Systems, 5(4):1635–1645, 2017.
[9] Manish Jain, Matthew Taylor, Milind Tambe, and Makoto Yokoo. DCOPs meet the real world: Exploring unknown reward matrices with applications to mobile sensor networks. In Intl. Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 181–186, 2009.
[10] Mikko Lauri, Eero Heinänen, and Simone Frintrop. Multirobot active information gathering with periodic communication. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 851–856, 2017.
[11] Mikko Lauri, Joni Pajarinen, and Jan Peters. Information Gathering in Decentralized POMDPs by Policy Graph Improvement. In Intl. Conf. on Autonomous Agents and Multiagent Systems (AAMAS), pages 1143–1151, 2019.
[12] Mikko Lauri, Joni Pajarinen, and Jan Peters. Multiagent active information gathering in discrete and continuousstate decentralized POMDPs by policy graph improvement. Autonomous Agents and MultiAgent Systems, 34(42):1–44, 2020. Issue no. 2.
[13] Duc Thien Nguyen, William Yeoh, Hoong Chuin Lau, Shlomo Zilberstein, and Chongjie Zhang. Decentralized multiagent reinforcement learning in averagereward dynamic DCOPs. In AAAI Conference on Artiﬁcial Intelligence, pages 1447–1455, 2014.
[14] Frans A. Oliehoek. Sufﬁcient plantime statistics for decentralized POMDPs. In Intl. Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 302–308, 2013.
[15] Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016.
[16] Frans A. Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate Qvalue functions for decentralized POMDPs. Journal of Artiﬁcial Intelligence Research, 32:289–353, 2008.
10
Mikko Lauri Department of Computer Science
Universität Hamburg Hamburg, Germany [email protected]
Frans A. Oliehoek Department of Computer Science
TU Delft Delft, the Netherlands [email protected]
Abstract
Multiagent active perception is a task where a team of agents cooperatively gathers observations to compute a joint estimate of a hidden variable. The task is decentralized and the joint estimate can only be computed after the task ends by fusing observations of all agents. The objective is to maximize the accuracy of the estimate. The accuracy is quantiﬁed by a centralized prediction reward determined by a centralized decisionmaker who perceives the observations gathered by all agents after the task ends. In this paper, we model multiagent active perception as a decentralized partially observable Markov decision process (DecPOMDP) with a convex centralized prediction reward. We prove that by introducing individual prediction actions for each agent, the problem is converted into a standard DecPOMDP with a decentralized prediction reward. The loss due to decentralization is bounded, and we give a sufﬁcient condition for when it is zero. Our results allow application of any DecPOMDP solution algorithm to multiagent active perception problems, and enable planning to reduce uncertainty without explicit computation of joint estimates. We demonstrate the empirical usefulness of our results by applying a standard DecPOMDP algorithm to multiagent active perception problems, showing increased scalability in the planning horizon.
1 Introduction
Active perception, collecting observations to reduce uncertainty about a hidden variable, is one of the fundamental capabilities of an intelligent agent [2]. In multiagent active perception a team of autonomous agents cooperatively gathers observations to infer the value of a hidden variable. Application domains include search and rescue robotics, sensor networks, and distributed hypothesis testing. A multiagent active perception task often has a ﬁnite duration: after observations have been gathered, they are collected to a central database for inference. While the inference phase is centralized, the observation gathering phase is decentralized: each agent acts independently, without knowing the observations collected by the other agents nor guaranteed communication to the other agents.
The key problem in multiagent active perception is to determine how each agent should act during the decentralized phase to maximize the informativeness of the collected observations, evaluated afterwards during the centralized inference phase. The problem can be formalized as a decentralized partially observable Markov decision process (DecPOMDP) [3, 15], a general model of sequential multiagent decisionmaking under uncertainty. At each time step in a DecPOMDP, each agent in the team takes an individual action. The next state is determined according to a Markov chain conditional on the current hidden state and all individual actions. Each agent then perceives an individual observation correlated with the next state and the individual actions. The agents should act so as to maximize the expected sum of shared rewards, accumulated at each time step over a ﬁnite horizon.
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
In the decentralized phase, the perstep reward depends on the hidden state and the individual actions of all agents. In typical DecPOMDPs, the reward on all time steps is of this form. Active perception problems are modelled by a reward that is a convex function of the team’s joint estimate of the hidden state, for example the negative entropy [11]. This encourages agents to act in ways that lead to joint state estimates with low uncertainty. In analogy to the centralized inference phase, the reward at the ﬁnal time step can be thought of as a centralized prediction reward where a centralized decisionmaker perceives the individual actionobservation histories of all agents and determines a reward based on the corresponding joint state estimate. Due to the choice of reward function, algorithms for standard DecPOMDPs are not applicable to such active perception problems. Despite the pooling of observations after the end of the task, the problem we target is decentralized. We design a strategy for each agent to act independently during the observation gathering phase, without knowing for sure how the others acted or what they perceived. Consequently, the joint state estimate is not available to any agent during the observation gathering phase. Strategies executable in a centralized manner are available only if alltoall communication during task execution is possible, which we do not assume. As decentralized strategies are a strict subset of centralized strategies, the best decentralized strategy is at most as good as the best centralized strategy [16].
In this paper, we show that the convex centralized prediction reward can be converted to a decentralized prediction reward that is a function of the hidden state and socalled individual prediction actions. This converts the DecPOMDP with a convex centralized prediction reward into a standard DecPOMDP with rewards that depend on the hidden state and actions only. This enables solving multiagent active perception problems without explicit computation of joint state estimates applying any standard DecPOMDP algorithm. We show that the error induced when converting the centralized prediction reward into a decentralized one, the loss due to decentralization, is bounded. We also give a sufﬁcient condition for when the loss is zero, meaning that the problems with the centralized, respectively decentralized, prediction rewards are equivalent. We prove the empirical usefulness of our results by applying standard DecPOMDP solution algorithms to active perception problems, demonstrating improved scalability over the stateoftheart.
The remainder of the paper is organized as follows. We review related work in Section 2, and give preliminary deﬁnitions for DecPOMDPs in Section 3. In Section 4, we introduce our proposed conversion of a centralized prediction reward to a decentralized prediction reward. We propose a method to apply standard DecPOMDP algorithms to multiagent active perception in Section 5, and empirically evaluate the method in Section 6. Section 7 concludes the paper.
2 Related work
We brieﬂy review possible formulations of multiagent active perception problems, and then focus on the DecPOMDP model that provides the most general formulation.
Multiagent active perception has been formulated as a distributed constraint optimization problem (DCOP), submodular maximization, or as a specialized variant of a partially observable Markov decision process (POMDP). Probabilistic DCOPs with partial agent knowledge have been applied to signal source localization [9, 26]. DCOPs with Markovian dynamics have been proposed for target tracking by multiple sensors [13]. DCOPs are a simpler model than DecPOMDPs, as a ﬁxed communication structure is assumed or the noise in the sensing process is not modelled. Submodular maximization approaches assume the agents’ reward can be stated as a submodular set function, and apply distributed greedy maximization to obtain an approximate solution [22, 8, 7]. Along with the structure of the reward function, interagent communication is typically assumed. Specialized variants of POMDPs may also be applied. If alltoall communication without delay during task execution is available, centralized control is possible and the problem can be solved as a multiagent POMDP [24]. Auctioning of POMDP policies can facilitate multiagent cooperation when agents can communicate [6]. Best et al. [4] propose a decentralized Monte Carlo tree search planner where agents periodically communicate their openloop plans to each other.
Multiagent active perception without implicit communication with uncertainty on state transitions and the agents’ perception may be modelled as a DecρPOMDP [10, 11]. In contrast to standard DecPOMDPs, the reward function in a DecρPOMDP is a convex function of the joint state estimate, for example the negative entropy. Unfortunately, because of the convex reward, standard DecPOMDP solution algorithms are not applicable. Furthermore, the heuristic algorithm proposed in [11, 12]
2
requires explicit computation of reachable joint state estimates. Computing and storing these joint state estimates adds signiﬁcant memory overhead to the already high computational burden of solving DecPOMDPs [3]. We address both shortcomings, showing that a DecρPOMDP can be converted to a standard DecPOMDP with linear rewards. This in turn enables applying standard planning algorithms that do not require computation of reachable joint state estimates, leading to improved scalability.
Our work in the decentralized setting draws inspiration from approaches using prediction rewards in singleagent and centralized scenarios. ArayaLópez et al. [1] proposed to tackle singleagent active perception as a ρPOMDP with a convex reward. The related POMDP with information rewards (POMDPIR) was proposed in [25]. The POMDPIR model adds prediction actions that the agent selects in addition to the usual actions. Active perception is facilitated by rewarding the agent for correctly predicting the true underlying state. The equivalence of ρPOMDP and POMDPIR model was later established [20]. Recently, Satsangi et al. [21] propose a reinforcement learning method to solve ρPOMDPs taking advantage of the equivalence. In this paper we prove an analogous equivalence for DecPOMDPs, by converting a DecρPOMDP to a standard DecPOMDP with individual prediction actions and a decentralized prediction reward. Unlike in the POMDP case, the conversion does not always result in a perfect equivalence, but is associated with a loss due to decentralization.
3 Multiagent active perception as a DecPOMDP
In this section, we review how active perception problems are modelled as DecPOMDPs. We also review plantime sufﬁcient statistics, which allow us to concisely express joint state estimates and value functions of a DecPOMDP. We concentrate on the practically relevant active perception problem where a prediction reward at the ﬁnal time step depends on the joint state estimate.
3.1 Decentralized POMDPs
We deﬁne a DecPOMDP where the action and observation spaces along with the reward function are timedependent, as this will be convenient for our subsequent introduction of prediction actions. Deﬁnition 1. A DecPOMDP is a tuple M = h, I, S, b0, A, Z, T , R , where
• h ∈ N is the time horizon of the problem,
• I = {1, 2, . . . , n} is a set of n agents,
• S is the ﬁnite set of states s,
• b0 ∈ ∆(S) is the initial state distribution at time step t = 0,
• A is the collection of individual action spaces Ai,t for each agent i ∈ I and time step t = 0, . . . , h − 1. The tuple at = a1,t, a2,t, . . . , an,t of individual actions is called the joint action at time step t,
• Z is the collection of individual observation spaces Zi,t for each agent i ∈ I and time step t = 1, . . . , h. The tuple zt = z1,t, z2,t, . . . , zn,t of individual observations is called the joint observation at time step t,
• T is the dynamics function specifying conditional probabilities P(zt+1, st+1  st, at), and
• R is the collection of reward functions Rt(st, at) for time steps t = 0, . . . , h − 1.
An admissible solution of a DecPOMDP is a decentralized joint policy π, i.e., a tuple π1, . . . , πn where the individual policy πi of each agent i maps individual observation sequences zi,t = (zi,1, . . . , zi,t) to an individual action.1 An individual policy is a sequence πi = (δi,0, . . . , δi,h−1) of individual decision rules that map lengtht individual observation sequences to an individual action δi,t(zi,t) = ai,t. A joint decision rule is a tuple δt = δ1,t, . . . , δn,t that maps a lengtht joint observation sequence zt = z1,t, . . . , zn,t to a joint action δt(zt) = at. We shall use notation z−i,t to denote the individual observation sequences of all agents except i.
1zi,0 = ∅ as there is no observation at time t = 0.
3
The objective is to ﬁnd an optimal joint policy π∗ that maximizes the expected sum of rewards,
that is, π∗ = argmaxπ E
h−1 t=0
Rt
(st
,
δt
(zt
))
, where the expectation is with respect to the distri
bution of states and joint observation sequences induced under π, i.e., P(s0, . . . , sh, zh  b0, π)
b0(s0)
h−1 t=0
T (zt+1,
st+1

st,
δt(zt)).
In this paper, we are interested in cooperative active perception. We assume that after the task
terminates at time step h, the joint observation sequence zh is used to compute the conditional
distribution bh ∈ ∆(S) over the ﬁnal state sh, that is, bh(sh) P(sh  zh, b0, π). We seek policies that, instead of only maximizing the expected sum of rewards Rt(st, at), also maximize the informativeness of bh.
The DecρPOMDP [11] models active perception by maximizing a convex function of bh, e.g., the negative entropy. This convex function may be though of as a centralized pre
diction reward that is independent of the agents’ individ
ual actions. Conceptually, the centralized prediction reward
is determined by a virtual centralized decisionmaker that
perceives the agents’ individual action and observation sequences at time h, computes bh, and then determines the ﬁnal reward. We approximate the centralized prediction re
ward using a piecewise linear and convex function as illus
trated in Figure 1. Consider a bounded, convex, and differentiable function f : ∆(S) → R. The tangent hyperplane α ∈ RS of f at b ∈ ∆(S) is α = ∇f (b) − f ∗(∇f (b)), where ∇ denotes the gradient and f ∗ is the convex conjugate of f [5]. We select a ﬁnite set of linearization points bj ∈ ∆(S), and deﬁne Γ as the set of corresponding tangent hyperplanes αj.2 Then, we obtain a lower approximation as f (b) ≥ maxα∈Γ s b(s)α(s). This approximation is also used in ρPOMDPs, and has a bounded error [1]. The approxi
mation is also used in [21], where error bounds for cases such as 01 rewards are provided. The DecρPOMDP problem we
consider is as follows.
f (b)
b, α1
0
b, α2
b, α3
−0.2
−0.4
−0.6
−0.8
−1 0 0.2 0.4 0.6 0.8 1 b
Figure 1: A convex function f is approximated by maxi b, αi where αi are tangent hyperplanes and ·, · is
the inner product. The horizontal
axis depicts b(s) in a system with two states s and s , where b(s ) = 1 − b(s). A prediction action corresponds to selecting a single αi.
Deﬁnition 2 (DecρPOMDP). A DecρPOMDP is a pair M, Γ , where M is a DecPOMDP and Γ is a set of tangent hyperplanes that determine the centralized prediction reward ρ : ∆(S) → R
deﬁned as ρ(b) max b(s)α(s).
α∈Γ s
The standard DecPOMDP is the special case where Γ contains a single allzero element. The
objective in the DecρPOMDP is to ﬁnd an optimal joint policy that maximizes the expected sum of
rewards and the centralized prediction reward, i.e., E
h−1 t=0
Rt
(st
,
δt
(zt
))
+
ρ(bh
)
.
3.2 Sufﬁcient plantime statistics and the optimal value function
Sufﬁcient plantime statistics are probability distributions over states and joint observation sequences
given the past joint policy followed by the agents [14]. A past joint policy at time t is a joint policy
speciﬁed until time t, denoted ϕt = ϕ1,t, . . . , ϕn,t , where each individual past policy is a sequence of individual decision rules: ϕi,t = (δi,0, . . . , δi,t−1). The sufﬁcient plantime statistic for initial state distribution b0 and a past joint policy ϕt is deﬁned as
σt(st, zt) P(st, zt  b0, ϕt),
(1)
and at the starting time, σ0(s0) b0(s0). The conditional σt(·  zt) is the state distribution after perceiving the joint observation sequence zt when executing policy ϕt with initial state distribution b0. The marginal σt(zt) is the probability of the joint observation sequence zt given ϕt and b0.
Sufﬁcient plantime statistics are updateable to any extension of the past joint policy ϕt. The extension of an individual past policy ϕi,t by an individual decision rule δi,t is deﬁned ϕi,t+1 = ϕi,t ◦ δi,t
2We defer discussion on how we select the linearization points to Section 5.
4
(δi,0, . . . , δi,t−1, δi,t). The extension of a past joint policy ϕt by δt = δ1,t, . . . , δn,t is deﬁned
ϕt+1 = ϕt ◦ δt ϕ1,t+1, . . . , ϕn,t+1 . Let zt+1 = (zt, zt+1) be a joint observation sequence that extends zt with zt+1. Given the statistic σt for ϕt and the next joint decision rule δt, the updated statistic for ϕt+1 = ϕt ◦ δt is σt+1 = Uss(σt, δt), where the update operator Uss is deﬁned as
[Uss(σt, δt)](st+1, zt+1)
T (zt+1, st+1  st, δt(zt))σt(st, zt).
(2)
st
The statistic σt is sufﬁcient to predict the expected immediate reward Rˆt(σt, δt) of choosing the next
joint decision rule δt:
Rˆt(σt, δt) = σt(st, zt)Rt(st, δt(zt)).
(3)
zt ,st
Furthermore, σh is sufﬁcient to predict the expected centralized prediction reward ρˆ(σh):
ρˆ(σh) = σh(zh) max σh(sh  zh)α(sh).
(4)
α∈ Γ
zh
sh
In contrast to the reward in earlier stages, ρˆ does not depend on decision rules at all. We interpret ρˆ as the expectation of the prediction reward of a centralized decisionmaker who selects the best α ∈ Γ after perceiving the full history of joint actions and observations at task termination.
Let π be a joint policy consisting of the joint decision rules δ0, δ1, . . . , δh−1. The valuetogo of π starting from a sufﬁcient plantime statistic σt at time step t is
Qπ(σ , δ ) = Rˆt(σt, δt) + ρˆ(σt+1)
if t = h − 1,
(5)
t tt
Rˆt(σt, δt) + Qπt+1(σt+1, δt+1) if 0 ≤ t < h − 1,
with the shorthand σt+1 = Uss(σt, δt). The value function V π of a joint policy π is deﬁned as the
sum of expected rewards when the agents act according to π, V π(σ0) Qπ0 (σ0, δ0). The value
function of an optimal policy π∗ is denoted V ∗, and it satisﬁes V ∗(σ0) ≥ V π(σ0) for all π. We write
VM π
and
V
π M,Γ
for the value function in a standard DecPOMDP and a DecρPOMDP, respectively.
4 Conversion to standard DecPOMDP
In this section, we show that any DecρPOMDP can be converted to a standard DecPOMDP by adding individual prediction actions for each agent, and by introducing a decentralized prediction reward. Each individual prediction action corresponds to a selection of a tangent hyperplane of the centralized prediction reward, as illustrated in Figure 1. As the individual prediction actions are chosen in a decentralized manner, the decentralized prediction reward never exceeds the centralized prediction reward. Consequently, we prove that an optimal policy of the standard DecPOMDP may be applied to the DecρPOMDP with bounded error compared to the true optimal policy. We give a sufﬁcient condition for when this loss due to decentralization is zero and a DecρPOMDP is equivalent to a standard DecPOMDP.
A major implication of our results is that it is possible to apply any DecPOMDP solver to a DecρPOMDP problem. We approximate a centralized prediction reward by a decentralized prediction reward that only depends on states and actions. This allows planning for active perception problems without explicit computation of joint state estimates.
We ﬁrst introduce our proposed conversion and prove its properties, including the error bound. We conclude the section by giving a sufﬁcient condition for when the two problems are equivalent. All omitted proofs are found in the supplementary material.
Deﬁnition 3. Given a DecρPOMDP M, Γ with M = h, I, S, b0, A, Z, T , R , convert it to a standard DecPOMDP M+ = h + 1, I, S, b0, A+, Z+, T +, R+ where the horizon is incremented by one and
• in A+, the individual action space Ai,h for each agent i ∈ I at time h is a set of individual prediction actions ai,h, with one individual prediction action for each tangent hyperplane α ∈ Γ; for other time steps Ai,t are as in A,
5
• in Z+, the individual observation space Zi,h+1 for each agent i ∈ I at time h + 1 contains a single null observation; for other time steps Zi,t are as in Z,
• T +(zh+1, sh+1  sh, ah) has probability of one for the joint null observation and sh+1 = sh, and is otherwise zero; for other time steps T + is equal to T , and
• in R+, the reward function Rh at time step h is a linear combination of individual prediction rewards Ri,h of each agent, such that for a joint prediction action ah = a1,h, . . . , an,h the decentralized prediction reward is
1n
Rh(sh, ah) = n Ri,h(sh, ai,h),
(6)
i=1
with Ri,h(sh, ai,h) = αai,h (sh), where αai,h ∈ Γ is the tangent hyperplane corresponding to the individual prediction action ai,h; for other time steps Rt are as in R.
As the action and observation sets in M and M+ are the same until t = h, past joint policies are interchangeable, and plantime sufﬁcient statistics are identical.
Lemma 1. Let M, Γ be a DecρPOMDP and deﬁne M+ as above. Then, for any past joint policy ϕt with t ≤ h, the respective plantime sufﬁcient statistics σt in M, Γ and σt+ in M+ are equivalent, i.e., σt ≡ σt+.
In M+ the horizon is incremented, so that each agent takes one more action than in M, Γ . The
additional action is one of the newly added individual prediction actions at time step h. To select an
individual prediction action, an agent may use 1) individual information, that is, the agent’s individual
observation sequence, and 2) plantime information common to all agents, that is, the plantime
sufﬁcient statistic. An individual prediction rule φi maps the individual observation sequence zi,h
and the plantime sufﬁcient statistic σh to an individual prediction action. A joint prediction rule
φ = φ1, . . . , φn is a tuple of individual prediction rules, and maps a joint observation sequence zh
and σh to a joint prediction action. The expected decentralized prediction reward given σh and φ is
deﬁned analogously to Eq. (3) as Rˆh(σh, φ)
zh,sh σh(sh, zh)Rh(sh, φ(zh, σh)).
An optimal joint prediction rule maximizes the expected decentralized prediction reward. We prove
that an optimal joint prediction rule consists of individual prediction rules that maximize the expected
individual prediction reward. We then show that the expected decentralized prediction reward is
at most equal to the centralized prediction reward, and that a similar relation holds between value functions in the respective DecPOMDP and DecρPOMDP. The key property required is that the
decentralized prediction reward is a sum of individual prediction rewards.
Lemma 2 (Optimal joint prediction rule). Let M, Γ be a DecρPOMDP and deﬁne M+ as above,
and let σh be a plantime sufﬁcient statistic for any past joint policy. Then the joint prediction rule
φ∗ =
φ
∗ 1
,
.
.
.
,
φ
∗ n
where each individual prediction rule φ∗i is deﬁned as
φ
∗ i
(
zi,h
,
σh
)
argmax
σh(sh, z−i,h  zi,h)Ri,h(sh, ai,h)
(7)
ai,h∈Ai,h z−i,h,sh
maximizes expected decentralized prediction reward, that is, Rˆh(σh, φ∗) = maxφ Rˆh(σh, φ).
In the DecρPOMDP M, Γ , the centralized prediction reward is determined by a centralized virtual agent that has access to the observation histories of all agents. In our converted standard DecPOMDP M+, each agent individually takes a prediction action, leading to the decentralized prediction reward. The decentralized prediction reward never exceeds the centralized prediction reward as we show next.
Lemma 3. The expected decentralized prediction reward Rˆh(σh, φ∗) in M+ is at most equal to the expected centralized prediction reward ρˆ(σh) in M, Γ , i.e., Rˆh(σh, φ∗) ≤ ρˆ(σh).
Lemma 4. Let M, Γ and M+ be as deﬁned above. Let ϕh be a past joint policy for M+, and let
φ∗ be the optimal joint prediction rule. Then, the value of ϕh ◦ φ∗ in M+ is at most equal to the
value of ϕh in
M, Γ
,
i.e.,
VM ϕh+◦φ∗ (σ0)
≤
V
ϕh M,Γ
(σ0).
We now show that the difference between the optimal values of the standard DecPOMDP M+ and the DecρPOMDP M, Γ is bounded. We call the error the loss due to decentralization.
6
Theorem 1 (Loss due to decentralization). Consider a DecρPOMDP M, Γ with the optimal
value
function
V
∗ M,Γ
.
Let
π
be
an
optimal
policy
for
the
standard
DecPOMDP
M+
created
as
in
Deﬁnition 3, and denote by ϕh the past joint policy consisting of the ﬁrst h decision rules of π. Then
the
difference
of
V
∗ M,Γ
and
the
value
function
V
ϕh M,Γ
of applying ϕh to
M, Γ
is bounded by
V
∗ M,Γ
(σ0)
−
V
ϕh M,Γ
(σ0)
≤
2 max ρˆ(σh)
−
Rˆh(σh, φ∗),
(8)
σh
where φ∗ is the optimal joint prediction rule.
Proof. For clarity, in the following we omit the argument σ0 from the value functions. Suppose the
optimal policy π∗ in M, Γ consists of the joint decision rules δ0∗, . . . , δh∗−1. For M+, deﬁne the
partial joint policy ϕ∗h = (δ0∗, . . . , δh∗−1) and its extension ϕ∗h ◦ φ∗. By Lemma 2, an optimal joint
policy π of M+ is an extension of some past joint policy ϕh by φ∗, i.e., π = ϕh ◦ φ∗. Because
ϕh ◦ φ∗ is optimal in M+, we have that VM ϕ∗h+◦φ∗ ≤ VM ϕh+◦φ∗ . By Lemma 4, VM ϕh+◦φ∗ ≤ V ϕMh ,Γ .
Finally, because π∗ is optimal in
M, Γ
,
we
have
V
ϕh M,Γ
≤
V
∗ M,Γ
.
Now,
V
∗ M,Γ
−
VM ϕh+ 
≤
V
∗ M,Γ
− VM ϕ∗h+◦φ∗  + VM ϕ∗h+◦φ∗ − VM ϕh+ 
(9a)
≤
V
∗ M,Γ
−
VM ϕ∗h+◦φ∗ 
+
VM ϕ∗h+◦φ∗
−
V
∗ M,Γ

=
2
·
V
∗ M,Γ
− VM ϕ∗h+◦φ∗ 
(9b)
h−1
h−1
=2·
Rˆt(σt∗, δt∗) + ρˆ(σh∗) −
Rˆt(σt∗, δt∗) + Rˆh(σh∗, φ∗) (9c)
t=0
t=0
= 2 · ρˆ(σh∗) − Rˆh(σh∗, φ∗) ≤ 2 · max ρˆ(σh) − Rˆh(σh, φ∗).
(9d)
σh
We ﬁrst apply the triangle inequality. The second inequality holds as VM ϕ∗h+◦φ∗ ≤ V ϕMh ,Γ ≤ V ∗M,Γ .
The next equality is by symmetry of absolute difference. The third line follows from the deﬁnition of the value function and Lemma 1. Note that we refer by σt∗ to the sufﬁcient plantime statistics reached under partial joint policies ϕ∗t of π∗. The ﬁnal inequality follows by maximizing the difference of the expected centralized and decentralized prediction rewards.
The theorem shows that given a DecρPOMDP M, Γ , we may solve the standard DecPOMDP M+ and apply its optimal policy to the DecρPOMDP with bounded error. If there is only a single agent, that agent is equivalent to the conceptualized centralized agent, which implies ρˆ ≡ Rˆh and that the loss is zero. The equivalence of a ρPOMDP with a convex prediction reward and a POMDP with information rewards ﬁrst shown in [20] is therefore obtained as a special case.
The proof indicates that the error is at most twice the difference of the expected centralized prediction reward and the expected decentralized prediction reward at the sufﬁcient plantime statistic σh∗ at time h under an optimal policy π∗ of M, Γ . This suggests that the ﬁnal bound given as maximum over all sufﬁcient plantime statistics may be overly pessimistic for some cases. While a complete characterization of such cases is beyond the scope of this paper, we give below a sufﬁcient condition for when the error is zero in the multiagent case. The proof is in the supplementary material.
Observation 1. Consider the setting of Theorem 1. Assume that the observation sequence of each agent is conditionally independent of the observation sequences of all other agents given the past joint policy and initial state distribution, i.e., for every agent i, σh(zh) = σh(zi,h)σh(z−i,h). Then π∗ is an optimal joint policy for M, Γ if and only if π∗ ◦ φ∗ is an optimal joint policy for M+.
The sufﬁcient condition above is restrictive. Informally, it requires that each agent executes its own independent active perception task. However, the loss due to decentralization may be small if a multiagent active perception task almost satisﬁes the requirement. It might be possible to derive less restrictive sufﬁcient conditions in weaklycoupled problems by building on inﬂuencebased abstractions [17] Such weaklycoupled cases may arise, e.g., in distributed settings where two agents deployed in different regions far away from each other have limited inﬂuence on each other.
5 Adaptive prediction action search
Recall that the centralized prediction reward function ρ is obtained by approximating a continuous convex function f by a set of αvectors. The approximation is more accurate the more αvectors
7
Algorithm 1 Adaptive prediction action search (APAS) for DecρPOMDP planning
Input: DecρPOMDP M, Γ , convex function f : ∆(S) → R, number of linearization points K
Output: Best joint policy found, πbest
1: Vbest ← −∞, πbest ← ∅
2: repeat
3: // Policy optimization phase 4: M+ ← CONVERTDECPOMDP(M, Γ) 5: π ← PLAN(M+)
Apply Deﬁnition 3 Use any DecPOMDP planner
6: V ← EVALUATE(π)
7: if V > Vbest then Vbest ← V, πbest ← π
8: // Adaptation phase
9: Γ ← ∅
10: for k = 1, . . . , K do
11:
zh ∼ σh(zh)
Sample joint observation sequence zh using πbest
12:
bk ← σh(·  zh)
13:
αk ← ∇f (bk) − f ∗(∇f (bk))
Final state estimate corresponding to zh Tangent hyperplane of f at bk
14:
Γ ← Γ ∪ {αk}
15: end for
16: until converged
17: return πbest
are used: for error bounds, see [1, 21]. We propose a planning algorithm for DecρPOMDPs called Adaptive Prediction Action Search (APAS) that dynamically updates the αvectors during planning. This avoids the need to a priori create a large number of αvectors some of which may turn out to not be useful. Moreover, APAS allows application of any standard DecPOMDP algorithm to solve DecρPOMDPs by using the conversion proposed in Section 4. This can avoid the computation of all reachable state estimates required by earlier algorithms for DecρPOMDPs [11, 12].
The pseudocode for APAS is shown in Algorithm 1. APAS consists of two phases that are repeated until convergence: the policy optimization phase and the αvector adaptation phase. In the policy optimization phase, the best joint policy for the current set Γ of αvectors is found. On the ﬁrst iteration we initialize Γ randomly as explained in the next section. In the adaptation phase, the αvectors are then modiﬁed such that they best approximate the ﬁnal reward for joint state estimates that are most likely reached by the current joint policy. In the optimization phase the DecρPOMDP is converted to a standard DecPOMDP as described in Section 4. We then apply any standard DecPOMDP algorithm to plan a joint policy π for the converted DecPOMDP. If the value of π exceeds the value of the best policy so far, πbest, the best policy is updated. In the adaptation phase, we sample a ﬁnal state distribution bk under the currently best policy, and insert the tangent hyperplane at bk into Γ. This corresponds to sampling a point on the horizontal axis in Figure 1 and adding the corresponding tangent hyperplane to Γ. This is repeated K times. Speciﬁcally, we simulate a trajectory under the current best policy πbest to sample a joint observation sequence (Line 11), use Bayesian ﬁltering to compute the corresponding state estimate bk (Line 12), and compute corresponding αvector (Line 13). The updated set Γ of αvectors provides the best approximation of the ﬁnal reward for the sampled state estimates. The time complexity of APAS is determined by the complexity of the DecPOMDP planner called in PLAN. A reference implementation is available at https://github.com/laurimi/multiagentpredictionreward.
6 Experiments
The DecρPOMDP we target is computationally more challenging (NEXPcomplete [3]) than centralized POMDP, POMDPIR, or ρPOMDP (PSPACEcomplete [19]). Immediate alltoall communication during task execution would be required to solve the problem we target as a centralized problem, and the resulting optimal value would be higher [16]. Therefore, a comparison to centralized methods is neither fair nor necessary. We compare APAS to the NPGI algorithm of [11] that solves a DecρPOMDP by iterative improvement of a ﬁxedsize policy graph. As the PLAN subroutine of APAS, we use the ﬁnitehorizon variant of the policy graph improvement method of [18]. This method is algorithmically similar to NPGI which helps isolate the effect of applying the proposed conversion to a standard DecPOMDP from the effect of algorithmic design choices to the extent
8
Table 1: Average policy values ± standard error in the MAV (left) and the rovers domains (right).
h APAS (ours)
MAV APAS (no adapt.)
NPGI [12]
Optimal h
APAS (ours)
Rovers APAS (no adapt.)
NPGI [12]
Optimal
2 2.006 ± 0.005 2.112 ± 0.002 3 1.936 ± 0.004 2.002 ± 0.002 4 1.879 ± 0.004 1.943 ± 0.002 5 1.842 ± 0.005 1.918 ± 0.002 6 1.814 ± 0.004 1.898 ± 0.002 7 1.789 ± 0.004 1.892 ± 0.003 8 1.820 ± 0.005 1.885 ± 0.006
1.931 1.833 1.768 1.725 — — —
1.919 2 3.484 ± 0.002 3.800 ± 0.006
1.831 3 3.402 ± 0.008 3.680 ± 0.006
—
4 3.367 ± 0.011 3.640 ± 0.006
—
5 3.293 ± 0.012 3.591 ± 0.006
—
6 3.333 ± 0.012 3.631 ± 0.007
—
7 3.375 ± 0.014 3.673 ± 0.008
—
8 3.496 ± 0.014 3.860 ± 0.010
3.495 3.192 3.036 2.981 — — —
3.479 3.189 — — — — —
possible. Any other DecPOMDP algorithm may also be used with APAS. Details on the algorithms and parameter settings are provided in the supplementary material.
We evaluate on the DecρPOMDP domains from [11]: the micro air vehicle (MAV) domain and information gathering rovers domain. In the MAV domain two agents cooperatively track a moving target to determine if the target is friendly or hostile. In the rovers domain, two agents collect samples of scientiﬁc data in a grid environment. The reward at the ﬁnal time step is the negative entropy of tKhe=jo2in, tasntdatfeoresthtiemraotveebrhs,ptrhoabtleism, fK(b=h) 5=indivshidbuha(lsphr)eldnicbthio(snha)c.tiFoonrs.thFeorMAAPVASpr,owbeleimni,tiwaleizuesΓe by randomly sampling linearization points bk ∈ ∆(S), k = 1, . . . , K using [23]. The corresponding αvectors for the negative entropy are αk(s) = ln bk(s). We only observed minor beneﬁts from using a greater number of αvectors and only for long horizons, details are reported in the supplementary material. We run 100 repetitions using APAS and report the average policy value and its standard error. Results for NPGI are as reported in [12]. To ensure comparability, all reported policy values are computed using f as the ﬁnal reward. As a baseline, we report the optimal value if known.
Comparison to the stateoftheart. The results are shown in Table 1. While APAS does not exceed stateoftheart performance in terms of policy quality, the major advantage of APAS is that it is able to scale to greater horizons than NPGI. The number of joint state distributions reachable under the current policy grows exponentially in the horizon h. Computation of these state distributions is required by NPGI, causing it to run out of memory with h > 5.
While running APAS, we compute the value of the policies exactly (Alg. 1, Line 6). For long horizons, exact evaluation requires a signiﬁcant fraction of the computation budget. We expect that further scaling in terms of planning horizon is possible by switching to approximate evaluation, at the cost of noisy value estimates. Results on computation time as well as results for horizons up to 10 for the Rovers problem are reported in the supplementary material.
Beneﬁt of adaptive prediction action selection. The adaptation phase of APAS is disabled by not running Lines 915. Instead, at the start of each iteration of the while loop we randomly sample K linearization points using [23] to create Γ and solve the corresponding DecPOMDP. We repeat this procedure 1000 times and report the results in the column “APAS (no adapt.)” of Table 1. We conclude that the αvector adaptation clearly improves performance in both domains.
7 Conclusion
We showed that multiagent active perception modelled as a DecρPOMDP can be reduced to a standard DecPOMDP by introducing individual prediction actions. The difference between the optimal solution of the standard DecPOMDP and the DecρPOMDP is bounded. Our reduction enables application of any standard DecPOMDP solver to multiagent active perception problems, as demonstrated by our proposed APAS algorithm.
Our results allow transferring advances in scalability for standard DecPOMDPs to multiagent active perception tasks. In multiagent reinforcement learning, rewards typically depend on the underlying state of the system. Therefore, our reduction result also enables further investigation into learning for multiagent active perception. An investigation of the necessary conditions for when the loss due to decentralization is zero is another future direction.
9
Broader Impact
This work is theoretical in nature, and therefore does not present any foreseeable societal consequence.
Acknowledgments
This project had received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 758824 —INFLUENCE).
References
[1] Mauricio ArayaLópez, Olivier Buffet, Vincent Thomas, and Francois Charpillet. A POMDP Extension with Beliefdependent Rewards. In Advances in Neural Information Processing Systems, pages 64–72, 2010.
[2] Ruzena Bajcsy, Yiannis Aloimonos, and John K Tsotsos. Revisiting active perception. Autonomous Robots, 42(2):177–196, 2018.
[3] Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4):819–840, 2002.
[4] Graeme Best, Oliver M Cliff, Timothy Patten, Ramgopal R Mettu, and Robert Fitch. DecMCTS: Decentralized planning for multirobot active perception. The International Journal of Robotics Research, 38(23):316–337, 2019.
[5] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
[6] Jesus Capitan, Matthijs T.J. Spaan, Luis Merino, and Anibal Ollero. Decentralized multirobot cooperation with auctioned POMDPs. The International Journal of Robotics Research, 32(6):650–671, 2013.
[7] Micah Corah and Nathan Michael. Distributed matroidconstrained submodular maximization for multirobot exploration: Theory and practice. Autonomous Robots, 43(2):485–501, 2019.
[8] Bahman Gharesifard and Stephen L Smith. Distributed submodular maximization with limited information. IEEE Transactions on Control of Network Systems, 5(4):1635–1645, 2017.
[9] Manish Jain, Matthew Taylor, Milind Tambe, and Makoto Yokoo. DCOPs meet the real world: Exploring unknown reward matrices with applications to mobile sensor networks. In Intl. Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 181–186, 2009.
[10] Mikko Lauri, Eero Heinänen, and Simone Frintrop. Multirobot active information gathering with periodic communication. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 851–856, 2017.
[11] Mikko Lauri, Joni Pajarinen, and Jan Peters. Information Gathering in Decentralized POMDPs by Policy Graph Improvement. In Intl. Conf. on Autonomous Agents and Multiagent Systems (AAMAS), pages 1143–1151, 2019.
[12] Mikko Lauri, Joni Pajarinen, and Jan Peters. Multiagent active information gathering in discrete and continuousstate decentralized POMDPs by policy graph improvement. Autonomous Agents and MultiAgent Systems, 34(42):1–44, 2020. Issue no. 2.
[13] Duc Thien Nguyen, William Yeoh, Hoong Chuin Lau, Shlomo Zilberstein, and Chongjie Zhang. Decentralized multiagent reinforcement learning in averagereward dynamic DCOPs. In AAAI Conference on Artiﬁcial Intelligence, pages 1447–1455, 2014.
[14] Frans A. Oliehoek. Sufﬁcient plantime statistics for decentralized POMDPs. In Intl. Joint Conference on Artiﬁcial Intelligence (IJCAI), pages 302–308, 2013.
[15] Frans A. Oliehoek and Christopher Amato. A Concise Introduction to Decentralized POMDPs. Springer, 2016.
[16] Frans A. Oliehoek, Matthijs TJ Spaan, and Nikos Vlassis. Optimal and approximate Qvalue functions for decentralized POMDPs. Journal of Artiﬁcial Intelligence Research, 32:289–353, 2008.
10