a1a1 a4a4 a2a2 a3a3 context-specific multiagent coordination and planning with factored mdps carlos...

Post on 11-Jan-2016

222 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A1

A4

A2 A3

Context-Specific Multiagent Coordination and Planning with Factored MDPsCarlos Guestrin Shobha Venkataraman Daphne KollerStanford University

Construction Crew Problem: Dynamic Resource Allocation

Joint Decision Space Represent as MDP:

Action space: joint action a for all agents State space: joint state x of all agents Reward function: total reward r

Action space is exponential: Action is assignment a = {a1,…, an}

State space: Exponential in # variables Global decision requires complete observation

,

,

Context-Specific Structure

Summary: Context-Specific Coordination

Summary of Algorithm

1. Pick local rule-based basis functions hi

2. Single LP algorithm for Factored MDPs obtains Qi’s

3. Variable coordination graph computes maximizing action

Construction Crew Problem

SysAdmin: Rule-based x Table-based

Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous

decisions Limited observability Limited communication

Multiagent Coordination Examples

Comparing to Apricodd [Boutilier et al. ’96-’99]

Conclusions and Extensions

Multiagent planning algorithm: Variable coordination structure; Limited context-specific communication; Limited context-specific observability.

Solve large MDPs!

Extensions to hierarchical and relational models

Stanford UniversityStanford University ! CMU

Agent 2Plumbing, Painting

Agent 1Foundation, Electricity, Plumbing

Agent 3Electricity, Painting

Agent 4Decoration

WANTED: Agents that coordinate to

build and maintain houses, but only when necessary!

Foundation ! {Electricity, Plumbing} ! Painting ! Decoration

Local Q-function Approximation

M4

M1

M3

M2

Q3

Q(A1,…,A4, X1,…,X4) ¼

Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +

Q3(A2, A3, X2,X3) + Q4(A3, A4,

X3,X4) Associated with Agent 3

Observe only

X2 and X3

Limited observability: agent i only observes variables in Qi

Must choose action to maximize i Qi

Problems with Coordination Graph

Tasks last multiple time steps Failures cause chain reactions Multiple houses

0

100

200

300

400

0 5 10 15 20 25

number of agents

running time (seconds)

csi

non-csi

Bidirectional Ring

y = 0.53x2 - 0.96x - 0.01

R2 = 0.990

100

200

300

400

500

0 5 10 15 20 25 30

number of agents

running time (seconds)

non-csi

csi

y = 0.000049 exp(2.27x)R = 0.9992

Server

Reverse Star

Optimal

Apricodd

Rule-based

Expon06

530.9 530.9 530.9

Expon08

77.09 77.09 77.09

Expon10

0.034 0.034 0.034

Optimal

Apricodd

Rule-based

Linear06

531.4 531.4 531.4

Linear08

430.5 430.5 430.5

Linear10

348.7 348.7 348.7

Running Times for the 'Linear' Problems

y = 0.1473x 3 - 0.8595x2 + 2.5006x - 1.5964

R2 = 0.9997

y = 0.0254x 2 + 0.0363x + 0.0725

R2 = 0.9983

0

10

20

30

40

50

6 8 10 12 14 16 18 20No. of variables

Time (in seconds)Apricodd

Rule-based

Running Times for the 'Expon' Problems

0

100

200

300

400

500

6 8 10 12No. of variables

Time (in seconds)

Apricodd

Rule-based

Context-Specific Coordination Structure

Table size exponential in #variables Messages are tables

Agents communicate even if not necessary Fixed coordination structure

What we want: Use structure in tables Variable coordination structure

Exploit context specific independence!

A1

A4

A2 A3

1Q

4Q 3Q

2Q

Local value rules represent context-specific structure:

<q1 , Plumbing_not_done A1 = Plumb A2 = Plumb : -100>

Set of rules Qi for each agent Must coordinate to maximize total

value: ∑ iaa

Qg,,1

maxK

Rule-based variable elimination [Zhang and Poole ’99]

Maximizing out A1

Rule-based coordination graph for finding optimal action

A - Simplification on instantiation of the state B - Simplification when passing messages C - Simplification on maximization Simplification by approximation

Variable agent communication structure Coordination structure is dynamic

Long-term Utility = Value of MDP Value computed by linear programming:

One variable V(x) for each state One constraint for each state x and action a Number of states and actions exponential!

,

)()( :subject to

)(:minimize

⎩⎨⎧∀

ax

xax

xx

,QV

V

Decomposable Value Function

Linear combination of restricted domain basis functions: ∑=

i iihwV )()(~

xx

Each hi is a rule over small part(s) of a complex system: The value of having two agents in the same house The value of two agents are painting a house together

Must find w giving good approximate value function

Single LP Solution for Factored MDPs

a,x

)x,a()x( :to subject

:minimize

⎪⎩

⎪⎨⎧

≥∑∑

ii

iii

xii

Qhw

wh

One variable wi for each basis function Polynomially many LP variables

One constraint for every state and action

Factored MDP

Plumbingi

Paintingi

Plumbingi’

Paintingi’

R A2

Required Tasks

Dependent Tasks

Agent 2Plumbing, Painting

Agent 1Foundation, Electricity, Plumbing

Agent 3 Electricity,

Painting

Agent 4Decoration

[Schweitzer and Seidmann ‘85]

[Guestrin et al. ’01]

Rule-based variable elim. Exponentially smaller LP than table-based!

A1

A4 A2

A3

A5 A6

1.0:32 xaa ∧∧

3:43 xaa ∧∧

3:421 xaaa ∧∧∧

5:21 xaa ∧∧1:31 xaa ∧∧

7:6 xa ∧4:51 xaa ∧∧2:65 xaa ∧∧ 3:61 xaa ∧∧

A1

A4 A2

A3

A5 A6

1.0:32 aa ∧

3:43 aa ∧

3:421 aaa ∧∧

5:21 aa ∧

7:6a4:51 aa ∧2:65 aa ∧

A

Instantiate current state: x = true

A1

A4 A2

A3

A5 A6

B Eliminate Variable A1

C

Local MaximizationA4 A2

A3

A5 A6

1:4 xa ∧ 1:4a

1.0:32 aa ∧

3:43 aa ∧

5:2a

7:6a4:5a

2:65 aa ∧

1:4a

4:51 aa ∧5:21 aa ∧

3:421 aaa ∧∧

4:5a

5:2a

1.0:32 aa ∧

3:43 aa ∧

3:421 aaa ∧∧

5:21 aa ∧

7:6a4:51 aa ∧2:65 aa ∧

1:4a

Outline

Given long-term utilities i Qi(x,a) Local message passing computes maximizing

action Variable coordination structure

Long-term planning to obtain i Qi(x,a) Linear programming approach Exploit context-specific structure

[Bellman et al. ‘63], [Tsitsiklis & Van Roy ’96], [Koller & Parr ’99,’00], [Guestrin et al. ’01]

∑+='

)(),|'(),(),(x

xVaxxPaxRaxQ γ

Factored Value function V = wi hi

Factored Q function Q = Qi

Foundation ! {Electricity, Plumbing} ! Painting ! Decoration

2 Agents, 1 house Agent 1 = {Foundation, Electricity, Plumbing} Agent 2 = {Plumbing, Painting and Decoration}

4 Agents, 2 houses Agent 1 = {Painting, Decoration}; moves Agent 2 = {Foundation, Electricity, Plumbing, Painting} house 1 Agent 3 = {Foundation, Electricity} house 2 Agent 4 = {Plumbing, Decoration} house 2

Example 1:

Example 2:

Actual value of resulting policies Actual value of resulting policies

Our rule-based approach

Apricodd

Algorithm based on Linear programming Value iteration

Types of independence exploited

Additive and context-specific

Only context-specific

“Basis function” representation

Specified by user Determined by algorithm

Introduction Context-Specific Coordination, Given Qi’s Long-Term Planning, Computing Qi’s Experimental Results

Use Coordination graph [Guestrin et al. ’01]

Use variable elimination for maximization: [Bertele & Brioschi ‘72]

Limited communication for optimal action choice

Comm. bandwidth = induced width of coord. graph

Here we need only 23, instead of 63 sum operations.

A1

A4

A2 A3

1Q

4Q 3Q

2Q

),(),(),(max321312211

,321

AAgAAQAAQA A A

),(),(max),(),(max 424433312211,

4321

AAQAAQAAQAAQA A A A

),(),(),(),(max424433312211

,,, 4321

AAQAAQAAQAAQA A A A

Computing Maximizing Action: Coordination Graph

For every action of A2 and A3,

maximum value for A4

hi and Qi depend on small sets of variables and actions

Polynomial-time algorithm generates compact LP

,

),()( :subject to⎪⎩

⎪⎨⎧

≥∑∑ax

xaQxhwi

ii

ii )(),( :subject to max0

,⎩⎨⎧

−∑≥i

iiixa

xhwxaQ

[ ]),(),(max),(),(max0 4321,,

DBfDCfCAfBAfDCBA

+++≥

),(),(

),(),(max0

43),(

1

),(121

,,

DBfDCfg

gCAfBAf

CB

CB

CBA

+≥

++≥

top related