a1a1 a4a4 a2a2 a3a3 context-specific multiagent coordination and planning with factored mdps carlos...

1
A 1 A 4 A 2 A 3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford University Construction Crew Problem: Dynamic Resource Allocation Joint Decision Space Represent as MDP: Action space: joint action a for all agents State space: joint state x of all agents Reward function: total reward r Action space is exponential: Action is assignment a = {a 1 ,…, a n } State space: Exponential in # variables Global decision requires complete observation , , Context-Specific Structure Summary: Context-Specific Coordination Summary of Algorithm 1. Pick local rule-based basis functions h i 2. Single LP algorithm for Factored MDPs obtains Q i ’s 3. Variable coordination graph computes maximizing action Construction Crew Problem SysAdmin: Rule-based x Table-based Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous decisions Limited observability Limited communication Multiagent Coordination Examples Comparing to Apricodd [Boutilier et al. ’96-’99] Conclusions and Extensions Multiagent planning algorithm: Variable coordination structure; Limited context-specific communication; Limited context-specific observability. Solve large MDPs! Extensions to hierarchical and relational models Stanford University Stanford University ! CMU Agent 2 Plumbing, Painting Agent 1 Foundation, Electricity, Plumbing Agent 3 Electricity, Painting Agent 4 Decoration WANTED: Agents that coordinate to build and maintain houses, but only when necessary! Foundation ! {Electricity, Plumbing} ! Painting ! Decoration Local Q-function Approximation M 4 M 1 M 3 M 2 Q 3 Q(A 1 ,…,A 4 , X 1 ,…,X 4 ) ¼ Q 1 (A 1 , A 4 , X 1 ,X 4 ) + Q 2 (A 1 , A 2 , X 1 ,X 2 ) + Q 3 (A 2 , A 3 , X 2 ,X 3 ) + Q 4 (A 3 , A 4 , X 3 ,X 4 ) Associated with Agent 3 Observe only X 2 and X 3 ed observability: agent i only observes variables in Q i Must choose action to maximize i Q i Problems with Coordination Graph Tasks last multiple time steps Failures cause chain reactions Multiple houses 0 100 200 300 400 0 5 10 15 20 25 num ber ofagents csi non-csi Bidirectional Ring y = 0.53x 2 -0.96x -0.01 R 2 = 0.99 0 100 200 300 400 500 0 5 10 15 20 25 30 num ber ofagents non-csi csi y = 0.000049 exp(2.27x) R = 0.999 2 Server Reverse Star Optima l Apricod d Rule- based Expon0 6 530.9 530.9 530.9 Expon0 8 77.09 77.09 77.09 Expon1 0 0.034 0.034 0.034 Optim al Apricod d Rule- based Linear0 6 531.4 531.4 531.4 Linear0 8 430.5 430.5 430.5 Linear1 0 348.7 348.7 348.7 R unning Tim es forthe 'Linear'Problem s y = 0.1473x 3 -0.8595x 2 + 2.5006x -1.5964 R 2 = 0.9997 y = 0.0254x 2 + 0.0363x + 0.0725 R 2 = 0.9983 0 10 20 30 40 50 6 8 10 12 14 16 18 20 N o.ofvariables A pricodd R ule-based R unning Tim es forthe 'Expon'Problem s 0 100 200 300 400 500 6 8 10 12 N o.ofvariables A pricodd R ule-based Context-Specific Coordination Structure Table size exponential in #variables Messages are tables Agents communicate even if not necessary Fixed coordination structure What we want: Use structure in tables Variable coordination structure Exploit context specific independence! A 1 A 4 A 2 A 3 1 Q 4 Q 3 Q 2 Q Local value rules represent context- specific structure: <q 1 , Plumbing_not_done A1 = Plumb A2 = Plumb : -100> Set of rules Q i for each agent Must coordinate to maximize total value: i a a Q g , , 1 max K Rule-based variable elimination [Zhang and Poole ’99] Maximizing out A 1 Rule-based coordination graph for finding optimal action A - Simplification on instantiation of the state B - Simplification when passing messages C - Simplification on maximization Simplification by approximation Variable agent communication structure Coordination structure is dynamic Long-term Utility = Value of MDP Value computed by linear programming: One variable V(x) for each state One constraint for each state x and action a Number of states and actions exponential! , ) ( ) ( : subject to ) ( : minimize a x x a x x x , Q V V Decomposable Value Function Linear combination of restricted domain basis functions: = i i i h w V ) ( ) ( ~ x x Each h i is a rule over small part(s) of a complex system: The value of having two agents in the same house The value of two agents are painting a house together Must find w giving good approximate value function Single LP Solution for Factored MDPs a , x ) x , a ( ) x ( : to subject : minimize i i i i i x i i Q h w w h One variable w i for each basis function Polynomially many LP variables One constraint for every state and action Factored MDP Plumbing i Painting i Plumbing i Painting i R A 2 Required Tasks Dependent Tasks Agent 2 Plumbing, Painting Agent 1 Foundation, Electricity, Plumbing Agent 3 Electricity, Painting Agent 4 Decoration [Schweitzer and Seidmann ‘85] [Guestrin et al. ’01] Rule-based variable elim. Exponentially smaller LP than table-based! A 1 A 4 A 2 A 3 A 5 A 6 1 . 0 : 3 2 x a a 3 : 4 3 x a a 3 : 4 2 1 x a a a 5 : 2 1 x a a 1 : 3 1 x a a 7 : 6 x a 4 : 5 1 x a a 2 : 6 5 x a a 3 : 6 1 x a a A 1 A 4 A 2 A 3 A 5 A 6 1 . 0 : 3 2 a a 3 : 4 3 a a 3 : 4 2 1 a a a 5 : 2 1 a a 7 : 6 a 4 : 5 1 a a 2 : 6 5 a a A Instantiate current state: x = true A 1 A 4 A 2 A 3 A 5 A 6 B Eliminat e Variable A 1 C Local Maximization A 4 A 2 A 3 A 5 A 6 1 : 4 x a 1 : 4 a 1 . 0 : 3 2 a a 3 : 4 3 a a 5 : 2 a 7 : 6 a 4 : 5 a 2 : 6 5 a a 1 : 4 a 4 : 5 1 a a 5 : 2 1 a a 3 : 4 2 1 a a a 4 : 5 a 5 : 2 a 1 . 0 : 3 2 a a 3 : 4 3 a a 3 : 4 2 1 a a a 5 : 2 1 a a 7 : 6 a 4 : 5 1 a a 2 : 6 5 a a 1 : 4 a Outline Given long-term utilities i Q i (x,a) Local message passing computes maximizing action Variable coordination structure Long-term planning to obtain i Q i (x,a) Linear programming approach Exploit context-specific structure [Bellman et al. ‘63], [Tsitsiklis & Van Roy ’96], [Koller & Parr ’99,’00], [Guestrin et al. ’01] + = ' ) ( ) , | ' ( ) , ( ) , ( x x V a x x P a x R a x Q γ Factored Value function V = w i h i Factored Q function Q = Q i Foundation ! {Electricity, Plumbing} ! Painting ! Decoration 2 Agents, 1 house Agent 1 = {Foundation, Electricity, Plumbing} Agent 2 = {Plumbing, Painting and Decoration} 4 Agents, 2 houses Agent 1 = {Painting, Decoration}; moves Agent 2 = {Foundation, Electricity, Plumbing, Painting} house 1 Agent 3 = {Foundation, Electricity} house 2 Agent 4 = {Plumbing, Decoration} house 2 Example 1: Example 2: Actual value of resulting policies Actual value of resulting policies Our rule-based approach Apricodd Algorithm based on Linear programming Value iteration Types of independence exploited Additive and context- specific Only context- specific “Basis function” representation Specified by user Determined by algorithm Introduction Context-Specific Coordination, Given Q i ’s Long-Term Planning, Computing Q i ’s Experimental Results Use Coordination graph [Guestrin et al. ’01] Use variable elimination for maximization: [Bertele & Brioschi ‘72] Limited communication for optimal action choice Comm. bandwidth = induced width of coord. graph Here we need only 23, instead of 63 sum operations. A 1 A 4 A 2 A 3 1 Q 4 Q 3 Q 2 Q ) , ( ) , ( ) , ( max 3 2 1 3 1 2 2 1 1 , 3 2 1 A A g A A Q A A Q A A A + + = ) , ( ) , ( max ) , ( ) , ( max 4 2 4 4 3 3 3 1 2 2 1 1 , 4 3 2 1 A A Q A A Q A A Q A A Q A A A A + + + = ) , ( ) , ( ) , ( ) , ( max 4 2 4 4 3 3 3 1 2 2 1 1 , , , 4 3 2 1 A A Q A A Q A A Q A A Q A A A A + + + Computing Maximizing Action: Coordination Graph For every action of A 2 and A 3 , maximum value for A 4 h i and Q i depend on small sets of variables and actions Polynomial-time algorithm generates compact LP , ) , ( ) ( : subject to a x x a Q x h w i i i i i ) ( ) , ( : subject to max 0 , i i i i x a x h w x a Q ) , ( ) , ( max ) , ( ) , ( max 0 4 3 2 1 , , D B f D C f C A f B A f D C B A + + + ) , ( ) , ( ) , ( ) , ( max 0 4 3 ) , ( 1 ) , ( 1 2 1 , , D B f D C f g g C A f B A f C B C B C B A + + +

Upload: olivia-mclaughlin

Post on 11-Jan-2016

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A1A1 A4A4 A2A2 A3A3 Context-Specific Multiagent Coordination and Planning with Factored MDPs Carlos Guestrin Shobha Venkataraman Daphne Koller Stanford

A1

A4

A2 A3

Context-Specific Multiagent Coordination and Planning with Factored MDPsCarlos Guestrin Shobha Venkataraman Daphne KollerStanford University

Construction Crew Problem: Dynamic Resource Allocation

Joint Decision Space Represent as MDP:

Action space: joint action a for all agents State space: joint state x of all agents Reward function: total reward r

Action space is exponential: Action is assignment a = {a1,…, an}

State space: Exponential in # variables Global decision requires complete observation

,

,

Context-Specific Structure

Summary: Context-Specific Coordination

Summary of Algorithm

1. Pick local rule-based basis functions hi

2. Single LP algorithm for Factored MDPs obtains Qi’s

3. Variable coordination graph computes maximizing action

Construction Crew Problem

SysAdmin: Rule-based x Table-based

Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control Multiple, simultaneous

decisions Limited observability Limited communication

Multiagent Coordination Examples

Comparing to Apricodd [Boutilier et al. ’96-’99]

Conclusions and Extensions

Multiagent planning algorithm: Variable coordination structure; Limited context-specific communication; Limited context-specific observability.

Solve large MDPs!

Extensions to hierarchical and relational models

Stanford UniversityStanford University ! CMU

Agent 2Plumbing, Painting

Agent 1Foundation, Electricity, Plumbing

Agent 3Electricity, Painting

Agent 4Decoration

WANTED: Agents that coordinate to

build and maintain houses, but only when necessary!

Foundation ! {Electricity, Plumbing} ! Painting ! Decoration

Local Q-function Approximation

M4

M1

M3

M2

Q3

Q(A1,…,A4, X1,…,X4) ¼

Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +

Q3(A2, A3, X2,X3) + Q4(A3, A4,

X3,X4) Associated with Agent 3

Observe only

X2 and X3

Limited observability: agent i only observes variables in Qi

Must choose action to maximize i Qi

Problems with Coordination Graph

Tasks last multiple time steps Failures cause chain reactions Multiple houses

0

100

200

300

400

0 5 10 15 20 25

number of agents

running time (seconds)

csi

non-csi

Bidirectional Ring

y = 0.53x2 - 0.96x - 0.01

R2 = 0.990

100

200

300

400

500

0 5 10 15 20 25 30

number of agents

running time (seconds)

non-csi

csi

y = 0.000049 exp(2.27x)R = 0.9992

Server

Reverse Star

Optimal

Apricodd

Rule-based

Expon06

530.9 530.9 530.9

Expon08

77.09 77.09 77.09

Expon10

0.034 0.034 0.034

Optimal

Apricodd

Rule-based

Linear06

531.4 531.4 531.4

Linear08

430.5 430.5 430.5

Linear10

348.7 348.7 348.7

Running Times for the 'Linear' Problems

y = 0.1473x 3 - 0.8595x2 + 2.5006x - 1.5964

R2 = 0.9997

y = 0.0254x 2 + 0.0363x + 0.0725

R2 = 0.9983

0

10

20

30

40

50

6 8 10 12 14 16 18 20No. of variables

Time (in seconds)Apricodd

Rule-based

Running Times for the 'Expon' Problems

0

100

200

300

400

500

6 8 10 12No. of variables

Time (in seconds)

Apricodd

Rule-based

Context-Specific Coordination Structure

Table size exponential in #variables Messages are tables

Agents communicate even if not necessary Fixed coordination structure

What we want: Use structure in tables Variable coordination structure

Exploit context specific independence!

A1

A4

A2 A3

1Q

4Q 3Q

2Q

Local value rules represent context-specific structure:

<q1 , Plumbing_not_done A1 = Plumb A2 = Plumb : -100>

Set of rules Qi for each agent Must coordinate to maximize total

value: ∑ iaa

Qg,,1

maxK

Rule-based variable elimination [Zhang and Poole ’99]

Maximizing out A1

Rule-based coordination graph for finding optimal action

A - Simplification on instantiation of the state B - Simplification when passing messages C - Simplification on maximization Simplification by approximation

Variable agent communication structure Coordination structure is dynamic

Long-term Utility = Value of MDP Value computed by linear programming:

One variable V(x) for each state One constraint for each state x and action a Number of states and actions exponential!

,

)()( :subject to

)(:minimize

⎩⎨⎧∀

ax

xax

xx

,QV

V

Decomposable Value Function

Linear combination of restricted domain basis functions: ∑=

i iihwV )()(~

xx

Each hi is a rule over small part(s) of a complex system: The value of having two agents in the same house The value of two agents are painting a house together

Must find w giving good approximate value function

Single LP Solution for Factored MDPs

a,x

)x,a()x( :to subject

:minimize

⎪⎩

⎪⎨⎧

≥∑∑

ii

iii

xii

Qhw

wh

One variable wi for each basis function Polynomially many LP variables

One constraint for every state and action

Factored MDP

Plumbingi

Paintingi

Plumbingi’

Paintingi’

R A2

Required Tasks

Dependent Tasks

Agent 2Plumbing, Painting

Agent 1Foundation, Electricity, Plumbing

Agent 3 Electricity,

Painting

Agent 4Decoration

[Schweitzer and Seidmann ‘85]

[Guestrin et al. ’01]

Rule-based variable elim. Exponentially smaller LP than table-based!

A1

A4 A2

A3

A5 A6

1.0:32 xaa ∧∧

3:43 xaa ∧∧

3:421 xaaa ∧∧∧

5:21 xaa ∧∧1:31 xaa ∧∧

7:6 xa ∧4:51 xaa ∧∧2:65 xaa ∧∧ 3:61 xaa ∧∧

A1

A4 A2

A3

A5 A6

1.0:32 aa ∧

3:43 aa ∧

3:421 aaa ∧∧

5:21 aa ∧

7:6a4:51 aa ∧2:65 aa ∧

A

Instantiate current state: x = true

A1

A4 A2

A3

A5 A6

B Eliminate Variable A1

C

Local MaximizationA4 A2

A3

A5 A6

1:4 xa ∧ 1:4a

1.0:32 aa ∧

3:43 aa ∧

5:2a

7:6a4:5a

2:65 aa ∧

1:4a

4:51 aa ∧5:21 aa ∧

3:421 aaa ∧∧

4:5a

5:2a

1.0:32 aa ∧

3:43 aa ∧

3:421 aaa ∧∧

5:21 aa ∧

7:6a4:51 aa ∧2:65 aa ∧

1:4a

Outline

Given long-term utilities i Qi(x,a) Local message passing computes maximizing

action Variable coordination structure

Long-term planning to obtain i Qi(x,a) Linear programming approach Exploit context-specific structure

[Bellman et al. ‘63], [Tsitsiklis & Van Roy ’96], [Koller & Parr ’99,’00], [Guestrin et al. ’01]

∑+='

)(),|'(),(),(x

xVaxxPaxRaxQ γ

Factored Value function V = wi hi

Factored Q function Q = Qi

Foundation ! {Electricity, Plumbing} ! Painting ! Decoration

2 Agents, 1 house Agent 1 = {Foundation, Electricity, Plumbing} Agent 2 = {Plumbing, Painting and Decoration}

4 Agents, 2 houses Agent 1 = {Painting, Decoration}; moves Agent 2 = {Foundation, Electricity, Plumbing, Painting} house 1 Agent 3 = {Foundation, Electricity} house 2 Agent 4 = {Plumbing, Decoration} house 2

Example 1:

Example 2:

Actual value of resulting policies Actual value of resulting policies

Our rule-based approach

Apricodd

Algorithm based on Linear programming Value iteration

Types of independence exploited

Additive and context-specific

Only context-specific

“Basis function” representation

Specified by user Determined by algorithm

Introduction Context-Specific Coordination, Given Qi’s Long-Term Planning, Computing Qi’s Experimental Results

Use Coordination graph [Guestrin et al. ’01]

Use variable elimination for maximization: [Bertele & Brioschi ‘72]

Limited communication for optimal action choice

Comm. bandwidth = induced width of coord. graph

Here we need only 23, instead of 63 sum operations.

A1

A4

A2 A3

1Q

4Q 3Q

2Q

),(),(),(max321312211

,321

AAgAAQAAQA A A

),(),(max),(),(max 424433312211,

4321

AAQAAQAAQAAQA A A A

),(),(),(),(max424433312211

,,, 4321

AAQAAQAAQAAQA A A A

Computing Maximizing Action: Coordination Graph

For every action of A2 and A3,

maximum value for A4

hi and Qi depend on small sets of variables and actions

Polynomial-time algorithm generates compact LP

,

),()( :subject to⎪⎩

⎪⎨⎧

≥∑∑ax

xaQxhwi

ii

ii )(),( :subject to max0

,⎩⎨⎧

−∑≥i

iiixa

xhwxaQ

[ ]),(),(max),(),(max0 4321,,

DBfDCfCAfBAfDCBA

+++≥

),(),(

),(),(max0

43),(

1

),(121

,,

DBfDCfg

gCAfBAf

CB

CB

CBA

+≥

++≥