concurrent probabilistic temporal planning (cptp) mausam joint work with daniel s. weld university...

Concurrent Probabilistic Temporal Planning (CPTP)

Mausam Joint work with Daniel S. WeldUniversity of WashingtonSeattle

Motivation

Three features of real world planning domains : Durative actions

All actions (navigation between sites, placing instruments etc.) take time.

Concurrency Some instruments may warm up Others may perform their tasks Others may shutdown to save power.

Uncertainty All actions (pick up the rock, send data etc.)

have a probability of failure.

Motivation (contd.)

Concurrent Temporal Planning (widely studied with deterministic

effects) Extends classical planning Doesn’t easily extend to probabilistic

outcomes. Concurrent planning with uncertainty

(Concurrent MDPs – AAAI’04) Handle combinations of actions over an MDP Actions take unit time.

Few planners handle the three in concert!

Outline of the talk

MDP and CoMDPConcurrent Probabilistic Temporal

PlanningConcurrent MDP in augmented state space.

Solution Methods for CPTPTwo heuristics to guide the searchHybridisation

Experiments & ConclusionsRelated & Future Work

Markov Decision Process

S : a set of states, factored into Boolean

variables.A : a set of actionsPr (S£A£S! [0,1]): the transition modelC (A! R) : the cost models0 : the start stateG : a set of absorbing goals

unit duration

GOAL of an MDP

Find a policy (S ! A) which:minimises expected cost of reaching

a goal for a fully observable Markov decision process if the agent executes for indefinite

horizon.

Equations : optimal policy

Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from s.

J* should satisfy:

Min

Bellman Backup

a1

a2

a3

s

Jn

Jn

Jn

Jn

Jn

Jn

Jn

Qn+1(s,a)

Jn+1(s)

Ap(s)

min

Min

RTDP Trial

a1

a2

a3

Jn

Jn

Jn

Jn

Jn

Jn

Jn

Qn+1(s,a)

Jn+1(s)

Ap(s)

amin = a2

Goal

s

min

Real Time Dynamic Programming(Barto, Bradtke and Singh’95)

Trial : Simulate greedy policy;

Perform Bellman backup on visited states

Repeat RTDP Trials until cost function converges Anytime behaviour Only expands reachable state space Complete convergence is slow

Labeled RTDP (Bonet & Geffner’03) Admissible, if started with admissible cost function. Monotonic; converges quickly

optimistic

Lower bound

Concurrent MDP (CoMDP)(Mausam & Weld’04)

Allows concurrent combinations of actions

Safe execution: Inherit mutex definitions from classical planning:Conflicting preconditionsConflicting effects Interfering preconditions and effects

Jn

Jn

Jn

Jn

Jn

Bellman Backup (CoMDP)

a2

a1,a2

a3

sJn+1(s)

Ap(s)

a1

a1,a

3

a2,a3

a1,a2,a3

Jn

Jn

Jn

Jn

Jn

Jn JnJn

Jn

Jn

Jn

Jn

Jn

Exponential blowup to calculate a

Bellman Backup!

min

Sampled RTDP

RTDP with Stochastic (partial) backups:ApproximateAlways try the last best combination Randomly sample a few other

combinations In practice

Close to optimal solutionsConverges very fast

Outline of the talk





Modelling CPTP as CoMDP

CoMDP CPTP

Model explicit action durationsMinimise expected make-span.

If we initialise C(a) as its duration – (a) :

Aligned epochs Interwoven epochs

Augmented state space

0 3 6 9

X

a

b

c

e

d f

h

g

<X,;><X1,{(a,1), (c,3)}>X1 : Application of b on X.

<X2,{(h,1)}>X2 : Application of a, b, c, d and e over X.

Time

Simplifying assumptions

All actions have deterministic durations. All action durations are integers. Action model

Preconditions must hold until end of action. Effects are usable only at the end of action.

Properties : Mutex rules are still required. Sufficient to consider only epochs when an action

ends

Completing the CoMDP

Redefine Applicability set Transition function Start and goal states.

Example: Transition function is redefined

Agent moves forward in time to an epoch where some action completes.

Start state : <s0,;> etc.

Solution

CPTP = CoMDP in interwoven state space.

Thus one may use our sampled RTDP (etc)

PROBLEM: Exponential blowup in the size of the state space.

Outline of the talk



Solution Methods for CPTPSolution 1 : Two heuristics to guide the

searchSolution 2 : Hybridisation


Max Concurrency Heuristic (MC)

Define c : maximum number of actions executable concurrently in the domain.

•J*(X) · 2£ J*(<X,;>)

•J*(<X,;>) ¸ J*(X)/2

a

b c

J*(<X,;>) = 10

X Ga b c

J*(X) · 20

X G

Serialisation

Admissible Heuristic

Eager Effects Heuristic : Solving a relaxed problem

S : S £ ZLet (X be a state where

X is the world state. : time remaining for all actions

(started anytime in the history) to complete execution.

Start state : (s0,0)Goal states : { (X,0) | X2G }

Eager Effects Heuristic (contd.)

After 2 units(V,6)a

bX

2

8V

c 4

Allow all actions even when

mutex with a or c!

Allowing inapplicable actions to execute, thus

optimistic!

Assuming information of action

effects ahead of time, thus optimisitic!

Hence the name –Eager Effects!

Admissible Heuristic

Solution2 : Hybridisation

ObservationsAligned epoch policy is sub-optimal

but fast to compute. Interwoven epoch policy is optimal

but slow to compute.

Solution: Produce a hybrid policy i.e. : Output interwoven policy for probable

states.Output aligned policy for improbable

states.

Path to goals

s G

GLow

Prob.

Hybrid algorithm (contd.)

Observation: RTDP explores probable branches much more than others.

Algorithm(m,k,r) : Loop

Do m RTDP trials: let current value of start state be J(s0).

Output a hybrid policy () Interwoven policy for states visited > k times Aligned policy for other states.

Evaluate policy : J(s0)

Stop if {J(s0) – J(s0)} < rJ(s0)

Less than optimal

Greater than optimal

Hybridisation

Outputs a proper policy : Policy defined at all reachablepolicy states Policy guaranteed to take agent to goal.

Has an optimality ratio (r) parameter Controls balance between optimality & running

times. Can be used as an anytime algorithm. Is general –

we can hybridise two algorithms in other cases e.g. in solving original concurrent MDP.

Outline of the talk





Experiments

DomainsRoverMachineShopArtificial

State Variables: 14-26Durations: 1-20

Speedups in Rover domain

Efficiency of different methods

1

10

100

1000

10000

1 2 3 4 5 6

Different Rover Problems

Tim

e in

sec (

in lo

gari

thm

ic s

cale

)

Interwoven Epoch

Max Concurrency

Eager Effects

Hybrid Algorithm

Aligned epochs

Qualities of solution

Solution Quality of different methods

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1 2 3 4 5 6

Different Rover Problems

Rati

o o

f m

ake-s

pan

to

th

e o

pti

mal

Interwoven Epoch

Max Concurrency

Eager Effects

Hybrid Algorithm

Aligned epochs

Experiments : Summary

Max Concurrency heuristic Fast to compute Speeds up the search.

Eager Effects heuristic High quality Can be expensive in some domains.

Hybrid algorithm Very fast Produces good quality solutions.

Aligned epoch model Superfast Outputs poor quality solutions at times.

Related Work

Prottle (Little, Aberdeen, Thiebaux’05)

Generate, test and debug paradigm (Younes & Simmons’04)

Concurrent options (Rohanimanesh & Mahadevan’04)

Future Work

Other applications of hybridisation CoMDP MDP OverSubscription Planning

Relaxing the assumptions Handling mixed costs Extending to PDDL2.1 Stochastic action durations

Extensions to metric resources State space compression/aggregation

concurrent probabilistic temporal planning (cptp) mausam joint work with daniel s. weld university...

Documents