incremental pruning

45
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok

Upload: kami

Post on 08-Jan-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Incremental Pruning. CSE 574 May 9, 2003 Stanley Kok. Value-Iteration (Recap). DP update – a step in value-iteration MDP S – finite set of states in the world A – finite set of actions T: SxA -> Π (S)(e.g. T(s,a,s’) = 0.2) R: SxA -> R (e.g. R(s,a) = 10) Algm. POMDP. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Incremental Pruning

Incremental Pruning

CSE 574 May 9, 2003Stanley Kok

Page 2: Incremental Pruning

Value-Iteration (Recap)• DP update – a step in value-iteration• MDP

  S – finite set of states in the world  A – finite set of actions  T: SxA -> Π(S)(e.g. T(s,a,s’) = 0.2)  R: SxA -> R (e.g. R(s,a) = 10)

• Algm

Page 3: Incremental Pruning

POMDP

• <S, A, T, R, Ω, O> tuple   S, A, T, R of MDP  Ω – finite set of observations  O:SxA-> Π(Ω)

• Belief state   - information state  – b, probability distribution over S  - b(s1)

Page 4: Incremental Pruning

POMDP - SE

• SE – State Estimator   updates belief state based on

• previous belief state last action, current observation • SE(b,a,o) = b’

Page 5: Incremental Pruning

POMDP - SE

Page 6: Incremental Pruning

POMDP - Π

• Focus on Π component

• POMDP-> “Belief MDP”  MDP parameters:

• S => B, set of belief states• A => same• T => τ(b,a,b’)• R => ρ(b, a)

• Solve with value-iteration algm

Page 7: Incremental Pruning

POMDP - Π

• τ(b,a,b’)

• ρ(b, a)

Page 8: Incremental Pruning

Two Problems

• How to represent value function over continuous belief space?

• How to update value function Vt with Vt-1?

• POMDP -> MDPS => B, set of belief statesA => sameT => τ(b,a,b’)R => ρ(b, a)

Page 9: Incremental Pruning

Running Example

• POMDP with  Two states (s1 and s2)  Two actions (a1 and a2)  Three observations (z1, z2, z3)

1D belief space for a 2 state POMDPProbability that state is s1

Page 10: Incremental Pruning

First Problem Solved• Key insight: value function

  piecewise linear & convex (PWLC)  Convexity makes intuitive sense

• Middle of belief space – high entropy, can’t select actions appropriately, less long-term reward

• Near corners of simplex – low entropy, take actions more likely to be appropriate for current world state, gain more reward

• Each line (hyperplane) represented with vector  Coefficients of line (hyperplane)  e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1))

• To find value function at b, find vector with largest dot pdt with b

 

Page 11: Incremental Pruning

Second Problem

• Can’t iterate over all belief states (infinite) for value-iteration but…

• Given vectors representing Vt-1, generate vectors representing Vt

Page 12: Incremental Pruning

Horizon 1• No future• Value function consists

only of immediate reward• e.g.

  R(s1, a1) = 0, R(s2, a1) = 1.5,

  R(s1, a2) = 1, R(s2, a2) = 0  b = <0.25, 0.75>

  Value of doing a1   = 1 x b(s1) + 0 x b(s2)  = 1 x 0.25 + 0 x 0.75

  Value of doing a2   = 0 x b(s1) + 1.5 x b(s2)  = 0 x 0.25 + 1.5 x 0.75

Page 13: Incremental Pruning

Second Problem

• Break problem down into 3 steps  -Compute value of belief state given action

and observation  -Compute value of belief state given action  -Compute value of belief state

Page 14: Incremental Pruning

Horizon 2 – Given action & obs

• If in belief state b,what is the best value of   doing action a1 and seeing z1?

• Best value = best value of immediate action + best value of next action

• Best value of immediate action = horizon 1 value function

Page 15: Incremental Pruning

Horizon 2 – Given action & obs

• Assume best immediate action is a1 and obs is z1 • What’s the best action for b’ that results from initial

b when perform a1 and observe z1?• Not feasible – do this for all belief states (infinite)

Page 16: Incremental Pruning

Horizon 2 – Given action & obs

• Construct function over entire (initial) belief space  from horizon 1 value function   with belief transformation built in

Page 17: Incremental Pruning

Horizon 2 – Given action & obs• S(a1, z1) corresponds to paper’s

• S() built in:  - horizon 1 value function   - belief transformation  - “Weight” of seeing z after performing a  - Discount factor  - Immediate Reward

• S() PWLC

Page 18: Incremental Pruning

Second Problem

• Break problem down into 3 steps  -Compute value of belief state given action and

observation  -Compute value of belief state given action  -Compute value of belief state

Page 19: Incremental Pruning

Horizon 2 – Given action

• What is the horizon 2 value of a belief state given immediate action is a1?

• Horizon 2, do action a1• Horizon 1, do action…?

Page 20: Incremental Pruning

Horizon 2 – Given action• What’s the best strategy at b?• How to compute line (vector) representing best

strategy at b? (easy)• How many strategies are there in figure?• What’s the max number of strategies (after

taking immediate action a1)?

Page 21: Incremental Pruning

Horizon 2 – Given action

• How can we represent the 4 regions (strategies) as a value function?

• Note: each region is a strategy

Page 22: Incremental Pruning

Horizon 2 – Given action• Sum up vectors representing region• Sum of vectors = vectors (add lines, get lines)• Correspond to paper’s transformation

Page 23: Incremental Pruning

Horizon 2 – Given action

• What does each region represent?• Why is this step hard (alluded to in paper)?

Page 24: Incremental Pruning

Second Problem

• Break problem down into 3 steps  -Compute value of belief state given action and

observation  -Compute value of belief state given action  -Compute value of belief state

Page 25: Incremental Pruning

Horizon 2

a1

a2

U

Page 26: Incremental Pruning

Horizon 2

This tells youhow to act!

=>

Page 27: Incremental Pruning

Purge

Page 28: Incremental Pruning

Second Problem

• Break problem down into 3 steps  -Compute value of belief state given action and

observation  -Compute value of belief state given action  -Compute value of belief state

• Use horizon 2 value function to update horizon 3’s ...

Page 29: Incremental Pruning

The Hard Step• Easy to visually inspect to obtain different regions• But in higher dimensional space, with many actions

and observations….hard problem

Page 30: Incremental Pruning

Naïve way - Enumerate

• How does Incremental Pruning do it?

Page 31: Incremental Pruning

Incremental Pruning• How does IP improve

naïve method?• Will IP ever do worse

than naïve method?

Combinations

Purge/Filter

Page 32: Incremental Pruning

Incremental Pruning

• What’s other novel idea(s) in IP?  RR: Come up with smaller set D as argument to Dominate()

• RR has more linear pgms but less contraints in the worse case.  Empirically ↓ constraints saves more time than ↑ linear programs

require

Page 33: Incremental Pruning

Incremental Pruning

• What’s other novel idea(s) in IP?  RR: Come up with smaller set D as argument to

Dominate()

Why are the terms after U needed?

Page 34: Incremental Pruning

Identifying Witness

• Witness Thm:  -Let Ua be a set of vectors representing value

function

  -Let u be in Ua (e.g. u = αz1,a2 + αz2,a1 + αz3,a1 )

  -If there is a vector v which differs from u in one observation (e.g. v = αz1,a1 + αz2,a1 + αz3,a1) and

  there is a b such that b.v > b.u,  -then Ua is not equal to the true value function

Page 35: Incremental Pruning

Witness Algm• Randomly choose a belief state b• Compute vector representing best value at b

(easy)• Add vector to agenda• While agenda is not empty

• Get vector Vtop from top of agenda• b’ = Dominate(Vtop, Ua)• If b’ is not null (there is a witness),

• compute vector u for best value at b’ and add it to Ua• compute all vectors v’s that differ from u at one

observation and add them to agenda

b’ b’’ b’ b’’b

Page 36: Incremental Pruning

Linear Support• If value function is incorrect, biggest diff is at

edges (convexity)

Page 37: Incremental Pruning

Linear Support

Page 38: Incremental Pruning

Experiments

• Comments???

Page 39: Incremental Pruning

Important Ideas

• Purge()

Page 40: Incremental Pruning

Flaws

• Insufficient background/motivation

Page 41: Incremental Pruning

Future Research

• Better best-case/worse-case analyses• Precision parameter Є

Page 42: Incremental Pruning

Variants

• Reactive Policy  - st = zt;

  - π(z) = a  - branch & bound search  - gradient ascent search  - perceptual aliasing problem

• Finite History Window  - π(z1…zk) = a  - Suffix tree to represent observation, leaf action

• Recurrent Neural Nets  - use neural nets to maintain some state (so information about

past is not forgotten)

Page 43: Incremental Pruning

Variants – Belief State MDP

• Exact V, exact b• Approximate V, exact b

  - Discreting b into a grid and interpolate

• Exact V, approximate b  - Use particle filters to sample b  - track approximate belief state using DBN

• Approximate V, Approximate b  - combine previous two

Page 44: Incremental Pruning

Variants - Pegasus

• Policy Evaluation of Goodness And Search Using Scenarios

• Convert POMDP to another POMDP with deterministic state transitions

• Search for policy of transformed POMDP with highest estimated value

Page 45: Incremental Pruning

That’s it!