a tutorial on the partially observable markov decision process and its applications

A Tutorial on the Partially Observable Markov

Decision Process and Its Applications

Lawrence Carin

June 7,2006

Outline

Overview of Markov decision Processes

(MDPs)

Introduction to partially observable decision

processes (POMDPs)

Some applications of POMDPs

Conclusions

Overview of MDPs

Introduction to POMDPs model

Conclusions

Markov decision processes

The MDP is defined by the tuple < S, A, T, R >

S is a finite set of states of the world.

A is a finite set of actions.

T: SA (S) is the state-transition function, the probability of an action changing the the world state from one to another,T(s, a, s’).

R: SA is the reward for the agent in a given world

state after performing an action, R(s, a).

Two properties of the MDP

• The action-dependent state transition is Markovian

• The state is fully observable after taking action aIllustration of MDPs

WORLD: T(s,a, s’)Sta

Objective of MDPs

• Finding the optimal policy , mapping state s to action a in order to maximize the value function V(s).

))'()',,(),((max)('

asVsasTasRsV

Overview of MDPs

Introduction to POMDPs

Conclusions

S, A, T, and R are defined the same as in MDPs

is a finite set of observations the agent can experience its world.

O: SA () is the observation function, the probability of making a certain observation after performing a particular action, landing in state s’, O(s’, a, o).

The POMDP is defined by the tuple < S, A, T, R, , O >

Differences between MDPs and POMDPs• The state is hidden after taking action a.

• The hidden state information is inferred from the action-state dependent observation function O(s’, a, o).

Uncertainty of state s in POMDPs

A new concept in POMDPs: Belief State b(s)b(st) = Pr(st = s | o1, a1, o2, a2, …, ot-1, at-1, ot)

),|Pr(

)()',,(),,'( )'('

sbsasToasOsb Ss

b’=T(b|a, o1)

b’=T(b|a, o2)

n control interval remaining n-1 control interval remainings2 s2s3

The belief state b(s) evolves according to Bayes rule

Illustration of POMDPs

WORLD: T(s,a, s’)

O(s’, a, o)

rvatio

ab :'bb

SE: State Estimator using (1)

: Policy Search

• Finding the optimal policy for POMDPs, mapping belief point b to action a in order to maximize the value function V(b).

))'((),'|Pr(),|'Pr()(),()(max

))'((){,'|Pr(),|'Pr()(max)(

SsSsAa

sbVasoasssbasRsb

sbVRasoasssbbV

Expected immediate reward

Objective of POMDPs

][max])()([max)( bsbsbV knk

a(p1) a(p2) a(p5)

• Piecewise linearity and convexity of optimal value function for finite horizon in POMDPs

Optimal value function

Substituting (3), (1) into (2)

)'(),'|(),|'(),()(max

)'(),'|(),|'()(max),()(max

)'(),|Pr(),()(max)(

Ss o s

sasoassTasRsb

sasoassTsbasRsb

bVaboasRsbbV

Maximizing to obtain the index l

-vector of belief point b

Optimal value of belief point b

Approaches to solving POMDPs problem• Exact algorithms: finding all -vectors for the

belief space which is exact but intractable for large

size problems.

• Approximate algorithms: finding -vectors of a

subset of the belief space, which is fast and can deal

with large size problems.

Point-based value iteration (PBVI)

b5b1 b3 b4b0

• focus on a finite set of belief points

• maintain an -vector for each point

Point-Based Value Iteration

• RBVI maintains an -vector for each convex region over which the optimal value function is linear.

• RBVI simultaneously determines the -vectors for all relevant convex regions based on all available belief points.

Region-Based Value Iteration (RBVI)

The piecewise linear value function:

which can be reformulated as

by introducing hidden variables z(b)=k, denoting b Bk

RBVI (Contd)

The belief space is partitioned using hyper-ellipsoids,

Then we have

RBVI (Contd)

The joint distribution of V(b) and b can be written as

Expectation-Maximization (EM) Estimation:

RBVI (Contd)

E step:

M step:

Overview of MDPs

Introduction to POMDPs model

Conclusions

Applications of POMDPs• Application of Partially Observable Markov Decision

Processes to robot navigation in a Minefield

• Application of Partially Observable Markov Decision

Processes to feature selection

• Application of Partially Observable Markov Decision

Processes to sensor scheduling

Applications of POMDPsSome considerations in applying

POMDPs to new problems

• How to define the state

• How to obtain the transition and observation matrix

• How to set the reward

References1. Leslie Pack Kaelbling, Michael L. Littman and Anthony R.

Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, Vol. 101,1998.

2. Smallwood, R. D., and Sondik, E. J. 1973. The optimal control of partially observable markov processes over a finite horizon. Operational Research 21:1071–1088.

3. J. Pineau, G. Gordon & S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, Aug. 2003.

4. D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Blemont, Massachusetts,2001, Vol.1 & Vol.2.

5. Bellman, R. 1957. Dynamic Programming. Princeton University Press.

a tutorial on the partially observable markov decision process and its applications

Documents

making complex declslons. outline mdps(markov decision...

partially observable markov decision...

learning and solving partially observable markov decision...

partially observable total-cost markov decision processes...

planning in partially-observable switching-mode continuous...

partially observable markov decision process (chapter 15 &...

tkk | automation technology laboratory partially observable...

pomdps: partially observable markov decision...

continuous-observation partially observable semi-markov ......

learning partially observable markov models from first...

partially observable markov decision processes for spoken...

ambiguous partially observable markov decision processes...

reinforcement learning in partially observable multiagent...

value-function approximations for partially observable...

1 planning under uncertainty. today’s topics sequential...

partially observable markov decision processes...

the partially observable hidden markov model and its...

partially-observable markov decision processes

partially-observable markov decision...

partially observable markov decision processes for faster...