markov decision processes: a survey1/68 markov decision processes: a survey adviser : yeong-sung...

68
Markov Decision Processes: A Survey 1/68 Markov Decision Processes: A Survey Adviser Yeong-Sung Lin Graduate Student Cheng-Ta L ee Network Optimization Research Group March 22, 2004

Upload: annabelle-hardy

Post on 17-Jan-2016

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 1/68

Markov Decision Processes: A Survey

Adviser: Yeong-Sung LinGraduate Student: Cheng-Ta Lee

Network Optimization Research Group

March 22, 2004

Page 2: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 2/68

Outline

Introduction Markov Theory Markov Decision Processes Conclusion Future Work

Page 3: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 3/68

IntroductionDecision Theory

Probability Theory

+ Utility Theory

=

Decision Theory

Describes what an agent should believe based on evidence.

Describes what an agent wants.

Describes what an agent should do.

Page 4: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 4/68

Introduction

Markov decision processes (MDPs) theory has developed substantially in the last three decades and become an established topic within many operational research.

Modeling of (infinite) sequence of recurring decision problems (general behavioral strategies)

MDPs defined Objective functions Policies

Page 5: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 5/68

Markov Theory

Markov process A mathematical model that us useful in the

study of complex systems. The basic concepts of the Markov process are

those “state” of a system and state “transition”. A graphic example of a Markov process is

presented by a frog in a lily pond. State transition system

Discrete-time processContinuous-time process

Page 6: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 6/68

Markov Theory

To study the discrete-time process Suppose that there are N states in the system nu

mbered from 1 to N. If the system is a simple Markov process, then the probability of a transition to state j during the next time interval, given that the system now occupies state i, is a function only of i and j and not of any history of the system before its arrival in i.

In other words, we may specify a set of conditional probability pij.

where

N

jijP

1

1 10 ijp

Page 7: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 7/68

The Toymaker Example

First state: the toy is great favor. Second state: the toy is out of favor. Matrix form Transition diagram

6.04.0

5.05.0][ ijpP

Page 8: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 8/68

The Toymaker Example

, the probability that the system will occupy state i after n transitions.

If its state at n=0 is known. It follow that

)(ni

N

ii n

1

1)(

,...2,1,01)()1(1

npnnN

iijij

,...2,1,0)()1( nPnn

3

2

)0()2()3(

)0()1()2(

)0()1(

PP

PP

P

,...2,1,0)0()( nPn n

Page 9: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 9/68

The Toymaker Example

If the toymaker starts with a successful toy, then and , so that 1)0(1 0)0(2 01)0(

200

111

200

89

6.04.0

5.05.0

20

11

20

9)2()3(

20

11

20

9

6.04.0

5.05.05.05.0)1()2(

5.05.06.04.0

5.05.001)0()1(

P

P

P

Page 10: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 10/68

The Toymaker Example

Table 1.1 Successive State Probabilities of Toymaker Starting with a Successful Toy

Table 1.2 Successive State Probabilities of Toymaker Starting without a Successful Toy

)(1 n

)(2 n

n= 0 1 2 3 4 5 …

1 0.5 0.45 0.445 0.4445 0.44445 …

0 0.5 0.55 0.555 0.5555 0.55555 …

)(1 n

)(2 n

n= 0 1 2 3 4 5 …

0 0.4 0.44 0.444 0.4444 0.44444 …

1 0.6 0.56 0.556 0.5556 0.55556 …

Page 11: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 11/68

The Toymaker Example

The row vector with components is thus the limit as n approaches infinity of

i)(n

P

N

ii

1

1 212

211

6.05.0

4.05.0

121

9

41

9

52

Page 12: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 12/68

z-Transformation

For the study of transient behavior and for theoretical convenience, it is useful to study the Markov process from the point of view of the generating function or, as we shall call it, the z-transform.

Consider a time function f(n) that takes on arbitrary values f(0), f(1), f(2), and so on, at nonnegative, discrete, integrally spaced points of time and that is zero for negative time.

Such a time function is shown in Fig. 2.4

Fig. 2.4 An Arbitrary discrete-time function

Page 13: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 13/68

z-Transformation z-transform F(z) such that Table 1.3. z-Transform Pairs

0

)()(n

nznfzF

n z1

1

z1

1

nn

2)1( z

z

2)1( z

z

)(nfn

)( zF

Time Function for n>=0 z-Transform

f(n) F(z)

f1(n)+f2(n) F1(z)+F2(z)

kf(n) (k is a constant) kF(z)

f(n-1) zF(z)

f(n+1) z-1[F(z)-f(0)]

1 (unit step)

n (unit ramp)

Page 14: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 14/68

z-Transformation

Consider first the step function

the z-transform is or

For the geometric sequence f(n)=αn,n 0≧ , or

00

,...3,2,1,01)(

n

nnf

32

0

1)()( zzzznfzFn

n

zzF

1

1)(

00

)()()(n

n

n

n zznfzF zzF

1

1)(

Page 15: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 15/68

z-Transformation

We shall now use the z-transform to analyze Markov processes.

In this expression I is the identity matrix.

,...2,1,0)()1( nPnn Pzzz )()0()(1

)0())((

)0()()(

zPIz

Pzzz

1))(0()( zPIz

Page 16: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 16/68

z-Transformation Let us investigate the toymaker’s problem by z-transformation.

5

3

5

22

1

2

1

P

zz

zzzPI

5

31

5

22

1

2

11

)10

11)(1(

2

11

)10

11)(1(

5

2

)10

11)(1(

2

1

)10

11)(1(

5

31

1

zz

z

zz

z

zz

z

zz

z

zPI

zzzz

zzzz

zPI

10

11

9

4

19

5

10

11

9

4

19

410

11

9

5

19

5

10

11

9

5

19

4

1

9

4

9

49

5

9

5

10

11

1

9

5

9

49

5

9

4

1

11

zzzPI

9

4

9

49

5

9

5

10

1

9

5

9

49

5

9

4

)(n

nH

Let the matrix H(n) be the inverse transform of (I-zP)-1 on an element-by-element basis

1))(0()( zPIz

)()0()( nHn

Page 17: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 17/68

z-Transformation

If the toymaker starts in the successful state 1, then π(0)=[1 0] and or ,

If the toymaker starts in the unsuccessful state 2, then π(0)=[0 1] and or ,

We have now obtained analytic forms for the data in Table 1.1 and 1.2.

9

5

9

5

10

1

9

5

9

4)(

n

nn

n

10

1

9

5

9

4)(1

n

n

10

1

9

5

9

5)(2

9

4

9

4

10

1

9

5

9

4)(

n

nn

n

10

1

9

4

9

4)(1

n

n

10

1

9

4

9

5)(2

Page 18: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 18/68

Laplace Transformation

We shall extend our previous work to the case in which the process may make transitions at random time intervals.

The Laplace transform of a time function f(t) which is zero for t<0 is defined by

Table 2.4. Laplace Transform Pairs

dtetfsF st

0

)()(

)(tfdt

d

ate as

1

s

1

atte 2)1(

1

a

2

1

s

)(tfe at

Time Function for t>=0 z-Transform

f(t) F(s)

f1(t)+f2(t) F1(n)+F2(n)

kf(t) (k is a constant) kF(s)

sF(s)-f(0)

1 (unit step)

t (unit ramp)

F(s+a)

Page 19: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 19/68

Laplace Transformation

dtetfsF st

0

)()(

00

01)(

t

ttf

0

1)(

sdtesF st

0)( tforetf at

0

)(

0

1)(

asdtedteesF tasstat

ji

ijiji

jijj dtatdtatdtt )(]1)[()(

ji

jijj aa

ji

ijijjjj dtatdtatdtt )(]1)[()(

)0()()()(

dtdtattdttji

ijijj

Page 20: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 20/68

Laplace Transformation

We shall now use the Laplace transform to analyze Markov processes.

N

iiji att

dt

d

1

)()( Attdt

d)()(

Atet )0()(

For discrete processes,

,...2,1,0)0()( nPn n

Pe A or PA ln

33

22

!3!2A

tA

ttAIe At

Attdt

d)()(

Asss )()0()(

)0())(( AsIs

1))(0()( AsIs

Page 21: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 21/68

Laplace Transformation

Recall the toymaker’s initial policy, for which the transition-probability matrix was

5

3

5

22

1

2

1

P

44

55A

44

55

s

sAsI

)9(

5

)9(

4)9(

5

)9(

4

1

ss

s

ss

ssss

s

AsI

99

4

9

5

99

4

9

49

9

5

9

5

99

5

9

4

1

ssss

ssssAsI

9

4

9

49

5

9

5

9

1

9

5

9

49

5

9

411

ssAsI

Page 22: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 22/68

Laplace Transformation

Let the matrix H(t) be the inverse transform (sI-A)-1

Then becomes by means of inverse transformation

1))(0()( AsIs

)()0()( tHt

9

4

9

49

5

9

5

9

5

9

49

5

9

4

)( 9tenH

Page 23: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 23/68

Laplace Transformation

If the toymaker starts in the successful state 1, then π(0)=[1 0] and or ,

If the toymaker starts in the unsuccessful state 2, then π(0)=[0 1] and or ,

We have now obtained analytic forms for the data in Table 1.1 and 1.2.

9

5

9

5

9

5

9

4)( 9tet

tet 91 9

5

9

4)( tet 9

2 9

5

9

5)(

9

4

9

4

9

5

9

4)( 9tet

tet 91 9

4

9

4)(

tet 92 9

4

9

5)(

Page 24: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 24/68

Markov Decision Processes

MDPs applies dynamic programming to the solution of a stochastic decision with a finite number of states.

The transition probabilities between the states are described by a Markov chain.

The reward structure of the process is described by a matrix that represents the revenue (or cost) associated with movement from one state to another.

Both the transition and revenue matrices depend on the decision alternatives available to the decision maker.

The objective of the problem is to determine the optimal policy that maximizes the expected revenue over a finite or infinite number of stages.

Page 25: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 25/68

Markov Process with Rewards

Suppose that an N-state Markov process earns rij dollars when it makes a transition from state i to j.

We call rij the “reward” associated with the transition from i to j.

The rewards need not be in dollars, they could be voltage levels, unit of production, or any other physical quantity relevant to the problem.

Let us define vi(n) as the expected total earnings in the next n transitions if the system is now in state i.

Page 26: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 26/68

Markov Process with Rewards

Recurrence relation

N

jjijiji nNinrpn

1

,...3,2,1,...,2,1)]1([)(

N

j

N

jjijijiji nNinprpn

1 1

,...3,2,1,...,2,1)]1()(

NirpqN

jijiji ,...,2,1

1

,...3,2,1,...,2,1)1()(1

nNinpqnN

jjijii

v(n)=q+Pv(n-1)

Page 27: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 27/68

The Toymaker Example

Table 3.1. Total Expected Reward for Toymaker as a Function of State and Number of Weeks Remaining

73

39R

6.04.0

5.05.0P

3

6q

)(1 n

)(2 n

n= 0 1 2 3 4 5 …

0 6 7.5 8.55 9.555 10.5555 …

0 -3 -2.4 -1.44 -0.444 0.5556 …

,...3,2,1,...,2,1)1()(1

nNinpqnN

jjijii

Page 28: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 28/68

Toymaker’s problem: total expected reward in each state as a function of week remaining

Page 29: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 29/68

z-Transform Analysis of the Markov Process with Rewards

The z-Transform of the total-value vector v(n) will be called where )(z

0

)()(n

nznvz

,...2,1,0)()1( nnPvqnv

)(1

1)]0()([1 zPq

zvzz

)(1

)0()( zzPqz

zvz

)0(1

)()( vqz

zzzPI

)0()()(1

)( 11 vzPIqzPIz

zz

qzPIz

zz 1)(

1)(

v(0)=0

Page 30: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 30/68

z-Transform Analysis of the Markov Process with Rewards

9

4

9

49

5

9

5

10

11

1

9

5

9

49

5

9

4

1

11

zzzPI

9

4

9

49

5

9

5

10

11

9

10

19

10

9

5

9

49

5

9

4

)1(

9

4

9

49

5

9

5

)10

11)(1(

9

5

9

49

5

9

4

)1(1

2

2

1

zzz

z

zz

z

z

zzPI

z

z

1

1

zPI

z

z

9

4

9

49

5

9

5

10

11

9

10

9

5

9

49

5

9

4

)(n

nnF

3

6q

4

5

10

11

9

10

1

1)(

n

nnv

n

nnv10

11

9

50)(1

n

nnv10

11

9

40)(2

9

50)(1 nnv

9

40)(2 nnv

The total-value vector v(n) is then F(n)q by inverse transformation of , and, sinceqzPI

z

zz 1)(

1)(

Let the matrix F(n) be the inverse transform of

We see that, as n becomes very large.

Both v1(n) and v2(n) have slope 1 and v1(n)-v2(n)=10.

Page 31: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 31/68

Optimization Techniques in General Markov Decision Processes

Value Iteration Exhaustive Enumeration Policy Iteration Linear Programming Lagrangian Relaxation

Page 32: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 32/68

Value Iteration

5.05.0][ 1 jp 39][ 1 jr

6.04.0][ 2 jp 73][ 1 jr

5.05.0][ 11 jp 39][ 1

1 jr

2.08.0][ 21 jp 44][ 2

1 jr

6.04.0][ 12 jp 73][ 1

1 jr

3.07.0][ 22 jp 191][ 2

2 jr

Original

Advertising? No Yes

Research? No Yes

Page 33: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 33/68

Diagram of States and Alternatives

Page 34: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 34/68

The Toymaker’s Problem Solved by Value Iteration

The quantity is the expected reward from a single transition from state i under alternative k. Thus,

The alternatives for the toymaker are presented in Table 3.1.

kip 1

kip 2

kir1

kir 2

kiq

State AlternativeTransition

ProbabilitiesReward

Expected Immediate

Reward

i k

1 (Successful toy)1 (No advertising) 0.5 0.5 9 3 6

2 (Advertising) 0.8 0.2 4 4 4

2 (Unsuccessful toy)1 (No research) 0.4 0.6 3 -7 -3

2 (research) 0.7 0.3 1 -19 -5

kiq

NirpqN

j

kij

kij

ki ,...,2,1

1

Page 35: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 35/68

The Toymaker’s Problem Solved by Value Iteration

We call di(n) the “decision” in state i at the nth stage. When di(n) has been specified for all i and all n, a “policy” has been determined. The optimal policy is the one that maximizes total expected return for each i and n.

To analyze this problem. Let us redefine as the total expected return in n stages starting from state i if an optimal policy is followed. It follows that for any n

“Principle of optimality” of dynamic programming: in an optimal sequence of decisions or choices, each subsequence must also be optimal.

,...2,1,0)]([max)1(1

nnrpnN

jj

kij

kij

ki

)(ni

Page 36: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 36/68

The Toymaker’s Problem Solved by Value Iteration

NirpqN

j

kij

kij

ki ,...,2,1

1

,...2,1,0)(max)1(1

nnpqnN

jj

kij

ki

ki

)(1 n

)(2 n

)(1 nd

)(2 nd

n= 0 1 2 3 4 …

0 6 8.2 10.22 12.222 …

0 -3 -1.7 0.23 2.223 …

- 1 2 2 2 …

- 1 2 2 2 …

Table 3.6 Toymaker’s Problem Solved by Value Iteration

Page 37: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 37/68

The Toymaker’s Problem Solved by Value Iteration

Note that for n=2, 3, and 4, the second alternative in each state is to be preferred. This means that the toymaker is better advised to advertise and to carry on research in spite of the costs of these activities.

For this problem the convergence seems to have taken place at n=2, and the second alternative in each state has been chosen. However, in many problems it is difficult to tell when convergence has been obtained.

Page 38: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 38/68

Evaluation of the Value-Iteration Approach

Even though the value-iteration method is not particularly suited to long-duration processes.

Page 39: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 39/68

Exhaustive Enumeration

The methods for solving the infinite-stage problem.

The method calls for evaluating all possible stationary policies of the decision problem.

This is equivalent to an exhaustive enumeration process and can be used only if the number of stationary policies is reasonably small.

Page 40: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 40/68

Exhaustive Enumeration

Suppose that the decision problem has S stationary policies, and assume that Ps and Rs are the (one-step) transition and revenue matrices associated with the policy, s=1, 2, …, S.

Page 41: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 41/68

Exhaustive Enumeration

The steps of the exhaustive enumeration method are as follows. Step 1. Compute vs

i, the expected one-step (one-period) revenue of policy s given state i, i=1, 2, …, m.

Step 2. Compute πsi, the long-run stationary probabilities of the tran

sition matrix Ps associated with policy s. These probabilities, when they exist, are computed from the equations

πs Ps =πs

πs1 +πs

2 +…+πsm =1

where πs =(πs1 , πs

2 , …, πsm ).

Step 3. Determine Es, the expected revenue of policy s per transition step (period), by using the formula

Step 4. The optimal policy s* id determined such that

m

i

si

si

sE1

}{* ss EMaxE

Page 42: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 42/68

Exhaustive Enumeration

We illustrate the method by solving the gardener problem for an infinite-period planning horizon.

The gardener problem has a total of eight stationary policies, as the following table shows:

Stationary policy, s Action

1 Do not fertilize at all.

2 Fertilize regardless of the state.

3 Fertilize if in state 1.

4 Fertilize if in state 2.

5 Fertilize if in state 3.

6 Fertilize if in state 1 or 2.

7 Fertilize if in state 1 or 3.

8 Fertilize if in state 2 or 3.

Page 43: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 43/68

Exhaustive Enumeration

The matrices Ps and Rs for policies 3 through 8 are derived from those of policies 1 and 2 and are given as

100

5.05.00

3.05.02.01P

100

150

3671R

55.04.005.0

3.06.01.0

1.06.03.02P

236

047

1562R

100

5.05.00

1.06.03.03P

100

150

1563R

100

3.06.01.0

3.05.02.04P

100

047

3674R

55.04.005.0

5.05.00

3.05.02.05P

236

150

3675R

100

3.06.01.0

1.06.03.06P

100

047

1566R

55.04.005.0

5.05.00

1.06.03.07P

236

150

1567R

55.04.005.0

3.06.01.0

3.05.02.08P

236

047

3678R

Page 44: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 44/68

Exhaustive Enumeration

Step1: The values of vs

i can thus be computed as given in the following table.

si

si=1 i=2 i=3

1 5.3 3 -1

2 4.7 3.1 0.4

3 4.7 3 -1

4 5.3 3.1 -1

5 5.3 3 0.4

6 4.7 3.1 -1

7 4.7 3 0.4

8 5.3 3.1 0.4

Page 45: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 45/68

Exhaustive Enumeration Step 2:

The computations of the stationary probabilities are achieved by using the equations

πs Ps =πs

πs1 +πs

2 +…+πsm =1

As an illustration, consider s=2. The associated equations are

The solution yields

In this case, the expected yearly revenue is

21

23

22

21 05.01.03.0

22

23

22

21 4.06.06.0

23

23

22

21 55.03.01.0

123

22

21

256.2)4.0221.3317.46(59

13

1

222 i

iiE

59

22,

59

31,

59

6 23

22

21

Page 46: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 46/68

Exhaustive Enumeration Step 3&4:

The following table summarizes πs and Es for all the stationary policies.

Policy 2 yields the largest expected yearly revenue. The optimum long-range policy calls for applying fertilizer regardless of the system.

s1 s

2 s3 sE

2.256=*sE

S

1 0 0 1 -1

2 6/59 31/59 22/59

3 0 0 1 0.4

4 0 0 1 -1

5 5/154 69/154 80/154 1.724

6 0 0 1 -1

7 5/137 62/167 70/137 1.734

8 12/135 69/135 54/135 2.216

Page 47: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 47/68

Policy Iteration

The system is completely ergodic, the limiting state probabilities πi are independent of the starting state, and the gain g of the system is

where qi is the expected immediate return in state i defined by

N

iii qg

1

NirpqN

jijiji ,...,2,1

1

Page 48: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 48/68

Policy Iteration

A possible five-state problem.

The alternative thus selected is called the “decision” for that state; it is no longer a function of n. The set of X’s or the set of decisions for all states is called a “policy”.

Page 49: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 49/68

Policy Iteration

It is possible to describe the policy by a decision vector d whose elements represent the number of the alternative selected in each state. In this case

An optimal policy is defined as a policy that maximizes the gain, or average return per transition.

3

1

2

2

3

d

Page 50: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 50/68

Policy Iteration

In five-state problem diagrammed, there are different policies.

However feasible this may be for 120 policies, it becomes unfeasible for very large problem.

For example, a problem with 50 states and 50 alternatives in each state contains 5050( 10≒ 85) policies.

The policy-iteration method that will be described will find the optimal policy in a small number of iterations.

It is composed of two parts, the value-determination operation and the policy-improvement routine.

12051234

Page 51: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 51/68

Policy Iteration The Iteration Cycle

NirpqN

jijiji ,...,2,1

1

Page 52: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 52/68

The Toymaker’s Problem

Let us suppose that we have no a priori knowledge about which policy is best.

Then if we set v1=v2=0 and enter the policy-improvement routine.

It will select as an initial policy the one that maximizes expected immediate reward in each state.

For the toymaker, this policy consists of selection of alternative 1 in both state 1 and 2. For this policy

1

1d

6.04.0

5.05.0P

3

6q

73

39R

Page 53: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 53/68

The Toymaker’s Problem

We are now ready to begin the value-determination operation that will evaluate our initial policy.

Setting v2=0 and solving these equation, we obtain

We are now ready to enter the policy-improvement routing as shown in Table 3.8

211 5.05.06 vvvg 212 6.04.03 vvvg

1g

101 v

02 v

N

jj

kij

ki vpq

1

State Alternative Test Quantity

i k

112

6+0.5(10)+0.5(0)=11 4+0.8(10)+0.2(0)=12

212

-3+0.4(10)+0.6(0)=1 -5+0.7(10)+0.3(0)=2

Page 54: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 54/68

The Toymaker’s Problem

The policy-improvement routine reveals that the second alternative in each state produces a higher value of the test quantity than does the first alternative. For this policy,

We are now ready to the value-determination operation that will evaluate our policy.

With v2=0, the results of the value-determination operation are

The gain of the policy is thus twice that of the original policy, we have found the optimal policy.

For the optimal policy, v1=10, v2=0, so that v1-v2=10. This means that, even when the toymaker is following the optimal policy by using advertising and research.

2

2d

3.07.0

2.08.0P

5

4q

211 2.08.04 vvvg 212 3.07.05 vvvg

2g

101 v

02 v

2

2d

Page 55: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 55/68

Linear Programming

The infinite-stage Markov decision problems, can be formulated and solved as linear programs.

We have defined the policy of MDP and can be defined

by . Each state has k decisions, so D can be

characterized by assigning values in the

matrix, , where each row must contain a single

1with the rest of the elements zero. When an element =1, it can be interpreted as calling for decision k when the system is in state i .

Nd

d

d

D

2

1

10 orDik

NKNN

K

K

ddd

ddd

ddd

D

21

12221

11211

ikD

Page 56: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 56/68

Linear Programming

When we use linear programming to solve the MDP problem, we will define the formulation as .

The linear programming formulation is best expressed in terms of a variable , which is related to as follows.

Let be the unconditional probability that the system is in state i and decision k is made; that is, .

From the rules of conditional probability, . Furthermore, . So that

N

i

K

k

kiiki qdE

1 1

ikw

ikD

}{ kdecisionandistatePwik

ikw

ikiik dw

K

kiki w

1

K

k ik

ik

i

ikik

w

wwd

1

Page 57: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 57/68

Linear Programming

There exist several constraints on1.

2.

3.

ikw

N

ii

1

1 , so that

N

i

K

kikw

1 1

1

.from the results on steady-state probabilities,

N

iijij p

1

, so that NjforpwwN

i

kij

K

kik

K

kik ,,2,1,

1 11

KkandNiwik ,,2,1,,2,1,0

Page 58: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 58/68

Linear Programming

The long run expected average revenue per unit time is given by , hence the problem to choose the that , subject to the constrains.

1.

2.

3. This is clearly a linear programming problem that can be

solved by the simplex method. Once the is obtained, the

N

i

K

k

kiik

N

i

K

k

kiiki qwqdE

1 11 1

ikw

N

i

K

kikw

1 1

1

NjforpwwN

i

kij

K

kik

K

kjk ,,2,1,0

1 11

KkandNiwik ,,2,1,,2,1,0

N

i

K

k

kiik qwMaximize

1 1

K

k ik

ikik

w

wd

1

ikw

Page 59: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 59/68

Linear Programming

The following is an LP formulation of the gardener problem without discounting:Maximize E=5.3w11+4.7w12+3w21+3.1w22-w31+0.4w32

subject tow11 + w12 - (0.2w11 + 0.3w12 + 0.1w22 + 0.05w32) = 0w21 + w22 - (0.5w11 + 0.6w12 + 0.5w21 + 0.6w22 + 0.4w32) = 0w31 + w32 - (0.3w11 + 0.1w12 + 0.5w21 + 0.3w22 + w31 + 0.55w32) = 0

w11 + w12 + w21 + w22 + w31 + w32 = 1

wik>=0, for all I and k The optimal solution is w11 = w21 = w31 = 0 and w12 = 0.1017, w22 = 0.5254,

and w32 = 0.3729. This result mean that d12=d22=d32=1. Thus, the optimal policy selects alternative k=2 for i=1, 2, and 3. The optimal values of E is 4.7(0.1017)+3.1(0.5254)+0.4(0.3729)=2.256.

Page 60: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 60/68

Largrangian Relexation

If the linear programming method can not find the optimal solution with the additional constraints .

we can use Lagrangian relaxation to bind the constraints to the object function, and then solve this new sub problem without the additional constraints added .

By adjusting the multiplier of Lagrangian relaxation, we can get the upper bound and the lower bound of this problem.

We will use the multiplier of Lagrangian relaxation to rearrange the revenue of Markovian decision process, and then do the original Markovian.

Decision Process model to find the optimal policy .

Page 61: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 61/68

Comparison

Characteristic

Methods

Calculates simply

large problemOptimal

policy Additional constraints

Value Iteration

Exhaustive Enumeration

Policy Iteration

Linear Programming

Lagrangian Relaxation

Page 62: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 62/68

Semi-Markov Decision Processes

So far we have assumed that decisions are taken at each of a sequence of unit time intervals.

We will allow decisions to be taken at varying integral multiples of the unit time interval.

The interval between decisions may be predetermined or random.

Page 63: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 63/68

Partially Observable MDPs

MDPs assume complete observable (can always t

ell what state you’re in).

We can’t always be certain of the current state.

POMDPs are more difficult to solve than MDPs

Most real-world problems are POMDPs

Page 64: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 64/68

Applications on MDPs

Capacity Expansion Decision Analysis Network Control Queueing System Control

Page 65: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 65/68

Conclusion

MDPs provide and elegant formal framework for sequential decision making.

We present a powerful tool for formulating models and finding the optimal policies.

Five algorithm were presented Value Iteration Exhaustive Enumeration (optimal policy) Policy Iteration (optimal policy) Linear Programming (optimal policy) Lagrangian Relaxation (optimal policy)

Page 66: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 66/68

Future Work

Sensor Networks Maximize system lifetime of sensor networks Maximize cover the area of sensor networks Minimize response time of sensor networks

Page 67: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 67/68

References1. Hamdy A. Taha, “Operations Research: an Introduction,” third edition, 1982.2. Hillier and Lieberman,”Introduction to Operations Research,” fourth edition, H

olden-Day, Inc, 1986.3. R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, “Network Flows,” Prentice-Hall, 1

993.4. Leslie Pack Kaelbling, “Techniques in Artificial Intelligence: Markov Decision

Processes,” MIT OpenCourseWare, Fall 2002.5. Ronald A. Howard, “Dynamic Programming and Markov Processes,” Wiley, N

ew York, 1970.6. D. J. White, “Markov Decision Processes,” Wiley, 1993.7. Dean L. Isaacson and Richard W. Madsen, “Markov Chains Theory and Appli

cations,” Wiley, 19768. M. H. A. Davis “Markov Models and Optimization,” Chapman & Hall, 1993.9. Martin L. Puterman, “Markov Decision Processes: Discrete Stochastic Dynam

ic Programming,” Wiley, New York, 1994.10. Hsu-Kuan Hung, Adviser: Yeong-Sung Lin ,“Optimization of GPRS Time Sl

ot Allocation”, June, 2001.11. Hui-Ting Chuang, Adviser: Yeong-Sung Lin ,“Optimization of GPRS Time Sl

ot Allocation Considering Call Blocking Probability Constraints”, June, 2002.

Page 68: Markov Decision Processes: A Survey1/68 Markov Decision Processes: A Survey Adviser : Yeong-Sung Lin Graduate Student : Cheng-Ta Lee Network Optimization

Markov Decision Processes: A Survey 68/68

References

12. 高孔廉,「作業研究 --管理決策之數量方法」,三民總經銷,民國 74年四版。13. 李朝賢,「作業研究概論」,弘業文化實業股份有限公司出版,民國 66年 8

月。14. 楊超然,「作業研究」,三民書局出版,民國 66年 9月初版。15. 葉若春,「作業研究」,中興管理顧問公司出版,民國 86年 8月五版。16. 薄喬萍,「作業研究決策分析」,復文書局發行,民國 78年 6月初版。17. 葉若春,「線性規劃理論與應用」,民國 73年 9月增定十版。18. Leonard Kleinrock, “Queueing Systems Volume I: Threory,” Wiley, New York,

1975.19. Chiu, Hsien-Ming, “Lagrangian Relaxation,” Tamkang University, Fall 2003.20. L. Cheng, E. Subrahmanian, A. W. Westerberg, “Design and planning under

uncertainty: issues on problem formulation and solution”, Computers and Chemical Engineering, 27, 2003, pp.781-801.

21. Regis Sabbadin, “Possibilistic Markov Decision Processes”, Engineering Application of Artificial Intelligence, 14, 2001, pp.287-300.

22. K. Karen Yin, Hu Liu, Neil E. Johnson, “Markovian Inventory Policy with Application to the Paper Industry”, Computers and Chemical Engineering, 26, 2002, pp.1399-1413.