on markovian decision programming with recursive reward functions

20
Annals of Operations Research 24(1990)145-164 145 ON MARKOVIAN DECISION PROGRAMMING WITH RECURSIVE REWARD FUNCTIONS* Jianyong LIU and Ke LIU Institute of Applied Mathematics, Academia Sinica, Beijing, P.R. China Abstract In this paper, the in.t-mite horizon Markovian decision programming with recursive reward functions is discussed. We show that Bellman's optimal principle is applicable for our model. Then, a sufficient and necessary condition for a policy to be optimal is given. For the stationary case, an iteration algorithm for finding a stationary optimal policy is designed. The algorithm is a generalization of Howard's [7] and Iwamoto's [3] algorithms. Keywords Markovian decision programming, recursive reward functions, optimal policy. 1. Introduction Infinite horizon Markovian decision programming with recursive reward functions was discussed by Furukawa and Iwamoto [1,2]. Under certain conditions of reward functions, they proved that there is a (p, e)-stationary optimal policy. When the action set is countable (finite), there is a stationary e-optimal policy (stationary optimal policy). In ref. [3], an infinite horizon discounted model is discussed and an iteration algorithm for finding a stationary optimal policy is given for the case of finite F. In ref. [4], proofs of the optimal principle and weak optimal principle are given for the finite horizon model. Our model is almost the same as those in [1]. Notations, definitions, some basic assumptions and results are stated in section 2; these are generally taken from [1]. The main results of this paper are given in section 3. Under the recursiveness monotonicity and Lipschitz condition (similar to [1]), Bellman's optimal principle is true (theorem 4). Then, we prove that a policy is optimal if and only if every decision rule of this policy is a probability distribution on an optimal action set under any realizable history (theorem 5). From the point of view of a decision rule, theorem 5 describes the characteristics of an optimal policy and is a generalization of theorem 7 in [5]. As an application of theorem 5, we prove an important relation between a class of general policies and a class of Markovian policies (theorem 6). In section 4, *This research was supported by the National Natural Science Foundation of China. J.C. Baltzer AG, Scientific Publishing Company

Upload: jianyong-liu

Post on 14-Aug-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Markovian decision programming with recursive reward functions

Annals of Operations Research 24(1990)145-164 145

ON MARKOVIAN DECISION PROGRAMMING WITH RECURSIVE REWARD FUNCTIONS*

Jianyong LIU and Ke LIU Institute of Applied Mathematics, Academia Sinica, Beijing, P.R. China

Abstract

In this paper, the in.t-mite horizon Markovian decision programming with recursive reward functions is discussed. We show that Bellman's optimal principle is applicable for our model. Then, a sufficient and necessary condition for a policy to be optimal is given. For the stationary case, an iteration algorithm for finding a stationary optimal policy is designed. The algorithm is a generalization of Howard's [7] and Iwamoto's [3] algorithms.

Keywords

Markovian decision programming, recursive reward functions, optimal policy.

1. Introduction

Infinite horizon Markovian decision programming with recursive reward functions was discussed by Furukawa and Iwamoto [1,2]. Under certain conditions of reward functions, they proved that there is a (p, e)-stationary optimal policy. When the action set is countable (finite), there is a stationary e-optimal policy (stationary optimal policy). In ref. [3], an infinite horizon discounted model is discussed and an iteration algorithm for finding a stationary optimal policy is given for the case of finite F.

In ref. [4], proofs of the optimal principle and weak optimal principle are given for the finite horizon model.

Our model is almost the same as those in [1]. Notations, definitions, some basic assumptions and results are stated in section 2; these are generally taken from [1]. The main results of this paper are given in section 3. Under the recursiveness monotonicity and Lipschitz condition (similar to [1]), Bellman's optimal principle is true (theorem 4). Then, we prove that a policy is optimal if and only if every decision rule of this policy is a probability distribution on an optimal action set under any realizable history (theorem 5). From the point of view of a decision rule, theorem 5 describes the characteristics of an optimal policy and is a generalization of theorem 7 in [5]. As an application of theorem 5, we prove an important relation between a class of general policies and a class of Markovian policies (theorem 6). In section 4,

*This research was supported by the National Natural Science Foundation of China.

�9 J.C. Baltzer AG, Scientific Publishing Company

Page 2: On Markovian decision programming with recursive reward functions

146 Jianyong Liu, Ke Liu, Markovian decision programming

stationary policies are discussed. Theorem 8 is a generalization of theorem 2 in the paper by Blackwell [6] and is also an application o f theorem 5. An iteration algorithm for finding a stationary optimal policy for finite F is given in theorem 9, and this algorithm is a generalization of Howard's [7] and Iwamoto's [3] algorithm.

In theorem 10, an iteration algorithm for when F is infinite is given. To avoid involving measure theory, we assume that the state set and action set are countable. However, similar results for the general case can be obtained by our methods.

2. Notations, some basic assumptions, and results

The model under consideration is {S, A, qn' g}' where S is a state set and A an action set. Both of them are countable. At stage m ( > 0 ) , S,, and A,~ denote the state and action sets, respectively (where S m = S, Ar, = A). s and a denote elements of Sr~ and A , respectively. At stage t, Yt and A t denote, respectively, a state of the system and an action taken, h = (sl, a~ . . . . . s _ 1, arn_l, S n ) is called a history up to stage m. / / denotes the set { h } .

{qn} means a stochastic transition law of the system: when the system is in state s n at the nth stage and we take action a n, the system moves to a new state s + ~ selected according to the probability qn(s + i I s , an). {qn} satisfies

]L q n ( s n + l l s n , a n ) = 1, for n > 0 , sn e Sn,an ~ An. S a + I E Sn+ I

Note that {qn} are nonhomogeneous, but the transition probability in [1] is not. For m > 0, the function g , " ]-I7=,,(S i x Ai) --->/R ~ is a reward function from

stage m to oo. Let 0 < m < n; the function

gm,n+l " f i (Si )<Ai)• • -''> /RI i=m

is a reward function from stage m to n + 1. The set of general policy g = (~c 1, ~z . . . . ) is denoted by H, where

def'med in [1] is a random decision rule and it has relation to a history. Let = (~rl, tc z . . . . ) ~ FI, nn denotes { ~ + 1' ~ + 2 . . . . }, n > 0. I-I n denotes the set

{nnl I r c FI }. Mapping f : S ---> A is called a determined (non-randomized) decision rule. F denotes all o f f Let f / E F, i = 1, 2 . . . . . ~ = ( fa ' f z . . . . ) is called a Markov policy. F I denotes all o f the Markov policy, FI n ,, denotes the set {n~lTr ~ I -1 }, n > 0.

Let ~ ~ FI. An expected reward function on Sa using the policy ~r is given by

E~[gl(sl, a 1 . . . . )1 Y1 = sl] =-- E~gl(sl )' sl ~ $1" Suppose ~r ~ 17; if for VIr ~ FI, s ~ S r we have

En*gl (s) > Engl (s),

then ~* is called an optimal policy in II.

Page 3: On Markovian decision programming with recursive reward functions

Jianyong Liu, Ke Liu, Markovian decision programming 147

In this paper, we assume {g,., g,.,.] satisfies the recursive relation, that is: V 1 r = ( g l , Tr2 . . . . ) E FI, Vm, n, 1 < m _ < n , and 'v 'h = (s 1,a 1 . . . . . s ) ~ / / ~ h a v e

(1) E r ~ - l ~ [ g m ( s , . , a , . . . . . ) [hm]

/7, ( x , n , . , . , ft'n ) f ,., n n- =.- . t~,.,~+ 1 ( s~ , a,~ . . . . . s ~ + l , E [ g ~ + l ( s ~ + t , a ~ + t , . . . ) l h . + t

= ( h m , a , . . . . . . s~+l ) ] ) [ h , . ] ,

where

E m - b r [ g m ( s , ~ , a , n . . . . )1 h,~]

is the total expected reward using it from stage m to ~ when the history of the system is h m.

(2) For any bounded function u" H + i ---)/RI' we have

E (n''~"+~ ..... n")[gm,n + 1 (sin, am . . . . . S . + l , u(/h + l ))l hm]

;grn = E [ g r a , , ~ + l ( s m , a ~ , S m + a ,

E (~r'~'~ ..... ~")[grn + 1 ,n+l (Sm+l ,am+a . . . . . S n + l , u ( h . + l ))l hn~+l

= ( h m , a ~ , s m + a ) ] ) l h , . ] ,

where

h~+l = (hm,am . . . . , S n + l ) -

For succinctness, the condition in the conditional expectation is often omitted.

In this paper, we assume that E n n g n + ~ ( h ~ + l ) is uniformly bounded, that is: there is a constant L such that [E""[g.+ ~(s+ ~, an+ ~ . . . . ) l h + ~]1 < L, for V~r ~ I-I, n > 0, h + 1 ~ / - / +~. We also assume that {gin.} is monotone, that is: for V m , n, 1

< m <_ n, V s i ~ S i, a i ~ A i, m < i < n, V s § ~ ~ S § ~, Vc~, c 2 ~ / R 1 , we have

(i) if c a < c z , t h e n

gm,~+l (sin, am . . . . . s . + l , cl ) <- g, . , .+l (sin, am . . . . . s~+l, cz);

(ii) if c 1 < c z, then

gm,n+ x (sin, am . . . . . s . + l , cl ) < gin,.+1 (sin, ct~ . . . . . S n + x , C2)-

Page 4: On Markovian decision programming with recursive reward functions

148 Jianyong Liu, Ke Liu, Markovian decision programming

In this paper, we assume that {g.,,.} satisfies the Lipschitz condition, that is: for any n > O, there is k n > 0 such that

sup I E ~r. [ g . , . + x ( s n , an, s . + l , u ( h . , an, s . + 1 )) I h~ ] hn~ Hn

- E~r"[g. . .+1 ( s n , a . , s n + l , v ( h n , a n , s n + l ))1 hn][

< k. sup [ u ( h . + 1 ) - v (h . + 1 )l for any 7 G , hn+ 1 E nn+ 1

where u, v are any bounded real functions from H + I to ]R 1. We also assume k 1 k 2 . . . k ---> O, when n ---> oo. The above formula is abbreviated to

E g n , n + l ( s n , a n , s n + l , U ( h n , a n , s n + l ) )

tr~ - E g . , . + l ( s . , a ~ , s . + l , v ( h ~ , a . , s . + l ) ) l l <- k. l l u - v l l .

Let m > 1, s ~ S define

V ; , ( s ) = sup Em-l t r [gm(S ,am . . . . ) lhm = ( h r a - l , a m - l , S ) ] , IrE n

h m - l E H m - 1

a m - l E A r n - 1

V l * ( s l ) = sup. E X [ g l ( s l , a l . . . ) l h l = ( s l ) ] , s E S1 . ;rgE 11

It is easy to see that

sup IV*(s)l-< L. m~l

s ~ Snt

LEMMA 1

Let 0 < m < n, ~r = (it 1, n: 2 . . . . ) ~ H. u, v are bounded real functions from H + 1 t o / R 1. Then,

I I E (x . . . . . . X")gm,. + 1 ( S i n , am . . . . . Sn + 1, u ( h m , am . . . . . Sn + l ) )

- E (~ . . . . . . X") gm,n + l ( S i n , a m . . . . . Sn + 1 , v (h,,, , am . . . . . s , + l ))11

< k m k m + 1 . . . k n l l u - v l l .

Page 5: On Markovian decision programming with recursive reward functions

Jianyong Liu, Ke Liu, Markovian decision programming 149

Proof

By recursiveness and the Lipschitz condition, the proof is trivial. []

LEMMA 2

For any m > 0 and e > 0, there exists "*- 1~= (f,~,f"* + 1 . . . . ) ~ 1-I~- t such that

E m - l n g m ( S m , a m . . . . )>- V,~(s"*)- e, fo rany s,, e Sin.

Proof

Because k l k 2 . . . k N---->O when N ~ o % there is N > r n such that 2 L k k +1 . . . . . kN < t /2. Take a Markov policy N~ = (fN + l'fN § 2' " ) ~ Uu"*" Take e ' > 0 that satisfies

(kmkra+l ...kN-1 +kmkm+l . . . kN-2 + ... +km)s + ~' < t /2 .

Suppose zzr= (f~+ 1,fz+2 . . . . . fN,N~) has been defined, m < i _< N. We define f. as follows: for any s. ~ S i, take f/(s/) ~ A i such that

iff E q i ( S i + l l S i , f i ( s i ) ) g i , i + l ( S i , f i ( s i ) , S i + l , E g i + l ( S i + l . . . . ))

$i+ t E Si . 1

> sup ~ q i ( s i+l l s i ,a i )g i , i + l ( s i , a i , s i + l , E gi+l(Si+l . . . . ) ) - e ' . aiEAi $i+lESi+l

Obviously, "* - br = (fro,f"* + 1 . . . . . fN'Nro as defined above is a Markov policy, that i s , "* - 17r ~ I 1 m - t

m

For any "*- 1if, = ( ~ , ~ + 1 . . . . ) E l-I"*- 1. By the proof methods o f theorem

5.1 in [1] and [2], we can prove: for any s ~ S , h _ 1 ~ / / ~ - 1 and a . _ 1 ~ A . _ l ' we have

E m- ]~gm (sin, am . . . . ) < E m- lugra (Sin, am . . . . ) + e.

By the def'mition o f V * ( s ) , we have

Em-17tgm($ m . . . . ) > V ,~( sm) -e , fo rany sm ~ Sin. []

T H E O R E M 1

For any e > 0, there is ~ ~ l-I"* such that

Page 6: On Markovian decision programming with recursive reward functions

150 Jianyong Liu, Ke Liu, Markovian decision programming

En(gl(Sl . . . . ) ) > V l * ( S l ) - e , forany sz ~ $1.

That is: there is a Markov e-optimal policy in I-I.

Proof

This follows immediately from lemma 2.

Similarly, we have:

THEOREM 2

For any m > 0 and s ~ S , we have

sup Em-lxgm(Sm ,am . . . . ) = Vm*(Sm).

[]

THEOREM 3

(Optimal equation). For any m > 0, s ~ S m, we have

Vfft(Sm)

= s u p Y. qrn(Sm+llsm,am)gm,m+l(Sm,am,Sm+l,V*+l(Sm+l)). amEAm Sm~lESm+l

Proof

Let C ( s ) be the right-hand side of the above equation. Now, we shall prove V * ( s ) = C ( s ) . For any s ~ S and V " - l g = (f , , ' fm+ 1 . . . . ) ~ I-I~-1, by the monotonicity we have

E m - ' l l t g m ( s in , am . . . . ) = E f" gm,m + Z (sin, am, Sm+ Z, E(f" + ~ .... )gin + 1 (Sin + 1 . . . . ))

< Ef=gm,m+l (sin, am,sin+l, Vm+l (Sm+l))

= ]~ qm(Sm+llSm,fm(Sm))gm,m+l(Sm,fm(Sm),Sm+l,V*+l(Sm+l)) Sm+lESm+l

< sup 2~ qm(Sm+llSm,am)gm,m+l(sm,am,Sm+l,Vr~+l(Sm+l)) amEAm Sin+lEarn+ 1

=Cm(Sm).

Page 7: On Markovian decision programming with recursive reward functions

Jianyong Liu, Ke Liu, Markovian decision programming 151

Because of theorem 2,

Vm(sm) = sup E m - l ~ r g m ( $ m . . . . ) <_ Gin(s in) . ,~- lrc~ l-I~ -1

By lemma 2, for any e > 0, there exists "~r E I7~ such that

Emggm+l(Sm+l . . . . ) > V * + l ( S m + l ) - e , forany Sm+l ~ Sm+l .

For any s ~ S and a ~ A,., take fro ~ F such tha t f , . ( s ) = a . By monotonicity and the Lipschitz condition, we have

qm (Sm + l l Sm , am )gm,m + l (Sm , am , Sm + l , V*+ l (sm + l ) ) Sin+ 1 E Sin+ 1

<- ~ q m ( S m + l l S m , a m ) g m , m + l ( S m , a m , S m + l , E m n g m + l ( S m + l . . . . )+E) S m + t E S m + l

= E/"gm,m+l (sin, am ,Sm+l ,Emngm+l (Sm+l . . . . )+ e)

< E f " gm.m + I (sin, am, Sm+ 1, Em~gm + 1 (s~ + 1 . . . . )) + k,,, e

= E t f " ' ~ ) g m ( s m . . . . ) + k m e < Vm(s , , , )+kme.

By the definition of C ( s ) , we have C ( s ) _< ~ ( s ) + k ,e . Making e ---) 0, we have C ( s ) []

Remark

Theorem 6.5(e) in [1] only proves the above-mentioned result for the stationary case.

LEMMA 3

For any n > 0, h = (s l, a 1 . . . . . s ) e H and V ~ , we have

E gn,n+l(sn,a. ,Sn+l,Vn+l(sn+l)) < V.(sn).

Proo f

This foUows immediately from the proof of theorem 3. []

Page 8: On Markovian decision programming with recursive reward functions

152 Jianyong Liu, Ke Liu, Markovian decision programming

3. The properties of the optimal policy

DEFINITION

Let ZC~ H , n > 1, h = ( s 1,a 1 . . . . . s ) ~ H , if P(Y1 = s l ' A l = a l . . . . . Y = s l Y1 = sl) > 0, then h is called a realizable history under the policy 7r ( h is called simply an R-history (~)), where P ( A I B) is the conditional probability using the policy ~r. Sometimes, the above inequality is denoted simply by P { h l Y~ = sl} > 0.

THEOREM 4

(Bellman's optimal principle). Let zc ~ YI be optimal, n > 1, h = (s 1, a 1 . . . . . s ) is an R-history (re). Then,

E"-lZ[gn(sn,an . . . . )l hA] = V*(sn).

Proof

Apply induction to n. Because z~ is optimal, this is true for n = 1. Suppose this is true for n > 1. Let h n § 1 = (s~, a~ . . . . . s , a , s + 1) be an R-history (It); then it is easy to see that h = (s~, al . . . . . s ) is also an R-history (z0. By the induction supposition,

E " - l n [ g . ( s . . . . . )l h . ] = V,~(s.).

By the definition of V* = . . - . + r f o r V ~ + l (gva l , . , S + l ) ~ H + l , w e h a v e

Enn[gn+l (gn+l . . . . )l ~ + 1 ] < vn+l (Sn+l) . (4 .1)

I f E " n [g .+ l (S.+l . . . . ) I h .+ l ] < Vn*+l (S .+ l ) , by monotonicity we have

g.,n+l (sn,a~, Sn+l ,E"~[gn+l (Sn+l . . . . )1 hn+l ])

< g. , .+l (s . , a . , s .+x , V.+l (s.+l)). (4 .2)

Because P { h § 11 Y1 = sl} > 0, it is easy to see that

Pn{An = an, Yn+l = s.+~l ~ } > O. (4.3)

By (4.1)-(4.3) and lemma 3, we have

Page 9: On Markovian decision programming with recursive reward functions

Jianyong Liu, Ke Liu, Markovian decision programming 153

V*(sn) = En- ln [ga ( s . . . . . )l h~]

= En" [ga,n+l (Sa, fin ,Sa+l ,Enn[ga+l (Sa+l . . . . )1 ~ + l = (ha, fin,sa+l )])1 hal

= ]~ e, ,{aa =~ . r 'a+~ =ga+~ Iha} f i n � 9

8a+ 1 �9 Sn+ 1

• gn,n+l (sa, fin, S,+l ,Ea~rg,+l (sn+l . . . . ))

<

fi . e A n

-~n+ 1 �9 Sn+ 1

• g,.,+~(Sa.fin.~a+a.V,i'+~(Ya+~))

= e '~" [ga . . + ~ (Sa, fin , Y a + I , V,i'+ ~ (Ya+~))1 h . ] -< V~(sa)

Here we have a contradiction. So E""[ga + l ( s + 1 . . . . ) l h + 1] = V*+ l ( s + 1)' that is: the result is also true for the case of n + 1.

For an optimal policy ~ = (~1, ~2 . . . . ) ~ H, which conditions should its decision rule ~ satisfy? Conversely, under which conditions satisfied by ~ is the policy rc optirnal? Other authors have discussed these problems for some other models, e.g. [5,9,10], etc. For our model, theorem 5 answers the above problems and is a generalization of theorem 7 in [5].

Let n > 0, s ~ S define

A~,(sa)= (an ~ A.I ~ q.(sa+llsa,a.)g.,.+l(sa,a.,s.+l, �9 ~ n + l E S ~ + I

1",;+1 (sa+l)) = v, (sa)}.

A * ( s ) is called an optimal action set in state s at stage n.

THEOREM 5

A policy ~ = (~rx, ~2 . . . . ) ~ H is optimal iff Vn > 1, if h a = ( s 1 , a 1 . . . . . S ) E n is an R-history (It), then, when a a ~ A a - A~(s) , we have ~ ( a a l h a) = 0.

Proof

(Necessity). Suppose lr is optimal. Vn > 1, if h a = (s 1, a x . . . . . sn) is an R-history (~), then by theorem 4, E n - l n [ g a ( s . . . . ) l h a] = V*(s).

Page 10: On Markovian decision programming with recursive reward functions

154 Jianyong Liu, Ke Liu, Markovian decision programming

Let a] ~ A n, and ~ ( a : I h ) > 0. We shall prove a* ~ A~(s) . If a* ~ A*(sn), by theorem 3 we have

]~ qn(Sn+11Sn,~)g.,n+l(s.,a*,.fn+l,Vn+l(Sn+l))< V*(sn). (5.1) �9 ~ n + 1 E Sn+ 1

By (5.1), theorem 3 and ~ (a~ l hn) > 0, we have

V*(sn)= E"-I~[g.(s. .... )lb.]

= E 'r" [g. ,n+l ( s . , ~ , gn+x, Enn [gn+ I (gn+l . . . . )[/~n+l = (hn, ft. ,.~n + 1 )])[ hn ]

< ETr" [gn,n+1 (S~, 6n, gn +1, V~*+ l (g .+ l ) ) l h . ]

= ]~ ~. en{An=~,Yn+l=Sn+l lhn} anEAn Sn+lESn+l

x g.,n+l(Sn,a.,g.+l,V~+1(gn+l))

= Z n:n(an Ihn) .~, qn(-Cn+l ISn,6~)gn,.+l(Sn,~,g.+l,V~+l(S.+l)) anE An sn+ 1E Sn+ 1

< ~ , zcn(a. Ihn)V~(s.) = v;(s .) , fine An

which is a contradiction. So, when ~(a*lh n) > 0, we have a~ ~ A*(s) . Hence, when a E A n - A ] ( s ) , w e have ~ ( a n I hn) = 0 .

(Sufficiency). Vs~ ~ S 1, Vm > 1; we take a realizable history h = (s~, a I . . . . . s ) , by the sufficiency assumption,

when n,. (6~ I hm) > 0, we have fi,. ~ Am (sin). (5.2)

By (5.2) and the definition of A~.(s) , we have

e ~" [ g m , . + ~ ( s . , am, L . + ~ , V.." + ~ ( L . + ~ ) ) I h,n ]

= ,Y_. ~(a , , , I h~) &, E a,.

tt , . (fi,n Ihm ) > 0

Page 11: On Markovian decision programming with recursive reward functions

Jianyong Liu, Ke Liu, Markovian decision programming 155

~_~ qm(gm+llSm,fm)gm,m+~(Sm,fim,Sm+~,V*+~(gm+~)) •

Sm+ I E Sm+ l

= ]~ a',,,(& I h,.)V2(s, .) = V,~(s,.). (5.3) fi,. �9 Am

n,. (fi,.lh.,) > 0

It is casy to scc that h_ l = (st' al ..... s_ i) is also a realizable history. If P={A -i = a,.- l' Y = g I h -x } > 0, then h = (h_ i' a.,- x' ~) is also a realizablc history. Similarly to the proof of (5.3), wc have

E '~" [g,.,m+l (L. , am, L.+~, V.~+I (~,. +, )) I/~,.] = V~(s-,.). (5.4)

By recursiveness and (5.4), we have

ECn"- 1'rr') [grn- i,,n +, (sin-I, :~,n- I ,g,. ,fin 'Sm+l' V*+I (s,. + I ))I hm-1 ]

= E n " - ' [ g m - l . m ( S m - 1 , { t in- l , Srn, Err" [gm,m+l (sin ,{tin , S m + l , Vn~+l (Sm+l))l / ira

= (~-,, ~-,,gm)])I hm_,]

= ~ P~{A,._, = A . - 1 , Y m = i.,ihm_, }

Pg{Am - 1 = t im- 1 ,Ym =sin [hm- 1 } > 0

• g, ._l, , , ,(s, , ,_l ,h,._l,Y,, , ,Err"[g,, , , , .+l(g., , ,f i , , , ,Ln+t,V,*+l(L,,+l))lA,, ,

- ( ~ . - , ,~m-1, Ym)})

tim- 1 E A m - l,,~m �9 Sm

P~{ Am- l =fim-1,Yr,, = J,~ lhm-1} > O

X gm_l ,m(Sm_l ,Ctm_l ,Sm,V~n(S-rn) )

= E~'- ' [g , . -1 .m(S, . -1 . ~ - 1 . L.. v,~(L.))I hm-I ].

Similarly to the proof of (5.3), we have

En'-'[gm-l,m(srn-1, tim-l, sin, Vs (sin)) [ hm-1 ] = Vs (sin-1).

Page 12: On Markovian decision programming with recursive reward functions

156 Jianyong Liu, Ke Liu, Markovian decision programming

In this way, we know

g(~rl'~2 ..... trm)gl,m+l ($I,aI ..... Sin+I, Vn~+I (sin+l)) = VI*(SI).

By recursiveness and l emma 1, we have

II E ng I (Sl . . . . ) _ E ( n l , n 2 . . . . . ~ " ) g l ,m + 1 ( S l , s . . . . . gm + I , V,~+ 1 (,(m + 1 ))11

--- II E <'~'~2 ..... ~") g l , m + l (sl ,al . . . . . Sm+ l ,Emngm+1 (Sm+ l . . . . ))

-- E(/ t l '~r2 . . . . . ~rrn)gl,m + 1 ( S 1 , &l . . . . . grn + 1, Vt~+l ( S m + l ) ) 1 1

<- kl k2...kmllEm'~gm+l (gm+l . . . . ) - Vm+l (gr~ + 1 )11

< 2 L k l kz . . . k in --'> 0 when m ---> ~ .

Therefore,

Engl (Sl .... ) = lira E (tr~'~rz .... tr')gl,m+1 ($I, t~I ..... Sin+l, Vr~+1 (Sm+l)) m "-> ~

= v l* ( s l )

Hence, ff is optimal.

f o r a n y sl e $1 .

[]

Theorem 5 clearly states that a policy is optimal if and only if its every decision rule will take optimal action for every realizable history. The following theorem is an application of theorem 5.

THEOREM 6

(a) I f there is an optimal policy in H, then there is ~ e lI,~ which is optimal in l-I.

Co) I f ~ ~ 17 is optimal in I-I m, then ~ is also optimal in I-I.

Proo f

(a) Let lr = (~1, 7r2 . . . . ) ~ FI be optimal. Vn > 1, ' v ' s ~ S .

(i) Suppose that there is h n = (s 1, a~ . . . . . s ) which is an R-history (~). We fix h ~ ~o(s~, a ~ . . . . . s ) which is an R-history (~), take a ~ A such that ~ ( a n I h n) > 0. Define g n ( s ) = a n. By theorem 5, a n ~ A~sn). n

(ii) I f (i) is not true, then take any a n ~ A n and def'me g n ( s ) = a n. Assign lr = (gv g2 . . . . ). Obviously ~* ~ I-I m.

Page 13: On Markovian decision programming with recursive reward functions

Jianyong Liu, Ke Liu, Markovian decision programming 157

PROPOSITION 1

Let n > 1, s ~ S . I f h = ( s l , a 1 . . . . . s ) is an R-history (if*), then there is �9 p

h" = (sl , a 1, . , s ) which is an R-history (if).

Proo f o f proposition 1

Apply induction to n. It is trivial for n = 1. Assume proposition 1 is true for n > 1. I f h * . = (s-., , g , ~ , s . ) i s a n R - h i s t o r y ( f f * ) , t h e n ~ = g ( Y ) and

- - n + l 1 " " " n , n n + l n J ~ . n

q n ( s + l I ~ , ~ ) > 0. Clearly, h n = (Yl,fil . . . . . ~ ) is also an R-history ( i f ) . By the induction assumption there is h~ = (s' 1, a~ . . . . . ~ ) which is an R-history (~). By the def'mition of gn, there is h ~ = (s ~ a ~ . . . . . ~ ) which is an R-history (if) such that ~ ( g n ( ~ ) I h ~ > 0. Assigning h n + 1 = (h~ ~ ' s + 1)' we have

P~{/~,+~ I Y~ = s o } = qn(sn+~l # n , ~ ) z . ( ~ , l h.~176 Y~ = s o }

o o > = q , ( s n + ~ l ~ , ~ ) ~ ( g ~ ( g ~ ) l h n ) P ~ { h ~ l Y 1 = O.

Therefore, h + l is an R-history (z0. So, proposition 1 is also true for n + 1. The proof of proposit ion 1 is complete.

~ 'n >1, V s ~ S . I f h~ = (s 1, a a . . . . . s ) is an R-history ( ~ ) , then by proposi- tion 1 there is h~ = (s~, a~ . . . . . s ) which is an R-history (~). By the definition of gn, g n ( s ) ~ A * ( s ) . So, we have

gn(an I h ~ ) = 0, when an e A n - A ~ ( s n ) .

By theorem 5, ~ is optimal.

(b) By theorem 2, Co) is obviously true. []

Theorem 6 states the following important fact: the problems of the existence and calculation of an optimal policy in r I can be changed into the same problems in F I .

For the finite horizon model (see [4]), theorems 4, 5 and 6 are still true. It is unnecessary to go into details here.

4. An algorithm to find a stationary optimal policy

In addition to the preceding assumptions, in this section we assume {gr,,n} are stationary, that is: for any m, n > 1 and c ~ /R 1, when S n = s , a m = a n, s § 1 = s + 1' we have

gm ,m + 1 (Sm, am, Sm + 1, C) = gn,n + 1 ( sn , an, sn + 1, c).

Page 14: On Markovian decision programming with recursive reward functions

158 Jianyong Liu, Ke Liu, Markovian decision programming

This a s sumpt ion is f rom ref. [1]. N o w , we also a s sume {qn} are h o m o g e n e o u s , that is: q . ( s ' l s , a) = q ( s ' l s , a) for any n > 1, s , s " ~ S and a E A.

LEMMA 4

For any m, m" > l , k > O and f i ~ F, i = m, m + l . . . . . m + k, and u is a bounded real funct ion on S, when s = s , , we have

E (fro'I'+' ..... /"§ (Sm, am . . . . . Sin+k+ 1, U(S. ,+k+I ))

= E(_fm ,f.~+ 1 ..... / " § k)g . , ' , . , '+ k + 1 (sin ' , am' , . . . . s . , '+ k + 1, u(sm'+k + x )).

P r o o f

Because {g.,,n} are s ta t ionary and {qn} are homogeneous , it is ea sy to see, when s = s , ,

E f " g,,,',m" + I (sin' , am', Sm'+ 1, u(sm" + 1 ))

= Ef"gm' ,m'+ I (Srn ,fro (sin), sin'+ 1, u(sm'+ 1 ))

= E f 'gm,rn + 1 (sin ,fro (sin), sin'+ 1, u(srn'+ 1 ))

= Ef"gm,rn+l (sin , a m , s m + l ,U(Sm+l )).

B y recurs iveness , we can p rove this l e m m a by induct ion.

LEMMA 5

F o r a n y m , m ' > 0 a n d f / ~ F , i = m , m + 1 . . . . . when s = s , , we have

E ( f " ' f "§ .... )g in(s in ,am . . . . ) = E (f ' ' /m§ .... ) g , , , , ( s , . . , a m , , . . . ) .

P r o o f

B y recurs iveness , l e m m a 4 and l e m m a 1, we have

I E~ '~ ' / '+ ~ .... ) gm , ( s~ , , a m ' , . . . ) - E (fro 'f=+l .... )g~ (s,~, am . . . . )1

= IE (f., . . . . . f = + t ) g m , , m , + k + l (S in" , a m ' , . . . . S i n ' + k + 1 ,

E (t,. + k§ ~ .... )gin'+ k + l (sin'+ k + 1 . . . . )) - E (f" .... )gin (Sin . . . . )1

[ ]

Page 15: On Markovian decision programming with recursive reward functions

Jianyong Liu, Ke Liu, Markovian decision programming 159

= I E~J" ..... ["+k)gm,m+k+l (Sin,a,. . . . . . S , . + k + l , E if'+*+1 .... )g , . '+k+l (Sm+k+t . . . . ))

-E( f= ..... f"*k) gm,m+k + l (Sm,am . . . . . Sm+k + l ,E (f"**§ ~ .... )gm+k + l (Sm+k + l . . . . ))1

< 2Lkmk, , ,+l . . . km+k --~ O, when k --+ oo.

Hence, l emma 5 is true. []

THEOREM 7

For any rn > 0 and s ~ S, we have V~(s) = V~(s) and A~(s ) = A l(s).

Proo f

By theorem 2, l emma 5, the definition of A*(s) , and the stationary hypothesis, theorem 7 is obviously true. []

The following theorem is a generalization of theorem 2 in Blackwell [6], and it is also an application of our theorem 5.

THEOREM 8

(a) Let ~r = (n: I, zr 2 . . . . . ~ _ l , f . , ~ § 1 . . . . ) ~ I-I be optimal, fn E F. If for any s. ~ S. there is h = (s 1, a I . . . . . s ) which is an R-history (~), then f ~ is also optimal.

Co) Let ~ = (~r 1 . . . . ) ~ H be optimal. Then ~ is also optimal.

Proo f

For any s E S , by theorem 5 f . ( s ) E a~,(s) . For any m > 0 and s ~ S , obviously s ~ S , so f . ( s ) E A * ( s ) . By theorem 7, we have

f ~ ( s m ) ~ A . ( s m ) = Am(sin) .

By theorem 5, fn" is optimal.

(b) Similar to the proof o f (a).

L e t f ~ F, v ~ V - {E'gll ~ I-Ira}. Define

[]

[T fv l ( s ) = ]~ q(s'l s , f ( s ) ) g l , 2 ( s , f ( s ) , s', v ( s ' ) ) , S'E S

Clearly,

s ~ S .

[Tzvl(s) =E/g l , z ( s ,a , s ' , v ( s ' ) ) , s ~ S.

Page 16: On Markovian decision programming with recursive reward functions

160 Jianyong Liu, Ke Liu, Markovian decision programming

Let u, v be real functions on a set X; u < v iff u(x) <_ v(x) for all x E X, u < v iff u < v , and u ~ v .

LEMMA 6

Proof

(a) Let u, v ~ P and u < v. Then, T/u < T/v.

(b) I f f f~ IIm, then Tf(Engl) = E~/"~ 1.

(c) Let 7r~ 11 m. Then Tfn(E~gl) -o E/'g l, when n ---> oo.

(a) and Co) are trivial.

(c) By (b) and recursiveness,

nff T:n(E~gl )(s) = E t: ' )gl (s)

f" = E gl,n+l(s . . . . . sn+l,E~g~+l(Sn+l . . . . )), s ~ S.

So, by lemma 1

IE( f '~ )g l ( s ) - E f ' g l (s)l < 2Lklk2. . .kn --9 O,

Therefore, (c) is true.

THEOREM 9

Let f ~ F and s ~ S defme

Then,

when n---> o% s ~ S.

[]

G(s, f ) = {a ~ A] ~ q(s'ls, a)gl,z(s, a, s', El| (s')) > El 'g1 (s)}. $'E S

(1) If for all s ~ S we have G(s,f) = ~ , then f ~ is optimal.

(2) If the condition in (1) is not true, def'me g ~ F such that

(a) g(s) ~ G(s, f ) , when G(s, f ) ~ 0 ;

(b) g(s) = f ( s ) , when G(s, f ) = f3.

Eg-gl > E I ' g l .

Page 17: On Markovian decision programming with recursive reward functions

Jianyong Liu, Ke Liu, Markovian decision programming 161

Proof

(1) By the hypothesis of (1), we have Ef 'g l > T , (E f 'g l ) for any 9 ~ F. For any ~r = (f~'f2 . . . . ) ~ Fire' similarly to the proof of theorem 3.1 in [3], we have E'r'gl > E~gl. Hence, by theorem 2,

Ef 'g l (s ) > sup E~rgl(s)= Vl*(s), forany s e S. a'E l"Im

So, f ~ is optimal in H.

(2) By the definition of g, we have Tg (Ef 'g l ) > Ef| . Similarly to the proof

of theorem 3.2 in [3], we have Eg| > E f 'g l . []

Theorem 9 is called an iteration algorithm to find an optimal stationary policy. If F is finite, we can find an optimal stationary policy by this theorem. It is easy to see that theorem 9 is a generalization of Howard's [7] and Iwamoto's [3] algorithms.

Remark

Theorem 3.3 in [3] proved only that: by the algorithm, we can find f ~ which is optimal in FIm (when F is finite). It was not proved in ref. [3] that f " is also optimal in H.

Let V ~ V w {V~ }; define

[TV](s)= sup ~', q(s'ls, a)gl,2(s,a,s',V(s')) s ~ S. a ~ A s ' E S

L E M M A 7

:

Proof

By theorems 7 and 3, we have

[TVI*](s) = sup ~ q(s'l s,a)gl,E(s,a,s',Vl'(S')) a ~ A s ' e S

= sup ~ q(s'ls, a)gl,2(s,a,s' ,V~(s'))= Vl*(s), s e S. a e A s ' r S

[]

If F is infinite, we assume for any f ~ F there is g ~ F such that

TEf 'g l = TgEf 'g l .

Page 18: On Markovian decision programming with recursive reward functions

162 Jianyong Liu, Ke Liu, Markovian decision programming

Then we have the following iteration algorithm.

Algorithm (A)

Let f0 ~ F. By the hypothesis, there is f~ ~ F such that TEfd'gl = T A Eft 'g1. In general, suppo~ f . ~ F is known; by the hypothesis, there is f . + x ~ F such that TE f~ gl = Tf.+a E f;' gl . For {fo'fl . . . . }, we have the following result.

THEOREM 10

(1) I f f . + I = f . , then f " is optimal in H.

(2) sup l Ef~ lg l ( s ) - Vl*(s)[ -< kl k2.. .kn+l sup l Ef~'gl ( s ) - Vl*(s)l �9 s E S s ~ S

Proof

(1) By theorem 9, (1) obviously is true.

(2) It is easy to see that

j.Tt +~Ef~*gl = TEf~'gl > j.T~ Ef~'gl = Eft*g1.

By lemma 6(a),

Tf~§ g 1 > Tf.+~Ef;'gl, forany m > 0.

By lemma 6(c),

E f ~ t g 1 > T f . + l E f ~ ' g l .

Hence,

Ef~*gl < TEf;*gl = Ty.+~Ef;*gl < Ef~+lg I < Vl*,

that is,

sup I E A ~ g l (s) - VI* (s)l -< sup l [TEfTgl ] ( s ) - Vl*(S)l s ~ S s ~ S

= sup l [TEIg*gl ] ( s ) - [TVl*](s)l. (10.1)

For any ~o ~ F and s ~ S, by the definition of T, the stationary hypothesis and the Lipschitz condition, we have

Page 19: On Markovian decision programming with recursive reward functions

Jianyong Liu, Ke Liu, Markovian decision programming 163

[Tr ] ( s ) - [TVI'](s) < [Tr ] ( s ) - [Tc VI'I(s)

= E Cg1,2 (s, a, s', Ell'g1 (s')) - ECg1,2 (s, a, s ' , Vl*(S'))

P f ~ P t = E C g , + l , n + 2 ( s , a , s , E g l ( s ) ) -ECgn+l ,~+2(s ,a , s ,V l*(s ' ) )

-< k,+x sup I Eff 'gl ( s ) - Vl*(s)l �9 sEs

Hence,

[TEf~ gl ](s) - [TVI*](s) < k~+ l sup l E f ; g l (s) - VI* (s)[ s E S

Similarly,

for any s ~ S.

So,

[TVI*](s)- [TEff'gl ](s) <- k~+l sup l E f ; g l ( s ) - VI*(s)[ for any s ~ S. s~S

sup I[TEf~ gl ] ( s ) - [TVI*](s)I <- kn+ l sup lEfTgl (s) - Vl*(s)[. ses s~S

By (10.1), we have

sup'[ Ef;'+~g 1 ( s ) - Vl*(s) I < kn+l sup lEf;*gl ( s ) - Vl*(s)l, n > 0. s~S s~S

By repeatedly using the above formula, (2) is obviously true. []

Theorem 10 states that algorithm (A) is convergent. Either it has found an optimal stationary policy in finite steps or { E ~ g l } uniformly converges to ~ on S. For any e > 0, we can find an e-optimal stationary policy in finite steps by algorithm (A).

References

[1] N. Furukawa and S. Iwamoto, Markovian decision processes with recursive reward functions, Bull. Math. Statist. 15, 3-4(1973)79-91.

[2] N. Furukawa and S. Iwamoto, Correction to "Markovian decision processes with recursive reward functions", Bull. Math. Statist. 16, 1-2(1974)127.

[3] S. Iwamoto, Discrete dynamic programming with a recursive additive system, Bull. Math. Statist. 16, 1-2(1974)49-66.

[4] N. Furukawa and S. Iwamoto, Dynamic programming on recursive reward systems, Bull. Math. Statist. 17, 1-2(1976)103-126.

[5] Dong Zeqing and Liu Ke, Structure of optimal policies for discounted Markovian decision program- rning, J. Math. Res. Exposition 6, 3(1986)125-134, in Chinese.

Page 20: On Markovian decision programming with recursive reward functions

164 Jianyong Liu, Ke Liu, Markovian decision programming

[6] D. BlackweU, Discrete dynamic programming, Ann. Math. Statist. 33(1962)719-726. [7] R.A. Howard, Dynamic Programming and Markov Processes (Wiley, New York, 1960). [8] Dong Zeqing, Lecture on Markovian decision programming, Institute of Applied Mathematics, Academia

Sinica, Beijing, Mimeograph (1985), in Chinese. [9] Dong Zeqing and Zhang Sheng, On the properties of ~(>0) optimal policies in the discounted un-

bounded return model, Acta Math. Appl. Sinica (English Series) 3, 1(1987)15-25. [10] Dong Zeqing and Liu Ke, Structure of optimal policies for discounted semi-Markov decision program-

ming with unbounded rewards, Sci. Simca (Ser. A), 4(1986)337-349.