undiscounted infinitehorizon dp

42
Wu Undiscounted infinite horizon DP Stochastic shortest path & average reward DP Cathy Wu 6.246 Reinforcement Learning: Foundations and Methods Mar 2, 2021

Upload: others

Post on 15-Apr-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Undiscounted infinitehorizon DP

Wu

Undiscounted infinite horizon DPStochastic shortest path & average reward DP

Cathy Wu6.246 Reinforcement Learning: Foundations and Methods

Mar 2, 2021

Page 2: Undiscounted infinitehorizon DP

Wu

References3

1. DPOC, vol 1. Chapter 5.

2. Dimitri Bertsekas. MIT 6.231 Dynamic Programming and Stochastic Control. Fall 2015, Lectures 10-12 & 17-18.

3. Daniela Pucci De Farias. MIT 2.997 Decision-Making in Large-Scale Systems. Spring 2004, Lectures 4-5.

4. Dimitrios Katselis. UIUC ECE586 MDPs and Reinforcement Learning. Spring 2019, Lecture 12. Acknowledgement: R. Srikant.

Page 3: Undiscounted infinitehorizon DP

Wu

Outline4

1. Undiscounted problems

2. Stochastic shortest path

3. Average reward dynamic programming

Page 4: Undiscounted infinitehorizon DP

Wu

Undiscounted Problems§ System: !"#$ = & !", (", )"§ Value of a policy * = *+, *$, …

-. !+ = limsup5→7

E9:

;<=,>,…

?

"@+

5A$

B !", *" !" , )"

§ Note that -. !+ and -∗ !+ can be +∞ or −∞.

§ Shorthand notation for DP mappings

G- ! = maxJ∈L M

E9B !, (, ) + - & !, (, ) , ∀O

G.- ! = EPB !, *(!), ) + - & !, * ! , ) , ∀O

§ G and G. need not be contractions in general, but their monotonicity is helpful (see DPOC vol2, Ch4).

§ Stochastic Shortest Path Problems (SSP) problems provide a “soft boundary” between the easy finite-state discounted problems and the hard undiscounted problems.• They share features of both.

• Some nice theory is recovered thanks to the termination state, and special conditions.

6

Page 5: Undiscounted infinitehorizon DP

Wu

“Easy” and “Difficult” Problems7

§ Easy problems• All of them are finite-state, finite-control• Bellman’s equation has unique solution• Optimal policies obtained from Bellman equation• Value and policy iteration algorithms apply

§ Somewhat complicated problems [last week, today]• Infinite state, discounted, bounded ! (contractive structure)• Finite-state SSP with “nearly” contractive structure• Bellman’s equation has unique solution; value and policy iteration work

§ Difficult problems (with additional structure) [today]• Infinite state, ! ≥ 0 or ! ≤ 0 ∀&, (, ) , deterministic problems• SSP without contractive structure• Average reward

§ Hugely large and/or model-free problems [next lectures]• Big state space and/or simulation model• Approximate DP methods

§ Continuous, measure theoretic formulations (not in this course)

Page 6: Undiscounted infinitehorizon DP

Wu

Outline8

1. Undiscounted problems

2. Stochastic shortest patha. Results overviewb. Connection to discounted problemsc. Analysis sketchd. Significance of proper policiese. Analysis sketch (sequel)

3. Average reward dynamic programming

Page 7: Undiscounted infinitehorizon DP

Wu

Stochastic Shortest Path Problems§ Assume finite-state system: States 1,… , $ and special cost-free termination

state %• Transition probabilities &'( )• Action/control constraints a ∈ + , (finite set)• Value of policy - = -/, -0, … is:

12 , = lim6→8

9 :;</

6=0

> ?;, -; ?; |?/ = ,

• Bounded >• Optimal policy if 12 , = 1∗ , for all ,.

§ Assumption (Termination inevitable): There exists an integer B such that for every policy and initial state, there is positive probability that the termination state will be reached after no more that B steps; for all -, we have

C2 = max'<0,…,F

G ?H ≠ %|?/ = , , - < 1

9

Page 8: Undiscounted infinitehorizon DP

Wu

Assumption10

§ Assumption (Termination inevitable): There exists an integer ! such that for every policy and initial state, there is positive probability that the termination state " will be reached after no more that !steps; for all #, we have

$% = max*+,,…,/

0 12 ≠ "|15 = 6 , # < 1

§ Note: We have $ = max%

$% < 1, “tractable” since $% depends only on the first ! components of #.

§ Shortest path routing examples:• acyclic (assumption is satisfied)• nonacyclic (assumption is not satisfied)

Page 9: Undiscounted infinitehorizon DP

Wu

Discounted Problems§ Assume a discount factor ! < 1.§ Conversion to an SSP problem ⟹

§ kth stage cost is the same for both problems § Value iteration converges to %∗ for all initial %':

%()* + = max0∈2 3

4 +, 6 + !89:*

;<39 6 %((>) , ∀+

§ %∗ is the unique solution of Bellman’s equation:

%∗ + = max0∈2 3

4 +, 6 + !89:*

;<39 6 %∗(>) , ∀+

§ Policy iteration terminates with an optimal policy, and linear programming works.

11

Page 10: Undiscounted infinitehorizon DP

Wu

Main Results13

§ Given any initial conditions !" 1 , …!" & , the sequence !' ( generated by value iteration,

!')* ( = max/∈1 2

3 (, 4 +678*

9

:27 4 !' ; , ∀(

converges to the optimal cost !∗ ( for each (.§ Bellman’s equation has !∗ ( as unique solution:

!∗ ( = max/∈1 2

3 (, 4 + ∑78*9 :27 4 !∗ ; , ∀( !∗ ? = 0

§ A stationary policy A is optimal if and only if for every state (, A ( attains the minimum in Bellman’s equation.

§ Key proof idea: The “tail” of the cost series,

6'8BC

D

E 3 F', A' F'

Vanishes as G → ∞.

Page 11: Undiscounted infinitehorizon DP

Wu

Finiteness of Policy Rewards14

§ View ! = max

&!& < 1

as an upper bound on the non-termination probability during the 1st ) steps, regardless of policy used.

§ For any * and any initial state +, -./ ≠ 1 | -3 = +, * = , -./ ≠ 1 | -/ ≠ 1, -3 = +, * ×, -/ ≠ 1| -3 = +, * ≤ !.

§ and similarly, -7/ ≠ 1 | -3 = +, * ≤ !7, + = 1,… , 9

§ So :{Reward between times <) and < + 1 ) − 1}

≤ )!7 max?@A,…,BC∈E F

G +, H

§ and

I& + ≤ J7@3

K

)!7 max?@A,…,BC∈E F

G +, H =)

1 − !max

?@A,…,BC∈E F

G +, H

Page 12: Undiscounted infinitehorizon DP

Wu

15

Proof: !" → !∗ (sketch)§ Assume for simplicity that !% & = 0, ∀&. For any + ≥ 1, write the cost of any policy . as

!/ 0% = 123%

4567

8 9 02, .2 02 + 12345

;

8 9 02, .2 02

≤ 123%

4567

8 9 02, .2 02 + 12345

;

=2>maxB,C

9 &, D

§ Take the maximum of both sides over . to obtain

!∗ 0% ≤ !45 0% +=5

1 − =>max

B,C9 &, D

§ Similarly, we have

!45 0% −=5

1 − =>max

B,C9 &, D ≤ !∗ 0%

§ It follows that lim5→;

!45 0% = !∗ 0% .

§ !45 0% and !45H2(0%) converge to the same limit for K < > (since K extra steps far into the future don’t matter), so !" 0% → !∗ 0% .

§ Similarly, !% ≠ 0 does not matter.

Page 13: Undiscounted infinitehorizon DP

Wu

17

§ Minimizing the ! Time to Termination : Let+ ,, . = 1, ∀, = 1,… , 3, . ∈ 5 ,

§ Under our assumptions, the costs 6∗ , uniquely solve Bellman’s equation, which has the form

6∗ , = max9∈: ; 1 +=>?@

AB;> . 6∗ C , , = 1,… , 3

§ In the special case where there is only one control at each state, 6∗ , is the mean first passage time from , to D. These times, denoted E;, are the unique solution of the classical equations

E; = 1 +=>?@

AB;>E> , , = 1,… , 3

which are seen to be a form of Bellman’s equation.

Example

Page 14: Undiscounted infinitehorizon DP

Wu

Proper policies§ Definition: A stationary policy ! is called proper, if under ", from

every state #, there is a positive probability path that leads to $.§ Important fact: If ! is proper, %& is contraction w. r. t. some weighted

sup-normmax*

1,*

%&- # − %&-/ # ≤ 1& max*1,*

- # − -/ #§ % is similarly a contraction if all " are proper (the case we just

analyzed).

18

Page 15: Undiscounted infinitehorizon DP

Wu

SSP Theory: the sequel§ The theory can be pushed one step further. Instead of all policies being

proper, assume that:1) There exists at least one proper policy2) For each improper !, #$ % = −∞ for some %

§ Example: Deterministic shortest path problem with a single destination ).• States ⟺ nodes; Controls ⟺ arcs• Termination state ⟺ the destination• Assumption (1) ⟺ every node is connected to the destination• Assumption (2) ⟺ all cycle costs > 0

§ Note that - is not necessarily a contraction. (Since not all policies may be proper)

§ The theory in summary is as follows:• -∗ is the unique solution of Bellman’s Equation• /∗ is optimal if and only if -$∗#∗ = -#∗• VI converges: -0# → #∗ for all V∈ ℜ4• PI terminates with an optimal policy, if started with a proper policy

21

Page 16: Undiscounted infinitehorizon DP

Wu

SSP Algorithms§ All the basic algorithms have counterparts under our assumptions; see DPOC

vol2, ch3.§ “Easy” case: All policies proper, in which case the mappings ! and !" are

contractions§ Even with improper (infinite cost) policies all basic algorithms have satisfactory

counterparts• VI and PI• Optimistic PI• Asynchronous VI• Asynchronous PI• Q-learning analogs

§ ** THE BOUNDARY OF NICE THEORY **§ Serious complications arise under any one of the following:

• There is no proper policy• There is improper policy with finite cost ∀$• The state space is infinite and/or the control space is infinite [infinite but compact % $ can

be dealt with]

22

Page 17: Undiscounted infinitehorizon DP

Wu

Pathologies I: Deterministic Shortest Paths§ Two policies, one proper (apply !),

one improper (apply !")§ Bellman’s equation is

# 1 = min # 1 , *Set of solutions is (−∞, *]

§ Case * > 0, #∗ = 0: VI does not converge to #∗ except if started from #∗. PI may get stuck starting from the inferior proper policy.

§ Case * < 0, #∗ = *: VI converges to #∗ if started above #∗, but not if started below #∗. PI can oscillate (if started with !" it generates !, and if started with ! it can generate !")

§Discuss: Why doesn’t this issue arise in the discounted setting?

1 3!, Cost *

Destination

!", Cost 0

(Warning: min)23

Page 18: Undiscounted infinitehorizon DP

Wu

SSP Analysis I§ For a proper policy !, "# is the unique fixed point of $#, and $#%" → "# for

all " (holds by the theory of DPOC Vol. I, Section 5.2)§ Key Fact: A ! satisfying V ≤ $#" for some V ∈ ℜ+ must be proper – true

because

V ≤ $#%" = -#%" + /012

%34-#05#

since "# = ∑0127 -#05# and some component of the term on the right goes to −∞ as : → ∞ if ! is improper (by our assumptions).

§ Consequence: $ can have at most one fixed point within ℜ+.§ Proof: If " and "; are two fixed points, select ! and !; such that V = $" =$#" and "; = $"; = $#<";. By preceding assertion, ! and !; must be proper, and V = "# and "; = "#< . Also

V = $%" ≥ $#<% " → "#< = ";Similarly, "; ≥ ", so V = "′.

26

Page 19: Undiscounted infinitehorizon DP

Wu

SSP Analysis II§ We first show that ! has a fixed point, and that PI converges to it.§ Use PI. Generate a sequence of proper policies "# starting from a proper

policy "$.§ "% is proper and &'( ≤ &'* since

&'( = !'(&'( ≤ !&'( = !'*&'( ≤ !'*# &'( ≤ &'*§ Thus &', is non-decreasing, some policy -" is repeated and &.' = !&.'. So &.' is fixed point of !.

§ Next show that !#& → &.' for all V, i.e., VI converges to the same limit as PI. (Sketch: True if & = &.'. Argue using the properness of -" to show that the terminal cost difference V − &.' does not matter).

§ To show &.' = &∗ , for any " = "$, "%, …!'( …!',5*&$ ≤ !#&$

where &$ ≡ 0. Take limsup as 8 → ∞, to obtain &' ≤ &.', so -" is optimal and &.' = &∗.

27

Page 20: Undiscounted infinitehorizon DP

Wu

Outline32

1. Undiscounted problems

2. Stochastic shortest path

3. Average reward dynamic programminga. Connections with finite horizon DPb. Connections with stochastic shortest pathc. Bellman’s equationd. Algorithms: value & policy iteratione. Connections with discounted MDPsf. Blackwell Optimal Policies

Page 21: Undiscounted infinitehorizon DP

Wu

Average Reward Problems36

§ In the average reward problems, we aim at finding a policy ! which minimizes:

"# $ = limsup,→.

10 1 2

345

,678# $3 | $5 = 0)

§ In the average-reward problem, "# $ does not offer enough information for an optimal policy to be found.

§ In most cases of interest, we will have "# $ = <# for some scalar <#, for all $, so that it does not allow us to distinguish the value of being in each state.

§ Footnotes:• For any fixed 0, the reward accrued up to time 0 does not matter (only the

state that we are at time 0 matters).• Setting: stationary dynamics, finite states and actions.

(1)

Page 22: Undiscounted infinitehorizon DP

Wu

Intuition: constant value37

Definition (Communicate)We say that two states !, # communicate under policy $ if there are %, &% ∈ {1,2, … } such that -./ !, # > 0, -.&/ (#, !) > 0.

§ If all states communicate, the optimal reward is independent of initial state [if we can go from 4 to 5 in finite expected time, we must have 6∗ 4 ≤ 6∗(5)]. So 6∗ 4 ≡ :∗, ∀4.

§ Because communication issues are so important, the methodology relies heavily on Markov chain theory.

§ The theory depends a lot on whether the chains corresponding to policies have a single or multiple “recurrent classes.” We will focus on the simplest version, using SSP theory.

Page 23: Undiscounted infinitehorizon DP

Wu

More definitions…38

Definition (Unichain Policy)We say that a policy ! is unichain if all of its recurrent states communicate.

Definition (Transient State)We say that a state " is transient under policy ! if it is only visited finitely many times, regardless of the initial condition of the system.

§ In the figure ⟹, states 1, 2, and 4 all communicate with each other, but state 4 doesn’t communicate with any state.

§ States 1, 2, and 3 are recurrent, while state 4 is transient.

§ This MDP is thus unichain.

1 2

34

Page 24: Undiscounted infinitehorizon DP

Wu

Assumption40

AssumptionOne of the states !∗ is such that for some integer # > 0, and for all initial states and all policies, !∗ is visited with positive probability at least once in the first # steps.

§ Equivalently: The special state !∗ is recurrent in the Markov chain corresponding to each stationary policy.

§ Equivalently (previous SSP assumption, termination inevitable): There exists integer # such that for every policy and initial state, there is positive probability that the termination state & will be reached after no more that # steps; for all ', we have

() = max./0,…,3 4 !5 ≠ &|!8 = 9 , ' < 1

Definition (Recurrent State)We say that a state ! is recurrent under policy ' if, conditioned on the fact that it is visited at least once, it is visited infinitely many times.

Page 25: Undiscounted infinitehorizon DP

Wu

More intuition: constant value41

§ Consider a set of states ! = #$, #&, … , #∗, … , #) .§ The states are visited in a sequence with some initial state #, say

#, … , #∗, … , #∗, … , #∗, …

§ Let *+ # , , = 1, 2, … be the stages corresponding to the ,th visit to state #∗, starting at state #. Let

/0+ # = 1∑3456 75689 : ;$ <0 #3*+=$ # − *+ #

§ Intuitively, we have the same transition probabilities whenever we start a new trajectory in state #∗. Thus, /0+ # independent of initial state # and /0+ # = /0

? # .§ Then expect @∗ , ≡ some /∗.

ℎ(#) /0$ /0&

Page 26: Undiscounted infinitehorizon DP

Wu

Connection to finite-horizon problems42

§ Going back to observe the definition of the function

!∗ #, % = max*

+ ,-./

01* #- | #/ = #

§ We conjecture that the function can be approximated as follows:!∗ #, % ≈ 4∗ # % + ℎ∗ # + 7 % , as % → ∞§ Note that, since 4∗ # is independent of the initial state, we can rewrite the

approximation as:!∗ #, % ≈ 4∗% + ℎ∗ # + 7 % , as % → ∞

§ Where the term ℎ∗(#) can be interpreted as a residual reward that depends on the initial state # and will be referred to as the differential cost (reward)function.

§ It can be shown thatℎ∗ # = + ,

-./

0= > ?@1*∗ # − 4∗

(2)

(3)

Page 27: Undiscounted infinitehorizon DP

Wu

Bellman’s equation43

§ We can now speculate about a version of Bellman’s equation for computing !∗and ℎ∗.

§ Approximating $∗ %, ' as in (3), we have$∗ %, ' + 1 = max

./. % +0

12. %, 3 $∗ 3, '

!∗ ' + 1 + ℎ∗ % + 4 ' = max.

/. % +012. %, 3 !∗' + ℎ∗ 3 + 4 '

§ Therefore, we have:!∗ + ℎ∗ % = max

./. % +0

12. %, 3 ℎ∗ 3 (4)

Page 28: Undiscounted infinitehorizon DP

Wu

!∗

!∗!∗

Connection with SSP44

§ Divide the sequence of generated states into cycles marked by successive visits to !∗.

§ Let’s focus on a single cycle: It can be viewed as a state trajectory of an SSP problem with !∗ as the termination state.• Let the cost (reward) at # of the SSP be ℎ # = & # − (∗.• We will argue (informally) that:

Average reward problem ≡ A minimum cost (maximum reward) cycle problem ≡ SSP Problem.

(Warning: min)

Page 29: Undiscounted infinitehorizon DP

Wu

Connection with SSP (continued)45

§ Consider a minimum cycle cost problem: Find a stationary policy !that minimizes the expected cost per transition with a cycle:

"# = %&& !'&& !

Where for a fixed !:%&& ! : ) cost from 1 up to the 6irst return to 1'&& ! : ) time from 1 up to the 6irst return to 1

§ Intuitively,9:: #;:: # = average cost of !, and optimal cycle cost = "∗, so

%&& ! − '&& ! "∗ ≥ 0§ Consider SSP with stage costs @ A, C − "∗. The cost of ! starting

from 1 is %&& ! − '&& ! "∗, so the optimal/min cycle ! is also optimal for the SSP.

§ Also: Optimal SSP cost starting from 1 = 0.

(Warning: min)

Page 30: Undiscounted infinitehorizon DP

Wu

Bellman’s Equation46

§ Let ℎ∗ # the optimal cost of this SSP problem when starting at the non-termination states # = 1,… , (. Then ℎ∗ 1 ,… , ℎ∗(() solve uniquely the corresponding Bellman’s equation:

ℎ∗ # = min.∈0 1 2 #, 3 − 5∗ +789:

;<:=18 3 ℎ∗ > , ∀#

§ If @∗ is an optimal stationary policy for the SSP problem, we have:ℎ∗ ( = A;; @∗ − B;; @∗ 5∗ = 0

§ Combining these equations, we have:

5∗ + ℎ∗ # = min.∈0 1 2 #, 3 +789:

;<:=18 3 ℎ∗ > , ∀#

ℎ∗ ( = 0§ If @∗ # attains the min for each #, @∗ is optimal.§ There is also Bellman Equation for a single policy @.§ Finally, flip all the signs for rewards (vs costs).§ Discuss: Any issues with solving the above Bellman equation?

(Warning: min)

Page 31: Undiscounted infinitehorizon DP

Wu

Bellman operators49

Lemma 1 (Monotonicity)Let ℎ ≤ #ℎ be arbitrary. Then $ℎ ≤ $#ℎ. $%ℎ ≤ $% #ℎ

§ Define the Bellman operators as follows:$%ℎ = '% + )%ℎ$ℎ = min

%$%ℎ

§ Then:

Lemma 2 (Offset)For all ℎ and ℎ ∈ ℝ, we have $ ℎ + /0 = $ℎ + /0§ Contraction principle does not hold for $ℎ = min% $%ℎ§ Bellman’s equation

10 + ℎ = $ℎ

Page 32: Undiscounted infinitehorizon DP

Wu

Bellman’s Equation50

TheoremSuppose that !∗ and ℎ∗ satisfy the Bellman’s equation. Let $∗ be greedy with respect to ℎ∗, i.e. %ℎ∗ ≡ %'∗ℎ∗. Then

(' ) = !∗, ∀)('∗∗ ) ≥ (' ) , ∀$

§ Bellman’s equation!. + ℎ = %ℎ (5)

Page 33: Undiscounted infinitehorizon DP

Wu

51

§ Let ! = !#, !%, … . Let ' be arbitrary. Then()*+,ℎ∗ ≤ (ℎ∗ = 0∗1 + ℎ∗

()*+3()*+,ℎ∗ ≤ ()*+3 ℎ∗ + 0∗1= ()*+3ℎ∗ + 0∗1≤ (ℎ∗ + 0∗1= ℎ∗ + 20∗1

§ Then(#(% …(56#ℎ∗ ≤ '0∗1 + ℎ∗

§ Thus, we have

7 89:;

56#<) =9 + ℎ∗ =5 ≤ '0∗1 + ℎ∗

Proof: Bellman’s Equation

Page 34: Undiscounted infinitehorizon DP

Wu

52

§ By dividing both sides by ! and take the limit as ! → ∞, we have$% ≤ '∗)

§ Take * = *∗, *∗, *∗, … , then all the inequalities before become the equality. Thus

'∗) = $%∗

Proof: Bellman’s Equation

Page 35: Undiscounted infinitehorizon DP

Wu

Remarks: Bellman’s Equation53

§ If !∗, ℎ∗ is a solution to the Bellman’s equation, then !∗, ℎ∗ + &'is also a solution, for all scalar &.

§ However, unlike the case of discounted-reward and finite-horizon problems, the average-reward Bellman’s equation does not necessarily have a solution.

§ Discuss: Are there examples in which the average reward should not be the same for all initial states?

Page 36: Undiscounted infinitehorizon DP

Wu

Value Iteration54

§ Natural VI method: Generate optimal !-stage rewards by DP algorithm starting with any "#:

"$%& ' = max,∈. / 0 ', 2 +456&

78/5 2 "$ 9 , ∀'

§ Convergence: lim$→>?@ /$ = A∗, ∀'

§ Proof outline: Let "$∗ be so generated starting from the optimal differential cost (reward), i.e. the initial condition "#∗ = ℎ∗. Then by induction:

"$∗ ' = !A∗ + ℎ∗ ' , ∀', ∀!§ On the other hand:

"$ ' − "$∗ ' ≤ max56&,…,7 "# 9 − ℎ∗ 9 , ∀'since "$ ' and "$∗ ' are optimal rewards for two !-stage problems that differ only in the terminal rewards functions, which are "# and ℎ∗.

Page 37: Undiscounted infinitehorizon DP

Wu

Relative Value Iteration55

§ The VI method just described has two drawbacks:• Since typically some components of !" diverge to ∞ or −∞, calculating lim"→)

*+ ," is

numerically cumbersome.• The method will not compute a corresponding differential reward vector ℎ∗.

§ We can bypass both difficulties by subtracting a constant from all components of the vector !" , so that the difference, call it ℎ" , remains bounded.

§ Relative VI algorithm: Pick any state /, and iterate according to

ℎ"01 2 = max6∈8 ,

9 2, ; +=>?1

@A,> ; ℎ" B

− max6∈8 C

9 /, ; +=>?1

@AC> ; ℎ" B , ∀2

§ Convergence: We can show ℎ" → ℎ∗ (under an extra assumption; see Vol. II).

Page 38: Undiscounted infinitehorizon DP

Wu

Policy Iteration56

§ At iteration !, we have a stationary policy "#.§ Policy evaluation: Compute $# and ℎ# & of "#, using the ' + 1 equations ℎ# ' = 0 and

$# + ℎ# & = , &, "# & +./01

234/ "# & ℎ# 5 , ∀&

§ Policy improvement: Find

"#71 & = arg max=∈? 4 , &, @ +./01

234/ @ ℎ# 5 , ∀&

§ If $#71 = $# and ℎ#71 & = ℎ# & , ∀&, stop; otherwise, repeat with "#71 replacing "#.§ Result: for each k, we either have $#71 > $# or we have policy improvement:

$#71 = $#, ℎ#71 & ≥ ℎ# & , & = 1,… , '§ The algorithm terminates with an optimal policy.

Page 39: Undiscounted infinitehorizon DP

Wu

Remarks59

1. Unlike discounted reward problems, average reward problems are full of technicalities.

2. Depending on the structure of transition probability matrix ! "given action ", an optimal stationary policy may not exist.

3. Such existence problems are highly technical, especially for infinite-state spaces.

4. There are general sufficient conditions for the existence of optimal stationary policies for average reward problems.

Page 40: Undiscounted infinitehorizon DP

Wu

Average reward results61

Theorem (Average reward)

Any of the following conditions are sufficient for the optimal average reward

to be the same regardless of the initial state:

1. (Unichain condition) Every stationary policy ! yields a Markov chain with

a single recurrent class (i.e. one communicating class) and a possibly

empty set of transient states.

2. There exists a unichain optimal policy.

3. For every pair of states x and y, there is a policy u such that x and y

communicate.

Page 41: Undiscounted infinitehorizon DP

Wu

Connection with Discounted Reward MDPs63

§ “Vanishing Discount Factor Idea”

§ Let !"∗ be the value of a discounted reward MDP, with discount factor $. Then !∗, ℎ of the average-reward problem are obtained by:

!∗ = lim"→, 1 − $ !"∗ /

ℎ / = lim"→,!"∗ / − !"∗ 0 for any 0

§ The following set of equations could relate average reward MDPs and $-discounted MDPs

!∗ = !7∗ / = lim8→9max71

; + 1=7 >?@A

8B C?, D? | CA = /

= lim8→9max7 lim"→,=7 ∑?@A8 $?B C?, D? | CA = /

∑?@A8 $?= lim"→, lim8→9max7

=7 ∑?@A8 $?B C?, D? | CA = /∑?@A8 $?

= lim"→,max7 lim8→9=7 ∑?@A

8 $?B C?, D? | CA = /lim8→9∑?@A

8 $?= lim"→, 1 − $ !"

∗ /

Page 42: Undiscounted infinitehorizon DP

Wu

Blackwell Optimal Policies65

§ Let !"∗ denote the optimal policy of a discounted reward MDP with discount factor $.

§ Let !∗ be the optimal policy of an average reward MDP.§ Under the unichain condition (and finite states), given a sequence

$% such that $% → 1 , it turns out that !"(∗ → !∗.§ Since there are only a finite number of policies, the convergence !"(∗ → !∗ implies that there exists an $̅ < 1 such that ∀$ ∈ $̅, 1 , the optimal policy of the discounted reward MDP coincides with that of the average reward MDP (i.e. !∗).

§ Such policies !"∗ are called Blackwell Optimal Policies.