module 6 value iteration - david r. cheriton school of

21
Module 6 Value Iteration CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Upload: others

Post on 09-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Module 6 Value Iteration - David R. Cheriton School of

Module 6

Value Iteration

CS 886 Sequential Decision Making and

Reinforcement Learning

University of Waterloo

Page 2: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

2

Markov Decision Process

β€’ Definition

– Set of states: 𝑆

– Set of actions (i.e., decisions): 𝐴

– Transition model: Pr(𝑠𝑑|π‘ π‘‘βˆ’1, π‘Žπ‘‘βˆ’1)

– Reward model (i.e., utility): 𝑅(𝑠𝑑, π‘Žπ‘‘)

– Discount factor: 0 ≀ 𝛾 ≀ 1

– Horizon (i.e., # of time steps): β„Ž

β€’ Goal: find optimal policy πœ‹

Page 3: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

3

Finite Horizon

β€’ Policy evaluation

π‘‰β„Žπœ‹ 𝑠 = 𝛾𝑑Pr(𝑆𝑑 = 𝑠′|𝑆0 = 𝑠, πœ‹)𝑅(𝑠′, πœ‹π‘‘(𝑠′))

β„Žπ‘‘=0

β€’ Recursive form (dynamic programming)

𝑉0πœ‹ 𝑠 = 𝑅(𝑠, πœ‹0 𝑠 )

π‘‰π‘‘πœ‹ 𝑠 = 𝑅 𝑠, πœ‹π‘‘ 𝑠 + 𝛾 Pr 𝑠′ 𝑠, πœ‹π‘‘ 𝑠 π‘‰π‘‘βˆ’1

πœ‹ (𝑠′)𝑠′

Page 4: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

4

Finite Horizon

β€’ Optimal Policy πœ‹βˆ—

π‘‰β„Žπœ‹βˆ—

𝑠 β‰₯ π‘‰β„Žπœ‹ 𝑠 βˆ€πœ‹, 𝑠

β€’ Optimal value function π‘‰βˆ— (shorthand for π‘‰πœ‹βˆ—)

𝑉0βˆ— 𝑠 = max

π‘Žπ‘…(𝑠, π‘Ž)

π‘‰π‘‘βˆ— 𝑠 = max

π‘Žπ‘… 𝑠, π‘Ž + 𝛾 Pr 𝑠′ 𝑠, π‘Ž π‘‰π‘‘βˆ’1

βˆ— (𝑠′)𝑠′

Bellman’s equation

Page 5: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

5

Value Iteration Algorithm

Optimal policy πœ‹βˆ—

𝑑 = 0:πœ‹0βˆ— 𝑠 ← argmax

π‘Žπ‘… 𝑠, π‘Ž βˆ€π‘ 

𝑑 > 0: πœ‹π‘‘βˆ— 𝑠 ← argmax

π‘Žπ‘… 𝑠, π‘Ž + 𝛾 Pr 𝑠′ 𝑠, π‘Ž π‘‰π‘‘βˆ’1

βˆ— (𝑠′)𝑠′ βˆ€π‘ 

NB: πœ‹βˆ— is non stationary (i.e., time dependent)

valueIteration(MDP) 𝑉0

βˆ— 𝑠 ← maxπ‘Ž

𝑅(𝑠, π‘Ž)βˆ€π‘ 

For 𝑑 = 1 to β„Ž do

π‘‰π‘‘βˆ— 𝑠 ← max

π‘Žπ‘… 𝑠, π‘Ž + 𝛾 Pr 𝑠′ 𝑠, π‘Ž π‘‰π‘‘βˆ’1

βˆ— (𝑠′)𝑠′ βˆ€π‘ 

Return π‘‰βˆ—

Page 6: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

6

Value Iteration

β€’ Matrix form:

π‘…π‘Ž: 𝑆 Γ— 1 column vector of rewards for π‘Ž

π‘‰π‘‘βˆ—: 𝑆 Γ— 1 column vector of state values

π‘‡π‘Ž: 𝑆 Γ— 𝑆 matrix of transition prob. for π‘Ž

valueIteration(MDP) 𝑉0

βˆ— ← maxπ‘Ž

π‘…π‘Ž

For 𝑑 = 1 to β„Ž do

π‘‰π‘‘βˆ— ← max

π‘Žπ‘…π‘Ž + π›Ύπ‘‡π‘Žπ‘‰π‘‘βˆ’1

βˆ—

Return π‘‰βˆ—

Page 7: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

7

Infinite Horizon

β€’ Let β„Ž β†’ ∞

β€’ Then π‘‰β„Žπœ‹ β†’ π‘‰βˆž

πœ‹ and π‘‰β„Žβˆ’1πœ‹ β†’ π‘‰βˆž

πœ‹

β€’ Policy evaluation:

π‘‰βˆžπœ‹ 𝑠 = 𝑅 𝑠, πœ‹βˆž 𝑠 + 𝛾 Pr 𝑠′ 𝑠, πœ‹βˆž 𝑠 π‘‰βˆž

πœ‹(𝑠′)𝑠′ βˆ€π‘ 

β€’ Bellman’s equation: π‘‰βˆž

βˆ— 𝑠 = maxπ‘Ž

𝑅 𝑠, π‘Ž + 𝛾 Pr 𝑠′ 𝑠, π‘Ž π‘‰βˆžβˆ—(𝑠′)𝑠′

Page 8: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

8

Policy evaluation

β€’ Linear system of equations

π‘‰βˆžπœ‹ 𝑠 = 𝑅 𝑠, πœ‹βˆž 𝑠 + 𝛾 Pr 𝑠′ 𝑠, πœ‹βˆž 𝑠 π‘‰βˆž

πœ‹(𝑠′)𝑠′ βˆ€π‘ 

β€’ Matrix form:

𝑅: 𝑆 Γ— 1 column vector of sate rewards for πœ‹

𝑉: 𝑆 Γ— 1 column vector of state values for πœ‹

𝑇: 𝑆 Γ— 𝑆 matrix of transition prob for πœ‹

𝑉 = 𝑅 + 𝛾𝑇𝑉

Page 9: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

9

Solving linear equations

β€’ Linear system: 𝑉 = 𝑅 + 𝛾𝑇𝑉

β€’ Gaussian elimination: 𝐼 βˆ’ 𝛾𝑇 𝑉 = 𝑅

β€’ Compute inverse: 𝑉 = 𝐼 βˆ’ 𝛾𝑇 βˆ’1𝑅

β€’ Iterative methods

β€’ Value iteration (a.k.a. Richardson iteration)

β€’ Repeat 𝑉 ← 𝑅 + 𝛾𝑇𝑉

Page 10: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

10

Contraction

β€’ Let 𝐻(𝑉) ≝ 𝑅 + 𝛾𝑇𝑉 be the policy eval operator

β€’ Lemma 1: 𝐻 is a contraction mapping.

𝐻 𝑉 βˆ’ 𝐻 π‘‰βˆž

≀ 𝛾 𝑉 βˆ’ π‘‰βˆž

β€’ Proof 𝐻 𝑉 βˆ’ 𝐻 π‘‰βˆž

= 𝑅 + 𝛾𝑇𝑉 βˆ’ 𝑅 βˆ’ π›Ύπ‘‡π‘‰βˆž

(by definition)

= 𝛾𝑇 𝑉 βˆ’ π‘‰βˆž

(simplification)

≀ 𝛾 π‘‡βˆž

𝑉 βˆ’ π‘‰βˆž

(since 𝐴𝐡 ≀ 𝐴 𝐡 )

= 𝛾 𝑉 βˆ’ π‘‰βˆž

(since max𝑠

𝑇(𝑠, 𝑠′)𝑠′ = 1)

Page 11: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

11

Convergence

β€’ Theorem 2: Policy evaluation converges to π‘‰πœ‹

for any initial estimate 𝑉

limπ‘›β†’βˆž

𝐻(𝑛) 𝑉 = π‘‰πœ‹ βˆ€π‘‰

β€’ Proof

β€’ By definition Vπœ‹ = 𝐻 ∞ 0 , but policy evaluation

computes 𝐻 ∞ 𝑉 for any initial 𝑉

β€’ By lemma 1, 𝐻(𝑛) 𝑉 βˆ’ 𝐻 𝑛 𝑉 ∞

≀ 𝛾𝑛 𝑉 βˆ’ 𝑉 ∞

β€’ Hence, when 𝑛 β†’ ∞, then 𝐻(𝑛) 𝑉 βˆ’ 𝐻 𝑛 0∞

β†’ 0

and 𝐻 ∞ 𝑉 = π‘‰πœ‹βˆ€π‘‰

Page 12: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

12

Approximate Policy Evaluation

β€’ In practice, we can’t perform an infinite number

of iterations.

β€’ Suppose that we perform value iteration for π‘˜

steps and 𝐻 π‘˜ 𝑉 βˆ’ 𝐻 π‘˜βˆ’1 π‘‰βˆž

= πœ–, how far

is 𝐻 π‘˜ 𝑉 from π‘‰πœ‹?

Page 13: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

13

Approximate Policy Evaluation

β€’ Theorem 3: If 𝐻 π‘˜ 𝑉 βˆ’ 𝐻 π‘˜βˆ’1 π‘‰βˆž

≀ πœ– then

π‘‰πœ‹ βˆ’ 𝐻 π‘˜ π‘‰βˆž

β‰€πœ–

1 βˆ’ 𝛾

β€’ Proof π‘‰πœ‹ βˆ’ 𝐻 π‘˜ π‘‰βˆž

= 𝐻 ∞ (𝑉) βˆ’ 𝐻 π‘˜ π‘‰βˆž

(by Theorem 2)

= 𝐻 𝑑+π‘˜ π‘‰βˆžπ‘‘=1 βˆ’ 𝐻 𝑑+π‘˜βˆ’1 𝑉

∞

≀ 𝐻 𝑑+π‘˜ (𝑉) βˆ’ 𝐻 𝑑+π‘˜βˆ’1 π‘‰βˆž

βˆžπ‘‘=1 ( 𝐴 + 𝐡 ≀ 𝐴 + | 𝐡 |)

= π›Ύπ‘‘πœ–βˆžπ‘‘=1 =

πœ–

1βˆ’π›Ύ (by Lemma 1)

Page 14: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

14

Optimal Value Function

β€’ Non-linear system of equations

π‘‰βˆžβˆ— 𝑠 = max

π‘Žπ‘… 𝑠, π‘Ž + 𝛾 Pr 𝑠′ 𝑠, π‘Ž π‘‰βˆž

βˆ—(𝑠′)𝑠′ βˆ€π‘ 

β€’ Matrix form:

π‘…π‘Ž: 𝑆 Γ— 1 column vector of rewards for π‘Ž

π‘‰βˆ—: 𝑆 Γ— 1 column vector of optimal values

𝑇a: 𝑆 Γ— 𝑆 matrix of transition prob for π‘Ž

π‘‰βˆ— = maxπ‘Ž

π‘…π‘Ž + π›Ύπ‘‡π‘Žπ‘‰βˆ—

Page 15: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

15

Contraction

β€’ Let π»βˆ—(𝑉) ≝ maxπ‘Ž

π‘…π‘Ž + π›Ύπ‘‡π‘Ž 𝑉 be the operator in

value iteration

β€’ Lemma 3: π»βˆ— is a contraction mapping.

π»βˆ— 𝑉 βˆ’ π»βˆ— π‘‰βˆž

≀ 𝛾 𝑉 βˆ’ π‘‰βˆž

β€’ Proof: without loss of generality,

let π»βˆ— 𝑉 𝑠 β‰₯ π»βˆ—(𝑉)(𝑠) and

let π‘Žπ‘ βˆ— = argmax

π‘Žπ‘… 𝑠, π‘Ž + 𝛾 Pr 𝑠′ 𝑠, π‘Ž 𝑉(𝑠′)𝑠′

Page 16: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

16

Contraction

β€’ Proof continued:

β€’ Then 0 ≀ π»βˆ— 𝑉 𝑠 βˆ’ π»βˆ— 𝑉 𝑠 (by assumption)

≀ 𝑅 𝑠, π‘Žπ‘ βˆ— + 𝛾 Pr 𝑠′ 𝑠, π‘Žπ‘ 

βˆ—π‘ β€² 𝑉 𝑠′ (by definition)

βˆ’π‘… 𝑠, π‘Žπ‘ βˆ— βˆ’ 𝛾 Pr 𝑠′ 𝑠, π‘Žπ‘ 

βˆ—π‘ β€² 𝑉 𝑠′

= 𝛾 Pr 𝑠′ 𝑠, π‘Žπ‘ βˆ— 𝑉 𝑠′ βˆ’ 𝑉 𝑠′

𝑠′

≀ 𝛾 Pr 𝑠′ 𝑠, π‘Žπ‘ βˆ— 𝑉 βˆ’ 𝑉

βˆžπ‘ β€² (maxnorm upper bound)

= 𝛾 𝑉 βˆ’ π‘‰βˆž

(since Pr 𝑠′ 𝑠, π‘Žπ‘ βˆ—

𝑠′ = 1)

β€’ Repeat the same argument for π»βˆ— 𝑉 𝑠 β‰₯ π»βˆ—(𝑉 )(𝑠) and

for each 𝑠

Page 17: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

17

Convergence

β€’ Theorem 4: Value iteration converges to π‘‰βˆ— for

any initial estimate 𝑉

limπ‘›β†’βˆž

π»βˆ—(𝑛) 𝑉 = π‘‰βˆ— βˆ€π‘‰

β€’ Proof

β€’ By definition Vβˆ— = π»βˆ— ∞ 0 , but value iteration

computes π»βˆ— ∞ 𝑉 for some initial 𝑉

β€’ By lemma 3, π»βˆ—(𝑛) 𝑉 βˆ’ π»βˆ— 𝑛 𝑉 ∞

≀ 𝛾𝑛 𝑉 βˆ’ 𝑉 ∞

β€’ Hence, when 𝑛 β†’ ∞, then π»βˆ—(𝑛) 𝑉 βˆ’ π»βˆ— 𝑛 0∞

β†’

0 and π»βˆ— ∞ 𝑉 = π‘‰βˆ—βˆ€π‘‰

Page 18: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

18

Value Iteration

β€’ Even when horizon is infinite, perform finitely

many iterations

β€’ Stop when 𝑉𝑛 βˆ’ π‘‰π‘›βˆ’1 ≀ πœ–

valueIteration(MDP) 𝑉0

βˆ— ← maxπ‘Ž

π‘…π‘Ž ; 𝑛 ← 0

Repeat 𝑛 ← 𝑛 + 1

𝑉𝑛 ← maxπ‘Ž

π‘…π‘Ž + π›Ύπ‘‡π‘Žπ‘‰π‘›βˆ’1

Until 𝑉𝑛 βˆ’ π‘‰π‘›βˆ’1 βˆžβ‰€ πœ–

Return 𝑉𝑛

Page 19: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

19

Induced Policy

β€’ Since 𝑉𝑛 βˆ’ π‘‰π‘›βˆ’1 βˆžβ‰€ πœ–, by Theorem 4: we know

that 𝑉𝑛 βˆ’ π‘‰βˆ—βˆž

β‰€πœ–

1βˆ’π›Ύ

β€’ But, how good is the stationary policy πœ‹π‘› 𝑠

extracted based on 𝑉𝑛?

πœ‹π‘› 𝑠 = argmaxπ‘Ž

𝑅 𝑠, π‘Ž + 𝛾 Pr 𝑠′ 𝑠, π‘Ž 𝑉𝑛(𝑠′)

𝑠′

β€’ How far is π‘‰πœ‹π‘› from π‘‰βˆ—?

Page 20: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

20

Induced Policy

β€’ Theorem 5: π‘‰πœ‹π‘› βˆ’ π‘‰βˆ—βˆž

≀2πœ–

1βˆ’π›Ύ

β€’ Proof

π‘‰πœ‹π‘› βˆ’ π‘‰βˆ—βˆž

= π‘‰πœ‹π‘› βˆ’ 𝑉𝑛 + 𝑉𝑛 βˆ’ π‘‰βˆ—βˆž

≀ π‘‰πœ‹π‘› βˆ’ 𝑉𝑛 ∞+ 𝑉𝑛 βˆ’ π‘‰βˆ—

∞ ( 𝐴 + 𝐡 ≀ 𝐴 + | 𝐡 |)

= π»πœ‹π‘› ∞ (𝑉𝑛) βˆ’ π‘‰π‘›βˆž

+ 𝑉𝑛 βˆ’ π»βˆ— ∞ π‘‰π‘›βˆž

β‰€πœ–

1βˆ’π›Ύ+

πœ–

1βˆ’π›Ύ (by Theorems 2 and 4)

=2πœ–

1βˆ’π›Ύ

Page 21: Module 6 Value Iteration - David R. Cheriton School of

CS886 (c) 2013 Pascal Poupart

21

Summary

β€’ Value iteration – Simple dynamic programming algorithm

– Complexity: 𝑂(𝑛 𝐴 𝑆 2) β€’ Here 𝑛 is the number of iterations

β€’ Can we optimize the policy directly instead of optimizing the value function and then inducing a policy? – Yes: by policy iteration