rl lecture 5: monte carlo methods
TRANSCRIPT
![Page 1: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/1.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
RL Lecture 5: Monte Carlo Methods
❐ Monte Carlo methods learn from complete sample returnsn Only defined for episodic tasks
❐ Monte Carlo methods learn directly from experiencen On-line: No model necessary and still attains optimalityn Simulated: No need for a full model
![Page 2: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/2.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2
Monte Carlo Policy Evaluation
❐ Goal: learn Vπ(s)❐ Given: some number of episodes under π which contain s❐ Idea: Average returns observed after visits to s
❐ Every-Visit MC: average returns for every time s is visited in an episode
❐ First-visit MC: average returns only for first time s is visited in an episode
❐ Both converge asymptotically
1 2 3 4 5
![Page 3: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/3.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3
First-visit Monte Carlo policy evaluation
![Page 4: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/4.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4
Blackjack example
❐ Object: Have your card sum be greater than the dealers without exceeding 21.
❐ States (200 of them): n current sum (12-21)n dealer’s showing card (ace-10)n do I have a useable ace?
❐ Reward: +1 for winning, 0 for a draw, -1 for losing❐ Actions: stick (stop receiving cards), hit (receive another
card)❐ Policy: Stick if my sum is 20 or 21, else hit
![Page 5: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/5.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5
Blackjack value functions
+1
−1
ADealer showing 10 12
Play
er s
um21
After 500,000 episodesAfter 10,000 episodes
Usableace
Nousable
ace
![Page 6: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/6.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6
Backup diagram for Monte Carlo
❐ Entire episode included❐ Only one choice at each state
(unlike DP)
❐ MC does not bootstrap
❐ Time required to estimate one state does not depend on the total number of states
terminal state
![Page 7: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/7.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7
e.g., Elastic Membrane (Dirichlet Problem)
The Power of Monte Carlo
How do we compute the shape of the membrane or bubble?
![Page 8: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/8.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8
Relaxation Kakutani’s algorithm, 1945
Two Approaches
![Page 9: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/9.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9
Monte Carlo Estimation of Action Values (Q)
❐ Monte Carlo is most useful when a model is not availablen We want to learn Q*
❐ Qπ(s,a) - average return starting from state s and action a following π
❐ Also converges asymptotically if every state-action pair is visited
❐ Exploring starts: Every state-action pair has a non-zero probability of being the starting pair
![Page 10: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/10.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10
Monte Carlo Control
❐ MC policy iteration: Policy evaluation using MC methods followed by policy improvement
❐ Policy improvement step: greedify with respect to value (or action-value) function
π Q
evaluation
improvement
Q →Qπ
π→greedy(Q)
![Page 11: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/11.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11
Convergence of MC Control
❐ Policy improvement theorem tells us:Qπ k (s,π k+1(s)) = Qπ k (s, argmax
aQπ k (s, a))
= maxaQπ k (s,a)
≥Qπ k (s,π k (s))
= Vπ k (s)❐ This assumes exploring starts and infinite number of
episodes for MC policy evaluation❐ To solve the latter:
n update only to a given level of performancen alternate between evaluation and improvement per
episode
![Page 12: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/12.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12
Monte Carlo Exploring Starts
Fixed point is optimal policy π*
Proof is open question
![Page 13: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/13.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13
Usableace
Nousable
ace
20
10A 2 3 4 5 6 7 8 9Dealer showing
Play
er s
um
HIT
STICK 19
21
1112131415161718
π*
10A 2 3 4 5 6 7 8 9
HIT
STICK 2019
21
1112131415161718
V*
21
10 12
A
Dealer showing Play
er s
um
10
A
12
21
+1
−1
Blackjack example continued
❐ Exploring starts❐ Initial policy as described before
![Page 14: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/14.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14
On-policy Monte Carlo Control
)(1
sAε
ε +−
greedy
)(sAε
non-max
❐ On-policy: learn about policy currently executing❐ How do we get rid of exploring starts?
n Need soft policies: π(s,a) > 0 for all s and an e.g. ε-soft policy:
❐ Similar to GPI: move policy towards greedy policy (i.e. ε-soft)
❐ Converges to best ε-soft policy
![Page 15: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/15.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15
On-policy MC Control
![Page 16: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/16.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16
Off-policy Monte Carlo control
❐ Behavior policy generates behavior in environment❐ Estimation policy is policy being learned about❐ Average returns from behavior policy by probability their
probabilities in the estimation policy
![Page 17: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/17.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17
Learning about π while following π "
![Page 18: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/18.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18
Off-policy MC control
![Page 19: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/19.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19
Incremental Implementation
❐ MC can be implemented incrementallyn saves memory
❐ Compute the weighted average of each return
Vn =wkRk
k =1
n
∑
wkk=1
n
∑
[ ]
000
11
11
11
==
+=
−+=
++
++
++
WVwWW
VRWwVV
nnn
nnn
nnn
incremental equivalent non-incremental
![Page 20: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/20.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20
Racetrack Exercise
Starting line
Finishline
Starting line
Finishline
❐ States: grid squares, velocity horizontal and vertical
❐ Rewards: -1 on track, -5 off track
❐ Actions: +1, -1, 0 to velocity❐ 0 < Velocity < 5❐ Stochastic: 50% of the time it
moves 1 extra square up or right
![Page 21: RL Lecture 5: Monte Carlo Methods](https://reader031.vdocuments.net/reader031/viewer/2022012506/618112f3d2a0a835215c6dec/html5/thumbnails/21.jpg)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21
Summary
❐ MC has several advantages over DP:n Can learn directly from interaction with environmentn No need for full modelsn No need to learn about ALL statesn Less harm by Markovian violations (later in book)
❐ MC methods provide an alternate policy evaluation process
❐ One issue to watch for: maintaining sufficient explorationn exploring starts, soft policies
❐ No bootstrapping (as opposed to DP)