![Page 1: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/1.jpg)
Value Function Methods
CS 285: Deep Reinforcement Learning, Decision Making, and Control
Sergey Levine
![Page 2: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/2.jpg)
Class Notes
1. Homework 2 is due in one week (next Monday)
2. Remember to start forming final project groups and writing your proposal!• Proposal due 9/25, this Wednesday!
![Page 3: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/3.jpg)
Today’s Lecture
1. What if we just use a critic, without an actor?
2. Extracting a policy from a value function
3. The Q-learning algorithm
4. Extensions: continuous actions, improvements
• Goals:• Understand how value functions give rise to policies
• Understand the Q-learning algorithm
• Understand practical considerations for Q-learning
![Page 4: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/4.jpg)
Recap: actor-critic
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
![Page 5: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/5.jpg)
Can we omit policy gradient completely?
forget policies, let’s just do this!
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
![Page 6: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/6.jpg)
Policy iteration
High level idea:
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
how to do this?
![Page 7: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/7.jpg)
Dynamic programming
0.2 0.3 0.4
0.5
0.6
0.7
0.3 0.3
0.3
0.3
0.40.40.4
0.5 0.5 0.5
just use the current estimate here
![Page 8: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/8.jpg)
Policy iteration with dynamic programming
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
0.2 0.3 0.4
0.5
0.6
0.7
0.3 0.3
0.3
0.3
0.40.40.4
0.5 0.5 0.5
![Page 9: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/9.jpg)
Even simpler dynamic programming
approximates the new value!
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
![Page 10: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/10.jpg)
Fitted value iteration
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
curse of dimensionality
![Page 11: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/11.jpg)
What if we don’t know the transition dynamics?
need to know outcomes for different actions!
Back to policy iteration…
can fit this using samples
![Page 12: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/12.jpg)
Can we do the “max” trick again?
doesn’t require simulation of actions!
+ works even for off-policy samples (unlike actor-critic)
+ only one network, no high-variance policy gradient
- no convergence guarantees for non-linear function approximation (more on this later)
forget policy, compute value directly
can we do this with Q-values also, without knowing the transitions?
![Page 13: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/13.jpg)
Fitted Q-iteration
![Page 14: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/14.jpg)
Review
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
• Value-based methods• Don’t learn a policy explicitly
• Just learn value or Q-function
• If we have value function, we have a policy
• Fitted Q-iteration
![Page 15: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/15.jpg)
Break
![Page 16: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/16.jpg)
Why is this algorithm off-policy?
dataset of transitions
Fitted Q-iteration
![Page 17: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/17.jpg)
What is fitted Q-iteration optimizing?
most guarantees are lost when we leave the tabular case (e.g., when we use neural network function approximation)
![Page 18: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/18.jpg)
Online Q-learning algorithms
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
off policy, so many choices here!
![Page 19: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/19.jpg)
Exploration with Q-learning
We’ll discuss exploration in more detail in a later lecture!
“epsilon-greedy”
final policy:
why is this a bad idea for step 1?
“Boltzmann exploration”
![Page 20: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/20.jpg)
Review
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
• Value-based methods• Don’t learn a policy explicitly
• Just learn value or Q-function
• If we have value function, we have a policy
• Fitted Q-iteration• Batch mode, off-policy method
• Q-learning• Online analogue of fitted Q-
iteration
![Page 21: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/21.jpg)
Value function learning theory
0.2 0.3 0.4
0.5
0.6
0.7
0.3 0.3
0.3
0.3
0.40.40.4
0.5 0.5 0.5
![Page 22: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/22.jpg)
Value function learning theory
0.2 0.3 0.4
0.5
0.6
0.7
0.3 0.3
0.3
0.3
0.40.40.4
0.5 0.5 0.5
![Page 23: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/23.jpg)
Non-tabular value function learning
![Page 24: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/24.jpg)
Non-tabular value function learning
Conclusions:value iteration converges (tabular case)fitted value iteration does not convergenot in generaloften not in practice
![Page 25: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/25.jpg)
What about fitted Q-iteration?
Applies also to online Q-learning
![Page 26: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/26.jpg)
But… it’s just regression!
Q-learning is not gradient descent!
no gradient through target value
![Page 27: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/27.jpg)
A sad corollary
An aside regarding terminology
![Page 28: Value Function Methods - University of California, Berkeleyrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-7.pdf•Value-based methods •Don’t learn a policy explicitly •Just](https://reader033.vdocuments.net/reader033/viewer/2022042212/5eb4a59ca011141f5574c72a/html5/thumbnails/28.jpg)
Review
generate samples (i.e.
run the policy)
fit a model to estimate return
improve the policy
• Value iteration theory• Linear operator for backup• Linear operator for projection• Backup is contraction• Value iteration converges
• Convergence with function approximation• Projection is also a contraction• Projection + backup is not a contraction• Fitted value iteration does not in general
converge
• Implications for Q-learning• Q-learning, fitted Q-iteration, etc. does
not converge with function approximation
• But we can make it work in practice!• Sometimes – tune in next time