![Page 1: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/1.jpg)
Deep RL
Robert Platt Northeastern University
![Page 2: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/2.jpg)
Q-learning
Worldstate action
argmaxQ action
sta
teQ-function
Update rule
![Page 3: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/3.jpg)
Worldstate action
argmaxQ action
stat
e
Q-function
Update rule
Q-learning
![Page 4: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/4.jpg)
Worldstate action
argmax
Q-function
Deep Q-learning (DQN)
Values of different possible discrete actions
![Page 5: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/5.jpg)
Worldstate action
argmax
Q-function
But, why would we want to do this?
Deep Q-learning (DQN)
![Page 6: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/6.jpg)
Where does “state” come from?
Agent
a
s,r
Agent takes actions
Agent perceives states and rewards
Earlier, we dodged this question: “it’s part of the MDP problem statement”
But, that’s a cop out. How do we get state?
Typically can’t use “raw” sensor data as state w/ a tabular Q-function– it’s too big (e.g. pacman has something like 2^(num pellets) + … states)
![Page 7: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/7.jpg)
Where does “state” come from?
Agent
a
s,r
Agent takes actions
Agent perceives states and rewards
Earlier, we dodged this question: “it’s part of the MDP problem statement”
But, that’s a cop out. How do we get state?
Typically can’t use “raw” sensor data as state w/ a tabular Q-function– it’s too big (e.g. pacman has something like 2^(num pellets) + … states)
Is it possible to do RL WITHOUThand-coding states?
![Page 8: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/8.jpg)
DQN
![Page 9: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/9.jpg)
Instead of state, we have an image– in practice, it could be a history of the k most recent images stacked as a single k-channel image
Hopefully this new image representation is Markov…– in some domains, it might not be!
DQN
![Page 10: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/10.jpg)
Q-function
Conv 1 Conv 2 FC 1 Output
Stack of images
DQN
![Page 11: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/11.jpg)
Q-function
Conv 1 Conv 2 FC 1 Output
Stack of images
DQN
![Page 12: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/12.jpg)
Q-function
Conv 1 Conv 2 FC 1 Output
Stack of images
Num output nodes equals the number of actions
DQN
![Page 13: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/13.jpg)
Q-function updates in DQN
Here’s the standard Q-learning update equation:
![Page 14: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/14.jpg)
Here’s the standard Q-learning update equation:
Q-function updates in DQN
![Page 15: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/15.jpg)
Here’s the standard Q-learning update equation:
Rewriting:
Q-function updates in DQN
![Page 16: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/16.jpg)
Here’s the standard Q-learning update equation:
Rewriting:
let’s call this the “target”
This equation adjusts Q(s,a) in the direction of the target
Q-function updates in DQN
![Page 17: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/17.jpg)
Here’s the standard Q-learning update equation:
Rewriting:
let’s call this the “target”
We’re going to accomplish this same thing in a different way using neural networks...
This equation adjusts Q(s,a) in the direction of the target
Q-function updates in DQN
![Page 18: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/18.jpg)
Use this loss function:
Q-function updates in DQN
![Page 19: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/19.jpg)
Use this loss function:
Notice that Q is now parameterized by the weights, w
Q-function updates in DQN
![Page 20: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/20.jpg)
Use this loss function: I’m including the bias in the weights
Q-function updates in DQN
![Page 21: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/21.jpg)
Use this loss function:target
Q-function updates in DQN
![Page 22: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/22.jpg)
Use this loss function:target
Question
What’s this called?
![Page 23: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/23.jpg)
Use this loss function:target
Q-function updates in DQN
We’re going to optimize this loss function using the following gradient:
![Page 24: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/24.jpg)
Use this loss function:target
Think-pair-share
We’re going to optimize this loss function using the following gradient:
What’s wrong with this?
![Page 25: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/25.jpg)
Use this loss function:target
Q-function updates in DQN
We’re going to optimize this loss function using the following gradient:
What’s wrong with this?
We call this the semigradient rather than the gradient– semi-gradient descent still converges– this is often more convenient
![Page 26: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/26.jpg)
“Barebones” DQN
Initialize Q(s,a;w) with random weightsRepeat (for each episode):
Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
Until s is terminal
Where:
![Page 27: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/27.jpg)
Initialize Q(s,a;w) with random weightsRepeat (for each episode):
Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
Until s is terminal
Where:
This is all that changed relative to standard
q-learning
“Barebones” DQN
![Page 28: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/28.jpg)
Example: 4x4 frozen lake env
Get to the goal (G)Don’t fall in a hole (H)
Demo!
![Page 29: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/29.jpg)
Think-pair-share
Suppose the “barebones” DQN algorithm w/ this DQN network experiencesthe following transition:
Which weights in the network could be updated on this iteration?
![Page 30: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/30.jpg)
Experience replay
Deep learning typically assumes independent, identically distributed (IID) training data
![Page 31: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/31.jpg)
Experience replay
Initialize Q(s,a;w) with random weightsRepeat (for each episode):
Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
Until s is terminal
Deep learning typically assumes independent, identically distributed (IID) training data
But is this true in the deep RL scenario?
![Page 32: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/32.jpg)
Experience replay
Initialize Q(s,a;w) with random weightsRepeat (for each episode):
Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
Until s is terminal
Deep learning typically assumes independent, identically distributed (IID) training data
But is this true in the deep RL scenario?
Our solution: buffer experiences and then “replay” them during training
![Page 33: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/33.jpg)
Experience replay
Initialize with random weights
Repeat (for each episode):Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
If mod(step,trainfreq) == 0:sample batch B from D
Replay buffer
Add this exp to buffer
One step grad descent WRT buffer
Train every trainfreq steps
![Page 34: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/34.jpg)
Experience replay
Initialize with random weights
Repeat (for each episode):Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
If mod(step,trainfreq) == 0:sample batch B from D
Where:
Replay buffer
Add this exp to buffer
One step grad descent WRT buffer
Train every trainfreq steps
![Page 35: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/35.jpg)
Experience replay
Initialize with random weights
Repeat (for each episode):Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
If mod(step,trainfreq) == 0:sample batch B from D
Where:
Replay buffer
Add this exp to buffer
One step grad descent WRT buffer
Train every trainfreq steps
Buffers like this are pretty common in DL
![Page 36: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/36.jpg)
Think-pair-share
Initialize with random weights
Repeat (for each episode):Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
If mod(step,trainfreq) == 0:sample batch B from D
What do you think are the tradeoffs between:– large replay buffer vs small replay buffer?– large batch size vs small batch size?
![Page 37: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/37.jpg)
With target network
Initialize with random weights
Repeat (for each episode):Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
If mod(step,trainfreq) == 0:sample batch B from D
if mod(step,copyfreq) == 0:
Where:
Target network helps stabilize DL– why?
![Page 38: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/38.jpg)
Example: 4x4 frozen lake env
Get to the goal (G)Don’t fall in a hole (H)
Demo!
![Page 39: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/39.jpg)
Comparison: replay vs no replay
(Avg final score achieved)
![Page 40: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/40.jpg)
Double DQN
Recall the problem of maximization bias:
![Page 41: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/41.jpg)
Double DQN
Recall the problem of maximization bias:
Our solution from the TD lecture:
Can we adapt this to the DQN setting?
![Page 42: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/42.jpg)
Where:
Double DQN
Initialize with random weights
Repeat (for each episode):Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
If mod(step,trainfreq) == 0:sample batch B from D
if mod(step,copyfreq) == 0:
![Page 43: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/43.jpg)
Where:
Think-pair-share
Initialize with random weights
Repeat (for each episode):Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
If mod(step,trainfreq) == 0:sample batch B from D
if mod(step,copyfreq) == 0:
1. In what sense is this double q-learning?
2. What are the pros/cons vs earlier version of double-Q?
3. Why not convert the original double-Q algorithm into a deep version?
![Page 44: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/44.jpg)
Double DQN
![Page 45: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/45.jpg)
Double DQN
![Page 46: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/46.jpg)
Prioritized Replay Buffer
Initialize with random weights
Repeat (for each episode):Initialize sRepeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’
If mod(step,trainfreq) == 0:sample batch B from D
Previously this sample was uniformly random
Can we do better by sampling the batch intelligently?
![Page 47: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/47.jpg)
Prioritized Replay Buffer
– Left action transitions to state 1 w/ zero reward
– Far right state gets reward of 1
![Page 48: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/48.jpg)
Question
– Left action transitions to state 1 w/ zero reward
– Far right state gets reward of 1
Why is the sampling method particularly important in this
Domain?
![Page 49: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/49.jpg)
Prioritized Replay Buffer
Num of updates needed to learn true value fn as a function of replay buffer size
Larger replay buffer corresponds to larger values of n in cliffworld.
Black line selects minibatches randomly
Blue line greedily selects transitions that minimize loss over entire buffer
– Left action transitions to state 1 w/ zero reward
– Far right state gets reward of 1
![Page 50: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/50.jpg)
Prioritized Replay Buffer
Num of updates needed to learn true value fn as a function of replay buffer size
Larger replay buffer corresponds to larger values of n in cliffworld.
Black line selects minibatches randomly
Blue line greedily selects transitions that minimize loss over entire buffer
– Left action transitions to state 1 w/ zero reward
– Far right state gets reward of 1
Minimizing loss over entire buffer is impractical
Can we achieve something similar online?
![Page 51: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/51.jpg)
Question
Idea: sample elements of minibatch by drawing samples with probability:
Problem: since we’re changing the distribution of updates performed, this is off policy.
– need to weight sample updates…
Question: qualitatively, how should we re-weight experiences?– e.g. how should we re-weight an experience that prioritized
replay does not sample often?
where denotes the priority of a sample– simplest case:
( this is “proportional” sampling)
![Page 52: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/52.jpg)
Prioritized Replay Buffer
Idea: sample elements of minibatch by drawing samples with probability:
Problem: since we’re changing the distribution of updates performed, this is off policy.
– need to weight sample updates:
where denotes the priority of a sample– simplest case:
( this is “proportional” sampling)
![Page 53: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/53.jpg)
Prioritized Replay Buffer
Idea: sample elements of minibatch by drawing samples with probability:
Problem: since we’re changing the distribution of updates performed, this is off policy.
– need to weight sample updates:
where denotes the priority of a sample– simplest case:
( this is “proportional” sampling)Why is epsilon
needed?
![Page 54: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/54.jpg)
Prioritized Replay Buffer
– Left action transitions to state 1 w/ zero reward
– Far right state gets reward of 1
Prioritized buffer is not as good as oracle, but it is better than uniform sampling...
![Page 55: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/55.jpg)
Prioritized Replay Buffer
– averaged results over 57 atari games
![Page 56: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/56.jpg)
Dueling networks for Q-learning
Recall architecture of Q-network:
![Page 57: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/57.jpg)
Dueling networks for Q-learning
This is a more common way of drawing it:
CONV Layers
Fully ConnectedLayers
![Page 58: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/58.jpg)
Dueling networks for Q-learning
This is a more common way of drawing it:
CONV Layers
Fully ConnectedLayers
We’re going to express the q-function using a new network architecture
![Page 59: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/59.jpg)
Dueling networks for Q-learning
Decompose Q as:
Advantage function
![Page 60: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/60.jpg)
Think-pair-share
Decompose Q as:
Advantage function
1. Why might this decomposition be better? 2. is A always positive, negative, or either? Why?
![Page 61: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/61.jpg)
Intuition
![Page 62: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/62.jpg)
Dueling networks for Q-learning
Notice that the V/Q decomposition is not unique, given Q targets only
Therefore:
![Page 63: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/63.jpg)
Question
Notice that the V/Q decomposition is not unique, given Q targets only
Therefore:
Why does this help?
![Page 64: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/64.jpg)
Dueling networks for Q-learning
Notice that the V/Q decomposition is not unique, given Q targets only
Actually:
![Page 65: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/65.jpg)
Dueling networks for Q-learning
Action set: left, right, up, down, no-op (arbitrary number of no-op actions).SE: squared error relative to true value functionCompare dueling w/ single stream networks (all networks are three-layer MLPs)Increasing number of actions in above corresponds to increases in no-op actions
Conclusion: Dueling networks can help a lot for large numbers of actions.
![Page 66: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/66.jpg)
Dueling networks for Q-learning
Change in avg rewards for 57 ALE domains versus DQN w/ single network.
![Page 67: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/67.jpg)
Asynchronous methods
Idea: run multiple RL agents in parallel– all agents run against their own environments and Q fn– periodically, all agents synch w/ a global Q fn.
Instantiations of the idea:– asynchronous Q-learning– asynchronous SARSA– asynchronous advantage actor critic (A3C)
![Page 68: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/68.jpg)
Asynchronous Q-learning
Periodically, apply batch of weight updates
Accumulate gradients
Update target network
Shared Q functions
![Page 69: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/69.jpg)
Asynchronous Q-learning
Why does this approach help?
![Page 70: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/70.jpg)
Asynchronous Q-learning
Why does this approach help?
It helps decorrelate training data– standard DQN relies on the replay buffer and the target network to decorrelate data– asynchronous methods accomplish the same thing by having multiple learners– makes it feasible to use on-policy methods like SARSA (why?)
![Page 71: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/71.jpg)
Asynchronous Q-learning
Different numbers of learners versus wall clock time
![Page 72: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/72.jpg)
Asynchronous Q-learning
Different numbers of learners versus number of SGD steps across all threads– speedup is not just due to greater computational efficiency
![Page 73: Deep RLThink-pair-share Initialize with random weights Repeat (for each episode): Initialize s ... In what sense is this double q-learning? 2. What are the pros/cons vs earlier version](https://reader034.vdocuments.net/reader034/viewer/2022042110/5e8adb2e6cfe6e07c64a3019/html5/thumbnails/73.jpg)
Combine all these ideas!