![Page 1: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/1.jpg)
Deep Reinforcement Learning in the Real World
Sergey LevineUC Berkeley Google Brain
1
![Page 2: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/2.jpg)
Deep learning helps us handle unstructured environments
![Page 3: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/3.jpg)
Reinforcement learning provides a formalism for behavior
decisions (actions)
consequencesobservationsrewards
Mnih et al. ‘13
notice something?
![Page 4: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/4.jpg)
But maybe it’s not so simple…deep learning handles unstructured environments using data like this:
Maybe we learned the wrong lesson from deep learning
how can we possibly hope RL to enable intelligent machines when it trains on data that looks like this:
It’s not about deep nets…
It’s about big models + highly varied and diverse data!
![Page 5: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/5.jpg)
RL has a big problem
reinforcement learning supervised machine learning
this is doneonce
train formany epochs
this is donemany times
![Page 6: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/6.jpg)
RL has a big problem
reinforcement learning actual reinforcement learning
this is donemany times
this is donemany times
this is donemany many times
![Page 7: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/7.jpg)
Off-policy RL with large datasets
reinforcement learning off-policy reinforcement learning
this is donemany times
big datasetfrom past
interaction
train formany epochs
occasionallyget more data
![Page 8: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/8.jpg)
Off-policy RL with large datasets
reinforcement learning off-policy reinforcement learning
this is donemany times
big datasetfrom past
interaction
train formany epochs
occasionallyget more data
How does off-policy RL work?
Off-policy RL = prediction• What do we predict?
• Future observations: understand how the world works• Future rewards: understand the consequences of your actions
• They’re not as different as you think – more on this later
![Page 9: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/9.jpg)
Off-policy model-free RL algorithms
Off-policy model-based RL algorithms
Why these are actually the same thing
![Page 10: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/10.jpg)
Off-policy model-free RL algorithms
Off-policy model-based RL algorithms
Why these are actually the same thing
![Page 11: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/11.jpg)
big datasetfrom past
interaction
occasionallyget more data
Off-policy model-free learning
train formany epochs enforce this equation on all off-policy data
11
![Page 12: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/12.jpg)
don’t need on-policy data for this!
dataset of transitions(“replay buffer”)
off-policyQ-learning
See, e.g.Riedmiller, Neural Fitted Q-Iteration ‘05
Ernst et al., Tree-Based Batch Mode RL ‘05
How to solve for the Q-function?
![Page 13: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/13.jpg)
QT-Opt: off-policy Q-learning at scale
live data collection
stored data from all past experiments
training buffers Bellman updaters
training threads
Kalashnikov, Irpan, Pastor, Ibarz, Herzong, Jang, Quillen, Holly, Kalakrishnan, Vanhoucke, Levine. QT-Opt: Scalable Deep Reinforcement Learning of Vision-Based Robotic Manipulation Skills
13
![Page 14: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/14.jpg)
Grasping with QT-Opt
➢About 1000 training objects
➢About 600k training grasp attempts
➢Q-function network with 1.2M parameters
➢The only grasp-specific feature is the reward (1 if grasped)
Kalashnikov, Irpan, Pastor, Ibarz, Herzong, Jang, Quillen, Holly, Kalakrishnan, Vanhoucke, Levine. QT-Opt: Scalable Deep Reinforcement Learning of Vision-Based Robotic Manipulation Skills
14
![Page 15: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/15.jpg)
Emergent grasping strategies
96%
15
![Page 16: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/16.jpg)
So far…
off-policy reinforcement learning
big datasetfrom past
interaction
train formany epochs
occasionallyget more data
repeatedlyget more data
live data collection
![Page 17: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/17.jpg)
dataset of transitions(“replay buffer”)
off-policyQ-learning
See, e.g.Riedmiller, Neural Fitted Q-Iteration ‘05
Ernst et al., Tree-Based Batch Mode RL ‘05
So what’s the problem?
what action will this pick?
![Page 18: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/18.jpg)
How to stop training on garbage?
Kumar, Fu, Tucker, Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction.
See also: Fujimoto, Meger, Precup. Off-Policy Deep Reinforcement Learning without Exploration.
naïve RL
distrib. matching (Fujimoto)
random data
only use values inside support region
support constraint
pessimistic w.r.t.epistemic uncertainty
our method
![Page 19: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/19.jpg)
How well does it work?
Kumar, Fu, Tucker, Levine. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction.
See also: Fujimoto, Meger, Precup. Off-Policy Deep Reinforcement Learning without Exploration.
“mediocre” data
average reward in dataset
distrib. matching
our method
![Page 20: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/20.jpg)
Off-policy model-free RL algorithms
Off-policy model-based RL algorithms
Why these are actually the same thing
![Page 21: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/21.jpg)
Off-policy model-based reinforcement learning
big datasetfrom past
interaction
occasionallyget more data
train for many epochs
➢Comparatively simple supervisedlearning problem
➢Natural and straightforward to apply to multi-task settings (all tasks have the same physics)
➢Models can be repurposed to new tasks without any additional learning
➢Typically worse final performance than model-free RL – is that always true?
![Page 22: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/22.jpg)
High-level algorithm outline
Nagabandi, Konolige, Levine, Kumar. Deep Dynamics Models for Learning Dexterous Manipulation. CoRL 2019.
experience buffer
model training
planning with model
use bootstrap ensemble for uncertainty estimation
derivative-free optimization using a variant of cross-entropy method
![Page 23: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/23.jpg)
Model-based RL for dexterous manipulation
Nagabandi, Konolige, Levine, Kumar. Deep Dynamics Models for Learning Dexterous Manipulation. CoRL 2019.
When should we prefermodel-based RL?
➢ On narrow tasks, model-free RL tends to do very well
➢ On broad and diversetasks, model-based RL has a major advantage!
![Page 24: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/24.jpg)
Model-based RL for dexterous manipulation
Nagabandi, Konolige, Levine, Kumar. Deep Dynamics Models for Learning Dexterous Manipulation. CoRL 2019.
When should we prefermodel-based RL?
➢ When we want to transfer the same dynamics model to perform multiple tasks
![Page 25: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/25.jpg)
Model-based RL for dexterous manipulation
Nagabandi, Konolige, Levine, Kumar. Deep Dynamics Models for Learning Dexterous Manipulation. CoRL 2019.
When should we prefermodel-based RL?
➢ When we want to learn in the real world with just a few hours of interaction
![Page 26: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/26.jpg)
Off-policy model-free RL algorithms
Off-policy model-based RL algorithms
Why these are actually the same thing
![Page 27: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/27.jpg)
Q-Functions (can) learn models
dataset of transitions(“replay buffer”)
off-policyQ-learning
![Page 28: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/28.jpg)
Q-Functions (can) learn models
Before:
Now:
Andrychowicz et al.“Hindsight Experience Replay”
This results in much faster learning
Why??
Kaelbling et al.“Learning to Reach Goals”
![Page 29: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/29.jpg)
Q-Functions (can) learn models
![Page 30: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/30.jpg)
Temporal difference models
Temporal Difference Models.
Pong*, Gu*, Dalal, L.
as fast as model-based learning asymptotic performance as good (or better) vs model-free
![Page 31: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/31.jpg)
Planning with TDMs
optimization over state variables
?
invalid states should be unreachable
but invalid states are out of distribution!
cannot make good predictions for OOD states!
![Page 32: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/32.jpg)
Optimizing over valid states
Planning with Goal-Conditioned Policies.
Nasiriany*, Pong*, Lin, L.
(but there are many other choices)
➢ Main idea: VAE provides state abstraction, TDM provides temporal abstraction
➢ Details (not covered here) matter: how to turn this into unconstrained optimization, how to optimize, etc. (see paper)
![Page 33: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/33.jpg)
LEAP: Latent Embeddings for Abstracted Planning
Planning with Goal-Conditioned Policies.
Nasiriany*, Pong*, Lin, L.
![Page 34: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/34.jpg)
34
Graph search with goal-conditioned values
➢ Can search through graph using any graph search algorithm
➢ Where do we get the graph nodes?
➢ Use the entire replay buffer!
Search on the Replay Buffer: Bridging Planning and Reinforcement Learning
Eysenbach, Salakhutdinov, L.
![Page 35: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/35.jpg)
35
Search on the Replay Buffer (SoRB)
Search on the Replay Buffer: Bridging Planning and Reinforcement Learning
Eysenbach, Salakhutdinov, L.
![Page 36: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/36.jpg)
Off-policy model-free RL algorithms
Off-policy model-based RL algorithms
Why these are actually the same thing
![Page 37: Deep Reinforcement Learning in the Real WorldDeep Reinforcement Learning in the Real World Sergey Levine UC Berkeley Google Brain 1. Deep learning helps us handle unstructured](https://reader030.vdocuments.net/reader030/viewer/2022041022/5ed367bd300b76579d56becd/html5/thumbnails/37.jpg)
RAIL
Robotic AI & Learning Lab
website: http://rail.eecs.berkeley.edu
source code: http://rail.eecs.berkeley.edu/code.html