deep reinforcement learning an introduction
TRANSCRIPT
DRL Intro1/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Reinforcement Learning
Xiaohu Zhu
Future Engineer Institute
July 9, 2016
Xiaohu Zhu DRL Intro 1/41
DRL Intro2/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Introduction
Recently Deep Learning’s influence grows faster and faster. The followingnames are more familiar to us than before.
University of Toronto - GeoffreyHinton
idsia - Jurgen Schmidhuber
University of Alberta - RichardSutton
NYU - Yann LeCun
Montreal University - YoshuaBengio
Stanford University - Li Feifei,Percy Liang
UC Berkeley - Pieter Abbeel
DeepMind - David Silver
OpenAI - Ilya Sutskever
Google - Jeff Dean
Baidu - IDL - Andrew Ng
Microsoft - MSR - Deng Li
Facebook - FAIR - Tomas Mikolov
Huawei - Noah Ark - Hang Li
...
Xiaohu Zhu DRL Intro 2/41
DRL Intro2/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Introduction
Recently Deep Learning’s influence grows faster and faster. The followingnames are more familiar to us than before.
University of Toronto - GeoffreyHinton
idsia - Jurgen Schmidhuber
University of Alberta - RichardSutton
NYU - Yann LeCun
Montreal University - YoshuaBengio
Stanford University - Li Feifei,Percy Liang
UC Berkeley - Pieter Abbeel
DeepMind - David Silver
OpenAI - Ilya Sutskever
Google - Jeff Dean
Baidu - IDL - Andrew Ng
Microsoft - MSR - Deng Li
Facebook - FAIR - Tomas Mikolov
Huawei - Noah Ark - Hang Li
...
Xiaohu Zhu DRL Intro 2/41
DRL Intro2/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Introduction
Recently Deep Learning’s influence grows faster and faster. The followingnames are more familiar to us than before.
University of Toronto - GeoffreyHinton
idsia - Jurgen Schmidhuber
University of Alberta - RichardSutton
NYU - Yann LeCun
Montreal University - YoshuaBengio
Stanford University - Li Feifei,Percy Liang
UC Berkeley - Pieter Abbeel
DeepMind - David Silver
OpenAI - Ilya Sutskever
Google - Jeff Dean
Baidu - IDL - Andrew Ng
Microsoft - MSR - Deng Li
Facebook - FAIR - Tomas Mikolov
Huawei - Noah Ark - Hang Li
...
Xiaohu Zhu DRL Intro 2/41
DRL Intro3/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Learning
Recently Deep Learning plays the key role for the ignition of AI.
AlphaGo
Atari games
Motor control and robotics
Pokers
...
Xiaohu Zhu DRL Intro 3/41
DRL Intro3/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Learning
Recently Deep Learning plays the key role for the ignition of AI.
AlphaGo
Atari games
Motor control and robotics
Pokers
...
Xiaohu Zhu DRL Intro 3/41
DRL Intro3/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Learning
Recently Deep Learning plays the key role for the ignition of AI.
AlphaGo
Atari games
Motor control and robotics
Pokers
...
Xiaohu Zhu DRL Intro 3/41
DRL Intro3/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Learning
Recently Deep Learning plays the key role for the ignition of AI.
AlphaGo
Atari games
Motor control and robotics
Pokers
...
Xiaohu Zhu DRL Intro 3/41
DRL Intro3/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Learning
Recently Deep Learning plays the key role for the ignition of AI.
AlphaGo
Atari games
Motor control and robotics
Pokers
...
Xiaohu Zhu DRL Intro 3/41
DRL Intro4/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Learning Architectures
Different types of architectures for various domains.
CNN - ResNet
RNN - LSTM - GRU - NTM
Generative Models - Adversarial Networks
...
Xiaohu Zhu DRL Intro 4/41
DRL Intro5/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Learning Ingredients
Several key ingredients in deep learning.
1 Chain rule - elementary calculus
2 Backpropagation - efficient gradient calculation
3 SGD - optimization techniques
4 Weight sharing (over location or time)
5 Regularisation
6 Batch normalisation
7 Dropout
Xiaohu Zhu DRL Intro 5/41
DRL Intro6/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Reinforcement Learning
Agent-oriented learning: learn by interacting with the correspondingenvironment to reach a desired goal.
through trail and error
delayed evaluative reward
Xiaohu Zhu DRL Intro 6/41
DRL Intro6/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Reinforcement Learning
Agent-oriented learning: learn by interacting with the correspondingenvironment to reach a desired goal.
through trail and error
delayed evaluative reward
Xiaohu Zhu DRL Intro 6/41
DRL Intro6/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Reinforcement Learning
Agent-oriented learning: learn by interacting with the correspondingenvironment to reach a desired goal.
through trail and error
delayed evaluative reward
Xiaohu Zhu DRL Intro 6/41
DRL Intro7/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Reinforcement Learning
Figure: excerpted from Rich Sutton’s NIPS 2015 tutorial
Agent use the information from interaction to determine the best choice
Environment could be unknown, nonlinear, stochastic and complexA favor of time sequentiality and online data
Xiaohu Zhu DRL Intro 7/41
DRL Intro7/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Reinforcement Learning
Figure: excerpted from Rich Sutton’s NIPS 2015 tutorial
Agent use the information from interaction to determine the best choiceEnvironment could be unknown, nonlinear, stochastic and complex
A favor of time sequentiality and online data
Xiaohu Zhu DRL Intro 7/41
DRL Intro7/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Reinforcement Learning
Figure: excerpted from Rich Sutton’s NIPS 2015 tutorial
Agent use the information from interaction to determine the best choiceEnvironment could be unknown, nonlinear, stochastic and complexA favor of time sequentiality and online dataXiaohu Zhu DRL Intro 7/41
DRL Intro8/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Position of MDP
Especially Markov Decision Process (MDP).
Figure: excerpted from Percy Liang’s lecture slides
Xiaohu Zhu DRL Intro 8/41
DRL Intro8/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Position of MDP
Especially Markov Decision Process (MDP).
Figure: excerpted from Percy Liang’s lecture slides
Xiaohu Zhu DRL Intro 8/41
DRL Intro9/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Elements of MDP
Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)
S state set
A action set
T (s, a, s′) transition probability of s′ from state s take action a
R(s, a, s′) reward for transition (s, a, s′)
γ discount factor
s0 starting state
Xiaohu Zhu DRL Intro 9/41
DRL Intro9/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Elements of MDP
Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)
S state set
A action set
T (s, a, s′) transition probability of s′ from state s take action a
R(s, a, s′) reward for transition (s, a, s′)
γ discount factor
s0 starting state
Xiaohu Zhu DRL Intro 9/41
DRL Intro9/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Elements of MDP
Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)
S state set
A action set
T (s, a, s′) transition probability of s′ from state s take action a
R(s, a, s′) reward for transition (s, a, s′)
γ discount factor
s0 starting state
Xiaohu Zhu DRL Intro 9/41
DRL Intro9/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Elements of MDP
Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)
S state set
A action set
T (s, a, s′) transition probability of s′ from state s take action a
R(s, a, s′) reward for transition (s, a, s′)
γ discount factor
s0 starting state
Xiaohu Zhu DRL Intro 9/41
DRL Intro9/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Elements of MDP
Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)
S state set
A action set
T (s, a, s′) transition probability of s′ from state s take action a
R(s, a, s′) reward for transition (s, a, s′)
γ discount factor
s0 starting state
Xiaohu Zhu DRL Intro 9/41
DRL Intro9/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Elements of MDP
Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)
S state set
A action set
T (s, a, s′) transition probability of s′ from state s take action a
R(s, a, s′) reward for transition (s, a, s′)
γ discount factor
s0 starting state
Xiaohu Zhu DRL Intro 9/41
DRL Intro9/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Elements of MDP
Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)
S state set
A action set
T (s, a, s′) transition probability of s′ from state s take action a
R(s, a, s′) reward for transition (s, a, s′)
γ discount factor
s0 starting state
Xiaohu Zhu DRL Intro 9/41
DRL Intro10/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
MDP - Definitions
A policy π is a mapping from each state s ∈ S to an action a ∈ A. Can beused to produce an episode:
s0; a1, r1, s1; a2, r2, s2; a3, r3, s3; ...; an, rn, sn
Value function Vπ(s) expected reward if follow π from state s
Vπ(s) = { 0 if terminatesQπ(s, π(s)) otherwise
Q-value function Qπ(s, a): with discount γ from state s and then follow π
Qπ(s, a) =∑s′
T (s, a, s′)[R(s, a, s′) + γvπ(s′)]
Xiaohu Zhu DRL Intro 10/41
DRL Intro10/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
MDP - Definitions
A policy π is a mapping from each state s ∈ S to an action a ∈ A. Can beused to produce an episode:
s0; a1, r1, s1; a2, r2, s2; a3, r3, s3; ...; an, rn, sn
Value function Vπ(s) expected reward if follow π from state s
Vπ(s) = { 0 if terminatesQπ(s, π(s)) otherwise
Q-value function Qπ(s, a): with discount γ from state s and then follow π
Qπ(s, a) =∑s′
T (s, a, s′)[R(s, a, s′) + γvπ(s′)]
Xiaohu Zhu DRL Intro 10/41
DRL Intro11/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Solve MDP
In a clear environment, we try to find policy.
Transiiton probability T (s, a, s′)
Reward R(s, a, s′)
Our goal is to maximize expected utility, just figure the solution in a dark room.
Xiaohu Zhu DRL Intro 11/41
DRL Intro11/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Solve MDP
In a clear environment, we try to find policy.
Transiiton probability T (s, a, s′)
Reward R(s, a, s′)
Our goal is to maximize expected utility, just figure the solution in a dark room.
Xiaohu Zhu DRL Intro 11/41
DRL Intro12/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Go further - Reinforcement Learning
In a unknown environment, we only can get the partial information frominteraction.
Transiiton probability T (s, a, s′) unknown
Reward R(s, a, s′) unknown
Our goal is still to maximize expected utility, but we are in a much moredifficult setting called reinforcement learning. Human like: we go around,taking actions, getting feedback. We get rewarded for doing well and learnalong the way.
Xiaohu Zhu DRL Intro 12/41
DRL Intro12/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Go further - Reinforcement Learning
In a unknown environment, we only can get the partial information frominteraction.
Transiiton probability T (s, a, s′) unknown
Reward R(s, a, s′) unknown
Our goal is still to maximize expected utility, but we are in a much moredifficult setting called reinforcement learning. Human like: we go around,taking actions, getting feedback. We get rewarded for doing well and learnalong the way.
Xiaohu Zhu DRL Intro 12/41
DRL Intro12/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Go further - Reinforcement Learning
In a unknown environment, we only can get the partial information frominteraction.
Transiiton probability T (s, a, s′) unknown
Reward R(s, a, s′) unknown
Our goal is still to maximize expected utility, but we are in a much moredifficult setting called reinforcement learning. Human like: we go around,taking actions, getting feedback. We get rewarded for doing well and learnalong the way.
Xiaohu Zhu DRL Intro 12/41
DRL Intro13/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Tradeoff between Exploration and Exploitation
Separated version often gives bad performance.
ε-greedy policy
π(s) = { arg maxa∈A(s)Q∗(s, a), with prob. 1− ε
a ∼ A with prob. ε
gradually decrease ε
naive explore a lotsophiscated, exploitPhrase transition exists?
Xiaohu Zhu DRL Intro 13/41
DRL Intro13/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Tradeoff between Exploration and Exploitation
Separated version often gives bad performance.
ε-greedy policy
π(s) = { arg maxa∈A(s)Q∗(s, a), with prob. 1− ε
a ∼ A with prob. ε
gradually decrease ε
naive explore a lotsophiscated, exploitPhrase transition exists?
Xiaohu Zhu DRL Intro 13/41
DRL Intro13/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Tradeoff between Exploration and Exploitation
Separated version often gives bad performance.
ε-greedy policy
π(s) = { arg maxa∈A(s)Q∗(s, a), with prob. 1− ε
a ∼ A with prob. ε
gradually decrease ε
naive explore a lotsophiscated, exploitPhrase transition exists?
Xiaohu Zhu DRL Intro 13/41
DRL Intro13/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Tradeoff between Exploration and Exploitation
Separated version often gives bad performance.
ε-greedy policy
π(s) = { arg maxa∈A(s)Q∗(s, a), with prob. 1− ε
a ∼ A with prob. ε
gradually decrease ε
naive explore a lotsophiscated, exploitPhrase transition exists?
Xiaohu Zhu DRL Intro 13/41
DRL Intro14/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Problems in RL
Curse of dimensionality. In real applications, state space is huge, epsilon-greedyis impossible to explore the whole space in a reasonable amount of time.
Xiaohu Zhu DRL Intro 14/41
DRL Intro14/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Problems in RL
Curse of dimensionality. In real applications, state space is huge, epsilon-greedyis impossible to explore the whole space in a reasonable amount of time.
Xiaohu Zhu DRL Intro 14/41
DRL Intro15/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Key points
Epsilon-greedy: balance the exploration/exploitation tradeoff
Function approximation: can generalize to unseen states
Xiaohu Zhu DRL Intro 15/41
DRL Intro16/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Thanks to DNN
Powerful approximation tools if we have enough data to fit it well.
Better than linear function
Easy to use for starters, hard to tune
Fashion It works!
Xiaohu Zhu DRL Intro 16/41
DRL Intro17/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Reinforcement Learning - DRL
Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:
DQN classic DRL with crucial techniques
DDPG continuous control
TRPO policy optimisation
NFSP for Nash equilibrium
Dueling Network improved DQN
...
Xiaohu Zhu DRL Intro 17/41
DRL Intro17/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Reinforcement Learning - DRL
Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:
DQN classic DRL with crucial techniques
DDPG continuous control
TRPO policy optimisation
NFSP for Nash equilibrium
Dueling Network improved DQN
...
Xiaohu Zhu DRL Intro 17/41
DRL Intro17/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Reinforcement Learning - DRL
Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:
DQN classic DRL with crucial techniques
DDPG continuous control
TRPO policy optimisation
NFSP for Nash equilibrium
Dueling Network improved DQN
...
Xiaohu Zhu DRL Intro 17/41
DRL Intro17/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Reinforcement Learning - DRL
Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:
DQN classic DRL with crucial techniques
DDPG continuous control
TRPO policy optimisation
NFSP for Nash equilibrium
Dueling Network improved DQN
...
Xiaohu Zhu DRL Intro 17/41
DRL Intro17/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Reinforcement Learning - DRL
Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:
DQN classic DRL with crucial techniques
DDPG continuous control
TRPO policy optimisation
NFSP for Nash equilibrium
Dueling Network improved DQN
...
Xiaohu Zhu DRL Intro 17/41
DRL Intro17/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Reinforcement Learning - DRL
Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:
DQN classic DRL with crucial techniques
DDPG continuous control
TRPO policy optimisation
NFSP for Nash equilibrium
Dueling Network improved DQN
...
Xiaohu Zhu DRL Intro 17/41
DRL Intro17/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Reinforcement Learning - DRL
Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:
DQN classic DRL with crucial techniques
DDPG continuous control
TRPO policy optimisation
NFSP for Nash equilibrium
Dueling Network improved DQN
...
Xiaohu Zhu DRL Intro 17/41
DRL Intro18/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
What is Deep Reinforcement Learning
Deep Neural Networks × Reinforcement Learning
Introduced by Mnih in 2013
The main idea is to use DNN to approximate Action-Value function Q
Performs well in many Atari games (beyond human expert players inseveral games)
Xiaohu Zhu DRL Intro 18/41
DRL Intro18/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
What is Deep Reinforcement Learning
Deep Neural Networks × Reinforcement Learning
Introduced by Mnih in 2013
The main idea is to use DNN to approximate Action-Value function Q
Performs well in many Atari games (beyond human expert players inseveral games)
Xiaohu Zhu DRL Intro 18/41
DRL Intro18/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
What is Deep Reinforcement Learning
Deep Neural Networks × Reinforcement Learning
Introduced by Mnih in 2013
The main idea is to use DNN to approximate Action-Value function Q
Performs well in many Atari games (beyond human expert players inseveral games)
Xiaohu Zhu DRL Intro 18/41
DRL Intro19/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Category of DRL Methods
Based on the methodology chosen, we can divide those methods into:
Value based
Policy based
Model based - AlphaGo
Xiaohu Zhu DRL Intro 19/41
DRL Intro19/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Category of DRL Methods
Based on the methodology chosen, we can divide those methods into:
Value based
Policy based
Model based - AlphaGo
Xiaohu Zhu DRL Intro 19/41
DRL Intro19/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Category of DRL Methods
Based on the methodology chosen, we can divide those methods into:
Value based
Policy based
Model based - AlphaGo
Xiaohu Zhu DRL Intro 19/41
DRL Intro20/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Benchmarks
Arcade Learning Environment
RLLAB
OpenAI gym
ROS
Xiaohu Zhu DRL Intro 20/41
DRL Intro21/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Q-Learning
Optimal Q-values should obey Bellman equation
Q∗(s, a) = Es′[r + γmax
a′Q(s′, a′)∗|s, a
]
Use r + γmaxa′ Q(s′, a′, w) as a target
Minimise MSE loss by SGD
lw =(r + γmax
aQ(s′, a′, w)−Q(s, a, w)
)2
Xiaohu Zhu DRL Intro 21/41
DRL Intro21/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Q-Learning
Optimal Q-values should obey Bellman equation
Q∗(s, a) = Es′[r + γmax
a′Q(s′, a′)∗|s, a
]Use r + γmaxa′ Q(s′, a′, w) as a target
Minimise MSE loss by SGD
lw =(r + γmax
aQ(s′, a′, w)−Q(s, a, w)
)2
Xiaohu Zhu DRL Intro 21/41
DRL Intro21/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Q-Learning
Optimal Q-values should obey Bellman equation
Q∗(s, a) = Es′[r + γmax
a′Q(s′, a′)∗|s, a
]Use r + γmaxa′ Q(s′, a′, w) as a target
Minimise MSE loss by SGD
lw =(r + γmax
aQ(s′, a′, w)−Q(s, a, w)
)2
Xiaohu Zhu DRL Intro 21/41
DRL Intro22/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
DQN
Definition: use a neural network for Q∗
Playing Atari [Google DeepMind, 2013]:
last 4 frames (images) ⇒ 3-layer NN ⇒ keystroke
ε-greedy, train over 10M frames with 1M replay memory
Human-level performance on some games (breakout), less good on others(space invaders)
Xiaohu Zhu DRL Intro 22/41
DRL Intro22/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
DQN
Definition: use a neural network for Q∗
Playing Atari [Google DeepMind, 2013]:
last 4 frames (images) ⇒ 3-layer NN ⇒ keystroke
ε-greedy, train over 10M frames with 1M replay memory
Human-level performance on some games (breakout), less good on others(space invaders)
Xiaohu Zhu DRL Intro 22/41
DRL Intro22/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
DQN
Definition: use a neural network for Q∗
Playing Atari [Google DeepMind, 2013]:
last 4 frames (images) ⇒ 3-layer NN ⇒ keystroke
ε-greedy, train over 10M frames with 1M replay memory
Human-level performance on some games (breakout), less good on others(space invaders)
Xiaohu Zhu DRL Intro 22/41
DRL Intro23/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Basic DQN issues
Valina setting does not work well.
Successive samples are correlated, non-iid
Policy changes rapidly with slight changes to Q-values
Scale of rewards and Q-values is unknown
Xiaohu Zhu DRL Intro 23/41
DRL Intro23/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Basic DQN issues
Valina setting does not work well.
Successive samples are correlated, non-iid
Policy changes rapidly with slight changes to Q-values
Scale of rewards and Q-values is unknown
Xiaohu Zhu DRL Intro 23/41
DRL Intro23/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Basic DQN issues
Valina setting does not work well.
Successive samples are correlated, non-iid
Policy changes rapidly with slight changes to Q-values
Scale of rewards and Q-values is unknown
Xiaohu Zhu DRL Intro 23/41
DRL Intro24/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Cue 1 for DQN - Experience Replay
Successive samples are correlated, non-iid.
Use ε-greedy policy to select action at
Store transition (st, at, rt+1, st+1) in replay memory D
Sample random mini-batch of transitions (s, a, r, s′) from D
Optimise MSE between Q-network and Q-learning targets, e.g.
L(w) = Es,a,r,s′∼D[(r + γmaxa′
Q(s′, a′, w)−Q(s, a, w))2]
Xiaohu Zhu DRL Intro 24/41
DRL Intro25/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Cure 2 for DQN - Freeze Target Q-network
Policy changes rapidly with slight changes to Q-values.
Compute Q-learning targets w.r.t. old, fixed parameters w−
r + γmaxa′
Q(s′, a′, w−)
Optimise MSE between Q-network and Q-learning targets
L(w) = Es,a,r,s′∼D[(r + γmaxa′
Q(s′, a′, w−)−Q(s, a, w))2]
Periodically update fixed parameters w− ← w
Xiaohu Zhu DRL Intro 25/41
DRL Intro26/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Cue 3 for DQN - Reward/Value Range
Scale of rewards and Q-values is unknown
DQN clips the rewards to [1, +1]
This prevents Q-values from becoming too large
Ensures gradients are well-conditioned
Xiaohu Zhu DRL Intro 26/41
DRL Intro27/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
DQN helps in playing Atari games
Xiaohu Zhu DRL Intro 27/41
DRL Intro28/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
DQN helps in playing Atari games
Xiaohu Zhu DRL Intro 28/41
DRL Intro29/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Policy Gradient
Use deep network with weights u to represent policy
a = π(a|s, u)ora = π(s, u)
Objective function as totoal discounted reward
lu = E[r1 + γr2 + γ2r3 + · · · |π(·, u)]
Optimise objective end-to-end by SGD
Suitable for continous control problem since dpg tackled the action spaceissue.
Xiaohu Zhu DRL Intro 29/41
DRL Intro29/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Deep Policy Gradient
Use deep network with weights u to represent policy
a = π(a|s, u)ora = π(s, u)
Objective function as totoal discounted reward
lu = E[r1 + γr2 + γ2r3 + · · · |π(·, u)]
Optimise objective end-to-end by SGD
Suitable for continous control problem since dpg tackled the action spaceissue.
Xiaohu Zhu DRL Intro 29/41
DRL Intro30/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
DDPG - Deep Deterministic Policy Gradient
DDPG can solve continuous control problem. Actor-Critic method.
DPG [David Silver et al. in 2014]
The main idea is to use DNN to approximate deterministic policy gradient
Suitable for continous control problem since dpg tackled the action spaceissue.
Xiaohu Zhu DRL Intro 30/41
DRL Intro30/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
DDPG - Deep Deterministic Policy Gradient
DDPG can solve continuous control problem. Actor-Critic method.
DPG [David Silver et al. in 2014]
The main idea is to use DNN to approximate deterministic policy gradient
Suitable for continous control problem since dpg tackled the action spaceissue.
Xiaohu Zhu DRL Intro 30/41
DRL Intro30/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
DDPG - Deep Deterministic Policy Gradient
DDPG can solve continuous control problem. Actor-Critic method.
DPG [David Silver et al. in 2014]
The main idea is to use DNN to approximate deterministic policy gradient
Suitable for continous control problem since dpg tackled the action spaceissue.
Xiaohu Zhu DRL Intro 30/41
DRL Intro31/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
DDPG - Deep Deterministic Policy Gradient
Use Actor network and Critic network.
Figure: excerpted from
Critic use DQN to estimate current policy value
lw =(r + γQ(s′, π(s′, u−), w−)−Q(s, a, w)
)2Actor improve policy based on the direction of improving Q
∂lu∂u
=∂Q(s, a, w)
∂a
a
ucritic provides loss function for actor
Xiaohu Zhu DRL Intro 31/41
DRL Intro32/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Explanation of DDPG
Three ingredients make DDPG.
Figure: excerpted from
Xiaohu Zhu DRL Intro 32/41
DRL Intro32/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Explanation of DDPG
Three ingredients make DDPG.
Figure: excerpted from
Xiaohu Zhu DRL Intro 32/41
DRL Intro33/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
DDPG algorithm
use techniques from DQN, therefore need 4 neural networks.
Xiaohu Zhu DRL Intro 33/41
DRL Intro35/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
TRPO
Policy optimisation perspective. Stochastic policy gradient.
REINFORCE algorithm
Natural policy gradient
Trust region policy optimization
PLATO
Xiaohu Zhu DRL Intro 35/41
DRL Intro35/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
TRPO
Policy optimisation perspective. Stochastic policy gradient.
REINFORCE algorithm
Natural policy gradient
Trust region policy optimization
PLATO
Xiaohu Zhu DRL Intro 35/41
DRL Intro35/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
TRPO
Policy optimisation perspective. Stochastic policy gradient.
REINFORCE algorithm
Natural policy gradient
Trust region policy optimization
PLATO
Xiaohu Zhu DRL Intro 35/41
DRL Intro35/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
TRPO
Policy optimisation perspective. Stochastic policy gradient.
REINFORCE algorithm
Natural policy gradient
Trust region policy optimization
PLATO
Xiaohu Zhu DRL Intro 35/41
DRL Intro36/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Practical Suggestions
Representations + Strategy
Xiaohu Zhu DRL Intro 36/41
DRL Intro37/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Promising direction for achieving AGI
DL + RL = AI
End-to-End
Learn by trial-and-error
General framework
Impressive performance
Xiaohu Zhu DRL Intro 37/41
DRL Intro37/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Promising direction for achieving AGI
DL + RL = AI
End-to-End
Learn by trial-and-error
General framework
Impressive performance
Xiaohu Zhu DRL Intro 37/41
DRL Intro37/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Promising direction for achieving AGI
DL + RL = AI
End-to-End
Learn by trial-and-error
General framework
Impressive performance
Xiaohu Zhu DRL Intro 37/41
DRL Intro37/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Promising direction for achieving AGI
DL + RL = AI
End-to-End
Learn by trial-and-error
General framework
Impressive performance
Xiaohu Zhu DRL Intro 37/41
DRL Intro37/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Promising direction for achieving AGI
DL + RL = AI
End-to-End
Learn by trial-and-error
General framework
Impressive performance
Xiaohu Zhu DRL Intro 37/41
DRL Intro38/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Concerns about AGI
Satety problem and human replacement problem.
“What is the boundary of intelligence?”
“Verified Artificial Intelligence” UC Berkeley
“Concrete Problems in AI Safety” Google & OpenAI
...
Xiaohu Zhu DRL Intro 38/41
DRL Intro38/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
Concerns about AGI
Satety problem and human replacement problem.
“What is the boundary of intelligence?”
“Verified Artificial Intelligence” UC Berkeley
“Concrete Problems in AI Safety” Google & OpenAI
...
Xiaohu Zhu DRL Intro 38/41
DRL Intro39/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
To summarise
Design suitable MDP for problems
Get a big neural network
use SGD to train
Go out play pokemon
Xiaohu Zhu DRL Intro 39/41
DRL Intro40/41
XiaohuZhu
Intro
DL
RL
DRL
DQN
DDPG
Closing
The End
It is the good question that makes discovery.
Xiaohu Zhu DRL Intro 40/41