deep reinforcement learning an introduction

94
DRL Intro 1/41 Xiaohu Zhu Intro DL RL DRL DQN DDPG Closing Deep Reinforcement Learning Xiaohu Zhu Future Engineer Institute July 9, 2016 Xiaohu Zhu DRL Intro 1/41

Upload: xiaohu-zhu

Post on 06-Jan-2017

292 views

Category:

Technology


0 download

TRANSCRIPT

DRL Intro1/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Reinforcement Learning

Xiaohu Zhu

Future Engineer Institute

July 9, 2016

Xiaohu Zhu DRL Intro 1/41

DRL Intro2/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Introduction

Recently Deep Learning’s influence grows faster and faster. The followingnames are more familiar to us than before.

University of Toronto - GeoffreyHinton

idsia - Jurgen Schmidhuber

University of Alberta - RichardSutton

NYU - Yann LeCun

Montreal University - YoshuaBengio

Stanford University - Li Feifei,Percy Liang

UC Berkeley - Pieter Abbeel

DeepMind - David Silver

OpenAI - Ilya Sutskever

Google - Jeff Dean

Baidu - IDL - Andrew Ng

Microsoft - MSR - Deng Li

Facebook - FAIR - Tomas Mikolov

Huawei - Noah Ark - Hang Li

...

Xiaohu Zhu DRL Intro 2/41

DRL Intro2/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Introduction

Recently Deep Learning’s influence grows faster and faster. The followingnames are more familiar to us than before.

University of Toronto - GeoffreyHinton

idsia - Jurgen Schmidhuber

University of Alberta - RichardSutton

NYU - Yann LeCun

Montreal University - YoshuaBengio

Stanford University - Li Feifei,Percy Liang

UC Berkeley - Pieter Abbeel

DeepMind - David Silver

OpenAI - Ilya Sutskever

Google - Jeff Dean

Baidu - IDL - Andrew Ng

Microsoft - MSR - Deng Li

Facebook - FAIR - Tomas Mikolov

Huawei - Noah Ark - Hang Li

...

Xiaohu Zhu DRL Intro 2/41

DRL Intro2/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Introduction

Recently Deep Learning’s influence grows faster and faster. The followingnames are more familiar to us than before.

University of Toronto - GeoffreyHinton

idsia - Jurgen Schmidhuber

University of Alberta - RichardSutton

NYU - Yann LeCun

Montreal University - YoshuaBengio

Stanford University - Li Feifei,Percy Liang

UC Berkeley - Pieter Abbeel

DeepMind - David Silver

OpenAI - Ilya Sutskever

Google - Jeff Dean

Baidu - IDL - Andrew Ng

Microsoft - MSR - Deng Li

Facebook - FAIR - Tomas Mikolov

Huawei - Noah Ark - Hang Li

...

Xiaohu Zhu DRL Intro 2/41

DRL Intro3/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Learning

Recently Deep Learning plays the key role for the ignition of AI.

AlphaGo

Atari games

Motor control and robotics

Pokers

...

Xiaohu Zhu DRL Intro 3/41

DRL Intro3/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Learning

Recently Deep Learning plays the key role for the ignition of AI.

AlphaGo

Atari games

Motor control and robotics

Pokers

...

Xiaohu Zhu DRL Intro 3/41

DRL Intro3/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Learning

Recently Deep Learning plays the key role for the ignition of AI.

AlphaGo

Atari games

Motor control and robotics

Pokers

...

Xiaohu Zhu DRL Intro 3/41

DRL Intro3/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Learning

Recently Deep Learning plays the key role for the ignition of AI.

AlphaGo

Atari games

Motor control and robotics

Pokers

...

Xiaohu Zhu DRL Intro 3/41

DRL Intro3/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Learning

Recently Deep Learning plays the key role for the ignition of AI.

AlphaGo

Atari games

Motor control and robotics

Pokers

...

Xiaohu Zhu DRL Intro 3/41

DRL Intro4/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Learning Architectures

Different types of architectures for various domains.

CNN - ResNet

RNN - LSTM - GRU - NTM

Generative Models - Adversarial Networks

...

Xiaohu Zhu DRL Intro 4/41

DRL Intro5/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Learning Ingredients

Several key ingredients in deep learning.

1 Chain rule - elementary calculus

2 Backpropagation - efficient gradient calculation

3 SGD - optimization techniques

4 Weight sharing (over location or time)

5 Regularisation

6 Batch normalisation

7 Dropout

Xiaohu Zhu DRL Intro 5/41

DRL Intro6/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Reinforcement Learning

Agent-oriented learning: learn by interacting with the correspondingenvironment to reach a desired goal.

through trail and error

delayed evaluative reward

Xiaohu Zhu DRL Intro 6/41

DRL Intro6/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Reinforcement Learning

Agent-oriented learning: learn by interacting with the correspondingenvironment to reach a desired goal.

through trail and error

delayed evaluative reward

Xiaohu Zhu DRL Intro 6/41

DRL Intro6/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Reinforcement Learning

Agent-oriented learning: learn by interacting with the correspondingenvironment to reach a desired goal.

through trail and error

delayed evaluative reward

Xiaohu Zhu DRL Intro 6/41

DRL Intro7/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Reinforcement Learning

Figure: excerpted from Rich Sutton’s NIPS 2015 tutorial

Agent use the information from interaction to determine the best choice

Environment could be unknown, nonlinear, stochastic and complexA favor of time sequentiality and online data

Xiaohu Zhu DRL Intro 7/41

DRL Intro7/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Reinforcement Learning

Figure: excerpted from Rich Sutton’s NIPS 2015 tutorial

Agent use the information from interaction to determine the best choiceEnvironment could be unknown, nonlinear, stochastic and complex

A favor of time sequentiality and online data

Xiaohu Zhu DRL Intro 7/41

DRL Intro7/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Reinforcement Learning

Figure: excerpted from Rich Sutton’s NIPS 2015 tutorial

Agent use the information from interaction to determine the best choiceEnvironment could be unknown, nonlinear, stochastic and complexA favor of time sequentiality and online dataXiaohu Zhu DRL Intro 7/41

DRL Intro8/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Position of MDP

Especially Markov Decision Process (MDP).

Figure: excerpted from Percy Liang’s lecture slides

Xiaohu Zhu DRL Intro 8/41

DRL Intro8/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Position of MDP

Especially Markov Decision Process (MDP).

Figure: excerpted from Percy Liang’s lecture slides

Xiaohu Zhu DRL Intro 8/41

DRL Intro9/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Elements of MDP

Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)

S state set

A action set

T (s, a, s′) transition probability of s′ from state s take action a

R(s, a, s′) reward for transition (s, a, s′)

γ discount factor

s0 starting state

Xiaohu Zhu DRL Intro 9/41

DRL Intro9/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Elements of MDP

Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)

S state set

A action set

T (s, a, s′) transition probability of s′ from state s take action a

R(s, a, s′) reward for transition (s, a, s′)

γ discount factor

s0 starting state

Xiaohu Zhu DRL Intro 9/41

DRL Intro9/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Elements of MDP

Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)

S state set

A action set

T (s, a, s′) transition probability of s′ from state s take action a

R(s, a, s′) reward for transition (s, a, s′)

γ discount factor

s0 starting state

Xiaohu Zhu DRL Intro 9/41

DRL Intro9/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Elements of MDP

Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)

S state set

A action set

T (s, a, s′) transition probability of s′ from state s take action a

R(s, a, s′) reward for transition (s, a, s′)

γ discount factor

s0 starting state

Xiaohu Zhu DRL Intro 9/41

DRL Intro9/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Elements of MDP

Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)

S state set

A action set

T (s, a, s′) transition probability of s′ from state s take action a

R(s, a, s′) reward for transition (s, a, s′)

γ discount factor

s0 starting state

Xiaohu Zhu DRL Intro 9/41

DRL Intro9/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Elements of MDP

Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)

S state set

A action set

T (s, a, s′) transition probability of s′ from state s take action a

R(s, a, s′) reward for transition (s, a, s′)

γ discount factor

s0 starting state

Xiaohu Zhu DRL Intro 9/41

DRL Intro9/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Elements of MDP

Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)

S state set

A action set

T (s, a, s′) transition probability of s′ from state s take action a

R(s, a, s′) reward for transition (s, a, s′)

γ discount factor

s0 starting state

Xiaohu Zhu DRL Intro 9/41

DRL Intro10/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

MDP - Definitions

A policy π is a mapping from each state s ∈ S to an action a ∈ A. Can beused to produce an episode:

s0; a1, r1, s1; a2, r2, s2; a3, r3, s3; ...; an, rn, sn

Value function Vπ(s) expected reward if follow π from state s

Vπ(s) = { 0 if terminatesQπ(s, π(s)) otherwise

Q-value function Qπ(s, a): with discount γ from state s and then follow π

Qπ(s, a) =∑s′

T (s, a, s′)[R(s, a, s′) + γvπ(s′)]

Xiaohu Zhu DRL Intro 10/41

DRL Intro10/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

MDP - Definitions

A policy π is a mapping from each state s ∈ S to an action a ∈ A. Can beused to produce an episode:

s0; a1, r1, s1; a2, r2, s2; a3, r3, s3; ...; an, rn, sn

Value function Vπ(s) expected reward if follow π from state s

Vπ(s) = { 0 if terminatesQπ(s, π(s)) otherwise

Q-value function Qπ(s, a): with discount γ from state s and then follow π

Qπ(s, a) =∑s′

T (s, a, s′)[R(s, a, s′) + γvπ(s′)]

Xiaohu Zhu DRL Intro 10/41

DRL Intro11/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Solve MDP

In a clear environment, we try to find policy.

Transiiton probability T (s, a, s′)

Reward R(s, a, s′)

Our goal is to maximize expected utility, just figure the solution in a dark room.

Xiaohu Zhu DRL Intro 11/41

DRL Intro11/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Solve MDP

In a clear environment, we try to find policy.

Transiiton probability T (s, a, s′)

Reward R(s, a, s′)

Our goal is to maximize expected utility, just figure the solution in a dark room.

Xiaohu Zhu DRL Intro 11/41

DRL Intro12/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Go further - Reinforcement Learning

In a unknown environment, we only can get the partial information frominteraction.

Transiiton probability T (s, a, s′) unknown

Reward R(s, a, s′) unknown

Our goal is still to maximize expected utility, but we are in a much moredifficult setting called reinforcement learning. Human like: we go around,taking actions, getting feedback. We get rewarded for doing well and learnalong the way.

Xiaohu Zhu DRL Intro 12/41

DRL Intro12/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Go further - Reinforcement Learning

In a unknown environment, we only can get the partial information frominteraction.

Transiiton probability T (s, a, s′) unknown

Reward R(s, a, s′) unknown

Our goal is still to maximize expected utility, but we are in a much moredifficult setting called reinforcement learning. Human like: we go around,taking actions, getting feedback. We get rewarded for doing well and learnalong the way.

Xiaohu Zhu DRL Intro 12/41

DRL Intro12/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Go further - Reinforcement Learning

In a unknown environment, we only can get the partial information frominteraction.

Transiiton probability T (s, a, s′) unknown

Reward R(s, a, s′) unknown

Our goal is still to maximize expected utility, but we are in a much moredifficult setting called reinforcement learning. Human like: we go around,taking actions, getting feedback. We get rewarded for doing well and learnalong the way.

Xiaohu Zhu DRL Intro 12/41

DRL Intro13/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Tradeoff between Exploration and Exploitation

Separated version often gives bad performance.

ε-greedy policy

π(s) = { arg maxa∈A(s)Q∗(s, a), with prob. 1− ε

a ∼ A with prob. ε

gradually decrease ε

naive explore a lotsophiscated, exploitPhrase transition exists?

Xiaohu Zhu DRL Intro 13/41

DRL Intro13/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Tradeoff between Exploration and Exploitation

Separated version often gives bad performance.

ε-greedy policy

π(s) = { arg maxa∈A(s)Q∗(s, a), with prob. 1− ε

a ∼ A with prob. ε

gradually decrease ε

naive explore a lotsophiscated, exploitPhrase transition exists?

Xiaohu Zhu DRL Intro 13/41

DRL Intro13/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Tradeoff between Exploration and Exploitation

Separated version often gives bad performance.

ε-greedy policy

π(s) = { arg maxa∈A(s)Q∗(s, a), with prob. 1− ε

a ∼ A with prob. ε

gradually decrease ε

naive explore a lotsophiscated, exploitPhrase transition exists?

Xiaohu Zhu DRL Intro 13/41

DRL Intro13/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Tradeoff between Exploration and Exploitation

Separated version often gives bad performance.

ε-greedy policy

π(s) = { arg maxa∈A(s)Q∗(s, a), with prob. 1− ε

a ∼ A with prob. ε

gradually decrease ε

naive explore a lotsophiscated, exploitPhrase transition exists?

Xiaohu Zhu DRL Intro 13/41

DRL Intro14/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Problems in RL

Curse of dimensionality. In real applications, state space is huge, epsilon-greedyis impossible to explore the whole space in a reasonable amount of time.

Xiaohu Zhu DRL Intro 14/41

DRL Intro14/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Problems in RL

Curse of dimensionality. In real applications, state space is huge, epsilon-greedyis impossible to explore the whole space in a reasonable amount of time.

Xiaohu Zhu DRL Intro 14/41

DRL Intro15/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Key points

Epsilon-greedy: balance the exploration/exploitation tradeoff

Function approximation: can generalize to unseen states

Xiaohu Zhu DRL Intro 15/41

DRL Intro16/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Thanks to DNN

Powerful approximation tools if we have enough data to fit it well.

Better than linear function

Easy to use for starters, hard to tune

Fashion It works!

Xiaohu Zhu DRL Intro 16/41

DRL Intro17/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Reinforcement Learning - DRL

Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:

DQN classic DRL with crucial techniques

DDPG continuous control

TRPO policy optimisation

NFSP for Nash equilibrium

Dueling Network improved DQN

...

Xiaohu Zhu DRL Intro 17/41

DRL Intro17/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Reinforcement Learning - DRL

Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:

DQN classic DRL with crucial techniques

DDPG continuous control

TRPO policy optimisation

NFSP for Nash equilibrium

Dueling Network improved DQN

...

Xiaohu Zhu DRL Intro 17/41

DRL Intro17/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Reinforcement Learning - DRL

Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:

DQN classic DRL with crucial techniques

DDPG continuous control

TRPO policy optimisation

NFSP for Nash equilibrium

Dueling Network improved DQN

...

Xiaohu Zhu DRL Intro 17/41

DRL Intro17/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Reinforcement Learning - DRL

Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:

DQN classic DRL with crucial techniques

DDPG continuous control

TRPO policy optimisation

NFSP for Nash equilibrium

Dueling Network improved DQN

...

Xiaohu Zhu DRL Intro 17/41

DRL Intro17/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Reinforcement Learning - DRL

Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:

DQN classic DRL with crucial techniques

DDPG continuous control

TRPO policy optimisation

NFSP for Nash equilibrium

Dueling Network improved DQN

...

Xiaohu Zhu DRL Intro 17/41

DRL Intro17/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Reinforcement Learning - DRL

Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:

DQN classic DRL with crucial techniques

DDPG continuous control

TRPO policy optimisation

NFSP for Nash equilibrium

Dueling Network improved DQN

...

Xiaohu Zhu DRL Intro 17/41

DRL Intro17/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Reinforcement Learning - DRL

Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:

DQN classic DRL with crucial techniques

DDPG continuous control

TRPO policy optimisation

NFSP for Nash equilibrium

Dueling Network improved DQN

...

Xiaohu Zhu DRL Intro 17/41

DRL Intro18/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

What is Deep Reinforcement Learning

Deep Neural Networks × Reinforcement Learning

Introduced by Mnih in 2013

The main idea is to use DNN to approximate Action-Value function Q

Performs well in many Atari games (beyond human expert players inseveral games)

Xiaohu Zhu DRL Intro 18/41

DRL Intro18/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

What is Deep Reinforcement Learning

Deep Neural Networks × Reinforcement Learning

Introduced by Mnih in 2013

The main idea is to use DNN to approximate Action-Value function Q

Performs well in many Atari games (beyond human expert players inseveral games)

Xiaohu Zhu DRL Intro 18/41

DRL Intro18/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

What is Deep Reinforcement Learning

Deep Neural Networks × Reinforcement Learning

Introduced by Mnih in 2013

The main idea is to use DNN to approximate Action-Value function Q

Performs well in many Atari games (beyond human expert players inseveral games)

Xiaohu Zhu DRL Intro 18/41

DRL Intro19/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Category of DRL Methods

Based on the methodology chosen, we can divide those methods into:

Value based

Policy based

Model based - AlphaGo

Xiaohu Zhu DRL Intro 19/41

DRL Intro19/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Category of DRL Methods

Based on the methodology chosen, we can divide those methods into:

Value based

Policy based

Model based - AlphaGo

Xiaohu Zhu DRL Intro 19/41

DRL Intro19/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Category of DRL Methods

Based on the methodology chosen, we can divide those methods into:

Value based

Policy based

Model based - AlphaGo

Xiaohu Zhu DRL Intro 19/41

DRL Intro20/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Benchmarks

Arcade Learning Environment

RLLAB

OpenAI gym

ROS

Xiaohu Zhu DRL Intro 20/41

DRL Intro21/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Q-Learning

Optimal Q-values should obey Bellman equation

Q∗(s, a) = Es′[r + γmax

a′Q(s′, a′)∗|s, a

]

Use r + γmaxa′ Q(s′, a′, w) as a target

Minimise MSE loss by SGD

lw =(r + γmax

aQ(s′, a′, w)−Q(s, a, w)

)2

Xiaohu Zhu DRL Intro 21/41

DRL Intro21/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Q-Learning

Optimal Q-values should obey Bellman equation

Q∗(s, a) = Es′[r + γmax

a′Q(s′, a′)∗|s, a

]Use r + γmaxa′ Q(s′, a′, w) as a target

Minimise MSE loss by SGD

lw =(r + γmax

aQ(s′, a′, w)−Q(s, a, w)

)2

Xiaohu Zhu DRL Intro 21/41

DRL Intro21/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Q-Learning

Optimal Q-values should obey Bellman equation

Q∗(s, a) = Es′[r + γmax

a′Q(s′, a′)∗|s, a

]Use r + γmaxa′ Q(s′, a′, w) as a target

Minimise MSE loss by SGD

lw =(r + γmax

aQ(s′, a′, w)−Q(s, a, w)

)2

Xiaohu Zhu DRL Intro 21/41

DRL Intro22/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DQN

Definition: use a neural network for Q∗

Playing Atari [Google DeepMind, 2013]:

last 4 frames (images) ⇒ 3-layer NN ⇒ keystroke

ε-greedy, train over 10M frames with 1M replay memory

Human-level performance on some games (breakout), less good on others(space invaders)

Xiaohu Zhu DRL Intro 22/41

DRL Intro22/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DQN

Definition: use a neural network for Q∗

Playing Atari [Google DeepMind, 2013]:

last 4 frames (images) ⇒ 3-layer NN ⇒ keystroke

ε-greedy, train over 10M frames with 1M replay memory

Human-level performance on some games (breakout), less good on others(space invaders)

Xiaohu Zhu DRL Intro 22/41

DRL Intro22/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DQN

Definition: use a neural network for Q∗

Playing Atari [Google DeepMind, 2013]:

last 4 frames (images) ⇒ 3-layer NN ⇒ keystroke

ε-greedy, train over 10M frames with 1M replay memory

Human-level performance on some games (breakout), less good on others(space invaders)

Xiaohu Zhu DRL Intro 22/41

DRL Intro23/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Basic DQN issues

Valina setting does not work well.

Successive samples are correlated, non-iid

Policy changes rapidly with slight changes to Q-values

Scale of rewards and Q-values is unknown

Xiaohu Zhu DRL Intro 23/41

DRL Intro23/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Basic DQN issues

Valina setting does not work well.

Successive samples are correlated, non-iid

Policy changes rapidly with slight changes to Q-values

Scale of rewards and Q-values is unknown

Xiaohu Zhu DRL Intro 23/41

DRL Intro23/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Basic DQN issues

Valina setting does not work well.

Successive samples are correlated, non-iid

Policy changes rapidly with slight changes to Q-values

Scale of rewards and Q-values is unknown

Xiaohu Zhu DRL Intro 23/41

DRL Intro24/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Cue 1 for DQN - Experience Replay

Successive samples are correlated, non-iid.

Use ε-greedy policy to select action at

Store transition (st, at, rt+1, st+1) in replay memory D

Sample random mini-batch of transitions (s, a, r, s′) from D

Optimise MSE between Q-network and Q-learning targets, e.g.

L(w) = Es,a,r,s′∼D[(r + γmaxa′

Q(s′, a′, w)−Q(s, a, w))2]

Xiaohu Zhu DRL Intro 24/41

DRL Intro25/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Cure 2 for DQN - Freeze Target Q-network

Policy changes rapidly with slight changes to Q-values.

Compute Q-learning targets w.r.t. old, fixed parameters w−

r + γmaxa′

Q(s′, a′, w−)

Optimise MSE between Q-network and Q-learning targets

L(w) = Es,a,r,s′∼D[(r + γmaxa′

Q(s′, a′, w−)−Q(s, a, w))2]

Periodically update fixed parameters w− ← w

Xiaohu Zhu DRL Intro 25/41

DRL Intro26/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Cue 3 for DQN - Reward/Value Range

Scale of rewards and Q-values is unknown

DQN clips the rewards to [1, +1]

This prevents Q-values from becoming too large

Ensures gradients are well-conditioned

Xiaohu Zhu DRL Intro 26/41

DRL Intro27/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DQN helps in playing Atari games

Xiaohu Zhu DRL Intro 27/41

DRL Intro28/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DQN helps in playing Atari games

Xiaohu Zhu DRL Intro 28/41

DRL Intro29/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Policy Gradient

Use deep network with weights u to represent policy

a = π(a|s, u)ora = π(s, u)

Objective function as totoal discounted reward

lu = E[r1 + γr2 + γ2r3 + · · · |π(·, u)]

Optimise objective end-to-end by SGD

Suitable for continous control problem since dpg tackled the action spaceissue.

Xiaohu Zhu DRL Intro 29/41

DRL Intro29/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Deep Policy Gradient

Use deep network with weights u to represent policy

a = π(a|s, u)ora = π(s, u)

Objective function as totoal discounted reward

lu = E[r1 + γr2 + γ2r3 + · · · |π(·, u)]

Optimise objective end-to-end by SGD

Suitable for continous control problem since dpg tackled the action spaceissue.

Xiaohu Zhu DRL Intro 29/41

DRL Intro30/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DDPG - Deep Deterministic Policy Gradient

DDPG can solve continuous control problem. Actor-Critic method.

DPG [David Silver et al. in 2014]

The main idea is to use DNN to approximate deterministic policy gradient

Suitable for continous control problem since dpg tackled the action spaceissue.

Xiaohu Zhu DRL Intro 30/41

DRL Intro30/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DDPG - Deep Deterministic Policy Gradient

DDPG can solve continuous control problem. Actor-Critic method.

DPG [David Silver et al. in 2014]

The main idea is to use DNN to approximate deterministic policy gradient

Suitable for continous control problem since dpg tackled the action spaceissue.

Xiaohu Zhu DRL Intro 30/41

DRL Intro30/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DDPG - Deep Deterministic Policy Gradient

DDPG can solve continuous control problem. Actor-Critic method.

DPG [David Silver et al. in 2014]

The main idea is to use DNN to approximate deterministic policy gradient

Suitable for continous control problem since dpg tackled the action spaceissue.

Xiaohu Zhu DRL Intro 30/41

DRL Intro31/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DDPG - Deep Deterministic Policy Gradient

Use Actor network and Critic network.

Figure: excerpted from

Critic use DQN to estimate current policy value

lw =(r + γQ(s′, π(s′, u−), w−)−Q(s, a, w)

)2Actor improve policy based on the direction of improving Q

∂lu∂u

=∂Q(s, a, w)

∂a

a

ucritic provides loss function for actor

Xiaohu Zhu DRL Intro 31/41

DRL Intro32/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Explanation of DDPG

Three ingredients make DDPG.

Figure: excerpted from

Xiaohu Zhu DRL Intro 32/41

DRL Intro32/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Explanation of DDPG

Three ingredients make DDPG.

Figure: excerpted from

Xiaohu Zhu DRL Intro 32/41

DRL Intro33/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DDPG algorithm

use techniques from DQN, therefore need 4 neural networks.

Xiaohu Zhu DRL Intro 33/41

DRL Intro34/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

DDPG algorithm

Xiaohu Zhu DRL Intro 34/41

DRL Intro35/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

TRPO

Policy optimisation perspective. Stochastic policy gradient.

REINFORCE algorithm

Natural policy gradient

Trust region policy optimization

PLATO

Xiaohu Zhu DRL Intro 35/41

DRL Intro35/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

TRPO

Policy optimisation perspective. Stochastic policy gradient.

REINFORCE algorithm

Natural policy gradient

Trust region policy optimization

PLATO

Xiaohu Zhu DRL Intro 35/41

DRL Intro35/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

TRPO

Policy optimisation perspective. Stochastic policy gradient.

REINFORCE algorithm

Natural policy gradient

Trust region policy optimization

PLATO

Xiaohu Zhu DRL Intro 35/41

DRL Intro35/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

TRPO

Policy optimisation perspective. Stochastic policy gradient.

REINFORCE algorithm

Natural policy gradient

Trust region policy optimization

PLATO

Xiaohu Zhu DRL Intro 35/41

DRL Intro36/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Practical Suggestions

Representations + Strategy

Xiaohu Zhu DRL Intro 36/41

DRL Intro37/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Promising direction for achieving AGI

DL + RL = AI

End-to-End

Learn by trial-and-error

General framework

Impressive performance

Xiaohu Zhu DRL Intro 37/41

DRL Intro37/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Promising direction for achieving AGI

DL + RL = AI

End-to-End

Learn by trial-and-error

General framework

Impressive performance

Xiaohu Zhu DRL Intro 37/41

DRL Intro37/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Promising direction for achieving AGI

DL + RL = AI

End-to-End

Learn by trial-and-error

General framework

Impressive performance

Xiaohu Zhu DRL Intro 37/41

DRL Intro37/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Promising direction for achieving AGI

DL + RL = AI

End-to-End

Learn by trial-and-error

General framework

Impressive performance

Xiaohu Zhu DRL Intro 37/41

DRL Intro37/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Promising direction for achieving AGI

DL + RL = AI

End-to-End

Learn by trial-and-error

General framework

Impressive performance

Xiaohu Zhu DRL Intro 37/41

DRL Intro38/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Concerns about AGI

Satety problem and human replacement problem.

“What is the boundary of intelligence?”

“Verified Artificial Intelligence” UC Berkeley

“Concrete Problems in AI Safety” Google & OpenAI

...

Xiaohu Zhu DRL Intro 38/41

DRL Intro38/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

Concerns about AGI

Satety problem and human replacement problem.

“What is the boundary of intelligence?”

“Verified Artificial Intelligence” UC Berkeley

“Concrete Problems in AI Safety” Google & OpenAI

...

Xiaohu Zhu DRL Intro 38/41

DRL Intro39/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

To summarise

Design suitable MDP for problems

Get a big neural network

use SGD to train

Go out play pokemon

Xiaohu Zhu DRL Intro 39/41

DRL Intro40/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

The End

It is the good question that makes discovery.

Xiaohu Zhu DRL Intro 40/41

DRL Intro41/41

XiaohuZhu

Intro

DL

RL

DRL

DQN

DDPG

Closing

References

Deep Learning in Neural Networks: An Overview

Deep Learning

David Silver’s course

Percy Liang’s course

Xiaohu Zhu DRL Intro 41/41