deep reinforcement learning an introduction

DRL Intro1/41

XiaohuZhu

Closing

Deep Reinforcement Learning

Xiaohu Zhu

Future Engineer Institute

July 9, 2016

Xiaohu Zhu DRL Intro 1/41

DRL Intro2/41

XiaohuZhu

Closing

Introduction

Recently Deep Learning’s influence grows faster and faster. The followingnames are more familiar to us than before.

University of Toronto - GeoffreyHinton

idsia - Jurgen Schmidhuber

University of Alberta - RichardSutton

NYU - Yann LeCun

Montreal University - YoshuaBengio

Stanford University - Li Feifei,Percy Liang

UC Berkeley - Pieter Abbeel

DeepMind - David Silver

OpenAI - Ilya Sutskever

Google - Jeff Dean

Baidu - IDL - Andrew Ng

Microsoft - MSR - Deng Li

Facebook - FAIR - Tomas Mikolov

Huawei - Noah Ark - Hang Li

DRL Intro2/41

XiaohuZhu

Closing

Introduction

NYU - Yann LeCun

Google - Jeff Dean

DRL Intro2/41

XiaohuZhu

Closing

Introduction

NYU - Yann LeCun

Google - Jeff Dean

DRL Intro3/41

XiaohuZhu

Closing

Deep Learning

Recently Deep Learning plays the key role for the ignition of AI.

AlphaGo

Atari games

Motor control and robotics

Pokers

DRL Intro3/41

XiaohuZhu

Closing

Deep Learning

AlphaGo

Atari games

Pokers

DRL Intro3/41

XiaohuZhu

Closing

Deep Learning

AlphaGo

Atari games

Pokers

DRL Intro3/41

XiaohuZhu

Closing

Deep Learning

AlphaGo

Atari games

Pokers

DRL Intro3/41

XiaohuZhu

Closing

Deep Learning

AlphaGo

Atari games

Pokers

DRL Intro4/41

XiaohuZhu

Closing

Deep Learning Architectures

Different types of architectures for various domains.

CNN - ResNet

RNN - LSTM - GRU - NTM

Generative Models - Adversarial Networks

DRL Intro5/41

XiaohuZhu

Closing

Deep Learning Ingredients

Several key ingredients in deep learning.

1 Chain rule - elementary calculus

2 Backpropagation - efficient gradient calculation

3 SGD - optimization techniques

4 Weight sharing (over location or time)

5 Regularisation

6 Batch normalisation

7 Dropout

DRL Intro6/41

XiaohuZhu

Closing

Reinforcement Learning

Agent-oriented learning: learn by interacting with the correspondingenvironment to reach a desired goal.

through trail and error

delayed evaluative reward

DRL Intro6/41

XiaohuZhu

Closing

DRL Intro6/41

XiaohuZhu

Closing

DRL Intro7/41

XiaohuZhu

Closing

Figure: excerpted from Rich Sutton’s NIPS 2015 tutorial

Agent use the information from interaction to determine the best choice

Environment could be unknown, nonlinear, stochastic and complexA favor of time sequentiality and online data

DRL Intro7/41

XiaohuZhu

Closing

Agent use the information from interaction to determine the best choiceEnvironment could be unknown, nonlinear, stochastic and complex

A favor of time sequentiality and online data

DRL Intro7/41

XiaohuZhu

Closing

Agent use the information from interaction to determine the best choiceEnvironment could be unknown, nonlinear, stochastic and complexA favor of time sequentiality and online dataXiaohu Zhu DRL Intro 7/41

DRL Intro8/41

XiaohuZhu

Closing

Position of MDP

Especially Markov Decision Process (MDP).

Figure: excerpted from Percy Liang’s lecture slides

DRL Intro8/41

XiaohuZhu

Closing

Position of MDP

Especially Markov Decision Process (MDP).

Figure: excerpted from Percy Liang’s lecture slides

DRL Intro9/41

XiaohuZhu

Closing

Elements of MDP

Markov Decision Process (MDP). Defined by tuple (S,A, T,R, γ, s0)

S state set

A action set

T (s, a, s′) transition probability of s′ from state s take action a

R(s, a, s′) reward for transition (s, a, s′)

γ discount factor

s0 starting state

DRL Intro9/41

XiaohuZhu

Closing

Elements of MDP

S state set

A action set

γ discount factor

s0 starting state

DRL Intro9/41

XiaohuZhu

Closing

Elements of MDP

S state set

A action set

γ discount factor

s0 starting state

DRL Intro9/41

XiaohuZhu

Closing

Elements of MDP

S state set

A action set

γ discount factor

s0 starting state

DRL Intro9/41

XiaohuZhu

Closing

Elements of MDP

S state set

A action set

γ discount factor

s0 starting state

DRL Intro9/41

XiaohuZhu

Closing

Elements of MDP

S state set

A action set

γ discount factor

s0 starting state

DRL Intro9/41

XiaohuZhu

Closing

Elements of MDP

S state set

A action set

γ discount factor

s0 starting state

DRL Intro10/41

XiaohuZhu

Closing

MDP - Definitions

A policy π is a mapping from each state s ∈ S to an action a ∈ A. Can beused to produce an episode:

s0; a1, r1, s1; a2, r2, s2; a3, r3, s3; ...; an, rn, sn

Value function Vπ(s) expected reward if follow π from state s

Vπ(s) = { 0 if terminatesQπ(s, π(s)) otherwise

Q-value function Qπ(s, a): with discount γ from state s and then follow π

Qπ(s, a) =∑s′

T (s, a, s′)[R(s, a, s′) + γvπ(s′)]

DRL Intro10/41

XiaohuZhu

Closing

MDP - Definitions

A policy π is a mapping from each state s ∈ S to an action a ∈ A. Can beused to produce an episode:

s0; a1, r1, s1; a2, r2, s2; a3, r3, s3; ...; an, rn, sn

Value function Vπ(s) expected reward if follow π from state s

Vπ(s) = { 0 if terminatesQπ(s, π(s)) otherwise

Q-value function Qπ(s, a): with discount γ from state s and then follow π

Qπ(s, a) =∑s′

T (s, a, s′)[R(s, a, s′) + γvπ(s′)]

DRL Intro11/41

XiaohuZhu

Closing

Solve MDP

In a clear environment, we try to find policy.

Transiiton probability T (s, a, s′)

Reward R(s, a, s′)

Our goal is to maximize expected utility, just figure the solution in a dark room.

DRL Intro11/41

XiaohuZhu

Closing

Solve MDP

In a clear environment, we try to find policy.

Transiiton probability T (s, a, s′)

Reward R(s, a, s′)

Our goal is to maximize expected utility, just figure the solution in a dark room.

DRL Intro12/41

XiaohuZhu

Closing

Go further - Reinforcement Learning

In a unknown environment, we only can get the partial information frominteraction.

Transiiton probability T (s, a, s′) unknown

Reward R(s, a, s′) unknown

Our goal is still to maximize expected utility, but we are in a much moredifficult setting called reinforcement learning. Human like: we go around,taking actions, getting feedback. We get rewarded for doing well and learnalong the way.

DRL Intro12/41

XiaohuZhu

Closing

DRL Intro12/41

XiaohuZhu

Closing

DRL Intro13/41

XiaohuZhu

Closing

Tradeoff between Exploration and Exploitation

Separated version often gives bad performance.

ε-greedy policy

π(s) = { arg maxa∈A(s)Q∗(s, a), with prob. 1− ε

a ∼ A with prob. ε

gradually decrease ε

naive explore a lotsophiscated, exploitPhrase transition exists?

DRL Intro13/41

XiaohuZhu

Closing

ε-greedy policy

DRL Intro13/41

XiaohuZhu

Closing

ε-greedy policy

DRL Intro13/41

XiaohuZhu

Closing

ε-greedy policy

DRL Intro14/41

XiaohuZhu

Closing

Problems in RL

Curse of dimensionality. In real applications, state space is huge, epsilon-greedyis impossible to explore the whole space in a reasonable amount of time.

DRL Intro14/41

XiaohuZhu

Closing

Problems in RL

Curse of dimensionality. In real applications, state space is huge, epsilon-greedyis impossible to explore the whole space in a reasonable amount of time.

DRL Intro15/41

XiaohuZhu

Closing

Key points

Epsilon-greedy: balance the exploration/exploitation tradeoff

Function approximation: can generalize to unseen states

DRL Intro16/41

XiaohuZhu

Closing

Thanks to DNN

Powerful approximation tools if we have enough data to fit it well.

Better than linear function

Easy to use for starters, hard to tune

Fashion It works!

DRL Intro17/41

XiaohuZhu

Closing

Deep Reinforcement Learning - DRL

Now under the name of DRL there are more and more interesting methodstargetting different challenging problems, such as controlling, planning andmotor. Representative ones are as follows:

DQN classic DRL with crucial techniques

DDPG continuous control

TRPO policy optimisation

NFSP for Nash equilibrium

Dueling Network improved DQN

DRL Intro17/41

XiaohuZhu

Closing

DRL Intro17/41

XiaohuZhu

Closing

DRL Intro17/41

XiaohuZhu

Closing

DRL Intro17/41

XiaohuZhu

Closing

DRL Intro17/41

XiaohuZhu

Closing

DRL Intro17/41

XiaohuZhu

Closing

DRL Intro18/41

XiaohuZhu

Closing

What is Deep Reinforcement Learning

Deep Neural Networks × Reinforcement Learning

Introduced by Mnih in 2013

The main idea is to use DNN to approximate Action-Value function Q

Performs well in many Atari games (beyond human expert players inseveral games)

DRL Intro18/41

XiaohuZhu

Closing

DRL Intro18/41

XiaohuZhu

Closing

DRL Intro19/41

XiaohuZhu

Closing

Category of DRL Methods

Based on the methodology chosen, we can divide those methods into:

Value based

Policy based

Model based - AlphaGo

DRL Intro19/41

XiaohuZhu

Closing

Value based

Policy based

DRL Intro19/41

XiaohuZhu

Closing

Value based

Policy based

DRL Intro20/41

XiaohuZhu

Closing

Benchmarks

Arcade Learning Environment

OpenAI gym

DRL Intro21/41

XiaohuZhu

Closing

Q-Learning

Optimal Q-values should obey Bellman equation

Q∗(s, a) = Es′[r + γmax

a′Q(s′, a′)∗|s, a

Use r + γmaxa′ Q(s′, a′, w) as a target

Minimise MSE loss by SGD

lw =(r + γmax

aQ(s′, a′, w)−Q(s, a, w)

DRL Intro21/41

XiaohuZhu

Closing

Q-Learning

a′Q(s′, a′)∗|s, a

]Use r + γmaxa′ Q(s′, a′, w) as a target

lw =(r + γmax

aQ(s′, a′, w)−Q(s, a, w)

DRL Intro21/41

XiaohuZhu

Closing

Q-Learning

a′Q(s′, a′)∗|s, a

]Use r + γmaxa′ Q(s′, a′, w) as a target

lw =(r + γmax

aQ(s′, a′, w)−Q(s, a, w)

DRL Intro22/41

XiaohuZhu

Closing

Definition: use a neural network for Q∗

Playing Atari [Google DeepMind, 2013]:

last 4 frames (images) ⇒ 3-layer NN ⇒ keystroke

ε-greedy, train over 10M frames with 1M replay memory

Human-level performance on some games (breakout), less good on others(space invaders)

DRL Intro22/41

XiaohuZhu

Closing

DRL Intro22/41

XiaohuZhu

Closing

DRL Intro23/41

XiaohuZhu

Closing

Basic DQN issues

Valina setting does not work well.

Successive samples are correlated, non-iid

Policy changes rapidly with slight changes to Q-values

Scale of rewards and Q-values is unknown

DRL Intro23/41

XiaohuZhu

Closing

Basic DQN issues

DRL Intro23/41

XiaohuZhu

Closing

Basic DQN issues

DRL Intro24/41

XiaohuZhu

Closing

Cue 1 for DQN - Experience Replay

Successive samples are correlated, non-iid.

Use ε-greedy policy to select action at

Store transition (st, at, rt+1, st+1) in replay memory D

Sample random mini-batch of transitions (s, a, r, s′) from D

Optimise MSE between Q-network and Q-learning targets, e.g.

L(w) = Es,a,r,s′∼D[(r + γmaxa′

Q(s′, a′, w)−Q(s, a, w))2]

DRL Intro25/41

XiaohuZhu

Closing

Cure 2 for DQN - Freeze Target Q-network

Policy changes rapidly with slight changes to Q-values.

Compute Q-learning targets w.r.t. old, fixed parameters w−

r + γmaxa′

Q(s′, a′, w−)

Optimise MSE between Q-network and Q-learning targets

L(w) = Es,a,r,s′∼D[(r + γmaxa′

Q(s′, a′, w−)−Q(s, a, w))2]

Periodically update fixed parameters w− ← w

DRL Intro26/41

XiaohuZhu

Closing

Cue 3 for DQN - Reward/Value Range

DQN clips the rewards to [1, +1]

This prevents Q-values from becoming too large

Ensures gradients are well-conditioned

DRL Intro27/41

XiaohuZhu

Closing

DQN helps in playing Atari games

DRL Intro28/41

XiaohuZhu

Closing

DQN helps in playing Atari games

DRL Intro29/41

XiaohuZhu

Closing

Deep Policy Gradient

Use deep network with weights u to represent policy

a = π(a|s, u)ora = π(s, u)

Objective function as totoal discounted reward

lu = E[r1 + γr2 + γ2r3 + · · · |π(·, u)]

Optimise objective end-to-end by SGD

Suitable for continous control problem since dpg tackled the action spaceissue.

DRL Intro29/41

XiaohuZhu

Closing

Deep Policy Gradient

Use deep network with weights u to represent policy

a = π(a|s, u)ora = π(s, u)

Objective function as totoal discounted reward

lu = E[r1 + γr2 + γ2r3 + · · · |π(·, u)]

Optimise objective end-to-end by SGD

DRL Intro30/41

XiaohuZhu

Closing

DDPG - Deep Deterministic Policy Gradient

DDPG can solve continuous control problem. Actor-Critic method.

DPG [David Silver et al. in 2014]

The main idea is to use DNN to approximate deterministic policy gradient

DRL Intro30/41

XiaohuZhu

Closing

DRL Intro30/41

XiaohuZhu

Closing

DRL Intro31/41

XiaohuZhu

Closing

Use Actor network and Critic network.

Figure: excerpted from

Critic use DQN to estimate current policy value

lw =(r + γQ(s′, π(s′, u−), w−)−Q(s, a, w)

)2Actor improve policy based on the direction of improving Q

∂lu∂u

=∂Q(s, a, w)

ucritic provides loss function for actor

DRL Intro32/41

XiaohuZhu

Closing

Explanation of DDPG

Three ingredients make DDPG.

DRL Intro32/41

XiaohuZhu

Closing

Explanation of DDPG

Three ingredients make DDPG.

DRL Intro33/41

XiaohuZhu

Closing

DDPG algorithm

use techniques from DQN, therefore need 4 neural networks.

DRL Intro34/41

XiaohuZhu

Closing

DDPG algorithm

DRL Intro35/41

XiaohuZhu

Closing

Policy optimisation perspective. Stochastic policy gradient.

REINFORCE algorithm

Natural policy gradient

Trust region policy optimization

DRL Intro35/41

XiaohuZhu

Closing

REINFORCE algorithm

DRL Intro35/41

XiaohuZhu

Closing

REINFORCE algorithm

DRL Intro35/41

XiaohuZhu

Closing

REINFORCE algorithm

DRL Intro36/41

XiaohuZhu

Closing

Practical Suggestions

Representations + Strategy

DRL Intro37/41

XiaohuZhu

Closing

Promising direction for achieving AGI

DL + RL = AI

End-to-End

Learn by trial-and-error

General framework

Impressive performance

DRL Intro37/41

XiaohuZhu

Closing

DL + RL = AI

End-to-End

General framework

DRL Intro37/41

XiaohuZhu

Closing

DL + RL = AI

End-to-End

General framework

DRL Intro37/41

XiaohuZhu

Closing

DL + RL = AI

End-to-End

General framework

DRL Intro37/41

XiaohuZhu

Closing

DL + RL = AI

End-to-End

General framework

DRL Intro38/41

XiaohuZhu

Closing

Concerns about AGI

Satety problem and human replacement problem.

“What is the boundary of intelligence?”

“Verified Artificial Intelligence” UC Berkeley

“Concrete Problems in AI Safety” Google & OpenAI

DRL Intro38/41

XiaohuZhu

Closing

Concerns about AGI

Satety problem and human replacement problem.

“What is the boundary of intelligence?”

“Verified Artificial Intelligence” UC Berkeley

“Concrete Problems in AI Safety” Google & OpenAI

DRL Intro39/41

XiaohuZhu

Closing

To summarise

Design suitable MDP for problems

Get a big neural network

use SGD to train

Go out play pokemon

DRL Intro40/41

XiaohuZhu

Closing

The End

It is the good question that makes discovery.

DRL Intro41/41

XiaohuZhu

Closing

References

Deep Learning in Neural Networks: An Overview

Deep Learning

David Silver’s course

Percy Liang’s course

deep reinforcement learning an introduction

Technology