ben lau, quantitative researcher, complus asset management limited, at mlconf nyc 2017
TRANSCRIPT
Deep Reinforcement Learningusing deep learning to play self-driving car games
Ben Lau
Ben Lau - Deep Learning and Reinforcement Learning
MLConf 2017, New York City
What is Reinforcement Learning?
Ben Lau - Deep Learning and Reinforcement Learning
3 classes of learning
Supervised Learning
Label data Direct Feedback
Unsupervised Learning
No labels data No feedback “Find Hidden Structure”
Reinforcement Learning
Using reward as feedback Learn series of actions Trial and Error
RL: Agent and Environment
Ben Lau - Deep Learning and Reinforcement Learning
𝑅𝑡
Agent Action
Environment
Reward
Observation
At each step t the Agent• Receive observation • Execute action • Receive reward
the Environment• Receive action • Sends observation • Sends reward
RL: State
Ben Lau - Deep Learning and Reinforcement Learning
Experience is a sequence of observations, actions, rewards
The state is a summary of experience
Note: Not all the state are fully observable
Fully Observable Not Fully Observable
Approach to Reinforcement Learning
Ben Lau - Deep Learning and Reinforcement Learning
Value-Based RL
Estimate the optimal value function This is the maximum value achievable under any policy
Policy-Based RL
Search directly for the optimal policy This is the policy achieving maximum future reward
Model-based RL
Build a model of the environment Plan (e.g. by lookahead) using model
Deep Learning + RL AI
Ben Lau - Deep Learning and Reinforcement Learning
reward
Game input
Deep convolution network
Steer
Gas Pedal
Brake
Policies
Ben Lau - Deep Learning and Reinforcement Learning
A deterministic policy is the agent’s behavior
It is a map from state to action:
In Reinforcement Learning, the agent’s goal is to choose each action such that it maximize the sumof future rewards
Choose at to maximize
is a discount factor [0,1], as the reward is less certain when further away
State(s) Action(a)
Obstacle Brake
Corner Left/Right
Straight line
Acceleration
Approach to Reinforcement Learning
Ben Lau - Deep Learning and Reinforcement Learning
Value-Based RL
Estimate the optimal value function This is the maximum value achievable under any policy
Policy-Based RL
Search directly for the optimal policy pi* This is the policy achieving maximum future reward
Model-based RL
Build a model of the environment Plan (e.g. by lookahead) using model
Value Function
Ben Lau - Deep Learning and Reinforcement Learning
A value function is a prediction of future reward How much reward will I get from action a in state s?
A Q-value function gives expected total reward From state-action pair (s, a) Under policy With discount factor
An optimal value function is the maximum achievable value
Once we have the we can act optimally
Understanding Q Function
Ben Lau - Deep Learning and Reinforcement Learning
The best way to understand Q function is considering a “strategy guide”
Suppose you are playing a difficult game (DOOM)
If you have a strategy guide, it’s pretty easy Just follow the guide
Suppose you are in state s, and need to make a decision, If you have this magical Q-function(strategy guide), then it is easy, just pick the action with highest Q
Doom Strategy Guide
How to find Q-function
Ben Lau - Deep Learning and Reinforcement Learning
Discount Future Reward:
which can be written as:
Recall the definition of Q-function (max reward if choose action a in state s)
Therefore, we can rewrite the Q-function as below
Q(
In plain English, it means maximum future reward for (s,a) is the immediate reward r + maximum future reward in next state s’, action a’
It can be solved by dynamic programming or iterative solution
Deep Q-Network (DQN)
Ben Lau - Deep Learning and Reinforcement Learning
Action-Value function (Q-function) often very big
DQN idea: I use the neural network to compress this Q-table, using the weight (w) in the neural network
Training become finding sets of optimal weights w instead
In the literature we often called “non-linear function approximation”State
Action
Value
A 1 140.11A 2 139.22B 1 145.89B 2 140.23C 1 123.67C 2 135.27
DQN Demo Using DeepQ network to play Doom
Approach to Reinforcement Learning
Ben Lau - Deep Learning and Reinforcement Learning
Value-Based RL
Estimate the optimal value function Q*(s,a) This is the maximum value achievable under any policy
Policy-Based RL
Search directly for the optimal policy This is the policy achieving maximum future reward
Model-based RL
Build a model of the environment Plan (e.g. by lookahead) using model
Deep Policy Network
Ben Lau - Deep Learning and Reinforcement Learning
Review: A policy is the agent’s behavior
It is a map from state to action: at = (st)
We can directly search the policy
Let’s parameterize the policy by some model parameters
We called it Policy-Based reinforcement learning because wewill adjust the model parameters directly
The goal is to maximize the total discount reward from beginning
maximize total
Policy Gradient
Ben Lau - Deep Learning and Reinforcement Learning
How to make good action more likely?
Define objective function as total discounted reward
or
Where the expectations of the total reward R is calculated under someprobability distribution parameterized by
The goal become maximize the total reward by
compute the gradient
Policy Gradient (II)
Ben Lau - Deep Learning and Reinforcement Learning
Recall: Q-function is the maximum discounted future reward in state s, action a
In the continuous case we can written as
Therefore, we can compute the gradient as
Using chain-rule, we can re-write as
No dynamics model required!
1. Only requires Q is differential w.r.t. a2. As long as a can be parameterized
as function of
The power of Policy Gradient
Ben Lau - Deep Learning and Reinforcement Learning
Because the policy gradient does not require the dynamical modeltherefore, no prior domain knowledge is required
AlphaGo doesn’t pre-programme any domain knowledge
It keep playing many times (via self-play) and adjust the policy parameters to maximize the reward(winning probability)
Intuition: Value vs Policy RL
Ben Lau - Deep Learning and Reinforcement Learning
Valued Based RL is similar to driving instructor : A score is given for any action is taken by student
Policy Based RL is similar to a driver : It is the actual policy how to drive a car
The car racing game TORCS
Ben Lau - Deep Learning and Reinforcement Learning
TORCS is a state of the art open source simulator written in C++
Main Features Sophisticated dynamics Provided with several
tracks, controllers
Sensors Rangefinder Speed Position on track Rotation speed of wheels RPM Angle with tracks
Quite realistic to self-driving cars… Track sensors
Deep Learning Recipe
Ben Lau - Deep Learning and Reinforcement Learning
rewardGame input state s
Deep Neural network
Steer
Gas Pedal
Brake
Rangefinder Speed Position on track Rotation speed of wheels RPM Angle with tracks
Compute the optimal policy via policy gradient
Design of the reward function
Ben Lau - Deep Learning and Reinforcement Learning
Obvious choice : Highest velocity of the car
H
Use modify reward function |track pos|
Encourage stay in the center of the track
Source code available here:Google: DDPG Keras
Ben Lau - Deep Learning and Reinforcement Learning
Training Set: Aalborg Track
Validation Set: Alpine TracksRecall basic Machine Learning, make sure you need to test the modelIn the validation set, not the training set
Learning how to brake
Ben Lau - Deep Learning and Reinforcement Learning
Since we try to maximize the velocity of the carThe AI agent don’t want to hit the brake at all! (As it go against the reward function)Using Stochastic Brake Idea
Final Demo – Car does not stay center of track
Ben Lau - Deep Learning and Reinforcement Learning
Future Application
Ben Lau - Deep Learning and Reinforcement Learning
Self driving cars:
Future Application
Thank you!Twitter: @yanpanlau
Appendix
How to find Q-function (II)
Ben Lau - Deep Learning and Reinforcement Learning
Q(
We could use iterative method to solve the Q-function, given a transition (s,a,r,s’)
We want Q(to be same as
Consider find Q-function is a regression task, we can define a loss function
Loss function =
Q is optimal when the loss function is minimumtarget prediction