ben lau, quantitative researcher, complus asset management limited, at mlconf nyc 2017

32
Deep Reinforcement Learning using deep learning to play self-driving car games Ben Lau Ben Lau - Deep Learning and Reinforcement Learning MLConf 2017, New York City

Upload: mlconf

Post on 05-Apr-2017

164 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Deep Reinforcement Learningusing deep learning to play self-driving car games

Ben Lau

Ben Lau - Deep Learning and Reinforcement Learning

MLConf 2017, New York City

Page 2: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

What is Reinforcement Learning?

Ben Lau - Deep Learning and Reinforcement Learning

3 classes of learning

Supervised Learning

Label data Direct Feedback

Unsupervised Learning

No labels data No feedback “Find Hidden Structure”

Reinforcement Learning

Using reward as feedback Learn series of actions Trial and Error

Page 3: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

RL: Agent and Environment

Ben Lau - Deep Learning and Reinforcement Learning

𝑅𝑡

Agent Action

Environment

Reward

Observation

At each step t the Agent• Receive observation • Execute action • Receive reward

the Environment• Receive action • Sends observation • Sends reward

Page 4: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

RL: State

Ben Lau - Deep Learning and Reinforcement Learning

Experience is a sequence of observations, actions, rewards

The state is a summary of experience

Note: Not all the state are fully observable

Fully Observable Not Fully Observable

Page 5: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Approach to Reinforcement Learning

Ben Lau - Deep Learning and Reinforcement Learning

Value-Based RL

Estimate the optimal value function This is the maximum value achievable under any policy

Policy-Based RL

Search directly for the optimal policy This is the policy achieving maximum future reward

Model-based RL

Build a model of the environment Plan (e.g. by lookahead) using model

Page 6: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Deep Learning + RL AI

Ben Lau - Deep Learning and Reinforcement Learning

reward

Game input

Deep convolution network

Steer

Gas Pedal

Brake

Page 7: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Policies

Ben Lau - Deep Learning and Reinforcement Learning

A deterministic policy is the agent’s behavior

It is a map from state to action:

In Reinforcement Learning, the agent’s goal is to choose each action such that it maximize the sumof future rewards

Choose at to maximize

is a discount factor [0,1], as the reward is less certain when further away

State(s) Action(a)

Obstacle Brake

Corner Left/Right

Straight line

Acceleration

Page 8: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Approach to Reinforcement Learning

Ben Lau - Deep Learning and Reinforcement Learning

Value-Based RL

Estimate the optimal value function This is the maximum value achievable under any policy

Policy-Based RL

Search directly for the optimal policy pi* This is the policy achieving maximum future reward

Model-based RL

Build a model of the environment Plan (e.g. by lookahead) using model

Page 9: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Value Function

Ben Lau - Deep Learning and Reinforcement Learning

A value function is a prediction of future reward How much reward will I get from action a in state s?

A Q-value function gives expected total reward From state-action pair (s, a) Under policy With discount factor

An optimal value function is the maximum achievable value

Once we have the we can act optimally

Page 10: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Understanding Q Function

Ben Lau - Deep Learning and Reinforcement Learning

The best way to understand Q function is considering a “strategy guide”

Suppose you are playing a difficult game (DOOM)

If you have a strategy guide, it’s pretty easy Just follow the guide

Suppose you are in state s, and need to make a decision, If you have this magical Q-function(strategy guide), then it is easy, just pick the action with highest Q

Doom Strategy Guide

Page 11: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

How to find Q-function

Ben Lau - Deep Learning and Reinforcement Learning

Discount Future Reward:

which can be written as:

Recall the definition of Q-function (max reward if choose action a in state s)

Therefore, we can rewrite the Q-function as below

Q(

In plain English, it means maximum future reward for (s,a) is the immediate reward r + maximum future reward in next state s’, action a’

It can be solved by dynamic programming or iterative solution

Page 12: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Deep Q-Network (DQN)

Ben Lau - Deep Learning and Reinforcement Learning

Action-Value function (Q-function) often very big

DQN idea: I use the neural network to compress this Q-table, using the weight (w) in the neural network

Training become finding sets of optimal weights w instead

In the literature we often called “non-linear function approximation”State

Action

Value

A 1 140.11A 2 139.22B 1 145.89B 2 140.23C 1 123.67C 2 135.27

Page 13: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

DQN Demo Using DeepQ network to play Doom

Page 14: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Approach to Reinforcement Learning

Ben Lau - Deep Learning and Reinforcement Learning

Value-Based RL

Estimate the optimal value function Q*(s,a) This is the maximum value achievable under any policy

Policy-Based RL

Search directly for the optimal policy This is the policy achieving maximum future reward

Model-based RL

Build a model of the environment Plan (e.g. by lookahead) using model

Page 15: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Deep Policy Network

Ben Lau - Deep Learning and Reinforcement Learning

Review: A policy is the agent’s behavior

It is a map from state to action: at = (st)

We can directly search the policy

Let’s parameterize the policy by some model parameters

We called it Policy-Based reinforcement learning because wewill adjust the model parameters directly

The goal is to maximize the total discount reward from beginning

maximize total

Page 16: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Policy Gradient

Ben Lau - Deep Learning and Reinforcement Learning

How to make good action more likely?

Define objective function as total discounted reward

or

Where the expectations of the total reward R is calculated under someprobability distribution parameterized by

The goal become maximize the total reward by

compute the gradient

Page 17: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Policy Gradient (II)

Ben Lau - Deep Learning and Reinforcement Learning

Recall: Q-function is the maximum discounted future reward in state s, action a

In the continuous case we can written as

Therefore, we can compute the gradient as

Using chain-rule, we can re-write as

No dynamics model required!

1. Only requires Q is differential w.r.t. a2. As long as a can be parameterized

as function of

Page 18: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

The power of Policy Gradient

Ben Lau - Deep Learning and Reinforcement Learning

Because the policy gradient does not require the dynamical modeltherefore, no prior domain knowledge is required

AlphaGo doesn’t pre-programme any domain knowledge

It keep playing many times (via self-play) and adjust the policy parameters to maximize the reward(winning probability)

Page 19: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Intuition: Value vs Policy RL

Ben Lau - Deep Learning and Reinforcement Learning

Valued Based RL is similar to driving instructor : A score is given for any action is taken by student

Policy Based RL is similar to a driver : It is the actual policy how to drive a car

Page 20: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

The car racing game TORCS

Ben Lau - Deep Learning and Reinforcement Learning

TORCS is a state of the art open source simulator written in C++

Main Features Sophisticated dynamics Provided with several

tracks, controllers

Sensors Rangefinder Speed Position on track Rotation speed of wheels RPM Angle with tracks

Quite realistic to self-driving cars… Track sensors

Page 21: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Deep Learning Recipe

Ben Lau - Deep Learning and Reinforcement Learning

rewardGame input state s

Deep Neural network

Steer

Gas Pedal

Brake

Rangefinder Speed Position on track Rotation speed of wheels RPM Angle with tracks

Compute the optimal policy via policy gradient

Page 22: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Design of the reward function

Ben Lau - Deep Learning and Reinforcement Learning

Obvious choice : Highest velocity of the car

H

Use modify reward function |track pos|

Encourage stay in the center of the track

Page 23: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Source code available here:Google: DDPG Keras

Ben Lau - Deep Learning and Reinforcement Learning

Page 24: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Training Set: Aalborg Track

Page 25: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Validation Set: Alpine TracksRecall basic Machine Learning, make sure you need to test the modelIn the validation set, not the training set

Page 26: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Learning how to brake

Ben Lau - Deep Learning and Reinforcement Learning

Since we try to maximize the velocity of the carThe AI agent don’t want to hit the brake at all! (As it go against the reward function)Using Stochastic Brake Idea

Page 27: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Final Demo – Car does not stay center of track

Ben Lau - Deep Learning and Reinforcement Learning

Page 28: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Future Application

Ben Lau - Deep Learning and Reinforcement Learning

Self driving cars:

Page 29: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Future Application

Page 30: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Thank you!Twitter: @yanpanlau

Page 31: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

Appendix

Page 32: Ben Lau, Quantitative Researcher, Complus Asset Management Limited, at MLconf NYC 2017

How to find Q-function (II)

Ben Lau - Deep Learning and Reinforcement Learning

Q(

We could use iterative method to solve the Q-function, given a transition (s,a,r,s’)

We want Q(to be same as

Consider find Q-function is a regression task, we can define a loss function

Loss function =

Q is optimal when the loss function is minimumtarget prediction