ben lau, quantitative researcher, complus asset management limited, at mlconf nyc 2017

Deep Reinforcement Learningusing deep learning to play self-driving car games

Ben Lau

Ben Lau - Deep Learning and Reinforcement Learning

MLConf 2017, New York City

What is Reinforcement Learning?


3 classes of learning

Supervised Learning

Label data Direct Feedback

Unsupervised Learning

No labels data No feedback “Find Hidden Structure”

Reinforcement Learning

Using reward as feedback Learn series of actions Trial and Error

RL: Agent and Environment


𝑅𝑡

Agent Action

Environment

Reward

Observation

At each step t the Agent• Receive observation • Execute action • Receive reward

the Environment• Receive action • Sends observation • Sends reward

RL: State


Experience is a sequence of observations, actions, rewards

The state is a summary of experience

Note: Not all the state are fully observable

Fully Observable Not Fully Observable

Approach to Reinforcement Learning


Value-Based RL

Estimate the optimal value function This is the maximum value achievable under any policy

Policy-Based RL

Search directly for the optimal policy This is the policy achieving maximum future reward

Model-based RL

Build a model of the environment Plan (e.g. by lookahead) using model

Deep Learning + RL AI


reward

Game input

Deep convolution network

Steer

Gas Pedal

Brake

Policies


A deterministic policy is the agent’s behavior

It is a map from state to action:

In Reinforcement Learning, the agent’s goal is to choose each action such that it maximize the sumof future rewards

Choose at to maximize

is a discount factor [0,1], as the reward is less certain when further away

State(s) Action(a)

Obstacle Brake

Corner Left/Right

Straight line

Acceleration



Value-Based RL

Estimate the optimal value function This is the maximum value achievable under any policy

Policy-Based RL

Search directly for the optimal policy pi* This is the policy achieving maximum future reward

Model-based RL


Value Function


A value function is a prediction of future reward How much reward will I get from action a in state s?

A Q-value function gives expected total reward From state-action pair (s, a) Under policy With discount factor

An optimal value function is the maximum achievable value

Once we have the we can act optimally

Understanding Q Function


The best way to understand Q function is considering a “strategy guide”

Suppose you are playing a difficult game (DOOM)

If you have a strategy guide, it’s pretty easy Just follow the guide

Suppose you are in state s, and need to make a decision, If you have this magical Q-function(strategy guide), then it is easy, just pick the action with highest Q

Doom Strategy Guide

How to find Q-function


Discount Future Reward:

which can be written as:

Recall the definition of Q-function (max reward if choose action a in state s)

Therefore, we can rewrite the Q-function as below

Q(

In plain English, it means maximum future reward for (s,a) is the immediate reward r + maximum future reward in next state s’, action a’

It can be solved by dynamic programming or iterative solution

Deep Q-Network (DQN)


Action-Value function (Q-function) often very big

DQN idea: I use the neural network to compress this Q-table, using the weight (w) in the neural network

Training become finding sets of optimal weights w instead

In the literature we often called “non-linear function approximation”State

Action

Value

A 1 140.11A 2 139.22B 1 145.89B 2 140.23C 1 123.67C 2 135.27

DQN Demo Using DeepQ network to play Doom



Value-Based RL

Estimate the optimal value function Q*(s,a) This is the maximum value achievable under any policy

Policy-Based RL

Search directly for the optimal policy This is the policy achieving maximum future reward

Model-based RL


Deep Policy Network


Review: A policy is the agent’s behavior

It is a map from state to action: at = (st)

We can directly search the policy

Let’s parameterize the policy by some model parameters

We called it Policy-Based reinforcement learning because wewill adjust the model parameters directly

The goal is to maximize the total discount reward from beginning

maximize total

Policy Gradient


How to make good action more likely?

Define objective function as total discounted reward

or

Where the expectations of the total reward R is calculated under someprobability distribution parameterized by

The goal become maximize the total reward by

compute the gradient

Policy Gradient (II)


Recall: Q-function is the maximum discounted future reward in state s, action a

In the continuous case we can written as

Therefore, we can compute the gradient as

Using chain-rule, we can re-write as

No dynamics model required!

1. Only requires Q is differential w.r.t. a2. As long as a can be parameterized

as function of

The power of Policy Gradient


Because the policy gradient does not require the dynamical modeltherefore, no prior domain knowledge is required

AlphaGo doesn’t pre-programme any domain knowledge

It keep playing many times (via self-play) and adjust the policy parameters to maximize the reward(winning probability)

Intuition: Value vs Policy RL


Valued Based RL is similar to driving instructor : A score is given for any action is taken by student

Policy Based RL is similar to a driver : It is the actual policy how to drive a car

The car racing game TORCS


TORCS is a state of the art open source simulator written in C++

Main Features Sophisticated dynamics Provided with several

tracks, controllers

Sensors Rangefinder Speed Position on track Rotation speed of wheels RPM Angle with tracks

Quite realistic to self-driving cars… Track sensors

Deep Learning Recipe


rewardGame input state s

Deep Neural network

Steer

Gas Pedal

Brake

Rangefinder Speed Position on track Rotation speed of wheels RPM Angle with tracks

Compute the optimal policy via policy gradient

Design of the reward function


Obvious choice : Highest velocity of the car

H

Use modify reward function |track pos|

Encourage stay in the center of the track

Source code available here:Google: DDPG Keras


Training Set: Aalborg Track

Validation Set: Alpine TracksRecall basic Machine Learning, make sure you need to test the modelIn the validation set, not the training set

Learning how to brake


Since we try to maximize the velocity of the carThe AI agent don’t want to hit the brake at all! (As it go against the reward function)Using Stochastic Brake Idea

Final Demo – Car does not stay center of track


Future Application


Self driving cars:

Future Application

Thank you!Twitter: @yanpanlau

Appendix

How to find Q-function (II)


Q(

We could use iterative method to solve the Q-function, given a transition (s,a,r,s’)

We want Q(to be same as

Consider find Q-function is a regression task, we can define a loss function

Loss function =

Q is optimal when the loss function is minimumtarget prediction

ben lau, quantitative researcher, complus asset management limited, at mlconf nyc 2017

Technology