rl reinforcement learning pietquin olivierolivier pietquin introduction mdp long term vision policy...

Post on 20-Mar-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RL

OlivierPietquinReinforcement Learning

Olivier Pietquinpietquin@google.com

olivier.pietquin@univ-lille.fr

Google Research - Brain Team

Reinforcement Learning Summer School 2019

RL

OlivierPietquin

Introduction

MDP

DynamicProgramming

Part I

Reminder

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

1 IntroductionProblem description

2 MDP

3 Dynamic Programming

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

Learning methods

Supervised Learning

Learn a mapping between inputs and outputs;

An oracle provides labelled examples of this mapping;

Unsupervised Learning

Learn a structure in a data set (capture the distribution);

No oracle;

Reinforcement Learning

Learn to Behave!

Online Learning.

Sequential decision making, controle.

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

Learning methods

Supervised Learning

Learn a mapping between inputs and outputs;

An oracle provides labelled examples of this mapping;

Unsupervised Learning

Learn a structure in a data set (capture the distribution);

No oracle;

Reinforcement Learning

Learn to Behave!

Online Learning.

Sequential decision making, controle.

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

Learning methods

Supervised Learning

Learn a mapping between inputs and outputs;

An oracle provides labelled examples of this mapping;

Unsupervised Learning

Learn a structure in a data set (capture the distribution);

No oracle;

Reinforcement Learning

Learn to Behave!

Online Learning.

Sequential decision making, controle.

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

Learning methods

Supervised Learning

Learn a mapping between inputs and outputs;

An oracle provides labelled examples of this mapping;

Unsupervised Learning

Learn a structure in a data set (capture the distribution);

No oracle;

Reinforcement Learning

Learn to Behave!

Online Learning.

Sequential decision making, controle.

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

General problem

RL is a problem (unsolved), a general paradigm, not a method !

Image taken from ATR Cyber rodent project

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

Induced Problems

Trial-and-error learning process

Acting is mandatory to learn.

Exploration vs Exploitation Dilemma

Should the agent follow its current policy because it knowsits consequences ?

Should the agent explore the environment to find a betterstrategy ?

Delayed Rewards

The results of an action can be delayed

How to learn to sacrifice small immediate rewards to gainlarge long term rewards ?

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

Examples

Artificial problems

Mazes or grid-worlds

Mountain car

Inverted Pendulum

Games: BackGammon, Chess,Atari, Go

Real-world problems

Man-Machine Interfaces

Data center cooling

Autonomous robotics

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

Examples I

Grid World

State: x,y position

Actions: up,down,right,left

Reward: +1 for reaching goal state, 0 every other step

Cart Pole

State: angle, angular velocity

Actions: right, left

Reward: +1 for vertical position, 0 otherwise

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

Examples II

Chess, Go

State: configuration of the board

Actions: move a piece, place a stone

Reward: +1 for winning, 0 for draw, -1 for loosing

Atari

RL

OlivierPietquin

Introduction

Problemdescription

MDP

DynamicProgramming

Example: Dialogue as an MDP

The dialogue strategy is optimized at an intention level.

States

Dialogue states are given by the context (e.g. informationretrieved, status of a database query)

Actions

Dialog acts : simple communicative acts (e.g. greeting, openquestion, confirmation)

Reward

User satisfaction usually estimated as a function of objectivemeasures (e.g. dialogue duration, task completion, ASRperformances)

RL

OlivierPietquin

Introduction

MDP

Long term vision

Policy

Value Function

DynamicProgramming

1 Introduction

2 MDPLong term visionPolicyValue Function

3 Dynamic Programming

RL

OlivierPietquin

Introduction

MDP

Long term vision

Policy

Value Function

DynamicProgramming

Markov Decision Processes (MDP)

Definition (MDP)

An MDP is a Tuple {S ,A,Pt , rt , γ} suchas:

S is the state space;

A is the action space;

T is the time axis ;

T ass′ ∈ (Pt)t∈T is a family of

markovian transition probabilitydistributions between statesconditionned on actions;

(rt)t∈T is a bouded familly ofrewards associated to transitions

γ is a discount factor

InterpretationAt each time t of T , the agent observes the currentstate st ∈ S , performs an action at ∈ A on thesystem wich is randomly led according toT ass′ = Pt (.|st , at ) to a new state st+1 (Pt (s′|s, a)

represents the probability to step into state s′ afterhaving performed action a at time t in state s), andreceives a reward rt (st , at , st+1) ∈ R. with

Rass′ = E [rt |s, s′, a]

RL

OlivierPietquin

Introduction

MDP

Long term vision

Policy

Value Function

DynamicProgramming

Gain : premises of local view

Definition (Cumulative reward)

Rt = rt+1 + rt+2 + ...+ rT =T∑

i=t+1

ri

Definition (Discounted cumulative reward)

Rt = rt+1 + γrt+2 + γ2rt+3...+ γT−t+1rT + ... =∞∑k=0

γk rt+k+1

Definition (Averaged Gain)

Rt =1

T − 1

T∑i=t+1

ri

RL

OlivierPietquin

Introduction

MDP

Long term vision

Policy

Value Function

DynamicProgramming

Policy

πt(a|s) : S → ∆A

Definition (Policy or Strategy π)

The agent’s policy or strategy πt at time t is an applicationfrom S into distributions over A defining the agent’s behavior(mapping between situations and actions, remember Thorndike)

Definition (Optimal Policy or Strategy π∗)

An optimal politicy or strategy π∗ for a given MDP is a politicythat maximises the agent’s gain

RL

OlivierPietquin

Introduction

MDP

Long term vision

Policy

Value Function

DynamicProgramming

Value Function

Definition (Value function for a state V π(s))

∀s ∈ S V π(s) = Eπ[∞∑t=0

γtr(st , at)|s0 = s]

V π(s) = Expected gain when starting from s and following thepolicy π

Definition (Action value function or Quality function Qπ(s, a))

∀s ∈ S , a ∈ A Qπ(s, a) = Eπ[∞∑t=0

γtr(st , at)|s0 = s, a0 = a]

Qπ(s, a) = Expected gain when starting from state s, selectingaction a then following policy π

RL

OlivierPietquin

Introduction

MDP

DynamicProgramming

BellmanEquations

Algorithms

1 Introduction

2 MDP

3 Dynamic ProgrammingBellman EquationsAlgorithms

RL

OlivierPietquin

Introduction

MDP

DynamicProgramming

BellmanEquations

Algorithms

Bellman evaluation equations

Bellman equations for Qπ(s, a) and V π(s)

Qπ(s, a) =∑s′

T ass′ [Ra

ss′ + γV π(s ′)]

V π(s) =∑a

π(s|a)∑s′

T ass′ [Ra

ss′ + γV π(s ′)]

Systems of |S | linear equations in |S | unknowns (tabularrepresentation).

RL

OlivierPietquin

Introduction

MDP

DynamicProgramming

BellmanEquations

Algorithms

Bellman Optimality equations

Theorem (Bellman equation for V ∗(s))

V ∗(s) = maxa

∑s′

T ass′ [Ra

ss′ + γV ∗(s ′)]

Theorem (Bellman Equations for Q∗(s, a))

Q∗(s, a) =∑s′

T ass′ [Ra

ss′ + γV ∗(s ′)]

=∑s′

T ass′ [Ra

ss′ + γmaxa′

Q∗(s ′, a′)]

∀s ∈ S π∗(s) = argmaxa

∑s′

T ass′ [Ra

ss′ + γV ∗(s ′)]

RL

OlivierPietquin

Introduction

MDP

DynamicProgramming

BellmanEquations

Algorithms

Value Iteration

Value iteration algorithm

initialize V0 ∈ Vn← 0while ‖Vn+1 − Vn‖ > ε dofor s ∈ S doVn+1(s) = maxa

∑s′ T a

ss′ [Rass′ + γVn(s ′)]

end forn← n + 1

end whilefor s ∈ S doπ(s) = argmaxa∈A

∑s′ T a

ss′ [Rass′ + γVn(s ′)]

end forreturn Vn, π

RL

OlivierPietquin

Introduction

MDP

DynamicProgramming

BellmanEquations

Algorithms

Policy Iteration

Policy iteration algorithm

Init π0 ∈ Dn← 0while πn+1 6= πn do

solve (Evaluation phase)

Vn+1(s) =∑

s′ Tπ(s)ss′ [Ra

ss′ + γVn(s ′)] (Linear eq.)for s ∈ S do (Improvement phase)πn+1(s) = argmaxa∈A

∑s′ T a

ss′ [Rass′ + γVn(s ′)]

end forn← n + 1

end whilereturn Vn, πn+1

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

Part II

Reinforcement Learning

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

4 Introduction

5 Problem Definition

6 Monte Carlo Methods

7 Temporal Differences

8 Exploration Management

9 Conclusion

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

4 Introduction

5 Problem Definition

6 Monte Carlo Methods

7 Temporal Differences

8 Exploration Management

9 Conclusion

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

Reinforcement Learning

Unknown environment

If the system’s dynamic is not known, learning has to happenthrough interaction. No policy can be learnt before someinformation about the environment is gathered. This settingdefines the Reinforcement Learning problem.

Naive Method : Adaptive DP

Learn the environment’s dynamic through interaction (samplingthe distributions) and apply dynamic programming.

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

4 Introduction

5 Problem Definition

6 Monte Carlo Methods

7 Temporal Differences

8 Exploration Management

9 Conclusion

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

Monte Carlo Methods

Learning V π(s) through sampling

Random choice of a starting state s ∈ S

Follow the policy π and observe the cumulative gain Rt

Do this infinitly and average: V π(s) = Eπ[Rt ]

Learning Qπ(s, a) by sampling

Random choice of a starting state s ∈ S

Random choice of an action a ∈ A (exploring starts)

Follow policy π and observe gain Rt

Do that infinitly and average : Qπ(s, a) = Eπ[Rt ]

Enhance the policy : π(s) = argmaxa∈AQπ(s, a)

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

Monte Carlo Methods

Learning V π(s) through sampling

Random choice of a starting state s ∈ S

Follow the policy π and observe the cumulative gain Rt

Do this infinitly and average: V π(s) = Eπ[Rt ]

Learning Qπ(s, a) by sampling

Random choice of a starting state s ∈ S

Random choice of an action a ∈ A (exploring starts)

Follow policy π and observe gain Rt

Do that infinitly and average : Qπ(s, a) = Eπ[Rt ]

Enhance the policy : π(s) = argmaxa∈AQπ(s, a)

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

Problem

Dynamic Programming

Requires knowing the system’s dynamics

But takes the structure into account :

∀s ∈ S V ∗(s) = maxa∈A

E (r(s, a) + γ∑s′∈ST ass′V

∗(s ′))

Monte Carlo

No knowledge is necessary

No consideration is made of the structure :Qπ(s, a) = Eπ[Rt ]

So, the agent has to wait until the end of the interactionto improve the policy

High variance

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

4 Introduction

5 Problem Definition

6 Monte Carlo Methods

7 Temporal DifferencesQ-LearningEligibility Traces

8 Exploration Management

9 Conclusion

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Temporal Differences (TD) I

TD Principle

Ideal Case (deterministic) :

V (st) = rt + γrt+1 + γ2rt+2 + γ3rt+3 + . . .

= rt + γV (st+1)

In practice :

δt = [rt + γV (st+1)]− V (st) 6= 0!

δt is the temporal difference error (TD error).

Note: r(st , at) = rtNote: target is now rt + γV (st+1) which is biased but withlower variance.

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Temporal Differences (TD) II

New Evaluation method for V

Widrow-Hoff like update rule:

V t+1(st)← V t(st) + α(rt + γV t(st+1)− V t(st)

)α is the learning rate

V (st) is the target

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

SARSA

Same for Q

Qt+1(st , at)← Qt(st , at) + α(rt + γQt(st+1, at+1)− Qt(st , at)

)

SARSA

Init Q0

for n← 0 until Ntot − 1 dosn ← StateChoice

an ← ActionChoice = f (Qπt (s, a))Perform action a and observe s ′, rbegin

Perform action a′ = f (Qπt (s ′, a′))δn ← rn + γQn(s ′n, a

′)− Qn(sn, an)Qn+1(sn, an)← Qn(sn, an) + αn(sn, an)δns ← s ′, a← a′ end

end forreturn QNtot

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Q-Learning

Learn π∗ following πt (off-policy)

Qt+1(st , at)← Qt(st , at) + α(rt + γmaxb

Qt(st+1, b)− Qt(st , at))

Q-learning Algorithm

for n← 0 until Ntot − 1 dosn ← StateChoice

an ← ActionChoice

(s ′n, rn)← Simuler(sn, an)% Update Qn

beginQn+1 ← Qn

δn ← rn + γmaxb Qn(s ′n, b)− Qn(sn, an)Qn+1(sn, an)← Qn(sn, an) + αn(sn, an)δn

endend forreturn QNtot

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Q-Learning

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Problem of TD(0) method

Problem

In case of a limited number of interactions, informationpropagation may not reach all the states.

Ex : grid world.

Solution ?

Remember all interactions replay them a large number of times.

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Eligibility Traces

The TD framework is based on R1t = rt+1 + γVt(st+1)

One can also write:

R2t = rt + γrt+1 + γ2Vt(st+1)

Rnt = rt + γrt+1 + γ2rt+2 + ...+ γnVt(st+n)

General update rule

∆Vt(st) = α[Rnt − Vt(st)]

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Forward view I

Any average of different Rt can be used :

Rmoyt = 1/2R2

t + 1/2R4t

Rmoyt = 1/3R1

t + 1/3R2t + 1/3R3

t

Eligibility Traces

Rλt = (1− λ)∞∑n=1

λn−1Rnt

∆V t(st) = α[Rλt − V (st)]

0 < λ < 1

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Forward view II

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Backward View I

A memory variable isassociated to each state(state-action pair).

∀s, t et(s) =

{γλet−1(s) si s 6= st

γλet−1(s) + 1 si s = st

Update rule

δt = rt + γV t(st+1)− V t(st)

∀s ∆V t(s) = αδtet(s)

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Backward View II

TD(λ) et Q(λ)

Every states are updated, the learning rate of each state being weighted bythe corresponding eligibility trace;

si λ = 0, TD(0) ;

si λ = 1, Monte Carlo

Sarsa(λ)

δt = rt + γQt(st+1, at+1)− Qt(st , at)

Qt+1(s, a) = Qt(s, a) + αδtet(s, a)

Watkin’s Q(λ)

δt = rt + γmaxb

Qt(st+1, b)− Qt(st , at)

Qt+1(s, a) = Qt(s, a) + αδtet(s, a)

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Backward View III

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Interpretation

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

Q-Learning

Eligibility Traces

ExplorationManagement

Conclusion

Replacing traces

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

4 Introduction

5 Problem Definition

6 Monte Carlo Methods

7 Temporal Differences

8 Exploration Management

9 Conclusion

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

Exploration Management

Action selection

Greedy Selection : a = a∗ = argmaxa Q(s, a)

ε-greedy selection : P(a∗) = 1− εSoftmax (Gibbs or Boltzmann) P(a) = eQ(a)/τ∑

a′ eQ(a′)/τ

Optimistic Initialization

Initialize the value functions with high values so as to visitunseen states thanks to action selection rules.

Uncertainty and value of information

Take uncertainty on the values into account.

Compute the value of information provided by exploration.

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

4 Introduction

5 Problem Definition

6 Monte Carlo Methods

7 Temporal Differences

8 Exploration Management

9 Conclusion

RL

OlivierPietquin

Introduction

ProblemDefinition

Monte CarloMethods

TemporalDifferences

ExplorationManagement

Conclusion

Conclusion

Good

Optimal control without models of the physics

Online learning

Bad

Large state spaces

Sample efficiency

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Part III

Value Function Approximation

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

10 Introduction

11 Policy Evaluation

12 Control

13 Warnings

14 Deep Q-Network

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

The Curse of Dimensionality (Bellman) I

Some examples

BackGammon: 1020 states [Tesauro, 1995]

Chess: 1050 states

Go: 10170 states, 400 actions [Silver et al., 2016]

Atari: 240x160 continuous dimensions [Mnih et al., 2015]

Robotics: multiple degrees of freedom

Language: very large discrete action space

Tabular RL

Complexity is polynomial. Doesn’t scale up.

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

The Curse of Dimensionality (Bellman) II

Two problems

How to handle large state/action spaces in memory?

How to generalise over state/action spaces to learn faster?

Challenges for Machine Learing

Data non i.i.d because they come in trajectories

Non stationnarity during control

Off-policy learning induces difference between observedand learnt distributions

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Value Function Approximation

Parametric approximation

The value function (or Q-function) will be expressed as afunction of a set of parameters θi :

V̂ π(s) = Vθ(s) = V (s, θ) Q̂π(s, a) = Qθ(s, a) = Q(s, a, θ)

where θ is the (column) vector of parameters: [θi ]pi=1

Method

Search in space H = {Vθ(s)(resp. Qθ(s, a))|θ ∈ Rp} generatedby parameters θi for the best fit to V π(s) (resp. Qπ(s, a)) byminimizing an objective function J(θ).

Goal

Learn optimal parameters θ∗ = argminθ J(θ) from samples.

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Types of parameterizations I

Linear function approximation

Vθ(s) =

p∑i=0

θiφi (s) = θ>φ(s)

where φi (s) are called basis functions (or features) and defineH and φ(s) = [φi (s)]pi=1.

Look up table

It is a special case of linear function approximation

Parameters are the value of each state (θi = V (si ) andp = |S |)φ(s) = δ(s) = [δi (s)]

|S |i=1 where δi (s) = 1 if s = si and 0

otherwise

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Types of parameterizations II

Neural networks

θ is the vector of synaptic weights

Inputs to the network is either s or (s, a)

Either a single output for Vθ(s) or Qθ(s, a) or |A| outputs(one for each Qθ(aj , s))

Other approximations

Tile Coding

Regression trees, neirest neighbours etc.

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

10 Introduction

11 Policy EvaluationDirect methodsResidual MethodsLeast-Square TDFitted-Value Iteration

12 Control

13 Warnings

14 Deep Q-Network

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Direct or semi-gradient methods I

General Idea

J(θ) = ‖V π(s)− Vθ(s)‖pp,µ

where ‖f (x)‖p,µ =[∫X µ(x)‖f (x)‖pdx

]1/pis the expetation of

`p-norm according to distribution µ.As samples are generated by a policy π, µ is in general thestationary distribution dπ of the Markov Chain induced by π.

In practice: empirical `2-norm

J(θ) =1

N

N∑i=1

(vπi − Vθ(si ))2

where vπi is a realisation of V π(si )

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Direct or semi-gradient methods II

Gradient Descent

θ ← θ − 1

2α∇θJ(θ)

∇θJ(θ) =2

N

N∑i=1

(vπi − Vθ(si ))∇θVθ(si )

Stochastic Gradient Descent

θ ← θ − αi

2∇θ [vπi − Vθ(si )]2

← θ + αi∇θVθ(si ) (vπi − Vθ(si ))

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Direct or semi-gradient methods III

Problem

vπi is of course unknown.

Different solution

Monte Carlo estimate: vπi ≈ GHi =

∑i+Ht=i γ

tr(st , at)

TD(0) estimate: vπi ≈ r(si , ai ) + γVθ(si+1)

TD(λ) estimate: vπi ≈ Gλi = (1− λ)

∑t λ

t−1G ti

Most often used: TD(0) estimate (Bootstrapping)

Replace vπi by its current estimate according to Bellmanequation: r(si , ai ) + γVθi−1

(si+1):

θ ← θ + αi∇θVθ(si ) (r(si , ai ) + γVθ(si+1)− Vθ(si ))

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Direct or semi-gradient methods IV

Linear TD(0)

Vθ(s) = θ>φ(s)

∇θVθ(s) = φ(s)

Linear TD(0) update:

θ ← θ + αiφ(si )(r(si , ai ) + γθ>φ(si+1)− θ>φ(si )

)Notes

This generalises exact TD(0) (using φ(s) = δ(s))

Guaranteed to converge to global optimum with linearfunction approximation

No guarantee in the general case.

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Residual or full gradient methods I

Semi versus full gradient

Semi-gradient: estimate of vπi doesn’t follow gradient ofJ(θ), only ∇θVθUse TD(0) before derivation

Same as minimizing the Bellman residual:

J(θ) = ‖TπVθ(s)− Vθ(s)‖pµ,p

Where Tπ is the evaluation Bellman operator:

TπV (s) = Eπ[R(s, a) + γV (s ′)]

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Residual or full gradient methods II

Residual approach [Baird, 1995]

V̂θ(s) must satisfy Bellman equation (V π = TπV π):

J(θ) = ‖TπVθ(s)− Vθ(s)‖pµ,p

In practice

J(θ) =1

N

N∑i=1

(T̂πVθ(si )− Vθ(si )

)2

with T̂πV (s) = r(s, π(s)) + γV (s ′)

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Residual or full gradient methods III

Gradient descent

θ ← θ−αN

N∑i=1

(∇θT̂πVθ(si )−∇θVθ(si )

)(T̂πVθ(si )− Vθ(si )

)

Stochastic Gradient Descent

θ ← θ − αi

(∇θT̂πVθ(si )−∇θVθ(si )

)(T̂πVθ(si )− Vθ(si )

)Linear residual

θ ← θ−αi (γφ(si+1)− φ(si ))(r(si , ai ) + γθ>φ(si+1)− θ>φ(si )

)

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Residual or full gradient methods IV

Problem

Approach works with deterministic MDPs

In stochastic MPDs, the estimator is biased:

E[(

T̂πVθ(s)− Vθ(s))2]

=[E(T̂πVθ(s)− Vθ(s)

)]2

+ Var(T̂πVθ(s)− Vθ(s)

)6= E

[(Vθ(s)− T̂Vθ(s)

)]2

Solution : double sampling [Baird, 1995]

θ ← θ−αi

[γ∇θVθ(s1

i+1)−∇θVθ(sj)] (

r(si , ai ) + γVθ(s2i+1)− Vθ(si )

)

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Least-Square Temporal Differences I

General idea (batch method)

Let’s define Π as the projection operator such that:

ΠV = argminVθ∈H

‖V − Vθ‖qν,q

Least-square TD minimizes the distance between the currentestimate Vθ and the projection on H of TπVθ(s)

J(θ) = ‖Vθ(s)− ΠTVθ(s)‖pµ,p

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Least-Square Temporal Differences II

Two nested optimisation problems

J1(θ) =1

N

N∑i=1

‖Vθ(si )− Vω(si )‖2

J2(ω) =1

N

N∑i=1

‖Vω(si )− (r(si , ai ) + γVθ(si+1))‖2

Linear solution: LSTD [Bradtke and Barto, 1996, Boyan, 1999]

θ∗ =

[N∑i=1

φ(si ) [φ(si )− γφ(si+1)]>]−1 N∑

i=1

φ(si )r(si , ai ).

RL

OlivierPietquin

Introduction

PolicyEvaluation

Direct methods

ResidualMethods

Least-Square TD

Fitted-ValueIteration

Control

Warnings

DeepQ-Network

Iterative projected fixed point

Fitted value iteration

Under some conditions, the composition of Π and Tπ remainsa contraction. The Fitted Value Iteration (FVI) procedureconsists in iteratively applying the following rule:

Vθ ← ΠTVθ

In practice (batch method) [Gordon, 1995]

Collect a set of transitions with π: {si , ai , r(si , ai ), si+1}Ni=1

Initialise θ0

Build a data set:Dt = {si , T̂πVθt (si )} = {si , r(si , ai ) + γVθt (si+1)}Ni=1

Regress on Dt to find θt+1

Iterate until convergence

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

SARSA

LSPI

Fitted-Q

Warnings

DeepQ-Network

10 Introduction

11 Policy Evaluation

12 ControlSARSALSPIFitted-Q

13 Warnings

14 Deep Q-Network

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

SARSA

LSPI

Fitted-Q

Warnings

DeepQ-Network

Mainly Policy Iteration

1 Learn approximation of Qπ ≈ Qθ2 Improve policy (ε-greedy or Softmax)

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

SARSA

LSPI

Fitted-Q

Warnings

DeepQ-Network

Approximate SARSA

Linear approximation of Qπ

Qθ(s, a) = θ>φ(s, a)

Linear SARSA

Init θ0

for n← 0 until Ntot − 1 dosn ← StateChoice

an ← ActionChoice = f (Qθt (sn, a))Perform action an and observe sn+1, r(sn, an)begin

Perform action an+1 = f (Qθt (sn+1, a))δn ← r(sn, an) + γθ>n φ(sn+1, an+1)− θnφ(sn, an)θn+1 ← θn + αnδnφ(sn, an)sn ← sn+1, an ← an+1

endend forreturn θNtot

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

SARSA

LSPI

Fitted-Q

Warnings

DeepQ-Network

Least Square Policy Iteration

Include LSTD into a policy iteration loop

Build a data set with a random π: {si , ai , r(si , ai ), s′i}Ni=1

Evaluate π with LSTD: Qθ

π ← greedy(Qθ)

(resample with pi = f (Qθ))

Iterate until convergence

Problem

Being greedy on approximation is unstable.

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

SARSA

LSPI

Fitted-Q

Warnings

DeepQ-Network

Fitted-Q iteration

Replace V π by Q∗ [Riedmiller, 2005, Ernst et al., 2005]

Collect a set of transitions with π: {si , ai , r(si , ai ), si+1}Ni=1

Initialise θ0

Build a data set: Dt = {(si , ai ), T̂ ∗Qθt (si , ai )} ={(si , ai ), r(si , ai ) + γmaxb Qθt (si+1, b)}Ni=1

Regress on Dt to find θt+1

(resample with π = f (Qθ(s, a))

Iterate until convergence

Output π = argmaxQθ(s, a)

Good point

There is no (yet) assumptions about parameterisation (nolinear)

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

10 Introduction

11 Policy Evaluation

12 Control

13 Warnings

14 Deep Q-Network

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Usage of value function approximation for control

Algorithm Look up Linear Non Linear

Monte Carlo 3 3 7

SARSA 3 3 7

Q-learning 3 7 7

LSPI 3 3 7

Fitted-Q 3 3 3

Table: Algorithms Comparison

3: Oscillate around optimal policy3: With some tricks

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Main issues

Deadly Triad (Sutton)

Off-policy estimation

Too much generalisation (extrapolation)

Bootstrapping

Leemon Baird’s counter example [Baird, 1995]:

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

10 Introduction

11 Policy Evaluation

12 Control

13 Warnings

14 Deep Q-Network

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Deep Q-Network I

Problems to use Neural Nets

Correlated data (trajectories are made of state transitionsconditioned on actions)

Non stationary strategies (learning control while learningvalue)

Extrapolate (bad for SARSA and Fitted-Q)

Residual methods are more suited but cost function isbiased

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Deep Q-Network II

Solution [Mnih et al., 2015]

Use two neural networks:1 A slow-learning target network (θ−)2 A fast learning Q-network (θ)

Use experience replay (fill in a replay buffer D withtransitions generated by π = f (Qθ(s, a))

Shuffle samples in the replay buffer and minimize:

J(θ) =∑

(s,a,r ,s′)∈D

[(r + γmax

bQθ−(s ′, b)

)− Qθ(s, a)

]2

θ ← α(r + γmaxb

Qθ−(s ′, b)− Qθ(s, a))∇θQθ(s, a)

Every N training steps θ− ← θ

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Deep Q-Network III

Network Architecture

https://www.youtube.com/watch?v=V1eYniJ0Rnk

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Deep Q-Network IV

Results on 52 Atari games

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Improvements I

Double DQN [van Hasselt et al., 2016]

DQN:

θ ← α(r + γQθ−(s ′, argmaxb

Qθ−(s, b))− Qθ(s, a))∇θQθ(s, a)

Double DQN

θ ← α(r + γQθ−(s ′, argmaxb

Qθ(s, b))− Qθ(s, a))∇θQθ(s, a)

Decorrelates selection and evaluation and avoid overestimationhttps://www.youtube.com/watch?v=OJYRcogPcfY

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Improvements II

Prioritized Experience Replay

Don’t sample uniformly

Sample with priority to high temporal differences:

‖r + γmaxb

Qθ−(s ′, b)− Qθ(s, a)‖

RL

OlivierPietquin

Introduction

PolicyEvaluation

Control

Warnings

DeepQ-Network

Questions?

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Part IV

Policy Gradient Methods

RL

OlivierPietquin

Introduction

Why learn apolicy

Problemdefinition

PolicyGradient

Actor-Critic

15 IntroductionWhy learn a policyProblem definition

16 Policy Gradient

17 Actor-Critic

RL

OlivierPietquin

Introduction

Why learn apolicy

Problemdefinition

PolicyGradient

Actor-Critic

Reasons

Exemple: Mountain Car

Value Function is much more complex than the policy.

Continuous action space.

Occam’s Razor

RL

OlivierPietquin

Introduction

Why learn apolicy

Problemdefinition

PolicyGradient

Actor-Critic

Problem definition I

Gradient ascent on parameterized policies

Define a parametric policy πθ(s, a)

Suppose πθ(s, a) is differentiable and that ∇θπθ(s, a) isknown

Define an objective function to optimize J(θ) (s.t. η(θ))

J(θ) such that θ∗ = argmaxθ

J(θ)

Perform gradient ascent on the objective function:

θ ← θ + α∇θJ(θ)

RL

OlivierPietquin

Introduction

Why learn apolicy

Problemdefinition

PolicyGradient

Actor-Critic

Problem definition II

Objective function

Total return on episodic tasks:

Je(θ) = Eπθ

[H∑t=1

r(st , at)

]= V πθ(s1)

Average value on continuing tasks:

Jv (θ) =∑s

dπθ(s)V πθ(s)

Average imediate reward

Jr (θ) =∑s

dπθ(s)∑a

πθ(s, a)r(s, a)

dπθ (s): stationarry distribution of the Markov Chain induced by πθ

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

15 Introduction

16 Policy GradientREINFORCEPolicy Gradient TheoremPG with baseline

17 Actor-Critic

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

Episodic case I

Redifining Je(θ)

A sample is a trajectory (rollout) τ

Je(θ) =

∫pπθ(τ)R(τ)dτ

with pπθ(τ) is the probability of observing trajectory τ underpolicy πθ and R(τ) is the total return accumulated ontrajectory τ

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

Episodic case II

Likelhood trick

∇θJ(θ) =

∫∇θpπθ(τ)R(τ)dτ

=

∫pπθ(τ)

∇θpπθ(τ)

pπθ(τ)R(τ)dτ

= E[∇θpπθ(τ)

pπθ(τ)R(τ)

]= E [∇θ log pπθ(τ)R(τ)]

Note

E[∇θpπθ (τ)

pπθ (τ) R(τ)]: increases probability of trajectory τ if it has

high return but not already high probability.

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

Episodic case III

∇θJe(θ) is independent from the dynamics

Using Markov Property:

pπθ(τ) = p(s1)H∏t=1

p(st+1|st , at)πθ(st , at)

∇θ log pπθ(τ) =H∑t=1

∇θ log πθ(st , at)

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

In Practice: REINFORCE [Williams, 1992, Peters and Schaal, 2006]

Episodic REINFORCE gradient estimate

Using N rollouts (s i1, ai1, r

i1, . . . , s

iH , a

iH , r

iH)Ni=1 drawn from πθ:

∇̂θJe(θ) =1

N

N∑i=1

[(H∑t=1

∇θ log πθ(s it , ait)

)(H∑t=1

r it

)]

Notes

Often one single rollout is enough

As it comes from a double sum, this estimate has a highvariance.

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

Policy Gradient Theorem I

Intuition: Case of Jr (θ)

∇θJr (θ) = ∇θ∑s

d(s)∑a

πθ(s, a)r(s, a)

=∑s

d(s)∑a

∇θπθ(s, a)r(s, a)

=∑s

d(s)∑a

πθ(s, a)∇θπθ(s, a)

πθ(s, a)r(s, a)

=∑s

d(s)∑a

πθ(s, a)∇θ log πθ(s, a)r(s, a)

= E[∇θ log πθ(s, a)r(s, a)]

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

Policy Gradient Theorem II

Policy Gradient Theorem (Proof in [Sutton et al., 2000])

∇θJ(θ) =∑s

dπθ(s)∑a

∇θπθ(s, a)Qπθ(s, a)

∇θJ(θ) = Eπθ [∇θ log πθ(s, a)Qπθ(s, a)]

Notes

Generalisation to Je(θ) and Jv (θ)

Qπθ is the true Q-function of policy πθ which is unknown

In case of Jv (θ): dπθ(s) is the stationnary distribution ofMarkov chain induced by πθ

In case of Je(θ): dπθ(s) is the probability of encoutering swhen starting from s1 and following πθ

In case of discounted Je(θ): dπθ(st) =∑∞

t=0 γtp(st |s1, πθ)

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

REINFORCE with PG theorem Algorithm I

REINFORCE gradient estimate with policy gradient

Replace Qπθ by a MC estimate (and dπθ(s) by empiricalcounts)

Draw N rollouts (s i1, ai1, r

i1, . . . , s

iH , a

iH , r

iH)Ni=1 from πθ:

∇̂θJe(θ) =1

N

N∑i=1

[(H∑t=1

∇θ log πθ(s it , ait)

H∑k=t

r ik

)]

Variant: G(PO)MDP

∇̂θJe(θ) =1

N

N∑i=1

[(H∑

k=1

(k∑

t=1

∇θ log πθ(s it , ait)

)r ik

)]

Both reduce the gradient estimate variance

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

REINFORCE with PG theorem Algorithm II

Algorithm 1 REINFORCE with PG theorem Algorithm

Initialize θ0 as random, Initialize step-size α0

n = 0while no convergence do

Generate rollout hn = {sn1 , an1, rn1 , . . . , snH , anH , rnH} ∼ πθnPGθ = 0for t = 1 to H do

Rt =∑H

t′=t rnt′

PGθ += ∇θ log πθn(st , at)Rt

end forn++θn ← θn−1 + αnPGθupdate αn (if step-size scheduling)

end whilereturn θn

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

Policy Gradient with Baseline I

Reducing variance

Gradient comes from a cumulative function

Substracting a constant (or a function of s) doesn’tmodify the solution

∇θJ(θ) =∑

s dπθ(s)

∑a∇θπθ(s, a)(Qπθ(s, a)− b(s))∑

a∇θπθ(s, a)b(s) = b(s)∇θ∑

a πθ(s, a) = b(s)∇θ1 = 0

var(q − b) = var(q)− 2cov(q, b) + var(b)

We reduce by 2cov(q, b)

RL

OlivierPietquin

Introduction

PolicyGradient

REINFORCE

Policy GradientTheorem

PG with baseline

Actor-Critic

Policy Gradient with Baseline II

Baseline candidates

An arbitrary constant

The average reward of policy πθ (MC estimate)

The average reward until time step t

Intuition

Instead of using pure performance to compute the gradient,let’s compare current performance with average. The gradientincreases (resp. decreases) the probability of actions that arebetter (resp. worst) than average.

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

15 Introduction

16 Policy Gradient

17 Actor-CriticCompatible approximationsQAC algorithmAdvantage Actor-Critic

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Coming back to PG theorem

∇θJ(θ) = E[∇θ log πθ(s, a)Qπθ(s, a)]

Approximate Qπθ

If Qπθ(s, a) ≈ Qω(s, a)

do we have ∇θJ(θ) ≈ E[∇θ log πθ(s, a)Qω(s, a)] ?

If yes, πθ is an actor (behaves), Qω is a critic (suggestsdirection to update policy)

Both can be estimated online: πθ with PG and Qω withSARSA

It could lead to more stable (less variance) algorithms.

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Compatible value function approximation I

Theorem: compatibility of approximations [Sutton et al., 2000]

If the two following conditions are satisfied:

1 The parameters ω minimize the mean square error:

ω∗ = argminω

Eπθ[(Qπθ(s, a)− Qω(s, a))2

]2 The value and the policy approximation are compatible:

∇ωQω = ∇θ log πθ

Then the policy gradient is exact:

∇θJ(θ) = E[∇θ log πθ(s, a)Qω(s, a)]

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Compatible value function approximation II

Proof

If mean square error is minimal, than its gradient w.r.t. to ω iszero.

∇ωEπθ[(Qπθ(s, a)− Qω(s, a))2

]= 0

Eπθ [(Qπθ(s, a)− Qω(s, a))∇ωQω(s, a)] = 0

Eπθ [(Qπθ(s, a)− Qω(s, a))∇θ log πθ(s, a)] = 0

Thus

∇θJ(θ) = Eπθ [∇θ log πθ(s, a)Qπθ(s, a)]

= Eπθ [∇θ log πθ(s, a)Qω(s, a)]

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Compatible value function approximation III

In practice

∇ωQω = ∇θ log πθ only holds for exponential policies(almost never used in practice)

ω∗ = argminω Eπθ[(Qπθ(s, a)− Qω(s, a))2

]is generally

not true neither as we don’t use through gradient descenton residals in online settings and batch methods are notconvenient

Most DeepRL methods for PG do not meet theseassumptions, but they work in practice

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Actor-Critic Algorithm

Algorithm 2 QAC with linear critic

Qω(s, a) = ω>φ(s, a)Initialize θ and ω as randomSet α, βInitialise sSample a ∼ πθ(s, .)for all steps do

Sample r(s, a) and s ′ ∼ p(.|a, s)Sample a′ = πθ(s ′, .)ω ← ω + β[r(s, a) + γQω(s ′, a′)− Qω(s, a)]φ(s, a)θ ← θ + α∇θ log πθ(s, a)Qω(s, a)a← a′,s ← s ′

end forreturn θ

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Reducing variance with a baseline

Advantage function

Same intuition as before, we shoud rather compare toaverage performance than measure absolute performanceto compute the gradient.

Average performance of πθ starting from state s is V πθ(s)

Advantage function: Aπ(s, a) = Qπ(s, a)− V π(s)

Advantage actor-critic

Qπθ(s, a) ≈ Qω(s, a) V πθ(s) ≈ Vψ(s)

Aω,ψ(s, a) = Qω(s, a)− Vψ(s)

∇J(θ) ≈ Eπθ [∇θ log πθ(s, a)Aω,ψ(s, a)]

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Estimating the Advantage function

Using the TD error

TD error: δπθ(s, a) = r(s, a) + γV πθ(s ′)− V πθ(s)

Eπθ [δπθ |s, a] = Eπθ [r(s, a) + γV πθ(s ′)|s, a]− V πθ(s)

= Qπθ(s, a)− V πθ(s)

= Aπθ(s, a)

With approximation: δψ(s, a) = r(s, a) + γVψ(s ′)− Vψ(s)

Policy gradient: ∇θJ(θ) ≈ Eπθ [∇θ log πθ(s, a)δψ(s, a)]

It only depends on θ and ψ parameters (no ω)

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Asyncronous Advantage Actor Critic (A3C) I

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Asyncronous Advantage Actor Critic (A3C) II

The agent learns a Value and a Policy with a sharedrepresentation

Many agents are working in parallel

They send gradients to the learner

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Asyncronous Advantage Actor Critic (A3C) III

When the learner updates it copies its parameters to theworkers

PG:∇θπθ(s, a)

(∑Nk=1 γ

k rt+k + γN+1Vθ(st+N+1)− Vθ(st))

Value:∇θ(∑N

k=1 γk rt+k + γN+1Vθ−(st+N+1)− Vθ(st)

)https://www.youtube.com/watch?v=nMR5mjCFZCw

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

AlphaGo I

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

AlphaGo II

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

AlphaGo III

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

AlphaGo IV

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

AlphaGo V

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Other Example

Language applications [Strub et al., 2017]

Optimize non differentiable objectives (like BLEU score)

Optimize long term dialogue strategies (GuessWhat?!Game)

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Summary: Types of RL algorithms

Value or not Value

Critique: only value (SARSA, Q-learning)

Actor: only policy (Policy Gradient, REINFORCE)

Actor-Critic: policy and value (PG theorem, AAC)

Others

Online / Batch

On-Policy / Off-Policy

Model-based / Model-Free

Exact / Approximate

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Questions

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Bibliography I

Baird, L. (1995).

Residual algorithms: reinforcement learning with functionapproximation.

In Proceedings of the Twelfth International Conference onInternational Conference on Machine Learning, pages 30–37. MorganKaufmann Publishers Inc.

Boyan, J. A. (1999).

Least-squares temporal difference learning.

In Proceedings of the Sixteenth International Conference on MachineLearning, pages 49–56. Morgan Kaufmann Publishers Inc.

Bradtke, S. J. and Barto, A. (1996).

Linear least-squares algorithms for temporal difference learning.

Machine Learning, 22:33–57.

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Bibliography II

Ernst, D., Geurts, P., and Wehenkel, L. (2005).

Tree-based batch mode reinforcement learning.

Journal of Machine Learning Research, 6(Apr):503–556.

Gordon, G. J. (1995).

Stable function approximation in dynamic programming.

In Proceedings of the Twelfth International Conference onInternational Conference on Machine Learning, pages 261–268.Morgan Kaufmann Publishers Inc.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J.,Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K.,Ostrovski, G., et al. (2015).

Human-level control through deep reinforcement learning.

Nature, 518(7540):529–533.

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Bibliography III

Peters, J. and Schaal, S. (2006).

Policy gradient methods for robotics.

In Intelligent Robots and Systems, 2006 IEEE/RSJ InternationalConference on, pages 2219–2225. IEEE.

Riedmiller, M. (2005).

Neural fitted q iteration-first experiences with a data efficient neuralreinforcement learning method.

In ECML, volume 3720, pages 317–328. Springer.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., VanDen Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam,V., Lanctot, M., et al. (2016).

Mastering the game of go with deep neural networks and tree search.

Nature, 529(7587):484–489.

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Bibliography IV

Strub, F., De Vries, H., Mary, J., Piot, B., Courville, A., and Pietquin,O. (2017).

End-to-end optimization of goal-driven and visually grounded dialoguesystems harm de vries.

In International Joint Conference on Artificial Intelligence.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y.(2000).

Policy gradient methods for reinforcement learning with functionapproximation.

In Advances in neural information processing systems, pages1057–1063.

Tesauro, G. (1995).

Temporal difference learning and td-gammon.

Communications of the ACM, 38(3):58–69.

RL

OlivierPietquin

Introduction

PolicyGradient

Actor-Critic

Compatibleapproximations

QAC algorithm

AdvantageActor-Critic

Bibliography V

van Hasselt, H., Guez, A., and Silver, D. (2016).

Deep reinforcement learning with double q-learning.

In Thirtieth AAAI Conference on Artificial Intelligence.

Williams, R. J. (1992).

Simple statistical gradient-following algorithms for connectionistreinforcement learning.

Machine learning, 8(3-4):229–256.

top related