reinforcement learning (machine learning,...

66
Formalism Dynamic Programming Approximate Dynamic Programming Online learning Policy search and actor-critic methods Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup´ elec) [email protected] 1 / 66

Upload: others

Post on 28-Sep-2020

28 views

Category:

Documents


8 download

TRANSCRIPT

Page 1: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Reinforcement Learning(Machine Learning, SIR)

Matthieu Geist (CentraleSupelec)[email protected]

1 / 66

Page 2: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Figure : The perception-action cycle in reinforcement learning.

2 / 66

Page 3: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Applications

playing games (Backgammon, Go, Tetris, Atari...)roboticsautonomous acrobatic helicopter control1

operation research (pricing, vehicle routing...)human computer interactions (dialogue management,e-learning...)virtually any control problem2

1http://heli.stanford.edu/2An old list: http://umichrl.pbworks.com/w/page/7597597/

Successes_of_Reinforcement_Learning3 / 66

Page 4: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

4 / 66

Page 5: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

5 / 66

Page 6: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

A Markov Decision Process (MDP) is a tuple {S,A,P, r , γ}where:

S is the (finite) state space;

A is the (finite) action space;

P ∈ ∆S×AS is the Markovian transition kernel. The termP(s ′|s, a) denotes the probability of transiting in state s ′ giventhat action a was chosen in state s;

r ∈ RS×A is the reward function, it associates the rewardr(s, a) for taking action a in state s. The reward function isassumed to be uniformly bounded;

γ ∈ (0, 1) is a discount factor that favors shorter term rewards(usually set to a value close to 1).

6 / 66

Page 7: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

7 / 66

Page 8: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

Policy:

π ∈ AS ;in state s, an agent applying policy π chooses the action π(s)

Value function (quantify the quality of a policy):

vπ(s) = E[∞∑t=0

γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))].

Comparing policies (partial ordering):

π1 ≥ π2 ⇔ ∀s ∈ S, vπ1(s) ≥ vπ2(s).

Optimal policy:π∗ ∈ argmax

π∈ASvπ.

8 / 66

Page 9: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

9 / 66

Page 10: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

Rewriting the Bellman equation

vπ(s) = E[∞∑t=0

γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))]

= r(s, π(s)) + E[∞∑t=1

γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))]

= r(s, π(s)) + γE[∞∑t=0

γtr(St+1, π(St+1))|S0 = s,St+1 ∼ P(.|St , π(St))]

⇔ vπ(s) = r(s, π(s)) + γ∑s′∈S

P(s ′|s, π(s))vπ(s ′).

10 / 66

Page 11: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

Rewriting the Bellman equation (cont.)

Define the stochastic matrix Pπ ∈ RS×S and the vectorrπ ∈ RS as

Pπ =(P(s ′|s, π(s))

)s,s′∈S and rπ = (r(s, π(s)))s∈S .

Using these notations, we have:

vπ = rπ + γPπvπ ⇔ vπ = (I − γPπ)−1rπ.

11 / 66

Page 12: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

Bellman evaluation operator

Define the Bellman evaluation operator Tπ : RS → RS as

∀v ∈ RS , Tπv = rπ + γPπv ,

or equivalently componentwise

∀s ∈ S, [Tπv ](s) = r(s, π(s)) + γ∑s′∈S

P(s ′|s, π(s))v(s ′).

Tπ is a contraction (supremum norm) and vπ is its unique fixedpoint:

vπ = Tπvπ.

12 / 66

Page 13: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

Optimal value function and policies

Assume that v∗ = vπ∗ is known, an optimal policy should begreedy resp. to v∗:

π∗(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v∗(s ′)

).

Characterizing v∗:

∀s ∈ S, v∗(s) = maxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v∗(s ′)

).

13 / 66

Page 14: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Markov Decision ProcessesPolicy and value functionBellman operators

Bellman optimality operator

Define the Bellman optimality operator T∗ : RS → RS as

∀v ∈ RS , T∗v = maxπ∈AS

(rπ + γPπv) ,

or equivalently componentwise

∀s ∈ S, [T∗v ](s) = maxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v(s ′)

).

T∗ is a contraction (supremum norm) and v∗ is its unique fixedpoint:

v∗ = T∗v∗.

14 / 66

Page 15: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

15 / 66

Page 16: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

DP: solve an MDP when the model is known.

In practice, the model is unknown and one has to rely on data.

Even so, related learning methods are often based on DP.

16 / 66

Page 17: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

17 / 66

Page 18: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

v∗ is the solution of the following linear program:

minv∈RS

1>v

subject to v ≥ T∗v .

Proof.

v ≥ T∗v ⇒ v ≥ v∗ ⇒ 1>v ≥ 1>v∗

andv∗ = T∗v∗.

18 / 66

Page 19: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

Algorithm 1 Linear programming

1: Solve

minv∈RS

∑s∈S

v(s)

subject to v(s) ≥ r(s, a) + γ∑s′∈S

P(s ′|s, a)v(s ′), ∀s ∈ S,∀a ∈ A

and get v∗.2: return the policy π∗ defined as

π∗(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v∗(s ′)

).

19 / 66

Page 20: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

20 / 66

Page 21: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

T∗ is a contraction: ∀u, v ∈ RS , ‖T∗u − T∗v‖∞ ≤ ‖u − v‖∞;

v∗ is its unique fixed point: T∗v∗ = v∗;

Banach fixed-point theorem: for any v0, the sequence

vk+1 = T∗vk

converges to v∗;

natural stopping criterion: ‖vk+1 − vk‖∞ ≤ ε;output a greedy policy (resp. to vk), πk ∈ G(vk)

π ∈ G(v)⇔ Tπv = T∗v

⇔ ∀s ∈ S, π(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v(s ′)

).

21 / 66

Page 22: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

Algorithm 2 Value iteration

Require: An initial v0 ∈ RS , a stopping criterion ε1: k = 02: repeat3: for all s ∈ S do4:

vk+1(s) = maxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vk(s ′)

)5: end for6: k ← k + 17: until ‖vk+1 − vk‖∞ ≤ ε8: return a policy πk ∈ G(vk):

πk(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vk(s ′)

).

22 / 66

Page 23: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

Quality of the obtained solution?

Stop iterations if‖vk+1 − vk‖∞ ≤ ε.

Guaranty on the function vk :

‖v∗ − vk‖∞ ≤1

1− γε.

Guaranty on the policy πk :

‖v∗ − vπk‖∞ ≤2γ

(1− γ)2ε.

23 / 66

Page 24: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

24 / 66

Page 25: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

Let π be any policy and vπ its value function.

Let π′ be greedy resp. to vπ, π′ ∈ G(vπ):

∀s ∈ S, π′(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vπ(s ′)

).

π′ is a better policy than π:

vπ′ ≥ vπ

This suggests the following algorithmic scheme, iterate:1 policy evaluation: solve Tπk

vπk= vπk

;2 policy improvement: compute πk+1 ∈ G(vπk

).

25 / 66

Page 26: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Linear programmingValue iterationPolicy iteration

Algorithm 3 Policy iteration

Require: An initial π0 ∈ AS1: k = 02: repeat3: solve (policy evaluation)

vk(s) = r(s, πk(s)) + γ∑s′∈S

P(s ′|s, πk(s))vk(s ′), ∀s ∈ S.

4: Compute (policy improvement)

πk+1(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vk(s ′)

).

5: k ← k + 16: until vk+1 = vk7: return the policy πk+1 = π∗

26 / 66

Page 27: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

27 / 66

Page 28: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

DP requires:

the state and action spaces to be small enough;

the model to be known.

Unfortunately:

the state space can be too large (even continuous) for the valuefunction to be represented exactly,

vθ(s) = θ>φ(s) =d∑

i=1

θiφi (s)

the model might be unknown and one has to rely on a dataset

D = {(si , ai , ri , s ′i )1≤i≤n}.

the dataset can be obtained in multiple ways;the evaluation operator can be sampled (assume hereai = π(si )),

[Tπv ](si ) = ri + γv(s ′i )

is unbiased:E[[Tπv ](si )|si ] = ES′∼P(.|si ,ai )[ri + γv(S ′)] = [Tπv ](si ).

28 / 66

Page 29: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

29 / 66

Page 30: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Problems with value functions

Computing a greedy policy requires knowing the model:

π ∈ G(v)⇔

∀s ∈ S, π(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v(s ′)

).

Sampling the optimality operator?Optimality operator:

[T∗v ](s) = maxa∈A

ES′∼P(.|s,a)[r(s, a) + γv(S ′)];

with s ′i,a ∼ P(.|si , a), a possible sampled operator:

[T∗v ](si ) = maxa∈A

(r(si , a) + γv(s ′i,a)

);

it is biased: E[[T∗v ](si )|si ] 6= T∗(si ).

30 / 66

Page 31: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

state-action value function (also called Q-function and qualityfunction)

Qπ(s, a) = E[∞∑t=0

γtr(St ,At)|S0 = s,A0 = a, St+1 ∼ P(.|St ,At),At+1 = π(St+1)].

Bellman evaluation operator Tπ : RS×A → RS×Adefinition:[TπQ](s, a) = r(s, a) + γ

∑s′∈S P(s ′|s, a)Q(s ′, π(s ′));

Qπ is its unique fixed point:TπQπ = Qπ;link to vπ:vπ(s) = Qπ(s, π(s)).

Bellman optimality operator T∗ : RS×A → RS×Adefinition:[T∗Q](s, a) = r(s, a) + γ

∑s′∈S P(s ′|s, a) maxa′∈AQ(s ′, a′);

Q∗ is its unique fixed point:Q∗ = T∗Q∗;link to v∗:v∗(s) = maxa∈AQ∗(s, a).

31 / 66

Page 32: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Allows acting greedily:resp to vπ = Qπ(s, π(s):

π′ ∈ G(vπ)⇔ ∀s ∈ S, π(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vπ(s ′)

)

⇔ ∀s ∈ S, π′(s) ∈ argmaxa∈A

Qπ(s, a).

resp. to v∗:π∗(s) ∈ argmax

a∈AQ∗(s, a).

resp. to any Q ∈ RS×A:

∀Q ∈ RS×A, π ∈ G(Q)⇔ ∀s ∈ S, π(s) ∈ argmaxa∈A

Q(s, a).

32 / 66

Page 33: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Allows sampling easily the related operatorsrecall the dataset

D = {(si , ai , ri , s′i )1≤i≤n}.

sampled Bellman evaluation operator

[TπQ](si , ai ) = ri + γQ(s ′i , π(s ′i ));

sampled Bellman optimality operator

[T∗Q](si , ai ) = ri + γ maxa′∈A

Q(s ′i , a′).

Features for the Q-function:linear param. of the Q-function: Qθ(s, a) = θ>φ(s, a) with

φ(s, a) =[δa=a1φ(s)> . . . δa=a|A|φ(s)>

]>.

other representations are possible.

33 / 66

Page 34: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

34 / 66

Page 35: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Value iteration: Qk+1 = T∗Qk

T∗ cannot be applied, the model being unknown;with large space, Qk ∈ H, no reason for T∗Qk ∈ H to hold.

Approximate value iteration (an introductory example):linear param. for the Q-functions

H = {Qθ(s, a) = θ>φ(s, a), θ ∈ Rd};

writing Qk = Qθk , sampled operator:

[T∗Qk ](si , ai ) = ri + γ maxa′∈A

Qk(s ′i , a′)

search for Q ∈ H being the closest to T∗Qk :

Qk+1 ∈ argminQθ∈H

1

n

n∑i=1

(Qθ(si , ai )− [T∗Qk ](si , ai )

)2.

summary: Qk+1 = ΠT∗Qk .

35 / 66

Page 36: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Abstraction

Approximate value iteration:

Qk+1 = AT∗Qk .

A is an abstract approximation operator

AT∗ should be a contraction!

otherwise divergence can (will) occur;

not the case for ΠT∗...

do not implement the introductory example!

true if averagers are used for function approximation, such as

ensemble of treesnotably extremely randomized forests: fitted-Qkernel averagers (Nadaraya-Watson)

36 / 66

Page 37: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Algorithm 4 Approximate value iteration

Require: A dataset D = {(si , ai , ri , s′i )1≤i≤n}, the number K of iterations,

a function approximator, an initial state-action value function Q0.1: for k = 0 to K do2: Apply the sampled Bellman operator to function Qk :

[T∗Qk ](si , ai ) = ri + γ maxa′∈A

Qk(s ′i , a′).

3: Solve the regression problem with inputs (si , ai ) and outputs[T∗Qk ](si , ai ) to get the Q-function Qk+1

4: end for5: return The greedy policy πK+1 ∈ G(QK+1):

∀s ∈ S, πK+1 ∈ argmaxa∈A

QK+1(s, a).

37 / 66

Page 38: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

38 / 66

Page 39: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Policy iteration:1 policy evaluation: solve the fixed-point equation

Qπk= Tπk

Qπk;

2 policy improvement: compute the greedy policyπk+1 = G(Qπk

).

Approximate policy iteration:1 approximate policy evaluation: find a function Qk ∈ H such

that Qk ≈ TπkQk ;

2 policy improvement: compute the greedy policyπk+1 = G(Qπk

).

39 / 66

Page 40: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Algorithm 5 Approximate policy iteration

Require: An initial π0 ∈ AS (possibly an initial Q0 and π0 ∈ G(Q0)),number of iterations K

1: for k = 0 to K do2: approximate policy evaluation: find Qk ∈ H such that Qk ≈

TπkQk .

3: policy improvement: πk+1 ∈ G(Qk).4: end for5: return the policy πK+1

Problem: how to find an approximate fixed point of Tπ, thatis a function Qθ ∈ H such that Qθ ≈ TπQθ?

40 / 66

Page 41: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Monte Carlo rollouts

Approx. fixed point of Tπ ⇔ approx Qπ.

Assume that Qπ is known, simply a regression problem. For example,linear least-squares:

minθ∈Rd

1

n

n∑i=1

(Qπ(si , ai )− Qθ(si , ai ))2 .

Qπ is (obviously) unknown... Monte carlo rollout:

sample a full trajectory starting in si where action ai is chosen first,all subsequent states being sampled according to the systemdynamics and all subsequent actions being chosen according to π;write qi the associated discounted cumulative reward;

unbiased estimate: E[qi |si , ai ] = Qπ(si , ai )

replace Qπ(si , ai ) by the unbiased estimate qi :

minθ∈Rd

1

n

n∑i=1

(qi − Qθ(si , ai ))2 .

Drawbacks: requires a simulator, rollouts can be quite noisy.

41 / 66

Page 42: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Residual approach

Idea: minimize the residual ‖Qθ − TπQθ‖ for some norm.

With an `2-loss, parametric representation and sampledoperator:

minθ∈Rd

1

n

n∑i=1

([TπQθ](si , ai )− Qθ(si , ai )

)2= min

θ∈Rd

1

n

n∑i=1

(ri + γQθ(s ′i , π(s ′i ))− Qθ(si , ai )

)2.

However, there is a bias problem:

E[([TπQθ](si , ai )− Qθ(si , ai ))2|si , ai ]=([TπQθ](si , ai )− Qθ(si , ai ))2 + var([TπQθ](si , ai )|si , ai )6=([TπQθ](si , ai )− Qθ(si , ai ))2.

42 / 66

Page 43: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Least-Squares temporal Differences

Idea: solveQθ = ΠTπQθ.

As a nested optimization problem:{wθ = argminw∈Rd

1n

∑ni=1(ri + γQθ(s ′i , π(s ′i ))− Qw (si , ai ))2

θn = argminθ∈Rd1n

∑ni=1(Qθ(si , ai )− Qwθ(si , ai ))2

.

43 / 66

Page 44: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Least-Squares temporal Differences (cont.)

Opt. problem (linear param. assumed):{wθ = argminw∈Rd

1n

∑ni=1(ri + γQθ(s ′i , π(s ′i ))− Qw (si , ai ))2

θn = argminθ∈Rd1n

∑ni=1(Qθ(si , ai )− Qwθ (si , ai ))2

.

First Eq., linear least-squares in w:

wθ =

(n∑

i=1

φ(si , ai )φ(si , ai )>

)−1 n∑i=1

φ(si , ai )(ri + γθ>φ(s ′i , π(s ′i ))).

Second Eq. minimized for θ = wθ:

θn = wθn ⇔ θn =

(n∑

i=1

φ(si , ai )φ(si , ai )>

)−1 n∑i=1

φ(si , ai )(ri + γθ>n φ(s ′i , π(s ′i )))

⇔ θn =

(n∑

i=1

φ(si , ai )(φ(si , ai )− γφ(s ′i , π(s ′i ))

)>)−1 n∑i=1

φ(si , ai )ri .

API with LSTD named LSPI (Least-Squares Policy Iteration).44 / 66

Page 45: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Least-Squares temporal Differences (cont.)

Algorithm 6 Least-squares policy iteration

Require: An initial π0 ∈ AS (possibly an initial Q0 and π0 ∈ G(Q0)),number of iterations K

1: for k = 0 to K do2: approximate policy evaluation:

θk =

(n∑

i=1

φ(si , ai ) (φ(si , ai )− γφ(s ′i , πk(s ′i )))>)−1 n∑

i=1

φ(si , ai )ri .

3: policy improvement:πk+1 ∈ G(Qθk ).

4: end for5: return the policy πK+1

45 / 66

Page 46: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

State-action value functionApproximate value iterationApproximate policy iteration

Approximating the policy

Idea: instead of generalizing the Q-function, generalize thepolicy

Motivation: a policy might be easier to learn than a Qfunction

At iteration k, let F ⊂ AS be an hypothesis space of policies,assume that Qπk (si , a) are known, and solve the cost-sensitivemulti-class classification problem

πk+1 ∈ argminπ∈F

1

n

n∑i=1

(maxa∈A

Qπk (si , a)− Qπk (si , π(si ))

).

Practically, replace Qπk (si , a) by a Monte Carlo rollout.

Often called DPI for Direct Policy Iteration

46 / 66

Page 47: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

SARSA and Q-learningThe exploration-exploitation dilemma

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

47 / 66

Page 48: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

SARSA and Q-learningThe exploration-exploitation dilemma

For ADP, we assumed that the dataset was provided.

What about online learning?

requires an online learner;dilemma between exploration and exploitation.

48 / 66

Page 49: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

SARSA and Q-learningThe exploration-exploitation dilemma

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

49 / 66

Page 50: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

SARSA and Q-learningThe exploration-exploitation dilemma

SARSA

Goal: online estimation of Qπ, for a given π.Assume a linear param., and that Qπ is known, the risk of interest is

minθ∈Rd

1

n

n∑i=1

(Qπ(si , ai )− Qθ(si , ai ))2 .

Minimize it with a stochastic gradient descent:

θi+1 = θi −αi

2∇ (Qπ(si , ai )− Qθ(si , ai ))2

= θi + αiφ(si , ai )(Qπ(si , ai )− θ>i φ(si , ai )

),

Qπ(si , ai ) is unknown, bootstrap it (si+1 ∼ P(.|si , ai ) and ai+1 = π(si+1))

Qπ(si , ai )→ [TπQθi ](si , ai ) = ri + γQθi (si+1, ai+1).

Replace Qπ(si , ai ) by its estimate in the update rule

θi+1 = θi + αiφ(si , ai ) (ri + γQθi (si+1, ai+1)− Qθi (si , ai ))

= θi + αiφ(si , ai )(ri + γθ>i φ(si+1, ai+1)− θ>i φ(si , ai )

).

This is called a temporal difference algorithm.50 / 66

Page 51: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

SARSA and Q-learningThe exploration-exploitation dilemma

SARSA (cont.)

Algorithm 7 SARSARequire: An initial parameter vector θ0, the initial state s0, an initial action a0,

the learning rates (αi )i≥0

1: i = 02: while true do3: Apply action ai in state si4: Get the reward ri and observe the new state si+1

5: Choose the action ai+1 to be applied in state si+1

6: Update the parameter vector of the Q-function according to the transition(si , ai , ri , si+1, ai+1);

θi+1 = θi + αiφ(si , ai )(ri + γθ>i φ(si+1, ai+1)− θ>i φ(si , ai )

)7: i ← i + 18: end while

Remark: on-policy algorithm.51 / 66

Page 52: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

SARSA and Q-learningThe exploration-exploitation dilemma

Q-learning

Goal: direct online estimation of Q∗.Assume a linear param., and that Q∗ is known, the risk of interest is

minθ∈Rd

1

n

n∑i=1

(Q∗(si , ai )− Qθ(si , ai ))2 .

Minimize it with a stochastic gradient descent:

θi+1 = θi −αi

2∇ (Q∗(si , ai )− Qθ(si , ai ))2

= θi + αiφ(si , ai )(Q∗(si , ai )− θ>i φ(si , ai )

),

Q∗(si , ai ) is unknown, bootstrap it (si+1 ∼ P(.|si , ai ))

Q∗(si , ai )→ [T∗Qθi ](si , ai ) = ri + γmaxa∈A

Qθi (si+1, a).

Replace Q∗(si , ai ) by its estimate in the update rule

θi+1 = θi + αiφ(si , ai )

(ri + γmax

a∈AQθi (si+1, a)− Qθi (si , ai )

)= θi + αiφ(si , ai )

(ri + γmax

a∈A(θ>i φ(si+1, a))− θ>i φ(si , ai )

).

52 / 66

Page 53: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

SARSA and Q-learningThe exploration-exploitation dilemma

Q-learning (cont.)

Algorithm 8 Q-learningRequire: An initial parameter vector θ0, the initial state s0, the learning rates

(αi )i≥0

1: i = 02: while true do3: Choose the action ai to be applied in state si4: Apply action ai in state si5: Get the reward ri and observe the new state si+1

6: Update the parameter vector of the Q-function according to the transition(si , ai , ri , si+1);

θi+1 = θi + αiφ(si , ai )

(ri + γmax

a∈A(θ>i φ(si+1, a))− θ>i φ(si , ai )

)7: i ← i + 18: end while

Remark: off-policy algorithm. 53 / 66

Page 54: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

SARSA and Q-learningThe exploration-exploitation dilemma

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

54 / 66

Page 55: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

SARSA and Q-learningThe exploration-exploitation dilemma

With SARSA or Q-learning, what action should be applied?

acting always greedily is not wise;dilemma between exploration and exploitation.

ε-greedy policy:

πε(s) =

{argmaxa∈AQθ(s, a) with probability 1− εa random action with probability ε

.

Softmax (stochastic) policy (τ is the temperature param.):

πτ (a|s) =e

1τQθ(s,a)∑

a′∈A e1τQθ(s,a′)

.

Other schemes exist.

55 / 66

Page 56: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

56 / 66

Page 57: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

What about continuous actions (max, argmax)?

Policy search: parameterize the policy, search directly in thepolicy space.

We’ll use stochastic policies:

stochastic policy: π ∈ ∆SA associates to each state s aconditional probability over actions π(.|s)All things already defined extend naturaly to stochasticpolicies:

vπ(s) =∑a∈A

π(a|s)

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vπ(s)

),

vπ(s) =∑a∈A

π(a|s)Qπ(s, a),

Qπ(s, a) = r(s, a) + γ∑s′∈S

P(s ′|s, a)vπ(s).

57 / 66

Page 58: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

Exemples of parameterized policies:discrete actions

πθ(a|s) =eθ>φ(s,a)∑

a′∈A eθ>φ(s,a′),

continuous (1d here) actions

πθ(a|s) ∝ e− 1

2

(a−θ>φ(s)

σ

)2

.

The policy search problem:let ν ∈ ∆S be a user-defined distribution over states;solve

maxθ∈Rd J(θ) with J(θ) =∑s∈S

ν(s)vπθ (s) = ES∼ν [vπθ (S)].

Difference with (A)DP:DP: find a policy that maximizes the value for every state;Policy search: find a policy that maximizes the value inaverage.

58 / 66

Page 59: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

59 / 66

Page 60: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

natural approach, gradient ascent:

θ ← θ + α∇θJ(θ).

Gradient ∇θJ(θ)?

Define dν,π ∈ ∆S , the γ-weighted occupancy measure:

dν,π = (1− γ)ν>(I − γPπ)−1.

Theorem (Policy gradient)

Let πθ be such that πθ(a|s) > 0, for all s, a. We have

∇θJ(θ) =1

1− γ∑s∈S

dν,π(s)∑a∈A

πθ(a|s)Qπθ(s, a)∇θ lnπθ(a|s)

=1

1− γES∼dν,π ,A∼πθ(.|S)[Qπθ(S ,A)∇θ lnπθ(A|S)].

60 / 66

Page 61: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

61 / 66

Page 62: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

Policy gradient:

∇θJ(θ) =1

1− γES∼dν,π ,A∼πθ(.|S)[Qπθ(S ,A)∇θ lnπθ(A|S)].

Qπ can be estimated pointwise using Monte Carlo rollouts.

Policy search is called an actor method (DPI too).

Can we replace Qπ by an approximation Qw ∈ H,without changing the gradient?

If so, the resulting approach is called an actor critic method.

62 / 66

Page 63: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

Policy gradient with a critic

Theorem (Policy gradient with a critic)

If the parametrization of the state-action value function iscompatible, in the sense that

∀(s, a) ∈ S ×A, ∇θ lnπθ(a|s) = ∇wQw (s, a),

and if Qw is a local optimum of the risk based on the `2-loss, withstate-action distribution given by dν,π, and with the target functionbeing Qπ, that is

∇wEdν,π [(Qπθ(S ,A)− Qw (S ,A))2] = 0,

then the gradient satisfies

∇θJ(θ) = Edν,π [Qw (S ,A)∇θ lnπθ(A|S)].

63 / 66

Page 64: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

Policy gradient with a critic (cont.)

Exemple of compatible approximation.

Softmax policy: πθ(a|s) = eθ>φ(s,a)∑

a′∈A eθ>φ(s,a′).

Gradient: ∇θ lnπθ(a|s) = φ(s, a)−∑

a′∈A πθ(a′|s)φ(s, a′).

Compatible approximation:Qw (s, a) = w>(φ(s, a)−

∑a′∈A πθ(a′|s)φ(s, a′)).

Not a Q-function as∑

a∈A πθ(a|s)Qw (s, a) = 0

More an advantage function: Aπ(s, a) = Qπ(s, a)− vπ(s).

Yet, for any v ∈ RS , Edν,π [v(s)∇θ lnπθ(a|s)] = 0

As the term w>∑

a′∈A πθ(a′|s)φ(s, a′) does not depend on a, acompatible approximation is

Qw (s, a) = θ>φ(s, a).

64 / 66

Page 65: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

Natural policy gradient

Natural gradient:

gradient premultiplied by the inverse of the Fisher informationmatrixinstead of following the steepest direction in the parameterspace, it follows the steepest direction with respect to Fishermetrictends to be much more efficient empirically

in our case, the natural gradient ∇ is

∇θJ(θ) = F (θ)−1∇θJ(θ)

with F (θ) = Edν,π [∇θ lnπθ(A|S)(∇θ lnπθ(A|S))>]

65 / 66

Page 66: Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

The policy gradient theoremActor-critic methods

Natural policy gradient (cont.)

Theorem (Natural policy gradient with a critic)

If the parametrization of the state-action value function iscompatible, in the sense that

∀(s, a) ∈ S ×A, ∇θ lnπθ(a|s) = ∇wQw (s, a),

and if Qw is a local optimum of the risk based on the `2-loss, withstate-action distribution given by dν,π, and with the target functionbeing Qπ, that is

∇wEdν,π [(Qπθ(S ,A)− Qw (S ,A))2] = 0,

then the natural gradient satisfies

∇θJ(θ) = w .

66 / 66