reinforcement learning lecture policy search · reinforcement learning ... linear parametrization:...

Reinforcement Learning

Policy Search

Continuous state/action space, policy gradient, natural policy gradient, actor-critic,natural actor-critic

Vien NgoMLR, University of Stuttgart

Outline

• Policy search overview: exploration, evaluation, updates

• Policy gradient methods– Plain policy gradient– Second-order policy gradient: Gauss-Newton, Natural gradient– Actor-critic, natural actor-critic.

• Inference-based methods

• Information-theoretic methods

• Black-box optimization methods

2/46

Missing

• Bayesian policy gradient methods

• Model-based policy search

3/46

Policy Search

4/46

Policy Search: motivations

• Value-based methods:– Inefficient in high-dimensional space– Value function approximation suffers from bias due to the use ofbootstrapping (value updates based on current estimates)– The value function (using function approximation) might bediscontinuous (e.g. in under-actuated systems).

• Advantages:– Guaranteed convergence properties– Learning in high-dimensional space problems– Enable imitation learning, learning from demonstration effectively– Integrate other optimal control policies straightforwardly

• Disadvantages: computationally inefficient

5/46

Policy Search

• Directly optimize over the parameter space θ ∈ Θ of paramterizedpolicies. For example,– Linear policies: π(s) = θ>φ(s)

– Stochastic Gaussian policies π(a|s, θ) ∼ e−f(s,a)

• Linear parametrization: f(s, a) = 12

(Wφ(s)− a

)>Σ−1

(Wφ(s)− a

),

where assuming that a ∈ Rn, each θi ∈ Θ for each action dimension, and

W =

θ>1

· · ·θ>n

hence θ = (W,Σ)

• Functional policy: f(s, a) = 12

(h(s)− a

)>Σ−1

(h(s)− a

); hence

θ = (h,Σ)

– (Recurrent) neural network: a = f(θ, s) (e.g. autonomous helicopter,AlphaGO, robotic control tasks, etc.)

6/46

Policy Search

Algorithm 1 Model-Free Policy Search

1: while (!converged) do

2: EXPLORATION: Generating trajectories D = {τi} using π(θk)

3: EVALUATION: Evaluating the quality of π(θk;D)

4: UPDATE: Updating π(θk+1) given the evaluations and trajectories.

(Deisenroth, Neumann and Peters, 2011)

7/46

Exploration Strategies

8/46

Exploration in Model-Free Policy Search

• Exploration in action space vs. exploration in parameter space

• Episode-based vs. step-based exploration

• Uncorrelated vs. correlated exploration

9/46

Exploration in Model-Free Policy Search• Exploration in action space: an exploration noise is added directly to

the executed actions (e.g. REINFORCE, eNAC)e.g. exploration at each step (e.g. PoWER, PI2): ut = φt(x)>θ

πθ(ut|xt) = N(ut|θ>φ(xt), φ(xt)

>Σφ(xt))

• Exploration in parameter sapce: the parameter is perturbed in thebeginning of an episode (or can be used at each time step) (e.g.black-box approaches, etc.), e.g.

u = µu(x, θ), θ ∼ N (θ|πθ(ω),Σθ)

A graphical model of learning an upper-level policy (Deisenroth, Neumann, Peters, 2011)10/46


• Step-based exploration (e.g. PoWER, PI2): e.g. ut = φt(x)>θ as alower-level policy, or can be rewritten as

πθ(ut|xt, t) = N (ut|φt(x)>µθ, φt(x)>Σθφt(x))

• Episode-based exploration: e.g. sampling θ ∼ N (θ|πθ(ω),Σθ) at thebeginning of each episode

11/46


• Uncorrelated exploration: Σθ is diagonal

• Correlated exploration: Σθ is full. This stragegy often gives betterlearning speed, however requires large amount of data to accuratelyestimate Σ (CMA-ES is an efficient algorithm, but should not be usedwhen |Θ| > 50)

12/46

Policy Evaluation Strategies

• Step-based Policy Evaluation

• Episode-based Policy Evaluation

13/46

Step-based Policy EvaluationGiven a set of sampled trajectoris D =

{ξi = (xit, u

it)Tit=0

}Mi=1

• the expected cost-to-go of (xit, uit)

Qit = Qπt (xit, uit) = Eπθ(ξ)

[Ti∑k=t

c(xk, uk)

∣∣∣∣∣xt = xit, ut = uit

]

≈Ti∑k=t

c(xik, uik)

• step-based update is based on a data set D′ = {{xit, uit, Qit}Tit=0}Mi=1

• algorithms: REINFORCE, G(PO)MDP, NAC, eNAC, PoWER, PI2.

14/46

Episode-based Policy EvaluationGiven a set of sampled trajectoris D =

{ξi = (xit, u

it)Tit=0

}Mi=1

• using the expected cost of a parameter θi

Ci = C(θi) = Eπθ(ξ)

[Ti∑k=0

c(xk, uk)

∣∣∣∣∣θ = θi

]

• episode-based update is based on a data set D′ = {θi, Ci}Mi=1

• algorithms: Episode-based REPS, Episode-based PI2, PEPG, NES,CMA-ES, RWR

15/46

Step-based vs. Episode-based

• Episode-based might suffer from large variance.

• Episode-based does (almost) exploration directly on parameter space.

• the choice between two depends on the given problem.

16/46

Policy Update Strategies

17/46

Policy Update Strategies

• Policy gradient methods: Finite Difference methods, REINFORCE,G(PO)MDP, Natural Gradient, Natural Actor-Critic (NAC), episodicNAC, etc.

• Inference-based techniques

• Information-theoretic techniques

• Black-box techniques

18/46

Policy gradient

19/46

Policy gradient

• Family of randomized policy µ(s, a) = p(a|s) (deterministic policy is aspecial case).

• A performance measure is

J(µ) = Eµ[r0 + γr1 + γ2r2 + . . .

]

• Parameterized policy family µθ(s, a) by a parameter space θ ∈ <d.• The parametric performance measure becomes

J(θ) = Eθ[r0 + γr1 + γ2r2 + . . .

]• Solution: gradient-ascent algorithms

θk+1 = θk + α∇θJ(θk)

20/46

Policy gradient



J(µ) = Eµ[r0 + γr1 + γ2r2 + . . .

]• Parameterized policy family µθ(s, a) by a parameter space θ ∈ <d.• The parametric performance measure becomes

J(θ) = Eθ[r0 + γr1 + γ2r2 + . . .

]

• Solution: gradient-ascent algorithms


20/46

Policy gradient



J(µ) = Eµ[r0 + γr1 + γ2r2 + . . .

]• Parameterized policy family µθ(s, a) by a parameter space θ ∈ <d.• The parametric performance measure becomes

J(θ) = Eθ[r0 + γr1 + γ2r2 + . . .

]• Solution: gradient-ascent algorithms


20/46

Policy gradient

• Policy gradient updates

θµ′ = θµ + α∇θJ(θµ)

• Guarantee the performance improvement: J(θµ′) ≥ J(θµ)⇒ µ′ at leastbetter than or equal to µ

21/46

Policy gradient

(from Jan Peters’ tutorial at ICRA 2012)

22/46

Policy gradient: Black-box approaches• Approximate the gradient using supervised learning (regression).

• Collect data D = {δθi, δJi}Mi=1 (the sampled gradients). By– perturbing the parameters: θ + δθ

– applying the new policy µ(θ + δθ) to get δJi = J(θ + δθ)− J(θ)

• the finite different (FD) gradient estimation by regression

gFD(θ) = (∆Θ>∆Θ)−1∆Θ>∆J

where

∆Θ =

δθ>1

· · ·δθ>M

,∆J =

δJ1

· · ·δJM

• gradient update

θ ← θ + αgFD(θ)

(Jan Peters, Scholarpedia 2010)

• This method works well if the return J is not too noisy.

23/46

Policy gradient: Black-box approaches• Approximate the gradient using supervised learning (regression).• Collect data D = {δθi, δJi}Mi=1 (the sampled gradients). By

– perturbing the parameters: θ + δθ



gFD(θ) = (∆Θ>∆Θ)−1∆Θ>∆J

where

∆Θ =

δθ>1

· · ·δθ>M

,∆J =

δJ1

· · ·δJM

• gradient update



• This method works well if the return J is not too noisy.

23/46

Policy gradient: Black-box approaches• Approximate the gradient using supervised learning (regression).• Collect data D = {δθi, δJi}Mi=1 (the sampled gradients). By

– perturbing the parameters: θ + δθ



gFD(θ) = (∆Θ>∆Θ)−1∆Θ>∆J

where

∆Θ =

δθ>1

· · ·δθ>M

,∆J =

δJ1

· · ·δJM

• gradient update



• This method works well if the return J is not too noisy. 23/46

Policy gradient: Likelihood Ratio Gradient

directly exploit the Markov property

24/46

Policy gradient: Likelihood Ratio Gradient

• rewrite the performance measure J(θ)

J(θ) = Eθ[r0 + γr1 + γ2r2 + . . .

]=

∫p(ξ|µθ)R(ξ)dξ

where ξ = {s0, a0, r0, s1, a1, r1, . . .} is a trajectory.

R(ξ) = r0 + γr1 + γ2r2 + . . .

p(ξ|µθ) = p(s0)∏

p(st+1|st, at)µθ(at|st)

25/46

Policy gradient: Likelihood Ratio Gradient• Gradient derivation

∇θJ(θ) =

∫∇θp(ξ|µθ)R(ξ)dξ

=

∫p(ξ|µθ)∇θ log p(ξ|µθ)R(ξ)dξ

(the trick is ∇f = f∇ log f)

= E[∇θ log p(ξ|µθ)R(ξ)

]• Using Monte-Carlo simulations: sampling M tracjectories ξi from policyµθ.

∇θJ(θ) ≈ 1

M

M∑i=1

∇θ log p(ξi|µθ)R(ξi)

=1

M

M∑i=1

Ti∑t=0

∇θ logµθ(at|st)R(ξi)

because p(st+1|st, at) not depends on θ, so its gradient w.r.t θ is 0.26/46

Policy gradient: Likelihood Ratio GradientA vanilla policy gradient algorithm

• initialize θ0

• for k = 0 :∞ (until convergence)

– generate data: M trajectories ξi from policy µθk

– evaluate: ∇θJ(θk) from {ξi}Mi=1

– update: θ: θk+1 = θk + α∇θJ(θk)

• end for

27/46

Variance Reduction

• Vanilla policy gradient method suffers from large variance.

• Variance reduction using a baseline:– retain non-biased estimate gradients– reduce variance of gradient estimates

• REINFORCE and G(PO)MDP algorithms

28/46

Variance Reduction: REINFORCE• Introduce a baseline b that satisfies (un-biased)

∇RFθi J(θ) = Epθ(ξ)

[T−1∑t=0

∇θi logµθ(at|st)(R(ξ)− b)

]

= Epθ(ξ)

[T−1∑t=0

∇θi logµθ(at|st)R(ξ))

]

• Non-bias gradient estimate: ∇θJ(θ) = ∇RFθ J(θ)

• Choose b s.t minimize the variance of ∇RFθ J(θ)

∂

∂biVar[∇RFθi J(θ)

]=

∂

∂bi

{E[(∇RFθi J(θ)

)2]− E[∇RFθi J(θ)]2}

=∂

∂bi

{E[(∇RFθi J(θ)

)2]}= 0

where i is the ith-dim of θ.

29/46

Variance Reduction: REINFORCE• Introduce a baseline b that satisfies (un-biased)

∇RFθi J(θ) = Epθ(ξ)

[T−1∑t=0

∇θi logµθ(at|st)(R(ξ)− b)

]

= Epθ(ξ)

[T−1∑t=0

∇θi logµθ(at|st)R(ξ))

]

• Non-bias gradient estimate: ∇θJ(θ) = ∇RFθ J(θ)

• Choose b s.t minimize the variance of ∇RFθ J(θ)

∂

∂biVar[∇RFθi J(θ)

]=

∂

∂bi

{E[(∇RFθi J(θ)

)2]− E[∇RFθi J(θ)]2}

=∂

∂bi

{E[(∇RFθi J(θ)

)2]}= 0

where i is the ith-dim of θ. 29/46

• The resulting baseline b is (for each dimension i)

bi =

Epθ(ξ)

[( T−1∑t=0

∇θi logµθ(at|st))2R(ξ)

]

Epθ(ξ)

[( T−1∑t=0

∇θi logµθ(at|st))2]

30/46

Variance Reduction: G(PO)MDP Algorithm

• The rewards at a given time are independent from future actions

Epθ(ξ)

[∂θ logµθ(at|st, t)rj

]= 0, ∀j < t

• The gradient computed by G(PO)MDP is

∇G(PO)MDPθi

J(θ) = Epθ(ξ)

[T−1∑j=0

j∑t=0

∇θi logµθ(at|st)(rj − bj)

]

• The bias is

bji =

Epθ(ξ)

[( j∑t=0

∇θi logµθ(at|st))2rj

]

Epθ(ξ)

[( j∑t=0

∇θi logµθ(at|st))2]

31/46

Second-order methods

• Natural policy gradientSham Kakade (2003)

• Gauss-Newton policy gradientThomas Furmston, Guy Lever, David Barber (2016)

32/46

Actor-Critic Framework

Combining advatages of value-based and policy-based methods

33/46

Actor-Critic methods

• The policy structure is known as the actor, (→ tuned by gradient updates.)

• The estimated value function is known as the critic (→ tuned by TD errors).

(Introduction to RL, Sutton & Barto)

34/46

Policy Gradient Theorem Algorithm

• R(ξ) can be estimated by Qπt (st, at)

∇θJ(θ) = E[ T∑t=0

∇θ logµθ(at|st)( T∑j=t

rj)]

= E[ T∑t=0

∇θ logµθ(at|st)Qπt (st, at)]

(Baxter & Bartlett, 2001)

• Policy gradient with function approximation

Qπt (st, at) = φ(st, at)> ·w

• Compatible function approximation: how to choose φ(st, at)? such that

– does not introduce bias

– reduce the variance

35/46

Compatible function approximationThe function approximation is compatible with the policy parametrization.

• The feature function is found φ(s, a) = ∇θ logµθ(a|s), by solving theleast-square error

Eφ(ξ)

[T−1∑t=0

(Q(st, at)− φ(st, at)>w)2

](Sutton, 1999; Konda & Tsitsiklis, 1999)

• The gradient with compatible function approximation

∇θJ(θ) = E[ T∑t=0

∇θ logµθ(at|st)(∇θ logµθ(at|st)

)>]︸︷︷︸

Fisher information matrix

w

= Gθw (Actor-critic algorithms)

• The natural policy gradient with compatible function approximation

∇NGθ J(θ) = G−1θ Gθw = w (Natural actor-critic algorithms)

36/46

Compatible function approximationThe function approximation is compatible with the policy parametrization.

• The feature function is found φ(s, a) = ∇θ logµθ(a|s), by solving theleast-square error

Eφ(ξ)

[T−1∑t=0

(Q(st, at)− φ(st, at)>w)2

](Sutton, 1999; Konda & Tsitsiklis, 1999)

• The gradient with compatible function approximation

∇θJ(θ) = E[ T∑t=0

∇θ logµθ(at|st)(∇θ logµθ(at|st)

)>]︸︷︷︸

Fisher information matrix

w

= Gθw (Actor-critic algorithms)

• The natural policy gradient with compatible function approximation

∇NGθ J(θ) = G−1θ Gθw = w (Natural actor-critic algorithms)36/46

Actor-critic algorithmsA regular-gradient actor-critic algorithm (a stochastic update version)

• initialize θ0

• For t = 0 :∞ (until convergence)

– choose an action at ∼ µθt (at|st)

– Take at, observe rt, and st+1.

– Compute TD error: δt = rt + γφ(st+1, at+1)>wt − φ(st, at)>wt.

– Critic update: wt+1 = wt + αδtφ(st, at)

– Actor update: θt+1 = θt + βφ(st, at)φ(st, at)>wt

• end For

37/46

Episodic natural actor-critic (eNAC)

• initialize θ0

• For t = 0 :∞ (until convergence)

– sample a set of trajectories {ξi} from π(θt)

– Critic update: by regression [w

v

]= [Φ>Φ]−1Φ>R

– Actor update:∇eNACθ J(θ) = w

• end For

38/46

Inference-based Policy Search

39/46

Inference-based Policy Search

Bayesian Reasoning and Machine Learning, David Barber (2012)40/46

Information-theoretic Policy Search

• Information-theoretic bounds the update of the trajectory distributions,i.e. regularization limits the information loss in policy update. Forexample, natural policy gradient.

• Regularization helps avoid premature on local optima as in EM-basedmethods.

• Information-theoretic policy exploits advantages from inference-based(updates without a step-size), and natural policy gradient (informationloss bound).

Relative Entropy Policy Search (REPS) (Jan Peters 2010).

41/46

Black-box Policy Search

• Evolutionary policy search

• Bayesian optimization policy search

42/46

Evolutionary policy search

• Evolutionary policy search: Cross-entropy (CE), covariance matrixadaptation evolution strategy (CMA-ES), Natural Evolution Strategies(NES), etc.

43/46

Bayesian optimization policy search

44/46

Bayesian optimization policy search

45/46

ReferencesMarc Peter Deisenroth, Gerhard Neumann, Jan Peters: A Survey onPolicy Search for Robotics. Foundations and Trends in Robotics,(2013)

46/46

reinforcement learning lecture policy search · reinforcement learning ... linear parametrization:...

Documents