chris g. willcocks durham university

13
Reinforcement Learning Lecture 8: Policy gradient methods Chris G. Willcocks Durham University

Upload: others

Post on 26-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chris G. Willcocks Durham University

Reinforcement LearningLecture 8: Policy gradient methods

Chris G. WillcocksDurham University

Page 2: Chris G. Willcocks Durham University

Lecture overview

Lecture covers chapter 13 in Sutton & Barto [1] and examples from David Silver [2]

1 Policy-based methodsdefinitioncharacteristicsdeterministic vs stochastic policies

2 Policy gradientsgradient-based estimatorMonte Carlo REINFORCE

3 Actor-critic methodsdefinitionalgorithmextensions

2 / 13

Page 3: Chris G. Willcocks Durham University

Policy-based methods introduction

Definition: policy-based methods

Last week, we used a function approximatorto estimate the value function:

v(s,w) ≈ vπ(s),

and for control we estimated Q:

q(s, a,w) ≈ qπ(s, a).

This week we will estimate policies:

πθ(a|s) = P (a|s, θ)

Given a state, what’s the distribution overactions?

Example: what’s the optimal policy?

3 / 13

Page 4: Chris G. Willcocks Durham University

Policy-based methods characteristics

Policy-based RL characteristics

This approach has the followingadvantages:• Can be more efficient that calculatingthe value function

• Better convergence guarantees• Effective in high-dimensional orcontinuous action spaces

• Can learn stochastic policiesAnd the following disadvantages:• Converges on local rather than globaloptimum

• Inefficient policy evaluation with highvariance

Example: continuous action spaces Û

4 / 13

Page 5: Chris G. Willcocks Durham University

Policy-based methods deterministic vs stochastic policies

Example: deterministic vs stochastic policies

Deterministic policy for feature vectors describing thewalls around a state:

a=π(s, θ)

Stochastic policy:

a∼π(s, θ)

Example from [2].

5 / 13

Page 6: Chris G. Willcocks Durham University

Policy gradients gradient estimators

Definition: gradient estimators

While we could optimise θ for non-differentiable functions using approachessuch as genetic algorithms or hill climbing, ideally we want to use a gradientbased estimator:

LPG(θ) = Et[∇θ log πθ(at|st)At

]where At is an estimate of the ‘advantage’ (difference between the return andthe state values, you could also replace At with q(s, a) instead for highervariance). The expectation Et is an empircal average over a finite batch ofsamples [3]. Typically π follows a categorical distribution (softmax) or aGaussian for continuous action spaces.

Therefore we empirically follow the gradient that maximizes the likelihood ofthe actions that give the most advantage.

6 / 13

Page 7: Chris G. Willcocks Durham University

Policy gradients Monte Carlo REINFORCE

Definition: Monte Carlo REINFORCE

REINFORCE estimates the return in theprevious equation by using a Monte Carloestimate [4].• Initialise some arbitrary parameters θ• Iteratively sample episodes• Calculate the complete return fromeach step

• For each step again, update in thegradient times the sample return

Algorithm: Monte Carlo REINFORCE

PyTorch example: W# initialise θ with random valuesπ = PolicyNetwork(θ)

while(True):# sample episode following πS0, A0, R1, ..., ST−1, AT−1, RT ∼ π

for t in range(T − 1):

Gt ←∑Tk=t+1 γ

k−t−1Rkθ ← θ + αγtGt∇ lnπ(At|St, θ)

7 / 13

Page 8: Chris G. Willcocks Durham University

Actor-critic methods introduction

Definition: actor-critic methods

We combine policy gradients withaction-value function approximation, usingtwo models that may (optionally) shareparameters.

• We use a critic to estimate theQ values:qw(s, a) ≈ qπθ (s, a)

• We use an actor to update the policyparameters θ in the direction suggestedby the critic.

Example: Actor critic

Image from freecodecamp.org

8 / 13

Page 9: Chris G. Willcocks Durham University

Actor-critic methods algorithm

Definition: actor-critic

Putting this together, actor-criticmethods use an approximatepolicy gradient to adjust the actorpolicy in the direction thatmaximises the reward according tothe critic:

θ ← θ + α∇θ log πθ(s, a)qw(s, a)

Algorithm: Actor-Critic (PyTorchW)

# initialise s, θ,w randomly# sample a ∼ πθ(a|s)for t in range(T ):

sample rt and s′ from environment(s, a)sample a′ ∼ πθ(a′|s′)θ ← θ + αqw(s, a)∇θ lnπθ(a|s) # update actorδt = rt + γqw(s′, a′)− qw(s, a) # TD errorw← w + αδt∇wqw(s, a) # update critica← a′, s← s′

9 / 13

Page 10: Chris G. Willcocks Durham University

Extensions

Extensions

This has introduced the foundations, hopefully now you have a goodplatform to read about the extensions to this.

1. Recommended further study (papers & code): W2. Recommended further study (theory & STAR):W

Recommended extensions include:• Advantage actor critic (A3C & A2C) [5]• Experience replay & prioritised replay [6]• Proximal policy optimisation [3]• Rainbow (combining extensions) [7]

10 / 13

Page 11: Chris G. Willcocks Durham University

Take Away Points

Summary

In summary:

• Policy gradients open up many new extensions• Choose extensions to reduce variance to stabilise training• Consider regularisation to encourage exploration• Going off-policy gives better exploration• Its possible for the actor and critic to share some lower layerparameters, but be careful about it

• Experience replay can increase sample efficiency (wheresimulation is expensive)

11 / 13

Page 12: Chris G. Willcocks Durham University

References I

[1] Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction (second edition). Available onlineI. MITpress, 2018.

[2] David Silver. Reinforcement Learning lectures.https://www.davidsilver.uk/teaching/. 2015.

[3] John Schulman et al. “Proximal policy optimization algorithms”. In:arXiv preprint arXiv:1707.06347 (2017).

[4] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionistreinforcement learning”. In: Machine learning 8.3-4 (1992), pp. 229–256.

[5] Volodymyr Mnih et al. “Asynchronous methods for deep reinforcement learning”.In: International conference on machine learning. 2016, pp. 1928–1937.

[6] Ziyu Wang et al. “Sample efficient actor-critic with experience replay”. In:arXiv preprint arXiv:1611.01224 (2016).

12 / 13

Page 13: Chris G. Willcocks Durham University

References II

[7] Matteo Hessel et al. “Rainbow: Combining improvements in deep reinforcementlearning”. In: arXiv preprint arXiv:1710.02298 (2017).

[8] Shixiang Gu et al. “Q-prop: Sample-efficient policy gradient with an off-policy critic”.In: arXiv preprint arXiv:1611.02247 (2016).

13 / 13