trust region policy optimizationppoupart/teaching/cs885-spring...trust region policy optimization...
TRANSCRIPT
Trust Region Policy OptimizationJohn Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel
@ICML 2015
Presenter: Shivam Kalra
CS 885 (Reinforcement Learning)
Prof. Pascal Poupart
June 20th 2018
Reinforcement Learning
Action Value Function
Q-Learning
Policy Gradients
TRPOActor Critic PPO
A3C ACKTR
Ref: https://www.youtube.com/watch?v=CKaN5PgkSBc
Policy Gradient
For i=1,2,β¦
Collect N trajectories for policy ππ
Estimate advantage function π΄
Compute policy gradient π
Update policy parameter π = ππππ + πΌπ
Problems of Policy Gradient
For i=1,2,β¦
Collect N trajectories for policy ππ
Estimate advantage function π΄
Compute policy gradient π
Update policy parameter π = ππππ + πΌπ
Non stationary input data due to changing policy and reward distributions change
Problems of Policy Gradient
For i=1,2,β¦
Collect N trajectories for policy ππ
Estimate advantage function π΄
Compute policy gradient π
Update policy parameter π = ππππ + πΌπ
cvAdvantage is very random initially
Advantage Policy
Youβre bad
Problems of Policy Gradient
For i=1,2,β¦
Collect N trajectories for policy ππ
Estimate advantage function π΄
Compute policy gradient π
Update policy parameter π = ππππ + πΌπWe need more carefully crafted policy update
We want improvement and not degradation
Idea: We can update old policy ππππ to a new policy πsuch that they are βtrustedβ distance apart. Suchconservative policy update allows improvement insteadof degradation.
RL to Optimization
β’ Most of ML is optimizationβ’ Supervised learning is reducing training loss
β’ RL: what is policy gradient optimizing?β’ Favoring (π , π) that gave more advantage π΄.
β’ Can we write down optimization problem that allows to do small update on a policy π based on data sampled from π (on-policy data)
Ref: https://www.youtube.com/watch?v=xvRrgxcpaHY (6:40)
What loss to optimize?
β’ Optimize π(π) i.e., expected return of a policy π
β’ We collect data with ππππ and optimize the objective to get a new policy π.
π π = πΌπ 0~π0,ππ‘~π . π π‘
π‘=0
β
πΎπ‘ππ‘
What loss to optimize?
β’ We can express π π in terms of the advantage over the original policy1.
π π = π ππππ + πΌπ~π
π‘=0
β
πΎπ‘π΄ππππ(π π‘ , ππ‘)
[1] Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.
Expected return of new policy
Expected return of old policy
Sample from newpolicy
What loss to optimize?
β’ Previous equation can be rewritten as1:
π π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
Expected return of new policy
Expected return of old policy
Discounted visitation frequency
ππ π = π π 0 = π + πΎπ π 1 = π + πΎ2π +β―
π π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
What loss to optimize?
New Expected Return
Old Expected Return
β₯ π
π π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
New Expected Return Old Expected Return>
What loss to optimize?
β₯ π
Guaranteed Improvement from ππππ β π
π π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
New State Visitation is Difficult
State visitation based on new policy
New policy
βComplex dependency of ππ(π ) onπ makes the equation difficult tooptimize directly.β [1]
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
π π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
New State Visitation is Difficult
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
πΏ π = π ππππ +
π
ππ(π )
π
π π π π΄π(π , π)
Local approximation of πΌ(π )
Local approximation of π( π)
[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.
πΏ π = π ππππ +
π
ππ(π )
π
π π π π΄ππππ(π , π)
The approximation is accurate within step size πΏ (trust region)
Monotonic improvement guaranteed
π
πβ² ππβ² π π does not change dramatically.
Trust region
β’ Monotonically improving policies can be generated by:
β’ The following bound holds:
Local approximation of π( π)
π π β₯ πΏ π β πΆπ·πΎπΏπππ₯(π, π)
Where, πΆ =4ππΎ
1βπΎ 2
π = arg maxπ
[πΏ π β πΆπ·πΎπΏπππ₯ π, π ]
Where, πΆ =4ππΎ
1βπΎ 2
Minorization Maximization (MM) algorithm
Actual function π(π)
Surrogate function πΏ π β πΆπ·πΎπΏπππ₯(π, π)
Optimization of Parameterized Policies
β’ Now policies are parameterized ππ π π with parameters π
β’ Accordingly surrogate function changes to
arg maxπ
[πΏ π β πΆπ·πΎπΏπππ₯ ππππ , π ]
Optimization of Parameterized Policies
arg maxπ
[πΏ π β πΆπ·πΎπΏπππ₯ ππππ , π ]
In practice πΆ results in very small step sizes
One way to take larger step size is to constraint KLdivergence between the new policy and the oldpolicy, i.e., a trust region constraint:
πππππππππ½
π³π½(π½)
subject to, π«π²π³πππ π½πππ , π½ β€ πΉ
Solving KL-Penalized Problem
β’ maxππππ§ππ πΏ π β πΆ.π·πΎπΏπππ₯(ππππ , π)
β’ Use mean KL divergence instead of max.β’ i.e., maxππππ§π
ππΏ π β πΆ.π·πΎπΏ(ππππ , π)
β’ Make linear approximation to πΏ and quadratic to KL term:
maxππππ§ππ
π . π β ππππ βπ
2π β ππππ
ππΉ(π β ππππ)
where, π =π
πππΏ π Θπ=ππππ , πΉ =
π2
π2ππ·πΎπΏ ππππ , π Θπ=ππππ
Solving KL-Penalized Problem
β’ Make linear approximation to πΏ and quadratic to KL term:
β’ Solution: π β ππππ =1
ππΉβ1π. Donβt want to form full Hessian matrix
πΉ =π2
π2ππ·πΎπΏ ππππ , π Θπ=ππππ .
β’ Can compute πΉβ1π approximately using conjugate gradient algorithm without forming πΉ explicitly.
maxππππ§ππ
π . π β ππππ βπ
2π β ππππ
ππΉ(π β ππππ)
where, π =π
πππΏ π Θπ=ππππ , πΉ =
π2
π2ππ·πΎπΏ ππππ, π Θπ=ππππ
Conjugate Gradient (CG)
β’ Conjugate gradient algorithm approximately solves for π₯ = π΄β1πwithout explicitly forming matrix π΄
β’ After π iterations, CG has minimized 1
2π₯ππ΄π₯ β ππ₯
TRPO: KL-Constrained
β’ Unconstrained problem: maxππππ§ππ
πΏ π β πΆ. π·πΎπΏ(ππππ , π)
β’ Constrained problem: maxππππ§ππ
πΏ π subject to πΆ. π·πΎπΏ ππππ , π β€ πΏ
β’ πΏ is a hyper-parameter, remains fixed over whole learning processβ’ Solve constrained quadratic problem: compute πΉβ1π and then rescale step
to get correct KLβ’ maxππππ§π
ππ . π β ππππ subject to
1
2π β ππππ
ππΉ π β ππππ β€ πΏ
β’ Lagrangian: β π, π = π . π β ππππ βπ
2[ π β ππππ
ππΉ π β ππππ β πΏ]
β’ Differentiate wrt π and get π β ππππ =1
ππΉβ1π
β’ We want 1
2π ππΉπ = πΏ
β’ Given candidate step π π’ππ πππππ rescale to π =2πΏ
π π’ππ πππππ.(πΉπ π’ππ πππππ)π π’ππ πππππ
TRPO Algorithm
For i=1,2,β¦
Collect N trajectories for policy ππEstimate advantage function π΄Compute policy gradient πUse CG to compute π»β1πCompute rescaled step π = πΌπ»β1π with rescaling and line search
Apply update: π = ππππ + πΌπ»β1π
maxππππ§ππ
πΏ π subject to πΆ. π·πΎπΏ ππππ, π β€ πΏ
Questions?