date: january 21, 2020 trust region policy optimization (trpo)...
TRANSCRIPT
![Page 1: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/1.jpg)
Trust Region Policy Optimization (TRPO)John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel
Presenter: Jingkang WangDate: January 21, 2020
![Page 2: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/2.jpg)
A Taxonomy of RL Algorithms
Image credit: OpenAI Spinning Up, https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html#id20
We are here!
![Page 3: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/3.jpg)
Policy Gradients (Preliminaries)1) Score function estimator (SF, also referred to as REINFORCE):
Remark: can be either differentiable and non-differentiable functions
Proof:
![Page 4: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/4.jpg)
Policy Gradients (Preliminaries)1) Score function estimator (SF, also referred to as REINFORCE):
2) Subtracting a control variate
Remark: if baseline is not a function of z
![Page 5: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/5.jpg)
Policy Gradients (PG)Policy Gradient Theorem [1]:
Subtract the Baseline - state-value function
Expected reward Visitation frequency State-action function (Q-value)
![Page 6: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/6.jpg)
Policy Gradients (PG)Policy Gradient Theorem [1]:
Subtract the Baseline - state-value function
Expected reward Visitation frequency State-action function (Q-value)
![Page 7: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/7.jpg)
Motivation - Problem in PG
How to choose the step size?
![Page 8: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/8.jpg)
Motivation - Problem in PG
How to choose the step size? too large? 1) bad policy -> 2) collected data under bad policy
too small? cannot leverage data sufficiently
![Page 9: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/9.jpg)
Motivation - Problem in PG
How to choose the step size? too large? 1) bad policy -> 2) collected data under bad policy
too small? cannot leverage data sufficiently
Cannot recover!
![Page 10: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/10.jpg)
Motivation: Why trust region optimization?
Image credit: https://medium.com/@jonathan_hui/rl-trust-region-policy-optimization-trpo-explained-a6ee04eeeee9
![Page 11: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/11.jpg)
TRPO - What Loss to optimize?- Original objective
- Improvement of new policy over old policy [1]
- Local approximation (visitation frequency is unknown)
![Page 12: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/12.jpg)
TRPO - What Loss to optimize?- Original objective
- Improvement of new policy over old policy [1]
- Local approximation (visitation frequency is unknown)
![Page 13: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/13.jpg)
Proof: Relation between new and old policy:
![Page 14: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/14.jpg)
TRPO - What Loss to optimize?- Original objective
- Improvement of new policy over old policy [1]
- Local approximation (visitation frequency is unknown)
![Page 15: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/15.jpg)
Surrogate Loss: Important sampling PerspectiveImportant Sampling:
Matches to first order for parameterized policy:
![Page 16: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/16.jpg)
Surrogate Loss: Important sampling PerspectiveImportant Sampling:
Matches to first order for parameterized policy:
![Page 17: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/17.jpg)
Monotonic Improvement Result- Find the lower bound in general stochastic gradient policies
- Optimized objective: maximize guarantees non-decreasing
![Page 18: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/18.jpg)
Optimization of Parameterized Policies- If we used the penalty coefficient C recommended by the theory above, the
step sizes would be very small
![Page 19: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/19.jpg)
Optimization of Parameterized Policies- If we used the penalty coefficient C recommended by the theory above, the
step sizes would be very small
- One way to take larger steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint:
![Page 20: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/20.jpg)
Optimization of Parameterized Policies- If we used the penalty coefficient C recommended by the theory above, the
step sizes would be very small
- One way to take larger steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint:
![Page 21: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/21.jpg)
Solving the Trust-Region Constrained Optimization
1. Compute a search direction, using a linear approximation to objective and quadratic approximation to the constraint
Conjugate gradient
![Page 22: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/22.jpg)
Solving the Trust-Region Constrained Optimization
1. Compute a search direction, using a linear approximation to objective and quadratic approximation to the constraint
Conjugate gradient
2. Compute the maximal step length
![Page 23: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/23.jpg)
Solving the Trust-Region Constrained Optimization
1. Compute a search direction, using a linear approximation to objective and quadratic approximation to the constraint
Conjugate gradient
2. Compute the maximal step length: satisfies the KL divergence
3. Line search to ensure the constraints and monotonic improvement
![Page 24: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/24.jpg)
Summary - TRPO1. Original objective:
![Page 25: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/25.jpg)
Summary - TRPO1. Original objective:
2. Policy improvement in terms of advantage function:
![Page 26: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/26.jpg)
Summary - TRPO1. Original objective:
2. Policy improvement in terms of advantage function:
3. Surrogate loss to remove the dependency on the trajectories of new policy
![Page 27: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/27.jpg)
Summary - TRPO4. Find the lower bound (monotonic improvement guarantee)
![Page 28: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/28.jpg)
Summary - TRPO4. Find the lower bound (monotonic improvement guarantee)
5. Solve the optimization problem using linear search (Fish matrix and conjugate gradients)
![Page 29: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/29.jpg)
Experiments (TRPO)
- Sample-based estimation of advantage functions- Single path: sample initial state and generate trajectories following - Vine: pick a “roll-out” subset and sample multiple actions and trajectories (lower variance)
(a) Single Path (b) Vine
![Page 30: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/30.jpg)
Experiments (TRPO)
- Simulated Robotic Locomotion tasks- Hopper: 12-dim state space- Walker: 18-dim state space- rewards: encourage fast and stable running (hopper); encourage smooth walke (walker)
![Page 31: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/31.jpg)
Experiments (TRPO)- Atari games (discrete action space) - 0 / 1
![Page 32: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/32.jpg)
Limitations of TRPO- Hard to use with architectures with multiple outputs, e.g., policy and value
function (need to weight different terms in distance metric)
- Empirically performs poorly on tasks requiring deep CNNs and RNNs, e.g., Atari benchmark (more suitable for locomotion)
- Conjugate gradients makes implementation more complicated than SGD
![Page 33: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/33.jpg)
Proximal Policy Optimization (PPO)- Clipped surrogate objective
TRPO:
PPO:
![Page 34: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/34.jpg)
Proximal Policy Optimization (PPO)- Adaptive KL Penalty Coefficient
![Page 35: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/35.jpg)
Experiments (PPO)
![Page 36: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/36.jpg)
Takeaways- Trust region optimization guarantees the monotonic policy improvement.
- PPO is a first-order approximation of TRPO that is simpler to implement and achieves better empirical performance (both locomotion and Atari games).
![Page 37: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/37.jpg)
Related Work[1] S. Kakade. “A Natural Policy Gradient.” NIPS, 2001.
[2] S. Kakade and J. Langford. “Approximately optimal approximate reinforcement learning”. ICML, 2002.
[3] J. Peters and S. Schaal. “Natural actor-critic”. Neurocomputing, 2008.
[4] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel. “Trust Region Policy Optimization”. ICML, 2015.
[5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. “Proximal Policy Optimization Algorithms”. 2017.
![Page 38: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/38.jpg)
Questions1. What is purpose of trust region? How we construct the trust region in TRPO
(Hint: average KL divergence)
2. Why trust region optimization is not widely used in supervised learning?
(Hint: i.i.d. assumption)
3. What are the differences between PPO and TRPO? Why PPO is preferred?
(Hint: adaptive coefficient, surrogate loss function)
![Page 39: Date: January 21, 2020 Trust Region Policy Optimization (TRPO) …wangjk/slides/CS2621_TRPO_PPO.pdf · 2020-02-28 · Trust Region Policy Optimization (TRPO) John Schulman, Sergey](https://reader034.vdocuments.net/reader034/viewer/2022042805/5f62051236b2882ae47af225/html5/thumbnails/39.jpg)
Reference1. http://www.cs.toronto.edu/~tingwuwang/trpo.pdf2. http://rll.berkeley.edu/deeprlcoursesp17/docs/lec5.pdf3. https://medium.com/@jonathan_hui/rl-trust-region-policy-optimization-trpo-explained-a6ee04eeeee94. https://people.eecs.berkeley.edu/~pabbeel/nips-tutorial-policy-optimization-Schulman-Abbeel.pdf5. https://www.depthfirstlearning.com/2018/TRPO#1-policy-gradient6. https://cs.uwaterloo.ca/~ppoupart/teaching/cs885-spring18/slides/cs885-lecture15a.pdf7. http://www.andrew.cmu.edu/course/10-703/slides/Lecture_NaturalPolicyGradientsTRPOPPO.pdf8. https://towardsdatascience.com/policy-gradients-in-a-nutshell-8b72f9743c5d9. Discretizing Continuous Action Space for On-Policy Optimization. Tang et al, ICLR 2018.
10. Trust Region Policy Optimization. Schulman et al., ICML 2015.11. A Natural Policy Gradient. Sham Kakade., NIPS 2001.12. Proximal Policy Optimization Algorithms. Schulman et al., 2017.13. Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines. Wu et al., ICLR
2018.