proximal policy optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · trust region policy...
TRANSCRIPT
![Page 2: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/2.jpg)
Proximal Policy Optimization(OpenAI)
”PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance”
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization
algorithms. arXiv preprint arXiv:1707.06347.
https://arxiv.org/pdf/1707.06347https://blog.openai.com/openai-baselines-ppo/
![Page 3: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/3.jpg)
Policy Gradient (REINFORCE)
In practice, update on each batch(trajectory)
* Use the same notation in the paper
![Page 4: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/4.jpg)
Problem?
• Unstable update• Step size is very important:
• If step size is too large:• Large step bad policy• Next batch is generated from current bad policy collect bad samples• Bad samples worse policy(compare to supervised learning: the correct label and data in the following batches may correct it)
• If step size is too small: the learning process is slow
• Data Inefficiency• On-policy method: for each new policy, we need to generate a completely new trajectory• The data is thrown out after just one gradient update• As complex neural networks need many updates, this makes the training process very slow
![Page 5: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/5.jpg)
Importance Sampling
Estimate one distribution by sampling from another distribution
𝐸𝑥~𝑝[𝑓 𝑥 ] ≈1
𝑁
𝑖=1,𝑥𝑖∈𝑝
𝑁
𝑓 𝑥𝑖
𝐸𝑥~𝑝[𝑓 𝑥 ] = න𝑓 𝑥 𝑝 𝑥 𝑑𝑥
= න𝑓 𝑥𝑝 𝑥
𝑞 𝑥𝑞 𝑥 𝑑𝑥
= 𝐸𝑥~𝑞[𝑓 𝑥𝑝 𝑥
𝑞 𝑥]
≈1
𝑁
𝑖=1,𝑥𝑖∈𝑞
𝑁
𝑓 𝑥𝑖𝑝 𝑥𝑖
𝑞 𝑥𝑖
![Page 6: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/6.jpg)
Data Inefficiency
Data Inefficiency
Make it efficient
Use previous samples?
Evaluate the gradient of current policy
Can we estimate an expectation of one distribution without taking samples from it?
Avoid sampling from current policy
Like replay buffer in DQN
![Page 7: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/7.jpg)
Importance Sampling in Policy Gradient
𝛻𝐽 𝜃 = 𝐸 𝑠𝑡, 𝑎𝑡 ~ 𝜋𝜃[𝛻 log 𝜋𝜃 𝑎𝑡 𝑠𝑡 𝐴(𝑠𝑡 , 𝑎𝑡)]
= 𝐸 𝑠𝑡, 𝑎𝑡 ~𝜋𝜃𝑜𝑙𝑑[𝜋𝜃(𝑠𝑡 , 𝑎𝑡)
𝜋𝜃𝑜𝑙𝑑(𝑠𝑡 , 𝑎𝑡)𝛻 log 𝜋𝜃 𝑎𝑡 𝑠𝑡 𝐴(𝑠𝑡 , 𝑎𝑡)]
𝐽 𝜃 = 𝐸 𝑠𝑡, 𝑎𝑡 ~𝜋𝜃𝑜𝑙𝑑[𝜋𝜃(𝑠𝑡 , 𝑎𝑡)
𝜋𝜃𝑜𝑙𝑑(𝑠𝑡 , 𝑎𝑡)𝐴(𝑠𝑡 , 𝑎𝑡)] Surrogate objective function
𝐸𝑥~𝑝 𝑓 𝑥 = 𝐸𝑥~𝑞[𝑓 𝑥𝑝 𝑥
𝑞 𝑥]
![Page 8: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/8.jpg)
Importance Sampling
Problem? No free lunch!
Two expectations are same, but we are using sampling method to estimate them variance is also important
𝐸𝑥~𝑞[𝑓 𝑥𝑝 𝑥
𝑞 𝑥] 𝑉𝐴𝑅 𝑋 = 𝐸 𝑋2 − 𝐸 𝑋 2
= 𝐸𝑥~𝑝 𝑓 𝑥 2 − 𝐸𝑥~𝑝 𝑓 𝑥2
V𝑎𝑟𝑥~𝑞[𝑓 𝑥𝑝 𝑥
𝑞 𝑥]
= 𝐸𝑥~𝑞 𝑓 𝑥𝑝 𝑥
𝑞 𝑥
2
− 𝐸𝑥~𝑞 𝑓 𝑥𝑝 𝑥
𝑞 𝑥
2
= 𝐸𝑥~𝑝 𝑓 𝑥 2𝑝 𝑥
𝑞 𝑥− 𝐸𝑥~𝑝 𝑓 𝑥
2
V𝑎𝑟𝑥~𝑝 𝑓 𝑥
𝐸𝑥~𝑝[𝑓 𝑥 ] =
Price (Tradeoff): we may need to sample more data, if is far away from 1𝑝 𝑥
𝑞 𝑥
![Page 9: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/9.jpg)
Unstable Update
Unstable StableAdaptive
learning rate
limit the policy update range
Can we measure the distance between two distributions?
Make confident updates
![Page 10: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/10.jpg)
KL Divergence
Measure the distance of two distributions
𝐷𝐾𝐿(𝑃||𝑄) = σ𝑥 𝑃 𝑥 𝑙𝑜𝑔𝑃(𝑥)
𝑄(𝑥)
𝐷𝐾𝐿(𝜋1||𝜋2)[𝑠] = σ𝑎∈𝐴𝜋1 𝑎|𝑠 𝑙𝑜𝑔𝜋1(𝑎|𝑠)
𝜋2(𝑎|𝑠)
KL divergence of two policies
* image: Kullback–Leibler divergence (Wikipedia)https://en.wikipedia.org/wiki/Kullback–Leibler_divergence
![Page 11: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/11.jpg)
Trust Region Policy Optimization (TRPO)
TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of β that performs well across different problems—or even within a single problem, where the characteristics change over the course of learning
Common trick in optimization: Lagrangian Dual
![Page 12: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/12.jpg)
Proximal Policy Optimization (PPO)
TRPO use conjugate gradient decent to handle the constraint
Hessian Matrix expensive both in computation and space
Idea:The constraint helps in the training process. However, maybe the constraint is not a strict constraint: Does it matter if we only break the constraint just a few times?
What if we treat it as a “soft” constraint? Add proximal value to objective function?
![Page 13: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/13.jpg)
PPO with Adaptive KL Penalty
Hard to pick 𝛽 value use adaptive 𝛽
Still need to set up a KL divergence target value …
![Page 14: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/14.jpg)
PPO with Adaptive KL Penalty
* CS294 Fall 2017, Lecture 13http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf
![Page 15: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/15.jpg)
PPO with Clipped Objective
Fluctuation happens when r changes too quickly limit r within a range?
r1
1
1 + Ɛ1 - Ɛ
1 - Ɛ
1 + Ɛ
r1
1
1 + Ɛ1 - Ɛ
1 - Ɛ
1 + Ɛ
![Page 16: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/16.jpg)
PPO with Clipped Objective
* CS294 Fall 2017, Lecture 13http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_advanced_pg.pdf
![Page 17: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/17.jpg)
PPO in practice
entropy bonus to ensure sufficient exploration
encourage “diversity”
a squared-error loss for “critic”
Surrogate objective function
* c1, c2: empirical values, in the paper, c1=1, c2=0.01
![Page 18: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/18.jpg)
Performance
Results from continuous control benchmark. Average normalized scores (over 21 runs of the algorithm, on 7 environments)
![Page 19: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/19.jpg)
Performance
Results in MuJoCo environments, training for one million timesteps
![Page 20: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/20.jpg)
Related Works
[2] An Adaptive Clipping Approach for Proximal Policy OptimizationPPO-𝜆
Change the clipping range adaptively
[1] Emergence of Locomotion Behaviours in Rich EnvironmentsDistributed PPO
Interesting fact: this paper is published before PPO paperDeepMind got this idea from OpenAI’s talking in NIPS 2016
[1] https://arxiv.org/abs/1707.02286[2] https://arxiv.org/abs/1804.06461
![Page 21: Proximal Policy Optimizationppoupart/teaching/cs885-spring... · 2018-06-21 · Trust Region Policy Optimization (TRPO) TRPO uses a hard constraint rather than a penalty because it](https://reader034.vdocuments.net/reader034/viewer/2022050216/5f6205a51b0f9505fe0fedf5/html5/thumbnails/21.jpg)
ENDThank you