maximum-a-posteriori (map) policy optimization · off-policy! munos, remi, et al. ‘safe and...

52
Maximum-a-Posteriori (MAP) Policy Optimization Mayank Mittal

Upload: others

Post on 28-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Maximum-a-Posteriori (MAP)Policy Optimization

Mayank Mittal

Page 2: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

MAP Policy OptimizationAbbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller (2018)

V-MPO: On-Policy MAP Policy Optimization For Discrete and Continuous ControlH. Francis Song* , Abbas Abdolmaleki* , Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick (2019)

Page 3: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Duality: Control and Estimation

▪ What are the actions which maximize future rewards?

▪ Assuming future success in maximizing rewards, what are the actions most likely to have been taken?

Solved using Expectation Maximization (EM)

Page 4: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Expectation Maximization

“global” parameters

▪ Generally, point estimation via MLE/MAP is not possible due to intractability

▪ Consider a latent variable model:

Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur

Page 5: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Expectation Maximization

Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur

ELBO

Page 6: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Expectation Maximization

Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur

Page 7: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Expectation Maximization

Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur

Page 8: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Expectation Maximization

Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur

Page 9: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Duality: Control and Estimation

▪ What are the actions which maximize future rewards?

▪ Assuming future success in maximizing rewards, what are the actions most likely to have been taken?

Solved using Expectation Maximization (EM)

Page 10: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

MAP Policy OptimizationAbbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller (2018)

V-MPO: On-Policy MAP Policy Optimization For Discrete and Continuous ControlH. Francis Song* , Abbas Abdolmaleki* , Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick (2019)

Page 11: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

▪ Given a prior distribution over trajectories

▪ Estimate the posterior distribution over trajectories consistent with desired outcome, O (such as achieving a goal)

Inference for Optimal Control

interpreted as event of succeeding at RL task

Page 12: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Inference for Optimal Control

Likelihood Function:(for undiscounted case)

interpreted as event of succeeding at RL task

temperature

Likelihood Objective:

Page 13: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Inference for Optimal ControlLikelihood Objective:

Auxiliary Distribution

ELBO

E-step:Improves ELBO w.r.t. q

M-step:Improves ELBO w.r.t. policy

Page 14: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Inference for Optimal Control

Likelihood Function:(for undiscounted case)

interpreted as event of succeeding at RL task

temperature

Likelihood Objective:

Page 15: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Inference for Optimal Control

Definition of variational distribution

policy parameters

prior

Likelihood Objective:

For undiscounted case:

For discounted case:

discount factor

Page 16: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Regularized RLLikelihood Objective:

For discounted case:

Regularized Q-value function:

Page 17: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Regularized RL

Proposition

Page 18: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Objective of MPO

▪ Uses EM-style coordinate ascent to maximize estimation objective

▪ Proposes off-policy algorithm that is▪ scalable, robust and insensitive to

hyperparameters▪ offers data-efficiency

on-policy algorithms

off-policy algorithms

Page 19: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Regularized RLLikelihood Objective:

For discounted case:

Regularized Q-value function:

Page 20: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

E-step: Maximization w.r.t. q

Consider iteration i:1. Set

2. Estimate unregularized action value:Using Retrace Algorithm

Off-policy!

Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing Systems, 2016, pp. 1054–1062. Neural Information Processing Systems,

Page 21: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

E-step: Maximization w.r.t. q

Consider iteration i:3. Maximize one-step objective:

constant w.r.t q

INTERPRETATION:

Policy chooses soft-optimal action for one step and then resorts to executing policy

Stationary since samples from replay buffer

Page 22: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

E-step: Maximization w.r.t. q

Consider iteration i:3. Maximize one-step objective:

(Hard) Constrained E-step: arbitrary scale!

Page 23: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

E-step: Maximization w.r.t. q

Consider iteration i:3. Maximize one-step objective:

(Hard) Constrained E-step:

Method 1

Use parametric variational distribution

Method 2

Use non-parametric variational distribution

Similar to TRPO/PPO sample based distribution over actions for a state

Page 24: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

E-step: Maximization w.r.t. q

Consider iteration i:3. Maximize one-step objective:

(Hard) Constrained E-step:

sample based distribution over actions for a state

Method 2

Use non-parametric variational distribution

Lagrangian Formulation

Page 25: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

E-step: Maximization w.r.t. q

Consider iteration i:1. Set 2. Estimate unregularized action value:

3. Maximize “one-step” KL regularized objective to obtain:

Using Retrace Algorithm

Page 26: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

M-step: Maximization w.r.tLikelihood Objective:

For discounted case:

M-step: Partial Maximization w.r.t policy

Looks similar to supervised learning!

Page 27: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

M-step: Maximization w.r.tLikelihood Objective:

For discounted case:

M-step: Partial Maximization w.r.t policy

samples weighted by variational distribution

from E-stepLooks similar to supervised learning!

Page 28: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

M-step: Maximization w.r.t

M-step: Partial Maximization w.r.t policy

Looks similar to supervised learning!

For generalized case:

Page 29: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

M-step: Maximization w.r.t

M-step: Partial Maximization w.r.t policy

For generalized case:

(Hard) Constrained M-step:

prevents overfitting on the samples since the constraint decreases tendency of the entropy of policy to collapse

Page 30: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Algorithm

Page 31: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Algorithm

Page 32: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Experimental Evaluation

▪ Gaussian parametrization of policy▪ Benchmark on continuous control tasks

Page 33: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Experimental Evaluation

▪ Stable learning on all tasks▪ Significant sample efficiency

Page 34: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Experimental Evaluation

▪ Stable learning on all tasks▪ Significant sample efficiency

Page 35: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

MAP Policy OptimizationAbbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller (2018)

V-MPO: On-Policy MAP Policy Optimization For Discrete and Continuous ControlH. Francis Song* , Abbas Abdolmaleki* , Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick (2019)

Page 36: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Objective of V-MPO

▪ Uses EM-style coordinate ascent to maximize estimation objective

▪ Proposes on-policy algorithm▪ replaces state-action value function in

MPO with state value function▪ scalable to multi-task setting without

population-based tuning of hyperparameters

Page 37: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Objective of V-MPO

▪ Uses EM-style coordinate ascent to maximize estimation objective

▪ Proposes on-policy algorithm▪ replaces state-action value function in

MPO with state value function▪ scalable to multi-task setting without

population-based tuning of hyperparameters

Page 38: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

▪ In MPO:

▪ In V-MPO:

Inference for Optimal Control

interpreted as event of succeeding at RL task

temperature

interpreted as relative improvement in policy over previous policy

temperature

Page 39: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Inference for ControlMAP Objective:

Identity:

E-step:Improves ELBO w.r.t.

M-step:Improves ELBO w.r.t. policy

Page 40: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

E-step: Maximization w.r.t. q

Consider iteration i:1. Set

2. Estimate value function :

3. Calculate advantages:

Using n-step targets

On-policy!

Page 41: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

E-step: Maximization w.r.t. q

Consider iteration i:4. Maximize objective:

(Hard) Constrained E-step:

Page 42: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Consider iteration i:4. Maximize objective:

E-step: Maximization w.r.t. q

(Hard) Constrained E-step:

sample based distribution over actions for a state

Lagrangian Formulation

Method

Use non-parametric variational distribution

Page 43: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Consider iteration i:4. Maximize objective:

E-step: Maximization w.r.t. q

(Hard) Constrained E-step:

Engineering:learning improves substantially if samples corresponding to the highest 50% of the advantages in each batch are taken

Page 44: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

M-step: Maximization w.r.t

M-step: Partial Maximization w.r.t policy

Here, minimization (due to negative sign):

Weighted maximum likelihood policy loss!

Assumption:During sample-based computation of the loss, any state-action pairs not in the batch of trajectories have zero weight

Page 45: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

M-step: Maximization w.r.t

M-step: Partial Maximization w.r.t policy

(Hard) Constrained M-step:

prevents overfitting on the samples since the constraint decreases tendency of the entropy of policy to collapse

For generalized case (minimization, due to negative sign):

Page 46: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Experimental Evaluation

Page 47: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Multi-task Control: DMLab-30

Experimental Evaluation

Page 48: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Discrete Control: Atari

Experimental Evaluation

Page 49: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Experimental Evaluation

Continuous Control

Page 50: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Summary

▪ Formulation of RL optimization problem into an inference problem

▪ Two particular formulations:▪ MPO: off-policy algorithm▪ V-MPO: on-policy algorithm

MPO V-MPO

Page 51: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

Thank you!

Page 52: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing

References

▪ Abdolmaleki, A., et al. (2018). Maximum a Posteriori Policy Optimisation. ArXiv, abs/1806.06920.

▪ Song, F., Abdolmaleki, A., et al. (2019), V-MPO: On-Policy MAP Policy Optimization for Discrete and Continuous Control. ArXiv, abs/1909.12238.