maximum-a-posteriori (map) policy optimization · off-policy! munos, remi, et al. ‘safe and...

Maximum-a-Posteriori (MAP)Policy Optimization

Mayank Mittal

MAP Policy OptimizationAbbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller (2018)

V-MPO: On-Policy MAP Policy Optimization For Discrete and Continuous ControlH. Francis Song* , Abbas Abdolmaleki* , Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick (2019)

Duality: Control and Estimation

▪ What are the actions which maximize future rewards?

▪ Assuming future success in maximizing rewards, what are the actions most likely to have been taken?

Solved using Expectation Maximization (EM)

Expectation Maximization

“global” parameters

▪ Generally, point estimation via MLE/MAP is not possible due to intractability

▪ Consider a latent variable model:

Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur

Duality: Control and Estimation

▪ What are the actions which maximize future rewards?

▪ Assuming future success in maximizing rewards, what are the actions most likely to have been taken?

Solved using Expectation Maximization (EM)

▪ Given a prior distribution over trajectories

▪ Estimate the posterior distribution over trajectories consistent with desired outcome, O (such as achieving a goal)

Inference for Optimal Control

interpreted as event of succeeding at RL task

Likelihood Function:(for undiscounted case)

temperature

Likelihood Objective:

Inference for Optimal ControlLikelihood Objective:

Auxiliary Distribution

E-step:Improves ELBO w.r.t. q

M-step:Improves ELBO w.r.t. policy

Likelihood Function:(for undiscounted case)

temperature

Definition of variational distribution

policy parameters

For undiscounted case:

For discounted case:

discount factor

Regularized RLLikelihood Objective:

Regularized Q-value function:

Regularized RL

Proposition

Objective of MPO

▪ Uses EM-style coordinate ascent to maximize estimation objective

▪ Proposes off-policy algorithm that is▪ scalable, robust and insensitive to

hyperparameters▪ offers data-efficiency

on-policy algorithms

off-policy algorithms

Regularized RLLikelihood Objective:

Regularized Q-value function:

E-step: Maximization w.r.t. q

Consider iteration i:1. Set

2. Estimate unregularized action value:Using Retrace Algorithm

Off-policy!

Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing Systems, 2016, pp. 1054–1062. Neural Information Processing Systems,

Consider iteration i:3. Maximize one-step objective:

constant w.r.t q

INTERPRETATION:

Policy chooses soft-optimal action for one step and then resorts to executing policy

Stationary since samples from replay buffer

(Hard) Constrained E-step: arbitrary scale!

(Hard) Constrained E-step:

Method 1

Use parametric variational distribution

Method 2

Use non-parametric variational distribution

Similar to TRPO/PPO sample based distribution over actions for a state

sample based distribution over actions for a state

Method 2

Lagrangian Formulation

Consider iteration i:1. Set 2. Estimate unregularized action value:

3. Maximize “one-step” KL regularized objective to obtain:

Using Retrace Algorithm

M-step: Maximization w.r.tLikelihood Objective:

M-step: Partial Maximization w.r.t policy

Looks similar to supervised learning!

M-step: Maximization w.r.tLikelihood Objective:

samples weighted by variational distribution

from E-stepLooks similar to supervised learning!

M-step: Maximization w.r.t

Looks similar to supervised learning!

For generalized case:

(Hard) Constrained M-step:

prevents overfitting on the samples since the constraint decreases tendency of the entropy of policy to collapse

Algorithm

Experimental Evaluation

▪ Gaussian parametrization of policy▪ Benchmark on continuous control tasks

▪ Stable learning on all tasks▪ Significant sample efficiency

Objective of V-MPO

▪ Proposes on-policy algorithm▪ replaces state-action value function in

MPO with state value function▪ scalable to multi-task setting without

population-based tuning of hyperparameters

Objective of V-MPO

▪ Proposes on-policy algorithm▪ replaces state-action value function in

MPO with state value function▪ scalable to multi-task setting without

population-based tuning of hyperparameters

▪ In MPO:

▪ In V-MPO:

temperature

interpreted as relative improvement in policy over previous policy

temperature

Inference for ControlMAP Objective:

Identity:

E-step:Improves ELBO w.r.t.

M-step:Improves ELBO w.r.t. policy

Consider iteration i:1. Set

2. Estimate value function :

3. Calculate advantages:

Using n-step targets

On-policy!

Consider iteration i:4. Maximize objective:

sample based distribution over actions for a state

Lagrangian Formulation

Method

Engineering:learning improves substantially if samples corresponding to the highest 50% of the advantages in each batch are taken

Here, minimization (due to negative sign):

Weighted maximum likelihood policy loss!

Assumption:During sample-based computation of the loss, any state-action pairs not in the batch of trajectories have zero weight

(Hard) Constrained M-step:

prevents overfitting on the samples since the constraint decreases tendency of the entropy of policy to collapse

For generalized case (minimization, due to negative sign):

Multi-task Control: DMLab-30

Discrete Control: Atari

Continuous Control

Summary

▪ Formulation of RL optimization problem into an inference problem

▪ Two particular formulations:▪ MPO: off-policy algorithm▪ V-MPO: on-policy algorithm

MPO V-MPO

Thank you!

References

▪ Abdolmaleki, A., et al. (2018). Maximum a Posteriori Policy Optimisation. ArXiv, abs/1806.06920.

▪ Song, F., Abdolmaleki, A., et al. (2019), V-MPO: On-Policy MAP Policy Optimization for Discrete and Continuous Control. ArXiv, abs/1909.12238.

maximum-a-posteriori (map) policy optimization · off-policy! munos, remi, et al. ‘safe and...

Documents

off-policy evaluation via off-policy...

weighted importance sampling for off-policy learning with

jean munos profil

safe and efficient off-policy reinforcement learning

bullying policy signed off 24.6.19 - st margaret's college

Édito - accueil | radio france...5h16 la une du jour –...

cálculo de enrolamento de máquinas elétricas e sistemas...

catecismo del pueblo luis munos marin.pdf

lessons from 60 years of pharmaceutical innovation nature...

an off-policy policy gradient theorem using emphatic...

understanding the curse of horizon in off-policy

hammerhead off-road warranty policy - off road go kart ......

empirical study of off-policy policy evaluation for

warm automatic c-off authorization option • 99 c-off...

trajectory-based off-policy deep reinforcement learning ·...

eu cohesion policy and the equity-efficiency trade-off

safe and efﬁcient off-policy reinforcement learningsafe...

fiscal policy changes in namibia: are namibians better off

quasi-stochastic approximation and off-policy

safe and efficient off policy reinforcement learning