pilco: a model-based and data-efficient approach to policy

PILCO: A Model-Based and Data-EfficientApproach to Policy Search

(M.P. Deisenroth and C.E. Rasmussen)

CSC2541November 4, 2016

PILCO Graphical Model

PILCO – Probabilistic Inference for Learning COntrol

Latent states {Xt} evolve through time based on previousstates and controls

Policy π maps Zt, a noisy observation of Xt, into a control, Ut

CSC2541 November 4, 2016 2/ 19

PILCO Objective

Transitions follow dynamic system

xt = f(xt−1, ut−1)

where x ∈ RD, u ∈ RF and f is a latent function.

Let π be parameterized by θ and ut = π(xt, θ). The objective is tofind π that minimizes expected cost of following π for T steps

Cost function encodes information about a target state, e.g.,c(x) = 1− exp(−‖x− xtarget‖2/σ2

c )

CSC2541 November 4, 2016 3/ 19

Algorithm

CSC2541 November 4, 2016 4/ 19

Algorithm

CSC2541 November 4, 2016 5/ 19

Dynamics Model Learning

Multiple plausible function approximators of f

CSC2541 November 4, 2016 6/ 19


Define a Gaussian process (GP) prior on the latent dynamicfunction f

CSC2541 November 4, 2016 7/ 19


Let the prior of f be GP(0, k(x̃, x̃′)) where x̃ , [xTuT ]T and thesquared exponential kernel is given by

CSC2541 November 4, 2016 8/ 19


Let ∆t = xt − xt−1 + ε where ε ∼ N (0,Σε) andΣε = diag([σε1 , . . . , σεD ]). The GP yields one-step predictions (seeSection 2.2 in reference 3)

Given n training inputs X̃ = [x̃1, . . . , x̃n] and correspondingtraining targets y = [∆1, . . . ,∆n], the posterior GPhyper-parameters are learned by evidence maximization (type 2maximum likelihood).

CSC2541 November 4, 2016 9/ 19

Algorithm

CSC2541 November 4, 2016 10/ 19

Policy Evaluation

In evaluating objective Jπ(θ), we must calculate p(xt) since

We have xt = xt−1 + ∆t − ε, where in general, computing p(∆t) isanalytically intractable.

Instead, p(∆t) is approximated with a Gaussian via momentmatching.

CSC2541 November 4, 2016 11/ 19

Moment Matching

Input distribution p(xt−1, ut1) is assumed Gaussian

When propagated through the GP model, we obtain p(∆t)

p(∆t) is approximated by a Gaussian via moment matching

CSC2541 November 4, 2016 12/ 19

Moment Matching

p(xt) can now be approximated with N (µt,Σt) where

µ∆ and Σ∆ are computed exactly via iterated expectation andvariance

CSC2541 November 4, 2016 13/ 19

Algorithm

CSC2541 November 4, 2016 14/ 19

Analytic Gradient for Policy Improvement

Let Et = Ext [c(xt)] so that Jπ(θ) =∑T

t=1 Et.Et depends on θ through p(xt), which depends on θ throughp(xt−1), which depends on θ through µt and Σt, . . ., whichdepends on θ based on µu and Σu, where ut = π(xt, θ).

Chain rule is used to calculate derivatives

Analytic gradients allow for gradient-based non-convexoptimization methods, e.g., CG or L-BFGS

CSC2541 November 4, 2016 15/ 19

Data-Efficiency

CSC2541 November 4, 2016 16/ 19

Advantages and Disadvantages

Advantages

Data-efficient

Incorporates model-uncertainty into long-term planning

Does not rely on expert knowledge, i.e., demonstrations, ortask-specific prior knowledge.

Disadvantages

Not an optimal control method. If p(Xi) do not cover thetarget region and σc induces a cost that is very peaked aroundthe target solution, PILCO gets stuck in a local optimumbecause of zero gradients.

Learned dynamics models are only confident in areas of thestate space previously observed.

Does not take temporal correlation into account. Modeluncertainty treated as uncorrelated noise

CSC2541 November 4, 2016 17/ 19

Extension: PILCO with Bayesian Filtering

R. McAllister and C. Rasmussen, “Data-Efficient Reinforcement Learning in Coninuous-State POMDPs.”https://arxiv.org/abs/1602.02523

CSC2541 November 4, 2016 18/ 19

https://arxiv.org/abs/1602.02523

References

1 M.P. Deisenroth and C.E. Rasmussen, “PILCO: A Model-Based and Data-Efficient Approach to PolicySearch” in Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA,2011.

2 R. McAllister and C. Rasmussen, “Data-Efficient Reinforcement Learning in Coninuous-State POMDPs.”https://arxiv.org/abs/1602.02523

3 C.E. Rasmussen and C.K.I. Williams (2006) Gaussian Processes for Machine Learning. MIT Press.www.gaussianprocess.org/gpml/chapters

4 C.M. Bishop (2006). Pattern Recognition and Machine Learning Chapter 6.4. Springer. ISBN0-387-31073-8.

CSC2541 November 4, 2016 19/ 19

https://arxiv.org/abs/1602.02523

www.gaussianprocess.org/gpml/chapters

pilco: a model-based and data-efficient approach to policy

Documents