pilco: a model-based and data-efficient approach to policy
TRANSCRIPT
PILCO: A Model-Based and Data-EfficientApproach to Policy Search
(M.P. Deisenroth and C.E. Rasmussen)
CSC2541November 4, 2016
PILCO Graphical Model
PILCO – Probabilistic Inference for Learning COntrol
Latent states {Xt} evolve through time based on previousstates and controls
Policy π maps Zt, a noisy observation of Xt, into a control, Ut
CSC2541 November 4, 2016 2/ 19
PILCO Objective
Transitions follow dynamic system
xt = f(xt−1, ut−1)
where x ∈ RD, u ∈ RF and f is a latent function.
Let π be parameterized by θ and ut = π(xt, θ). The objective is tofind π that minimizes expected cost of following π for T steps
Cost function encodes information about a target state, e.g.,c(x) = 1− exp(−‖x− xtarget‖2/σ2
c )
CSC2541 November 4, 2016 3/ 19
Algorithm
CSC2541 November 4, 2016 4/ 19
Algorithm
CSC2541 November 4, 2016 5/ 19
Dynamics Model Learning
Multiple plausible function approximators of f
CSC2541 November 4, 2016 6/ 19
Dynamics Model Learning
Multiple plausible function approximators of f
CSC2541 November 4, 2016 6/ 19
Dynamics Model Learning
Define a Gaussian process (GP) prior on the latent dynamicfunction f
CSC2541 November 4, 2016 7/ 19
Dynamics Model Learning
Let the prior of f be GP(0, k(x̃, x̃′)) where x̃ , [xTuT ]T and thesquared exponential kernel is given by
CSC2541 November 4, 2016 8/ 19
Dynamics Model Learning
Let ∆t = xt − xt−1 + ε where ε ∼ N (0,Σε) andΣε = diag([σε1 , . . . , σεD ]). The GP yields one-step predictions (seeSection 2.2 in reference 3)
Given n training inputs X̃ = [x̃1, . . . , x̃n] and correspondingtraining targets y = [∆1, . . . ,∆n], the posterior GPhyper-parameters are learned by evidence maximization (type 2maximum likelihood).
CSC2541 November 4, 2016 9/ 19
Algorithm
CSC2541 November 4, 2016 10/ 19
Policy Evaluation
In evaluating objective Jπ(θ), we must calculate p(xt) since
We have xt = xt−1 + ∆t − ε, where in general, computing p(∆t) isanalytically intractable.
Instead, p(∆t) is approximated with a Gaussian via momentmatching.
CSC2541 November 4, 2016 11/ 19
Moment Matching
Input distribution p(xt−1, ut1) is assumed Gaussian
When propagated through the GP model, we obtain p(∆t)
p(∆t) is approximated by a Gaussian via moment matching
CSC2541 November 4, 2016 12/ 19
Moment Matching
p(xt) can now be approximated with N (µt,Σt) where
µ∆ and Σ∆ are computed exactly via iterated expectation andvariance
CSC2541 November 4, 2016 13/ 19
Algorithm
CSC2541 November 4, 2016 14/ 19
Analytic Gradient for Policy Improvement
Let Et = Ext [c(xt)] so that Jπ(θ) =∑T
t=1 Et.Et depends on θ through p(xt), which depends on θ throughp(xt−1), which depends on θ through µt and Σt, . . ., whichdepends on θ based on µu and Σu, where ut = π(xt, θ).
Chain rule is used to calculate derivatives
Analytic gradients allow for gradient-based non-convexoptimization methods, e.g., CG or L-BFGS
CSC2541 November 4, 2016 15/ 19
Data-Efficiency
CSC2541 November 4, 2016 16/ 19
Advantages and Disadvantages
Advantages
Data-efficient
Incorporates model-uncertainty into long-term planning
Does not rely on expert knowledge, i.e., demonstrations, ortask-specific prior knowledge.
Disadvantages
Not an optimal control method. If p(Xi) do not cover thetarget region and σc induces a cost that is very peaked aroundthe target solution, PILCO gets stuck in a local optimumbecause of zero gradients.
Learned dynamics models are only confident in areas of thestate space previously observed.
Does not take temporal correlation into account. Modeluncertainty treated as uncorrelated noise
CSC2541 November 4, 2016 17/ 19
Extension: PILCO with Bayesian Filtering
R. McAllister and C. Rasmussen, “Data-Efficient Reinforcement Learning in Coninuous-State POMDPs.”https://arxiv.org/abs/1602.02523
CSC2541 November 4, 2016 18/ 19
References
1 M.P. Deisenroth and C.E. Rasmussen, “PILCO: A Model-Based and Data-Efficient Approach to PolicySearch” in Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA,2011.
2 R. McAllister and C. Rasmussen, “Data-Efficient Reinforcement Learning in Coninuous-State POMDPs.”https://arxiv.org/abs/1602.02523
3 C.E. Rasmussen and C.K.I. Williams (2006) Gaussian Processes for Machine Learning. MIT Press.www.gaussianprocess.org/gpml/chapters
4 C.M. Bishop (2006). Pattern Recognition and Machine Learning Chapter 6.4. Springer. ISBN0-387-31073-8.
CSC2541 November 4, 2016 19/ 19