Model-based Reinforcement Learningwith Neural Networkson Hierarchical Dynamic System
Akihiko Yamaguchi and Christopher G. Atkeson
Robotics Institute, Carnegie Mellon Universityhttp://akihikoy.net/
http://reflectionsintheword.files.wordpress.com/2012/08/pouring-water-into-glass.jpg
http://schools.graniteschools.org/edtech-canderson/files/2013/01/heinz-ketchup-old-bottle.jpg
http://old.post-gazette.com/images2/20021213hosqueeze_230.jpg
http://img.diytrade.com/cdimg/1352823/17809917/0/1292834033/shampoo_bottle_bodywash_bottle.jpg
http://www.nescafe.com/upload/golden_roast_f_711.png
My pizza demonstration https://youtu.be/Wgj32blPGiE
https://youtu.be/GjwfbOur3CQ
Pouring: A Manipulation of Deformable Object
Planning actionsPlanning parameters of actions= Dynamic Programming (Opt ctrl, MPC, …)Dynamics are partially unknown
Reinforcement Learning ProblemRL in pouring
Adaptation: not much hardGeneralization: hardIs Deep NN useful in this problem? (How to use in RL framework?)4
Remarks of Reinforcement LearningGood to think about Model-free RL v.s. Model-based RLSuccessful robot-learning RL is model-free (direct policy search) [cf. Kober et al. 2013]
Good at fine-tuning, Less computation cost (at execution)Robust to PoMDPModel-based: Simulation biases
Model-based:1. Generalization ability2. Sharable / Reusable3. Capable to reward changes
2 and 3: Thanks to symbolic (hierarchical) representation
5
inputoutput
hidden
- u
update
FK ANN
[Magtanong et al. 2012]
How to deal with simulation biases?Do not learn dx/dt = F(x,u) (dt: small like xx ms)
Learn (sub)task-level dynamicsParameters F_grasp Grasp result
Parameters F_flow_ctrl Flow ctrl result
Use stochastic modelsGaussian F Gaussian
Stochastic Neural Networks [Yamaguchi, Atkeson, ICRA 2016]
Use stochastic dynamic programmingStochastic Differential Dynamic Programming[Yamaguchi, Atkeson, Humanoids 2015]
6 Model-based RL with Neural Networks for Hierarchical Dynamic System
Stochastic Neural Networks
Propagation of probability distribution from input to outputGradients of output expectation w.r.t. an inputDifficulty: Nonlinear activation functions
ReLU (f(x)=max(0,x))
7
Meanmodel
Errormodel
Input(shared)
Use Case
8 Independent neural networks for each (sub)dynamical system
Stochastic Differential Dynamic Programming
9
Results of Experiments
DNN+DDP was better than LWR+DDP
Using redundant features did not affect the learning performance
Worked in pouringwith PR2 robot
10Video: https://youtu.be/aM3hE1J5W98
More Informationhttp://akihikoy.net/https://www.youtube.com/AkihikoYamaguchiAkihiko Yamaguchi and Christopher G. Atkeson:Neural Networks and Differential Dynamic Programming for Reinforcement Learning Problems, in Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA2016), Stockholm, Sweden, May, 2016.https://www.researchgate.net/publication/294729454Akihiko Yamaguchi and Christopher G. Atkeson:Differential Dynamic Programming with Temporally Decomposed Dynamics, in Proceedings of the 15th IEEE-RAS International Conference on Humanoid Robots (Humanoids2015), pp. 696-703, Seoul, 2015.https://www.researchgate.net/publication/282157952Akihiko Yamaguchi, Christopher G. Atkeson, and Tsukasa Ogasawara:Pouring Skills with Planning and Learning Modeled from Human Demonstrations, International Journal of Humanoid Robotics, Vol.12, No.3, pp.1550030, July, 2015.https://www.researchgate.net/publication/280733055
11