big picture and key challenges

57
Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University (Variation hereof) presented at Cornell/USC/UCSD/Michigan/UNC/Duke/UCLA/UW/EPFL/Berkeley /CMU Winter/Spring 2008 In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico

Upload: nijole

Post on 22-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter Flight Pieter Abbeel Stanford University (Variation hereof) presented at Cornell/USC/UCSD/Michigan/UNC/Duke/UCLA/UW/EPFL/Berkeley/CMU Winter/Spring 2008 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Big picture and key challenges

Apprenticeship Learning for Robotic Control, with Applications to Quadruped Locomotion and Autonomous Helicopter

Flight

Pieter AbbeelStanford University

(Variation hereof) presented at Cornell/USC/UCSD/Michigan/UNC/Duke/UCLA/UW/EPFL/Berkeley/CM

UWinter/Spring 2008

In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov, Sebastian Thrun.

Page 2: Big picture and key challenges
Page 3: Big picture and key challenges

Big picture and key challengesDynamics Model

Psa

Reward Function R

ReinforcementLearning /

Optimal Control

Controller/Policy p

Prescribes action to take for each

state

Probability distribution over next states given current state and

actionDescribes

desirability of being in a state.

Key challenges Providing a formal specification of the control

task. Building a good dynamics model. Finding closed-loop controllers.

Page 4: Big picture and key challenges

Apprenticeship learning algorithms Leverage expert demonstrations to learn to perform a

desired task.

Formal guarantees Running time Sample complexity Performance of resulting controller

Enabled us to solve highly challenging, previously unsolved, real-world control problems in

Quadruped locomotion Autonomous helicopter flight

Overview

Page 5: Big picture and key challenges

Example task: driving

Page 6: Big picture and key challenges

Input: Dynamics model / Simulator Psa(st+1 | st, at) No reward function Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …

(= trace of the teacher’s policy p*)

Desired output: Policy , which (ideally) has

performance guarantees, i.e.,

Note: R* is unknown.

Problem setup

Page 7: Big picture and key challenges

Formulate as standard machine learning problem Fix a policy class

E.g., support vector machine, neural network, decision tree, deep belief net, …

Estimate a policy from the training examples (s0, a0), (s1, a1), (s2, a2), …

E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002.

Prior work: behavioral cloning

Page 8: Big picture and key challenges

Prior work: behavioral cloning

Limitations: Fails to provide strong performance

guarantees Underlying assumption: policy simplicity

Page 9: Big picture and key challenges

Problem structureDynamics Model

Psa

Reward Function R

ReinforcementLearning /

Optimal Control

Controller/Policy p

Prescribes action to take for each state: typically very complex

Often fairly succinct

Page 10: Big picture and key challenges

Apprenticeship learning [Abbeel & Ng, 2004]

Assume Initialize: pick some controller p0. Iterate for i = 1, 2, … :

“Guess” the reward function: Find a reward function such that the teacher maximally outperforms all previously found controllers.

Find optimal control policy pi for the current guess of the reward function Rw.

If , exit the algorithm.

Learning through reward functions rather than

directly learning policies.

There is no reward function for which the teacher significantly

outperforms thus-far found policies.

Page 11: Big picture and key challenges

Theoretical guarantees

Guarantee w.r.t. unrecoverable reward function of teacher.

Sample complexity does not depend on complexity of teacher’s policy p*.

Page 12: Big picture and key challenges

Related work Prior work:

Behavioral cloning. (covered earlier) Utility elicitation / Inverse reinforcement

learning, Ng & Russell, 2000. No strong performance guarantees.

Closely related later work: Ratliff et al., 2006, 2007; Neu & Szepesvari,

2007;Syed & Schapire, 2008.

Work on specialized reward function: trajectories. E.g., Atkeson & Schaal, 1997.

Page 13: Big picture and key challenges

Highway drivingTeacher in Training World Learned Policy in Testing World

Input: Dynamics model / Simulator Psa(st+1 | st, at) Teacher’s demonstration: 1 minute in “training world” Note: R* is unknown. Reward features: 5 features corresponding to lanes/shoulders; 10

features corresponding to presence of other car in current lane at different distances

Page 14: Big picture and key challenges

More driving examples

In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Driving demonstratio

n

Driving demonstrati

on

Learned behavior

Learned behavior

Page 15: Big picture and key challenges

Parking lot navigation

[Abbeel et al., submitted]

Reward function trades off: curvature smoothness, distance to obstacles, alignment with principal directions.

Page 16: Big picture and key challenges

Demonstrate parking lot navigation on “train parking lot.”

Run our apprenticeship learning algorithm to find a set of reward weights w.

Receive “test parking lot” map + starting point and destination.

Find a policy for navigating the test parking lot.

Experimental setup

Learned reward weights

Page 17: Big picture and key challenges

Learned controller

Page 18: Big picture and key challenges

Reward function trades off 25 features.

Quadruped

[Kolter, Abbeel & Ng, 2008]

Page 19: Big picture and key challenges

Demonstrate path across the “training terrain”

Run our apprenticeship learning algorithm to find a set of reward weights w.

Receive “testing terrain”---height map.

Find a policy for crossing the testing terrain.

Experimental setup

Learned reward weights

Page 20: Big picture and key challenges

Without learning

Page 21: Big picture and key challenges

With learned reward function

Page 22: Big picture and key challenges

LearnR

Apprenticeship learning

Dynamics Model

Psa

Reward Function R

ReinforcementLearning /

Optimal Control

Controller p

Teacher’s flight

(s0, a0, s1, a1, ….)

Page 23: Big picture and key challenges

ReinforcementLearning /

Optimal Control

LearnR

Apprenticeship learning

Dynamics Model

Psa

Reward Function R

Controller p

Teacher’s flight

(s0, a0, s1, a1, ….)

Page 24: Big picture and key challenges

Accurate dynamics model Psa

Motivating example

• Textbook model• Specification

Accurate dynamics model Psa

Collect flight data.

• Textbook model• Specification

Learn model from data.

How to fly for data collection?

How to ensure that the entire flight envelope is covered?

Page 25: Big picture and key challenges

Aggressive (manual) exploration

Page 26: Big picture and key challenges

Never any explicit exploration (neither manual nor autonomous).

Near-optimal performance (compared to teacher).

Small number of teacher demonstrations.

Small number of autonomous trials.

Desired properties

Page 27: Big picture and key challenges

Learn Psa Learn Psa

Apprenticeship learning of the model

Dynamics Model

Psa

Reward Function R

ReinforcementLearning /

Optimal Control

Controller p

Autonomous flight

(s0, a0, s1, a1, ….)

Teacher’s flight

(s0, a0, s1, a1, ….)

Page 28: Big picture and key challenges

No explicit exploration required.

Theoretical guarantees

[Abbeel & Ng, 2005]

Page 29: Big picture and key challenges

LearnR

Learn Psa

Learn Psa Dynamics

Model Psa

Reward Function R

ReinforcementLearning /

Optimal Control

Controller p

Autonomous flight

(s0, a0, s1, a1, ….)

Teacher’s flight

(s0, a0, s1, a1, ….)

Apprenticeship learning summary

Page 30: Big picture and key challenges

Other relevant parts of the story

Learning the dynamics model from data Locally weighted models Exploiting structure from physics Simulation accuracy at time-scales relevant for

control

Reinforcement learning / Optimal control Model predictive control Receding horizon differential dynamic

programming[Abbeel et al. 2005, 2006a, 2006b, 2007]

Page 31: Big picture and key challenges

Related work Bagnell & Schneider, 2001; LaCivita, Papageorgiou,

Messner & Kanade, 2002; Ng, Kim, Jordan & Sastry 2004a (2001);

Roberts, Corke & Buskey, 2003; Saripalli, Montgomery & Sukhatme, 2003; Shim, Chung, Kim & Sastry, 2003; Doherty et al., 2004.

Gavrilets, Martinos, Mettler and Feron, 2002; Ng et al., 2004b.

Maneuvers presented here are significantly more challenging than those flown by any other autonomous helicopter.

Page 32: Big picture and key challenges

Autonomous aerobatic flips (attempt) before apprenticeship learning

Task description: meticulously hand-engineeredModel: learned from (commonly used) frequency sweeps data

Page 33: Big picture and key challenges

1. Our expert pilot demonstrates the airshow several times.

Experimental setup for helicopter

Page 34: Big picture and key challenges

Demonstrations

Page 35: Big picture and key challenges

1. Our expert pilot demonstrates the airshow several times.

2. Learn a reward function---trajectory.3. Learn a dynamics model.

Experimental setup for helicopter

Page 36: Big picture and key challenges

Learned reward (trajectory)

Page 37: Big picture and key challenges

1. Our expert pilot demonstrates the airshow several times.

2. Learn a reward function---trajectory.3. Learn a dynamics model.4. Find the optimal control policy for learned reward

and dynamics model.5. Autonomously fly the airshow

6. Learn an improved dynamics model. Go back to step 4.

Experimental setup for helicopter

Page 38: Big picture and key challenges
Page 39: Big picture and key challenges

Accuracy White: target trajectory. Black: autonomously flown trajectory.

Page 40: Big picture and key challenges

Summary Apprenticeship learning algorithms

Learn to perform a task from observing expert demonstrations of the task.

Formal guarantees Running time. Sample complexity. Performance of resulting controller.

Enabled us to solve highly challenging, previously unsolved, real-world control problems.

Page 41: Big picture and key challenges

Applications: Autonomous helicopters to assist in

in wildland fire fighting. Fixed-wing formation flight:

Estimated fuel savings for three aircraft formation: 20%.

Learning from demonstrations only scratches surface of the potential impact of work at intersection machine learning/control on robotics.

Safe autonomous learning. More general advice taking.

Current and future work

Page 42: Big picture and key challenges
Page 43: Big picture and key challenges

Thank you.

Page 44: Big picture and key challenges

Chaos (short)

Page 45: Big picture and key challenges

Chaos (long)

Page 46: Big picture and key challenges

Flips

Page 47: Big picture and key challenges

Auto-rotation descent

Page 48: Big picture and key challenges

Tic-toc

Page 49: Big picture and key challenges

Autonomous nose-in funnel

Page 50: Big picture and key challenges

From initial pilot demonstrations, our model/simulator Psa will be accurate for the part of the state space (s,a) visited by the pilot.

Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s controller p*.

Consequently, there is at least one controller (namely p*) that looks capable of flying the helicopter well in our simulation.

Thus, each time we solve for the optimal controller using the current model/simulator Psa, we will find a controller that successfully flies the helicopter according to Psa.

If, on the actual helicopter, this controller fails to fly the helicopter---despite the model Psa predicting that it should---then it must be visiting parts of the state space that are inaccurately modeled.

Hence, we get useful training data to improve the model. This can happen only a small number of times.

Model Learning: Proof Idea

Page 51: Big picture and key challenges

Apprenticeship Learning via Inverse Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2004.

Learning First Order Markov Models for Control, Pieter Abbeel and Andrew Y. Ng. In NIPS 17, 2005.

Exploration and Apprenticeship Learning in Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2005.

Modeling Vehicular Dynamics, with Application to Modeling Helicopters, Pieter Abbeel, Varun Ganapathi and Andrew Y. Ng. In NIPS 18, 2006.

Using Inaccurate Models in Reinforcement Learning, Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. In Proc. ICML, 2006.

An Application of Reinforcement Learning to Aerobatic Helicopter Flight, Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. In NIPS 19, 2007.

Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion,

J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng. In NIPS 20, 2008.

Page 52: Big picture and key challenges

Helicopter dynamics model in auto

Page 53: Big picture and key challenges

Performance guarantee intuition

Intuition by example: Let

If the returned controller p satisfies

Then no matter what the values of and are, the controller p performs as well as the teacher’s controller p*.

Page 54: Big picture and key challenges

Probabilistic graphical model for multiple demonstrations

Page 55: Big picture and key challenges

Teacher demonstration for quadruped

Full teacher demonstration = sequence of footsteps.

Much simpler to “teach hierarchically”: Specify a body path. Specify best footstep in a small area.

Page 56: Big picture and key challenges

Hierarchical inverse RL

Quadratic programming problem (QP): quadratic objective, linear constraints.

Constraint generation for path constraints.

Page 57: Big picture and key challenges

Training: Have quadruped walk straight across a fairly

simple board with fixed-spaced foot placements.

Around each foot placement: label the best foot placement. (about 20 labels)

Label the best body-path for the training board.

Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels.

Test on hold-out terrains: Plan a path across the test-board.

Experimental setup