an application of reinforcement learning to autonomous helicopter flight

An Application of Reinforcement Learning

to Autonomous Helicopter Flight

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y.

NgStanford University

Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng

Overview

Autonomous helicopter flight is widely accepted to be a highly challenging control/reinforcement learning (RL) problem.

Human expert pilots significantly outperform autonomous helicopters.

Apprenticeship learning algorithms use expert demonstrations to obtain good controllers.

Our experimental results significantly extend the state of the art in autonomous helicopter aerobatics.


Apprenticeship learning: uses an expert demonstration to help select the model and the reward function.

Apprenticeship learning and RL

Reward Function R

ReinforcementLearning

Control policy )(...)(Emax 0 TsRsR

Unknown dynamics:

flight data is required to obtain an accurate model.

Hard to specify the reward function for

complex tasks such as

helicopter aerobatics.

Dynamics Model

Psa


Learning the dynamical model

State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)

Have goodmodel of dynamics?

NO

“Explore”

YES

“Exploit”


Learning the dynamical model

State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)

Have goodmodel of dynamics?

NO

“Explore”

YES

“Exploit”

Exploration policies are impractical: they do not even try

to perform well.Can we avoid explicit exploration and just

exploit?


Aggressive manual exploration


Apprenticeship learning of the model

Expert human pilot flight

(a1, s1, a2, s2, a3, s3, ….)

Learn P sa

(a1, s1, a2, s2, a3, s3, ….)

Autonomous flight

Learn Psa

Dynamics Model

Psa

Reward Function R

ReinforcementLearning )(...)(Emax 0 TsRsR

Control policy

[Abbeel & Ng, 2005]Theorem. The described procedure will return policy as good as the expert’s policy in a polynomial number of iterations.


Learning the dynamics model

Details of algorithm for learning dynamics model: Gravity subtraction [Abbeel, Ganapathi & Ng,

2005] Lagged criterion [Abbeel & Ng, 2004]


Autonomous nose-in funnel


Autonomous tail-in funnel


Apprenticeship learning: reward

Reward Function R



Dynamics Model

Psa

Hard to specify the reward function for

complex tasks such as

helicopter aerobatics.


Example task: flip

Ideal flip: rotate 360 degrees around horizontal axis going right to left through the helicopter.1

5

2 3 4

76 8

g

g g

gg

gg

g g

T

T

T

T

TT

T T


Example task: flip (2)

Specify flip task as: Idealized trajectory

Reward function that penalizes for deviation. Reward function

+


Example of a bad reward function


Apprenticeship learning for the reward function

Our approach: Observe expert’s demonstration of task. Infer reward function from demonstration. [see also Ng &

Russell, 2000]

Algorithm: Iterate for t = 1, 2, … Inverse RL step:

Estimate expert’s reward function R(s)= wT(s) such that under R(s) the expert outperforms all previously found policies {i}.

RL step: Compute optimal policy t for the estimated

reward function.


Theoretical Results: Convergence

Theorem. After a number of iterations polynomial in the number of features and the horizon, the algorithm outputs a policy that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w*T(s).

[Abbeel & Ng, 2004]


Overview

Reward Function R



Dynamics Model

Psa


Optimal control algorithm

Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989] An efficient algorithm to (locally) optimize a

policy for continuous state/action spaces.


DDP design choices and lessons learned

Simplest reward function: penalize for deviation from the target states for each time.

Penalize for high frequency control inputs significantly improves the controllers.

To allow aggressive maneuvering, we use a two-step procedure:

Make a plan off-line. Penalize for high frequency deviations from the planned inputs.

Penalize for integrated orientation error. [See paper for details.]

Process noise has little influence on controllers’ performance.

Observation noise and delay in observations greatly affect the controllers’ performance.

Insufficient: resulting controllers perform very poorly.


Autonomous stationary flips


Autonomous stationary rolls


Related work

Bagnell & Schneider, 2001; LaCivita et al., 2006; Ng et al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.; Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron, 2002.

Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter.


Conclusion

Apprenticeship learning for the dynamics model avoids explicit exploration in our experiments.

Procedure based on inverse RL for the reward function gives performance similar to human pilots.

Our results significantly extend state of the art in autonomous helicopter flight: first autonomous completion of stationary flips and rolls, tail-in funnels and nose-in funnels.


Acknowledgments

Ben Tse, Garett Oku, Antonio Genova. Mark Woodward, Tim Worley.


Continuous flips

an application of reinforcement learning to autonomous helicopter flight

Documents

morgan quigley

adam coates

funnelpieter abbeel

reward function pieter

gpieter abbeel

dynamics modelpsapieter

gravity subtraction

lagged criterion abbeel