an application of reinforcement learning to autonomous helicopter flight
DESCRIPTION
An Application of Reinforcement Learning to Autonomous Helicopter Flight. Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University. Overview. Autonomous helicopter flight is widely accepted to be a highly challenging control/reinforcement learning (RL) problem. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/1.jpg)
An Application of Reinforcement Learning
to Autonomous Helicopter Flight
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y.
NgStanford University
![Page 2: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/2.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Overview
Autonomous helicopter flight is widely accepted to be a highly challenging control/reinforcement learning (RL) problem.
Human expert pilots significantly outperform autonomous helicopters.
Apprenticeship learning algorithms use expert demonstrations to obtain good controllers.
Our experimental results significantly extend the state of the art in autonomous helicopter aerobatics.
![Page 3: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/3.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Apprenticeship learning: uses an expert demonstration to help select the model and the reward function.
Apprenticeship learning and RL
Reward Function R
ReinforcementLearning
Control policy )(...)(Emax 0 TsRsR
Unknown dynamics:
flight data is required to obtain an accurate model.
Hard to specify the reward function for
complex tasks such as
helicopter aerobatics.
Dynamics Model
Psa
![Page 4: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/4.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Learning the dynamical model
State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)
Have goodmodel of dynamics?
NO
“Explore”
YES
“Exploit”
![Page 5: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/5.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Learning the dynamical model
State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)
Have goodmodel of dynamics?
NO
“Explore”
YES
“Exploit”
Exploration policies are impractical: they do not even try
to perform well.Can we avoid explicit exploration and just
exploit?
![Page 6: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/6.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Aggressive manual exploration
![Page 7: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/7.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Apprenticeship learning of the model
Expert human pilot flight
(a1, s1, a2, s2, a3, s3, ….)
Learn P sa
(a1, s1, a2, s2, a3, s3, ….)
Autonomous flight
Learn Psa
Dynamics Model
Psa
Reward Function R
ReinforcementLearning )(...)(Emax 0 TsRsR
Control policy
[Abbeel & Ng, 2005]Theorem. The described procedure will return policy as good as the expert’s policy in a polynomial number of iterations.
![Page 8: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/8.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Learning the dynamics model
Details of algorithm for learning dynamics model: Gravity subtraction [Abbeel, Ganapathi & Ng,
2005] Lagged criterion [Abbeel & Ng, 2004]
![Page 9: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/9.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Autonomous nose-in funnel
![Page 10: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/10.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Autonomous tail-in funnel
![Page 11: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/11.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Apprenticeship learning: reward
Reward Function R
ReinforcementLearning
Control policy )(...)(Emax 0 TsRsR
Dynamics Model
Psa
Hard to specify the reward function for
complex tasks such as
helicopter aerobatics.
![Page 12: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/12.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Example task: flip
Ideal flip: rotate 360 degrees around horizontal axis going right to left through the helicopter.1
5
2 3 4
76 8
g
g g
gg
gg
g g
T
T
T
T
TT
T T
![Page 13: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/13.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Example task: flip (2)
Specify flip task as: Idealized trajectory
Reward function that penalizes for deviation. Reward function
+
![Page 14: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/14.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Example of a bad reward function
![Page 15: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/15.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Apprenticeship learning for the reward function
Our approach: Observe expert’s demonstration of task. Infer reward function from demonstration. [see also Ng &
Russell, 2000]
Algorithm: Iterate for t = 1, 2, … Inverse RL step:
Estimate expert’s reward function R(s)= wT(s) such that under R(s) the expert outperforms all previously found policies {i}.
RL step: Compute optimal policy t for the estimated
reward function.
![Page 16: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/16.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Theoretical Results: Convergence
Theorem. After a number of iterations polynomial in the number of features and the horizon, the algorithm outputs a policy that performs nearly as well as the expert, as evaluated on the unknown reward function R*(s)=w*T(s).
[Abbeel & Ng, 2004]
![Page 17: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/17.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Overview
Reward Function R
ReinforcementLearning
Control policy )(...)(Emax 0 TsRsR
Dynamics Model
Psa
![Page 18: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/18.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Optimal control algorithm
Differential dynamic programming [Jacobson & Mayne, 1970; Anderson & Moore, 1989] An efficient algorithm to (locally) optimize a
policy for continuous state/action spaces.
![Page 19: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/19.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
DDP design choices and lessons learned
Simplest reward function: penalize for deviation from the target states for each time.
Penalize for high frequency control inputs significantly improves the controllers.
To allow aggressive maneuvering, we use a two-step procedure:
Make a plan off-line. Penalize for high frequency deviations from the planned inputs.
Penalize for integrated orientation error. [See paper for details.]
Process noise has little influence on controllers’ performance.
Observation noise and delay in observations greatly affect the controllers’ performance.
Insufficient: resulting controllers perform very poorly.
![Page 20: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/20.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Autonomous stationary flips
![Page 21: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/21.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Autonomous stationary rolls
![Page 22: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/22.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Related work
Bagnell & Schneider, 2001; LaCivita et al., 2006; Ng et al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.; Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron, 2002.
Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter.
![Page 23: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/23.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Conclusion
Apprenticeship learning for the dynamics model avoids explicit exploration in our experiments.
Procedure based on inverse RL for the reward function gives performance similar to human pilots.
Our results significantly extend state of the art in autonomous helicopter flight: first autonomous completion of stationary flips and rolls, tail-in funnels and nose-in funnels.
![Page 24: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/24.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Acknowledgments
Ben Tse, Garett Oku, Antonio Genova. Mark Woodward, Tim Worley.
![Page 25: An Application of Reinforcement Learning to Autonomous Helicopter Flight](https://reader038.vdocuments.net/reader038/viewer/2022110100/56812a99550346895d8e4ec2/html5/thumbnails/25.jpg)
Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng
Continuous flips