feature construction for inverse reinforcement learningsvlevine/papers/firl_poster.pdf · goal:...

1
[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learn- ing. In ICML ’04: Proceedings of the 21st International Conference on Machine Learning. ACM, 2004. [2] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. In ICML ’00: Proceedings of the 17th International Conference on Machine Learn- ing, pages 663–670. Morgan Kaufmann Publishers Inc., 2000. [3] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning. In ICML ’06: Proceedings of the 23rd International Conference on Machine Learn- ing, pages 729–736. ACM, 2006. [4] U. Syed, M. Bowling, and R. E. Schapire. Apprenticeship learning using linear programming. In ICML ’08: Proceedings of the 25th International Conference on Machine Learning, pages 1032–1039. ACM, 2008. Gridworld transfer comparison: 64 × 64 gridworld with colored objects placed at random. Component features give distance to object of specific color. Many colors are irrelevant. Transfer performance corresponds to learning reward on one random gridworld, and evaluating on 10 others (with random object placement). Comparing FIRL (proposed algorithm), Abbeel & Ng [1], MMP [3], LPAL [4]. FIRL outperforms prior methods, which cannot distinguish relevant objects from irrelevant ones. “Lawful” policies “Outlaw” policies percent feature average percent feature average mispred- expect. speed mispred- expect. speed iction distance iction distance Expert 0.0% 0.000 2.410 0.0% 0.000 2.375 FIRL 22.9% 0.025 2.314 24.2% 0.027 2.376 MMP 27.0% 0.111 1.068 27.2% 0.096 1.056 A&N 38.6% 0.202 1.054 39.3% 0.164 1.055 Random 42.7% 0.220 1.053 41.4% 0.184 1.053 Highway driving: “lawful” policy avoids going fast in right lane, “outlaw” policy drives fast, but slows down near police. Features indicate presence of police, current lane, speed, distance to cars, etc. Logical connection be- tween speed and lanes/police cars cannot be captured by linear combina- tions, and prior methods cannot match the expert’s speed while also match- ing feature expectations. Videos of the learned policies can be found at: http://graphics.stanford.edu/projects/firl/index.htm. Overview: Iteratively construct feature set Φ and reward R, al- ternating between an optimization phase that determines a re- ward, and a fitting phase that determines the features. Optimization Phase: Find reward R “close” to current features Φ, under which examples D are part of the optimal policy. Letting Proj Φ R denote the closest reward to R that is a linear combination of features Φ, we find R as: min R R Proj Φ R 2 s.t. π R (s)= a (s, a) ∈D Note that R can “step outside” of the current features to satisfy the examples, if the current features Φ are insufficient. Fitting Phase: Fit a regression tree to R, with component features δ acting as tests at tree nodes. Indicators for leaves of the tree are the new features Φ. Only component features that are relevant to the structure of R are selected, and leaves correspond to their logical conjunctions. Markov Decision Process: M = {S , A,θ,γ,R} S – set of states A – set of actions γ – discount factor R – reward function θ – state transition probabilities: θ sas = P (s |s, a) Optimal Policy: denoted π , maximizes E t=0 γ t R(s t ,a t )|π Example Traces: D = {(s 1,1 ,a 1,1 ), ..., (s n,T ,a n,T )}, where s i,t is the t th state in the i th trace, and a i,t is the optimal action in s i,t . Previous Work: most existing algorithms require a set of fea- tures Φ to be provided, and find a reward function that is a linear combination of the features [1, 2, 3, 4]. Finding features that are relevant and sufficient is difficult. Furthermore, a linear combina- tion is not always a good estimate for the reward. Component Features: instead of a complete set of relevant fea- tures, our method accepts an exhaustive list of component features δ : S→ Z. The algorithm finds a regression tree, with relevant component features acting as tests, to represent the reward. Goal: given Markov Decision Process (MDP) M without its re- ward function R, as well as example traces D from its optimal policy, find R. Motivations: learning policies from examples, inferring goals, specifying tasks by demonstration. Challenge: many functions R fit the examples, but many will not generalize to unobserved states. Selecting compact set of features that represent R is difficult. Solution: construct features to represent R from exhaustive list of component features, using logical conjunctions of component features represented as a regression tree. 1. Introduction 2. Background 3. Algorithm 4. Illustrated Example 5. Experimental Results 6. References Feature Construction for Inverse Reinforcement Learning Sergey Levine Stanford University Vladlen Koltun Stanford University Zoran Popović University of Washington root root root Optimization Phase Fitting Phase

Upload: others

Post on 08-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature Construction for Inverse Reinforcement Learningsvlevine/papers/firl_poster.pdf · Goal: given Markov Decision Process (MDP) Mwithout its re-ward function R, as well as example

[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learn-ing. In ICML ’04: Proceedings of the 21st International Conference on MachineLearning. ACM, 2004.

[2] A. Y. Ng and S. J. Russell. Algorithms for inverse reinforcement learning. InICML ’00: Proceedings of the 17th International Conference on Machine Learn-ing, pages 663–670. Morgan Kaufmann Publishers Inc., 2000.

[3] N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning. InICML ’06: Proceedings of the 23rd International Conference on Machine Learn-ing, pages 729–736. ACM, 2006.

[4] U. Syed, M. Bowling, and R. E. Schapire. Apprenticeship learning using linearprogramming. In ICML ’08: Proceedings of the 25th International Conferenceon Machine Learning, pages 1032–1039. ACM, 2008.

Gridworld transfer comparison: 64×64 gridworld with colored objects placed atrandom. Component features give distance to object of specific color. Manycolors are irrelevant. Transfer performance corresponds to learning rewardon one random gridworld, and evaluating on 10 others (with random objectplacement). Comparing FIRL (proposed algorithm), Abbeel & Ng [1], MMP[3], LPAL [4]. FIRL outperforms prior methods, which cannot distinguishrelevant objects from irrelevant ones.

“Lawful” policies “Outlaw” policiespercent feature average percent feature average

mispred- expect. speed mispred- expect. speediction distance iction distance

Expert 0.0% 0.000 2.410 0.0% 0.000 2.375FIRL 22.9% 0.025 2.314 24.2% 0.027 2.376MMP 27.0% 0.111 1.068 27.2% 0.096 1.056A&N 38.6% 0.202 1.054 39.3% 0.164 1.055Random 42.7% 0.220 1.053 41.4% 0.184 1.053

Highway driving: “lawful” policy avoids going fast in right lane, “outlaw”policy drives fast, but slows down near police. Features indicate presenceof police, current lane, speed, distance to cars, etc. Logical connection be-tween speed and lanes/police cars cannot be captured by linear combina-tions, and prior methods cannot match the expert’s speed while also match-ing feature expectations. Videos of the learned policies can be found at:http://graphics.stanford.edu/projects/firl/index.htm.

Overview: Iteratively construct feature set Φ and reward R, al-ternating between an optimization phase that determines a re-ward, and a fitting phase that determines the features.

Optimization Phase: Find reward R “close” to current featuresΦ, under which examples D are part of the optimal policy. LettingProjΦR denote the closest reward to R that is a linear combinationof features Φ, we find R as:

minRR− ProjΦR2

s.t. πR(s) = a ∀ (s, a) ∈ D

Note that R can “step outside” of the current features to satisfythe examples, if the current features Φ are insufficient.

Fitting Phase: Fit a regression tree to R, with componentfeatures δ acting as tests at tree nodes. Indicators for leaves ofthe tree are the new features Φ. Only component features that arerelevant to the structure of R are selected, and leaves correspondto their logical conjunctions.

Markov Decision Process: M = {S,A, θ, γ,R}S – set of states A – set of actionsγ – discount factor R – reward functionθ – state transition probabilities: θsas = P (s|s, a)

Optimal Policy: denoted π∗, maximizes E∞

t=0 γtR(st, at)|π∗, θ

Example Traces: D = {(s1,1, a1,1), ..., (sn,T , an,T )}, where si,t isthe tth state in the ith trace, and ai,t is the optimal action in si,t.

Previous Work: most existing algorithms require a set of fea-tures Φ to be provided, and find a reward function that is a linearcombination of the features [1, 2, 3, 4]. Finding features that arerelevant and sufficient is difficult. Furthermore, a linear combina-tion is not always a good estimate for the reward.

Component Features: instead of a complete set of relevant fea-tures, our method accepts an exhaustive list of component featuresδ : S → Z. The algorithm finds a regression tree, with relevantcomponent features acting as tests, to represent the reward.

Goal: given Markov Decision Process (MDP) M without its re-ward function R, as well as example traces D from its optimalpolicy, find R.

Motivations: learning policies from examples, inferring goals,specifying tasks by demonstration.

Challenge: many functions R fit the examples, but many will notgeneralize to unobserved states. Selecting compact set of featuresthat represent R is difficult.

Solution: construct features to represent R from exhaustive listof component features, using logical conjunctions of componentfeatures represented as a regression tree.

1. Introduction

2. Background

3. Algorithm

4. Illustrated Example

5. Experimental Results

6. References

Feature Construction for Inverse Reinforcement LearningSergey Levine

Stanford UniversityVladlen KoltunStanford University

Zoran PopovićUniversity of Washington

root root root

OptimizationPhase

FittingPhase