inverse reinforcement learning - peoplecbfinn/_files/bootcamp_inverserl.pdf · apprenticeship...

33
Inverse Reinforcement Learning Chelsea Finn Deep RL Bootcamp

Upload: others

Post on 10-Jun-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Inverse Reinforcement LearningChelsea Finn

Deep RL Bootcamp

Page 2: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Where does the reward come from?Computer Games Real World Scenarios

robotics dialog autonomous driving

what is the reward?often use a proxy

*frequently easier to provide expert data* Approach: infer reward function from roll-outs of expert policy

reward

Mnih et al. ‘15

Page 3: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Why infer the reward?Behavioral Cloning/Direct Imitation: Mimic actions of expert

- but no reasoning about outcomes or dynamics - the expert might have different degrees of freedom

Can we reason about what the expert is trying to achieve?

Page 4: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Inverse Optimal Control / Inverse Reinforcement Learning:infer reward function from demonstrations

(IOC/IRL)

Challenges underdefined problem

difficult to evaluate a learned reward demonstrations may not be precisely optimal

(Kalman ’64, Ng & Russell ’00)

given: - state & action space - Roll-outs from π* - dynamics model (sometimes)

goal: - recover reward function - then use reward to get policy

Page 5: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

handle ambiguity using probabilistic model of behavior

(energy-based model for behavior)

Notation:

MaxEnt formulation:

Maximum Entropy Inverse RL(Ziebart et al. ’08)

Page 6: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Maximum Entropy IRL Optimization

Page 7: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Maximum Entropy IRL Optimization{blackboard

Page 8: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Maximum Entropy Inverse RL

Algorithm:

(Ziebart et al. ’08)

Page 9: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

How can we: (1) handle unknown dynamics? (2) avoid solving the MDP in the inner loop

sample adaptively to estimate Z

[by constructing a policy]sample to estimate Z

Page 10: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Update reward using samples & demos

generate policy samples from π

update π w.r.t. rewardpolicy π reward r

guided cost learning algorithm

policy π

(Finn et al. ICML ’16)

Page 11: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Update reward using samples & demos

generate policy samples from π

update π w.r.t. reward take 1 policy opt. steppolicy π reward r

policy π

Update reward in inner loop of policy optimization

generator

discriminator

Ho & Ermon, NIPS ‘16

guided cost learning algorithm(Finn et al. ICML ’16)

Page 12: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Aside: Generative Adversarial Networks (Goodfellow et al. ’14)

Similar to inverse RL, GANs learn an objective for generative modeling.

Isola et al. ‘17Arjovsky et al. ‘17

trajectory τ

Zhu et al. ‘17

sample x

policy π~q(τ) generator G

reward r discriminator D

Inverse RL GANs

(Finn*, Christiano*, et al. ’16)

Page 13: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Connection to Generative Adversarial Networks(Goodfellow et al. ’14)

trajectory τ sample xpolicy π~q(τ) generator G

reward r discriminator DInverse RL GANs

(Finn*, Christiano*, et al. ’16)

data distribution p

Reward/discriminator optimization:GCL: GAIL:

some classifier

Both:

Page 14: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Connection to Generative Adversarial Networks(Goodfellow et al. ’14)

trajectory τ sample xpolicy π~q(τ) generator G

reward r discriminator DInverse RL GANs

(Finn*, Christiano*, et al. ’16)

data distribution p

Policy/generator optimization:

Baram et al. ICML ’17: use learned dynamics model to backdrop through discriminator

Unknown dynamics: train generator/policy with RL

*entropy-regularized RL*

Page 15: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Guided Cost Learning Experiments

dish placement pouring almonds

Real-world Tasks

ComparisonRelative Entropy IRL

(Boularias et al. ‘11)

state includes goal plate posestate includes unsupervised

visual features [Finn et al. ’16]

Page 16: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Update reward using samples & demos

generate policy samples from q

reward r

ComparisonsRelative Entropy IRL

(Boularias et al. ‘11)Path Integral IRL

(Kalakrishnan et al. ‘13)

Page 17: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Dish placement, demos

Page 18: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Dish placement, standard reward

Page 19: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Dish placement, RelEnt IRL

• video of dish baseline method

Page 20: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Dish placement, GCL policy

• video of dish our method - samples & reoptimizing

Page 21: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Pouring, demos

• video of pouring demos

Page 22: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Pouring, RelEnt IRL

• video of pouring baseline method

Page 23: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Pouring, GCL policy

• video of pouring our method - samples

Page 24: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Generative Adversarial Imitation Learning Experiments(Ho & Ermon NIPS ’16)

- demonstrations from TRPO-optimized policy - use TRPO as a policy optimizer

Conclusion: IRL requires fewer demonstrations than behavioral cloning

Page 25: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

learned behaviors from human motion capture

(Ho & Ermon NIPS ’16)

Merel et al. ‘17

Generative Adversarial Imitation Learning Experiments

walking falling & getting up

Page 26: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

GCL & GAIL: Pros & Cons

(with an efficiency policy optimizer)

Page 27: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Acquiring a reward function is important (and challenging!)

Inverse Reinforcement Learning Review

Goal of Inverse RL: infer reward function underlying expert demonstrations

Evaluating the partition function: - initial approaches solve the MDP in the inner loop and/or

assume known dynamics - with unknown dynamics, estimate Z using samples

Connection to generative adversarial networks: - sampling-based MaxEnt IRL is a GAN with a special form of

discriminator and uses RL to optimize the generator

Page 28: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Suggested Reading on Inverse RLClassic Papers: Abbeel & Ng ICML ’04. Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement learning Ziebart et al. AAAI ’08. Maximum Entropy Inverse Reinforcement Learning. Introduction to probabilistic method for inverse reinforcement learning

Modern Papers: Wulfmeier et al. arXiv ’16. Deep Maximum Entropy Inverse Reinforcement Learning. MaxEnt inverse RL using deep reward functions Finn et al. ICML ’16. Guided Cost Learning. Sampling based method for MaxEnt IRL that handles unknown dynamics and deep reward functions Ho & Ermon NIPS ’16. Generative Adversarial Imitation Learning. Inverse RL method using generative adversarial networks

Page 29: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Further Reading on Inverse RL

MaxEnt-based IRL: Ziebart et al. AAAI ’08, Wulfmeier et al. arXiv ’16, Finn et al. ICML ’16 Adversarial IRL: Ho & Ermon NIPS ’16, Finn*, Christiano* et al. arXiv ’16, Baram et al. ICML ’17

Handling multimodality: Li et al. arXiv ’17, Hausman et al. arXiv ’17, Wang, Merel et al. arXiv ’17

Handling domain shift: Stadie et al. ICLR ’17

Page 30: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Questions?

Page 31: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement
Page 32: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

IOC is under-definedneed regularization:• encourage slowly changing cost

• cost of demos decreases strictly monotonically in time

Page 33: Inverse Reinforcement Learning - Peoplecbfinn/_files/bootcamp_inverserl.pdf · Apprenticeship Learning via Inverse Reinforcement Learning. Good introduction to inverse reinforcement

Regularization ablation