inverse reinforcement learning cs885 reinforcement

23
Inverse Reinforcement Learning CS885 Reinforcement Learning Module 6: November 9, 2021 Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In ICML. Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In ICML (pp. 49-58). CS885 Fall 2021 Pascal Poupart 1 University of Waterloo

Upload: others

Post on 02-Apr-2022

20 views

Category:

Documents


4 download

TRANSCRIPT

Inverse Reinforcement LearningCS885 Reinforcement LearningModule 6: November 9, 2021

Ziebart, B. D., Bagnell, J. A., & Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In ICML.

Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In ICML (pp. 49-58).

CS885 Fall 2021 Pascal Poupart 1University of Waterloo

CS885 Fall 2021 Pascal Poupart 2

Reinforcement Learning Problem

Agent

Environment

StateReward Action

Data: (𝑠!, π‘Ž!, π‘Ÿ!, 𝑠", π‘Ž", π‘Ÿ", … , 𝑠#, π‘Ž#, π‘Ÿ#)Goal: Learn to choose actions that maximize rewards

University of Waterloo

CS885 Fall 2021 Pascal Poupart 3

Imitation Learning

Expert

Environment

StateReward Optimal action

Data: (𝑠!, π‘Ž!βˆ— , 𝑠", π‘Ž"βˆ—, … , 𝑠#, π‘Ž#βˆ— )Goal: Learn to choose actions by imitating expert actions

University of Waterloo

CS885 Fall 2021 Pascal Poupart 4

Problems

β€’ Imitation learning: supervised learning formulation– Issue #1: Assumption that state-action pairs are identically

and independently distributed (i.i.d.) is false𝑠!, π‘Ž!βˆ— β†’ 𝑠", π‘Ž"βˆ— β†’ β‹― β†’ (𝑠#, π‘Ž#βˆ— )

– Issue #2: Can’t easily transfer learnt policy to environments with different dynamics

University of Waterloo

CS885 Fall 2021 Pascal Poupart 5

Inverse Reinforcement Learning (IRL)

University of Waterloo

Rewards Optimal policy πœ‹βˆ—RL

Rewards Optimal policy πœ‹βˆ—IRL

Benefit: can easily transfer reward function to new environment where we can learn an optimal policy

CS885 Fall 2021 Pascal Poupart 6

Formal Definition

Definitionβ€’ States: 𝑠 ∈ 𝑆‒ Actions: π‘Ž ∈ 𝐴‒ Transition: Pr(𝑠!|𝑠!"#, π‘Ž!"#)β€’ Rewards: π‘Ÿ ∈ ℝ‒ Reward model: Pr(π‘Ÿ!|𝑠!, π‘Ž!)β€’ Discount factor: 0 ≀ 𝛾 ≀ 1β€’ Horizon (i.e., # of time steps): β„Ž

Data: (𝑠!, π‘Ž!, π‘Ÿ!, 𝑠", π‘Ž", π‘Ÿ", … , 𝑠# , π‘Ž# , π‘Ÿ#)Goal: find optimal policy πœ‹βˆ—

University of Waterloo

Definitionβ€’ States: 𝑠 ∈ 𝑆‒ Optimal actions: π‘Žβˆ— ∈ 𝐴‒ Transition: Pr(𝑠!|𝑠!"#, π‘Ž!"#)β€’ Rewards: π‘Ÿ ∈ ℝ‒ Reward model: Pr(π‘Ÿ!|𝑠!, π‘Ž!)β€’ Discount factor: 0 ≀ 𝛾 ≀ 1β€’ Horizon (i.e., # of time steps): β„Ž

Data: (𝑠%, π‘Ž%βˆ— , 𝑠#, π‘Ž#βˆ—, … , 𝑠&, π‘Ž&βˆ— )Goal: find Pr(π‘Ÿ!|𝑠!, π‘Ž!) for which expert actions π‘Žβˆ— are optimal

Reinforcement Learning (RL) Inverse Reinforcement Learning (IRL)

CS885 Fall 2021 Pascal Poupart 7

IRL Applications

Advantagesβ€’ No assumption that state-action pairs are i.i.d.β€’ Transfer reward function to new environments/tasks

University of Waterloo

autonomous drivingrobotics

CS885 Fall 2021 Pascal Poupart 8

IRL Techniques

General approach: 1. Find reward function for which

expert actions are optimal2. Use reward function to optimize

policy in same or new environments

Broad categories of IRL techniquesβ€’ Feature matchingβ€’ Maximum margin IRLβ€’ Maximum entropy IRLβ€’ Bayesian IRL

University of Waterloo

(𝑠!, π‘Ž!βˆ— , 𝑠", π‘Ž"βˆ— , … , 𝑠#π‘Ž#βˆ— )

𝑅 𝑠, π‘Ž or Pr(π‘Ÿ|𝑠, π‘Ž)

πœ‹βˆ—

CS885 Fall 2021 Pascal Poupart 9

Feature Expectation Matching

β€’ Normally: find 𝑅 such that πœ‹βˆ— chooses the same actions π‘Žβˆ— as expert

β€’ Problem: we may not have enough data for some states (especially continuous states) to properly estimate transitions and rewards

β€’ Note: rewards typically depend on features πœ™6(𝑠, π‘Ž)e.g., 𝑅 𝑠, π‘Ž = βˆ‘6𝑀6πœ™6(𝑠, π‘Ž) = π’˜7𝝓(𝑠, π‘Ž)

β€’ Idea: Compute feature expectations and match them

University of Waterloo

CS885 Fall 2021 Pascal Poupart 10

Feature Expectation Matching

Let 𝝁8(𝑠9) =:;βˆ‘<=:; βˆ‘> 𝛾>𝝓(𝑠>

< , π‘Ž>(<))

be the average feature count of expert 𝑒(where 𝑛 indexes trajectories)

Let 𝝁?(𝑠9) be the expected feature count of policy πœ‹

Claim: If 𝝁? 𝑠 = 𝝁8 𝑠 βˆ€π‘  then 𝑉? 𝑠 = 𝑉8 𝑠 βˆ€π‘ 

University of Waterloo

CS885 Fall 2021 Pascal Poupart 11

ProofFeatures: 𝝓 𝑠, π‘Ž = (πœ™! 𝑠, π‘Ž , πœ™" 𝑠, π‘Ž , πœ™# 𝑠, π‘Ž , … )$Linear reward function: π‘…π’˜ 𝑠, π‘Ž = βˆ‘&𝑀&πœ™&(𝑠, π‘Ž) = π’˜$𝝓(𝑠, π‘Ž)

Discounted state visitation frequency: πœ“'%( 𝑠′ = 𝛿(𝑠′, 𝑠)) + π›Ύβˆ‘'πœ“'%

( (𝑠) Pr 𝑠* 𝑠, πœ‹ 𝑠

Value function: 𝑉( 𝑠 = βˆ‘'&πœ“'( 𝑠* π‘…π’˜(𝑠′, πœ‹(𝑠′))= βˆ‘'&πœ“'( 𝑠* π’˜$𝝓(𝑠*, πœ‹ 𝑠* )= π’˜$βˆ‘'&πœ“'( 𝑠* 𝝓(𝑠′, πœ‹ 𝑠′ )= π’˜$ 𝝁((𝑠)

Hence: 𝝁( 𝑠 = 𝝁+ 𝑠à π’˜!𝝁" 𝑠 = π’˜!𝝁# 𝑠à 𝑉" 𝑠 = 𝑉#(𝑠)

University of Waterloo

CS885 Fall 2021 Pascal Poupart 12

Indeterminacy of Rewards

β€’ Learning π‘…π’˜ 𝑠, π‘Ž = π’˜3𝝓(𝑠, π‘Ž) amounts to learning π’˜

β€’ When 𝝁4 𝑠 = 𝝁5 𝑠 , then 𝑉4 𝑠 = 𝑉5(𝑠), but π’˜ can be anything since

𝝁4 𝑠 = 𝝁5 𝑠 Γ  π’˜3𝝁4 𝑠 = π’˜3𝝁5 𝑠 βˆ€π’˜

β€’ We need a bias to determine π’˜β€’ Ideas:

– Maximize the margin– Maximize entropy

University of Waterloo

CS885 Fall 2021 Pascal Poupart 13

Maximum Margin IRL

β€’ Idea: select reward function that yields the greatest minimum difference (margin) between the Q-values of the expert actions and other actions

π‘šπ‘Žπ‘Ÿπ‘”π‘–π‘› = min@

𝑄 𝑠, π‘Žβˆ— βˆ’maxABAβˆ—

𝑄 𝑠, π‘Ž

University of Waterloo

CS885 Fall 2021 Pascal Poupart 14

Maximum Margin IRL

Let 𝝁? 𝑠, π‘Ž = 𝝓 𝑠, π‘Ž + 𝛾 βˆ‘@- Pr 𝑠C 𝑠, π‘Ž 𝝁?(𝑠C)Then 𝑄? 𝑠, π‘Ž = π’˜7𝝁?(𝑠, π‘Ž)Find π’˜βˆ— that maximizes margin:

π’˜βˆ— = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π’˜min) π’˜*𝝁+ 𝑠, π‘Žβˆ— βˆ’max,-,βˆ—

π’˜*𝝁+ 𝑠, π‘Ž

s.t. 𝝁+ 𝑠, π‘Ž = 𝝁. 𝑠, π‘Ž βˆ€π‘ , π‘Ž

Problem: maximizing margin is somewhat arbitrary since it doesn’t allow suboptimal actions to have values that are close to optimal

University of Waterloo

CS885 Fall 2021 Pascal Poupart 15

Maximum EntropyIdea: Among models that match the expert’s average features, select the model with maximum entropy

max.(0)

𝐻(𝑃 𝜏 )

s.t. !2343

βˆ‘0∈6787𝝓(𝜏) = 𝐸[𝝓 𝜏 ]

Trajectory: 𝜏 = (𝑠)0, π‘Ž)0, 𝑠!0, π‘Ž!0, … , 𝑠90, π‘Ž90)Trajectory feature vector: 𝝓 𝜏 = βˆ‘8 𝛾8𝝓(𝑠80, π‘Ž80)Trajectory cumulative reward: 𝑅 𝜏 = π’˜$𝝓 𝜏 = βˆ‘8 𝛾8π’˜$𝝓(𝑠80, π‘Ž80)

Probability of a trajectory: π‘ƒπ’˜ 𝜏 = +' (

βˆ‘( +' ( =+π’˜*𝝓 (

βˆ‘(& +π’˜*𝝓((&)

Entropy: 𝐻(𝑃 𝜏 ) = βˆ’βˆ‘0𝑃(𝜏) log𝑃(𝜏)

University of Waterloo

CS885 Fall 2021 Pascal Poupart 16

Maximum LikelihoodMaximum Entropy

max6(7)

𝐻(𝑃 𝜏 )

s.t. "89:9

βˆ‘7∈<=>=πœ™(𝜏) = 𝐸[πœ™ 𝜏 ]

Dual objective: This is equivalent to maximizing the log likelihood of the trajectories under the constraint that 𝑃 𝜏 takes an exponential form:

maxπ’˜

βˆ‘7∈<=>= log π‘ƒπ’˜(𝜏)s.t. π‘ƒπ’˜(𝜏) ∝ π‘’π’˜

'𝝓(7)

University of Waterloo

CS885 Fall 2021 Pascal Poupart 17

Maximum Log Likelihood (LL)π’˜βˆ— = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π’˜

!|6787|

βˆ‘0∈6787 logπ‘ƒπ’˜(𝜏)

= π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π’˜!

|6787|βˆ‘0∈6787 log

+π’˜*𝝓(()

βˆ‘(& +π’˜*𝝓((&)

= π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π’˜!

|6787|βˆ‘0∈6787π’˜$𝝓(𝜏) βˆ’ logβˆ‘0& π‘’π’˜

*𝝓(0*)

Gradient: βˆ‡π’˜πΏπΏ =!

|6787|βˆ‘0∈6787𝝓(𝜏) βˆ’ βˆ‘0&&

+π’˜*𝝓 (&&

βˆ‘(& +π’˜*𝝓 (&

𝝓(𝜏**)

= !|6787|

βˆ‘0∈6787𝝓(𝜏) βˆ’ βˆ‘0&&+π’˜*𝝓 (&&

βˆ‘(& +π’˜*𝝓 (&

𝝓(𝜏**)

= !|6787|

βˆ‘0∈6787𝝓(𝜏) βˆ’ βˆ‘0&& π‘ƒπ’˜(𝜏**)𝝓(𝜏**)= 𝐸6787 𝝓 𝜏 βˆ’ πΈπ’˜[𝝓 𝜏 ]

University of Waterloo

CS885 Fall 2021 Pascal Poupart 18

Gradient estimationComputing πΈπ’˜[πœ™ 𝜏 ] exactly is intractable due to exponential number of trajectories. Instead, approximate by sampling.

πΈπ’˜ πœ™ 𝜏 β‰ˆ1𝑛 O0∼.π’˜(0)

πœ™(𝜏)

Importance sampling: Since we don’t have a simple way of sampling 𝜏 from π‘ƒπ’˜(𝜏), sample 𝜏 from a base distribution π‘ž(𝜏)and then reweight 𝜏 by π‘ƒπ’˜(𝜏)/π‘ž(𝜏):

πΈπ’˜ πœ™ 𝜏 β‰ˆ1𝑛 O0∼>(0)

π‘ƒπ’˜ πœπ‘ž 𝜏 πœ™(𝜏)

We can choose π‘ž 𝜏 to be a) uniform, b) close to demonstration distribution, or c) close to π‘ƒπ’˜ 𝜏

University of Waterloo

CS885 Fall 2021 Pascal Poupart 19

Maximum Entropy IRL Pseudocode

Input: expert trajectories 𝜏( ∼ πœ‹()*(+! where 𝜏( = 𝑠#, π‘Ž#, 𝑠,, π‘Ž,, …Initialize weights π’˜ at randomRepeat until stopping criterion

Expert feature expectation: 𝐸-!"#!$% 𝝓 𝜏 = #|/0!0|

βˆ‘1! ∈ /0!0𝝓(𝜏()Model feature expectation:

Sample 𝑛 trajectories: 𝜏 ~ π‘ž(𝜏)πΈπ’˜ πœ™ 𝜏 = #

4βˆ‘1

5π’˜ 16 1

𝝓(𝜏)Gradient: βˆ‡π’˜πΏπΏ = 𝐸-!"#!$% 𝝓 𝜏 βˆ’ πΈπ’˜ 𝝓 𝜏Update model: π’˜ ← π’˜+π›Όβˆ‡π’˜πΏπΏ

Return π’˜

University of Waterloo

Assumption: Linear rewards π‘…π’˜ 𝑠, π‘Ž = π’˜7𝝓(𝑠, π‘Ž)

CS885 Fall 2021 Pascal Poupart 20

Non-Linear Rewards

Suppose rewards are non-linear in π’˜e.g., π‘…π’˜ 𝑠, π‘Ž = π‘π‘’π‘’π‘Ÿπ‘Žπ‘™π‘π‘’π‘‘π’˜(𝑠, π‘Ž)

Then π‘…π’˜ 𝜏 = βˆ‘> 𝛾>π‘…π’˜ 𝑠>7, π‘Ž>7

Likelihood: 𝐿𝐿 π’˜ = "|<=>=|

βˆ‘7∈<=>=π‘…π’˜(𝜏) βˆ’ logβˆ‘78 𝑒@π’˜(78)

Gradient: βˆ‡π’˜πΏπΏ = 𝐸<=>= βˆ‡π’˜π‘…π’˜ 𝜏 βˆ’ πΈπ’˜[βˆ‡π’˜π‘…π’˜ 𝜏 ]

University of Waterloo

CS885 Fall 2021 Pascal Poupart 21

Maximum Entropy IRL Pseudocode

Input: expert trajectories 𝜏# ∼ πœ‹#$%#&' where 𝜏# = 𝑠(, π‘Ž(, 𝑠), π‘Ž), …Initialize weights 𝑀 at randomRepeat until stopping criterion

Expert feature expectation: 𝐸"!"#!$% βˆ‡π’˜π‘…π’˜ 𝜏# = (|,-'-|

βˆ‘.! ∈ ,-'- βˆ‡π’˜π‘…π’˜ 𝜏#Model feature expectation:

Sample 𝑛 trajectories: 𝜏 ~ π‘ž(𝜏)πΈπ’˜ βˆ‡π’˜π‘…π’˜ 𝜏 = (

0βˆ‘.

1π’˜ .2 .

βˆ‡π’˜π‘…π’˜ 𝜏Gradient: βˆ‡π’˜πΏπΏ = 𝐸"!"#!$% βˆ‡π’˜π‘…π’˜ 𝜏 βˆ’ πΈπ’˜ βˆ‡π’˜π‘…π’˜ 𝜏Update model: π’˜ ← π’˜ + π›Όβˆ‡π’˜πΏπΏ

Return π’˜

University of Waterloo

General case: Non-linear rewards 𝑅9 𝑠, π‘Ž

CS885 Fall 2021 Pascal Poupart 22

Policy ComputationTwo choices:1) Optimize policy based on π‘…π’˜(𝑠, π‘Ž) with favorite RL algorithm2) Compute policy induced by π‘ƒπ’˜(𝜏).

Induced policy: probability of choosing π‘Ž after 𝑠 in trajectoriesLet 𝑠, π‘Ž, 𝜏 be a trajectory that starts with 𝑠, π‘Ž and then continues with the state action-pairs of 𝜏

πœ‹π’˜ π‘Ž 𝑠 = π‘ƒπ’˜(π‘Ž|𝑠)= βˆ‘( .π’˜ ',7,0

βˆ‘.&,(& .π’˜(',7&,0&)

= βˆ‘( +'π’˜ 0,. 12'π’˜(()

βˆ‘.&,(& +'π’˜ 0,.& 12'π’˜((&)

University of Waterloo

CS885 Fall 2021 Pascal Poupart 23

Demo: Maximum Entropy IRL

Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. ICML.

University of Waterloo