abstract arxiv:2102.02872v2 [cs.lg] 11 feb 2021

14
Feedback in Imitation Learning: The Three Regimes of Covariate Shift Jonathan Spencer 1 Sanjiban Choudhury 2 Arun Venkatraman 2 Brian Ziebart 2 J. Andrew Bagnell 23 Abstract Imitation learning practitioners have often noted that conditioning policies on previous actions leads to a dramatic divergence between “held out” error and performance of the learner in situ. In- teractive approaches (Ross et al., 2011) can prov- ably address this divergence but require repeated querying of a demonstrator. Recent work identi- fies this divergence as stemming from a “causal confound” (de Haan et al., 2019; Wen et al., 2020) in predicting the current action, and seek to ablate causal aspects of current state using tools from causal inference. In this work, we argue instead that this divergence is simply another manifesta- tion of covariate shift, exacerbated particularly by settings of feedback between decisions and input features. The learner often comes to rely on fea- tures that are strongly predictive of decisions, but are subject to strong covariate shift. Our work demonstrates a broad class of problems where this shift can be mitigated, both theoreti- cally and practically, by taking advantage of a sim- ulator but without any further querying of expert demonstration. We analyze existing benchmarks used to test imitation learning approaches and find that these benchmarks are realizable and sim- ple and thus insufficient for capturing the harder regimes of error compounding seen in real-world decision making problems. We find, in a surpris- ing contrast with previous literature, but consis- tent with our theory, that naive behavioral cloning provides excellent results. We detail the need for new standardized benchmarks that capture the phenomena seen in robotics problems. 1. Introduction Imitation learning (IL) is a method of training robot con- trollers using demonstrations from human experts. In con- 1 Princeton 2 Aurora Innovation 3 Carnegie Mellon. Correspon- dence to: S. Choudhury <[email protected]>. Train time Expert demo Test time Error! Covariate shift Learnt policy: Learner latches to erroneous brake Figure 1. A common example of feedback-driven covariate shift in self-driving. At train time, the robot learns that the previous action (BRAKE) accurately predicts the current action almost all the time. At test time, when the learner mistakenly chooses to BRAKE, it continues to choose BRAKE, creating a bad feedback cycle that causes it to diverge from the expert. trast to, e.g., reinforcement learning (RL), IL controllers can learn to perform complex control tasks “offline”, with- out interaction in the environment, and achieve remark- able proficiency with relatively few demonstrations. This speedup comes at a cost, however, since offline learning in robotic control introduces feedback artifacts which must be accounted for. Consider the problem of training a a self-driving car to stop at a yellow light, as depicted in Figure 1. When an expert human driver decides whether to THROTTLE or BRAKE, they may pay attention to timing of the light, pedestrians, or their previous action. A robot trained to naively imitate this demonstration, however, will latch onto the previous action as it predicts the current action almost all the time. At test time, when the learner inevitably makes a mistake, this mistake feeds into the features for the next timestep. This starts a vicious cycle of errors, as the learner diverges from the expert, not having learned how to recover. This driving scenario is a classic example of a feedback loop (Bagnell, 2016; Sculley et al., 2014) between the learn- ers actions and the input features a learner sees. This ef- fects results in a covariate shift between training data and the input data a learner encounters when it executes its own policy. This shift leads to often a dramatic divergence between “hold out“ error and performance of the learner actually executing the policy in situ. Feedback-driven co- arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Upload: others

Post on 26-Jun-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning:The Three Regimes of Covariate Shift

Jonathan Spencer 1 Sanjiban Choudhury 2 Arun Venkatraman 2 Brian Ziebart 2 J. Andrew Bagnell 2 3

AbstractImitation learning practitioners have often notedthat conditioning policies on previous actionsleads to a dramatic divergence between “held out”error and performance of the learner in situ. In-teractive approaches (Ross et al., 2011) can prov-ably address this divergence but require repeatedquerying of a demonstrator. Recent work identi-fies this divergence as stemming from a “causalconfound” (de Haan et al., 2019; Wen et al., 2020)in predicting the current action, and seek to ablatecausal aspects of current state using tools fromcausal inference. In this work, we argue insteadthat this divergence is simply another manifesta-tion of covariate shift, exacerbated particularly bysettings of feedback between decisions and inputfeatures. The learner often comes to rely on fea-tures that are strongly predictive of decisions, butare subject to strong covariate shift.

Our work demonstrates a broad class of problemswhere this shift can be mitigated, both theoreti-cally and practically, by taking advantage of a sim-ulator but without any further querying of expertdemonstration. We analyze existing benchmarksused to test imitation learning approaches andfind that these benchmarks are realizable and sim-ple and thus insufficient for capturing the harderregimes of error compounding seen in real-worlddecision making problems. We find, in a surpris-ing contrast with previous literature, but consis-tent with our theory, that naive behavioral cloningprovides excellent results. We detail the needfor new standardized benchmarks that capture thephenomena seen in robotics problems.

1. IntroductionImitation learning (IL) is a method of training robot con-trollers using demonstrations from human experts. In con-

1Princeton 2Aurora Innovation 3Carnegie Mellon. Correspon-dence to: S. Choudhury <[email protected]>.

Train time

Expert dem

o

Test time

Error!

Covariate

shift

Learnt policy: Learner latches to erroneous

brake

Figure 1. A common example of feedback-driven covariate shift inself-driving. At train time, the robot learns that the previous action(BRAKE) accurately predicts the current action almost all the time.At test time, when the learner mistakenly chooses to BRAKE, itcontinues to choose BRAKE, creating a bad feedback cycle thatcauses it to diverge from the expert.

trast to, e.g., reinforcement learning (RL), IL controllerscan learn to perform complex control tasks “offline”, with-out interaction in the environment, and achieve remark-able proficiency with relatively few demonstrations. Thisspeedup comes at a cost, however, since offline learning inrobotic control introduces feedback artifacts which must beaccounted for.

Consider the problem of training a a self-driving car to stopat a yellow light, as depicted in Figure 1. When an experthuman driver decides whether to THROTTLE or BRAKE,they may pay attention to timing of the light, pedestrians,or their previous action. A robot trained to naively imitatethis demonstration, however, will latch onto the previousaction as it predicts the current action almost all the time. Attest time, when the learner inevitably makes a mistake, thismistake feeds into the features for the next timestep. Thisstarts a vicious cycle of errors, as the learner diverges fromthe expert, not having learned how to recover.

This driving scenario is a classic example of a feedbackloop (Bagnell, 2016; Sculley et al., 2014) between the learn-ers actions and the input features a learner sees. This ef-fects results in a covariate shift between training data andthe input data a learner encounters when it executes itsown policy. This shift leads to often a dramatic divergencebetween “hold out“ error and performance of the learneractually executing the policy in situ. Feedback-driven co-

arX

iv:2

102.

0287

2v2

[cs

.LG

] 1

1 Fe

b 20

21

Page 2: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

variate shift in the context of imitation learning was firstobserved by (Pomerleau, 1989) and has recently resurfacedas a often recurring problem in self-driving (Kuefler et al.,2017; Bansal et al., 2018; Codevilla et al., 2019). Imita-tion learner error in settings with explicit feedback is some-times attributed to a causal confound (de Haan et al., 2019),however, we argue that it is simply a more visible case offeedback-driven covariate shift.

A common approach to address covariate shift in imitationlearning is to interactively query an expert in states thatthe learner visits (Ross et al., 2011; Ross, 2013) using theDAGGER algorithm. Consider a class of learner policiesπ that can achieve a one-time-step classification error of εwith respect to the expert’s chosen action. Whereas naivebehavioral cloning (BC) suffers a classification regret ofO(T 2ε) that grows quadratically in horizon T , interactiveimitation learning approaches like DAGGER guarantee theoptimal O(Tε) regret even in the worst case. However,interactively querying experts is impractical in many caseswhere human experts are unable to provide feedback online.

Although interactive methods are provably necessary in theworst case, we posit that there are many problems with moremodest demands in a “Goldilocks Regime” of moderate dif-ficulty that can be solved without the need for an interactiveexpert. We propose solving those problems in a modifiedimitation learning setting where we have access to a fixedset of cached expert demonstrations as well as generativeaccess to a high fidelity simulator of the environment. Thisis similar to inverse reinforcement learning settings that ben-efit from environment/simulator access during the processof training (Abbeel & Ng, 2004; Ziebart et al., 2008; Ho& Ermon, 2016; Finn et al., 2016) In this setting, we in-troduce a novel class of algorithms that provably addressfeedback-driven covariate shift.

Our key insight is that, as long as the expert demonstra-tor visits all states that the learner will visit, then thedensity ratio between the expert and learner is bounded,Plearner(s)/Pexpert(s) ≤ C, we can correct for the covariateshift and enjoy a regret of just O(εT ) using just a simulatorand cached demonstrations.

Our contributions are the following:

1. We introduce a general framework for mitigating co-variate shift using cached expert demonstrations, calledALICE (Aggregate Losses to Imitate Cached Experts)

2. We propose a family of loss functions within that frame-work and analyze their consistency.

3. We identify specific characteristics needed for betterbenchmarks in imitation learning.

2. Preliminaries: The Feedback ProblemFor simplicity, we consider an episodic finite-horizonMarkov Decision Process (MDP) 〈S,A, C, P, ρ, T 〉 whereS is a set of states, A is a set of actions, C : S → [0, 1]is a (potentially unobserved) state-dependent cost function,P : S × A → ∆(S) is the transition dynamics, ρ ∈ ∆(S)is the initial distribution over states, and T ∈ N+ is thetime horizon. Given a policy π : S → ∆(A), let ρtπdenote the distribution over states at time t following π.Let ρπ be the average state distribution ρπ = 1

T

∑Tt=1 ρ

tπ.

Let Qπ be the cost-to-go for selecting a at s and follow-ing π thereafter, Qπt (s, a) = C(s) + Es∼P (s,a) [C(s′)] +∑T−1t′=t+1 Es′′∼P (st′ ,π(st′ )) [C(s′′)]. Let the (dis)-advantage

function Aπ be the difference in cost-to-go between se-lecting action a and selecting action π(a), Aπ(s, a) =Qπ(s, a) − Qπ(s, π(s)). The total cost of executing pol-icy π for T -steps is J(π) =

∑Tt=1 Est∼ρtπ [C(st)] =

TEs∼ρπ [C(s)].

In imitation learning, we do not observe C(s). Instead,we observe only cached expert demonstrations Dexp ={(s∗t , a∗t )} generated a priori by the expert π∗, i.e, s∗t ∼ρtπ∗ , a

∗t ∼ π∗(.|s∗t ). The cost C(s) can be constructed as a

“mismatch” loss relative to the expert (i.e. 0-1 classificationloss for discrete actions or squared error in state or actionfor continuous spaces). Whether C(s) is set by the MDPor relative to the expert, the analysis is the same, and thegoal is to train a policy π that minimizes the performancedifference J(π)−J(π∗) via imitation. When C(s) is a lossrelative to the expert, minimizing performance difference isequivalent to minimizing on-policy expert mismatch.

J(π)− J(π∗) = TEs∼ρπ[1π(s) 6=π∗(s)

]Feedback Drives Covariate Shift The traditional ap-proach to imitation learning, Behavior Cloning (BC) (Pomer-leau, 1989), simply trains a policy π that correctly classifiesthe expert actions on a cached set of demonstrations:

π = arg minπ∈Π

E(s∗,a∗)∼Dexp[`cs(π(s∗), a∗)] . (1)

where `cs is a classification loss. Here, the best wecould hope for is to drive down the training errorE(s∗,a∗)∼Dexp

[`cs(π(s∗), a∗)] ≤ ε and ensure a similar val-idation error on held out data. This is perfectly reasonablein the supervised learning setting where low hold-out errorimplies low test error.

However, sequential decision making tasks contain inher-ent feedback, where past action at−1 affect the input toour learner’s decision making at time t. As shown in thegraphical model in Fig. 2, this can happen through multi-ple channels. It can happen indirectly through the MDP

Page 3: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

stst−1 st+1

at−1 at

zt

Figure 2. Inherent feedback in sequential decision making tasks.Past action at−1 affects current action at, either indirectly (blue)via MDP dynamics or directly (green) via explicit conditioning.

dynamics where at−1 affects st, and can also happen explic-itly when π is directly conditioned on previous actions, i.e.π(at|st, at−1). In fact, dependence on past actions is oftenessential to ensure model smoothness or hysteresis.

A commonly observed phenomena (Pomerleau, 1989; Ross& Bagnell, 2010; Brantley et al., 2019) is that such feedbackloops drive covariate shift, where the states experiencedby the learner st ∼ ρtπ diverge significantly from Dexp.Although a learner begins by making an error of just ε, thoseerrors drive the learner into novel states rarely demonstratedby the expert where it is even more likely to make mistakes.Distributional shift and compounding errors result in totalerror O(T 2ε). This is formalized below:

Theorem 1 (Theorem 2.1 in (Ross et al., 2011)). LetEs∼ρπ [`cs(π(s), π∗(s))] ≤ ε be the bounded on-policytraining error, where `cs is the 0-1 loss (or an upper bound).We have J(π) ≤ J(π∗) + T 2ε

(Ross et al., 2011) go on to show that it is possible to achievethe ideal O(Tε)1 error by interactively querying the experton states visited by the learner s ∼ ρπ , and minimizing thesame classification loss `cs.

Theorem 2 (Theorem 2.2 in (Ross et al., 2011)). LetEs∼ρπ [`cs(π(s), π∗(s))] ≤ ε be the bounded on-policytraining error, and Aπ

∗(s, a) ≤ u be the bounded (dis)-

advantage w.r.t expert ∀s, a. We have J(π) ≤ J(π∗) +uTε.

However, querying the expert online is impractical in manysettings (Laskey et al., 2017), requiring a human to pro-vide cognitively challenging low-latency open-loop controls.This inspires a search for methods which achieve similarbounds without the need for an interactive demonstrator,asking the following question:

Problem 1. Can we achieve O(Tε) error without queryingthe expert online, i.e., using only the cached expert Dexp?

1We are usually interested in a mismatch cost C(s) =1π(s)6=π∗(s), which in Theorem 2 implies u = 1, giving us thisO(Tε) bound.

3. Confusion on Causality and Covariate ShiftWe now take a slight detour to review instances where feed-back causes real problems in imitation learning for self-driving, and clarify a widespread confusion between covari-ate shift and causality. A common observation in all theseinstances is that while past actions are strongly correlatedwith future actions, it often leads to a latching effect whereonce the robot begins to brake, it continues braking. (Mulleret al., 2006) first note such correlations this with steeringactions, but more recently (Kuefler et al., 2017; Bansal et al.,2018; de Haan et al., 2019) note this with braking actionsand (Codevilla et al., 2019) note a correlation betweenspeed and desired acceleration.

In an important paper, (de Haan et al., 2019) look to under-stand the problem of feedback through the lens of causalreasoning (Pearl et al., 2016; Spirtes et al., 2000). Theynote that, practically, the most severe repercussions comefrom features that enable the learner to condition almost di-rectly on previous actions.2 The authors propose that: “Thissituation presents a give-away symptom of causal misidenti-fication: access to more information leads to worse general-ization performance in the presence of distributional shift.Causal misidentification occurs commonly in natural imita-tion learning settings, especially when the imitator’s inputsinclude history information.” and that their contribution is:“We propose a new interventional causal inference approachsuited to imitation learning.”

Fig. 2 shows the graphical model under consideration. Theauthors claim zt = [st−1, at−1] is a confounder betweenthe independent variable st and the dependent variableat. However, zt is fully observed. From (Pearl et al.,2016) (Theorem 3.2.2), st and at are not confounded iffP (at|do(st)) = P (at|st), i.e. whenever the observation-ally witnessed association between them is the same as theassociation that would be measured in a controlled experi-ment, with st randomized. In the experimental setups theyconsidered, despite explicit conditioning on previous ac-tions we failed to find true confounders as all variables areobserved.

When all variables are observed, under infinite data regimeand full support (all modes excited), behavior cloning / MLEis consistent. Indeed (Zhang et al., 2020) similarly provethat “if the expert and learner share the same policy space,then the policy is always imitable”. While the authors of(de Haan et al., 2019) claim “We define causal misidentifi-cation as the phenomenon whereby cloned policies fail bymisidentifying the causes of expert actions”, we posit thatthis is not a causal issue but rather one of finite data regime

2It is difficult, however, to formalize this intuition because inany interesting imitation learning problem, actions do indeed affectthe states the learner must operate with.

Page 4: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

BC

The EASY Regime The HARD Regime The GOLDILOCKS Regime

DAGGER

ALICE

Bayes error O(T 2✏)O(TC✏)

O(T ✏)O(T ✏)

O(T 2✏)

Bayes errorBayes error O(T ✏)

Model misspecification, Infinite density-ratio

Model misspecification, Finite density-ratio

Realizable, Finite density-ratio

LearnerDemonstrator

Figure 3. Spectrum of feedback driven covariate shift regimes. Consider the case of training a UAV to fly through a forest usingdemonstrations (blue). In EASY regime, the demonstrator is realizable, while in the GOLDILOCKS and HARD regime, the learner(yellow) is confined to a more restrictive policy class. While model mispecification usually requires interactive demonstrations, in theGOLDILOCKS regime, ALICE achieves O(Tε) without interactive query.

and/or model misspecification. The learner prediction errorε concentrates unevenly as the learner comes to rely on afeatures that are strongly predictive of actions, but createstrong feedback loops that lead to covariate shift.

Finally, if we alter the graphical model to have an unob-served casual confounders, (de Haan et al., 2019) is correctto note that this would manifest as a covariate shift and leadto poor performance even in the infinite data limit. How-ever, their proposed approach requires expert interventionsimilar to DAGGER (although is can be much more sampleefficient), which makes it difficult to apply in our setting.Instead, we propose to measure the induced covariate shiftand directly adjust for it.

4. Feedback Driven Covariate Shift SpectrumWe return to Problem 1 – achieve O(Tε) error without re-peatedly querying the expert. We adopt the large-sample,reduction-style analysis of (Beygelzimer et al., 2008; Ross& Bagnell, 2010; Ross et al., 2011) and for simplicity con-sider the infinite data regime to keep the focus squarely onunderstanding the error-compounding problem so commonin real-world applications (Section 3).

While the theoretical analysis demonstrates that BC mustincur O(T 2ε) error in the worst-case, practitioners haveoften noted settings where behavior cloning is effective (Bo-jarski et al., 2016). This suggests that there maybe varyingregimes of difficulty in the imitation learning problem thatcall for different algorithmic assumptions and approachesas shown in Fig. 3

4.1. The EASY Regime: Realizable

In the simplest case, the expert is realizable π∗ ∈ Π. We cansimply drive the classification error to ε = 0, recovering theexpert policy. The prescription in this regime is the easiestpractically and often quite powerful: collect as much dataas possible and ensure realizability with a sufficiently broadpolicy class.

Crucially, we note theoretically and demonstrate below em-pirically that the benchmarks used in recent imitation learn-ing literature (Ho & Ermon, 2016; Barde et al., 2020) allfall into the EASY regime. Without restrictions on modelclass, we find that the commonly used benchmarks are wellsolved by naive behavioral cloning and fail to capture real-world issues (Sculley et al., 2014; Pomerleau, 1989; Kuefleret al., 2017; Bansal et al., 2018; Codevilla et al., 2019). Wesuggest the methodological issues that may have led to priorbeliefs that these benchmarks captured error-compoundingand suggest improvements for future benchmarks.

4.2. The HARD Regime: Model Misspecification,Infinite Density-ratio

This is the hardest case, where the expert is not realizable,π∗ /∈ Π, and does not visit every state, thus leading to in-finite learner-expert density ratio. As (Ross et al., 2011)note, when a fundamental limit (of ε classification error) isimposed on how accurately we can learn the expert’s policywhether due to partial observability, optimization limita-tions, or regularization, a compounding error of O(T 2ε) issimply inevitable without repeated access and interactionwith the demonstrator. Interaction with the demonstratorsenables the learner to achieve the best possible O(Tε) error.

Page 5: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

This comes fundamentally from the fact that there may sim-ply be states that are never visited under the demonstratorpolicy and thus there is effectively no way to recover if anerror leads our learner to such states. As such, interactivemethods like DAGGER (Ross et al., 2011) are inevitablyrequired to achieve high performance.

4.3. The GOLDILOCKS Regime: ModelMisspecification, Finite Density-ratio

Consider now a regime where the expert is not realizable,π∗ /∈ Π, and there thus exists some inevitable classificationerror (captured crudely via the ε bound), but for which thereis sufficient exploration in the data collected by observingthe demonstrator that we can hope to learn good “recovery”behavior. We quantify that notion of exploration by sayingthat the demonstrator visits all states with some small prob-ability, thus bounding the learner-demonstrator density ratio||ρπ/ρπ∗ ||∞ ≤ C. We denote this the Goldilocks regime aswe speculate this model is “just right” to capture much of theerror compounding difficulty seen in real-world imitationlearning.

We begin by noting that due to misspecification, a naive BCcan still only provide sub-optimal performance O(TCε).

Theorem 3 (BC in Goldilocks regime). Let π be the learntpolicy such that E(s∗,a∗)∼Dexp

[`cs(π(s∗), a∗)] ≤ ε. Forbounded density ratio ||ρπ/ρπ∗ ||∞ ≤ C and bounded (dis)-advantage Aπ

∗(s, a) ≤ u ∀(s, a), we have

J(π) ≤ J(π∗) + min(TCuε, T 2ε) (2)

Proof. See appendix A.1

In this regime, the problem becomes one of deploying finitemodel resources to lead to as good as possible performance.We approach Problem 1 from a perspective of covariate-shift correction and provide, for the first time to our knowl-edge, an approach that achievesO(Tε) without interactivelyquerying the expert.

5. ApproachWe speculate that the GOLDILOCKS regime is relativelycommon in real-world applications and that in the largedata limit the need to re-query an expert may be relativelyuncommon. Although an interactive expert is sufficient toaddress this issue, we believe that only a set of cached expertdemonstrations is necessary. We thus consider the use ofclassical covariate-shift mitigation strategies, though theIL setting introduces a complication. Due to feedback wehave a subtle chicken-or-egg problem: high-performancecovariate shift mitigation strategies rely on samples fromthe distribution of inputs (states) from the test distribution.

However, the test distribution here is directly an effect ofdecisions made earlier by the policy.

We show how this can be addressed both for non-stationaryand for stationary policies. For non-stationary policies(Sec 5.1), we use sequential forward training of succes-sive policies. For stationary policies (Sec 5.3), we use aniterative approach which refines both policy and mitigationstrategy until they stabilize.

We present ALICE (Aggregate Losses to Imitate CachedExperts), a family of algorithms that train a policy to imitatecached expert while countering or adjusting for covariateshift. Abstractly, all variants of ALICE optimize:

π = arg minπ∈Π

`(π,Dexp,Σ) (3)

where the loss `(.) uses both cached demonstrations Dexp

and a simulator3 Σ : π 7→ ρπ(s) to build the induced densityρπ(s). There exist several choices for the loss `(.) of whichwe analyze three, and show that they all lead to an O(Tε)bound under varying assumptions.

5.1. Forward Training Non-Stationary Policies

A simple resolution to the chicken-or-egg problem in (3) isto train a different policy πt for every timestep. This leadsto a simple, recursive recipe: at each time t, roll-in trainedpolicies π1, . . . , πt−1 in the simulator Σ to construct ρtπ(s)and solve

πt = arg minπ∈Π

`t(π,Dexp, ρtπ) (4)

While training non-stationary policies is at times impracti-cal, we use this setting to theoretically analyze various lossfunctions `(.), deferring to Section 5.3 for a simple onlinevariant with no fixed horizon.

5.2. Family of ALICE Loss Functions

We analyze two different losses – reweigh expert demon-strations by estimating the learner-expert density ratio, andmatch next state moments of learner-expert distribution. Wealso analyze a third loss that combines both ideas to bedoubly robust.

Loss 1 (ALICE-COV): Reweigh by Density Ratio

A classic approach to covariate shift mitigation (Shimodaira,2000) is to reweigh samples by the density ratio of the target

3This demonstration + simulator requirement is equivalent tothe setting of, e.g. GAIL (Ho & Ermon, 2016), where the learnerhas access to the environment but does not require access again tothe demonstrator after the initial collection.

Page 6: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

(learner) to the observed (expert) distribution.

`t(π,Dexp,ρtπ) =

Es∗t ,a

∗t∼Dexp

[(ρtπ(s∗t )

ρtπ∗(s∗t )

)`cs(π(s∗t ), a

∗t ))

] (5)

where `cs(.) is the plain classification loss. In practice,the density ratio is not known and must be estimatedr(s) ≈

(ρtπ(s∗t )ρtπ∗ (s∗t )

)4. ALICE-COV is the simplest vari-

ant that requires only that the ratio estimate be boundedto guarantee O(Tε)

Theorem 4 (ALICE-COV). Let π be the learned policyproduced by ALICE-COV such that `t(π,Dexp, ρ

tπ) ≤

ε. Let r(st) be the density ratio estimate such that

Es∗t∼Dexp

[r(st)− ρtπ(st)

ρtπ∗ (st)

]≤ γ. Let Aπ

∗(s, a) ≤

u ∀(s, a) be a bound on the (dis)-advantage w.r.t expert.We have

J(π) ≤ J(π∗) + Tu(ε+ γ) (6)

Proof. See appendix A.2

Corollary 5.1. Let C(s) be the classification loss`cs(π(s), π∗(s)) with respect to expert and π be a policyproduced as in Theorem. 4. In this case

J(π) ≤ J(π∗) + T (ε+ γ) (7)

Proof. For C(s) = `cs(π(s), π∗(s)), we have u = 1.

We incur a penalty u in Theorem. 4 because we are matchingexpert actions via `cs(.) instead of matching values, i.e.minimizing the expert dis-advantage function. We look athow to do this next.

Loss 2 (FAIL): Match next-state moments

Since our ultimate objective is to match the expert’s valuefunction, we can perhaps do so more effectively by matchingmoments of that function rather than matching individualactions. The loss function introduced in the FAIL algorithm(Sun et al., 2019) does exactly that, matching moments ofthe expert and learner’s next state distribution ρt+1 for non-stationary policies. We reintroduce that loss here withinour general framework5, and combine it with covariate shift

4In practice, since the density ratio cannot be estimated per-fectly, we often apply an exponential weighting r(st)α, α ∈ [0, 1]which brings the estimate closer to uniform weighting and reins inunreasonably large values, tuning α through validation.

5The FAIL setting is observation only and non-stationary. Sincein our setting we observe expert actions, the adversary functionclass can be expanded to F : S × A → R, searching instead forthe state-action value function. This requires a stationary targetoptimization policy π, and that we uniformly sample both at andat+1 and apply appropriate reweighting

correction in the next section. Please refer to (Sun et al.,2019) for the full implementation details of FAIL.

Assuming value functions V ∗t+1(s) belong to a class of mo-ments Ft+1 : S → R, the learner can try to match theexpert’s moments of the next state. An example of a met-ric that captures deviation from matching moments is theintegral probability metric, where an adversary searchesfor a moment function f ∈ Ft+1 that maximizes learner-expert moment mismatch. Given a learner distribution ρtand expert distribution ρ∗t+1, we can define the followingIPM metric:

d(π|ρt, ρ∗t+1) = maxf∈Ft+1

Est∼ρtat∼π

st+1∼P (.|st,at)

[f(st+1)]

− Es∗t+1∼ρ∗t+1

[f(s∗t+1)

] (8)

where the first term is the the expected moment on statesobtained by rolling out π and the second term is the expectedexpert moment. FAIL defines the loss as:

`t(π,Dexp, ρtπ) = d(π|ρtπ,Dexp) (9)

While FAIL does not require a bounded density ratio to re-move error compounding, it does require a different, at timesstronger condition – the notion of one-step recoverabilityDefinition 5.1 (One-step Recoverability). For any statedistribution ρt, there exists a policy π that boundsd(π|ρ,Dexp) ≤ ε.

This basically requires that no matter what the current dis-tribution, the learner can recover in a single time-step todrive the loss to ε. Without this condition, divergences atearlier time-step could be unrecoverable and hence com-pound over time leading to O(T 2ε). When considering anyIL method where C(s) and corresponding value functionV ∗(s) are inherent to the MDP (rather than relative to theexpert as in C(s) = `cs(π, s)), recoverability becomes afundamental requirement. If the environment includes hardbranches between high and low reward paths, bounds onlost reward (ε here, u in DAGGER) becomes arbitrarily largesince optimal actions are useless after an initial mistake.These requirements of recoverability can be thought of asimposing a limit on the reward “branchiness” of the envi-ronment, which we require here only in expectation. This isillustrated in Appendix A.5. With this condition, we have:Theorem 5 (FAIL). Let π be the learnt policy produced byFAIL such that `t(π,Dexp, ρ

tπ) ≤ ε, assuming the expert

is one-step recoverable, and εbe be the Inherent BellmanError (IBE). We have:

J(π) ≤ J(π∗) + 2T (ε+ εbe) (10)

Proof. We require small IBE to account for the richness ofthe chosen function class F and its ability to realize V ∗.See appendix A.3 for proof and definition of IBE.

Page 7: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

Loss 3 (ALICE-COV-FAIL): Doubly robust loss

When we combine Loss 1 and 2, we still require a versionof recoverability (captured completely in the loss bound ε),but we can eliminate dependence on dis-advantage and IBE.We consider the following loss:

`t(π,Dexp, ρtπ) =

Es∗t ,a

∗t∼Dexp

[(ρtπ(s∗t )

ρtπ∗(s∗t )

)`ipm(π(s∗), a∗)

] (11)

where

`ipm(π(s∗), a∗) = maxf∈Ft+1

Eat∼π(.|s∗t )

st+1∼P (.|s∗t ,at)

[f(st+1)]

− Ea∗t∼π(.|s∗t )

s∗t+1∼P (.|s∗t ,a∗t )

[f(s∗t+1)

] (12)

Comparing with ALICE-COV (5), we swap out the classifi-cation loss `cs(.) with next state IPM loss `ipm(.). We havethe following performance guarantee:Theorem 6 (ALICE-COV-FAIL). Let π be the learntpolicy produced by ALICE-COV-FAIL such that`t(π,Dexp, ρ

tπ) ≤ ε. Let r(s) be the density ratio

estimate such that Es∗t∼Dexp

[r(s)− ρtπ(s)

ρtπ∗ (s)

]≤ γ. Let

Aπ∗(s, a) ≤ u ∀(s, a) be a bound on the (dis)-advantage

w.r.t expert. We have

J(π) ≤ J(π∗) + T (ε+ uγ) (13)

Proof. See appendix A.4.

If the density ratio estimate is perfect, γ = 0, the bound isO(Tε), i.e. we have no dependence on u.

5.3. Iterative Training Stationary Policies

We now provide a more practical variant of ALICE thattrains a single stationary policy. We use the same principlein (Ross et al., 2011) to reduce the chicken-or-egg problemto an iterative no-regret online learning (Shalev-Shwartz &Kakade, 2008), for example, via dataset aggregation.

Algorithm 1 provides an outline. It takes in a dataset ofcached expert demonstrations Dexp and a simulator. At it-eration i, it rolls in the current learner πi via the simulatorto construct the learner state distribution st ∼ ρtπi for anytimestep. This in turn is used to compute a dataset of lossesacross various timesteps. This dataset is then aggregatedwith previous datasets and a new learner πi+1 is computedon the whole dataset. For strongly convex losses, Algo-rithm 1 is a no-regret online algorithm, and hence achievessub-linear regret against the best policy in hindsight. Withsome regularization6, we can recover an algorithm O(Tε)with any of the 3 losses.

6To achieve no-regret we might either use a strongly convex

Algorithm 1 ALICEInput: Cached expert demonstrations Dexp ={(s∗t , a∗t )}, Simulator Σ : π → ρπInitialize dataset D ← ∅Initialize learner π1 to any policy in Πfor i = 1 to N do

Sample states st ∼ ρtπi by running πi in simulator ΣCompute a dataset of losses Di = {`t(π,Dexp, ρ

tπi)}.

Aggregate datasets: D ← D⋃DiTrain learner πi+1 on aggregated D

end forReturn best πi on validation

6. Evaluation and BenchmarksIn this section we take the opportunity to raise broader con-cerns with the benchmarks commonly used in imitationlearning and identify necessary attributes of better bench-marks.

We performed a set experiments using what is now the stan-dard IL approach of training an RL agent to provide a setof demonstrations Dexp, then running IL using Dexp. Wereport the average reward of the expert dataset and behav-ioral cloning (BC) in Table 6, and share more baselines andresults in Appendix A.6. In the stationary policy setting, BCis the initialization point for ALICE and a lower bound onour performance. Unfortunately (or fortunately), as a lowerbound for our algorithm, BC leaves no room for improve-ment because the standard benchmarks fall squarely withinthe realizable (ε = 0) setting. It is indeed challenging tofind settings for which BC does not perform well, leavinglittle room to showcase relative performance gains. Thatnaive BC performs better than or equivalent to sophisticatedIL methods on many of these benchmarks is not necessarilya critique of the IL algorithms, which may be useful forcombating covariate shift in real-world applications. Rather,it is an acknowledgment of the inadequacy of the benchmarkenvironments themselves to demonstrate the O(T 2) errorcompounding suffered by BC, commonly observed boththeoretically and experimentally (Ross & Bagnell, 2010).

Difficulty - We claim these benchmarks are “too easy” inthe sense that they do not exhibit the feedback-driven co-variate shift and error compounding observed when imple-menting IL in real world scenarios. This is evidenced bythe fact that in many published works (Barde et al., 2020),naive BC outperforms many sophisticated “state-of-the-art”IL algorithms, as reported by the authors themselves. Incases where authors report weak performance by BC (Ho

approximation of the classification loss, or we might require regu-larization as in online gradient descent (Zinkevich, 2003) or meth-ods like weighted majority (Kolter & Maloof, 2007) for arbitrary,non-convex losses.

Page 8: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

Environment Expert BCCartPole 500± 0 500± 0Acrobot −71.7± 11.5 −78.4± 14.2MountainCar −99.6± 10.9 −107.8± 16.4Hopper 3554± 216 3258± 396Walker2d 5496± 89 5349± 634HalfCheetah 4487± 164 4605± 143Ant 4186± 1081 3353± 1801

Table 1. Behavioral Cloning performance on common controlbenchmarks for 25 expert trajectories, averaged over 7 randomseeds. In nearly every case, BC performs within the expert’s errormargin and often just as well as more sophisticated methods. Seeappendix for more results and implementation specifics.

& Ermon, 2016; de Haan et al., 2019), we have found thatrepeating those experiments with stronger optimization anddifferent model classes (validating our claim of “model mis-specification”) produces reasonable results.

Proposed Benchmark Requirements - Ideally, IL re-searchers would develop their own realistic and repeatableset of IL-centric benchmarks with the following properties:

1. A repeatable sequential decision making environment(a la OpenAI Gym, MuJoCo, ...)

2. A pre-tuned (reward-optimal) expert policy or fixed setof demonstrations, used in common by all researchers

3. A standard success metric of on-policy expert mis-match: Es∼ρπ

[1π(s)6=π∗(s)

]4. Scalable difficulty that is nontrivial yet feasible (i.e.

distinct expert/learner model classes, but adequate ex-pert coverage).

As noted by (Zhang et al., 2020), imitability or difficulty inIL is closely linked to observability, and partial observabilityremoves realizability and injects often substantial covariateshift into IL problems (Wen et al., 2020). However, reducingobservability in benchmark problems must be done withcare, as it can quickly make the problem so challenging asto become unsolvable even for algorithms such as DAGGER,which achieve the theoretic upper bound in many settings.In pure IL, the upper and lower performance bounds are notExpert and Random, but carefully optimized DAGGER andBC.

Rather than avoiding the fact that carefully optimized BCperforms remarkably well in many existing benchmarksor contriving ways to handicap BC, the community canbenefit from a focus on IL-centric benchmarks that exhibitthe characteristics of more difficult real world problems.

7. DiscussionIn this paper we identify the root cause of some classic chal-lenges in imitation learning and present a general frameworkfor addressing them. Specifically, we notice how feedbackin sequential decision making tasks causes covariate shift instandard imitation learning, and show how we can correctfor that shift under certain assumptions.

Although this Goldilocks regime of moderate difficulty prob-lems has been oft been noted in the real world, an active areaof future work is to find and develop benchmarks which arenot easily solvable by BC yet still feasible. This is also partof a broader call to ourselves and to the broader communityof researchers to develop IL-centric benchmarks which areconsistent yet flexible.

Within that Goldilocks regime we outlined a general ILframework and introduced three specific loss functions tocounter and correct covariate shift using density ratio correc-tion and next state moment matching. Those solutions reliedon a bounded density-ratio and a notion of one-step recov-erability. Single step recoverability is often a very strongrequirement, and we would like to consider approaches thatcan manage with less demanding requirements.

AcknowledgementsThe authors thank Hal Daume for thoughtful conversationon when causal confounding might play a significant rolein imitation learning and structured prediction and SergeyLevine and Dinesh Jayaraman for discussions on identifi-cation of causal models and the role for causality in thecompounding-error phenomena.

ReferencesAbbeel, P. and Ng, A. Y. Apprenticeship learning via inverse

reinforcement learning. In Proceedings of the twenty-firstinternational conference on Machine learning, pp. 1,2004.

Bagnell, D. Feedback in machine learning. In Workshop onSafety and control for artificial intelligence, CMU, 2016.

Bansal, M., Krizhevsky, A., and Ogale, A. Chauffeurnet:Learning to drive by imitating the best and synthesizingthe worst. arXiv preprint arXiv:1812.03079, 2018.

Barde, P., Roy, J., Jeon, W., Pineau, J., Pal, C., andNowrouzezahrai, D. Adversarial soft advantage fitting:Imitation learning without policy optimization. arXivpreprint arXiv:2006.13258, 2020.

Beygelzimer, A., Langford, J., and Zadrozny, B. Machinelearning techniques—reductions between prediction qual-

Page 9: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

ity metrics. In Performance Modeling and Engineering,pp. 3–28. Springer, 2008.

Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B.,Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller,U., Zhang, J., et al. End to end learning for self-drivingcars. arXiv preprint arXiv:1604.07316, 2016.

Brantley, K., Sun, W., and Henaff, M. Disagreement-regularized imitation learning. In International Confer-ence on Learning Representations, 2019.

Codevilla, F., Santana, E., Lopez, A. M., and Gaidon, A.Exploring the limitations of behavior cloning for au-tonomous driving. In Proceedings of the IEEE Inter-national Conference on Computer Vision, pp. 9329–9338,2019.

de Haan, P., Jayaraman, D., and Levine, S. Causal confusionin imitation learning. In Advances in Neural InformationProcessing Systems, pp. 11693–11704, 2019.

Finn, C., Levine, S., and Abbeel, P. Guided cost learning:Deep inverse optimal control via policy optimization. InInternational conference on machine learning, pp. 49–58,2016.

Ho, J. and Ermon, S. Generative adversarial imitation learn-ing. In Advances in neural information processing sys-tems, pp. 4565–4573, 2016.

Kolter, J. Z. and Maloof, M. A. Dynamic weighted majority:An ensemble method for drifting concepts. Journal ofMachine Learning Research, 8(Dec):2755–2790, 2007.

Kuefler, A., Morton, J., Wheeler, T., and Kochenderfer,M. Imitating driver behavior with generative adversarialnetworks. In 2017 IEEE Intelligent Vehicles Symposium(IV), pp. 204–211. IEEE, 2017.

Laskey, M., Chuck, C., Lee, J., Mahler, J., Krishnan, S.,Jamieson, K., Dragan, A., and Goldberg, K. Comparinghuman-centric and robot-centric sampling for robot deeplearning from demonstrations. In 2017 IEEE Interna-tional Conference on Robotics and Automation (ICRA),pp. 358–365. IEEE, 2017.

Muller, U., Ben, J., Cosatto, E., Flepp, B., and Cun, Y. L.Off-road obstacle avoidance through end-to-end learning.In Advances in neural information processing systems,pp. 739–746, 2006.

Pearl, J., Glymour, M., and Jewell, N. P. Causal inferencein statistics: A primer. John Wiley & Sons, 2016.

Pomerleau, D. Alvinn: An autonomous land vehicle in aneural network. In Touretzky, D. (ed.), Proceedings ofAdvances in Neural Information Processing Systems 1,pp. 305 –313. Morgan Kaufmann, December 1989.

Ross, S. Interactive learning for sequential decisions andpredictions. 2013.

Ross, S. and Bagnell, D. Efficient reductions for imitationlearning. In Proceedings of the thirteenth internationalconference on artificial intelligence and statistics, pp.661–668, 2010.

Ross, S., Gordon, G., and Bagnell, J. A. A reduction ofimitation learning and structured prediction to no-regretonline learning. In AISTATS, 2011.

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips,T., Ebner, D., Chaudhary, V., and Young, M. Machinelearning: The high interest credit card of technical debt.In SE4ML: Software Engineering for Machine Learning(NIPS 2014 Workshop), 2014.

Shalev-Shwartz, S. and Kakade, S. M. Mind the duality gap:Logarithmic regret algorithms for online optimization.Advances in Neural Information Processing Systems, 21:1457–1464, 2008.

Shimodaira, H. Improving predictive inference under covari-ate shift by weighting the log-likelihood function. Jour-nal of statistical planning and inference, 90(2):227–244,2000.

Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman,D. Causation, prediction, and search. MIT press, 2000.

Sun, W., Vemula, A., Boots, B., and Bagnell, J. A. Prov-ably efficient imitation learning from observation alone.In Proceedings of (ICML) International Conference onMachine Learning, June 2019.

Wen, C., Lin, J., Darrell, T., Jayaraman, D., and Gao, Y.Fighting copycat agents inbehavioral cloning from ob-servation histories. In Advances in Neural InformationProcessing Systems, 2020.

Zhang, J., Kumor, D., and Bareinboim, E. Causal imitationlearning with unobserved confounders. In Advances inNeural Information Processing Systems, 2020.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K.Maximum entropy inverse reinforcement learning. InAaai, volume 8, pp. 1433–1438. Chicago, IL, USA, 2008.

Zinkevich, M. Online convex programming and generalizedinfinitesimal gradient ascent. In Proceedings of the 20thinternational conference on machine learning (icml-03),pp. 928–936, 2003.

Page 10: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

A. AppendixA.1. Proof of Theorem 3 (Behavior Cloning)

Proof. We begin by bounding the expected on-policy dis-advantage of learner w.r.t expert

Est∼ρtπ[Aπ∗(st, π(st))

]= Est∼ρtπ∗

[ρπ(st)

ρπ∗(st)Aπ∗(st, π(st))

]≤∣∣∣∣∣∣∣∣ ρtπ(.)

ρtπ∗(.)

∣∣∣∣∣∣∣∣∞

Est∼ρtπ∗[Aπ∗(st, π(st))

]≤∣∣∣∣∣∣∣∣ ρtπ(.)

ρtπ∗(.)

∣∣∣∣∣∣∣∣∞

∣∣∣∣∣∣Aπ∗(., .)∣∣∣∣∣∣∞

Es∗t∼ρ

tπ∗

a∗t∼π∗(.|s∗t )

[`cs(π(s∗t ), a∗t )]

≤ Cuε

(14)

where the third inequality follows from the fact that `cs(.) upper bounds the 0− 1 loss.

Using the Performance Difference Lemma

J(π) = J(π∗) +

T∑t=1

Est∼ρtπ[Aπ∗(st, π(st))

]≤ J(π∗) +

T∑t=1

Cuε

≤ J(π∗) + TCuε

(15)

We can clip the bound above if we assume that costs are bounded [0, 1], using Theorem 2.1 from (Ross et al., 2011)

J(π) ≤ J(π∗) + T 2ε (16)

A.2. Proof of Theorem 4 (ALICE-COV)

Proof. We begin by bounding the expected on-policy dis-advantage of learner w.r.t expert

Est∼ρtπ[Aπ∗(st, π(st))

]= Est∼ρtπ∗

[ρπ(st)

ρπ∗(st)Aπ∗(st, π(st))

]= Est∼ρtπ∗

[r(st)A

π∗(st, π(st))]

+ Est∼ρtπ∗

[(ρπ(st)

ρπ∗(st)− r(st)

)Aπ∗(st, π(st))

]≤ u E

st∼ρtπ∗a∗t∼π

∗(.|st)

[r(st)`cs(π(st), a

∗t )] + u Est∼ρtπ∗

[(ρπ(st)

ρπ∗(st)− r(st)

)]≤ u `(π,Dexp, ρ

tπ) + u γ

≤ u(ε+ γ)

(17)

Using the Performance Difference Lemma

J(π) = J(π∗) +

T∑t=1

Est∼ρtπ[Aπ∗(st, π(st))

]≤ J(π∗) +

T∑t=1

u(ε+ γ)

≤ J(π∗) + Tu(ε+ γ)

(18)

Page 11: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

A.3. Proof of Theorem 5 (FAIL)

Proof. We begin by bounding the expected on-policy dis-advantage of learner w.r.t expert

Est∼ρtπ[Aπ∗(st, π(st))

]= E

st∼ρtπat∼π(.|st)

st+1∼P (.|st,at)

[V ∗t+1(st+1)

]− E

st∼ρtπa∗t∼π

∗t (.|st)

st+1∼P (.|st,a∗t )

[V ∗t+1(st+1)

]

= Est∼ρtπ

at∼π(.|st)st+1∼P (.|st,at)

[V ∗t+1(st+1)

]− E

s∗t∼ρtπ∗

a∗t∼π∗t (.|s∗t )

s∗t+1∼P (.|s∗t ,a∗t )

[V ∗t+1(st+1)

]

+ Es∗t∼ρ

tπ∗

a∗t∼π∗t (.|s∗t )

s∗t+1∼P (.|s∗t ,a∗t )

[V ∗t+1(st+1)

]− E

st∼ρtπa∗t∼π

∗t (.|st)

st+1∼P (.|st,a∗t )

[V ∗t+1(st+1)

]

= Est+1∼ρt+1π

[V ∗t+1(st+1)

]− Es∗t+1∼ρ

t+1π∗

[V ∗t+1(s∗t+1)

]+ E

s∗t∼ρtπ∗

a∗t∼π∗t (.|s∗t )

s∗t+1∼P (.|s∗t ,a∗t )

[V ∗t+1(st+1)

]− E

st∼ρtπa∗t∼π

∗t (.|st)

st+1∼P (.|st,a∗t )

[V ∗t+1(st+1)

]

≤ maxf∈Ft+1

Est+1∼ρt+1π

[f(st+1)]− Es∗t+1∼ρt+1π∗

[f(s∗t+1)

]+ E

s∗t∼ρtπ∗

a∗t∼π∗t (.|s∗t )

s∗t+1∼P (.|s∗t ,a∗t )

[V ∗t+1(st+1)

]− E

st∼ρtπa∗t∼π

∗t (.|st)

st+1∼P (.|st,a∗t )

[V ∗t+1(st+1)

]

(19)

The second term is a little tricky to bound since the value function V ∗t+1() needs to be pulled back from t + 1 to t viaBellman operator. To do this, we borrow the definition of Inherent Bellman Error (IBE) (Sun et al., 2019),

Definition A.1 (Inherent Bellman Error). Let the Bellman Operator on a function g ∈ Ft+1 w.r.t the optimal policy π∗ be

B∗t g(st) , Ea∗t∼π

∗t (.|st)

st+1∼P (.|st,a∗t )

[g(st+1)] , (20)

i.e., the pull-back of g from t+ 1 to t. This pulled back function may not be in the family of function B∗t g(st) /∈ Ft.We define the Inherent Bellman Error εbe as the worst-case projection error

εbe = maxt

(maxg∈Ft+1

minf∈Ft

||f − B∗t g||∞)

(21)

Applying Definition. A.1

Es∗t∼ρ

tπ∗

a∗t∼π∗t (.|s∗t )

s∗t+1∼P (.|s∗t ,a∗t )

[V ∗t+1(st+1)

]≤ maxf∈Ft

Es∗t∼ρtπ∗ [f(s∗t )] + minf ′∈Ft

∣∣∣∣f ′ − B∗t V ∗t+1

∣∣∣∣∞

≤ maxf∈Ft

Es∗t∼ρtπ∗ [f(s∗t )] + εbe

(22)

Page 12: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

Using (22) in (19)

Est∼ρtπ[Aπ∗(st, π(st))

]≤ maxf∈Ft+1

Est+1∼ρt+1π

[f(st+1)]− Es∗t+1∼ρt+1π∗

[f(s∗t+1)

]+ E

s∗t∼ρtπ∗

a∗t∼π∗t (.|s∗t )

s∗t+1∼P (.|s∗t ,a∗t )

[V ∗t+1(st+1)

]− E

st∼ρtπa∗t∼π

∗t (.|st)

st+1∼P (.|st,a∗t )

[V ∗t+1(st+1)

]

≤ maxf∈Ft+1

Est+1∼ρt+1π

[f(st+1)]− Es∗t+1∼ρt+1π∗

[f(s∗t+1)

]+ maxf∈Ft

Es∗t∼ρtπ∗ [f(s∗t )]− Est∼ρtπ [f(st)] + 2εbe

≤ `(π,Dexp, ρt+1π ) + `(π,Dexp, ρ

tπ) + 2εbe

≤ 2(ε+ εbe)

(23)

Using the Performance Difference Lemma

J(π) = J(π∗) +

T∑t=1

Est∼ρtπ[Aπ∗(st, π(st))

]≤ J(π∗) +

T∑t=1

2(ε+ εbe)

≤ J(π∗) + 2T (ε+ εbe)

(24)

A.4. Proof of Theorem 6 (ALICE-COV-FAIL)

Proof. We begin by bounding the expected on-policy dis-advantage of learner w.r.t expert

Est∼ρtπ[Aπ∗(st, π(st))

]= Est∼ρtπ∗

[ρπ(st)

ρπ∗(st)Aπ∗(st, π(st))

]= Est∼ρtπ∗

[r(st)A

π∗(st, π(st))]

+ Est∼ρtπ∗

[(ρπ(st)

ρπ∗(st)− r(st)

)Aπ∗(st, π(st))

]≤ Est∼ρtπ∗

[r(st)A

π∗(st, π(st))]

+ u Est∼ρtπ∗

[(ρπ(st)

ρπ∗(st)− r(st)

)]≤ Est∼ρtπ∗

[r(st)A

π∗(st, π(st))]

+ uγ

≤ Est∼ρtπ∗

r(st) E

at∼π(.|st)st+1∼P (.|st,at)

[V ∗t+1(st+1)

]− E

a∗t∼π∗(.|st)

s∗t+1∼P (.|st,a∗t )

[V ∗t+1(s∗t+1)

]+ uγ

≤ Est∼ρtπ∗

r(st) maxf∈Ft+1

Eat∼π(.|st)

st+1∼P (.|st,at)

[f(st+1)]− Ea∗t∼π

∗(.|st)s∗t+1∼P (.|st,a∗t )

[f(s∗t+1)

]+ uγ

≤ `(π,Dexp, ρtπ) + uγ

≤ ε+ uγ(25)

Page 13: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

Using the Performance Difference Lemma

J(π) = J(π∗) +

T∑t=1

Est∼ρtπ[Aπ∗(st, π(st))

]≤ J(π∗) +

T∑t=1

(ε+ uγ)

≤ J(π∗) + T (ε+ uγ)

(26)

A.5. Why Recoverability Matters: Various Recoverability Regimes and Performance Bounds

s1 s2

a1

s1 s2

a1

a1

a2

a2

s1 s2

a1a2

a1

a2

a1a1

a2a2

k-1 stepss2+k s3

DAgger: O(T 2✏)

BC: O(T 2✏)

ALICE-FAIL: O(T 2✏)ALICE-FAIL: O(T ✏)

BC: O(T 2✏) BC: O(T 2✏)

DAgger: O(T ✏) DAgger: O(kT ✏)

ALICE-FAIL: O(T 2✏)

For all MDP: C(s1) = 0, C(s) = 1 8s 6= s1

One-step recoverable k-step recoverable UnrecoverableA⇡⇤

(s, a) u = 1 A⇡⇤(s, a) u = k A⇡⇤

(s, a) u = T

Figure 4. Three different MDPs with varying recoverability regimes. For all MDPs, C(s1) = 0 and C(s) = 1 for all s 6= s1. The expertdeterministic policy is therefore π∗(s1) = a1 and π∗(s) = a2 for all s 6= s1. Even with one-step recoverability, BC can still result inO(T 2ε) error. For > 1-step recoverability, even FAIL slides to O(T 2ε), while DAGGER can recover in k steps leading to O(kTε). Forunrecoverable problem, all algorithms can go upto O(T 2ε). Hence recoverability dictates the lower-bound of how well we can do in themodel misspecified regime.

Here we show that recoverability, in a weaker sense than Def. 5.1, is a fundamental requirement for any IL algorithm.Consider the class of MDPs defined in Fig. 4. They vary in a weaker sense of recoverability, i.e. the upper bound of theadvantage

∣∣∣∣Aπ∗(s, a)∣∣∣∣∞ ≤ u. The first MDP has u = 1, allowing for one-step recoverability and hence FAIL achieves

O(Tε) while BC suffers from compounding error O(T 2ε). The second MDP needs atleast k−step to recover, thus bothBC and ALICE hit O(T 2ε), while DAGGER can achieve the best bound of O(kTε). Finally, for unrecoverable MDPs, allalgorithms are resigned to O(T 2ε).

Page 14: Abstract arXiv:2102.02872v2 [cs.LG] 11 Feb 2021

Feedback in Imitation Learning

A.6. IL Baseline Implementation Specifics

We use the following settings in training our RL experts (using the RL Baselines Zoo Python package) and IL learners.Learner policy classes consisted of neural networks with relu activation everywhere except final layer and a set number ofhidden dimensions (layer 1 dimension, layer 2 dimension, ...). These experiments are all in the stationary setting, where asingle policy is trained on all available data and executed for all time-steps.

Box2d Discrete OpenAI Gym MuJoCo Continuous Control(CartPole, Acrobot, MountainCar) (Ant, HalfCheetah, Reacher, Walker, Hopper)

RL Expert RL ExpertDeep Q Network Soft Actor Critic100k env. steps 3M env. steps

exploration fraction = 50% learning rate = 0.0003final ε = 0.1 buffer size = 1MIL Learner IL Learner

learning rate = 0.001 learning rate = 0.001hidden layer dimensions = (64,) hidden layer dimensions = (512,512)

100k optimization steps 20M optimization steps

Table 2. Settings used for training RL experts and IL learners

In some settings, it is natural and useful to augment the current observation st with previous action at−1, i.e. st = [st, at−1].We refer to this addition of single-step action history to BC as ”BC+H” and include

Environment Expert BC BC+HCartPole-v1 500.0± 0.0 500.0± 0.0 500.0± 0.0Acrobot-v1 −71.7± 11.5 −78.4± 14.2 −80.0± 18.0MountainCar-v0 −99.6± 10.9 −107.8± 16.4 −105.4± 16.6Ant-v2 4185.9± 1081.1 3352.9± 1800.9 2919.2± 1967.9HalfCheetah-v2 4514.6± 111.4 4388.4± 494.8 4338.1± 791.7Reacher-v2 −4.4± 1.3 −4.9± 2.4 −4.5± 1.4Walker2d 5496± 89 5349± 634 4451± 1491

Table 3. Behavioral Cloning performance on common control benchmarks for 25 expert trajectories. In most cases, BC performs withinthe expert’s error margin and often as well as more sophisticated methods. See appendix for more results and implementation specifics.

Although we did not perform experiments for the partially observable setting, please refer to the experiments of (Wenet al., 2020). They use a generative adversarial approach to generate an intermediate state representation that removesall information of the previous action. Although their vanilla implementation of BC didn’t achieve the strongest results,they include an implementation of their algorithm without the use of an adversary, effectively performing BC with a moreprincipled autoencoder-type model class. Using that improved model class, they show that BC achieves similar reward toDAgger in many of their experiments. We also refer the reader to (Sun et al., 2019) for experimental results of Loss #2(FAIL) in the non-stationary policy setting.