inverse reinforcement learning with natural language goals

Inverse Reinforcement Learning with Natural Language Goals

Li Zhou, Kevin SmallAmazon Alexa

{lizhouml, smakevin}@amazon.com

Abstract

Humans generally use natural language (NL) to communicatetask requirements to each other. Ideally, NL should also be us-able for communicating goals to autonomous machines (e.g.,robots) to minimize friction in task specification. However,understanding and mapping NL goals to sequences of statesand actions is challenging. Specifically, existing work alongthese lines has encountered difficulty in generalizing learnedpolicies to new NL goals and environments. In this paper, wepropose a novel adversarial inverse reinforcement learning al-gorithm to learn a language-conditioned policy and rewardfunction. To improve generalization of the learned policy andreward function, we use a variational goal generator to rela-bel trajectories and sample diverse goals during training. Ouralgorithm outperforms multiple baselines by a large marginon a vision-based NL instruction following dataset (Room-2-Room), demonstrating a promising advance in enabling theuse of NL instructions in specifying agent goals.

1 IntroductionHumans use natural language (NL) to communicate witheach other. Accordingly, a desired method for conveyinggoals or assigning tasks to an autonomous machine (e.g., arobot) is also by NL. For example, in household tasks, whenasking an agent to pick up a toolbox from the garage, wedo not assign the coordinates of the toolbox to the robot.Instead, we state to the agent retrieve my toolbox from thegarage. This requires the agent to understand the semanticsof the natural language goals, associate states and actionswith the natural language goals, and infer whether the nat-ural language goals are achieved or not – all of which arevery challenging tasks. Furthermore, we want to be able tolearn policies that are robust to lexical variation in goals andgeneralize well to new goals and environments.

Inverse reinforcement learning (IRL) (Abbeel and Ng2004; Fu, Luo, and Levine 2018), a form of imitation learn-ing (Osa et al. 2018), is the task of learning a reward func-tion and hence a policy based on expert demonstrations.Imitation learning has been successfully applied to a widerange of tasks including robot manipulation (Finn, Levine,and Abbeel 2016), autonomous driving (Kuefler et al. 2017),human behavior forecasting (Rhinehart and Kitani 2017),

Copyright © 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

video game AI (Tucker, Gleave, and Russell 2018). Gen-erally, task goals are specified intrinsically by the environ-ment and an agent is trained for each task. To generalizethe learned policy to new goals, goal-conditioned imitationlearning (Ding et al. 2019) and reinforcement learning (RL)algorithms (Kaelbling 1993; Schaul et al. 2015; Nair et al.2018) have been proposed where the policy is explicitly con-ditioned on a goal. Normally, the goals either share the samespace with the states or can be easily mapped to the statespace. For example, in Ding et al. (2019), the goals andstates are both coordinates in the environment and the goalsare provided to the agent by specifying goal positions.

Fu et al. (2019) and Bahdanau et al. (2019) representtwo recent related works from the perspective of under-standing NL goal specifications, where they propose learn-ing a language-conditioned reward function under the max-imum entropy inverse reinforcement learning (Ziebart et al.2008) and generative adversarial imitation learning frame-works (Ho and Ermon 2016). However, the NL goals areproduced by a preset template-based grammar. Fu et al.(2019)’s policy is optimized exactly in a grid environmentwith known dynamics and when the policy is optimizedby sample-based reinforcement learning algorithms such asdeep Q-learning (Mnih et al. 2015), the model performancedrops significantly. This reveals a sample efficiency chal-lenge in this setting and implies that when the NL goalsand the environments are complicated, generalization to newgoals and environments becomes a very difficult problem.

One potential method for addressing these shortcomingsare goal relabeling techniques such as hindsight experiencereplay (HER) (Andrychowicz et al. 2017) and latent goalrelabeling (Nair et al. 2018), which have been shown to im-prove sample efficiency in RL settings. However, when thegoals are NL and are in a different space than the state space,goal relabeling cannot be applied directly as we cannot eas-ily relabel a state to a NL goal. Cideron et al. (2019) pro-poses to build such a relabeling function with a SEQ2SEQmodel that takes a trajectory as input and outputs a rela-beled NL goal. However, this uses the ground-truth rewardfunction and their experiments are again based on simpletemplate-based NL goals. Moreover, as we will show in thispaper, applying HER to IRL with NL goals doesn’t signifi-cantly improve performance, as the reward function does notgeneralize well to relabeled goals.

arX

iv:2

008.

0692

4v3

[cs

.LG

] 1

6 D

ec 2

020

In this work, we propose a sample efficient IRL algorithmwith natural language (NL) goals. To the best of our knowl-edge, our work is the first IRL algorithm that works with realhuman generated NL goals (as opposed to template-basedlanguage goals (Fu, Luo, and Levine 2018; Bahdanau et al.2019)) in a real world vision-based environment. Our con-tributions include: (1) proposing a NL goal-conditioned ad-versarial inverse reinforcement learning algorithm, (2) spec-ifying a variational goal generator that efficiently sample di-verse NL goals given a trajectory, (3) utilizing goal relabel-ing and sampling strategies to improve the generalization ofboth the policy and the reward function to new NL goals andnew environments, (4) proposing a self-supervised learningmechanism to further improve the generalization in new en-vironments. Through these innovations, we show that ouralgorithm significantly outperform strong baselines on theRoom-2-Room dataset (Anderson et al. 2018).

2 Problem FormulationA task is defined as a pair (E,G), where E is an environ-ment that the agent can interact with and G is a NL goalthat the agent has to fulfill. G = {w1, w2, ..., wN} consistsof N words. For example, in our experiments, E is a re-alistic 3D indoor environment and G is a human-generatednavigation instruction. The true reward function of each taskis unknown, but there exists a set of expert demonstrationsconsisting of a task (E, G) and a trajectory τ . The trajec-tory τ = {s1, a1, s2, a2, ..., sT , aT } consists of a sequenceof T states perceived by the experts and the actions takenby the experts given the states. For example, in our experi-ments the states are first-person views in a 3D indoor envi-ronment and the actions are movements towards a directionin the environment. While our proposed methods can be ap-plied to both continuous and discrete action spaces, we focuson discrete action spaces in this paper. The objective of thealgorithm is to learn a policy that imitates expert demonstra-tions such that given a new NL goal in an existing or newenvironment, the agent can perform a sequence of actionsto achieve that goal. In this paper, we use G to representan arbitrary goal, G to represent the set of goals in expertdemonstrations, and Gi to represent the i-th goal in G. Thesame notation style applies to E and τ too.

3 Method3.1 Preliminary: Adversarial Inverse

Reinforcement Learning (AIRL)AIRL (Fu, Luo, and Levine 2018) is based on maximum en-tropy inverse reinforcement learning (MaxEntIRL) (Ziebartet al. 2008). In MaxEntIRL, the probability of a trajectoryis defined as pθ(τ) = 1

Z exp(fθ(τ)), where fθ is the rewardfunction to learn and Z =

∑τ exp(fθ(τ)) is the partition

function. MaxEntIRL learns the reward function from ex-pert demonstrations by maximizing the likelihood of trajec-tories in the expert demonstrations: maxθ Eτ∼D(log pθ(τ)).Maximizing this likelihood is challenging because the par-tition function Z is hard to estimate. AIRL maximizesthis likelihood by using a generative adversarial network

(GAN) (Goodfellow et al. 2014). The GAN generator is thepolicy π to learn and the GAN discriminator is defined as

Dθ,φ(s, a, s′) =exp{fθ,φ(s, a, s′)}

exp{fθ,φ(s, a, s′)}+ π(a|s)(1)

where fθ,φ(s, a, s′) = gθ(s, a) + γhφ(s′)− hφ(s), gθ(s, a)is the reward function to learn, and hφ(s′) − hφ(s) is thereward shaping term. The policy π and the discriminator areupdated alternately.

3.2 Inverse Reinforcement Learning with NaturalLanguage Goals

Our AIRL-based solution builds on language-conditionedreward learning (LC-RL) (Fu et al. 2019) by addingtemplate-free NL capabilities and continuous state spacesupport. In this problem setting, the policy and the rewardfunction are conditioned on a NL goal G, so we first extendthe discriminator in Equation (1) to have fθ,φ(s, a, s′, G) =gθ(s, a,G) + γhφ(s′, G)− hφ(s,G) such that

gθ(s, a,G) = MLP([es; Att(s,G); ea])

hφ(s,G) = MLP([es; Att(s,G)])

where MLP(·) is a multilayer Perceptron, es and ea

are the embeddings of state s and action a, respec-tively, and Att(s,G) is an attention function. Att(s,G) =∑αie

wi where ewi is the word embedding of wi in

G, αi = (Linear(es) · ewi ) / (∑i Linear(es) · ewi ) and

Linear(·) is a single-layer Perceptron. We use soft actor-critic (SAC) (Haarnoja et al. 2018), one of the state-of-the-art off-policy RL algorithms, to optimize policy π given thereward function gθ(s, a,G). SAC includes a policy networkto predict an action given a state and a goal, and a Q-networkthat estimate the Q-value of an action given a state and agoal. We define the policy network as

πw(s, a,G) = Softmax(ea ·MLP([es; Att(s,G)]))

and we define the Q-network qψ(s, a,G) as the same net-work architecture as gθ. Compared with on-policy algo-rithms such as TRPO (Schulman et al. 2015), SAC uti-lizes a replay buffer to re-use sampled trajectories. The re-play buffer is beneficial to training both the discrimina-tor and the policy. To update the discriminator, we sam-ple negative (s, a, s′, G) examples from the replay bufferand sample positive (s, a, s′, G) examples from the expertdemonstrations. To update the policy, we sample a batch of(s, a, s′, G) from the replay buffer and use gθ to estimatetheir rewards; then we update the Q- and policy networkusing these reward-augmented samples. We modify SACslightly to support discrete action spaces. Appendix B con-tains details about model architecture and optimization.

3.3 A Variational Goal Generator for ImprovedGeneralization and Sample Efficiency

While SAC has better sample efficiency than many on-policy RL algorithms, as we will show in our experiments,in environments with complicated NL goals and high dimen-sional state spaces (e.g., vision-based instruction following),

AIRL with SAC performs only slightly better than super-vised learning based behavior cloning and the learned pol-icy and discriminator doesn’t generalize well to new goalsor new environments. In this section, we show that by learn-ing a variational goal generator, we can enrich the trainingdata for both the discriminator and the policy, which leads tolarge improvements in sample efficiency and generalizationfor both the discriminator and policy in NL settings.

A goal generator takes a trajectory τ (sequence of statesand actions) as input and outputs a NL goal G. Given ex-pert demonstrations, a straightforward way of learning agoal generator is to train an encoder-decoder model that en-codes a trajectory and decodes a NL goal. However, in real-ity, there are many possible ways to describe a specific tra-jectory using NL, and the description will likely be biaseddue to variance in people’s expression preferences. For ex-ample, consider a trajectory in which the agent goes to thekitchen and washes the dishes. Two possible goal descrip-tions for this trajectory can be go to the kitchen and washthe dishes or clean the dishes on the dining table. To bettermodel variations and generate more diverse NL goals, welearn a variational encoder-decoder model as goal generator.The generative process of the variational goal generator is

µprior, σ2prior = fprior (τ)

z ∼ N(µprior, σ

2priorI

)G = fdec (z, τ)

where fprior is a LSTM-based trajectory encoder and fdecis an attention-based goal decoder. The posterior distributionof latent variable z is approximated by

µposterior, σ2posterior = fposterior (τ,G)

q (z|τ,G) = N(z|µposterior, σ2

posteriorI)

where fposterior is a LSTM-based trajectory and goalencoder. We then maximize the variational lower bound−DKL(q(z|τ,G)||p(z)) + Eq(z|τ,G)[log p(G|z, τ)]. Net-work architecture and optimization details are in AppendixB. Given the learned variational goal generator, we proposeusing three goal relabeling and sampling strategies to im-prove generalization as itemized below.

Expert Goal Relabeling (EGR). As discussed previ-ously, multiple goals can be mapped to the same trajectory.We propose to augment expert demonstrations by generat-ing N other goals for each expert trajectory τi: Gi,n ∼GoalGenerator(τi), n = {1, 2, ..., N}, where Gi,n is the n-th generated goal for τi, and GoalGenerator(·) is the learnedvariational goal generator. For expository and experimen-tal simplicity, we set N = 2, which works well for thisdataset. Model performance under different values of Nare shown in Appendix D. These newly generated tuples{(Ei, Gi,n, τi)}Nn=1 are also treated as expert demonstra-tions for training. We represent the set of goals in the aug-mented expert demonstrations by G1 where the superscriptofG represents the round of generation as described shortly.

Hindsight Goal Relabeling (HGR) for the Discrimina-tor. In the AIRL framework, the quality of the discrimi-nator is crucial to the generalization of the learned policy. A

good discriminator learns a good reward function that gen-eralizes well to new goals and new environments such thatthe learned policy can also well generalize to new goals andnew environments (Fu et al. 2019). To improve the discrim-inator, we propose to augment the positive examples of thediscriminator by relabeling the goals of the sampled trajec-tories. More specifically, during the training, given a goal(Ej , G

1j ), the agent interacts with the environment and sam-

ples a trajectory τ1j . We then use the variational goal gener-

ator to sample a goal for τ1j :

τ1j ∼ π(Ej , G

1j ) (2)

G2j ∼ GoalGenerator(τ1

j ) (3)

The tuples (E,G2, τ 1) are treated as positive examples forthe discriminator, as a supplement to the positive examplesfrom expert demonstrations (E,G1, τ ).

Hindsight Goal Sampling (HGS) for Policy Optimiza-tion. When optimizing the policy with SAC, we have tosample NL goals to train the policy. One natural way is tosample goals from expert demonstrations. However, expertdemonstrations are relatively scarce, expensive to acquire,and it is difficult to cover goal variance encountered in thetesting. Meanwhile, as we will show in our experiments,training with a diverse set of goals is beneficial to the gener-alization of the policy. Therefore, we propose to also samplegoals fromG2 so that the policy can train with goals beyondthese in the expert demonstrations.

τ2j ∼ π(Ej , G

2j ) (4)

G3j ∼ GoalGenerator(τ2

j ) (5)

Of course, training the policy withG2 relies on the discrim-inator’s generalization ability to provide reasonable rewardestimates for states and actions sampled under G2. This isensured by HGR in Section 3.3, as (E,G2, τ 1) are providedas positive examples for the discriminator.

The process from Equation (2) to Equation (5) can be seenas using G1 and τ as seed, and iteratively sample τ v fromGv using the policy, and then sample Gv+1 from τ v usingthe goal generator. We can go deeper in this loop to samplemore diverse NL goals and trajectories for policy and dis-criminator training. More specifically, to train the discrimi-nator, positive examples are sampled from (E,G1, τ ) withprobability 0.5, and are sampled from {(E,Gv+1, τ v)|v ≥1}with probability 0.5; negative examples are sampled from{(E,Gv, τ v)|v ≥ 1}. To train the policy, goals are sam-pled from (E,G1) with probability 0.5 and are sampledfrom {(E,Gv+1)|v ≥ 1} with probability 0.5. Algorithm1 shows the overall description for LangGoalIRL.

Hindsight Experience Replay (HER) for Policy Opti-mization. A closely related approach, hindsight experi-ence replay (Andrychowicz et al. 2017), has been shownto work very well in RL settings with sparse rewards. Inprinciple, we could easily incorporate HER into our train-ing procedure. That is, when sampling from replay bufferto optimize policy π, we sample batches not only from{(E,Gv, τ v) |v ≥ 1}, but also from {(E,Gv+1, τ v) |v ≥

Algorithm 1 Inverse Reinforcement Learning with NaturalLanguage Goals (LangGoalIRL)Input: GoalGenerator: the variational goal generator from section

3.3; D: expert demonstrations;R = ∅: replay buffer; b: batchsize; N: number of expert relabeling goals.

1: D = ∅, G = ∅2: for each (Ei, Gi, τi) ∈ D do3: for n = 1, 2, ..., N do4: . Expert Goal Relabeling (EGR)5: Gi,n ∼ GoalGenerator(τi)6: end for7: add (Ei, Gi, τi) and {(Ei, Gi,n, τi)}Nn=1 to D8: end for9: while not converged do

10: r1, r2 ∼ Uniform(0, 1)11: if r1 < 0.5 then12: Sample a goal (Ej , Gj) ∼ D13: else14: . Hindsight Goal Sampling (HGS)15: Sample a goal (Ej , Gj) ∼ G16: end if17: Sample a trajectory τ ′j ∼ π(Ej , Gj)18: Sample a relabeled goal G′j ∼ GoalGenerator(τ ′j)19: Add (Ej , Gj , G

′j , τ′j) to replay bufferR

20: Add (Ej , G′j) to G

21: if r2 < 0.5 then22: Sample a batch P+ = {(stk, atk, st+1

k , Gk)}bk=1 ∼ D23: else24: . Hindsight Goal Relabeling (HGR)25: Sample a batch P+ = {(stk, atk, st+1

k , G′k)}bk=1 ∼ R26: end if27: Sample a batch P− = {(stk, atk, st+1

k , Gk)}bk=1 ∼ R28: Update discriminator parameters with P+ and P− as pos-

itive and negative examples, respectively.29: Sample a batchQ = {(stk, atk, st+1

k , Gk)}bk=1 ∼ R30: Expand each entry ofQ with reward rtk = gθ(s

tk, a

tk, Gk)

31: Optimize π using Soft Actor-Critic withQ32: end while

1}. The difference between HER and HGR is that the for-mer is goal relabeling for the policy while the latter is goalrelabeling for the discriminator. HER is most efficient whenrewards are sparse; however, in our setting, the rewards pro-vided by the discriminator are not sparse and we do not ob-serve a boost of performance after applying HER. This isdiscussed in greater detail in Section 4.

3.4 Self-Supervised Learning in NewEnvironments

In this section, we specifically examine the scenario wherethe learned policy is deployed to a new environment. For ex-ample, after training an embodied agent to perform tasks in aset of buildings, we may deploy this agent to a new buildingwith different floor plans. We assume that we have accessto these new environments, but we do not have any expertdemonstrations nor any NL goals in new environments. Notethat we can not directly apply NL goals from existing envi-ronments to new environments, because goals are tied to theenvironments. For example, in instruction-following tasks,an example goal is go downstairs and walk pass the living

room. However, there may be no stairs in a new environ-ment. As a result, we can not just sample goals from expertdemonstrations to train in the new environments.

The high-level idea of our method is to first learn a policyπp that is goal-agnostic. That is, the policy selects an actionpurely based on the state without conditioning on a goal.This model gives us a prior of how the agent will act given astate. We use behavior cloning to train this policy on expertdemonstrations. More details are available in Appendix B.We then sample trajectories in the new environments usingthis policy and sample goals of these trajectories using thegoal generator:

τnew ∼ πp(Enew)

Gnew ∼ GoalGenerator(τnew)

(Enew,Gnew, τnew) are treated as expert demonstrationsin the new environments. We then fine-tune the discrimina-tor and the policy in new environments using Algorithm 1(w/o expert goal relabeling). The self-supervised learningsignals come from both the generated expert demonstrationsfor the new environments, and the hindsight goal relabelingand sampling proposed in Section 3.3. The generated expertdemonstrations can be seen as seeds to bootstrap hindsightgoal relabeling and sampling.

4 ExperimentsWhile LangGoalIRL works on any natural language (NL)goal based IRL problems, our experiments focus on achallenging vision-based instruction following problem.We evaluate our model on the Room-2-Room (R2R)dataset (Anderson et al. 2018), a visually-grounded NL nav-igation task in realistic 3D indoor environments. The datasetcontains 7,189 routes sampled from 90 real world indoorenvironments. A route is a sequence of viewpoints in theindoor environments with the agent’s first-person cameraviews. Each route is annotated by humans with 3 navigationinstructions. The average length of navigation instructionsis 29 words. The dataset is split into train (61 environmentsand 14,025 instructions), seen validation (61 environmentssame as train set, and 1,020 instructions), unseen validation(11 new environments and 2,349 instructions), and test (18new environments and 4,173 instructions). We don’t use thetest set for evaluation because the ground-truth routes of thetest set are not released. Along with the dataset, a simula-tor is provided to allow the embodied agent to interact withthe environments. The observation of the agent is the first-person camera view, which is a panoramic image. The ac-tion of the agent is the nearby viewpoints that the agent canmove to. For details about the dataset, please see AppendixA. The R2R dataset has become a very popular testbed forlanguage grounding in visual context (Tan, Yu, and Bansal2019; Wang et al. 2019; Fried et al. 2018; Zhu et al. 2020).

We evaluate the model performance based on the trajec-tory success rate. Each navigation instruction in the R2Rdataset is labeled with a goal position. Following Andersonet al. (2018), the agent is considered successfully reachingthe goal if the navigation error (the distance between thestop position and the goal position) is less than 3 meters.

Note that the goal position is not available to our model, aswe use navigation instructions as goals.

We compare our model with the following baselines: (1)LangGoalIRL BASE, which corresponds to our proposedmodel in Section 3.2 without the goal relabeling/samplingstrategies proposed in Section 3.3. (2) Behavior Cloning,which is imitation learning as supervised learning. Themodel shares the same architecture as the policy network ofLangGoalIRL and is trained to minimize cross entropy losswith actions in the expert demonstrations as labels. (3) LC-RL (Sampling) (Fu et al. 2019), which also uses MaxEn-tIRL (Ziebart et al. 2008) to learn a reward function. It opti-mizes the policy exactly in a grid environment, which is notscalable to our experimental setting, so we use AIRL withSAC for LC-RL to optimize the policy. In this case, LC-RLis very similar to LangGoalIRL BASE, except that LC-RLsimply concatenates state and goal embeddings as input tothe policy and reward function, while LangGoalIRL BASEuses the attention mechanism Att(s,G) in Section 3.2. (4)AGILE (Bahdanau et al. 2019), which proposes to learn adiscriminator (reward function) that predicts whether a stateis the goal state for a NL goal or not. The positive examplesfor the discriminator are the (NL goal, goal state) pairs inexpert demonstrations, and the negative examples are sam-pled from the replay buffer of SAC. (5) HER (Andrychow-icz et al. 2017; Cideron et al. 2019), which uses our learnedvariational goal generator to relabel trajectories. This corre-sponds to the final strategy in Andrychowicz et al. (2017)applied to LangGoalIRL BASE. (6) Upper Bound (Tan, Yu,and Bansal 2019). Recently, there are a few vision-languagenavigation models (Tan, Yu, and Bansal 2019; Wang et al.2019; Fried et al. 2018; Zhu et al. 2020) developed and eval-uated on R2R dataset. However, they assume the goal po-sitions or the optimal actions at any states are known, andassume the environments can be exhaustively searched with-out sample efficiency considerations in order to do trainingdata augmentation. As a result, their problem settings arenot really comparable, but we include Tan, Yu, and Bansal(2019)’s result a potential upper bound. Note that withoutthe additional information described above, their model per-formance reverts to our Behavior Cloning baseline. We willdiscuss this more in Section 5. For implementation details ofour algorithms and the baselines, please refer to Appendix B.

The performance of our algorithm and baselines areshown in Figure 1, Table 1, and Table 2. From the results,we would like to highlight the following observations.

1. LangGoalIRL outperforms baselines by a largemargin. From Table 1 and Figure 1(a) we can see thatLangGoalIRL achieves a 0.530 success rate on seen vali-dation set, a 47.63% improvement over LC-RL (sampling)and a 129% improvement over AGILE. Rewards learnedby AGILE are binary and sparse, which may be one of themain reasons why AGILE performs even worse than Behav-ior Cloning. The difference between LC-RL (sampling) andLangGoalIRL BASE is just the lack of attention mechanismover NL goals, which simply indicates that attention mech-anism is important when NL goals are complicated. Otherpowerful attention mechanisms such as the Transformer ar-chitecture (Vaswani et al. 2017) could also be easily con-

seen validation unseen validationBehavior Cloning 0.445± 0.0122 0.300± 0.0114LC-RL (sampling) 0.359± 0.0117 0.216± 0.0081AGILE 0.231± 0.0173 0.253± 0.0269HER 0.464± 0.0126 0.306± 0.0173LangGoalIRL BASE 0.449± 0.0130 0.308± 0.0087LangGoalIRL 0.530± 0.0138 0.357± 0.0089

+self-supervised – 0.491± 0.0065Upper Bound 0.621 0.645

Table 1: Success rate after 1 million environment interac-tions. LangGoalIRL+self-supervised is further fine-tuned inunseen environment for 1 million environment interactions(explained in Figure 1(c)).

ExpertGoal Relabeling

HindsightGoal Relabeling

HindsightGoal Sampling

seen validationsuccess rate

– – – 0.449± 0.01303 – – 0.450± 0.0122– 3 – 0.427± 0.0048– – 3 0.399± 0.01243 3 – 0.504± 0.01853 – 3 0.444± 0.0134– 3 3 0.503± 0.01173 3 3 0.530± 0.0138

Table 2: Ablation study of the three proposed strategies onseen validation set. A complete ablation study which in-cludes HER is in Table 4 of Appendix D.

sidered. The success rate of LangGoalIRL is 18.04% and15.91% higher than LangGoalIRL BASE on seen valida-tion and unseen validation, respectively, which shows thatthe three proposed strategies efficiently improve the gener-alization of the policy to new NL goals and new environ-ments. From Figure 1(b) we see that, when combined withHER, LangGoalIRL converges slower and performs worsethan LangGoalIRL alone. HER has been shown to efficientlydeal with reward sparsity; however, the rewards learned byAIRL are not sparse. From Table 1 we see that HER alonebarely outperforms LangGoalIRL BASE. As we will dis-cuss shortly, HER only improves the sample efficiency of thegenerator (policy), however, the generalization of the dis-criminator (reward function) is what is crucial to the overallperformance of the policy. Finally, from Figure 1(b) we canalso see that LangGoalIRL BASE consistently outperformsBehavior Cloning after about 200k interactions, which ef-fectively indicates the benefit of using inverse reinforcementlearning over behavior cloning.

2. Hindsight goal relabeling and expert goal relabel-ing improve the generalization of the discriminator (re-ward function); such generalization is crucial to the per-formance of the policy. From Table 2 and Figure 1(f), wesee that if we take out HGR, the success rate drops signifi-cantly from 0.530 to 0.444. This shows that HGR plays a keyrole in learning a better reward function by enriching posi-tive examples of the discriminator. Meanwhile, when apply-ing HGS without HGR, the success rate drops from 0.503to 0.399. This is because the discriminator cannot general-ize well to sampled goals when HGR is not applied, whichshows the importance of discriminator generalization. How-

(a) Baseline performance in seen validation. (b) HER variants performance in seen validation. (c) Self-supervised learning in unseen validation.

(d) More self-supervised learning in unseen validation. (e) Ablation study in seen validation. (f) More ablation study in seen validation.

Figure 1: Model performance on both seen and unseen validation. LangGoalIRL SS in Figure (c) and (d) represents self-supervised learningin unseen environments. LangGoalIRL SS (15K) means that 15, 000 demonstrations are sampled by the goal-agnostic policy πp. RandomExplore means that a random policy instead of πp is used to sample demonstrations. LangGoalIRL SS G2 means that the policy does notiteratively sample goals from {(E,Gv)|v > 2} as described in Section 3.3.3.

ever, from Table 2 and Figure 1(e) we can also see that ifwe only keep HGR, the success rate drops to 0.427 whichis even lower than LangGoalIRL BASE. We observe that inthis case the goals generated by the variational goal genera-tor only appear in the positive examples of the discriminator,so the discriminator easily identify these goals, and simplyassigns high rewards to these goals regardless of what ac-tion is taken. When EGR and HGR are applied together, re-labeled goals appear in both positive and negative examples,and the success rate improves from 0.427 to 0.504.

3. Self-supervised learning largely improves perfor-mance in unseen environments. Table 1 shows that withself-supervised learning, LangGoalIRL achieves a suc-cess rate of 0.491, a 59.42% improvement over Lang-GoalIRL BASE and a 37.54% improvement over Lang-GoalIRL w/o self-supervised learning. This shows that theproposed self-supervised learning algorithm can signifi-cantly improve policy performance even though neither ex-pert demonstrations nor NL goals in new environments aregiven. The number of trajectories sampled by the goal-agnostic policy πp in new environments by default is 15,000.In Figure 1(c), we show the performance of a baseline modelwhere the embodied agent randomly explores the new envi-ronments (randomly selects an action at any states, but doesnot visit any visited viewpoints) rather than using πp to sam-

ple trajectories. We set the number of sampled trajectory to15,000 and 5,000. We see that the agent performance dropsduring early stages, as randomly explored trajectories arenot good seeds to bootstrap Algorithm 1 in new environ-ments. After early stages, the model performance graduallyincreases as HGR and HGS provide some self-supervisedsignals but the overall the performance remains lower thanLangGoalIRL. More results are in Table 5 of Appendix D.Finally, self-supervised learning can also be applied to exist-ing environments by augmenting the expert demonstrationsin seen validation set. While this is not our target setting, weinclude these results in Table 6 of Appendix D.

4. Hindsight goal sampling improves the generaliza-tion of the policy by enabling the policy to explore a morediverse set of goals. From Table 2 and Figure 1(f) we cansee that HGS further improves the policy performance from0.504 to 0.530 on seen validation set. Figure 1(d) shows theimpact of HGS on self-supervised learning in new environ-ments. The baseline, LangGoalIRL SS G2 samples goalsonly from G1 and G2 defined in Section 3.3 and does notiteratively sample goals from {(E,Gv)|v > 2}. As goalssampled are less diverse, we see that LangGoalIRL SS G2performs worse than LangGoalIRL SS given 5,000, 10,000and 15,000 generated expert demonstrations in new environ-ments. More experiments are in Table 5 of Appendix D.

LangGoalIRLLanguage Goal-Conditioned IRL

(Fu et al. 2019)(Bahdanau et al. 2019)

Language Goal-Conditioned RL(Cideron et al. 2019)

(Jiang et al. 2019)

State Goal-Conditioned IRL(Ding et al. 2019)

State Goal-Conditioned RL(Andrychowicz et al. 2017)

(Plappert et al. 2018)

SoTA models on R2R dataset(Tan, Yu, and Bansal 2019)

(Fried et al. 2018)(Zhu et al. 2020)

Require Ground-TruthReward function No No Yes No Yes No (Yes if use RL fine-tuning)

Goal Type Natural Language Natural Language Natural Language States States Natural LanguageUse Student-forcing No No No No No YesGoals/PathsAugmentation EGR, HGS, HGR No HER HER HER All shortest-paths between any

two locations in environments

Table 3: We compare our algorithm with closely related works that fall into 5 different problem settings. In the setting we areinterested in, the goals of the tasks are NL goals, and neither the ground-truth reward function (required by RL) nor the optimalaction at any state (required by student-forcing) is available.

5 Related WorksOur paper focuses on proposing a sample efficient algorithmfor IRL with natural language (NL) goals. Table 3 summa-rizes the differences between our LangGoalIRL algorithmand recent closely related works on goal-conditioned RL andimitation learning (IL). Goal-conditioned RL and IL havebeen explored by many prior works (Schaul et al. 2015; Flo-rensa et al. 2017; Nair et al. 2018; Plappert et al. 2018; Dinget al. 2019). Usually, the task is to learn a policy that is pa-rameterized by a goal and can generalize to new goals. Inmost prior works, the goals and states are in a same space,such as positions in a Cartesian coordinate system (Plappertet al. 2018; Ding et al. 2019). This corresponds to column5 and 6 in Table 3. However, in this paper we investigate adifferent setting where the goals are NL goals, which cor-responds to column 3 and 4 in Table 3. The closest priorworks to our algorithm are language-conditioned IRL andreward learning (Fu et al. 2019; Bahdanau et al. 2019; Mac-Glashan et al. 2015; Williams et al. 2018; Goyal, Niekum,and Mooney 2019). Goyal, Niekum, and Mooney (2019)propose to train a model to predict whether a language in-struction describes an action in a trajectory, and use the pre-dictions to perform reward shaping. Bahdanau et al. (2019)propose to use GAN (Goodfellow et al. 2014; Ho and Ermon2016) to learn a discriminator that predicts whether a stateis the goal state of a NL instruction. Fu et al. (2019) pro-pose to use the MaxEntIRL (Ziebart et al. 2008) frameworkto learn a language-conditioned reward function for instruc-tion following tasks. The language instructions in Bahdanauet al. (2019) and Fu et al. (2019)’s experiments are generatedby templates and much easier to understand compared withour dataset. Meanwhile, LangGoalIRL demonstrates signif-icantly better generalization than these two prior works.There are also many prior works on goal relabeling to im-prove sample efficiency, but none of them can be directly ap-plied to the IRL with NL goals setting. Andrychowicz et al.(2017) propose hindsight experience replay that samples ad-ditional goals for each transition in the replay buffer and canefficiently deal with the reward sparsity problem. Nair et al.(2018) propose to sample additional diverse goals from alearned latent space. Ding et al. (2019) propose to relabeltrajectories using the states within the trajectories. These al-gorithms do not work with NL goals. Cideron et al. (2019)and Jiang et al. (2019) generalize HER to language settingand relabel trajectories by hindsight language instructions.However, in this paper we show that, when doing IRL, sim-ply applying HER to policy optimization does not improve

policy performance, and it is critical to improve the gener-alization of the reward function. Finally, there are multiplerecent works (Tan, Yu, and Bansal 2019; Wang et al. 2019;Fried et al. 2018; Zhu et al. 2020) on R2R dataset that fo-cus on solving vision-language navigation (Anderson et al.2018). This corresponds to the last column in Table 3. Theseworks do not focus on sample efficient IRL algorithms anduse student-forcing (Anderson et al. 2018) to train their su-pervised learning models or pre-train their RL models. Withstudent-forcing, the policy is assumed to always have ac-cess to the optimal action at any state it encounters duringtraining, which is even a stronger assumption than know-ing the ground-truth reward function. Fried et al. (2018) andTan, Yu, and Bansal (2019) propose to learn a speaker modelthat back-translate a given trajectory to a NL goal, which issimilar to our goal generator without latent variables. How-ever, the purpose of the speaker model is to augment trainingdata before the training, rather than relabeling goals duringtraining. To augment training data, they generate the short-est path between any two viewpoints. This requires know-ing the topological graph of viewpoints before the training.Moreover, they augment training set with every shortest paththat is not included in the original training set. This assumesthe environments can be exhaustively searched without sam-pling efficiency considerations. Other proposed techniquesin these papers such as environment dropout (Tan, Yu, andBansal 2019) and progress monitoring (Zhu et al. 2020) areorthogonal to our work. For a survey of reinforcement learn-ing with NL, please see Luketina et al. (2019).

6 Conclusion

In this work, we propose a sample efficient algorithm forIRL with NL goals. Observing limitations of existing worksregarding language-conditioned IRL, LangGoalIRL empha-sizes template-free NL goal specification and sample effi-ciency when generalizing to new goals and environments.Specifically, we propose learning a variational goal gener-ator that can relabel trajectories and sample diverse goals.Based on this variational goal generator, we describe threestrategies to improve the generalization and sample effi-ciency of the language-conditioned policy and reward func-tion. Empirical results demonstrate that LangGoalIRL out-performs existing baselines by a large margin and general-izes well to new natural language goals and new environ-ments, thus increasing flexibility of expression and domaintransfer in providing instructions to autonomous agents.

ReferencesAbbeel, P.; and Ng, A. Y. 2004. Apprenticeship learningvia inverse reinforcement learning. In Proceedings of thetwenty-first international conference on Machine learning,1.

Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.;Sunderhauf, N.; Reid, I.; Gould, S.; and van den Hen-gel, A. 2018. Vision-and-language navigation: Interpretingvisually-grounded navigation instructions in real environ-ments. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 3674–3683.

Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong,R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, O. P.;and Zaremba, W. 2017. Hindsight experience replay. InAdvances in neural information processing systems, 5048–5058.

Bahdanau, D.; Hill, F.; Leike, J.; Hughes, E.; Hosseini, A.;Kohli, P.; and Grefenstette, E. 2019. Learning to understandgoal specifications by modelling reward. International Con-ference on Learning Representations .

Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner,M.; Savva, M.; Song, S.; Zeng, A.; and Zhang, Y. 2017.Matterport3D: Learning from RGB-D Data in Indoor En-vironments. International Conference on 3D Vision (3DV).

Cideron, G.; Seurin, M.; Strub, F.; and Pietquin, O. 2019.Self-Educated Language Agent With Hindsight Experi-ence Replay For Instruction Following. arXiv preprintarXiv:1910.09451 .

Ding, Y.; Florensa, C.; Abbeel, P.; and Phielipp, M. 2019.Goal-conditioned imitation learning. In Advances in NeuralInformation Processing Systems, 15298–15309.

Finn, C.; Levine, S.; and Abbeel, P. 2016. Guided cost learn-ing: Deep inverse optimal control via policy optimization. InInternational conference on machine learning, 49–58.

Florensa, C.; Held, D.; Geng, X.; and Abbeel, P. 2017. Au-tomatic goal generation for reinforcement learning agents.arXiv preprint arXiv:1705.06366 .

Fried, D.; Hu, R.; Cirik, V.; Rohrbach, A.; Andreas, J.;Morency, L.-P.; Berg-Kirkpatrick, T.; Saenko, K.; Klein, D.;and Darrell, T. 2018. Speaker-follower models for vision-and-language navigation. In Advances in Neural Informa-tion Processing Systems, 3314–3325.

Fu, J.; Korattikara, A.; Levine, S.; and Guadarrama, S.2019. From language to goals: Inverse reinforcement learn-ing for vision-based instruction following. arXiv preprintarXiv:1902.07742 .

Fu, J.; Luo, K.; and Levine, S. 2018. Learning Robust Re-wards with Adverserial Inverse Reinforcement Learning. InInternational Conference on Learning Representations.

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.2014. Generative adversarial nets. In Advances in neuralinformation processing systems, 2672–2680.

Goyal, P.; Niekum, S.; and Mooney, R. J. 2019. Using Natu-ral Language for Reward Shaping in Reinforcement Learn-ing. In Proceedings of the Twenty-Eighth International JointConference on Artificial Intelligence, 2385–2391.Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.;Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al.2018. Soft actor-critic algorithms and applications. arXivpreprint arXiv:1812.05905 .He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-ual learning for image recognition. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, 770–778.Ho, J.; and Ermon, S. 2016. Generative adversarial imita-tion learning. In Advances in neural information processingsystems, 4565–4573.Jiang, Y.; Gu, S. S.; Murphy, K. P.; and Finn, C. 2019. Lan-guage as an abstraction for hierarchical deep reinforcementlearning. In Advances in Neural Information ProcessingSystems, 9414–9426.Kaelbling, L. P. 1993. Learning to achieve goals. In Pro-ceedings of the Thirteenth International Joint Conference onArtificial Intelligence, 1094–1098.Kuefler, A.; Morton, J.; Wheeler, T.; and Kochenderfer, M.2017. Imitating driver behavior with generative adversar-ial networks. In 2017 IEEE Intelligent Vehicles Symposium(IV), 204–211. IEEE.Lovejoy, W. S. 1991. A survey of algorithmic methods forpartially observed Markov decision processes. Annals ofOperations Research 28(1): 47–65.Luketina, J.; Nardelli, N.; Farquhar, G.; Foerster, J.; An-dreas, J.; Grefenstette, E.; Whiteson, S.; and Rocktaschel,T. 2019. A survey of reinforcement learning informed bynatural language. arXiv preprint arXiv:1906.03926 .MacGlashan, J.; Babes-Vroman, M.; desJardins, M.;Littman, M.; Muresan, S.; Squire, S.; Tellex, S.; Arumugam,D.; and Yang, L. 2015. Grounding English Commands toReward Functions. In Proceedings of Robotics: Science andSystems. Rome, Italy.Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fid-jeland, A. K.; Ostrovski, G.; et al. 2015. Human-level con-trol through deep reinforcement learning. Nature 518(7540):529–533.Nair, A. V.; Pong, V.; Dalal, M.; Bahl, S.; Lin, S.; andLevine, S. 2018. Visual reinforcement learning with imag-ined goals. In Advances in Neural Information ProcessingSystems, 9191–9200.Osa, T.; Pajarinen, J.; Neumann, G.; Bagnell, J. A.; Abbeel,P.; Peters, J.; et al. 2018. An algorithmic perspective on im-itation learning. Foundations and Trends® in Robotics 7(1-2): 1–179.Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:Global vectors for word representation. In Proceedings ofthe 2014 conference on empirical methods in natural lan-guage processing (EMNLP), 1532–1543.

Plappert, M.; Andrychowicz, M.; Ray, A.; McGrew, B.;Baker, B.; Powell, G.; Schneider, J.; Tobin, J.; Chociej, M.;Welinder, P.; et al. 2018. Multi-goal reinforcement learning:Challenging robotics environments and request for research.arXiv preprint arXiv:1802.09464 .

Rhinehart, N.; and Kitani, K. M. 2017. First-person activityforecasting with online inverse reinforcement learning. InProceedings of the IEEE International Conference on Com-puter Vision, 3696–3705.

Schaul, T.; Horgan, D.; Gregor, K.; and Silver, D. 2015. Uni-versal value function approximators. In International con-ference on machine learning, 1312–1320.

Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz,P. 2015. Trust region policy optimization. In Internationalconference on machine learning, 1889–1897.

Stooke, A.; and Abbeel, P. 2019. rlpyt: A research code basefor deep reinforcement learning in pytorch. arXiv preprintarXiv:1909.01500 .

Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna,Z. 2016. Rethinking the inception architecture for computervision. In Proceedings of the IEEE conference on computervision and pattern recognition, 2818–2826.

Tan, H.; Yu, L.; and Bansal, M. 2019. Learning to NavigateUnseen Environments: Back Translation with Environmen-tal Dropout. In Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Computa-tional Linguistics, 2610–2621.

Tucker, A.; Gleave, A.; and Russell, S. 2018. Inversereinforcement learning for video games. arXiv preprintarXiv:1810.10593 .

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-tention is all you need. In Advances in neural informationprocessing systems, 5998–6008.

Wang, X.; Huang, Q.; Celikyilmaz, A.; Gao, J.; Shen, D.;Wang, Y.-F.; Wang, W. Y.; and Zhang, L. 2019. Reinforcedcross-modal matching and self-supervised imitation learn-ing for vision-language navigation. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, 6629–6638.

Williams, E. C.; Gopalan, N.; Rhee, M.; and Tellex, S. 2018.Learning to parse natural language to grounded reward func-tions with weak supervision. In 2018 IEEE InternationalConference on Robotics and Automation, 1–7.

Zhu, F.; Zhu, Y.; Chang, X.; and Liang, X. 2020. Vision-language navigation with self-supervised auxiliary reason-ing tasks. In Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition, 10012–10022.

Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; and Dey, A. K.2008. Maximum entropy inverse reinforcement learning. InAaai, volume 8, 1433–1438. Chicago, IL, USA.

AppendixA Dataset and Environment Details

The Room-2-Room (R2R) dataset 1 (Anderson et al. 2018) isa dataset for visually-grounded natural language navigationtask in realistic 3D indoor environments. The dataset is builton top of the Matterport3D simulator (Anderson et al. 2018).Matterport3D simulator provides APIs to interact with 903D indoor environments, including homes, offices, churchesand hotels. An agent can navigation between viewpoints inthe 3D environment. At each viewpoint, the agent can turnaround and perceive a 360-degree panoramic image. Imagesare all real rather than synthetic (Chang et al. 2017).

The routes in the R2R dataset was sampled from the sim-ulator by first sampling two viewpoints in the environmentand then calculating the shortest-path from one viewpointto the other. In total, there are 7,189 visually diverse routessampled. Then each routes was annotated by 3 Amazon Me-chanical Turk workers. The workers are asked to write direc-tions to provide navigation instructions for sampled routesso that a robot can follow the instructions to reach to theend viewpoints. In total there are 21,567 instructions, andthe average length of each instruction is 29 words. The sizeof the vocabulary for training is 991. The dataset is split intotrain (61 environments and 14,025 instructions), seen valida-tion (61 environments same as train set, and 1,020 instruc-tions), unseen validation (11 new environments and 2,349instructions), and test (18 new environments and 4,173 in-structions).

The following are 3 navigation instructions in the datasetdescribing the same route:

1. Head indoors and take the hallway to the living room.Stop and wait on the right hand side of the purple wall.

2. Turn slightly left and walk across the hallway. Turnslightly right and walk towards the green sofas. Walk tothe right entrance where the window panes are at and waitthere.

3. Go inside and walk straight down the hallway. Keep goinguntil you see a green couch on the right and then walktowards it. Go to the right side of the TV and stop in thedoorway leading to an interior room with a table in it.

For more details about the dataset, please see Anderson et al.(2018).

B Implementation DetailsObservation Space: The Matterport3D simulator outputsa RGB image corresponding to the embodied agent’s first-person camera view. The image resolution is 640×480. Fol-lowing Fried et al. (2018) and Tan, Yu, and Bansal (2019), atthe each viewpoint, the agent first look around and get a 360degree panoramic image, which consists of 36 first-personcamera view images (12 heading and 3 elevation with 30 de-gree increments). Image features are 2048-dimensional ve-cotors extracted from a pre-trained ResNet-152 model 2 (Heet al. 2016).

1https://bringmeaspoon.org/2https://github.com/peteanderson80/Matterport3DSimulator

Figure 2: The left-side figure shows a bird view of the indoor environment. This view is for demonstration purpose and is notavailable to the embodied agent. The right-side figure is the camera view of the embodied agent at a viewpoint. The agent canlook around or move to the next viewpoint.

State Space: The observation of the embodied agent onlycaptures views from the current viewpoint. We define thestate of the embodied agent as the set of all previous observa-tions in the current trajectory. We use a LSTM model to en-code previous and current observations into a n-dimensionalvector. If we update the LSTM model parameters dur-ing training, it would make our problem a POMDP prob-lem (Lovejoy 1991). For the purpose of simplicity, we pre-train the LSTM model using the base model below and fixits parameters during training.Action Space: The action space at any given state is thecollection of nearby viewpoints. The maximum number ofnearby viewpoints is 13. The feature vector of each action(viewpoint) is am-dimensional vector that will be describedin the base model below. We add a m-dimensional vector ofall zeros to represent the stop action, so the maximum num-ber of actions given a state is 14.Base Model: we use a base model architecture that is similarto many prior works (Fried et al. 2018; Tan, Yu, and Bansal2019; Wang et al. 2019). The base model is a sequence-to-sequence model. The encoder takes natural language goalsas input and the decoder outputs a sequence of actions. Let{ot,i}36

i=1 be the observations (36 first-person camera viewimages) at time step t. Let θt,i and φt,i be the headingand elevation of the camera view ot,i. The feature vector ofot,i is ft,i = [fResNet(ot,i); cos θt,i; sin θt,i; cosφt,i; sinφt,i],where fResNet is a pre-trained ResNet-152 model on Im-ageNet dataset 3. Let the natural language goal be G ={w1, w2, ..., wN}. The natural language goal is encoded bya LSTM model:

hw1 , hw2 , ..., h

wN = LSTMenc(e

w1 , e

w2 , ..., e

wN )

where ewn is the n-th word embedding. We use GloVe (Pen-nington, Socher, and Manning 2014) to initialize the weightsof word embeddings. The decoder is also a LSTM model.

3http://image-net.org/index

Let the hidden state of the decoder at step t be hst . We usehst−1 to attend over the observations at t

αsi = Softmax(Linear(hst−1) · ft,i)

ft =

36∑i=1

αsi ft,i

Then we update the decoder model:

eanglet = Linear([cos θt; sin θt; cosφt; sinφt])

hst = LSTMdec(hst−1, [ft; e

anglet ])

where θt and φt are the current heading and elevation ofthe embodied agent. To predict the next action, we firstuse hst to attend over the natural language goal: αwn =Softmax(Linear(hst ) · hwn ). Then the summarized goal em-bedding is gt =

∑Nn=1 α

wnh

wn . Then the probability of se-

lecting an action is given by

pai = Softmax(BiLinear(Linear([hst ; gt]), eat,i)) (6)

where BiLinear(v, u) = v>Wu, eat,i =[fResNet(ot,a); cos θt,a; sin θt,a; cosφt,a; sinφt,a] is theaction embedding, ot,a is the camera view of the action(viewpoint) from the current viewpoint, and θ and φ areheading and elevation of the action from the current view-point. The base model is trained with cross-entropy lossLbase = CrossEntropy(pa, ya) where ya is the ground-truthaction. The trained LSTMdec is used as the pre-trained stateencoding model for all algorithms in our experiments.Policy Model: We use soft actor-critic (SAC) (Haarnojaet al. 2018) as our policy optimization algorithm. SAC in-cludes a Q-network and a policy network. The Q networkQ(s, a,G) is defined as

Qψ(st, at,i, G) = MLP([hst ; gt; eat,i])

where MLP is a multilayer feedforward neural network, hstis the state of the embodied agent, which is given by the pre-trained LSTMdec model in the base model, gt is the summa-rized goal embedding calculated in the same way as the basemodel, and eat,i is the action embedding. The policy networkis defined as

πw(st, at,i, G) = Softmax(eat,i ·MLP([hst ; gt]))

By default the action space in SAC is continuous, how-ever, in our setting the action space is discrete and state-conditioned. We modify the SAC algorithm slightly so thatit can support discrete action space. The Q-network parame-ters in SAC are trained to minimize the following soft Bell-man residual:JQ(ψ) = E(st,at,i,G)∼D[

1

2

(Qψ(st, at,i, G)−

(rt + γEst+1

[Vψ(st+1, G)

]))2]

]

where ψ is the target Q-network that is obtained as an ex-ponentially moving average of the Q-network. In discreteaction space setting, the value function is given byVψ(st, G) =

Kt∑i=1

πw(at,i|st, G)[Qψ(st, at,i, G)− α log πw(at,i|st, G)]

where Kt is the number of actions at state st. Similarly, indiscrete setting, the policy network is learned by minimizingthe KL divergence between the policy and the exponential ofthe Q-networkJπ(w) = E(st,G)∼D[

Kt∑i=1

πw(at,i|st, G)[α log πw(at,i|st, G)−Qψ(st, at,i, G)]

]

The temperature α is updated by minimizing

J(α) =

Kt∑i=1

πw(at,i|st, G)[−αt log πw(at,i|st, G)− αtH]

where H is the desired minimum expected entropy ofπw(at|st, G).Discriminator: Our model is based on the adversarial in-verse reinforcement learning (AIRL) framework (Fu, Luo,and Levine 2018). The generator in the AIRL framework isthe SAC policy model we just described, and the discrimi-nator is defined as follows:Dθ,φ(st, at,i, st+1, G) =

exp{fθ,φ(st, at,i, st+1, G)}exp{fθ,φ(st, at,i, st+1, G)}+ π(at,i|st, G)

where fθ,φ(st, at,i, st+1, G) = gθ(st, at,i, G) +γhφ(st+1, G) − hφ(st, G). gθ and hφ are multilayerfeedforward neural network

g(st, at,i, G) = MLP([hst ; gt; eat,i])

h(st, G) = MLP([hst ; gt])

The parameter θ and φ are optimized by minimizing thecross-entropy loss with label smoothing (Szegedy et al.2016)

L(θ, φ) =E(st,at,i,st+1,G)∼D[

(1− `) logD(st, at,i, st+1, G)

+ ` log(1−D(st, at,i, st+1, G))

]

+ E(st,at,i,st+1,G)∼D′ [

` logD(st, at,i, st+1, G)

+ (1− `) log(1−D(st, at,i, st+1, G))

]

where ` is the label smoothing hyper-parameter, and D andD′ are the positive and negative examples discussed in Sec-tion 4.Variational Goal Generator: The variational goal gener-ator has a encoder and a decoder. The encoder encodesthe sequence of observations in a trajectory τ , and the de-coder samples a natural language goal G for that trajec-tory. Let eat be the embedding of action selected at stept. We first use a bi-directional LSTM to encode actionsin the trajectory ha1 , h

a2 , ..., h

aT = BiLSTM(ea1 , e

a2 , ..., e

aT ),

then we use hat−1 to attend over observations at t: αai =Softmax(Linear(hat−1) · ft,i). Then the summarized obser-vation feature vector at t is ft =

∑36i=1 α

ai ft,i. Then the

generative process of the n-th word in the natural languagegoal is as follows:

ho1, ho2, ..., h

oT = BiLSTM(f1, f2, ..., fT )

ho = ElementWiseMax(ho1, ho2, ..., h

oT )

µprior, σ2prior = MLP(ho),Softplus(MLP(ho))

p(z) = N (z|µprior, σ2priorI)

z ∼ N (µprior, σ2priorI)

hgn = LSTMCell(hgn−1, e

wn

)αot = Softmax (Linear([hgn; z]) · hot )

pn+1(w) = Softmax

(MLP

([T∑t=1

αothot ;h

gn; z

]))w = arg max

wpn+1(w)

where Softplus is a function f(x) = ln(1 + ex) to ensurethat σ2 is positive, ewn is the word embedding of wn, andpn+1(w) is the probability of selecting a word w at positionn + 1. The posterior distribution of the latent variable z isapproximated by

hg = ElementWiseMax(hg1, hg2, ..., h

gN )

µposterior, σ2posterior = MLP(ho, hg),Softplus(MLP(ho, hg))

q(z|τ,G) = N (µposterior, σ2posteriorI)

We maximize the variational lower bound−λDKL(q(z|τ,G)||p(z)) + Eq(z|τ,G)[log p(G|z, τ)]. λis KL multiplier that gradually increases from 0 to 1 duringtraining.

Goal-agnostic Policy πp. In Section 3.4, we propose to learna goal-agnostic policy πp that can sample trajectories in newenvironments to construct expert demonstrations. We learnthis goal-agnostic policy πp using the same network archi-tecture as the base model mentioned above, except that nowthat the action selection is not conditioned on goals. Thatis, the probability of selecting an action which originally inEquation 6 now becomes

pai = Softmax(BiLinear

(hst , e

at,i

))Hyper-parameters: By default, the hidden size of MLPin all the models in this paper is 512, and the numberof layers is 2. The hidden size of LSTM is 512 (256for BiLSTM) and the number of layer is 1. The size ofword embedding is 300. The dropout rate is 0.5 for allthe models. The batch sizes for training SAC and the dis-criminator are both 100. All the models are optimized byAdam and the learning rate is 1e − 4. The target entropyin SAC is − log(1/MAX NUM ACTIONS) ∗ 0.1 whereMAX NUM ACTIONS = 14. The size of replay buffer inSAC is 106. The reward discount factor is 0.99. Our imple-mentation of SAC is based on the rlpyt code base (Stookeand Abbeel 2019). We apply label smoothing when optimiz-ing the discriminator. The positive label is 0.95 and the neg-ative label is 0.05. The size of the latent variable in varia-tional goal generator is 128. When training the variationalgoal generator, we gradually increase the KL multiplier λfrom 0 to 1 during the first 50, 000 gradient updates, andwe also apply a word dropout with probability 0.25. All theexperiments are run 8 times and the average and standarddeviation of the success rate are reported. One run includes1 million interactions with the environments and takes about30 hours to train on a NVIDIA V100 GPU.

C Examples of Generated GoalsThe following are examples of human-annotated vs.GoalGenerator-generated instructions for a trajectory in theRoom-2-Room dataset.

Example 1:Human: Exit room through the doorway near the fire-place. Keep right and walk through the hallway, turnright, enter the bedroom and wait near the bed.GoalGenerator: Walk through the doorway to theright of the fireplace. Walk past the fireplace and turnright. Walk down the hallway and turn right into thebedroom . Stop in front of the bed.

Example 2:Human: Turn right and exit the room through the dooron the left. Turn left and walk out into the hallway. Turnright and enter bedroom. Walk through the bedroomand into the bathroom. Stop once you are in front ofthe sink.GoalGenerator: Turn around and walk through thedoorway. Turn left and walk down the hallway. Turnright at the end of the hall and enter the bathroom.Stop in front of the sink.

Example 3:

Human: Turn around and walk outside through thedoor behind you. Once outside, turn right and walk tothe other end of the pool, At the end of the pool turnright and enter the door back into the house and stopbehind the 2 white chairs.

GoalGenerator: Turn around and walk out of theroom. Once out, turn right and walk towards the pool.Once you reach the pool, turn right and walk towardsthe large glass doors. Once you reach the 2 chairs, turnright and enter the large room. Stop once you reach thecouch.

The following are goals generated by the GoalGeneratorfor trajectories sampled by the goal-agnostic policy in newenvironments (Section 3.4).

1. Walk straight across the room and past the couch. Walkstraight until you get to a room with a large green vase.Wait there.

2. Exit the bathroom and turn right. Walk down the hallwayand turn right. Walk straight until you get to a kitchenarea. Wait near the stove.

3. Walk straight down the hallway and into the room withthe large mirror. Walk through the door and stop in frontof the table with the plant on it.

D Additional ExperimentsTable 4 shows the full ablation study results which alsoinclude hindsight experience replay. Notably, when applyHER together with HGR, the success rate drops to 0.101.The reason is that without EGR and HGS, the discrimina-tor over-estimate the rewards under relabeled goals (as rela-beled goals only appear in positive examples of the discrim-inator), and hence cannot give a good estimate of rewards totrajectories relabeled by HER.

Table 5 shows the policy performance in unseen val-idation with three different self-supervised learning set-tings. RandomExplore randomly select actions instead of us-ing πp when generating demonstrations in unseen environ-ments (but avoid visiting a viewpoint twice). After collect-ing demonstrations, RandomExplore fine-tunes the policy inthe same way as LangGoalIRL SS. LangGoalIRL SS G2 isdifferent from LangGoalIRL SS in that it does not iterativelysample goals from {(E,Gv)|v > 2} as described in Section3.3.3, so the policy experience a less diverse set of naturallanguage goals compared with LangGoalIRL SS. The num-ber of generated demonstrations in unseen environments isset to 5, 000, 10, 000, 15, 000 or 20, 000. We can see thatLangGoalIRL SS performs much better than RandomEx-plore, showing the importance of collecting demonstrationsby πp instead of a random policy. LangGoalIRL SS alsoperforms better than LangGoalIRL SS G2, however, thegap between LangGoalIRL SS and LangGoalIRL SS G2decreases as the number of generated demonstrations in-creases.

Table 6 shows the success rate of LangGoalIRL and Lang-GoalIRL BASE when self-supervised learning is applied toexisting environments. More specifically, in Section 3.4 we

ExpertGoal Relabeling

HindsightGoal Relabeling

HindsightGoal Sampling

HindsightExperience Replay

seen validationsuccess rate

– – – – 0.449± 0.01303 – – – 0.450± 0.0122– 3 – – 0.427± 0.0048– – 3 – 0.399± 0.0124– – – 3 0.464± 0.01263 3 – – 0.504± 0.01853 – 3 – 0.444± 0.01343 – – 3 0.459± 0.0050– 3 3 – 0.503± 0.0117– 3 – 3 0.101± 0.0790– – 3 3 0.378± 0.01923 3 3 – 0.530± 0.01383 3 – 3 0.491± 0.01463 – 3 3 0.452± 0.0088– 3 3 3 0.476± 0.01663 3 3 3 0.521± 0.0093

Table 4: A complete ablation study of EGR, HGR, HGS and HER on seen validation set.

5,000 10,000 15,000 20,000RandomExplore 0.274± 0.0086 0.303± 0.0092 0.347± 0.0098 0.348± 0.0115LangGoalIRL SS G2 0.430± 0.0080 0.459± 0.0084 0.483± 0.0124 0.490± 0.0068LangGoalIRL SS 0.452± 0.0067 0.475± 0.0062 0.491± 0.0065 0.496± 0.0039

Table 5: The success rate in unseen validation with different self-supervised learning setting. Algorithms sample 5, 000 to20, 000 demonstrations in unseen environments. RandomExplore samples demonstrations by randomly select an unvisitednearby viewpoint, while LangGoalIRL SS and LangGoalIRL SS G2 sample demonstrations using goal-agnostic policy πp.During fine-tuning, LangGoalIRL SS G2 samples goals only from {(E,Gv)|v ≤ 2}, while the other two algorithms samplegoals from {(E,Gv)|v ≥ 1}.

0 5000 10,000 15,000 20,000LangGoalIRL BASE 0.449± 0.0130 0.469± 0.0111 0.472± 0.0105 0.478± 0.0108 0.481± 0.0156LangGoalIRL 0.530± 0.0138 0.532± 0.0126 0.535± 0.0113 0.538± 0.0155 0.543± 0.0125

Table 6: Self-supervised learning can also be applied to seen validation set. This table shows the success rate in seen validationwhen self-supervised learning is used to augment the expert demonstrations in seen validation. We use the goal-agnostic policyπp to sample 5, 000 to 20, 000 trajectories, relabel them with goal generator, and add them to the seen validation for training.

#relabeled expert goals. 0 1 2 3LangGoalIRL 0.503± 0.0117 0.525± 0.0090 0.530± 0.0138 0.528± 0.0105LangGoal ONLY EGR 0.449± 0.0130 0.446± 0.0169 0.450± 0.0122 0.450± 0.0155

Table 7: Success rate in seen validation given different number of relabeled expert goals. This corresponds to setting N inSection 3.3.1 to 0, 1, 2 and 3.

BLEU Unigram Precision Bigram Precision Trigram Precisionseen validation 0.341 0.804 0.478 0.259unseen validation 0.316 0.791 0.453 0.234

Table 8: BLEU score and n-gram precision of the variational goal generator evaluated on both seen and unseen validation sets.

propose a self-supervised learning algorithm in new environ-ments (unseen validation set), we can also apply the sameidea to existing environments (seen validation set). To doso, we use the goal-agnostic policy πp to sample trajectoriesin existing environments and relabel them with the varia-tional goal generator. Then we augment the seen validationset with these generated demonstrations. During the train-ing, EGR is only applied to the original expert demonstra-tions in the seen validation set. From the results we cansee that self-supervised learning improves the performanceof LangGoalIRL and LangGoalIRL BASE in seen valida-tion. Self-supervised learning has a larger impact on Lang-GoalIRL BASE than on LangGaolIRL. This is reasonablesince LangGoalIRL can already generalize the policy and re-ward function by our proposed three strategies (EGR, HGRand HGS).

Table 7 shows the success rate of LangGoalIRL in seenvalidation when we vary the number of relabeled expertgoals (N in Section 3.3.1). We can see that the policy perfor-mances are comparable when the number of relabeled expertgoals N ≥ 1.

Table 8 shows the BLEU score and n-gram precision ofthe varitaional goal generator on both seen and unseen envi-ronments.

inverse reinforcement learning with natural language goals

Documents