maximum entropy deep inverse reinforcement learningmaximum entropy deep inverse reinforcement...

Maximum Entropy Deep Inverse Reinforcement Learning

Markus Wulfmeier [email protected] Ondruska [email protected] Posner [email protected]

Mobile Robotics Group, Department of Engineering Science, University of Oxford

AbstractThis paper presents a general framework for ex-ploiting the representational capacity of neuralnetworks to approximate complex, nonlinear re-ward functions in the context of solving the in-verse reinforcement learning (IRL) problem. Weshow in this context that the Maximum Entropyparadigm for IRL lends itself naturally to the effi-cient training of deep architectures. At test time,the approach leads to a computational complex-ity independent of the number of demonstrations,which makes it especially well-suited for appli-cations in life-long learning scenarios. Our ap-proach achieves performance commensurate tothe state-of-the-art on existing benchmarks whileexceeding on an alternative benchmark based onhighly varying reward structures.Finally, we ex-tend the basic architecture - which is equivalentto a simplified subclass of Fully ConvolutionalNeural Networks (FCNNs) with width one - toinclude larger convolutions in order to eliminatedependency on precomputed spatial features andwork on raw input representations.

1. IntroductionRecent successes in machine learning, vision and roboticshave lead to widespread expectations that machines willincreasingly succeed in applications of real value to thepublic domain. A central tenet of any vision deliveringon this promise revolves around learning from user inter-actions. Inverse reinforcement learning (IRL) is playing apivotal role in these developments and commonly finds ap-plications in robotics (Argall et al., 2009) where it allowsrobot to learn complex behaviour from human demonstra-tions and also in fields of cognition (Baker et al., 2009) andpreference learning (Ziebart et al., 2008) where it serves

Figure 1: Fully Convolutional Neural Network for rewardapproximation in the IRL setting. The network serves tomodel the relationship between input features and final re-ward map.

as a tool to better understand human decisions or medicine(Asoh et al., 2013) to predict patient response to treatment.The objective of inverse reinforcement learning (IRL) is toinfer the underlying reward structure guiding an agent’s be-haviour based on observations as well as a model of theenvironment. This may be done either to learn the rewardstructure for modelling purposes or to provide a methodto allow the agent to imitate a demonstrator’s specific be-haviour (Ramachandran & Amir, 2007). While for smallproblems the complete set of rewards can be learned explic-itly, many problems of realistic size require the applicationof generalisable function approximations.

Much of the prior art in this domain relies on parametri-sation of the reward function based on pre-determined fea-tures. In addition to better generalisation performance thandirect state-to-reward mapping, this approach enables thetransfer of learned reward functions between different sce-narios with the same feature representation. A number ofearly works from (Ziebart et al., 2008), (Abbeel & Ng,2004), (Lopes et al., 2009) and (Ratliff et al., 2006), ex-

arX

iv:1

507.

0488

8v3

[cs

.LG

] 1

1 M

ar 2

016


press the reward function as a weighted linear combinationof hand selected features. To overcome the inherent limi-tations of linear models, (Choi & Kim, 2013) and (Levineet al., 2010) extend this approach to a limited set of non-linear rewards by learning a set of composites of logicalconjunctions of atomic features. Non-parametric methodssuch as Gaussian Processes (GPs) have also been employedto cater for potentially complex, nonlinear reward functions(Levine et al., 2011). While in principle this extends theIRL paradigm to the flexibility of nonlinear reward approx-imation, the use of a kernel machine makes this approachscale badly with higher numbers of training data and proneto requiring a large number of reward samples in orderto approximate highly varying reward functions (Bengioet al., 2007). Even sparse GP approximations as used in(Levine et al., 2011) lead to a query complexity time in de-pendency of the size of the active set or the number of ex-perienced state-reward pairs. Situations with increasinglycomplex reward function leading to higher requirements re-garding the number of inducing points can quickly renderthis nonparametric approach computationally impractica-ble. Furthermore, in comparison to (Babes et al., 2011),we focus on a singular expert in what finally leads to an anend-to-end learning scenario in section 4.3 from raw inputto reward without compression or preprocessing on the in-put representation. To our knowledge the only other workconsidering the use of deep networks is given by (Levineet al., 2015), who focus on directly approximating policieswith neural networks but shortly refer to the possibility ofextension for cost function learning with neural networks.

In contrast to prior art, we explore the use of neural net-works to approximate the reward function. Neural Net-works already achieve state-of-the-art performance acrossa variety of domains such as computer vision, natural lan-guage processing, speech recognition (Bengio et al., 2012)and reinforcement learning (Mnih et al., 2013). Their ap-plication in IRL suggests itself due to their compact repre-sentation of highly nonlinear functions through the compo-sition and reuse of the results of many nonlinearities in thelayered structure (Bengio et al., 2007). In addition, NNsprovide favourable computational complexity (O(1)) atquery time with respect to observed demonstrations, whichprovides for scaling to problems with large state spaces andcomplex reward structures – circumstances which mightrender the application of existing prior methods intractableor ineffective. With the approach represented in Figure 1,a state’s reward can be determined either solely based onits own feature representation or – in using wider convo-lutional layers – analysed in combination with its spatialcontext. The applied architectures are Fully ConvolutionalNeural Networks, which – by skipping the fully connectedfinal layers common in classification tasks – preserve spa-tial information and can create an output of the same spa-

tial dimension and size as the input. Recent examples forthe application of FCNNs focus on dense prediction: in-cluding pixel-wise semantic segmentation by (Long et al.,2014), sliding window detection and prediction of objectboundaries (Sermanet et al., 2013), depth estimation withsingle monocular images (Liu et al., 2015) and human poseestimation in monocular images (Tompson et al., 2014).

Our principal contribution is a framework for MaximumEntropy Deep Inverse Reinforcement Learning (DeepIRL)based on the Maximum Entropy paradigm for IRL (Ziebartet al., 2008), which lends itself naturally for training deeparchitectures by leading to an objective that is - without ap-proximations - fully differentiable with respect to the net-work weights. Furthermore, we demonstrate performancecommensurate to state-of-the-art methods on a publiclyavailable benchmark, while outperforming the state-of-the-art on a new benchmark where the true underlying rewardhas complex interacting structure over the feature represen-tation. In addition, we emphasise the flexibility of the ap-proach and eliminate the requirement of preprocessing andprecomputed features by applying wider convolutional lay-ers to learn spatial features of relevance to the IRL task.This enables the application without manually crafted fea-ture design as long as the state space is constrained to aregularly gridded representation allowing for convolutions.

We argue that these properties are important for practicallarge-scale applications of IRL as can be seen in life-longlearning approaches with often complex reward functionsand increasing scale of demonstrations requiring high ca-pacity models and fast computational speeds.

2. Inverse Reinforcement LearningThis section presents a brief overview of IRL. Let a MarkovDecision Process (MDP) be defined asM = {S,A, T , r},where S denotes the state space, A denotes the set of pos-sible actions, T denotes the transition model and r de-notes the reward structure. Given an MDP, an optimal pol-icy π∗ is one which, when adhered to, maximizes the ex-pected cumulative reward. Furthermore, an additional fac-tor γ ∈ [0, 1] may be considered in order to discount futurerewards.

IRL considers the case where a MDP specification isavailable but the reward structure is unknown. Instead,a set of expert demonstrations D = {ς1, ς2, ..., ςN}is provided which are sampled from a user policy π,i.e. provided by a demonstrator. Each demonstrationconsists of a set of state-action pairs such that ςi ={(s0, a0), (s1, a1), ..., (sK , aK)}. The goal of IRL is to un-cover the hidden reward r from the demonstrations.

A number of approaches have been proposed to tackle theIRL problem (see, for example, (Abbeel & Ng, 2004),


(Neu & Szepesvari, 2012), (Ratliff et al., 2006), (Syed& Schapire, 2007)). An increasingly popular formulationis Maximum Entropy IRL (Ziebart et al., 2008), whichwas used to effectively model large-scale user driving be-haviour. In this formulation the probability of user pref-erence for any given trajectory between specified start andgoal states is proportional to the exponential of the rewardalong the path

P (ς|r) ∝ exp{∑s,a∈ς

rs,a}. (1)

As shown in Ziebart’s work, principal benefits of the Max-imum Entropy paradigm include the ability to handle ex-pert suboptimality as well as stochasticity by operating onthe distribution over possible trajectories. Moreover, theMaximum Entropy based objective function given in Equa-tion 10 enables backpropagation of the objective gradi-ents to the network’s weights. The training procedure isthen straightforwardly framed as an optimisation task com-putable e.g. via conjugate gradient or stochastic gradientdescent.

2.1. Approximating the Reward Structure

Due to the dimensionality and size of the state space inmany real world applications, the reward structure can notbe observed explicitly for every state. In these cases staterewards are not modelled directly per state, but the rewardstructure is restricted by imposing that states with similarfeatures, x, should have similar rewards. To this end, func-tion approximation is used in order to regress the featurerepresentation onto a real valued reward using a mappingg : RN → R, with N being the dimensionality of the fea-ture space such that

r = g(f, θ). (2)

A feature representation, f , is usually hand-crafted basedon preprocessing such as segmentation and manually de-fined distance metrics, but can be learned based on the pro-posed framework - as shown in section 4.3. Furthermore,the application of feature based function approximation en-ables easier generalisation and transfer of models.

The choice of model used for function approximation hasa dramatic impact on the ability of the algorithm to cap-ture relationship between the state feature vector f and userpreference. Commonly, the mapping from state to rewardis simply a weighted linear combination of feature values

g(f, θ) = θ>f. (3)

This choice, while appropriate in some scenarios, is sub-optimal if the true reward can not be accurately approxi-mated by a linear model. In order to alleviate this limitation

(Choi & Kim, 2013) extend the linear model by introducinga mapping Φ : RN → {0, 1}N such that

g(f, θ,Φ) = θ>Φ(f). (4)

Here Φ denotes a set of composite features which arejointly learned as part of the objective function. Thesecomposites are assumed to be the logical conjunctions ofthe predefined, atomic features f . Due to the nature of thefeatures used the representational power of this approach islimited to the family of piecewise constant functions.

In contrast, (Levine et al., 2011) employ a GaussianProcesses (GP) framework to capture the potentially un-bounded complexity of any underlying reward structure.The set of expert demonstrationsD is used in this context toidentify an active set of GP support points, Xu, and associ-ated rewards u. The mean function is then used to representthe individual reward at a state described by f

g(f, θ,Xu, u) = K>f,uK−1u,uu. (5)

Here Kf,u denotes the covariance of the reward at f withthe active set reward values u located at Xu and Ku,u de-notes the covariance matrix of the rewards in the active setcomputed via a covariance function kθ(fi, fj) with hyper-parameters θ.

Nevertheless, a significant drawback of the GPIRL ap-proach is a computational complexity proportional to thenumber of demonstrations and the size of the active setof inducing points, which in turn depends on the rewardcomplexity. While the modelling of complex, nonlinearreward structures in problems with large state spaces istheoretically feasible for the GPIRL approach, the car-dinality of the active set will quickly become unwieldy,putting GPIRL at a significant computational disadvantageor, worse, rendering it entirely intractable. These short-comings are remedied when using deep parametric archi-tectures for reward function approximation while keepingthe accuracy of nonlinear function approximation, as out-lined in the next section.

3. Reward Function Approximation withDeep Architectures

We argue that IRL algorithms scalable to MDPs with largefeature spaces require models, which are able to efficientlyrepresent complex, nonlinear reward structures. In thiscontext, deep architectures are a natural choice as they ex-plicitly exploit the depth-breadth trade-off (Bengio et al.,2007) and increase representational capacity by reusing thecomputations of earlier nodes in the following layers.

For the remainder of the paper, we consider a network ar-chitecture which accepts as input state features x, maps


Figure 2: Schema for Neural Network based reward func-tion approximation based on the feature representation ofMDP states

these to state reward r and is governed by the network pa-rameters θ1,2,..n. In the context of Section 2.1, the statereward is therefore obtained as

r ≈ g(f, θ1, θ2, ..., θn) (6)= g1(g2(...(gn(f, θn), ...), θ2), θ1). (7)

While many choices exist for the individual building blocksof a deep architecture, it has been shown that a sufficientlylarge NN with as little as two layers and sigmoid activa-tion functions can represent any binary function (Hassoun,1995) or any piecewise-linear function (Hornik et al., 1989)and can therefore be regarded as a universal approximator.While this holds true in theory, it can be far more com-putationally practicable to extend the depth of the networkstructure and reduce the number of required computationsin doing so (Bengio, 2009).

Importantly, in applying backpropagation, NNs also lendthemselves naturally to training in the maximum entropyIRL framework and the network structure can be adaptedto suit individual tasks without complicating or even inval-idating the main IRL learning mechanism. In the DeepIRLframework proposed here the full range of architecturechoices thus becomes available. Different problem do-mains can utilise different network architectures as e.g.convolutional layers can remove the dependency on hand-crafted spatial features. Furthermore, it is straightforwardto show that the linear maximum entropy IRL approachproposed in (Ziebart et al., 2008) can be seen as a sim-plification of the more general deep approach and can becreated by applying the rules of back-propagation to a net-work with a single linear output connected to all inputs withzero bias term.

While the common NN architectures for whole-image clas-sification regress to fixed size outputs, the applied FCNNsresult in an output with equivalent spatial dimensionalityand by padding data correspondingly we realise rewardmaps of the same size as our input. It is to note here thatpadding is not the only possibility and deconvolutions as

applied by (Long et al., 2014) can also transform and re-shape to create equally sized model output.

3.1. Training Procedure

The task of solving the IRL problem can be framed in thecontext of Bayesian inference as MAP estimation, maxi-mizing the joint posterior distribution of observing expertdemonstrations, D, under a given reward structure and ofthe model parameters θ.

L(θ) = logP (D, θ|r) = logP (D|r)︸︷︷︸LD

+ logP (θ)︸︷︷︸Lθ

. (8)

This joint log likelihood is differentiable with respect to theparameters θ of a linear reward model, which allows the ap-plication of gradient descent methods (Snyman, 2005). Weextend this benefit with the adaptation of Maximum En-tropy for neural networks as presented in LD of Equation10 by separating into the gradient of the loss with respectto the rewards r and the gradient of the reward with respectto the network’s weights obtained via backpropagation.

The complete gradient is given by the sum of the gradientswith respect to θ of the data term LD and a weight decayterm as model regulariser Lθ

∂L∂θ

=∂LD∂θ

+∂Lθ∂θ

. (9)

The earlier mentioned separation of derivatives in the gra-dient of the data term is shown in equation 10

∂LD∂θ

=∂LD∂r· ∂r∂θ

(10)

= (µD − E[µ]) · ∂∂θg(f, θ), (11)

where r = g(f, θ)). As shown in (Ziebart et al., 2008),the gradient of the expert demonstration term LD with re-spect to the model parameters of a linear function is equalto the difference in feature counts along the trajectories.For higher level models this gradient can be split into thederivative with respect to the reward r and the derivative ofthe reward with respect to the model parameters which incase of a neural network is obtained via backpropagation.The derivative of the Maximum Entropy objective with re-spect to the reward equals the difference in state visitationcounts between solutions given by the expert demonstra-tions and the expected visitation counts for the learned sys-tems trajectory distribution in 12.

E[µ] =∑

ς:{s,a}∈ς

P (ς|r) (12)


Algorithm 1 Maximum Entropy Deep IRL

Input: µaD, f, S,A, T, γ

Output: optimal weights θ∗

1: θ1 = initialise weights()

Iterative model refinement

2: for n = 1 : N do

3: rn = nn forward(f, θn)

Solution of MDP with current reward

4: πn = approx value iteration(rn, S,A, T, γ)

5: E[µn] = propagate policy(πn, S,A, T )

Determine Maximum Entropy loss and gradients

6: LnD = log(πn)× µaD7:

∂LnD∂rn = µD − E[µn]

Compute network gradients

8:∂LnD∂θnD

= nn backprop(f, θn,∂LnD∂rn )

9: θn+1 = update weights(θn,∂LnD∂θnD

)

10: end for

Computation of E[µ] usually involves summation over ex-ponentially many possible trajectories. A more effective al-gorithm based on dynamic programming which computesthis quantity in polynomial-time can be found in (Ziebartet al., 2008; Kitani et al., 2012). Subsequently, the effectivecomputation of the gradient ∂LD

∂θ involves first computingthe difference in visitation counts using this algorithm andthen passing this as an error signal through the network us-ing back-propagation.

The complete proposed method is described by Algorithm1, with the loss and gradient derivation in lines 6 and 7given by the linear Maximum Entropy formulation. Theexpert’s state action frequencies µaD, which are needed forthe calculation of the loss are summed over the actions to

compute the expert state frequencies µD =A∑a=1

µaD .

Lines 4 and 5 are explained in detail in the algorithms 2 and3 respectively, and are adapted from (Kitani et al., 2012).Algorithm 2 determines the policy given the current rewardmodel via iterative update of the state-action value func-tion, while algorithm 3 determines the expected state vis-iting frequencies by probabilistically traversing the MDP

Algorithm 2 Approximate Value Iteration

1: V (s) = −∞

2: repeat

3: Vt = V ; V (sgoal) = 0

4: Q(s, a) = r(s, a) + ET (s,a,s′)[V (s′)]

5: V = softmaxa Qi(s, a)

6: until maxs(V (s)− Vt(s)) < ε

7: π(a|s) = eQ(s,a)−V (s)

Algorithm 3 Policy Propagation

1: E1[µ(sstart)] = 1

2: for i = 1 : N do

3: Ei[µ(sgoal)] = 0

4: Ei+1[µ(s)] =∑s′,a T (s, a, s′) π(a|s′) Ei[µ(s′)]

5: end for

6: E[µ(s)] =∑i Ei[µ(s)]

given the current policy. Additional indices representingthe iteration of the main algorithm were omitted in thesesubscripts in favour of readability.

The presented algorithm is applied to train FCNNs basedon the loss derivatives for all states at once. As each of thefinal state-wise rewards is influenced by its correspondingarea in the original state space – its receptive field, train-ing with the summed loss over the whole scene is equiva-lent to a stochastic gradient formulation with all receptivefields addressed in a minibatch. This formulation is com-putationally more efficient than separate computation perfield, since these fields overlap as soon as the width of ourconvolutional filters exceeds one (Long et al., 2014).

4. ExperimentsWe assess the performance of DeepIRL two bench-mark tasks against current state-of-the-art approaches :GPIRL (Levine et al., 2011), NPB-FIRL (Choi & Kim,2013) and the original MaxEnt (Ziebart et al., 2008) to il-lustrate the necessity of non-linear function approximation.

All tests are run multiple times on training and transfer sce-narios for the different settings, while learning is performedbased on synthetically generated stochastic demonstrationsbased on the optimal policy to evaluate performance onsuboptimal example sets. This is achieved by providinga number of demonstrations sampled from the optimal pol-


icy based on the true reward structure, but including 30%of random actions.

In our experiments, we employ a FCNN with two hid-den layers and rectified linear units as function approxima-tor between state feature representation and reward. Thisrather shallow networks structure suffices for the applica-tion based on strongly simplified toy benchmarks. How-ever, the whole framework can be utilised for training net-works of arbitrary capacity. All experiments except for thespatial feature learning in section 4.3 are based on filtersof width one to focus on direct evaluation against the otheralgorithms, which are in their current form limited to thefeatures of each state for reward approximation. Wider fil-ters as applied for spatial feature learning are used to eval-uate the performance on raw inputs without manual fea-ture design. For these benchmarks, we apply AdaGrad(Duchi et al., 2011), an approach for stochastic gradientdescent with per parameter adaptive learning rates. Signif-icant parts of the neural network implementation are basedon MatConvNet (Vedaldi & Lenc, 2014).

In line with related works, we use expected value differenceas principal metric of evaluation. It is a measure of the sub-optimality of the learned policy under the true reward. Thescore represents the difference between the value functionobtained for the optimal policy given the true reward struc-ture and the value function obtained for the optimal policybased on the learned reward model. Additionally to theevaluation on each specific training scenario, the trainedmodels are evaluated on a number of randomly generatedtest environments. The test on these transfer examplesserves to analyse each algorithm’s ability to generalise tothe true reward structure without over-fitting.

4.1. Objectworld Benchmark

The Objectworld scenario (Levine et al., 2011) consists ofa map of M ×M states for M = 32 where possible ac-tions include motions in all four directions as well as stay-ing in place. Two different sets of state features are im-plemented based on randomly placed colours to evaluatethe algorithms. For the continuous set x ∈ RC . Eachfeature dimension describes the minimum distance to anobject of one of C colours. Building on the continuousrepresentation the discrete set includes C ×M binary fea-tures, where each dimension indicates whether an object ofa given colour is closer than a threshold d ∈ {1, ...,M}.

The reward is positive for cells which are both within thedistance 3 of color 1 and distance 2 of color 2, negative ifonly within distance 3 of color 1 and zero otherwise. Thisis illustrated for a small subset of the state space in Figure3.

In line with common benchmarking procedures, we evalu-

Optimal Policy

Example Reward-building Objects

Reward (low to high)

Example Distractor Objects

Figure 3: Objectworld benchmark. The true reward is dis-played by the brightness of each cell and based on the sur-rounding object configuration. Only a subset of colors in-fluences the reward, while the others serve as distractingfeatures.

ated the algorithms with a set number of features and in-creasing demonstrations. Additionally, the learned rewardfunctions are deployed on randomly generated transfer sce-narios to uncover any overfitting to the training data.

While the original MaxEnt is unable to capture the nonlin-ear reward structure well, both DeepIRL and GPIRL pro-vide significantly better approximations as represented inFigure 4. Evaluation of NPB-FIRL on this benchmark wasdone in (Choi & Kim, 2013) where it showed a similar levelof performance as GPIRL. GPIRL generates a good modelalready with few data points whereas DeepIRL achievescommensurate performance when increasing the numberof available expert demonstrations. The same behaviouris exhibited when using both continuous and discrete statefeatures (Fig. 5). The requirement for more training datawill be rendered unimportant in robot applications basedon autonomous data acquisition, while enforcing the loweralgorithmic complexity as dominant advantage of the para-metric approach.

Groundtruth DeepIRL

GPIRL MaxEnt

Figure 4: Reward reconstruction sample in Objectworldbenchmark provided N = 64 examples and C = 2 colourswith continuous features. White - high reward; black - lowreward.

Additional tests are performed with increased number of


distractor features to evaluate each approach’s overfittingtendency. The corresponding figures are left out due to lim-ited space. Both DeepIRL and GPIRL show robustness todistractor variables, though DeepIRL shows minimally big-ger signs of overfitting as the number of distractor variablesis increased. This is due to the NN’s capacity being broughtto bear on the increasing noise introduced by the distractorsand will be addressed in future work with additional regu-larisation methods, such as Dropout (Hinton et al., 2012)and ensemble methods.

4 8 16 32 640

5

10

15

20

25

30

examples

exp

ecte

d v

alu

e d

iffe

ren

ce

GPIRLMaxEntDeepIRL

4 8 16 32 640

5

10

15

20

25

30

examples

exp

ecte

d v

alu

e d

iffe

ren

ce

4 8 16 32 640

5

10

15

20

25

examples

exp

ecte

d v

alu

e d

iffe

ren

ce

4 8 16 32 640

5

10

15

20

25

examples

exp

ecte

d v

alu

e d

iffe

ren

ce

GPIRLMaxEntDeepIRL

a) b)

4 8 16 32 64 1280

5

10

15

20

examples

expecte

d v

alu

e d

iffe

rence

GPIRLMaxEntDeepIRL

4 8 16 32 64 1280

5

10

15

20

examples

expecte

d v

alu

e d

iffe

rence

4 8 16 32 64 1280

5

10

15

20

examples

expecte

d v

alu

e d

iffe

rence

GPIRLMaxEntDeepIRL

4 8 16 32 64 1280

5

10

15

20

examples

expecte

d v

alu

e d

iffe

rence

c) d)

Figure 5: Objectworld benchmark. From top left to bottomright: expected value difference (EVD) withC = 2 coloursand varying number of demonstrations N for training a)and transfer case b) with continuous and subsequently withdiscrete features in c) & d) ; As the number of demonstra-tions grows DeepIRL is able to quickly match performanceof GPIRL on the task.

4.2. Binaryworld Benchmark

In order to test the ability of all approaches to successfullyapproximate more complex reward structures, the Binary-world benchmark is presented. This test scenario is similarto Objectworld, but in this problem every state is randomlyassigned one of two colours (blue or red). The feature vec-tor for each state consequently consists of a binary vector oflength 9, encoding the colour of each cell in its 3x3 neigh-bourhood. The true reward structure for a particular state

4 8 16 32 64 1280

5

10

15

20

25

30

35

40

45

examples

exp

ecte

d v

alu

e d

iffe

ren

ce

GPIRLMaxEnt

4 8 16 32 64 1280

5

10

15

20

25

30

35

40

45

examples

exp

ecte

d v

alu

e d

iffe

ren

ce

DeepIRLNPB-FIRL

4 8 16 32 64 1280

5

10

15

20

25

30

35

40

45

examples

exp

ecte

d v

alu

e d

iffe

ren

ce

4 8 16 32 64 1280

5

10

15

20

25

30

35

40

45

examples

exp

ecte

d v

alu

e d

iffe

ren

ce

GPIRLMaxEntDeepIRLNPB-FIRL

Figure 6: Value differences observed in the Binaryworldbenchmark for GPIRL, MaxEnt and DeepIRL for the train-ing scenario (left) and the transfer task (right).

is fully determined by the number of blue states in its localneighbourhood. It is positive if exactly four out of nineneighbouring states are blue, negative if exactly five areblue and zero otherwise. The main difference comparedto the Objectworld scenario is that a single feature valuedoes not carry much weight, but rather that higher-orderrelationships amongst the features determine the reward.

Since the reward depends on a higher representation forthe basic features - that is to say the number of specificfeatures - such case is arguably more challenging than theoriginal Objectworld experiment and a good performanceon this benchmark implies the algorithm’s ability to learnand capture this complex relationship.

The performance of DeepIRL compared to GPIRL, lin-ear MaxEnt and NPB-FIRL is depicted in Fig. 6. Inthis increasingly more complex scenario, DeepIRL is ableto learn the higher-order dependencies between features,whereas GPIRL struggles as the inherent kernel measurecan not correctly relate the reward of different exampleswith similarity in their state features. GPIRL needs a largernumber of demonstrations to achieve good performanceand to determine an accurate estimate on the reward forall 29 possible feature combinations.

Perhaps surprising is the comparatively low performanceof the NPB-FIRL algorithm. This can be explained by thelimitations of this framework. The true reward in this sce-nario can not be efficiently described by the logical con-junctions used. In fact, it would require 29 different logicalconjunctions, each capturing all possible combinations offeatures, to accurately model the reward in this framework.

Fig. 7 shows the reconstruction of the reward structures es-timated by DeepIRL, MaxEnt and GPIRL. While GPIRLwas able to reconstruct the correct reward for some of thestates having features it has encountered before it providesinaccurate rewards for states which were never encoun-


Groundtruth DeepIRL

GPIRL MaxEnt

Figure 7: Reward reconstruction sample for the Bina-ryworld benchmark provided N = 128 demonstrations.White - high reward; black - low reward.

tered. It produces an overall too smooth reward functiondue to assumptions and priors in the GP approximation.On the other hand, DeepIRL is able to reconstruct it withhigh accuracy demonstrating the ability to effectively learnthe highly-varying structure of the underlying function.

4.3. Spatial Feature Learning

While the earlier benchmarks visualise performance com-pared to current algorithms in the context of precomputedfeatures, the approach can be extended via the use of widerfilters to eliminate the requirement of preprocessing ormanual design of features. Figure 8 represents the resultsfor both earlier benchmarks, but instead of using the earlierdescribed feature representations, the FCNN builds the re-ward based on the raw input representation, which for eachstate only includes the availability of each specific object atthat specific state. All spatial information is derived basedon the convolutional filters. Based on the simplicity of thebenchmarks, we employed a five layer approach with 3x3convolutional kernels in the first two layers. By increasingthe depth of the network and include convolutional filters,we add enough capacity to enable the learning of featuresas well as their combination into the reward function in thesame architecture and process.

Due to the increasing number of parameters, the approachrequires additional training data to perform at equal accu-racy but with increasing number of expert samples con-verges towards the performance with predefined features.Since the given features in these simplified toy problemsare optimal and the true reward is directly calculated ontheir basis, automatically learned features cannot exceedthe performance. However, in real-world scenarios, thecompression of raw data - such as images - to feature rep-resentations leads to information loss and the learning oftask-relevant features gains even more importance.

4 8 16 32 64 1280

5

10

15

20

25

30

35

40

examples

expecte

d v

alu

e d

iffe

rence

DeepIRLDeepIRL&CNN

4 8 16 32 64 1280

5

10

15

20

25

30

35

40

examples

expecte

d v

alu

e d

iffe

rence

4 8 16 32 64 1280

2

4

6

8

10

12

14

examples

expecte

d v

alu

e d

iffe

rence

DeepIRLDeepIRL&CNN

4 8 16 32 64 1280

2

4

6

8

10

12

14

examples

expecte

d v

alu

e d

iffe

rence

Objectworld Binaryworld

Figure 8: Application of convolutional layers for spatialfeature learning. Spatial feature learning quickly convergesto performance with optimally designed features.

5. Conclusion and Future WorkThis paper presents Maximum Entropy Deep IRL, a frame-work exploiting FCNNs for reward structure approxima-tion in Inverse Reinforcement Learning. Neural networkslend themselves naturally to this task as they combine rep-resentational power with computational efficiency com-pared to state-of-the-art methods. Unlike prior art in thisdomain DeepIRL can therefore be applied in cases wherecomplex reward structures need to be modelled for largestate spaces. Moreover, training can be achieved effectivelyand efficiently within the popular Maximum Entropy IRLframework. A further advantage of DeepIRL lies in its ver-satility. Custom network architectures and types can be de-veloped for any given task while exploiting the same costfunction in training. This is expressed in section 4.3, whereconvolutional filters are applied to eliminate the need ofmanual feature design.

Our experiments show that DeepIRL’s performance is com-mensurate to the state-of-the-art on a common benchmark.While exhibiting slightly increased requirements regard-ing training data in this benchmark, a principal strengthof the approach lies in its algorithmic complexity indepen-dent of the number of demonstrations samples. Therefore,it is particularly well-suited for life-long learning scenariosin the context of robotics, which inherently provide suffi-cient amounts of training data. We also provide an alter-native evaluation on a new benchmark with a significantlymore complex reward structure, where DeepIRL signifi-cantly outperforms the current state-of-the-art and provesits strong capability in modeling the interaction betweenfeatures. Furthermore, we extend the approach to wider fil-ters in order to eliminate the dependency on precomputedfeatures and to emphasise the adaptability of framing IRLin the context of deep learning.

In future work we will explore the benefits of autoencoder-


style pretraining to reduce the increased demand of expertdemonstrations when employing wider convolutional fil-ters. Especially when based on more complex inputs suchas raw image data, the easily available unsupervised train-ing data will help to learn features which then only need tobe refined during the supervised IRL-based training phase.Due to the variety of existing work on FCNN architecturesmentioned in section 1, we expect to be able to benefit fromapplying more complex networks for real life problems,such as the skipping architecture by (Long et al., 2014),which enables the concatenation of fine structural informa-tion alongside with coarser higher level features in the lastregression layer to improve overall performance in evaluat-ing features of multiple scales. Furthermore, other methodsfor optimising demonstration data likelihood such as givenby (Babes et al., 2011) will be evaluated.

ReferencesAbbeel, Pieter and Ng, Andrew Y. Apprenticeship learn-

ing via inverse reinforcement learning. In Proceedingsof the twenty-first international conference on Machinelearning, pp. 1. ACM, 2004.

Argall, Brenna D, Chernova, Sonia, Veloso, Manuela, andBrowning, Brett. A survey of robot learning fromdemonstration. Robotics and autonomous systems, 57(5):469–483, 2009.

Asoh, Hideki, Akaho, Masanori Shiro1 Shotaro,Kamishima, Toshihiro, Hasida, Koiti, Aramaki,Eiji, and Kohro, Takahide. An application of inversereinforcement learning to medical records of diabetestreatment. 2013.

Babes, Monica, Marivate, Vukosi, Subramanian, Kaushik,and Littman, Michael L. Apprenticeship learning aboutmultiple intentions. In Proceedings of the 28th Interna-tional Conference on Machine Learning (ICML-11), pp.897–904, 2011.

Baker, Chris L, Saxe, Rebecca, and Tenenbaum, Joshua B.Action understanding as inverse planning. Cognition,113(3):329–349, 2009.

Bengio, Yoshua. Learning deep architectures for ai. Foun-dations and trends R© in Machine Learning, 2(1):1–127,2009.

Bengio, Yoshua, LeCun, Yann, et al. Scaling learning algo-rithms towards AI. Large-scale kernel machines, 34(5),2007.

Bengio, Yoshua, Courville, Aaron C., and Vincent, Pascal.Unsupervised feature learning and deep learning: A re-view and new perspectives. CoRR, abs/1206.5538, 2012.URL http://arxiv.org/abs/1206.5538.

Choi, Jaedeug and Kim, Kee-Eung. Bayesian nonparamet-ric feature construction for inverse reinforcement learn-ing. In Proceedings of the Twenty-Third internationaljoint conference on Artificial Intelligence, pp. 1287–1293. AAAI Press, 2013.

Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptivesubgradient methods for online learning and stochasticoptimization. The Journal of Machine Learning Re-search, 12:2121–2159, 2011.

Hassoun, Mohamad H. Fundamentals of artificial neuralnetworks. MIT press, 1995.

Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex,Sutskever, Ilya, and Salakhutdinov, Ruslan. Improvingneural networks by preventing co-adaptation of featuredetectors. CoRR, abs/1207.0580, 2012. URL http://arxiv.org/abs/1207.0580.

Hornik, Kurt, Stinchcombe, Maxwell, and White, Halbert.Multilayer feedforward networks are universal approxi-mators. Neural networks, 2(5):359–366, 1989.

Kitani, Kris, Ziebart, Brian, Bagnell, James, and Hebert,Martial. Activity forecasting. Computer Vision–ECCV2012, pp. 201–214, 2012.

Levine, Sergey, Popovic, Zoran, and Koltun, Vladlen. Fea-ture construction for inverse reinforcement learning. InAdvances in Neural Information Processing Systems, pp.1342–1350, 2010.

Levine, Sergey, Popovic, Zoran, and Koltun, Vladlen. Non-linear inverse reinforcement learning with gaussian pro-cesses. In Advances in Neural Information ProcessingSystems, pp. 19–27, 2011.

Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel,Pieter. Learning deep vision-based costs and policies.Robotics: Science and Systems. WS: Learning fromDemonstration, 2015.

Liu, Fayao, Shen, Chunhua, Lin, Guosheng, and Reid,Ian D. Learning depth from single monocular im-ages using deep convolutional neural fields. CoRR,abs/1502.07411, 2015. URL http://arxiv.org/abs/1502.07411.

Long, Jonathan, Shelhamer, Evan, and Darrell, Trevor.Fully convolutional networks for semantic segmentation.CoRR, abs/1411.4038, 2014. URL http://arxiv.org/abs/1411.4038.

Lopes, Manuel, Melo, Francisco, and Montesano, Luis.Active learning for reward estimation in inverse rein-forcement learning. In Machine Learning and Knowl-edge Discovery in Databases, pp. 31–46. Springer, 2009.

http://arxiv.org/abs/1206.5538








Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, andRiedmiller, Martin. Playing atari with deep reinforce-ment learning. CoRR, abs/1312.5602, 2013.

Neu, Gergely and Szepesvari, Csaba. Apprenticeship learn-ing using inverse reinforcement learning and gradientmethods. CoRR, abs/1206.5264, 2012.

Ramachandran, Deepak and Amir, Eyal. Bayesian inversereinforcement learning. Urbana, 51:61801, 2007.

Ratliff, Nathan D, Bagnell, J Andrew, and Zinkevich, Mar-tin A. Maximum margin planning. In Proceedings of the23rd international conference on Machine learning, pp.729–736. ACM, 2006.

Sermanet, Pierre, Eigen, David, Zhang, Xiang, Mathieu,Michael, Fergus, Rob, and LeCun, Yann. Overfeat: Inte-grated recognition, localization and detection using con-volutional networks. CoRR, abs/1312.6229, 2013. URLhttp://arxiv.org/abs/1312.6229.

Snyman, Jan. Practical mathematical optimization: an in-troduction to basic optimization theory and classical andnew gradient-based algorithms, volume 97. SpringerScience & Business Media, 2005.

Syed, Umar and Schapire, Robert E. A game-theoretic ap-proach to apprenticeship learning. In Advances in neuralinformation processing systems, pp. 1449–1456, 2007.

Tompson, Jonathan, Jain, Arjun, LeCun, Yann, and Bre-gler, Christoph. Joint training of a convolutional net-work and a graphical model for human pose estimation.CoRR, abs/1406.2984, 2014. URL http://arxiv.org/abs/1406.2984.

Vedaldi, A. and Lenc, K. Matconvnet – convolutional neu-ral networks for matlab. CoRR, abs/1412.4564, 2014.

Ziebart, Brian D, Maas, Andrew L, Bagnell, J Andrew, andDey, Anind K. Maximum entropy inverse reinforcementlearning. In AAAI, pp. 1433–1438, 2008.




maximum entropy deep inverse reinforcement learningmaximum entropy deep inverse reinforcement...

Documents