invariant transform experience replay · invariant transform experience replay yijiong lin 1,...

7
Invariant Transform Experience Replay Yijiong Lin 1 , Jiancong Huang 1 , Matthieu Zimmer 2 , Juan Rojas 1 , Paul Weng 2 Abstract— Deep reinforcement learning (DRL) is a promising approach for adaptive robot control, but its current application to robotics is currently hindered by high sample requirements. We propose two novel data augmentation techniques for DRL based on invariant transformations of trajectories in order to reuse more efficiently observed interaction. The first one called Kaleidoscope Experience Replay exploits reflectional symmetries, while the second called Goal-augmented Experi- ence Replay takes advantage of lax goal definitions. In the Fetch tasks from OpenAI Gym, our experimental results show a large increase in learning speed. I. INTRODUCTION Deep reinforcement learning (DRL) has demonstrated great promise in recent years [1], [2]. However, despite being shown to be a viable approach in robotics [3], [4], DRL still suffers from low sample efficiency in practice—an acute issue in robot learning. Given how critical this issue is, many diverse propositions have been presented. For brevity, we only recall the most related to our work. A first idea is to better utilize observed samples, e.g., memory replay [5] or hindsight experience replay (HER) [6]. Although better observation reuse does not reduce sample requirements in a DRL algorithm, it decreases the needed number of actual interactions with the environment to obtain a satisfactory behavior, which is the most important factor in robot learning. Another idea is to exploit any a priori domain knowledge one may have (e.g., symmetry [7]) to support learning. Besides, a robot is generally expected to solve not only one fixed task, but multiple related ones. Multi-task reinforcement learning [8] is considered beneficial as it would not be feasible to repeatedly solve each encountered task tabula rasa. Finally, in order to avoid or reduce learning on an actual robot, recent works have investigated how policies learned in a simulator can be transferred to a real robot [9]. In this work, in order to further improve the data efficiency of DRL, we propose two novel data augmentation methods that better reuse the samples observed in the true environment by exploiting the symmetries (i.e., any invariant transforma- tion) of the problem. For simplicity, we present them in the same set-up as HER [6], although the techniques can be instantiated in other settings. The first technique, Kaleidoscope Experience Replay (KER), is based on reflectional symmetry, it posits that trajectories in robotic spaces can enjoy multiple reflections and remain in the feasible workspace. Namely, for a given 1 Guangdong University of Technology, 2 Shanghai Jiao Tong University, Corresponding author: [email protected] robotic problem, there is a set of reflections that can generate from any observed trajectory many new artificial feasible ones for training. For concreteness, in this paper we focus on reflections with respect to hyperplanes. The second technique, Goal-augmented Experience Re- play (GER), can be seen as a generalization of HER: any artificial goal g generated by HER can be instead replaced by a random goal sampled in a small ball around g. This idea takes advantage of tasks where success is defined as reaching a final pose within a distance of the goal set by a threshold (such tasks are common in robotics). Here, trajectories are augmented using the invariant transformation that consists in changing actual goals to random goals close of reached ones. In Sec. II, we present the necessary background (DRL and HER) for our work. Sec. III introduces related work that seeks to increase data efficiency. Sec. IV details our two invariant transform data augmentation techniques. Sec. V presents experimental results on OpenAI Gym Fetch tasks [10], which demonstrates the effectiveness of our proposi- tions. Sec. VI and Sec. VII highlight key lessons. II. BACKGROUND In this work, we consider robotic tasks that are modeled as multi-goal Markov decision processes [11] with continuous state and action spaces: hS , A, G,T,R,p,γ i where S is a continuous state space, A is a continuous action space, G is a set of goals, T is the unknown transition function that de- scribes the effects of actions, R(s, a, s 0 ,g) is the immediate reward when reaching state s 0 ∈S after performing action a ∈A in state s ∈S if the goal were g ∈G. Finally, p is a joint probability distribution over initial states and initial goals, and γ (0, 1) is a discount factor. In this framework, the robot learning problem corresponds to an RL problem that aims at obtaining a policy π : S×G→A such that the expected discounted sum of rewards is maximized for any given goal. Due to the continuity of the state-action spaces, this optimization problem is usually restricted to a class of parameterized policies. In DRL, the parameterization is defined by the neural network architecture. To learn such continuous policies, actor-critic algorithms [12] are efficient iterative methods since they can reduce the variance of the estimated gradient using simultaneously learned value functions. DDPG (Deep Deterministic Policy Gradient) [13] is a model-free off-policy state-of-the-art DRL algorithm learning a deterministic policy, which is desirable in robotic tasks. In DDPG, the transitions are collected into a replay arXiv:1909.10707v2 [cs.RO] 10 Oct 2019

Upload: others

Post on 26-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Invariant Transform Experience Replay · Invariant Transform Experience Replay Yijiong Lin 1, Jiancong Huang , Matthieu Zimmer2, Juan Rojas1, Paul Weng2 Abstract—Deep reinforcement

Invariant Transform Experience Replay

Yijiong Lin1, Jiancong Huang1, Matthieu Zimmer2, Juan Rojas1, Paul Weng2

Abstract— Deep reinforcement learning (DRL) is a promisingapproach for adaptive robot control, but its current applicationto robotics is currently hindered by high sample requirements.We propose two novel data augmentation techniques for DRLbased on invariant transformations of trajectories in orderto reuse more efficiently observed interaction. The first onecalled Kaleidoscope Experience Replay exploits reflectionalsymmetries, while the second called Goal-augmented Experi-ence Replay takes advantage of lax goal definitions. In the Fetchtasks from OpenAI Gym, our experimental results show a largeincrease in learning speed.

I. INTRODUCTION

Deep reinforcement learning (DRL) has demonstratedgreat promise in recent years [1], [2]. However, despite beingshown to be a viable approach in robotics [3], [4], DRLstill suffers from low sample efficiency in practice—an acuteissue in robot learning.

Given how critical this issue is, many diverse propositionshave been presented. For brevity, we only recall the mostrelated to our work. A first idea is to better utilize observedsamples, e.g., memory replay [5] or hindsight experiencereplay (HER) [6]. Although better observation reuse doesnot reduce sample requirements in a DRL algorithm, itdecreases the needed number of actual interactions with theenvironment to obtain a satisfactory behavior, which is themost important factor in robot learning. Another idea isto exploit any a priori domain knowledge one may have(e.g., symmetry [7]) to support learning. Besides, a robotis generally expected to solve not only one fixed task,but multiple related ones. Multi-task reinforcement learning[8] is considered beneficial as it would not be feasible torepeatedly solve each encountered task tabula rasa. Finally,in order to avoid or reduce learning on an actual robot, recentworks have investigated how policies learned in a simulatorcan be transferred to a real robot [9].

In this work, in order to further improve the data efficiencyof DRL, we propose two novel data augmentation methodsthat better reuse the samples observed in the true environmentby exploiting the symmetries (i.e., any invariant transforma-tion) of the problem. For simplicity, we present them in thesame set-up as HER [6], although the techniques can beinstantiated in other settings.

The first technique, Kaleidoscope Experience Replay(KER), is based on reflectional symmetry, it posits thattrajectories in robotic spaces can enjoy multiple reflectionsand remain in the feasible workspace. Namely, for a given

1 Guangdong University of Technology, 2 Shanghai Jiao Tong University,Corresponding author: [email protected]

robotic problem, there is a set of reflections that can generatefrom any observed trajectory many new artificial feasibleones for training. For concreteness, in this paper we focuson reflections with respect to hyperplanes.

The second technique, Goal-augmented Experience Re-play (GER), can be seen as a generalization of HER: anyartificial goal g generated by HER can be instead replacedby a random goal sampled in a small ball around g. This ideatakes advantage of tasks where success is defined as reachinga final pose within a distance of the goal set by a threshold(such tasks are common in robotics). Here, trajectories areaugmented using the invariant transformation that consistsin changing actual goals to random goals close of reachedones.

In Sec. II, we present the necessary background (DRLand HER) for our work. Sec. III introduces related workthat seeks to increase data efficiency. Sec. IV details ourtwo invariant transform data augmentation techniques. Sec.V presents experimental results on OpenAI Gym Fetch tasks[10], which demonstrates the effectiveness of our proposi-tions. Sec. VI and Sec. VII highlight key lessons.

II. BACKGROUND

In this work, we consider robotic tasks that are modeled asmulti-goal Markov decision processes [11] with continuousstate and action spaces: 〈S,A,G, T,R, p, γ〉 where S is acontinuous state space, A is a continuous action space, G isa set of goals, T is the unknown transition function that de-scribes the effects of actions, R(s, a, s′, g) is the immediatereward when reaching state s′ ∈ S after performing actiona ∈ A in state s ∈ S if the goal were g ∈ G. Finally, p isa joint probability distribution over initial states and initialgoals, and γ ∈ (0, 1) is a discount factor. In this framework,the robot learning problem corresponds to an RL problemthat aims at obtaining a policy π : S ×G → A such that theexpected discounted sum of rewards is maximized for anygiven goal.

Due to the continuity of the state-action spaces, thisoptimization problem is usually restricted to a class ofparameterized policies. In DRL, the parameterization isdefined by the neural network architecture. To learn suchcontinuous policies, actor-critic algorithms [12] are efficientiterative methods since they can reduce the variance ofthe estimated gradient using simultaneously learned valuefunctions. DDPG (Deep Deterministic Policy Gradient) [13]is a model-free off-policy state-of-the-art DRL algorithmlearning a deterministic policy, which is desirable in robotictasks. In DDPG, the transitions are collected into a replay

arX

iv:1

909.

1070

7v2

[cs

.RO

] 1

0 O

ct 2

019

Page 2: Invariant Transform Experience Replay · Invariant Transform Experience Replay Yijiong Lin 1, Jiancong Huang , Matthieu Zimmer2, Juan Rojas1, Paul Weng2 Abstract—Deep reinforcement

buffer to later update the action-value function in a semi-gradient way and the policy with the deterministic policygradient [14]. Because the policy has to adapt to multiplegoals, as in HER, we rely on universal value functions [11]:the classic inputs of the value function and the policy ofDDPG are augmented with the desired goal.

When the reward function is sparse, as assumed here, theRL problem is particularly hard to solve. In particular, weconsider here reward functions that are described as follows:

R(s, a, s′, g) = 1[d(s′, g) ≤ εR]− 1 (1)

where 1 is the indicator function, d is a distance, and εR > 0is a fixed threshold.

To tackle this issue, HER is based on the followingprinciple: any trajectory that failed to reach its goal stillcarries useful information; it has at least reached the statesof its trajectory path. Using this natural and powerful idea,memory replay can be augmented with the failed trajectoriesby changing their goals in hindsight and computing the newassociated rewards.

In the robotics tasks solved by HER, the states are gener-ally defined as s = (xs, ys, zs, αs, βs, γs) where xs, ys, zs

represent the gripper position and αs, βs, γs describe itsorientation. The actions are defined as a = (xa, ya, za, dgrip)where xa, ya, za represent the new position that the grippershould reach and dgrip is the desired distance between thetwo fingers of the gripper.

III. RELATED WORK

HER [6], [8] has been extended in various ways. Prior-itized replay was incorporated in HER to learn from morevaluable episodes with higher priority [15]. In [16], HER wasgeneralized to deal with dynamic goals. In [17], a variantof HER was also investigated where completely randomgoals replace achieved goals and in [18], it was adapted towork with on-policy RL algorithms. All these extensions areorthogonal to our work and could easily be combined withKER. We leave these for future work.

Symmetry has been considered in MDPs [19] and RL[20]–[24]. It can be known a priori or learned [22]. Inthis work, we assume the former, which is reasonable inmany robotics tasks. A natural approach to exploit symmetryin sequential decision-making is by aggregating states thatsatisfy an equivalence relation induced by some symmetry[19], [20]. Another related approach takes into accountsymmetry in the policy representation [24]. Doing so reducesrepresentation size and generally leads to faster solutiontimes. However, the state-aggregated representation maybe difficult to recover, especially if many symmetries areconsidered simultaneously. Still another approach is to usesymmetry during training instead. One simple idea is to learnthe Q-function by performing an additional symmetricalupdate [21]. Another method is to augment the trainingpoints with their reflections [23]. In this paper, we generalizefurther this idea as a data augmentation technique wheremany symmetries can be considered and pairs of symmetricalupdates do not need to be simultaneously applied.

To the best of our knowledge, data augmentation has notbeen considered much to accelerate learning in RL. It has,however, been used extensively and with great success inmachine learning [25] and in deep learning [26]. Interest-ingly, symmetries can also be exploited in neural networkarchitecture design [27]. However, in our case, the integrationof symmetry in deep networks will be left as future work.

IV. INVARIANT TRANSFORMATIONS FOR RL

To reduce the number of interactions with the real environ-ment, we propose to leverage symmetries (i.e., any invarianttransformations) in the space of feasible trajectories in orderto generate artificial training data from actual trajectoriescollected during the robot’s learning.

We now present this idea more formally. We start withsome notations and definitions. A trajectory τ of length hwith goal g ∈ G can be written as 〈g, (s0, a1, r1, s1, a2,r2, s2, . . . , sh)〉 where s0 ∈ S, ∀i = 1, . . . , h, ai ∈ A,si ∈ S , and ri = R(si−1, ai, si, g). We assume that alltrajectories have a length not larger than H ∈ N, whichis true in robotics. The set of all trajectories is denotedΓ = ∪Hh=1G × S × (A× R× S)h.

A trajectory τ is said to be feasible if for i = 1, . . . , h,T (si−1, ai, si) > 0. The set of feasible trajectories isdenoted Γ ⊆ Γ. A trajectory τ of length h with goal gis said to be successful if R(τ) > Rmin where R(τ) =∑hi=1 γ

i−1R(si−1, ai, si, g) and Rmin = minτ ′∈ΓR(τ ′).The set of successful trajectories is denoted Γ+ ⊆ Γ.

We can now define the different notions of symmetriesthat we use in this paper. A symmetry of Γ is a one-to-onemapping σ : Γ→ Γ such that σ(Γ) = Γ where σ(Γ) = {τ ∈Γ | ∃τ ′ ∈ Γ, σ(τ ′) = τ}. In words, a symmetry of Γ leavesthe space invariant. As we only apply symmetries to feasibletrajectories, we directly consider their restrictions to Γ andkeep the same notation, i.e., σ : Γ → Γ. A decomposablesymmetry is a symmetry σ such that there exist one-to-onemappings σG : G → G, σS : S → S, and σA : A → A thatsatisfy for any τ ∈ Γ:

σ(τ) = 〈g′, (s′0, a′1, r′1, s′1, a′2, r′2, s′2, . . . , s′h)〉 (2)

where τ = 〈g, (s0, a1, r1, s1, . . . , sh)〉, g′ = σG(g),s′0 = σS(s0), ∀i = 1, . . . , h, ai = σA(ai), si = σS(si),and r′i = R(σS(si−1), σA(ai), σS(si), σG(g)). A reward-preserving symmetry σ : Γ → Γ is a symmetry such thatfor any trajectory τ ∈ Γ, the rewards appearing in τ areexactly the same as those in σ(τ) in the same order. Theprevious definitions of symmetries can be naturally appliedto the set of successful trajectories Γ+ as well.

Besides, note that any number of symmetries induces agroup structure (i.e., they can be composed). Given n fixedsymmetries and a trajectory, one could possibly generate upto 2n−1 new trajectories. This property is useful if one onlyknows a fixed number of symmetries for a given problem,because a recursive application of those symmetries couldlead to an exponential increase of trajectories that could beused for training.

Page 3: Invariant Transform Experience Replay · Invariant Transform Experience Replay Yijiong Lin 1, Jiancong Huang , Matthieu Zimmer2, Juan Rojas1, Paul Weng2 Abstract—Deep reinforcement

Normal Environment Goal Setting reflection

Buffer

updatePolicy

Environment

StateAction

KER augmented dataReal transitions

GER augmented data

minibatch

Fig. 1. ITER framework overview: real and symmetrically transformedtransitions are stored in the replay buffer. Sampled minibatches are thenaugmented with GER before updating the policy.

As a general approach to increase data efficiency in DRL,one can leverage the symmetries of Γ or Γ+ for dataaugmentation. As an illustration, we propose ITER (InvariantTransform Experience Replay), an architecture that combinesour two proposed techniques (explained in details below):• Kaleidoscope experience replay (KER) is based on

decomposable symmetries of Γ.• Goal-Augmented Experience Replay (GER) is based on

reward-preserving decomposable symmetries of Γ+, butcan be applied to all feasible trajectories in the samefashion as HER.

While the two methods are combined here, only one of themcould be used instead. An overview of our architecture isillustrated in Fig. 1.

A. Kaleidoscope Experience Replay (KER)

KER uses reflectional symmetry1. Consider a 3Dworkspace with a bisecting plane xoz as shown in Fig. 2.If a feasible trajectory is generated in the workspace (red inFig. 2), natural symmetry would then yield a new feasibletrajectory reflected on this plane. More generally, the xozplane may be rotated by some angle θz along axis ~z andstill define an invariant symmetry for the robotic task.

We can now precisely define KER, which amounts toaugmenting any original trajectory with a certain numberof random reflectional symmetries. To ensure the generationof feasible symmetrical trajectories, we can first derive themaximum valid angle θmax according to the relative posebetween the working space of a robot and the positionarea of objects and goals in any given robotic manipulationtask. Hyperparameter nKER denotes the number of symmetricplanes applied by KER for trajectories symmetrization. Weconsider that the center of the robot base is positioned inthe origin of world coordinate. For any given nKER, there isalways a symmetrical plane which is parallel to xoz planeand goes through the center of robot arm base, we referto this decomposable symmetry as σxoz : Γ → Γ. Thus,the number of new trajectories generated is 2nKER − 1. IfnKER = 1, KER directly applies σxoz to symmetrize the newtrajectory with:

σxozS (s) = σxozS (〈xs, ys, zs, αs, βs, γs〉)= 〈xs,−ys, zs,−αs, βs,−γs〉,

1Though more general invariant transformations (e.g., rotation, transla-tion) could also be used in place of reflectional symmetry.

X Label0.60.70.80.91.01.11.21.31.41.5Y Label

0.55 0.60 0.65 0.70 0.75 0.80 0.85

Z Label

0.40

0.42

0.44

0.46

0.48

0.50

0.52

0.54

Fig. 2. Kaleidoscope Experience Replay leverages natural symmetry (top).Feasible trajectories are reflected with respect to plane xoz, where the lattercan itself be rotated by some θz along axis ~z. The bottom figure illustratesthis with xoz rotated twice to obtain the orange and green planes. The threesymmetries are then applied to the red trajectory to obtain five new ones.

σxozA (a) = σxozA (〈xa, ya, za, dgrip〉)= 〈xa,−ya, za, dgrip〉,

σxozG (g) = σxozG (〈xg, yg, zg〉)= (〈xg,−yg, zg〉),

where the 3-tuple 〈x, y, z〉 denotes the position and the3-tuple 〈α, β, γ〉 denotes the orientation and dgrip isthe distance between the two fingers of a robot. IfnKER > 1, KER first generates a set of rotated symmetricplanes which are rotated along axis ~z by a set of ran-dom angles (sampled from a uniform distribution) Θ ={θj | θj ∈ (0, θmax], j = 1, 2, ..., nKER − 1

}. For each plane,

the associated decomposable symmetries are denoted asΣΘ =

{σθj : Γ→ Γ|j = 1, 2, ..., nKER − 1

}. The decompo-

sition of σθ is:

σθjS (s) = σθS(〈xs, ys, zs, αs, βs, γs〉)

= 〈Rθt (〈xs, ys, zs〉), Rθr(〈αs, βs, γs〉)〉

σθjG (g) = σθG(〈xg, yg, zg〉)

= 〈Rθt (〈xg, yg, zg〉)〉

σθjA (a) = σθA(〈xa, ya, za, dgripper〉)

= 〈Rθt (〈xs, ys, zs)〉, dgripper〉

where

Rθt (〈xs, ys, zs〉) = Rz(θ)〈xs, ys, zs〉T

= 〈xs′, ys′, zs′〉,

Page 4: Invariant Transform Experience Replay · Invariant Transform Experience Replay Yijiong Lin 1, Jiancong Huang , Matthieu Zimmer2, Juan Rojas1, Paul Weng2 Abstract—Deep reinforcement

Rθr(〈αs, βs, γs〉) = Car(Rz(θ)Rz(γs)Ry(βs)Rx(αs))

= Car(Rz(γs′)Ry(βs′)Rx(αs′))

= 〈αs′, βs′, γs′〉

where Rx,Ry ,Rz are standard rotation matrix with respectto x axis, y axis and z xis respectively, and Car() mapsorientation matrix to relative angles in Cardanion type (Car-danian mapping is a relative rotating operation, about x, y,and z axes in order. Relative rotating means that after werotate about x, then we use the new rotated y, and the samefor z.).

Note that instead of storing the reflected trajectories inthe replay buffer, the random symmetries could be insteadapplied to sampled minibatches. This approach was triedpreviously for single-symmetry scenarios [7]. Doing so,however, is more computationally taxing as transitions arereflected every time they are sampled and more significantlyleads to lower performance2. Our conjecture is that such anapproach leads to a lower diversity in the minibatches.

B. Goal-Augmented Experience Replay (GER)

GER exploits the formulation of any reward function (seeEqn. 1) that defines a successful trajectory as one whose endposition is within a small radial threshold (a ball) centeredaround the goal. When the robot obtains a trajectory, wetherefore know that it can in fact be considered successful forany goal within a ball centered around its end position. Basedon this observation, GER augments trajectories by replacingthe original goal with a random goal sampled within thatball. This ball can be formally described as B(g, ε) = {g ∈G|d(sh, g) ≤ ε} where sh is the state reached in the originaltrajectory and ε < εR is a threshold.

Formally, GER is based on reward-preserving decom-posable symmetries of Γ+ where σS and σA are identitymappings and σG is randomly chosen, conditional to atrajectory τ reaching some state sh, in the following set:{ρ : G → G | ∀g ∈ G, ρ(g) ∈ B(sh, ε)}. Interestingly, suchsymmetries, when viewed as mappings from Γ to Γ+ can beapplied to the whole set of feasible trajectories to generatesuccessful trajectories, which we do in our architecture. Inthis sense, GER is a generalization of HER and can beimplemented in the same fashion. Thus our architecture, likeHER, is also applied on minibatches. For each applicationof GER, the size of the minibatch is increased with the newartificial transitions.

V. EXPERIMENTS AND RESULTS

Our experimental evaluation is performed according tothe same set-up as for the evaluation of HER [6]. Namely,we use a simulated 7-DOF Fetch Robotics arm trained withDDPG on the pushing, sliding, and pick-and-place tasks fromOpenAI Gym [10].

An epoch is defined as a fixed-size set of successiveepisodes. Since a trajectory, which is also defined as a

2Please visit our project page for this and more supplementary informa-tion www.JuanRojas.net/ker.

single episode, can be qualified as successful (Sec. IV),we can compute a success rate over an epoch by countingthe number of successful episodes. In order to highlightthe difference between HER and ITER in earlier learningstage, we use a smaller number of episodes (1 epoch = 100episodes) to define a complete epoch than in the HER paper(1 epoch = 800 episodes).

The neural network learning the action-value function iscomposed of 3 hidden layers with 256 units and ReLUactivation function. The output neuron predicting the Q-value does not have an activation function. The actor networkapproximating the policy is also composed of 3 hidden layerswith 256 units and ReLU activation function. The activationfunction of the output layer is tanh.

The provided learning curves (Figs. 3-6) display the testingperformance of the target networks of DDPG. It means thatthe exploration has been disabled and the policy is fullydeterministic. After each learning epoch, the testing successrate is computed over 10 episodes. We display an averageover 5 random seeds for each curve.

During a learning epoch, the exploration strategy, thelearning rates, the soft update ratio, and the replay buffersize are all the same as in HER.

As HER, GER uses the «future» strategy to choose whichreached state sh will be the center of the ball for pickingrandom goals: a random reached state that appeared later inthe same trajectory as the current transition.

The environment difficulty is the same as defined inOpenAI Gym (harder than the one in the original HERpaper): εR is fixed to 0.05 cm.

We design our experiments to demonstrate the effective-ness of our propositions and answer those questions:

• How does ITER (GER+KER) perform compared toHER on single and multi-goal tasks ?

• How much KER contributes to the performance ofITER? How many nKER should be used?

• What is the contribution of GER to the performance ofITER? What is the impact of nGER ?

A. Does ITER improve performance with respect to HER?

The best combination uses nKER = 8 random symmetriesfor KER and 4 applications of GER (where one of them usesa threshold ε equal to zero in order to also take full advantageof realized goals). We show that ITER outperforms the dataefficiency of HER on different robotic tasks in both the multi(Fig. 3) and single goal (Fig. 4) tasks.

The slide task is a difficult task where the result isdetermined by only a few contacts (generally one) betweenthe gripper and the box. This makes it hard to learn andgeneralize, which would explain why ITER does not improveso much over HER. The pick-and-place task in the single-goal setting is also a difficult task. Without using any trainingtrick like in the HER paper, HER is not able to learnanything, while our approch can start to learn after about180 epochs.

Page 5: Invariant Transform Experience Replay · Invariant Transform Experience Replay Yijiong Lin 1, Jiancong Huang , Matthieu Zimmer2, Juan Rojas1, Paul Weng2 Abstract—Deep reinforcement

0 50 100 150 2000.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

FetchPush-v1

0 50 100 150 200Epoch Number (every epoch = 100 episodes = 100x50 timesteps)

0.0

0.2

0.4

0.6

0.8

1.0FetchSlide-v1

DDPG+vanilla HER DDPG+8KER+4GER

0 50 100 150 2000.0

0.2

0.4

0.6

0.8

1.0FetchPickAndPlace-v1

Fig. 3. Comparison of vanilla HER and ITER with 8 KER symmetries and 4 GER applications on multi-goal tasks.

0 50 100 150 2000.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

FetchPush-v1

0 50 100 150 200Epoch Number (every epoch = 100 episodes = 100x50 timesteps)

0.0

0.2

0.4

0.6

0.8

1.0FetchSlide-v1

DDPG+vanilla HER DDPG+8KER+3GER

0 50 100 150 2000.0

0.2

0.4

0.6

0.8

1.0FetchPickAndPlace-v1

Fig. 4. Comparison of vanilla HER and ITER with 8 KER symmetries and 4 GER applications on single-goal tasks.

0 50 100 150 2000.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

FetchPush-v1

0 50 100 150 200Epoch Number (every epoch = 100 episodes = 100x50 timesteps)

0.0

0.2

0.4

0.6

0.8

1.0FetchSlide-v1

0 50 100 150 2000.0

0.2

0.4

0.6

0.8

1.0FetchPickAndPlace-v1

vanilla HER 1KER 2KER 4KER 8KER 16KER

Fig. 5. Comparison of different nKER for KER with a single GER on multi-goal tasks.

B. How many symmetries should we use in KER?In this experiment (Fig. 5), we use a single GER ap-

plication with a zero threshold ε (i.e., HER). We observe

Page 6: Invariant Transform Experience Replay · Invariant Transform Experience Replay Yijiong Lin 1, Jiancong Huang , Matthieu Zimmer2, Juan Rojas1, Paul Weng2 Abstract—Deep reinforcement

0 50 100 150 2000.0

0.2

0.4

0.6

0.8

1.0

Succ

ess R

ate

FetchPush-v1

0 50 100 150 200Epoch Number (every epoch = 100 episodes = 100x50 timesteps)

0.0

0.2

0.4

0.6

0.8

1.0FetchSlide-v1

1GER 2GER 4GER

0 50 100 150 2000.0

0.2

0.4

0.6

0.8

1.0FetchPickAndPlace-v1

Fig. 6. Comparison of different nGER for GER without KER on multi-goal tasks.

that performance increase is monotonic with respect to thenumber of random symmetries nKER, although the gaindiminishes for larger nKER.

C. Does GER improve performance?

In this experiment, KER is not used and we vary thenumber of GER applications. We observe similar behavior asKER: the more GER is applied the better the performances,until reaching a ceiling (Fig. 6).

Because GER changes the size of the minibatches, weperformed a control experiment, not shown here, where weincreased the size of the minibatches in vanilla HER. Wefound that it did not improve the performance of HER.

By observing Fig. 5 and Fig. 6, we can conclude that thecontributions of KER and GER to ITER are of equivalentorder.

VI. DISCUSSIONSA. Apply invariant transformations to the input or output ofthe replay buffer

One interesting question concerns how the new data shouldbe used. We could either populates the replay buffer with theartificial transitions or apply the transformations to a mini-batch sampled from the replay buffer. If the transformationsare applied after, the diversity of the minibatch could belimited because all the new artificial transitions come fromthe same source. In the meantime, if the transformation isapplied before, then we do not fully exploit the informationcontained in this transformation because we only do a uniquesampling where we could sampling several times. In ourexperiment, we notice that applying KER before, and GERafter works better in practice.

B. Performance drop with KER

After a certain number of epoch, we also observed anunexpected performance drop when applying KER. Thelarger nKER is, the faster this phenomenon appears. We firstthought that the algorithm was overfitting the new artificialgoals introduced by KER at the expense of real goals.

However, in an experiment not shown here, we observed thatthe performance drop was also affecting the artificial goals.We have two other hypotheses to explain this phenomenon:

• We update the policy on too much data that are unlikelyto appear under the current policy. This implies thatthere is a mismatch between the state distribution in-duced by the minibatches and that of an optimal policy.If it is the case, we should weight the new artificialtransitions to limit their importance on updates [18].

• As DDPG is known to be very sensitive to hyperpa-rameters [28], they should be optimized for each newnKER.

This investigation is left for future work.

VII. CONCLUSIONS

We proposed two novel data augmentation techniquesKER and GER to amplify the efficiency of observed samplesin a memory replay mechanism. KER exploited reflectionalsymmetry in the feasible workspace (though in general itcould be employed with other types of symmetries). GER, asan extension of HER, is specific to goal-oriented tasks wheresuccess is defined in terms of a thresholded distance. Thecombination of these techniques greatly accelerated learningas demonstrated in our experiments.

Our next step is to use our method to solve the same taskswith a simulated Baxter robot, and then transfer to a realBaxter using a sim2real methodology. Furthermore, we aimat extending our proposition to other types of symmetriesand to other robotic tasks as well.

ACKNOWLEDGMENT

This work is supported by Major Project of the GuangdongProvince Department for Science and Technology [grantnumber 2016B0911006], by the Natural Science Foundationof China [grant numbers 61750110521 and 61872238], andthe Shanghai Natural Science Foundation [grant number19ZR1426700].

Page 7: Invariant Transform Experience Replay · Invariant Transform Experience Replay Yijiong Lin 1, Jiancong Huang , Matthieu Zimmer2, Juan Rojas1, Paul Weng2 Abstract—Deep reinforcement

REFERENCES

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, and Others,“Human-level control through deep reinforcement learning,” Nature,vol. 518, pp. 529–533, 2015.

[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot, and others, “Mastering the game of Go with deep neuralnetworks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.

[3] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang,D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine,“Scalable Deep Reinforcement Learning for Vision-Based RoboticManipulation,” in Proceedings of The 2nd Conference on RobotLearning, ser. Proceedings of Machine Learning Research, A. Billard,A. Dragan, J. Peters, and J. Morimoto, Eds., vol. 87. PMLR, 2018,pp. 651–673.

[4] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz,B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell,A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng,and W. Zaremba, “Learning dexterous in-hand manipulation,” CoRR,2018. [Online]. Available: http://arxiv.org/abs/1808.00177

[5] L.-J. Lin, “Self-improving reactive agents based on reinforcementlearning, planning and teaching,” Machine learning, vol. 8, no. 3-4,pp. 293–321, 1992.

[6] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welin-der, B. McGrew, J. Tobin, P. Abbeel, and W. Zaremba, “Hindsightexperience replay,” in Advances in Neural Information ProcessingSystems, 2017, pp. 5049–5059.

[7] Ł. Kidzinski, S. P. Mohanty, C. F. Ong, Z. Huang, S. Zhou,A. Pechenko, A. Stelmaszczyk, P. Jarosik, M. Pavlov, S. Kolesnikovet al., “Learning to run challenge solutions: Adapting reinforcementlearning methods for neuromusculoskeletal environments,” in TheNIPS’17 Competition: Building Intelligent Systems. Springer, 2018,pp. 121–153.

[8] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker,G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar,and W. Zaremba, “Multi-Goal Reinforcement Learning: ChallengingRobotics Environments and Request for Research,” Tech. Rep., 2 2018.

[9] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,“Domain randomization for transferring deep neural networks fromsimulation to the real world,” in 2017 IEEE/RSJ International Con-ference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp.23–30.

[10] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprintarXiv:1606.01540, 2016.

[11] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal valuefunction approximators,” in International Conference on MachineLearning, 2015, pp. 1312–1320.

[12] V. R. Konda and J. N. Tsitsiklis, “Actor-Critic Algorithms,” NeuralInformation Processing Systems, vol. 13, pp. 1008–1014, 1999.

[13] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning,” International Conference on Learning Representa-tions, 2016.

[14] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-miller, “Deterministic Policy Gradient Algorithms,” Proceedings ofthe 31st International Conference on Machine Learning, pp. 387–395,2014.

[15] R. Zhao and V. Tresp, “Energy-Based Hindsight ExperiencePrioritization,” in coRL, 2018. [Online]. Available: http://arxiv.org/abs/1810.01363

[16] M. Fang, C. Zhou, B. Shi, B. Gong, J. Xu, and T. Zhang, “DHER:Hindsight Experience Replay,” in ICLR, 2019.

[17] A. Gerken and M. Spranger, “Continuous Value Iteration (CVI)Reinforcement Learning and Imaginary Experience Replay (IER)For Learning Multi-Goal, Continuous Action and State SpaceControllers,” in 2019 International Conference on Robotics andAutomation (ICRA). IEEE, 5 2019, pp. 7173–7179. [Online].Available: https://ieeexplore.ieee.org/document/8794347/

[18] P. Rauber, A. Ummadisingu, F. Mutz, and J. Schmidhuber, “Hindsightpolicy gradients,” arXiv preprint arXiv:1711.06006, 2017.

[19] M. Zinkevich, M. Zinkevich, and T. Balch, “Symmetry in Markovdecision processes and its implications for single agent and multi

agent learning,” IN PROCEEDINGS OF THE 18TH INTERNATIONALCONFERENCE ON MACHINE LEARNING, pp. 632–640, 2001.

[20] M. Kamal and J. Murata, “Reinforcement learning for problems withsymmetrical restricted states,” Robotics and Autonomous Systems,vol. 56, no. 9, pp. 717–727, 9 2008.

[21] A. Agostini and E. Celaya, “Exploiting Domain Symmetries in Re-inforcement Learning with Continuous State and Action Spaces,” inICMLA, 2009.

[22] A. Mahajan and T. Tulabandhula, “Symmetry Learning for Func-tion Approximation in Reinforcement Learning,” arxiv preprint1706.02999, 2017.

[23] Ł. Kidzinski, S. P. Mohanty, C. F. Ong, Z. Huang, S. Zhou,A. Pechenko, A. Stelmaszczyk, P. Jarosik, M. Pavlov, S. Kolesnikov,S. Plis, Z. Chen, Z. Zhang, J. Chen, J. Shi, Z. Zheng, C. Yuan,Z. Lin, H. Michalewski, P. Milos, B. Osinski, A. Melnik,M. Schilling, H. Ritter, S. F. Carroll, J. Hicks, S. Levine,M. Salathé, and S. Delp, “Learning to Run Challenge Solutions:Adapting Reinforcement Learning Methods for NeuromusculoskeletalEnvironments,” in The NIPS ’17 Competition: Building IntelligentSystems. Springer, 2018, pp. 121–153. [Online]. Available: http://link.springer.com/10.1007/978-3-319-94042-7_7

[24] F. Amadio, A. Colome, and C. Torras, “Exploiting Symmetries inReinforcement Learning of Bimanual Robotic Tasks,” IEEE Roboticsand Automation Letters, vol. 4, no. 2, pp. 1838–1845, 4 2019.[Online]. Available: https://ieeexplore.ieee.org/document/8637816/

[25] H. S. Baird, “Document Image Defect Models,” in Structured Docu-ment Image Analysis. Springer, 1992, pp. 546–556.

[26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in NeurIPS, p. 2012.

[27] R. Gens and P. Domingos, “Deep symmetry networks,” in Advancesin Neural Information Processing Systems, 2014, pp. 2537–2545.

[28] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, andD. Meger, “Deep Reinforcement Learning that Matters,” ArXiv e-prints, 2017.