robotic table tennis with model-free reinforcement learning · 2020. 5. 29. · robotic table...

Robotic Table Tennis with Model-FreeReinforcement Learning

Wenbo Gao∗§, Laura Graesser∗†, Krzysztof Choromanski∗†, Xingyou Song†, Nevena Lazic‡,Pannag Sanketi†, Vikas Sindhwani† and Navdeep Jaitly¶

∗Equal contribution †Robotics at Google ‡DeepMind§Columbia University. Work done during Google internship ¶Work done at Google

Abstract—We propose a model-free algorithm for learningefficient policies capable of returning table tennis balls bycontrolling robot joints at a rate of 100Hz. We demonstratethat evolutionary search (ES) methods acting on CNN-basedpolicy architectures for non-visual inputs and convolving acrosstime learn compact controllers leading to smooth motions. Fur-thermore, we show that with appropriately tuned curriculumlearning on the task and rewards, policies are capable of devel-oping multi-modal styles, specifically forehand and backhandstroke, whilst achieving 80% return rate on a wide range ofball throws. We observe that multi-modality does not requireany architectural priors, such as multi-head architectures orhierarchical policies.

I. INTRODUCTION

Recent advances in machine learning (ML), and in par-ticular reinforcement learning (RL), have led to progressin robotics [17, 43, 16], which has traditionally focusedon optimal control. Two key advantages of ML are itsability to leverage increasing data and computation andlearn task-specific representations. ML algorithms reduce theneed for human knowledge by automatically learning usefulrepresentations of the data. For difficult problems, this iscrucial because the complexity often exceeds what can beaccomplished by explicit engineering. The effectiveness ofML is dependent on having large amounts of data. While thislimits ML when little data is available, it becomes a strengthbecause ML continues to scale with ever-increasing amountsof data and computation: a system can be improved, oftento the new state of the art, simply by gathering more data.These advantages make RL appealing for robotics, wherelarge-scale data can be generated by simulation, and systemsand tasks are often too complex to explicitly program.

In this paper, we apply RL to robotic table tennis. Incontrast to some of the previous manipulation tasks solved byRL [17] where inference time of few hundred millisecondsis acceptable, table tennis is distinguished by the need forhigh-speed perception and control, and the high degree ofprecision required to succeed at the task — the ball needsto be hit very precisely in time and space. Given thesechallenges, prior work on robotic table tennis is typicallymodel-based, and combines kinematics for predicting the fullball trajectory with inverse kinematics for planning the robottrajectory. Most recent research first identifies a virtual hittingpoint [35] from a partial ball trajectory and predicts the ball

Fig. 1. Simulated table tennis 8 DOF robot: 6 DOF arm with revolutejoints (labeled j1-j6) + 2 linear axes.

velocity, and potentially spin, at this point. A target paddleorientation and velocity is then determined, and a trajectoryis generated to bring the paddle to the desired target at aparticular time [26, 27, 31, 29, 24, 13, 21]. Most systemsinclude a predictive model of the ball (learned or analytical),and may also utilize a model of the robot. It is an openquestion whether end-to-end RL is an effective approachfor robotic table tennis (and complex high-speed tasks ingeneral), which motivates our approach.

We are also motivated by human play. When humans playtable tennis, they exhibit a variety of stroke styles (multi-modality). These styles include forehand, backhand, topspin,backspin, etc. We are interested in understanding if multi-modal style emerges naturally during training, and if not,what techniques are required to generate this behavior.

In this work, we describe how to learn a multi-modalmodel-free policy to directly control a simulated table tennisrobot in joint space using RL, without relying on humandemonstrations, and without having a separate system topredict the ball trajectory. To simplify the task, we focuson forehand and backhand play styles without spin. Ourpolicies take as input short histories of joint positions andball locations, and output velocities per joint at 100Hz.

A video demonstrating our system can be viewed atwww.youtube.com/watch?v=-eHeq1nvHAE. The video is di-

arX

iv:2

003.

1439

8v2

[cs

.LG

] 2

7 M

ay 2

020

www.youtube.com/watch?v=-eHeq1nvHAE

vided into segments, each exhibiting a different policy (PolicyA – E). Policy A is an example of a strong policy whichreturns 80% of randomly thrown balls to the opponent sideof the table (see Table VII), and exhibits high-level decisionsthrough bimodal (forehand and backhand) play. Policies B –E in the video show ablation studies (Section V-C to V-E)and failure modes (Section V-F2). Policies B and C illustratethe differences in style between CNN and MLP architectures.

In summary, our contributions are the following:

• To the best of our knowledge, we train the first tabletennis controller using model-free techniques without apredictive model of the ball or human demonstrationdata (Section V-B).

• We show that it is possible for a policy to learn multi-modal play styles with careful curriculum learning, butwith no need of architectural priors (Section V-F).

• We demonstrate that convolutions across time on non-visual inputs lead to smoother policies. They are alsomore compact than MLPs, which may be particularlybeneficial for ES-methods (Sections V-B and V-C).

II. RELATED WORK

Research in robotic table tennis dates back to 1983 whenBillingsley [4] proposed a robotic table tennis challenge withsimplified rules and a smaller table. Subsequently, severalearly systems were developed [18, 11, 12]; see [28] fora summary of these approaches. At the time of the lastcompetition in 1993, the problem remained far from solved.

Standard model-based approaches to robotic table tennis,as discussed in the introduction, consist of several steps:identifying virtual hitting points from trajectories, predictingball velocities, calculating target paddle orientations andvelocities, and finally generating robot trajectories leading todesired paddle targets. [25, 26, 2, 30, 47] take this approachand impose the additional constraint of fixing the interceptplane in the y-axis (perpendicular distance to the net). [13, 39,21] allow for a variable ball intercept, but still use a virtualhitting point. The predictive ball model is either learned fromdata [25, 23, 24, 26] or is a parameterized dynamics model,which can vary in complexity from Newtonian dynamicswith air drag [28, 30], to incorporating restitution and spin[47]. Robots vary from low-DOF robots with simple motiongeneration [25, 23, 24, 26] to anthropomorphic arms withstrong velocity and acceleration constraints [28, 29, 42, 10].

Once a paddle target has been generated, the trajectorygeneration problem is still far from straightforward, espe-cially if the robotic system has strong constraints. [28] resolvethe redundancy in a 7DOF system by minimizing the distanceto a ‘comfort posture’ in joint space whilst finding a paddleposition and orientation that coincides with the hitting point.[29, 31] create a movement library of dynamical systemmotor primitives (DMPs) [15] from demonstrations, and learnto select from amongst them and generalize between themwith their Mixture of Motor Primitives (MoMP) algorithm.

[14] and [19] take a different approach and do not identifya virtual hitting point. [14] use a combination of supervised

and reinforcement learning to generate robot joint trajectoriesin response to ball trajectories. An important component ofthis system is a map which predicts the entire ball trajectorygiven the initial ball state estimated from a collection ofmeasured ball positions. [19] use three ball models (flightmodel, ball-table rebound model, ball-racket contact) to gen-erate a discrete set of ball positions and velocities, given ballobservations queried from the vision system. Desired racketparameters for each set of ball positions are generated andthen the optimization for robot joint movements is run. Thesystem is fast enough to generate trajectories online.

[14] and [19] are the closest to our work. We use a similaranthropomorphic robotic system, a 6 DOF arm with revolutejoints + 2 DOF linear axes, and do not make use of a virtualhitting point. However in our approach we do not use apredictive model of ball, nor do we use demonstrations tolearn to generate trajectories.

To the best of our knowledge, two classes of model-free approaches have been applied to robotic table tennis.[47] frames the problem as a single-step bandit and usesDDPG [20] to predict the linear velocity and normal vectorof the paddle given the position, linear and angular velocityof the ball at the hitting plane in simulation. [1] learns alocal quadratic time-dependent Q-function from trajectorydata, which they use to optimize a time-dependent stochasticlinear feedback controller. The initial policy is learned fromdemonstration data. By contrast the main focus of this paperis on-policy methods, and our policies produce temporallyextended actions at 100Hz in joint space.

III. PRELIMINARIES

A. Robotic Table Tennis

Our goal in this paper is to train policies to solve the basictask of returning balls launched from the opponent side ofthe table. Beyond this, we are also interested in the style ofthe robot’s play.

How a policy acts is as important as how many ballsit can return, especially if the policy is used to control aphysical robot. Good style – smooth control operating withinthe robot’s limits – is crucial, in addition to the success rate.

Our policies should be able to execute complex playingstyles involving high-level decision making, as humans do.In the context of robotic table tennis, an instance of thisis bimodal play: the ability of the policy to select betweenforehand and backhand swings. Moreover, this should beextensible; in addition to the aforementioned goals of smooth-ness and bimodal play, the policy should have avenues forincorporating further styles or strategies as the difficulty ofthe task increases.

A key contribution of our paper is the development ofmethods to accomplish these goals using RL. We describethese methods in Section IV.

B. RL Background

To describe RL policies and algorithms, we use theformalism of the Markov Decision Process (MDP), whichconsists of a state space S, action space A, a reward function

Policy type ES PPO1

Layers 3 3Channels (per layer) 8, 12, 8 8-32, 12-64, 8-32Stride 1,1,1 1,1,1Dilation 1,2,4 1,2,4Activation function tanh tanhGating Y,Y,N Y,Y,YTotal parameters 1.0k 2.4k - 36k

TABLE IPOLICY MODEL DETAILS.

R : S × A → R, and stochastic transition dynamicsp(st+1|st, at). We parameterize a policy π : S → Aas a neural network with parameters θ ∈ Θ, denoted asπθ. The goal is to maximize the expected total rewardF (θ) := Eat∼πθ(st)

[∑Ht=1 r(st, at)

], where at ∼ πθ(st)

indicates that actions follow πθ.Our controller πθ generates continuous velocity commands

in joint space. Within model-free RL, there are two broadclasses of algorithms: those based on value functions, withthe canonical example being Q-learning, and those usingdirect policy search [40]. Q-learning, which learns a functionQ(s, a) to predict the expected reward starting at state-actionpair (s, a), has been successfully applied to manipulationproblems [17]. Its disadvantage is that inference (i.e. selectingan action) involves solving an optimization problem (i.e.πθ(s) = argmaxaQ(s, a)) which can take several hundredmilliseconds for continuous action spaces, as in [17] (usingCMA-ES). This is impractical for high-speed tasks like tabletennis. Instead, we opt for direct policy search methods,which learn a mapping from states to actions.

IV. METHODS: APPLYING RL TO TABLE TENNIS

In this section, we describe our key design choices forsuccessfully applying RL to robotic table tennis: the policyarchitecture, the use of curriculum learning, and the choiceof optimization algorithm.

A. Policy Representation

We represent a policy using a three-layer convolutionalneural network (CNN) with gated activation units [44](Figure 2, Table I). We found the gating mechanism wasimportant for accelerating training and varied the numberof channels depending on the algorithm (ES, PPO) for bestperformance.

Figure 2 depicts this architecture. The input to the CNNis a tensor of shape (T, S + 3), where T is the number ofpast observations, and S is the DOF of the robot. The +3corresponds to a measurement of the 3D position of the ball,which is appended to the robot state. In our experiments,S = 8 and T = 8, so the input has shape (8, 11). The CNNapplies 1D convolutions on the time dimension, with the balland joint state treated as channels. Each gated layer producestwo tensors o1 (red) and o2 (yellow) of equal size. Thegating mechanism then multiplies the activations tanh(o1)elementwise with the mask σ(o2) to produce the output yi.

1Architecture for both the value and policy networks. Total parametersare the sum of both networks’ parameters.

11input

T = 8

8

8

�

σ

tanh

~

~8 12

12

�

σ

tanh

~

~12

8output

~

Fig. 2. Gated CNN architecture3. Rectangles represent channels, and1D filters are applied on the vertical (time) dimension. Dilation reducesthe height in successive layers. The symbol ~ denotes the application ofconvolutions, σ denotes elementwise sigmoid, and � denotes elementwise(Hadamard) product. The dimensions shown are for the ES policy (Table I).

Formally, let Wi be the kernel of hidden layer i, bi the bias,and Xi the input. We have

(o1, o2) = Xi ~Wi + bi

yi = tanh(o1)� σ(o2).(1)

To better understand the effect of using a CNN controller,we also trained simple three-layer multi-layer perceptron(MLP) controllers with 50 and 10 units in two hidden layers.

B. Curriculum Learning in RL

The task of returning the ball defines a sparse reward of1.0 given at the end of episodes (ball throws) in which therobot succeeded, and 0.0 otherwise. In principle, this typeof sparse reward can be used to find an optimal policy asit exactly expresses the goal of the task. However, simplymaximizing the success rate has two major drawbacks:• Other considerations such as smoothness and style (see

Section III-A) are not captured by the sparse reward.• Sparse rewards make training difficult, as ∇F (θ) = 0

for a large portion of the parameter space. A largeamount of exploration is required to observe any signalwhen rewards are highly sparse, which makes gradientestimation difficult. Furthermore, gradient-based algo-rithms are more likely to get trapped in local maxima.

We can address these issues using curriculum learning [3]by carefully adjusting two aspects of the problem:• Shaping the training distribution, e.g. changing the

distribution of ball throws during training so that apolicy can improve on its weaknesses. In the MDPformalism for RL, this is changing the distribution ofinitial states s0 ∼ P(S).

• Shaping the reward function. New rewards can be addedto e.g. discourage moving at high speeds, improve theswing pose, or to use forehand and backhand swings.

Curriculum learning is especially important for learningcomplex styles such as bimodal play, as we explore inSection V-F.

We briefly discuss an alternative to curriculum learning:human engineering of the desired behavior. Indeed, we cancreate a hierarchical policy which uses a fixed decisionrule to select between sub-policies based on the predictedball landing position. This is able to achieve a near-perfectsuccess rate of 94% (Table VII, Hierarchical) by leveragingoptimal forehand and backhand policies trained with RL.

3Drawn in PlotNeuralNet. https://github.com/HarisIqbal88/PlotNeuralNet

https://github.com/HarisIqbal88/PlotNeuralNet

However this approach is ultimately limited. Though rules-based engineering is possible for achieving the narrow goal ofselecting between forehand-backhand, it is highly likely thatmore complex aspects of style or strategy cannot be neatlydecomposed into distinct modes. Moreover, training separatepolicies for each mode quickly increases the total number ofparameters. This limits the viability of engineering (it is notextensible), and motivates our use of curriculum learning.

C. Algorithms

We consider two classes of RL optimization algorithms:evolutionary search and policy gradient methods.

1) Evolutionary Search (ES): This class of optimizationalgorithms [45, 33] has recently become popular for RL[36, 5, 6, 7]. ES is a general blackbox optimization paradigm,particularly efficient on objectives F (θ) which are possiblynon-smooth. The key idea behind ES is to consider theGaussian smoothing of F , given by the following equation:

Fσ(θ) = Eg∼N (0,Id)[F (θ + σg)], (2)

where σ > 0 controls the precision of the smoothing. It canbe shown that Fσ(θ) is differentiable with gradients:

∇Fσ(θ) =1

σEg∼N (0,Id)[F (θ + σg)g], (3)

for which it is easy to derive unbiased Monte Carlo (MC)estimators. The ES method applies stochastic gradient ascent(SGD) to Fσ(θ), using the estimator ∇Fσ(θ), so the updatetakes the form θ ← θ + η · ∇Fσ(θ).

ES algorithms come in different variations, where thedifferentiation is made based on: Monte Carlo estimatorsused, additional performed normalizations, specific formsof control variate terms and more. We use ES methodswith state normalization [32, 36], filtering [36], and rewardnormalization [22], with repeated rollouts for each parameterperturbation gi. We found that averaging the reward fromrepeated rollouts was important for training good policies.This is likely due to the high degree of random variation be-tween episodes. We also observed that the state-normalizationheuristic was crucial for achieving good performance.

2) Policy Gradient Methods: Policy gradient (PG) meth-ods [46, 41, 40] are commonly used in RL and have beenadapted for robotics [20]. We experiment with a state-of-the-art PG algorithm called proximal policy optimization(PPO) [37]. Note that standard PG methods require stochasticpolicies which is not the case for ES algorithms.

V. EXPERIMENTS

Observations, results and conclusions from all conductedexperiments are organized as follows:• ES outperforms PPO in final reward, produces smoother

policies, and requires fewer network parameters. (Sec-tion V-B).

• CNN-based architectures both outperform MLP-basedarchitectures and produce significantly smoother mo-tions (Section V-C).

• Reward shaping can improve the policy’s smoothnesswithout impacting the success rate (Section V-D).

Fig. 3. Simulated robotic table tennis system. Our coordinate system places(0, 0, 0) at the table center, and the axes are color-coded as x = red, y =green, z = blue.

• Bimodal play is nontrivial to learn, but can be obtainedby using curriculum learning (Sections V-F and V-G). Itdoes not require any modifications of the architecture.

A. System Description

Our robotic table tennis system consists of a 6DOF ABBIRB120 arm with revolute joints, and two linear axes per-mitting movement across and behind the table, for a total of8DOF. The table conforms to ITTF standards and is 2.74mlong, 1.525m wide, and the net is 15.25cm high. See Figure 3for a depiction of the robot and table. The simulation is builtusing PyBullet [8].

Each episode consists of one ball throw, and the outcomeis a hit if the paddle makes contact with the ball, anda success if it is hit and lands on the opponent’s table.The simulation uses a simplified ball dynamics model thatexcludes air drag and spin. Ball throws are generated byrandomly sampling the initial ball position4 (x0, y0, z0), thetarget landing coordinates on the robot side of the table(x1, y1, 0), and the initial z-axis speed vz . The full initialvelocity vector (vx, vy, vz) is then solved for, and we acceptthrows with vy ∈ [−8.5,−3.5]. Using this approach, wedefine two ball distributions: a forehand ball distributionwith x1 ∈ [−0.2, 0.7] and a full table distribution withx1 ∈ [−0.7, 0.7]. At the start of each ball throw, the arm isinitialized to either a forehand pose (see Figure 4, LHS) or acentral pose (see: Figure 3) depending on the ball distribution.The initial pose is perturbed slightly to prevent overfitting.

For policy evaluation, we simulate 2500 episodes, andreport success and hit rates, as well as smoothness metrics.

Randomness can be injected into the simulation in anumber of different ways. Uniform random noise is addedto the ball position at each timestep. Additionally, the balland robot observations can be delayed independently and by arandom number of simulation time steps each episode. Policyactions can also be delayed by a random number of time stepseach episode.

B. ES and PPO

We first trained CNN-based ES and PPO policies on theforehand ball distribution with sparse rewards: +1 for hittingthe ball, +1 for landing the ball. ES policies were each trained

4The coordinate system places (0, 0, 0) at the table center, and the x andy axes are parallel to the width and length of the table. See Figure 3.

for 15K parameter updates, equivalently 22.7M episodes,where as the PPO policies were trained for 2M parameterupdates, equivalently 1.1− 1.2M episodes.

ES policies were stronger in this setting, with 89% suc-cessful returns to the opponent side of the table (see TableII, Column S, “sparse" suffix). PPO policies attained only70% success rate and required more parameters to train wellon this task. For example, we found it was necessary toincrease the number of parameters to ∼ 36K to achieve thebest performance. PPO policies with a comparable number ofparameters (∼ 2.4K PPO vs ∼ 1.0K ES) only has a successrate of 22%. The phenomenon of policy gradient requiringmore parameters than ES is not new, however. This effect hasbeen observed numerous times in previous work. See for ex-ample: low-displacement rank policies with linear number ofparameters or linear policies trained with ES outperformingPPO-trained ones [5, 22], as well as reverse trends betweenperformance and architecture size for meta-learning (MAML)[38, 9]. Several explanations have been proposed previouslyfor these effects, such as fewer parameters producing morestable and generalizable behaviors [22, 34]. Meanwhile, forpolicy gradient methods, complex architectural requirementssuch as requiring stronger representational power for valueapproximation [9], or additional batch normalization layers[20], can be bottlenecks.

Policy S H J A V JR

Random-8-12-8 0 2 1.5 1.0 0.9 6.1ES-8-12-8-sparse 89 99 4.0 2.9 3.4 9.5PPO-8-12-8-sparse 22 98 8.1 5.3 4.5 14.1PPO-16-32-16-sparse 70 98 9.6 6.2 4.6 14.2PPO-32-64-32-sparse 70 97 8.9 5.8 4.4 14.3ES-8-12-8-shaped 90 99 2.9 2.2 2.9 9.2ES-8-12-8-shaped + AF 3Hz 79 98 2.0 1.4 2.1 9.0PPO-8-12-8-shaped 33 93 5.9 3.9 3.9 12.2PPO-16-32-16-shaped 50 98 4.4 3.2 4.0 11.6PPO-32-64-32-shaped 62 99 4.1 3.3 3.9 11.5

TABLE IIRESULTS: FOREHAND BALL DISTRIBUTION S: SUCCESS (%), H: HIT(%), J: AVG MAX JERK, A: AVG MAX ACCELERATION, V: AVG MAX

VELOCITY, JR: SUM OF JOINT RANGE.5

We observe a similar performance differential between ESand PPO for the harder task of the full table ball distributionas shown in Table III. For this reason we conducted theablation studies and training for bimodal play using only ES.

C. Policy architecture

Table IV shows that policy network architecture has asignificant effect on both success rates and smoothness. Weevaluate gated CNN and MLP policies, each with 3 hiddenlayers. Both policy types received the same inputs: the 8 mostrecent time-steps of ball and joint positions.

To evaluate smoothness we look at three metrics averagedover all time steps and joints: (1) maximum jerk per timestep (J), (2) maximum acceleration per time step (A), and

5Bold indicates best performance in a column, excluding the randompolicy.

Policy S H J A V JR

Random-8-12-8 0 1 0.7 0.5 0.8 8.6ES-8-12-8-sparse 39 97 5.5 3.6 3.6 10.9PPO-8-12-8-sparse 18 89 6.3 4.5 3.7 11.1PPO-16-32-16-sparse 21 94 7.0 4.9 4.4 14.5PPO-32-64-32-sparse 34 95 9.2 6.0 4.4 14.4ES-8-12-8-shaped 48 97 4.6 3.1 3.7 11.8ES-8-12-8-shaped + AF 3Hz 8 98 2.2 2.0 2.7 11.8PPO-8-12-8-shaped 1 52 2.0 1.6 2.4 10.7PPO-16-32-16-shaped 4 98 4.1 3.3 4.0 14.4PPO-32-64-32-shaped 31 96 4.2 3.9 4.2 13.6

TABLE IIIRESULTS: FULL TABLE BALL DISTRIBUTION. S: SUCCESS (%), H: HIT

(%), J: AVG MAX JERK, A: AVG MAX ACCELERATION, V: AVG MAXVELOCITY, JR: SUM OF JOINT RANGE.5

(3) maximum velocity per time step (V). We also measurethe joint range (JR) of a policy.

CNN policies achieved higher success rates in both thesparse reward and shaped reward (see V-D) settings comparedto MLPs: 89% vs. 67% with sparse reward and 90% vs46% with shaped rewards. MLP policies are significantly lesssmooth, with average max. jerk of 2.5 - 3x that of the CNN,2.5x average max. acceleration, and 1.3 - 1.5x average max.velocity. MLPs also have a noticeably higher range of motionwhen compared with CNNs (column JR, Table IV). Thesedifferences are also clear from a qualitative inspection of thepolicy behavior (see policies B and C in the supplementaryvideo).

Architecture S H J A V JR

CNN (sparse) 89 99 4.0 2.9 3.4 9.5MLP (sparse) 67 98 11.1 7.2 4.4 11.4CNN (shaped) 90 99 2.9 2.2 2.9 9.2MLP (shaped) 46 95 7.6 5.4 4.3 13.4

TABLE IVCOMPARING CNN AND MLP POLICIES WITH SPARSE AND SHAPED

REWARDS ON THE FOREHAND BALL DISTRIBUTION. ALL POLICIES WERETRAINED FOR 15K PARAMETER UPDATES.

D. Reward shaping

Reward shaping is common practice in RL and it is aneffective technique for shaping policy behavior. We exploredthe effect of seven different rewards (see footnote 6) shown inTable V, and find that it improves style with little or no costin success rate for the same number of optimization steps.

Rewards6 S H J A V JR

ST (sparse) 89 99 4.0 2.9 3.4 9.5ST, IC, BBR 87 99 4.3 3.1 3.1 8.6ST, IC, BBR, PH, JA 91 99 3.2 2.5 3.1 9.7ST, IC, BBR, PH, V, A, J 96 99 2.7 2.0 2.8 8.5Canonical (all rewards)7 90 99 2.9 2.2 2.9 9.2

TABLE VABLATION STUDY: EFFECT OF DIFFERENT REWARDS ON SUCCESS, HIT

RATES, AND SMOOTHNESS METRICS. POLICIES WERE TRAINED ANDEVALUATED ON THE FOREHAND BALL DISTRIBUTION.

6ST: hit and success sparse rewards, IC: penalty for self-collision orcolliding with the table, BBR: penalty for rotating the base joint too farbackwards, PH: penalty if the paddle gets too close to the table, JA: penaltyif the arm position gets too close to any of the joint limits, V / A / JL:Penalty for exceeding a velocity / acceleration / jerk limit.

7Canonical reward shaping: ST, IC, BBR, PH, V, A, J, JA

We observe a similar pattern with PPO in that rewardshaping improves the smoothness metrics (see Table II forexample). However, PPO policies also are noticeably lesssmooth, with J, A, and V values over 2x that of comparableES policies, and have a larger range of motion. It may be thatPPO results in more sensitive and mobile policies comparedto ES, and this makes it harder to train smooth policies. Itwould be interesting to explore this further in future work.

E. Action filters

We find that applying a low-pass butterworth filter to thepolicy actions further increases smoothness (see Table VI)but the differences in smoothness metrics between filters withvarying cutoff frequencies is not very large. Unlike rewardshaping, adding a filter appears to make the primary problemof returning balls harder, and leads to slightly lower successrates.

Action filter S H J A V JR

None 90 99 2.9 2.1 2.9 9.2f-cut: 2Hz 84 97 1.8 1.2 2.2 9.0f-cut: 3Hz 79 98 2.0 1.4 2.1 9.0f-cut: 5Hz 80 99 1.9 1.3 2.2 7.2f-cut: 10Hz 88 99 1.9 1.5 2.4 9.5

TABLE VIABLATION STUDY: EFFECT OF APPLYING A LOW-PASS ACTION FILTER TO

THE POLICY OUTPUT. POLICIES WERE TRAINED AND EVALUATED ONTHE FOREHAND BALL DISTRIBUTION.

F. Learning Complex Playstyles

Policy A, as shown in the video, is the result of trainingwith a curriculum on the ball distribution and rewards. It hasa success rate of 80% on the full table ball distribution, a hitrate of 99%, and has a good bimodal style, with balancedsuccess rates on the forehand and backhand (Table VII).We also note that Policy A can obtain success rates near-ing the human-engineered hierarchical policy (discussed inSection IV-B) which can be seen as a near-optimal policyfor this problem, despite Policy A having a much smallermodel.

Policy S8 H S-F H-F S-B H-BA 80 99 78 99 81 99

Hierarchical 94 100 92 100 96 100

TABLE VIIPERFORMANCE OF BIMODAL POLICIES ON THE FULL TABLE BALL

DISTRIBUTION. POLICY A IS SHOWN IN VIDEO.

The curriculum used to train Policy A is listed in Sec-tion V-G. To understand the necessity of each component, wecarry out a series of ablation studies, shown in Table VIII.In each ablation test, we train a policy for 15K steps andevaluate its success rate and style. Note that success ratesare low overall, since training with the full table distribution(Section V-A) requires more steps than the forehand tabledistribution used in the ablation studies of Section V-B.

8 ‘S’ denotes success rate, ‘H’ denotes hit rate, ‘F’ denotes the rate forball throws on the forehand side of the table, and ‘B’ the backhand.

# Regime S8 H S-F H-F S-B H-BPose Reward (see Section V-F1)

1 None 11.8 88.3 23.6 96.3 0.8 80.82 CPS 17.7 53.3 28.4 73.8 6.3 31.53 DCPS 2.3 83.7 1.4 85.4 3.3 82.14 CPT 0.8 95.3 1.3 95.2 0.4 95.5

Success + Pose (see Section V-F2)5 DTR + None 9.6 71.4 16.8 92.4 2.4 50.66 DTR + DCPS 11.3 88.3 18.9 98.0 2.7 77.3

Ball Range + Pose (see Section V-F3)7 (0.5, 0.7) + None 16.4 36.9 32.6 73.6 0 08 (0.3, 0.7) + None 0.3 38.3 0.6 77.0 0 09 (0.3, 0.7) + CPS 3.6 19.1 7.0 37.5 0 0

10 (0.3, 0.7) + DCPS 8.5 87.6 9.7 87.4 7.3 87.5

TABLE VIIIABLATION STUDIES: PERFORMANCE AFTER 15K STEPS FOR VARIOUS

REGIMES. NUMBERS EXPRESS PERCENTAGES, AND BOLDED ROWS (3,4,10) INDICATE BIMODAL POLICIES (FROM VISUAL INSPECTION).

A major challenge is that optimizing for success ratestypically leads to a unimodal policy. For instance, in Ta-ble VIII, row #1 shows that under the canonical rewards, theresulting policy is forehand-only; it achieves 23% success onforehand balls but only 0.8% on backhand balls. This is alsoconfirmed by visual inspection. To reliably obtain bimodalplay, we must devise a curriculum to encourage it. First, weintroduce pose rewards which can maintain bimodal style(Section V-F1). Then we discuss how adding success bonuses(Section V-F2) and shaping the task distribution can improvetraining (Section V-F3). Finally in Section V-G we presentthe curriculum for training Policy A.

1) Reward Shaping for Bimodal Play: We have alreadyseen reward shaping in Section V-D to encourage better style.We use a similar technique for bimodal play.

Precisely characterizing ‘forehand’ swings and the ballthrows for which a forehand swing should be rewarded,is nontrivial, and vary in the amount of human knowledgeimplicit in the reward definition. We consider several variantsof pose rewards:

1) Conditional Pose State (CPS): We define referenceposes for the forehand and backhand, which are shownin Figure 4. The reward is given for taking a poseclose9 to the reference pose corresponding to the sideon which the ball lands. That is, RFCPS = 1 − dF

is awarded for episodes where the ball lands on theforehand side, where dF is the minimum distance ofthe arm’s pose to the forehand reference point, andRBCPS is defined similarly for backhand.

2) Dense Conditional Pose State (DCPS): A denserversion of the CPS reward which penalizes taking thewrong pose. The reward is w(RFCPS − RBCPS) forepisodes where the ball is thrown to the forehand, andw(RBCPS−RFCPS) for throws to the backhand. We adda scale w which reduces the magnitude of the rewardif the ball’s landing point is near the center.

3) Conditional Pose Timesteps (CPT): We define fore-hand and backhand in terms of the rotation of J1 and J4(Figure 1). The pose is considered to be ‘forehand’ if J1and J4 both lie in the appropriate half of their ranges,and similarly for backhand. Let tF to be the percentage

9As measured by the L2 norm in joint space

Fig. 4. Reference forehand (left) and backhand (right) poses.

of timesteps before ball contact such that the robot is ina forehand pose, and similarly for tB on the backhand.The reward given is RFCPT = w(tF − tB) for ballsthrown to the forehand, and RBCPT = w(tB − tF ) forbackhand balls.

We first investigate each type of pose reward (rows #1-4,Table VIII). As noted before, training without pose rewardsled to a purely forehand policy (row #1). The CPS rewardalso led to a forehand policy (row #2), indicating that locally,the policy can still obtain higher reward by improving itsforehand play at the expense of the backhand. This wasa motivation for the DCPS reward (row #3). Although itssuccess rate after the same period of training is lower, itshit rate is high, and the forehand and backhand successrates are balanced. Video inspection confirms it has a basicbimodal style. The CPT reward produces similar training asDCPS in early stages, but video inspection shows that DCPSproduces a clearer bimodal style. In general, a more ‘pre-scriptive’ reward such as DCPS appears to speed up learningat the beginning (especially with distribution shaping; seeSection V-F3), whereas a more ‘flexible’ reward such as CPThas advantages in later training (Section V-F2).

2) Reward Shaping to Escape Plateaus: Bimodal policiesoften reach a plateau in training where further progressbecomes extremely slow. This is likely because of the sparsityof the success reward; since it does not distinguish betweenreturn balls that are close to landing and those that are farfrom the target, it is difficult for exploration to find nearbypolicies with sufficiently higher success rates so as to havemeasurably better rewards. We experiment with two rewardsfor increasing the success signal:

1) Landing Bonus: Increase the sparse success reward.2) Distance to Table (DTR): A dense version of the

success reward given by max{1−d,−2}, where d is theminimum distance between the ball and the opponent’stable surface during the return trajectory.

Introducing these rewards helps to escape plateaus forpolicies that are already bimodal and have reasonable successrates. Training such policies with the CPT pose reward andthe DTR success reward leads to continued improvementwhile maintaining style, especially combined with distri-bution shaping (Section V-F3). However, we find that thebalance between success rewards and pose rewards whichlead to bimodal play can be sensitive. For example;

1) Using DTR at the onset leads to a unimodal policy,even with DCPS pose rewards (rows #5, 6).

2) When initialized with a bimodal policy (trained withDCPS), subsequent training with DTR and DCPS leads

the policy to collapse to an unusual unimodal stylewhich takes a backhand pose and then manipulates its‘shoulder’ J3 (video, Policy E) to reduce penalties.

3) The combination of DCPS and a landing bonus of +1.0leads to collapse.

3) Shaping the Training Distribution: Varying the distri-bution of tasks can improve training. For instance, if ourobjective is forehand play only, notice by comparing Table Vto Table VIII that much higher success rates are obtained in15K steps by restricting the ball distribution to forehand-only. Adjusting the difficulty of the task can therefore maketraining faster. It may also help to avoid local minima forwhich undesirable style is compensated by locally maximalsuccess rates.

As we might expect, our experiments show that the policyhas the greatest difficulty learning a bimodal style for ballsthat land at the center of the table. Based on this observation,we consider a spectrum of tasks where the ball’s trainingdistribution is supported on the two sides of the table. Theball range (a, b) for 0 ≤ a < b indicates that the x-coordinateof the ball’s landing position belongs to [−b,−a] ∪ [a, b].To overcome the style collapse problem when training abimodal policy with the DTR reward, we change the ballrange to (0.5, 0.7) when the DTR reward is introduced, andthen periodically widen the range as the policy improves untilthe distribution spans the entire table.

We also consider whether this distribution shaping can beapplied at the onset. However, ablation studies indicate thatthe policy learns to ignore one side of the table entirely (seerows #7-9 of Table VIII), unless the DCPS reward is alsogiven, which trains well. This shows that training is mostefficient with a curriculum, which leads the policy to improveskills without forgetting previous ones.

G. Effective Curriculum Learning

Based on the observations of the preceding sections, weused the following scheme to train bimodal policies with RL,an example of which is Policy A.

1) Canonical style rewards and DCPS pose reward on thefull table.

2) Introduce the DTR success reward, switch the posereward to CPT, and set the ball range to (0.5, 0.7).

3) Increase the ball range to (0.3, 0.7), (0.1, 0.7), and thento the full table.

The precise number of steps required varies owing to therandomness in RL algorithms; the first stage may take 15Ksteps, as was used in our ablation studies.

VI. CONCLUSIONS

We have shown that model-free reinforcement learningis an effective approach for robotic table tennis and issuitable for high-speed control, generating actions at 100Hz.This has the advantage of avoiding ball prediction modelingand trajectory optimization without the need for humandemonstrations. Policies can learn to hit and return ballssimply by being given sparse rewards for contact and suc-cess. Moreover, reward shaping and curriculum learning can

be used to improve the style of the policy, and developmore complex play. We demonstrated this by training strongbimodal policies which are capable of playing both theforehand and backhand. This suggests that RL is promisingfor robotic table tennis. Our experiments show that CNNpolicies outperform MLP policies both in terms of successand smoothness. We also observe that the evolution strategy(ES) paradigm in RL was particularly effective for thisproblem, and in comparison to policy gradient methods, wasable to train smaller policies.

VII. ACKNOWLEDGEMENTS

We thank David D’Ambrosio, Jie Tan, and Peng Xu fortheir helpful comments on this work.

REFERENCES

[1] Riad Akrour, Abbas Abdolmaleki, Hany Abdulsamad, Jan Peters, andGerhard Neumann. Model-free trajectory-based policy optimizationwith monotonic improvement. J. Mach. Learn. Res., 2016.

[2] Russell Anderson. A Robot Ping-Pong Player: Experiments in Real-Time Intelligent Control. MIT Press, 1988.

[3] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason We-ston. Curriculum learning. In ICML, pages 41–48, 2009.

[4] John Billingsley. Robot ping pong. Practical Computing, 1983.[5] Krzysztof Choromanski et al. Structured evolution with compact

architectures for scalable policy optimization. In ICML, 2018.[6] Krzysztof Choromanski et al. From complexity to simplicity: Adaptive

es-active subspaces for blackbox optimization. In NeurIPS, 2019.[7] Krzysztof Choromanski et al. Provably robust blackbox optimization

for reinforcement learning. In CoRL, 2019.[8] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics

simulation for games, robotics and machine learning. http://pybullet.org, 2016–2019.

[9] Chelsea Finn and Sergey Levine. Meta-learning and universality: Deeprepresentations and gradient descent can approximate any learningalgorithm. In ICLR, 2018.

[10] Yapeng Gao, Jonas Tebbe, Julian Krismer, and Andreas Zell. Mark-erless racket pose detection and stroke classification based on stereovision for table tennis robots. IEEE Robotic Computing, 2019.

[11] J. Hartley. Toshiba progress towards sensory control in real time. TheIndustrial Robot 14-1, pages 50–52, 1983.

[12] Hideaki Hashimoto, Fumio Ozaki, and Kuniji Osuka. Development ofping-pong robot system using 7 degree of freedom direct drive robots.In Industrial Applications of Robotics and Machine Vision, 1987.

[13] Yanlong Huang, Bernhard Schölkopf, and Jan Peters. Learning optimalstriking points for a ping-pong playing robot. IROS, 2015.

[14] Yanlong Huang, Dieter Buchler, Okan KoÃg, Bernhard Schölkopf, andJan Peters. Jointly learning trajectory generation and hitting pointprediction in robot table tennis. IEEE-RAS Humanoids, 2016.

[15] Auke Jan Ijspeert, Jun Nakanishi, and Stefan Schaal. Movementimitation with nonlinear dynamical systems in humanoid robots. ICRA,2002.

[16] Stephen James, Paul Wohlhart, Mrinal Kalakrishnan, Dmitry Kalash-nikov, Alex Irpan, Julian Ibarz, Sergey Levine, Raia Hadsell, andKonstantinos Bousmalis. Sim-to-real via sim-to-sim: Data-efficientrobotic grasping via randomized-to-canonical adaptation networks. InCVPR, 2019.

[17] Dmitry Kalashnikov et al. Qt-opt: Scalable deep reinforcement learningfor vision-based robotic manipulation. CoRL, 2018.

[18] John Knight and David Lowery. Pingpong-playing robot controlledby a microcomputer. Microprocessors and Microsystems - EmbeddedHardware Design, 1986.

[19] Okan KoÃg, Guilherme Maeda, and Jan Peters. Online optimaltrajectory generation for robot table tennis. Robotics & AutonomousSystems, 2018.

[20] Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess,Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuouscontrol with deep reinforcement learning. ICLR, 2016.

[21] Reza Mahjourian, Navdeep Jaitly, Nevena Lazic, Sergey Levine, andRisto Miikkulainen. Hierarchical policy design for sample-efficient

learning of robot table tennis through self-play. arXiv:1811.12927,2018.

[22] Horia Mania, Aurelia Guy, and Benjamin Recht. Simple random searchprovides a competitive approach to reinforcement learning. NeurIPS,2018.

[23] Michiya Matsushima, Takaaki Hashimoto, and Fumio Miyazaki.Learning to the robot table tennis task-ball control and rally witha human. IEEE International Conference on Systems, Man andCybernetics, 2003.

[24] Michiya Matsushima, Takaaki Hashimoto, Masahiro Takeuchi, andFumio Miyazaki. A learning approach to robotic table tennis. IEEETransactions on Robotics, 2005.

[25] Fumio Miyazaki, Masahiro Takeuchi, Michiya Matsushima, TakamichiKusano, and Takaaki Hashimoto. Realization of the table tennis taskbased on virtual targets. ICRA, 2002.

[26] Fumio Miyazaki et al. Learning to dynamically manipulate: A tabletennis robot controls a ball and rallies with a human being. In Advancesin Robot Control, 2006.

[27] Katharina Muelling and Jan Peters. A computational model of humantable tennis for robot application. In AMS, 2009.

[28] Katharina Muelling, Jens Kober, and Jan Peters. A biomimeticapproach to robot table tennis. Adaptive Behavior, 2010.

[29] Katharina Muelling, Jens Kober, and Jan Peters. Learning table tenniswith a mixture of motor primitives. IEEE-RAS Humanoids, 2010.

[30] Katharina Muelling, Jens Kober, and Jan Peters. Simulating humantable tennis with a biomimetic robot setup. In Simulation of AdaptiveBehavior, 2010.

[31] Katharina Muelling, Jens Kober, Oliver Kroemer, and Jan Peters.Learning to select and generalize striking movements in robot tabletennis. The International Journal of Robotics Research, 2012.

[32] Anusha Nagabandi et al. Neural network dynamics for model-baseddeep reinforcement learning with model-free fine-tuning. In ICRA,2018.

[33] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free mini-mization of convex functions. FoCM, 2017.

[34] Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, and Sham M.Kakade. Towards generalization and simplicity in continuous control.In NeurIPS, 2017.

[35] Marie-Marin Ramanantsoa and Alain Durey. Towards a stroke con-struction model. The International Journal of Table Tennis Sciences,1994.

[36] Tim Salimans et al. Evolution strategies as a scalable alternative toreinforcement learning. arXiv:1703.03864, 2017.

[37] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford,and Oleg Klimov. Proximal policy optimization algorithms.arXiv:1707.06347, 2017.

[38] Xingyou Song, Wenbo Gao, Yuxiang Yang, Krzysztof Choromanski,Aldo Pacchiano, and Yunhao Tang. ES-MAML: simple hessian-freemeta learning. In ICLR, 2020.

[39] Yichao Sun, Rong Xiong, Qiuguo Zhu, Jingjing Wu, and Jian Chu.Balance motion generation for a humanoid robot playing table tennis.IEEE-RAS Humanoids, 2011.

[40] Richard Sutton and Andrew Barto. Reinforcement Learning: Anintroduction. MIT Press, MA, 1 edition, 1998.

[41] Richard Sutton, David McAllester, Satinder Singh, and Yishay Man-sour. Policy gradient methods for reinforcement learning with functionapproximation. In NeurIPS, 2000.

[42] Jonas Tebbe, Yapeng Gao, Marc Sastre-Rienietz, and Andreas Zell. Atable tennis robot system using an industrial kuka robot arm. GCPR,2018.

[43] Josh Tobin et al. Domain randomization and generative models forrobotic grasping. In IROS, 2018.

[44] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, koraykavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional imagegeneration with pixelcnn decoders. In NeurIPS, 2016.

[45] Daan Wierstra, Tom Schaul, Jan Peters, and Jürgen Schmidhuber.Natural evolution strategies. In 2008 IEEE Congress on EvolutionaryComputation, 2008.

[46] Ronald Williams. Simple statistical gradient-following algorithms forconnectionist reinforcement learning. Machine Learning, 1992.

[47] Yifeng Zhu, Yongsheng Zhao, Lisen Jin, Jingjing Wu, and Rong Xiong.Towards high level skill learning: Learn to return table tennis ballusing monte-carlo based policy gradient method. IEEE InternationalConference on Real-time Computing and Robotics, 2018.

http://pybullet.org

http://pybullet.org

robotic table tennis with model-free reinforcement learning · 2020. 5. 29. · robotic table...

Documents