a physics-based juggling simulation using reinforcement...

7
A physics-based Juggling Simulation using Reinforcement Learning Jason Chemin Department of Computer Science and Engineering Seoul National University [email protected] Jehee Lee Department of Computer Science and Engineering Seoul National University [email protected] Figure 1: Demonstration of Juggling simulation with 5 balls. ABSTRACT Juggling is a physical skill which consists in keeping one or several objects in continuous motion in the air by tossing and catching it. Jugglers need a high dexterity to control their throws and catches which require speed, accuracy and synchronization. The more balls we juggle with, the more these qualities have to be strong to achieve this performance. This complex skill is good challenge for realis- tic physical based simulation which could be useful for jugglers to evaluate the feasibility of their tricks. This simulation has to understand the different notations used in juggling and to apply the mathematical theory of juggling to reproduce it. In this paper, we present a deep reinforcement learning method for both tasks catching and throwing, and we combine them to recreate the all juggling process. Our character is able to react accurately and with enough speed and power to juggle with up to 7 balls, even with external forces applied on it. CCS CONCEPTS Computing methodologies Physical simulation; Rein- forcement learning; KEYWORDS Juggling, Physics-based character animation, Reinforcement Learn- ing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MIG ’18, November 8–10, 2018, Limassol, Cyprus © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6015-9/18/11. . . $15.00 https://doi.org/10.1145/3274247.3274516 ACM Reference Format: Jason Chemin and Jehee Lee. 2018. A physics-based Juggling Simulation using Reinforcement Learning. In MIG ’18: Motion, Interaction and Games (MIG ’18), November 8–10, 2018, Limassol, Cyprus. ACM, New York, NY, USA, 7 pages. https://doi.org/10.1145/3274247.3274516 1 INTRODUCTION The art of juggling has appeared in various cultures in the history as a distracting skill which can sometime be very impressive as told in an ancient Chinese annals, a man who juggled nine balls in front of the enemy armies and won by impressing them with its dexterity. The juggler has to show his control over the timing of his tosses and catches if he wants to keep all objects in the air. Specifically, throwing a ball is a task requiring a balance between accuracy to avoid collision and power to throw the ball high enough. And the more balls we juggle with, the higher or faster we have to throw the ball while keeping track of others in the air. The actual human world record of juggling continuously is eleven balls and as explained by the professional juggler Jack Kalvan [2018], this is physically very difficult to beat. Research on juggling started with Shannon [1981] when he for- mulated a mathematics juggling theorem and built the first machine which was able to perform bounce juggling with three balls. Since then, this physical skill has been an interesting challenge in science and in robotics [Beek and Lewbel 1995]. The first juggling machines were designed to make the balls follow a predefined trajectories and it is only recently that we started to use feedbacks from cameras and other sensors to correct trajectory errors while juggling. The Shannon’s juggling machine was built with no feedback, repeat- ing the same movement with a constant frequency and avoiding collision by the design of its system. The trajectory of the ball was corrected using groove cups or tracks to redirect it. Many juggling machines were conceived this way, but latest machines started to use feedbacks to learn and to adapt their behavior. The machine named ServoJuggler [Burget and Mezera 2010] can juggle with 5

Upload: others

Post on 01-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A physics-based Juggling Simulation using Reinforcement ...mrl.snu.ac.kr/research/ProjectJuggling/PaperJuggling.pdf · A physics-based Juggling Simulation using Reinforcement Learning

A physics-based Juggling Simulation using ReinforcementLearning

Jason CheminDepartment of Computer Science and Engineering

Seoul National [email protected]

Jehee LeeDepartment of Computer Science and Engineering

Seoul National [email protected]

Figure 1: Demonstration of Juggling simulation with 5 balls.

ABSTRACTJuggling is a physical skill which consists in keeping one or severalobjects in continuous motion in the air by tossing and catching it.Jugglers need a high dexterity to control their throws and catcheswhich require speed, accuracy and synchronization. The more ballswe juggle with, the more these qualities have to be strong to achievethis performance. This complex skill is good challenge for realis-tic physical based simulation which could be useful for jugglersto evaluate the feasibility of their tricks. This simulation has tounderstand the different notations used in juggling and to applythe mathematical theory of juggling to reproduce it. In this paper,we present a deep reinforcement learning method for both taskscatching and throwing, and we combine them to recreate the alljuggling process. Our character is able to react accurately and withenough speed and power to juggle with up to 7 balls, even withexternal forces applied on it.

CCS CONCEPTS• Computing methodologies → Physical simulation; Rein-forcement learning;

KEYWORDSJuggling, Physics-based character animation, Reinforcement Learn-ing

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected] ’18, November 8–10, 2018, Limassol, Cyprus© 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-6015-9/18/11. . . $15.00https://doi.org/10.1145/3274247.3274516

ACM Reference Format:Jason Chemin and Jehee Lee. 2018. A physics-based Juggling Simulationusing Reinforcement Learning. In MIG ’18: Motion, Interaction and Games(MIG ’18), November 8–10, 2018, Limassol, Cyprus. ACM, New York, NY, USA,7 pages. https://doi.org/10.1145/3274247.3274516

1 INTRODUCTIONThe art of juggling has appeared in various cultures in the historyas a distracting skill which can sometime be very impressive astold in an ancient Chinese annals, a man who juggled nine ballsin front of the enemy armies and won by impressing them withits dexterity. The juggler has to show his control over the timingof his tosses and catches if he wants to keep all objects in the air.Specifically, throwing a ball is a task requiring a balance betweenaccuracy to avoid collision and power to throw the ball high enough.And the more balls we juggle with, the higher or faster we have tothrow the ball while keeping track of others in the air. The actualhuman world record of juggling continuously is eleven balls andas explained by the professional juggler Jack Kalvan [2018], this isphysically very difficult to beat.

Research on juggling started with Shannon [1981] when he for-mulated a mathematics juggling theorem and built the first machinewhich was able to perform bounce juggling with three balls. Sincethen, this physical skill has been an interesting challenge in scienceand in robotics [Beek and Lewbel 1995]. The first juggling machineswere designed to make the balls follow a predefined trajectories andit is only recently that we started to use feedbacks from camerasand other sensors to correct trajectory errors while juggling. TheShannon’s juggling machine was built with no feedback, repeat-ing the same movement with a constant frequency and avoidingcollision by the design of its system. The trajectory of the ball wascorrected using groove cups or tracks to redirect it. Many jugglingmachines were conceived this way, but latest machines started touse feedbacks to learn and to adapt their behavior. The machinenamed ServoJuggler [Burget and Mezera 2010] can juggle with 5

Page 2: A physics-based Juggling Simulation using Reinforcement ...mrl.snu.ac.kr/research/ProjectJuggling/PaperJuggling.pdf · A physics-based Juggling Simulation using Reinforcement Learning

MIG ’18, November 8–10, 2018, Limassol, Cyprus Chemin and Lee

balls using visual feedback and sensors to detect catches. The hu-manoid robot created by Kober et al. [2012] is able to juggle with aa human partner using one hand. It uses external cameras to locatethe ball and to predict its trajectory. Then it throws it back at theright velocity to the other human participant.

In computer graphics field, a physics-based simulation could helpjugglers to create, visualize and evaluate the feasibility of patterns.Some methods can be applied to calculate the movements neededto throw a ball at the right velocity and the position where the ballcan be catch, but they can be difficult to apply for real-time. Theinteractive tool of Jain et al. [2009] is used tomanually edit motion tocreate dynamic scenes with human and object interaction as toss orfootball juggling. By using a framework for simulation and controlof a human musculo-skeletal system, Lee et al. [2018] demonstratesthe efficacy of his model by performing a very realistic juggling.His character was able to juggle with up to 6 balls alone and morewith two persons juggling together. The Finite Element Methodused to calculate the action of the agent needs a preprocessing ofseveral seconds per frame to create the simulation.

The recent research on machine learning provides more possibil-ities to solve the real-time problem. To demonstrate their methodto learn a feedback policy, Ding et al. [2015] learns to keep a ballin the air using a racket and to kick a ball to reach a target. Re-inforcement learning combined with neural networks has led tomany successes in learning policies to perform complex tasks asplaying Atari games [Mnih et al. 2013]. Applied to physically basedsimulation an agent can learn locomotion behavior in rich environ-ment using Proximal Policy Optimization algorithm [Heess et al.2017]. Using reinforcement learning to imitate actions in motionclips, Peng et al. [2018] make an agent learn to throw a ball on atarget like a Baseball pitch, reproducing human movement frommotion capture. Their character is able to throw a ball using thewhole body to reach a target.

Our goal is to create an agent on a 2D simulated body ables tolearn how to juggle in real time using Deep Deterministic PolicyGradients [Lillicrap et al. 2015]. Juggling is composed of two mainsubtasks which are catching and throwing a ball. We train twoagents to perform these subtasks with one arm. The trained agentscan be mirrored on the other arm and we use them sequentially torecreate the all process of juggling.

2 JUGGLING THEORYWe can divide Juggling into 3 subclasses : Roll Juggling where theobject will roll over a body or a surface, Bounce juggling as inShannon’s machine which consists in making the object bounce ona surface before catching it, and finally, Toss Juggling which is themost known practice of juggling that we will focus on. It consistsin throwing an object with one hand in the air before catching itwith the same or the other hand.

2.1 Siteswap notationToss juggling can be resume as a sequence of throws and catchesfrom one or two hands at with a different height, timing, directionand power. Therefore, to describe the figures performed and thefuture sequence of catches and throws, a codification is needed.The most common notation used is named Siteswap. It is a tool to

Figure 2: Asynchronous juggling system. One ball is thrown(T) from the right hand at every even beat, and one ball isthrown from the left hand at every odd beat. After a throw,our hand is empty and a new ball is caught (C).

help jugglers to communicate with each other about their tricksand to describe, to share and to discover new patterns and transi-tions. Some extensions of this notation can describe very complexpatterns. In this work, we will use only the basic notation namedvanilla Siteswap which describes solo patterns where one objectis caught and thrown at a time. As shown in Figure 2, at everyperiodic beat time, a throw is executed from one hand alternatingbetween the right and left hand. The hand period corresponds tothe time between two throws from the same hand and the dwelltime is the duration of the hand holding a ball during this period.The dwell time can also be seen as the time available to preparethe throw. Regarding the measurements made on several profes-sional jugglers [Kalvan 2018], the hand period Phand is around 0.55seconds with three balls down to 0.44 seconds when juggling withseven balls . During this period, the performer has to catch theball incoming and then needs some time to prepare the throw. Thedwell time corresponds to the hand period multiplied by the dwellratio Dratio which is in average around 70%.

To understand the juggling process, we will describe the actionsof the left hand in Figure 2. At the beginning, the left hand has aball and is preparing a throw. Then, the beat 1 comes and we throwthis ball (T). After the throw, our left hand is empty until we catch anew ball (C). After the catch, we have a ball in hand and we preparethe next throw at beat 3. In this example, the odd beats correspondto a throw from the left hand, and the even beats, a throw from theright hand. We do not specify where the ball should be thrown, andthis is explained by the basic patterns.

2.2 Juggling patternsWe name basic pattern or basic Siteswap, the simplest kind ofjuggling patterns and its number corresponds to the number ofbeats you have to wait before throwing this ball again. For examplea ball thrown with a pattern ’2’ will have to be thrown again twobeats later. This means that if a ball is thrown at beat 1, it should bethrown again at beat 3 and that the ball will have to be caught at thehand period just before this. As the hand who performs the throwalternates between right and left at every beat time, an odd numberrepresents a throw that will go from one hand to another, and an

Page 3: A physics-based Juggling Simulation using Reinforcement ...mrl.snu.ac.kr/research/ProjectJuggling/PaperJuggling.pdf · A physics-based Juggling Simulation using Reinforcement Learning

A physics-based Juggling Simulation using Reinforcement Learning MIG ’18, November 8–10, 2018, Limassol, Cyprus

Figure 3: Example of basic pattern (3) and Siteswap (31).

even number is a throw for the same hand. A vanilla Siteswap isa cyclic sequence of several basic patterns where we will throw aball at each beat with the next pattern in this sequence. It exists asmany vanilla Siteswap patterns as there are combinations possiblewhere we have only one catch per hand period and one ball thrownat each beat. In this paper, we will name Siteswap only sequencesof at least two different basic patterns. By juggling with only basicpatterns, the number of ball we can have is equal to the patternused. For example, juggling with a basic pattern ’4’ allows you tojuggle with four balls and so on. For a Siteswap, the number ofballs we juggle with is equal to the sum of all basic patterns of thesequence divided by its length. For example, a Siteswap ’531’ willbe juggled with three balls.

We associate a throw to a basic pattern. The higher the basicpattern is, the more the ball will have to stay in the air and so themore powerfull the throw will have to be.

Tball InAir = Phand (0.5 ∗Cpattern − Dratio ) (1)

where Tball InAir is the time the ball will have to spend in theair before being caught, Pattern is the basic pattern of this throw,Phand is the hand period,Cpattern is the basic pattern of the throwsuperior or equal to two and Dratio is the Dwell time divided bythe hand period.

VyThrow = 9.81(Tball InAir /2.0) (2)

VxThrow =DCatch,Throw

Tball InAir(3)

where VyThrow and VxThrow is the velocity wanted for the throwon the Y-axis and X-axis andDCatch,Throw is the distance betweenthe throw and the catch at the same height. Performing a throw ofa specific basic pattern consists in tossing the ball at the velocityVxThrow and VyThrow associated to it.

3 APPROACHThe overall juggling process can be explained as throwing the ballat the right velocity and time and catching the next ball incoming.Our approach is to learn separately these two distinct tasks usingDeep Reinforcement Learning (Deep RL). We first train our agentsonly on the right arm of our body and we mirror it on the left arm.

Figure 4: Sequence of catch and throw mode for one hand.

Then, in parallel on both sides of the body, we use our catching andthrowing agent sequentially to recreate the all process of juggling.

3.1 Juggling sequenceDuring the simulation, we keep track of the beat time for the left andright hand. At every even beat we throw a ball from the right and atevery odd beat, from the left. The switching between the catchingand throwing agent for one arm is straightforward as shown inFigure 4. When we have a ball in hand, we are in throwing modeand when our hand is empty, we are in in catching mode. Therefore,we need to detect the frame when a ball is caught and when theball is thrown. Also, Both of these tasks need to adapt to possibleexternal forces that could be applied on balls while in the air or toa late catch which could lead to a very reduced Dwell time and soless time to prepare the throw.

3.2 Reinforcement learningReinforcement learning is defined as a Markov decision process(S,A,P,R,γ )where S is a set of states,A a set of actions discreteor continuous, P(s,a, s ′) is the transition probability p(s ′ |s,a) , Ra reward function mapping S × A → R, and γ ∈ [0, 1) a discountfactor.

We train an agent to take some action in a fully observableenvironment, which returns him a reward to encourage him or notto do this action next time. The goal is to learn a policy which givesthe action to perform in function of the actual state at = π (st ) inorder to maximize the future cumulative rewards.

To the optimal policy, we associate a Q-function or action valuefunction which satisfies the Bellman optimality equation :

Q(s,a) = R(s,a) + γ maxa′∈A

Q(s ′,π (s ′)) (4)

Deep Deterministic policy gradient (DDPG) [Lillicrap et al. 2015]is a model-free RL algorithm used for actions in continuous space.It is composed of two networks, the Actor and the Critic. TheActor network learns the optimal policy. The Critic network learnsan approximation of the Q-value function and updates the Actornetwork toward the optimal actions to perform.

Page 4: A physics-based Juggling Simulation using Reinforcement ...mrl.snu.ac.kr/research/ProjectJuggling/PaperJuggling.pdf · A physics-based Juggling Simulation using Reinforcement Learning

MIG ’18, November 8–10, 2018, Limassol, Cyprus Chemin and Lee

3.3 CatchingThe ball is automatically caught by the hand if their position areclose. So the goal of the catching task is to put the hand on theball and if possible, near the height of catch wanted. The catchingreward is composed of two terms :

R = w1Ref f or t +w2Rcatch (5)

where the weightsw1 andw2 are constant values. The first term,Ref f or t encourages the agent to move smoothly by punishing jerks.

Ref f or t = −∑j ∈J

|ω jt − ω

jt−1 | (6)

where J is a set of joints,ω jt andω

jt−1 are the angular joint velocity

at time t and t − 1.The second term Rcatchinд penalizes the agent for not catching

a ball or catching it far from the height wanted.

Rcatchinд =

{−1, if fail to catch the ball−|p̄y − py | ∈ (-1., 0.] otherwise

(7)

where p̄y and py are the height of catch wanted and the actualheight of catch respectively.

The first term is applied at every step whereas the second one isapplied only once per episode, on the terminal state. The episodeends when the ball is caught or lands out of the action range of thearm.

3.4 ThrowingTo simplify the throwing task, we decide to release the ball auto-matically, respectively to the Juggling beat times. In this condition,the goal of throwing is at the release time to move our hand withthe grasped object near the releasing position wanted at the rightvelocity. The throwing reward is composed of three terms :

R = w3Ref f or t +w4Rposit ionThrow +w5Rvelocity (8)

where the weightsw3,w4 andw5 are constant values. The first term,Ref f or t encourages the agent to move smoothly by punishing jerks.

Ref f or t = −∑j ∈J

|ω jt − ω

jt−1 | (9)

The second term, Rposit ionThrow penalizes the agent for throwingthe ball far from the release point wanted :

Rposit ionThrow =

{− ∥ p̄ − p ∥2 when ball thrown0 otherwise

(10)

where p̄ is the position where the hand should release the ball andp is the position of the actual release point. The last term Rvelocitypenalizes the agent for not throwing the ball at the velocity wanted:

Rvelocity =

{− ∥ v̄throw −vthrow ∥2 when ball thrown0 otherwise

(11)

where v̄throw and vthrow are the velocity of the ball wanted andactual respectively when thrown.

The first term is applied at every step whereas the second andthird terms are applied only once per episode, on the terminal state.The episode ends when the ball is thrown.

4 EXPERIMENTS

Table 1: Parameters

Simulation time step 10msReplay Buffer size 1e5

Hidden Layer size (1) 512Hidden Layer size (2) 256Learning rate (Actor) 1e-4Learning rate (Critic) 1e-3Discount factor (γ ) 0.95

Pemax 1.0Pemin 0.20Phand 0.5sDratio 70%kp 1200.0kv 70.0w1 0.12w2 1.0w3 0.29w4 9.0w5 1.4

Gravity (N/Kg) 9.81

Our algorithm was implemented in Python, using DART for thesimulation of articulated body dynamics and TensorFlow for thetraining of deep neural networks. The simulation was run on aPC with Intel Core i7-4930K (6 cores, 3.40GHz). Parameters can beseen in Table 1. During the training of our both agents catchingand throwing, we linearly decay the exploration rate from Pemaxto Pemin over 8000 episodes of training. The size of hidden layersis the same for both actor and critic and for both agents. For thisexperiment, we do not take in consideration collision which is partof a more complicated problem as we will explain in the later partof this paper.

4.1 StatesThe state of the agent is contains information about the side of thebody we will use and the ball concerned. We get for the body state,the position phand and velocityvhand of the hand, and a set of fourjoint angles θJ in radians, with J = {neck,shoulder,elbow,hand}.We get the position pball and the velocity vball of the ball. Forthe throwing agent we augment the state with the time left beforethrowing the ball tthrow and the velocity of the throw wantedv̄throw . As a result we have the catching state and the throwingstate :

scatchinд = {phand ,vhand ,pball ,vball ,θJ }sthrowinд = {phand ,vhand ,pball ,vball ,θJ , tthrow , v̄throw }.

4.2 ActionsFor both policies, catching and throwing, an action represents aset of target joint angles θ̄J . We have for both agents an actiona = {θ̄neck , θ̄shoulder , θ̄elbow , θ̄hand }. These target joint anglesare used by the PD-controller to compute the torques to apply to

Page 5: A physics-based Juggling Simulation using Reinforcement ...mrl.snu.ac.kr/research/ProjectJuggling/PaperJuggling.pdf · A physics-based Juggling Simulation using Reinforcement Learning

A physics-based Juggling Simulation using Reinforcement Learning MIG ’18, November 8–10, 2018, Limassol, Cyprus

the joints.

τi = kip (θ̄i − θi ) + kiv (ω̄i − ωi ) (12)

where joint i ∈ J, τi is the torque applied to it, θ̄i and θi are itsjoint angle target and actual respectively, ω̄i and ωi are the jointangle velocity target and actual respectively and kip and kiv aremanually-specified gains. For this experiment, we set the same gainkp and kv for all joints and ω̄i is set to zero for all joints in J.

5 RESULTSWe first test each agent for what they have been trained and wefinally use them sequentially to create our simulated juggler. Theresults for both tasks can be seen in the supplemental video. Thetraining time needed to make our learning curves converging withour agents are around 1 hour 20 minutes for throwing and 2 hoursfor catching. The throwing task learning converges faster and is afar more stable than the catching task, but both succeeds in theirown subtask. As a result, the combination of both agents on ourcharacter is able to perform Siteswap composed of basic pattern upto 7.

Figure 5: Learning curve for throwing task.

Figure 6: Learning curve for catching task.

5.1 Throwing resultsWe trained our throwing agent to move our hand with the graspedball toward the the position wanted at the right velocity beforethe release time. This velocity is associated to the pattern we wantto perform and we train it with patterns up to 7. Figure 5 showsthe learning for the throwing task. At the start of the training, ourhand fails to throw the ball at the right velocity resulting in a highnegative reward. Finally our hand finds out a strategy which canbalance value of negative rewards. At the final state of training,the reward average is around −2 with 40% of it due to the error invelocity, 25% due to error of position of throw wanted and 35% dueto the effort of joints. We manually selected the reward weightsto have the best and smoother agent as possible. Considering thevelocity and position reward weights fixed, we found the effortreward weight being very sensitive to change as it can bring to avery jerky result or a fail in being able to throw. A vertical velocityerror directly impacts the time the ball will stay in the air. If thisdifference of time is too high, the ball may land in the wrong handperiod and as we can have only one ball in the hand per period, thismeans that the ball will not be caught and will fall on the ground.A horizontal error changes the position of ball landing. As long asthis position is in the range of action of the hand receiving, thecatch is possible, but if the ball lands outside of it then the patternwill fail. Our agent can successfully throw a ball with a pattern upto 7 with a non critical error in velocity.

5.2 Catching resultsWe trained our catching agent to place his hand on a ball to catchit near the height wanted. The ball can come from any position atany velocity. As a result, the hand waits the ball to be in its rangeof action and rapidly moves to its position to catch it. We test therobustness of our agent by adding some wind as a constant externalforce on balls, modifying its velocity. The reward when the catchfails is inferior or equal to -1.0 and superior if the catch is a success.The closer the catch is to the position wanted, the higher is thereward. The hand successfully catches the ball as long as the ball isin its arm range and as close as possible from the position wanted.

Page 6: A physics-based Juggling Simulation using Reinforcement ...mrl.snu.ac.kr/research/ProjectJuggling/PaperJuggling.pdf · A physics-based Juggling Simulation using Reinforcement Learning

MIG ’18, November 8–10, 2018, Limassol, Cyprus Chemin and Lee

We manually selected the reward weights to ensure the catch whilebeing smooth.

Catching is a very time sensitive task that makes the learningvery unstable as we can see in Figure 6. We think that the numberof frames where the ball is in catching range can be very shortand is reduced if the velocity of the ball incoming is high or ifwe raise the simulation time step. The hand has to be precisely inthe area of grasping of the ball on these key frames. Consideringthe catching reward weight fixed, if we set a low reward effortweight, our catching agent tries sweep the area very fast near theposition of landing. This strategy gives a high probability to catchthe ball and also makes the learning more stable but full of jerksand unnatural. In this work we will keep the value of effort rewardweight where our agent performs the smoother catches.

5.3 Juggling resultsAfter these agents learned successfully to perform their respectivesubtasks, we apply them to both arms and we switch between bothagents as explained in Figure 4. The catching agent succeeds incatching the ball thrown from the same or other hand. The throwingagent succeeds in tossing the ball with the right velocity and tomake it land on the aimed hand at the hand period wanted for allpatterns up to 7. After the initialization phase where all balls willget a near constant trajectory, the result of the combination of bothagents appears to be quite smooth when juggling continuously.

Figure 7 shows some results of our juggling simulation. The basicpattern 3 (a) consists in throwing a ball to the other hand at everybeat time. Basic pattern 4 (b) is an even number, so the thrown willbe for the same hand. Finally the Siteswap 53 (c) consists in throwinga ball from the left hand with a basic pattern 3 and from the righthand with a basic pattern 5. When trained with different rewardweights for throwing, our character can show some limitationswith patterns superior or equal to 7, as it does not have enoughpower to throw it high enough. As a result, the hand period whenthe ball is at catching range, is already occupied by another balland the juggling fails.

We test the robustness of our agents by adding some wind onthe balls while in the air, adding a constant horizontal accelerationand making them landing further or closer to our character. Ouragents succeeds to catch all balls as long as they land in his rangeof action and to perform the next throw. These juggling simulationfor different patterns can be seen in the supplemental video.

6 CONCLUSIONWe presented a method to learn a catching and throwing task ina physics-based simulation using Deep Deterministic Policy Gra-dients and we used them sequentially by switching from catchingto throwing mode, respecting the mathematics of juggling withthe vanilla Siteswap notation. In our simulation, ReinforcementLearning allows us to have a real time Juggling simulation able toadapt to some external forces on balls and to perform some patternscontinuously with success.

The unstable learning of the catching agent has an impact on hisability to catch some balls in difficult situations. To make this learn-ing more stable, a solution could be to use directly the calculatedposition of ball landing to make the agent anticipate it. Concerning

the throwing agent, our character fails to throw patterns higherthan 7 which requires more power. An easy solution could be toraise the gain of the PD-controller kp and kv to make the jointsmoving faster but this would lead to more unnaturalness. There is alot of improvements to do in order to fix the still jerky movementsof our character, as setting customized gain for each joint. Anotherroom of exploration will be to go for a 3D simulation. Adding newdegrees of freedommay give to our character more options to throwhigh patterns and also to go for a realistic motion. Furthermore,considering collisions and adding some balls while juggling requireto plan the timing, power and direction of our future throws. Theplanning of throws and movements could be performed by a highand low level controller respectively as demonstrated by Peng etal. [2017] applied to locomotion planning. Finally, our agents aretrained to catch at a fixed height and to throw position at a fixedposition. It could be interesting in a future work to train him tothrow and catch at any position allowing more complex jugglingpatterns and also to modify our simulation to perform other typesof juggling. Football juggling for example could be done by settinga null value to the Dwell time and by redefining the conditions ofcatches and throws in the simulation.

ACKNOWLEDGMENTSThis research was supported by the MSIP(Ministry of Science, ICTand Future Planning), Korea, under the SW STARLab support pro-gram (IITP-2017-0536-20170040) supervised by the IITP(Institutefor Information & communications Technology Promotion.

REFERENCESPeter J. Beek and Arthur Lewbel. 1995. The Science of Juggling - Studying the ability

to toss and catch balls and rings provides insight into human coordination, roboticsand mathematics. Scientific American 273, 5 (1995).

Pavel Burget and Pavel Mezera. 2010. A Visual-Feedback Juggler With Servo Drives.The 11th IEEE International Workshop on Advanced Motion Control (2010).

Kai Ding, Libin Liu, Michiel van de Panne, and KangKang Yin. 2015. Learning Reduced-Order Feedback Policies for Motion Skills. In Proc. ACM SIGGRAPH / EurographicsSymposium on Computer Animation.

Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne,Yuval Tassa, Tom Erez, Ziyu Wang, S. M. Ali Eslami, Martin A. Riedmiller, andDavid Silver. 2017. Emergence of Locomotion Behaviours in Rich Environments.CoRR abs/1707.02286 (2017).

Sumit Jain and C. Karen Liu. 2009. Interactive Synthesis of Human-Object Interaction.In ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA). ACM,New York, NY, USA, 47–53. DOI:http://dx.doi.org/10.1145/1599470.1599476

Jack Kalvan. 2018. The Human Limits - How Many Objects Can Be Juggled. (2018).Jens Kober, Matthew Glisson, and Michael Mistry. 2012. Playing Catch and Juggling

with a Humanoid Robot. 12th IEEE-RAS International Conference on HumanoidRobots (2012).

Seunghwan Lee, Ri Yu, Jungnam Park, Mridul Aanjaneya, Eftychios Sifakis, and JeheeLee. 2018. Dexterous Manipulation and Control with Volumetric Muscles. ACMTransactions on Graphics (SIGGRAPH 2018) 37, 57 (2018).

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deepreinforcement learning. CoRR abs/1509.02971 (2015).

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforce-ment Learning. CoRR abs/1312.5602 (2013).

Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018. Deep-Mimic: Example-Guided Deep Reinforcement Learning of Physics-Based CharacterSkills. ACM Transactions on Graphics (Proc. SIGGRAPH 2018 - to appear) 37, 4 (2018).

Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel van de Panne. 2017. DeepLoco:Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning. ACMTransactions on Graphics (Proc. SIGGRAPH 2017) 36, 4 (2017).

Claude Elwood Shannon. 1981. Scientific Aspects of Juggling. In Claude ElwoodShannon: Collected papers., Aaron D. Wyner and N.J.A. Sloane (Eds.). IEEE Press,New York, 850âĂŞ864.

Page 7: A physics-based Juggling Simulation using Reinforcement ...mrl.snu.ac.kr/research/ProjectJuggling/PaperJuggling.pdf · A physics-based Juggling Simulation using Reinforcement Learning

A physics-based Juggling Simulation using Reinforcement Learning MIG ’18, November 8–10, 2018, Limassol, Cyprus

Figure 7: Juggling patterns.