a arxiv:1611.01843v1 [stat.ml] 6 nov 2016 a correct answer and -1 for incorrect, and the episode...

12
Under review as a conference paper at ICLR 2017 L EARNING TO P ERFORM P HYSICS E XPERIMENTS VIA D EEP R EINFORCEMENT L EARNING Misha Denil 1 Pulkit Agrawal 2 Tejas D Kulkarni 1 Tom Erez 1 Peter Battaglia 1 Nando de Freitas 1,3,4 1 DeepMind 2 University of California Berkeley 3 University of Oxford 4 Canadian Institute for Advanced Research {mdenil,tkulkarni,etom,peterbattaglia,nandodefreitas}@google.com [email protected] ABSTRACT When encountering novel object, humans are able to infer a wide range of phys- ical properties such as mass, friction and deformability by interacting with them in a goal driven way. This process of active interaction is in the same spirit of a scientist performing an experiment to discover hidden facts. Recent advances in artificial intelligence have yielded machines that can achieve superhuman perfor- mance in Go, Atari, natural language processing, and complex control problems, but it is not clear that these systems can rival the scientific intuition of even a young child. In this work we introduce a basic set of tasks that require agents to estimate hidden properties such as mass and cohesion of objects in an inter- active simulated environment where they can manipulate the objects and observe the consequences. We found that state of art deep reinforcement learning methods can learn to perform the experiments necessary to discover such hidden properties. By systematically manipulating the problem difficulty and the cost incurred by the agent for performing experiments, we found that agents learn different strategies that balance the cost of gathering information against the cost of making mistakes in different situations. 1 I NTRODUCTION Our work is inspired by empirical findings and theories in psychology indicating that infant learning and thinking is similar to that of adult scientists (Gopnik, 2012). One important view in developmen- tal science is that babies are endowed with a small number of separable systems of core knowledge for reasoning about objects, actions, number, space, and possibly social interactions (Spelke & Kin- zler, 2007). The object core system covering aspects such as cohesion, continuity, and contact, enables babies and other animals to solve object related tasks such as reasoning about oclusion and predicting how objects behave. Core knowledge research has motivated the development of methods that endow agents with physics priors and perception modules so as to infer intrinsic physical properties rapidly from data (Battaglia et al., 2013; Wu et al., 2015; 2016; Stewart & Ermon, 2016). For instance, using physics engines and mental simulation, it becomes possible to infer quantities such as mass from visual input (Hamrick et al., 2016; Wu et al., 2015). In early stages of life, infants spend a lot of time interacting with objects in a seemingly random manner (Smith & Gasser, 2005). They interact with objects in multiple ways, such as by throwing, pushing, pulling, breaking, and biting. It is quite possible that this process of actively engaging with objects by watching the consequences of the applied actions helps infants understand different physical properties of the object which cannot be inferred directly using the sensory systems. It seems infants run a series of “physical” experiments to enhance their knowledge about the world (Gopnik, 2012). The act of performing an experiment (i.e. active interaction with the environment) is useful both for quickly adapting an agent’s policy to a new environment and understanding object properties in a holistic manner. Despite impressive advances in artificial intelligence that have led to superhuman performance in Go, Atari and natural language processing, it is still unclear if these systems behind these advances can rival the scientific intuition of even a small child. 1 arXiv:1611.01843v1 [stat.ML] 6 Nov 2016

Upload: tranxuyen

Post on 14-May-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Under review as a conference paper at ICLR 2017

LEARNING TO PERFORM PHYSICS EXPERIMENTS VIADEEP REINFORCEMENT LEARNING

Misha Denil1 Pulkit Agrawal2 Tejas D Kulkarni1 Tom Erez1 Peter Battaglia1 Nando de Freitas1,3,41DeepMind 2 University of California Berkeley 3University of Oxford

4Canadian Institute for Advanced Research{mdenil,tkulkarni,etom,peterbattaglia,nandodefreitas}@google.com

[email protected]

ABSTRACT

When encountering novel object, humans are able to infer a wide range of phys-ical properties such as mass, friction and deformability by interacting with themin a goal driven way. This process of active interaction is in the same spirit of ascientist performing an experiment to discover hidden facts. Recent advances inartificial intelligence have yielded machines that can achieve superhuman perfor-mance in Go, Atari, natural language processing, and complex control problems,but it is not clear that these systems can rival the scientific intuition of even ayoung child. In this work we introduce a basic set of tasks that require agentsto estimate hidden properties such as mass and cohesion of objects in an inter-active simulated environment where they can manipulate the objects and observethe consequences. We found that state of art deep reinforcement learning methodscan learn to perform the experiments necessary to discover such hidden properties.By systematically manipulating the problem difficulty and the cost incurred by theagent for performing experiments, we found that agents learn different strategiesthat balance the cost of gathering information against the cost of making mistakesin different situations.

1 INTRODUCTION

Our work is inspired by empirical findings and theories in psychology indicating that infant learningand thinking is similar to that of adult scientists (Gopnik, 2012). One important view in developmen-tal science is that babies are endowed with a small number of separable systems of core knowledgefor reasoning about objects, actions, number, space, and possibly social interactions (Spelke & Kin-zler, 2007). The object core system covering aspects such as cohesion, continuity, and contact,enables babies and other animals to solve object related tasks such as reasoning about oclusion andpredicting how objects behave.

Core knowledge research has motivated the development of methods that endow agents with physicspriors and perception modules so as to infer intrinsic physical properties rapidly from data (Battagliaet al., 2013; Wu et al., 2015; 2016; Stewart & Ermon, 2016). For instance, using physics engines andmental simulation, it becomes possible to infer quantities such as mass from visual input (Hamricket al., 2016; Wu et al., 2015).

In early stages of life, infants spend a lot of time interacting with objects in a seemingly randommanner (Smith & Gasser, 2005). They interact with objects in multiple ways, such as by throwing,pushing, pulling, breaking, and biting. It is quite possible that this process of actively engagingwith objects by watching the consequences of the applied actions helps infants understand differentphysical properties of the object which cannot be inferred directly using the sensory systems. Itseems infants run a series of “physical” experiments to enhance their knowledge about the world(Gopnik, 2012). The act of performing an experiment (i.e. active interaction with the environment)is useful both for quickly adapting an agent’s policy to a new environment and understanding objectproperties in a holistic manner. Despite impressive advances in artificial intelligence that have ledto superhuman performance in Go, Atari and natural language processing, it is still unclear if thesesystems behind these advances can rival the scientific intuition of even a small child.

1

arX

iv:1

611.

0184

3v1

[st

at.M

L]

6 N

ov 2

016

Under review as a conference paper at ICLR 2017

While we draw inspiration from child development, it must be emphasized that our purpose is notto provide and account of learning and thinking in humans. Our intent, in this paper, is to show thatwe can build agents that can learn to experiment so as to learn representations that are informativeabout physical properties of objects, using deep reinforcement learning. We believe this work es-tablishes an important foundation for future research in grounded experimentation, intuitive physicsand transfer of skills across tasks. The act of conducting an experiment involves the agent having abelief about the world, which it then updates by observing the consequences of actions it performs.

We investigate the ability of state of the art deep reinforcement methods to perform experimentsto infer object properties such as mass and cohesion by constructing two environments – Which isHeavier and Towers. In the Which is Heavier environment, the agent is allowed to apply forces toblocks and it must infer which is the heaviest block, whereas in the Tower environment the agent’stask is to infer how many unique objects are present in the world. Unlike Wu et al. (2015), weassume that the agent has no prior knowledge about physical properties of objects or the laws ofphysics and hence it must interact with the objects in order to learn to answer questions about theseproperties. Our work can thus be thought of as question answering via interaction.

Our results indicate that in the case Which is Heavier environment our agents learn experimentationstrategies that are similar to those we would expect from an algorithm designed with knowledge ofthe underlying structure of the environment. In the Towers environment we show that our agentslearn a closed loop policy that can adapt to a varying time scale in the environment.

2 ANSWERING QUESTIONS THROUGH INTERACTION

We pose the problem of experimentation as that of answering questions about non-visual propertiesof objects present in the environment. We designed environments that provide rewards when theagent is able to correctly infer these properties and we train agents using reinforcement learning.

The process by which the agent conducts the experiments can be broken down into three phases:

Exploration via Interaction Initially there is an exploration phase, where the agent is free to inter-act with the environment and gather information.

Labeling At the end of the exploration phase the agent produces labeling action through which itcommunicates its answer to the implicit question posed by the environment.

Reward When the agent produces as labeling action, the environment responds with a reward, +1for a correct answer and -1 for incorrect, and the episode terminates. If the agent doesnot output any labeling action, the episode terminates automatically after a maximum timelimit and the agent gets a reward of -2.

Crucially, the transition between interaction and labeling does not happen at a fixed time step, butis initiated by the agent. This is achieved by providing the agent with the ability to either output aninteraction action that influences the objects in the environment or output a labeling action at everytime step. This setup allows the agent to decide when enough information has been gathered, andforces the agent to balance the trade-off between answering now given its current knowledge, ordelaying its answer to gather more information.

The optimal trade-off between information gathering and risk of answering depends on two fac-tors. The first factor is the difficulty of the question and the second is the cost of information. Thedifficulty of the question is environment specific and is addressed later when we describe the en-vironments. The cost of information can be generically controlled by varying the discount factor.A small discount factor places less emphasis on future rewards and encourages the agent to answerthe question as quickly as possible. On the other hand, a large discount factor encourages the agentto spend more time gathering information in order to increase the likelihood of outputting a correctanswer.

3 AGENT ARCHITECTURE AND TRAINING

We use the same basic agent architecture and training procedure for all of our experiments, makingonly minimal modifications in order to adapt the agents to different observation spaces and actu-

2

Under review as a conference paper at ICLR 2017

Figure 1: Left: Diagram of the Which is Heavier environment. Blocks are always arranged in a line,but density of the different blocks changes from episode to episode. Agents can interact with theblocks through one of the various actuators described in the body of the paper. Right: Distributionsof the mass gap for different settings of β.

ators.For all experiments we train recurrent agents using an LSTM with 100 hidden units. Whenworking from features we feed the observations into the LSTM directly. When training from pixelswe first scale the observations to 84x84 pixels and feed them through a three convolution layersfollowed by ReLU non-linearity. These three layers are composed of 32, 64, 64 square filters withsizes 8, 4, 3 that are applied at strides of 4, 2, 1. We train the agents using Asynchronous AdvantageActor Critic (Mnih et al., 2016), but ensure that the unroll length is always greater than the timeoutlength so the agent network is unrolled over the entirety of each episode.

4 WHICH IS HEAVIER

The which is heavier domain is designed to have agents answer a question about the mass of differentobjects in a scene.

4.1 ENVIRONMENT

The environment is depicted in the left panel of Figure 1. It consists of four blocks, which areconstrained to move only along the vertical axis. The blocks are always the same size, but vary inmass between episodes. The agent’s strength (i.e. magnitude of force it can apply) remains constantbetween episodes, and we provide different actuators that allow the agent to interact with the blocksin different ways.

The question to answer in this environment is which of the four blocks is the heaviest. Since the massof each block is randomly assigned in each episode, the agent must poke the blocks and observe howthey respond in order to determine which block is the heaviest. Assigning masses randomly in eachepisode ensures it is not possible to solve this task from vision (or features) alone, since the colorand identity of each block imparts no information about the mass in the current episode. The onlyway to obtain information about the masses of the blocks is to interact with them and watch howthey respond.

The Which is Heavier environment is designed to encode a latent bandit problem through a “physi-cal” lens. Each block corresponds to an arm of the bandit, and the reward obtained by pulling eacharm is proportional to the mass of the block. Identifying the heaviest block can then be seen as abest arm identification problem (Audibert & Bubeck, 2010). Best arm identification is a well studiedproblem in experimental design, and understanding of how an optimal solution to the latent banditshould behave can be used to guide our analysis of agents we train on this task.

It is important to emphasize that we cannot simply apply standard bandit algorithms to this task,because we impose a much higher level of prior ignorance on our algorithms than that setting allows.Bandit algorithms assume that the rewards are observed directly, which is not the case in our setting.Our agents observe mass through its role in dynamics (and in the case of learning from pixels,through the lens of vision as well). One could imagine parameterizing this transformation fromreward to observation, and perhaps even learning this mapping as well, in a bandit setting; however,doing so requires explicitly acknowledging this mapping in the design of the learning algorithm,which we avoid doing. Moreover, acknowledging this mapping in any way requires the a-priori

3

Under review as a conference paper at ICLR 2017

recognition of the existence of the latent bandit structure. From the perspective of our learningalgorithm the mere existence of this structure also lies beyond the veil of ignorance.

Controlling the distribution of masses allows us to control the difficulty of this task. In particular,by controlling the expected size of the mass gap between the two heaviest blocks we can make thetask more or less difficult. We can assign masses in the range [0, 1] and scale them to an appropriaterange for the agent’s strength.

We use the following scheme for controlling the difficulty of the Which is Heavier environment.First we select one of the blocks uniformly to be the “heavy” block and designate the remainingthree as “light”. We sample the mass of the heavy block from Beta(β, 1) and the mass of the lightblocks from Beta(1, β). The single parameter β controls the distribution of mass gaps (and thuscontrols the difficulty), with large values of β leading to easier problems.

We distinguish between problem level and instance level difficulty for this domain. Instance leveldifficulty refers to the size of the mass gap in a single episode. If the mass gap is small it is harderto determine which block is heaviest, and we say that one episode is more difficult than another bycomparing their mass gaps. Problem level difficulty refers to the shape of the generating distributionof mass gaps (e.g. as shown in the right panel of Figure 1). A distribution that puts more mass onconfigurations that have a small mass gap will tend to generate more episodes that are difficult at theinstance level. We control the problem level difficulty through β, but we incorporate both problemand instance level difficulty in our analysis.

We set the episode length limit to 100 steps in this environment, which is sufficient time to be muchlonger than a typical episode by a successfully trained agent.

4.2 ACTUATORS

The obvious choice for actuation in physical domains is some kind of arm or hand based manipulator.However, controlling an arm or hand is quite challenging on its own, requiring a fair amount ofdexterity on the part of the agent. The manipulation problem, while very interesting is orthogonalto our interest in this work. Therefore we avoid the problem of learning dexterous manipulation byproviding the agent with simpler forms of actuation.

We call the actuation strategy for this environment direct actuation, which allows the agent to affectforces on the different blocks directly. At every time step the agent can output one out of 8 possibleactions. The first four actions, result in an application of a vertical force of fixed magnitude to centerof mass of the four blocks respectively. Actions 5-8 are labeling actions and correspond to agent’scommunication about which is the heaviest block.

4.3 EXPERIMENTS

Our first experiment is a sanity check to show that we can train agents successfully on the Whichis Heavier environment using both features and pixels. This experiment is designed simply to showthat our task is solvable, and to emphasize the fact that by changing the problem difficulty we canmake the which is heavier task very hard.

We present two additional experiments showing how varying difficulty leads to differentiated behav-ior both at the problem level and at the instance level. In both cases knowledge of the latent banditproblem allows us to make predictions about how an experimenting agent should behave, and ourexperiments are designed to show that qualitatively correct behavior is obtained by our agents inspite of their a-priori ignorance of the underlying structure of the environment.

The first experiment shows that as we increase the problem difficulty the learned policies transitionfrom guessing immediately when a heavy block is found to strongly preferring to poke all blocksbefore making a decision. This corresponds to the observation that if it is unlikely for more than onearm to give high reward then any high reward arm is likely to be best.

The second experiment shows that we also see behavior adapting to problem difficulty on the level ofindividual problem instances, where a single agent will tend to spend longer gathering informationwhen the particular problem instance is more difficult. This corresponds to the observation that when

4

Under review as a conference paper at ICLR 2017

Figure 2: Learning curves for typical agents trained on the Which is Heavier environment at varyingdifficulty settings. The y-axis shows the probability of the agent producing the correct answer beforethe episode times out. Agents do not reach perfect performance on the harder settings because of thepresence of very difficult episodes in the generating distribution. Left: Agents trained from features.Right: Agents trained from pixels.

the two best arms have similar reward then more information is required to accurately distinguishthem.

Success in learning For this experiment we trained several agents at three different difficultiescorresponding to β ∈ {3, 5, 10}. For each problem difficulty we trained agents on both featureobservations, which includes the z coordinate of each of the four blocks; and also using raw pixels,providing 84 × 84 pixel RGB rendering of the scene to the agent. Representative learning curvesfor each condition are shown in Figure 2. The curves are smoothed over time to show a runningestimate of the probability of success, rather than showing the binary outcomes directly.

The agents do not reach perfect performance on this task, with more difficult problems plateauingand progressively lower performance. This can be explained by looking at the distributions of in-stance level difficulties generated by different settings of β, which is shown in the right panel ofFigure 1. For higher difficulties (lower values of β) there is a substantial probability of generat-ing problem instances where the mass gap is near 0, which makes distinguishing between the twoheaviest blocks very difficult.

Since the agents exhibit similar performance using pixels and features we conduct the remainingexperiments in this section using feature observations, since these agents are substantially faster totrain.1

Population strategy differentiation For this experiment we trained agents at three different dif-ficulties corresponding to β ∈ {3, 5, 10} all using a discount factor of γ = 0.95 which correspondsa relatively high cost of gathering information. We trained three agents for each difficulty and showresults aggregated across the different replicas.

After training each agent was run for 10,000 steps under the same conditions they were exposed toduring training. We record the number and length of episodes executed during the testing period aswell as the outcome of each episode. Episodes are terminated by timeout after 100 steps, but the vastmajority of episodes are terminated in < 30 steps by the agent producing a label. Since episodesvary in length not all agents complete the same number of episodes during testing. Overall successrate on this task is high, with the agents choosing correctly on 96.3% of episodes.

The left plot in Figure 3 shows histograms of the episode lengths broken down by task difficulty.The red vertical line indicates an episode length of four interaction steps, which is the minimumnumber of actions required for the agents to interact with every block. At a task difficulty of β = 10the agents appear to learn simply to search for a single heavy (which can be found with an average oftwo interactions). However, at a task difficulty of β = 3 we see a strong bias away from terminatingthe episode before taking at least four exploratory actions.

Individual strategy differentiation For this experiment we trained agents using the same threetask difficulties as in the previous experiment, but with an increased discount factor of γ = 0.99.

1The majority of training time for pixel based agents comes from rendering the observations.

5

Under review as a conference paper at ICLR 2017

Figure 3: Left: Histograms of episode lengths for different task difficulty (β) settings. There is atransition from β = 10 where the agents answer eagerly as soon as they find a heavy block to β = 3where the agents are more conservative about answering before they have acted enough to poke allthe blocks at least once. Right: Episode lengths as a function of the normalized mass gap. Unitson the x-axes are scaled to the range of possible masses, and the y-axis shows the number of stepsbefore the agent takes a labeling action. The black dots show individual episodes, and the red lineshows a linear trend fit by OLS and error bars show a histogram estimate of standard deviations.Each plot shows the testing episodes of a single trained agent.

This decreases the cost of exploration and encourages the agents to gather more information beforeproducing a label, leading to longer episodes.

After training each agent was run for 100,000 steps under the same conditions they were exposedto during training. We record the length of each episode, as well as the mass gap between the twoheaviest blocks in each episode. In the same way that we use the distribution of mass gaps as ameasure of task difficulty, we can use the mass gap in a single episode as a measure of the difficultyof that specific problem instance. We again exclude from analysis the very small proportion ofepisodes that terminate by timeout.

The right plots in Figure 3 show the relationship between the mass gap and episode length acrossthe testing runs of two different agents. From these plots we can see how a single agent has learnedto adapt its behavior based on the difficulty of a single problem instance. This behavior reflects withwhat we would expect from a solution to the latent bandit problem; more information is required toidentify the best arm when the second best arm is nearly as good. Although the variance is high,there is a clear correlation between the mass gap and the length of the episodes.

5 TOWERS

The Towers domain is designed to have agents count the number of cohesive rigid bodies in a scene.The environment is designed so that in its initial configuration it is not possible to determine thenumber of rigid bodies from vision alone.

5.1 ENVIRONMENT

The environment is depicted in the left panel of Figure 4. It consists of a tower of five blocks whichcan move freely in three dimensions. The initial block tower is always in the same configuration butin each episode we bolt together different subsets of the blocks to form larger rigid bodies as shownin the figure. The question to answer in this environment is how many rigid bodies are formedfrom the primitive blocks. Since which blocks are bound together into rigid bodies is randomlyassigned in each episode, and binding forces are invisible, the agent must poke the block tower andobserve how it falls down in order to determine how many rigid bodies it is composed of. We createthe environment in a way that the distribution over the number of separate blocks in the tower isuniform. This ensures that there is no single action strategy that achieves high reward.

5.2 ACTUATORS

In the “Towers” environment, we used two actuators: direct actuation and the fist actuator shownin the top panel of Figure 4. In case of the direct actuation, the agent can output one out of 25

6

Under review as a conference paper at ICLR 2017

0.01 0.02 0.04 0.06 0.08 0.1Agent Time Step (s)

0.0

0.2

0.4

0.6

0.8

1.0

Pro

b /

Length

(s)

Probability of success

Episode length

Figure 4: Top: Example trajectory of a block tower being knocked down using the fist actuator.Left: Schematic diagram of the hidden structure of the Towers environment. The tower on the leftis composed of five blocks, but could decompose into rigid objects in any several ways that can onlybe distinguished by interacting with the tower. Right: Behavior of a single trained agent using fistactuators when varying the agent time step. The x-axis shows different agent time step lengths (thetraining condition 0.1). The blue line shows probability of the agent correctly identifying the numberof blocks. The red line shows the median episode length (in seconds) with error bars showing 95%confidence intervals computed over 50 episodes. The shaded region shows +/−1 agent time steparound the median.

actions. At every time step, the agent can apply a force of fixed magnitude in either of +x, -x, +y or-y direction to one out of the five blocks. If two blocks are glued together, both blocks move underthe effect of force. Since, there are a maximum of 5 blocks, this results into 20 different possibleactions. Actions 21-25 are labeling actions that are used by the agent to communicate the numberof distinct blocks in the tower.

Fist is a large spherical object that the agent can actuate by setting velocities in a 2D horizontalplane. Unlike direct actuation, the agent cannot apply any direct forces to the objects that constitutethe tower, but only manipulate them by pushing or hitting them by the fist. At every time step agentcan output one out of nine actions. The first four actions corresponds to setting the velocity of thefist to a constant amount in (+x, -x, +y, -y) directions respectively. Actions 5-9 are labeling actions,that are used by the agent to communicate the number of distinct blocks in the tower. For the directactuators we use an episode timeout of 150 steps and for the fist actuators we use an episode timeoutof 100 steps.

In order to investigate if the agent learns a strategy of stopping after a fixed number of time stepsor whether it integrates sensory information in a non-trivial manner we used a notion of “agent timestep”. The idea of agent time step is similar to that of action repeats and if the physics simulationtime step is 0.025s and agent time step is 0.1s, it means that the same action is repeated 4 times.

5.3 EXPERIMENTS

Our first experiment in the tower environment is again a sanity check to show that we can trainagents in the Towers environment. This experiment is again designed simply to show that our taskis solvable by our agents.

The first experiment shows that the agents learn to wait for an observation where they can identifythe number of rigid bodies before producing an answer. This is designed to show that the agents finda sensible, rather than degenerate, strategy for counting the number of rigid bodies. An alternativehypothesis would be that agents learn to wait for (approximately) the same number of steps eachtime and then take its best guess.

Success in learning For this experiment we trained several agents on the Towers environment us-ing different pairings of actuators and perception. The features observations include the 3d positionof each primitive block, and when training using raw pixels we provide an 84× 84 pixel RGB ren-

7

Under review as a conference paper at ICLR 2017

Figure 5: Learning curves for a collection of agents trained on the Towers environment under differ-ent conditions. The y-axis shows the probability of the agent producing the correct answer beforethe episode times out. The different plots show different pairings of observations and actuators asindicated in the plot titles.

dering of the scene as the agent observation. We show a representative collection of learning curvesin Figure 5.

In all cases we obtain agents that solve the task nearly perfectly, although when training from pixelswe find that the range of hyperparameters which train successfully is narrower than when trainingfrom features. Interestingly, the fist actuators lead to the fastest learning, in spite of the fact that itmust manipulate the blocks indirectly through the fist. One possible explanation is that the fist canaffect multiple blocks in one action step, whereas in the direct actuation only 1 block can be affectedper time step.

Waiting for information For this experiment we trained an agent with feature observations andthe fist actuator on the towers task with an agent time step of 0.1 seconds and examine its behaviorat test time with a smaller delay between actions. Reducing the agent time step means that fromthe agent perspective time has been slowed down. Moving its fist a fixed amount of distance takeslonger, as does waiting for the block tower to collapse once it has been hit.

After training the agent was run for 50 episodes for each of several different agent time steps. Werecord the outcome of each episode, as well as the number of steps taken by the agent before itchooses a label. None of the test episodes terminate by timeout, so we include all of them in theanalysis.

The plot in Figure 4 shows the probability of answering correctly, as well as the median length ofeach episode measured in seconds. In terms of absolute performance we see a small drop comparedto the training setting, where the agent is essentially perfect, but there is not a strong correlationperformance and time step over the range we tested.

We also observe that the episodes with different time steps take approximate the same amount of realtime across the majority of the tested range. This corresponds to a large change in episode length asmeasured by number of agent actions, since with an agent time step of 0.01 the agent must execute10x as many actions to cover the same amount of time as with the agent time step used duringtraining. From this we can infer that the agent has learned to wait for an informative observationbefore producing a label, as opposed to a simpler degenerate strategy of waiting a fixed amount ofsteps before answering.

6 RELATED WORK

Deep learning techniques in conjunction with vast labeled datasets have yielded powerful modelsfor image (Krizhevsky et al., 2012; He et al., 2016) and speech recognition (Hinton et al., 2012).In recent years, as we have approached human level performance on these tasks, there has been astrong interest in the computer vision field in moving beyond semantic classification, to tasks thatrequire a deeper and more nuanced understanding of the world.

Inspired by developmental studies (Smith & Gasser, 2005), some recent works have focused onlearning representations by predicting physical embodiment quantities such as ego-motion (Agrawalet al., 2015; Jayaraman & Grauman, 2015), instead of symbolic labels. Extending the realm ofthings-to-be-predicted to include quantities beyond class labels, such as viewer centric parameters

8

Under review as a conference paper at ICLR 2017

(Doersch et al., 2015) or the poses of humans within a scene (Delaitre et al., 2012; Fouhey et al.,2014), has been shown to improve the quality of feature learning and scene understanding. Re-searchers have looked at cross modal learning, for example synthesizing sounds from visual images(Owens et al., 2015), using summary statistics of audio to learn features for object recognition(Owens et al., 2016) or by image colorization (Zhang et al., 2016).

Inverting the prediction tower, another line of work has focused on learning about the visual world bysynthesizing, rather than analyzing, images. Major cornerstones of recent work in this area includethe Variational Autoencoders of Kingma & Welling (2014), the Generative Adversarial Networksof (Goodfellow et al., 2014), and more recently autoregressive models have been very successful(van den Oord et al., 2016).

Building on models of single image synthesis there have been many works on predicting the evo-lution of video frames over time (Ranzato et al., 2014; Srivastava et al., 2015; van den Oord et al.,2016). Xue et al. (2016) have approached this problem by designing a variational autoencoder archi-tecture that uses the latent stochastic units of the VAE to make choices about the direction of motionof objects, and generates future frames conditioned on these choices.

A different form of uncertainty in video prediction can arise from the effect of actions taken by anagent. In environments with deterministic dynamics (where the possibility of “known unknowns”can, in principle, be eliminated), very accurate action-conditional predictions of future frames can bemade (Oh et al., 2015). Introducing actions into the prediction process amounts to learning a latentforward dynamics model, which can be exploited to plan actions to achieve novel goals (Watteret al., 2015; Assael et al., 2015; Fragkiadaki et al., 2016). In these works, frame synthesis plays therole of a regularizer, preventing collapse of the feature space where the dynamics model lives.

Agrawal et al. (2016) break the dependency between frame synthesis and dynamics learning byreplacing frame synthesis with an inverse dynamics model. The forward model plays the same roleas in the earlier works, but in this work feature space collapse is prevented by ensuring that themodel can decode actions from pairs of time-adjacent images. Several works, including Agrawalet al. (2016) and Assael et al. (2015) mentioned above but also Pinto et al. (2016); Pinto & Gupta(2016); Levine et al. (2016), have gone further in coupling feature learning and dynamics. Thelearned dynamics models can be used for control not only after learning but also during the learningprocess in order to collect data in a more targeted way, which has been shown to improve the speedand quality of learning in robot manipulation tasks.

A key challenge of learning from dynamics is collecting the appropriate data. An ingenious solutionto this is to import real world data into a physics engine and simulate the application of forces inorder to generate ground truth data. This is the approach taken by Mottaghi et al. (2016), whogenerate an “interactable” data set of scenes, which they use to generate a static data set of imageand force pairs, along with the ground truth trajectory of a target object in response to the applicationof the indicated force.

When the purpose is learning an intuitive understanding of dynamics it is possible to do interestingwork with entirely synthetic data Fragkiadaki et al. (2016); Lerer et al. (2016). Lerer et al. (2016)show that convolutional networks can learn to make judgments about the stability of synthetic blocktowers based on a single image of the tower. They also show that their model trained on syntheticdata is able to generalize to make accurate judgments about photographs of similar block towersbuilt in the real world.

Making intuitive judgments about block towers has been extensively studied in the psychophysicsliterature. There is substantial evidence connecting the behavior of human judgments to inferenceover an explicit latent physics model (Hegarty, 2004; Hamrick et al., 2011; Battaglia et al., 2013).Humans can infer mass by watching movies of complex rigid body dynamics (Hamrick et al., 2016).

A major component of the above line of work is analysis by synthesis, in which understanding of aphysical process is obtained by learning to invert it. Observations are assumed to be generated froman explicitly parameterized generative model of the true physical process, and provide constraints toan inference process run over the parameters of this process. The analysis by synthesis approach hasbeen extremely influential due to its power to explain human judgments and generalization patternsin a variety of situations (Lake et al., 2015).

9

Under review as a conference paper at ICLR 2017

In terms of understanding of dynamics, Galileo (Wu et al., 2015) is a particularly relevant instanceof tying together analysis by synthesis and deep learning. This system first infers the physicalparameters (mass and friction coefficient) of a variety of blocks by watching videos of them slidingdown slopes and colliding with other blocks. This stage of the system uses an off-the-shelf objecttracker to ground inference over the parameters of a physical system, and the inference is achievedby matching simulated and observed block trajectories. These inferred physical parameters are usedto train a deep network to predict the physical parameters from the initial frame of video. At testtime the system is evaluated by using the deep network to infer physical parameters of new blockvideos, which can be fed into the physics engine and used to answer questions about behaviors notobserved at training time.

Physics 101 (Wu et al., 2016) is an extension of Galileo that more fully embraces deep learning.Instead of using a first pass of analysis by synthesis to infer physical parameters based on observa-tions, a deep network is trained to regress the observations directly, and the relevant physical lawsare encoded directly into the architecture of the model. They show that they can use latent intrin-sic physical properties inferred in this way to make novel predictions. The approach of encodingphysical models as architecture constraints has also been proposed by Stewart & Ermon (2016).

Most of the works discussed thus far, including Galileo and Physics 101, are restricted to passivesensing. Pinto et al. (2016); Pinto & Gupta (2016); Agrawal et al. (2016); Levine et al. (2016)are exceptions to this because they learn their models using a sequential greedy data collectionbootstrapping strategy. Active sensing appears is an important aspect of visual object learning intoddlers as argued by Bambach et al. (2016), providing motivation for the approach presented here.

In computer vision, it is also well known that recognition performance can be improved by mov-ing so as to acquire new views of an object or scene. There is a vast literature on active vision,see the related work section of Jayaraman & Grauman (2016). This particular work applies deepreinforcement learning to construct an agent that chooses how to acquire new views of an objectso as to classify it into a semantic category. While this paper and others share deep reinforcementlearning and active sensing in common with our work, their goal is to learn a policy that can beapplied to images to make decisions based on vision. In contrast, the goal in this paper is to studyhow agents learn to experiment continually so as to learn representations to answer questions aboutintrinsic properties of objects. In particular, the focus is on different tasks that can only be solved byinteraction and not by vision.

7 CONCLUSION

Despite recent advances in artificial intelligence, machines lack a common sense understanding ofour physical world. There has been impressive progress in recognizing objects, segmenting objectboundaries and even describing a visual scene with natural language. However, these tasks are notenough for machines to infer physical properties of objects such as mass, friction or deformability.We introduce a deep reinforcement learning agent, which actively interacts with physical objects toinfer their hidden properties. Our approach is inspired by findings from the developmental psychol-ogy literature indicating that infants spend a lot of their early time experimenting with objects withrandom exploration (Smith & Gasser, 2005; Gopnik, 2012; Spelke & Kinzler, 2007). By letting ouragents conduct physical experiments in an interactive simulated environment, they learn to manip-ulate objects and observe the consequences to infer hidden object properties. We demonstrate theefficacy of our approach on two important physical understanding tasks – inferring mass and count-ing the number of objects under strong visual ambiguities. Our empirical findings suggests that ouragents learn different strategies for these tasks that balance the cost of gathering information againstthe cost of making mistakes in different situations.

ACKNOWLEDGMENTS

We would like to thank Matt Hoffman for several enlightening discussions about bandits.

REFERENCES

Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In IEEE International Confer-ence on Computer Vision, pp. 37–45, 2015.

10

Under review as a conference paper at ICLR 2017

Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, and Jitendra Malik. Learning to poke by poking: Experientiallearning of intuitive physics. In Neural Information Processing Systems, 2016.

John-Alexander M Assael, Niklas Wahlstrom, Thomas B Schon, and Marc Peter Deisenroth. Data-efficient learning of feedback policies from image pixels using deep dynamical models. arXiv preprintarXiv:1510.02173, 2015.

Jean-Yves Audibert and Sebastien Bubeck. Best arm identification in multi-armed bandits. In Conference onLearning Theory, pp. 13–p, 2010.

Sven Bambach, David J Crandall, Linda B Smith, and Chen Yu. Active viewing in toddlers facilitates visualobject learning: An egocentric vision approach. CogSci, 2016.

Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum. Simulation as an engine of physical sceneunderstanding. Proceedings of the National Academy of Sciences, 110(45):18327–18332, 2013.

Vincent Delaitre, David F Fouhey, Ivan Laptev, Josef Sivic, Abhinav Gupta, and Alexei A Efros. Scene se-mantics from long-term observation of people. In European Conference on Computer Vision, pp. 284–298.Springer, 2012.

Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by contextprediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430, 2015.

David F Fouhey, Vincent Delaitre, Abhinav Gupta, Alexei A Efros, Ivan Laptev, and Josef Sivic. Peoplewatching: Human actions as a cue for single view geometry. International Journal of Computer Vision, 110(3):259–274, 2014.

Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual predictive modelsof physics for playing billiards. ICLR, 2016.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Process-ing Systems, pp. 2672–2680, 2014.

Alison Gopnik. Scientific thinking in young children: Theoretical advances, empirical research, and policyimplications. Science, 337(6102):1623–1627, 2012.

Jessica Hamrick, Peter Battaglia, and Joshua B Tenenbaum. Internal physics models guide probabilistic judg-ments about object dynamics. In Proceedings of the 33rd annual conference of the cognitive science society,pp. 1545–1550. Cognitive Science Society Austin, TX, 2011.

Jessica B Hamrick, Peter W Battaglia, Thomas L Griffiths, and Joshua B Tenenbaum. Inferring mass in complexscenes by mental simulation. Cognition, 157:61–76, 2016.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InConference on Computer Vision and Pattern Recognition, 2016.

Mary Hegarty. Mechanical reasoning by mental simulation. Trends in cognitive sciences, 8(6):280–5, 2004.Geoffrey E. Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew

Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury. Deep neural networks foracoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal ProcessingMagazine, 29(6):82–97, 2012.

Dinesh Jayaraman and Kristen Grauman. Learning image representations tied to ego-motion. In Proceedingsof the IEEE International Conference on Computer Vision, pp. 1413–1421, 2015.

Dinesh Jayaraman and Kristen Grauman. Look-ahead before you leap: End-to-end active recognition by fore-casting the effect of motion. In European Conference on Computer Vision, pp. 489–505, 2016.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learn-ing Representations, 2014.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neuralnetworks. In Advances in neural information processing systems, pp. 1097–1105, 2012.

Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning throughprobabilistic program induction. Science, 350(6266):1332–1338, 2015.

Adam Lerer, Sam Gross, and Rob Fergus. Learning physical intuition of block towers by example. In Interna-tional Conference on Machine Learning, pp. 430–438, 2016.

Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen. Learning hand-eye coordination for roboticgrasping with deep learning and large-scale data collection. arXiv preprint arXiv:1603.02199, 2016.

Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley,David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. arXivpreprint arXiv:1602.01783, 2016.

Roozbeh Mottaghi, Mohammad Rastegari, Abhinav Gupta, and Ali Farhadi. “what happens if. . . ” learning topredict the effect of forces in images. In European Conference on Computer Vision, pp. 269–285, 2016.

11

Under review as a conference paper at ICLR 2017

Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. Action-conditional video pre-diction using deep networks in Atari games. In Neural Information Processing Systems, pp. 2863–2871,2015.

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman.Visually indicated sounds. arXiv preprint arXiv:1512.08512, 2015.

Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound pro-vides supervision for visual learning. In European Conference on Computer Vision, pp. 801–816. Springer,2016.

Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robothours. In IEEE International Conference on Robotics and Automation, pp. 3406–3413, 2016.

Lerrel Pinto, Dhiraj Gandhi, Yuanfeng Han, Yong-Lae Park, and Abhinav Gupta. The curious robot: Learningvisual representations via physical interactions. In European Conference on Computer Vision, pp. 3–18,2016.

MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video(language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604,2014.

Linda Smith and Michael Gasser. The development of embodied cognition: Six lessons from babies. Artificiallife, 11(1-2):13–29, 2005.

Elizabeth S Spelke and Katherine D Kinzler. Core knowledge. Developmental science, 10(1):89–96, 2007.Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations

using lstms. CoRR, abs/1502.04681, 2, 2015.Russell Stewart and Stefano Ermon. Label-free supervision of neural networks with physics and domain knowl-

edge. arXiv preprint arXiv:1609.05566, 2016.Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv

preprint arXiv:1601.06759, 2016.Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally

linear latent dynamics model for control from raw images. In Advances in Neural Information ProcessingSystems, pp. 2746–2754, 2015.

Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physicalobject properties by integrating a physics engine with deep learning. In Neural Information ProcessingSystems. 2015.

Jiajun Wu, Joseph J. Lim, Hongyi Zhang, Joshua B. Tenenbaum, and William T. Freeman. Physics 101:Learning physical object properties from unlabeled videos. In British Machine Vision Conference, 2016.

Tianfan Xue, Jiajun Wu, Katherine L. Bouman, and William T. Freeman. Visual dynamics: Probabilistic futureframe synthesis via cross convolutional networks. arXiv preprint arXiv 1607.02586, 2016.

Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. arXiv preprintarXiv:1603.08511, 2016.

12