arxiv:2105.05145v1 [cs.ro] 11 may 2021

8
Visual Perspective Taking for Opponent Behavior Modeling Boyuan Chen* Yuhang Hu Robert Kwiatkowski Shuran Song Hod Lipson Columbia University Abstract— In order to engage in complex social interaction, humans learn at a young age to infer what others see and cannot see from a different point-of-view, and learn to predict others’ plans and behaviors. These abilities have been mostly lacking in robots, sometimes making them appear awkward and socially inept. Here we propose an end-to-end long-term visual prediction framework for robots to begin to acquire both these critical cognitive skills, known as Visual Perspective Taking (VPT) and Theory of Behavior (TOB). We demonstrate our approach in the context of visual hide-and-seek – a game that represents a cognitive milestone in human development. Unlike traditional visual predictive model that generates new frames from immediate past frames, our agent can directly predict to multiple future timestamps (25 s), extrapolating by 175% beyond the training horizon. We suggest that visual behavior modeling and perspective taking skills will play a critical role in the ability of physical robots to fully integrate into real- world multi-agent activities. Our website is at http://www. cs.columbia.edu/ ˜ bchen/vpttob/. I. INTRODUCTION Imagine that you see your colleague trying to open the door, only to find that the door is blocked by the doorstop hidden on the other side. Without exchanging words, you rush to remove the doorstop, allowing the door to swing open. You took this action because you understood what your colleague was trying to do (open the door), and you understood that she could not see the doorstop (from her viewpoint). We refer to these intuitive abilities as Behavior Modeling and Visual Perspective Taking respectively. Visual Perspective Taking [1]–[6] is essential for many useful multi- agent skills such as understanding social dynamics and rela- tionships, engaging in social interactions, and understanding intentions of others. A simple action such as getting out of the way of a busy person, or assisting a person about to sit down, involve both of these abilities. Like humans, robots that can perform VPT and use that ability to model others [7] have many advantages across a variety of applications such as social robotics, assistive and service robotics. However, even though current robots can perform variety of tasks by themselves or move as a group, these robots often do not demonstrate the capability of performing perspective taking or behavior modeling. Several challenges still remain. First, visual observations are high-dimensional with large amount of information. Relying on manual feature extrac- tions and hand-designed learning procedure [8] cannot scale We thank Philippe Wyder and Carl Vondrick for helpful discussions. We thank all the reviewers for their great comments and efforts. This research is supported by NSF NRI 1925157 and DARPA MTO grant HR0011-18-2- 0020. *[email protected] Seeker Hider Hider’s View What does the seeker see? Fig. 1: Visual Perspective Taking (VPT) refers to the ability to estimate other agents’ viewpoint from its own observation. With our VPT framework, the hider robot learns to predict the seeker robot future observation and plan its actions based on this prediction to avoid being captured. in complex setup. Second, the behaviors among different robots are cross-dependent. Consequently, it is necessary to account for the mutual interaction [9] between related robots. Third, directly learning on physical robots requires strong data efficiency. Though recent methods [10]–[13] demonstrate that similar cognitive abilities could emerge from reward-driven learning, they often require millions of learning steps. Lastly, learning to predict long-term future is very challenging. Most existing visual prediction frame- works [14]–[18] follow an iterative paradigm to rollout the prediction which are both time and memory consuming under long-horizon tasks. In this paper, we propose a self-supervised vision-based learning framework to take a small step towards realizing VPT and TOB in navigation robots involving future opponent modeling. We call our framework VPT-TOB. Our key idea is to explicitly predict the future perspective of the other robot as an image by conditioning on a top-down visual embedding of first-person RGB-D camera images and time- abstracted action embeddings. We further present a value prediction model that can leverage the future perspective prediction to evaluate action proposals. During test time, we ask the robot to propose future goals on the top-down map and generates potential action plans with its low-level controller. The robot can then use our VPT-TOB model to imagine future perspectives of the other robot and evaluates its goal proposal with the value prediction model. The robot arXiv:2105.05145v1 [cs.RO] 11 May 2021

Upload: others

Post on 14-Apr-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2105.05145v1 [cs.RO] 11 May 2021

Visual Perspective Taking for Opponent Behavior Modeling

Boyuan Chen* Yuhang Hu Robert Kwiatkowski Shuran Song Hod LipsonColumbia University

Abstract— In order to engage in complex social interaction,humans learn at a young age to infer what others see andcannot see from a different point-of-view, and learn to predictothers’ plans and behaviors. These abilities have been mostlylacking in robots, sometimes making them appear awkward andsocially inept. Here we propose an end-to-end long-term visualprediction framework for robots to begin to acquire both thesecritical cognitive skills, known as Visual Perspective Taking(VPT) and Theory of Behavior (TOB). We demonstrate ourapproach in the context of visual hide-and-seek – a game thatrepresents a cognitive milestone in human development. Unliketraditional visual predictive model that generates new framesfrom immediate past frames, our agent can directly predictto multiple future timestamps (25 s), extrapolating by 175%beyond the training horizon. We suggest that visual behaviormodeling and perspective taking skills will play a critical rolein the ability of physical robots to fully integrate into real-world multi-agent activities. Our website is at http://www.cs.columbia.edu/˜bchen/vpttob/.

I. INTRODUCTION

Imagine that you see your colleague trying to open thedoor, only to find that the door is blocked by the doorstophidden on the other side. Without exchanging words, yourush to remove the doorstop, allowing the door to swingopen. You took this action because you understood whatyour colleague was trying to do (open the door), and youunderstood that she could not see the doorstop (from herviewpoint). We refer to these intuitive abilities as BehaviorModeling and Visual Perspective Taking respectively. VisualPerspective Taking [1]–[6] is essential for many useful multi-agent skills such as understanding social dynamics and rela-tionships, engaging in social interactions, and understandingintentions of others. A simple action such as getting out ofthe way of a busy person, or assisting a person about to sitdown, involve both of these abilities.

Like humans, robots that can perform VPT and use thatability to model others [7] have many advantages acrossa variety of applications such as social robotics, assistiveand service robotics. However, even though current robotscan perform variety of tasks by themselves or move as agroup, these robots often do not demonstrate the capability ofperforming perspective taking or behavior modeling. Severalchallenges still remain.

First, visual observations are high-dimensional with largeamount of information. Relying on manual feature extrac-tions and hand-designed learning procedure [8] cannot scale

We thank Philippe Wyder and Carl Vondrick for helpful discussions. Wethank all the reviewers for their great comments and efforts. This researchis supported by NSF NRI 1925157 and DARPA MTO grant HR0011-18-2-0020. *[email protected]

Seeker

Hider

Hider’s ViewWhat does the seeker see?

Fig. 1: Visual Perspective Taking (VPT) refers to the abilityto estimate other agents’ viewpoint from its own observation.With our VPT framework, the hider robot learns to predictthe seeker robot future observation and plan its actions basedon this prediction to avoid being captured.

in complex setup. Second, the behaviors among differentrobots are cross-dependent. Consequently, it is necessaryto account for the mutual interaction [9] between relatedrobots. Third, directly learning on physical robots requiresstrong data efficiency. Though recent methods [10]–[13]demonstrate that similar cognitive abilities could emergefrom reward-driven learning, they often require millions oflearning steps. Lastly, learning to predict long-term futureis very challenging. Most existing visual prediction frame-works [14]–[18] follow an iterative paradigm to rollout theprediction which are both time and memory consuming underlong-horizon tasks.

In this paper, we propose a self-supervised vision-basedlearning framework to take a small step towards realizingVPT and TOB in navigation robots involving future opponentmodeling. We call our framework VPT-TOB. Our key ideais to explicitly predict the future perspective of the otherrobot as an image by conditioning on a top-down visualembedding of first-person RGB-D camera images and time-abstracted action embeddings. We further present a valueprediction model that can leverage the future perspectiveprediction to evaluate action proposals. During test time,we ask the robot to propose future goals on the top-downmap and generates potential action plans with its low-levelcontroller. The robot can then use our VPT-TOB model toimagine future perspectives of the other robot and evaluatesits goal proposal with the value prediction model. The robot

arX

iv:2

105.

0514

5v1

[cs

.RO

] 1

1 M

ay 2

021

Page 2: arXiv:2105.05145v1 [cs.RO] 11 May 2021

can choose the safest position to navigate to.Our experiments on a physical hide-and-seek task suggest

that our method outperforms baseline algorithms which donot explicitly consider other agents’ perspectives and behav-iors, showing the importance of explicit visual perspectivetaking and behavior modeling. Our hider robot exhibits di-verse behaviors during hiding and an interpretable decision-making process by providing how the robot is envisioningthe future state and actions of another robot.

The primary contribution of this paper is to demonstratethat a long-term vision prediction framework can be usedto model the perspectives and behaviors of other robots.We present a value prediction module that can take theperspective prediction to evaluate action proposals during testtime. We perform several experiments and ablation studiesin both real and simulation environments to evaluate theeffectiveness of our design decisions.

II. RELATED WORKS

We chose to demonstrate and evaluate the capacity ofrobots to perform perspective taking and behavior modelingin the game of hide-and-seek. Hide-and-seek has been arepresentative example [6], [19] in Cognitive Science andPsychology to study children’s development of VPT skills.Hide-and-seek has also been suggested [20] as a promisingframework to study VPT and Theory of Mind related abilitiesin animals. In order to demonstrate robots with VPT and ToBabilities, we build a simple visual embodied environment toplay a similar hide-and-seek game between two robots.

In Embodiment AI and Robotics, earlier works [8], [21]present a cognitive architecture for a robot to play hide andseek with humans by learning from a hand-designed rulesand feature extraction. Recently, Chen et al. [10] proposesa first-person visual hide-and-seek task and analyzes theemerged behaviors under various environment conditions.Baker et al. [11] trains the agents to play the game onlow-dimensional observations and found that more complextool usage behaviors can emerge. Weihs et al. [12] observessimilar behaviors by training the agents in a variant of hidingtask with objects. There are also significant works [22]–[29] on multi-agent communications with carefully designedprotocols. However, these works only operate in simulationenvironments with perfect sensory inputs. These approachesalso require at least millions of learning steps, which make itimpractical for real-world robotic applications. Our work, onthe other hand, trains the physical robot from scratch whilealso showing diverse learned behaviors.

Our work is also closely related to model-based robotlearning from visual inputs. There have been significantrecent research on learning to plan with Model PredictiveControl (MPC) by first learning the dynamics from visualinputs ranging from object manipulation [15], [30], [31],locomotion [18], [32], and navigation [33], [34]. A typicalsetup is to train a model to predict future states from thecurrent and past observations, and then the same model or aseparate model is used to estimate the state-action value forgoal-driven planning or reward-driven policy learning. Our

work differs from these studies mainly on two aspects. First,instead of predicting the future in an auto-regressive manner,our method can directly envision the future within any step inthe horizon with one-step rollout. Second, our work focuseson opponent’s perspective and behavior modeling while otherworks research single-agent skill learning.

III. METHOD OVERVIEW

In this paper, we frame the Visual Perspective Takingand Behavior Modeling as an action-conditioned vision pre-dictive model. Our method encourages the learning robotto explicitly contemplate the future visual perspective andbehavior of the other robot by outputting where the otheragent will be, and what the other agent will see as animage. Our hider robot can be trained directly on the physicalsetup. Once trained, the robot can answer “what if” questionsvisually: what will the other agent do, and what will its viewbecome, if I choose to perform this action for some time?

Task Setup To investigate the effectiveness of this frame-work in opponent modeling, we consider a hide-and-seeksetting [10] where a hider robot needs to avoid beingcaptured by another seeker robot while navigating aroundthe environment with various obstacles. In order to predictthe future perspective of the seeker, the hider robot needs toinfer the geometric view from the seeker, reason about theenvironment spatial layout, and form a model of the seeker’spolicy. This task reflects several challenging aspects in real-world setup such as occlusions in robot perception and cross-dependent actions among multiple robots.

We 3D printed two-wheeled robots with size of 110mmby 130mm on a 1.2m by 1.2m plane with three obstacles.Similar to Wu et al. [35], for simple prototype, we placefiducial markers [36] on the robots and track them with a top-down camera. However, our setup aims at representing whatcould be possible to achieve with only the onboard RGB-D camera for SLAM. Therefore, the robots only perceivewhat could be captured with the first-person RGB-D camerawith 86◦FoV with noisy depth and color information andocclusions.

Both robots start by facing at each other at a randomlocation in the environment and moves in the same speed(10 mm/step). The robots can move forward or backwardand rotate with 10◦as given primitives. We only focus onstationary case where the seeker robot follows a heuristicexpert policy and leave the non-stationary scenario for futurework as the first step towards grounding VPT to physicalrobots. If the hider is visible through its first-person camera,the seeker will navigate towards it with A* [37]; if the seekerloses the hider in its camera view, it will navigate to the lastknown position of the hider and then continuously explorethe room until the hider shows again.

A. State Representation

We represent the robot state observation as a RGB im-age. The image representation depicts a top-down visibilitymap. This is similar to the representations in other mobilenavigation and self-driving systems [35], [38]–[41]. We

Page 3: arXiv:2105.05145v1 [cs.RO] 11 May 2021

Hider

Seeker

Physical Arena

Top-Down Projection

Hider’s View

Visual Perspective Taking(VPT-TOB)

Predicted Value Map

Selected Hiding Spot

0 (Risky)

1 (Safe)

0.2

0.4

0.6

0.8

Predicted Seeker’s View

VPN

Fig. 2: Method Overview: our robot learns to infer the future view of the other robot with our VPT-TOB model basedon its initial visual observation and future action embeddings (blue is invisible area). With these anticipations and a valueprediction model, the robot can produce a value map indicating the safety level (color bar) and selects the best hiding spot.

project the depth information from first-person view to a2D bird’s eye view to acquire the top-down visibility map.Areas behind the object boundary are treated as free space.We colorize the observation map to reflect this. The hiderrobot is always aware of its own position, while the seekerposition is only available when it is visible through a colorfilter from the first-person view. The robots are plotted ascircles with different colors. We denote the final visual staterepresentation of the hider and the seeker at timestamp t asIH,t and IS,t.

B. Action Representation

We design the action representation with two key con-siderations in mind. First, we want to align the actionrepresentation with the state representation. Second, we needto preserve information about past action sequences. Thus,we propose to encode the robot’s trajectory sequence as two2-channel images with the same dimensions as our top-downvisibility map. Say we are interested in the action sequencefrom t=0 up to a future time t=ti. The first image is avisitation map Ft=0:ti denoting how many times each pointon the map will be visited by the robot. This is achievedthrough the second alpha-channel where darker color repre-sents more frequent visitation. The second image is a time-encoded trajectory map Tt=0:ti denoting the traversal order.Similar to F , the darker the color in T , the later the positionwill be traversed. Both F and T are necessary for a completeaction representation to avoid the ambiguity brought by thetrajectory intersection. An example of a training data pair isshown in Fig. 3.

C. Data Collection

We experiment with two types of exploration policies forthe hider to collect the training data. The first is a randompolicy created by sampling from the given action primitives.However, this policy explores poorly thus resulting in theseeker easily catching the hider. The hider mostly stays alivefor only 20 to 30 steps which biases the training data to beshort-sighted. The second policy is human play data [42].We ask a human subject to play as the hider in front of adisplay from the first-person view for 100 trajectories. Sincethe human subject has never played this version of the gamebefore, the trajectories are diverse ranging from successfultrials to failures, making it a good strategy to stay with.

Observation Action Embedding Seeker Future View

Fig. 3: An example of the training data pair for VPT-TOB.

During robot data collection, the hider robot randomlysamples a human trajectory and uses the selected trajectoryas is but from a different initial condition. It is also possibleto train an imitation policy on the human data, but we findthe current strategy sufficient to give enough balanced data.A horizon will terminate when one of two conditions istriggered: 1) the seeker robot catches the hider robot, or 2)the total number of steps reaches a maximum value (100 inour setup).

D. VPT-TOB Network

The objective of the VPT-TOB network fp is to anticipatethe visual observation IS,t=ti of the seeker at future time stepti given its initial observation IH,t=0 and action embedding(Ft=0:ti , Tt=0:ti ). Intuitively, the hider robot needs to firstanticipate where it will end up if it takes certain actionsequences. Secondly, the robot needs to model the behaviorof the other robot so that it learns how the seeker will react.Finally, the hider needs to project the view using its pre-diction of the seeker agent, which requires an understandingof scene geometry and view projection. During training, thehider has access to the seeker’s view as a supervised signal.During testing, the hider has to predict the seeker’s views.

Network Architecture Our VPT-TOB model is a fullyconvolutional encoder-decoder network [7]. The encoder isa 8-layer fully convolutional network which takes in theconcatenated visual observation and action embedding asinput with size 128 × 128. The decoder network comprisesfour transposed convolutional layers where each of them isfollowed by another 2D convolution and a transposed con-volution for high-resolution output [43] with size 128×128.

Training VPT-TOB model can be supervised with a simplepixel-wise mean-squared error loss. We use 1,000 real-worldtrajectories for training and 200 trajectories for validation andtesting. We use Adam [44] optimizer with a learning rate of0.001 and batch size 256. We decrease the learning rate by

Page 4: arXiv:2105.05145v1 [cs.RO] 11 May 2021

10% at the 25% and 65% of the training progress. Formally,the loss function can be written as:

LVPT-TOB = MSE(fp(IH,t=0,Ft=0:ti ,Tt=0:ti), IS,t=ti)

E. Inference and Planning

At test time, the hider robot needs to propose a set ofaction plans and envision the possible outcomes for themwith VPT-TOB to pick the best plan. One way to representthe plan is to specify a goal location and then performvisual MPC [15], [45] with the learned opponent model tofind the best path. However, in our task setup in which nohuman specified goal is available, a more reasonable way isto propose a spatial location in the room and plan with alow-level controller policy (e.g., A*). This parameterizationemphasizes a safer hiding location instead of a safer path tofollow given a required goal.

Value Estimation Map We therefore aim to generate a valuemap to evaluate the risk level of the possible goal locations.A value map is a matrix where each element denotes thesafety of choosing the corresponding position as the goal.The earlier the hider robot is caught towards that goal, theriskier that goal is. Formally, the value for each element onthe value map is defined as the following accumulation form:

Vi,t=0:T =∑k

vi,t=(kN−1), v =

{1, if hider is caught0, otherwise

where k is an integer from 0 to⌊TN

⌋and T is the maximum

time budget for this horizon and N is the step size. Underthis formulation, we discretize the environment as a gridmap (121 endpoints for our setup). From the current hiderposition, the planned path from A* can be embedded everyN (N=10 for us) steps.

Value Prediction Network (VPN) Our value predictionnetwork fv takes in the visual prediction from VPT-TOBto output the binary v value for each step. We can then usethe above formula to compute the accumulated value alongall the steps. We just need a few forward passes on VPT-TOBand VPN to obtain the entire value map within 1.4 secondson a single GPU.

Intuitively, VPN estimates whether the hider will be caughtgiven the predicted future view of the seeker. Our VPNnetwork consists of 3-convolutional layers followed by 2fully-connected layers. We apply max-pooling after eachconvolution and a dropout layer right after the first fully-connected layer. The network is trained to minimize a cross-entropy loss. We augment the data by randomly rotating theimages among (90◦, 180◦, 270◦). In total, we have 2, 000images for training, and 400 images each for validation andevaluation. We train our model for 100 epochs with a batchsize of 256 and a learning rate of 0.0005. We decreasethe learning rate by 10% at 20% and 50% of the trainingprogress. We use Adam as the optimizer. Formally, the lossfunction can be written as:

LVPN = −(v log(fv (IS,t=ti)))+(1−v) log(1−fv (IS,t=ti))

IV. EXPERIMENTS

We first perform a series of experiments in our real-worldsetup. To evaluate different components in our algorithm,we further replicate our physical setup in a simulation andexecute several ablation and baseline comparisons.

A. Evaluation Metrics

VPT-TOB and VPN A straightforward evaluation of theVPT-TOB model is to provide quantitative visualizations ofthe imagined frames. Since our goal with VPT-TOB is tocapture the perspective information and opponent behaviorfor long-term planning, a perceptual error will not be idealfor our evaluation. Meanwhile, our VPN model needs to takein the predicted frame from VPT-TOB to infer whether thehider robot will be safe. Therefore, we can use the perfor-mance of VPN on the output of VPT-TOB as a quantitativemetric for our VPT-TOB model.

Value Estimation Map To evaluate the overall performanceof the entire planning pipeline, we quantify the accuracywith the entire value map. We measure this in simulationwhich can provide us with ground-truth values convenientlyby moving the hider robot to all possible goal locations andcollecting the accumulated values. During test time, whatmatters for decision making is the relative value betweendifferent goal locations instead of the absolute value. Tothis end, our metric for the value map is a relative rankingaccuracy. Say we choose gi and gj as a pair to evaluate. Therelative ranking of this pair is correct if it matches the relativeranking from the ground-truth value map. We evaluated allpairs at least 3 units away. This metric will result in thesame ranking among all comparisons as the last metric, butprovide an overall evaluation.

B. Real-World Results

The prediction results from our VPT-TOB model is asequence of images based on a single initial observationfrom the hider’s view and the future action embedding. InFig. 4, we show the predictions from the random hiderrobot every 20 steps up to 100 steps corresponding to about25 seconds. Each predicted image presents where the hiderthinks the seeker robot will be, and what the seeker’s viewwill look like. Our model gives accurate predictions aboutthe perspective and behavior of the opponent robot.

For test time planning, we randomly place the hider andseeker robot in the environment and ask the hider robotto plan and execute its goal selection from VPT-TOB andVPN. Fig. 5 shows the video frames where the hider robotnavigates to its selected goal and successfully avoids theseeker only by relying on its initial value map estimation forthe future 100 time steps. There is no extra decision makingcost during execution. Interesting hiding behaviors emergedautomatically from our setup as shown in Figure 5.

We further test the performance of our VPN model underthe real-world setup by inputting the predictions from theVPT-TOB model. Our VPN model achieves an accuracyof 88.39%. As we will discuss in the next section, this

Page 5: arXiv:2105.05145v1 [cs.RO] 11 May 2021

Physical Bird’s-Eye

View(Reference)

Seeker’s View(Prediction by

Hider)

Seeker’s View(Ground-truth)

1st step 20th step 40th step 60th step 80th step 100th step 1st step 20th step 40th step 60th step 80th step 100th step

Fig. 4: VPT-TOB predictions from physical robots: we here show two example sequences. The predictions are shown asa function of time. Our hider robot accurately predicts the future perspectives of the seeker robot only by its own initialobservation and a future potential action plan.

Bird View Hider’s View Goal Sampling Value Map Selected Hiding Spot

t = 20 t = 40 t = 60 t = 80 t = 100

(A) Hider robot chose to hide behind the obstacle

t = 24 t = 48 t = 72 t = 96 t = 120

(B) Hider robot chose to hide around the corner

(C) Hider robot completely avoid the tracking of the seeker

t = 32 t = 64 t = 96 t = 128

t = 37 (Last time the seeker sees the hider)

Fig. 5: Planning and hiding Our hider robot produces avalue map with our VPT-TOB and VPN networks. (Blueis safe and red is dangerous.) The hider then navigatesto the farthest location with the highest value (star mark).Interestingly, in the last example, the hider completely gotrid of the seeker from the 37th time step.

is very similar to what we can achieve in the simulation,demonstrating the effectiveness of our method in the physicalrobot platform.

C. Simulation Analysis

Our simulation is built with Unity [46] to evaluate differentcomponents in our approach. The simulation can run inparallel and hence provides an efficient and scalable testingenvironment. To provide extensive analysis, we use in total

4,500 trajectories as training and 500 trajectories each forvalidation and testing. We can also get the top-down obser-vation directly by applying a mesh filter in Unity with similarresult from our real robot sensors. We run the hider robotfor 200 steps.

D. Baseline Methods

Self-perspective predicts the hider’s own future states con-ditioned on its initial observation and action plans otherthan explicitly modeling the other robot. This follows recentworks on learning visual world models [15], [16], [18], [30],[32] with adapted architectures for fair comparison.

Coordinate-value directly predict whether the hider robotwill be caught in the future from initial observation andaction plans. The input consists of an initial state observationfrom the hider, a vector action embedding with spatialcoordinate information, and a future coordinate value wherethe hider will end up using the given actions.

Vector-action is the same with VPT-TOB but replaces ouraction representation with a vector embedding as in [47].This embedding is concatenated with the intermediate fea-tures to produce the final predicted image.

The self-perspective baseline aims to test whether model-ing the opponent agent is necessary. Both the self-perspectiveand coordinate-value baselines serve to examine if explicitopponent modeling is more effective than relying on thenetwork to learn the opponent model implicitly. Vector-actionis used to evaluate if our action representation can improvethe quality of VPT-TOB model.

TABLE I: VPT-TOB and VPN Performance

Method Success Rate on Held-out Examples

Self-perspective + VPN 76.61%Coordinate-value 76.40%

Vector-action + VPN 82.98%VPT-TOB (ours) 88.45%

E. Baseline Comparisons

Through our simulation comparisons, we show the ac-curacy of VPN across all the applicable methods in Tab.

Page 6: arXiv:2105.05145v1 [cs.RO] 11 May 2021

t = 156 t = 23 t = 185

Hider’s Observation(Input, t=1)

Seeker’s Future View (Ground-truth)

Seeker’s Future View (Prediction by Hider)

t = 75

Hider’s Observation(Input, t=1)

t = 13 t = 179 t = 95 t = 60

Hider’s Future View (Ground-truth)

Hider’s Future View (Prediction by Hider)

t = 80 t = 2 t = 112

Hider’s Observation(Input, t=1)

t = 117

Seeker’s Future View (Ground-truth)

Seeker’s Future View (Prediction by Hider)

(A) VPT-TOB

(B) self-perspective

(C) vector-action

Fig. 6: Predictions from VPT-TOB and other baselines. Ourmethod produces more accurate modeling of the other robot’sfuture perspective, suggesting the advantage of explicit op-ponent modeling and our visual action embedding.

I in which the input of VPN comes from our VPT-TOBor other baseline models. We run all the methods on 1,000unseen examples with uniform starting locations and horizonlength. The result suggest the advantage of our method. BothVPT-TOB and vector-action outperforms self-perspective andcoordinate-value, validating the necessity to explicitly modelthe opponent perspective and behavior. The gap betweenvector-action and VPT-TOB indicates that our action rep-resentation offers additional gains for learning an accuratevisual opponent model.

Visualizations of the predicted frames substantiate ourconclusion (Fig. 6). For better comparison, we also displaybird’s-eye view in the first row and ground-truth seeker’sview in the last row. However, these images are only forvisualization, not inference.

Fig. 7 shows the generated value maps from our approach

Bird’s-EyeView (t=0)

PredictedValue Map Ground-truth Bird’s-Eye

View (t=0)PredictedValue Map Ground-truth

Fig. 7: Predicted value maps versus ground-truth value maps.Our method produces accurate value estimations and reflectthe overall value patterns.

comparing with the ground-truth value maps. Overall, ourframework demonstrates an accurate estimation about thesafety level across the room.

F. Generalization Across Training Time HorizonWith the relative ranking accuracy, we can quantify the

performance of our approach from a higher-level perspective.Since our model should also learn temporal abstractions fromour action representation, we extend the test horizon up to175% of the maximum number of steps given in trainingand examine how the relative ranking accuracy changes. Weachieved this test by simply projecting longer horizon actiontrajectories on our image-based action embedding.

Tab. II shows the performance. Our model is able topredict far beyond its training horizon with very small lossesof the accuracy. This suggests that our method also learnsto model the temporal dynamics while modeling the otheragent.

TABLE II: Generalization on Prediction Horizon

Prediction Steps Relative Ranking Accuracy

200 (max seen during training) 75.29 ± 12.23%250 73.75 ± 10.35%300 73.03 ± 10.40%350 72.77 ± 10.82%

V. CONCLUSIONS AND FUTURE WORKWe propose a visual predictive framework for realizing

Visual Perspective Taking and Behavior Modeling in real-world robotic system. Our approach shows how to explicitlymodel the opponent’s view and use the predictions for long-term planning. Our work takes a step towards a generalimplementation of VPT and TOB ability on physical robots.

Our work studies VPT and TOB in a two-robot adver-sarial task. An interesting future direction is to explorethe generalization to more robots. Additionally, our currentmethod assumes a fixed policy for the seeker robot. Itwould be a useful next step to incorporate scenarios wherethe seeker policy might change by considering stochasticoptimization with few-shot learning algorithms in an iterativelearning manner. Finally, it would be an interesting topicto study swarm robot [48], [49] applications to deepen ourunderstanding about more complex social dynamics.

Page 7: arXiv:2105.05145v1 [cs.RO] 11 May 2021

REFERENCES

[1] J. Piaget, Child’s Conception of Space: Selected Works vol 4. Rout-ledge, 2013, vol. 4.

[2] J. H. Flavell, “The development of knowledge about visual percep-tion.” in Nebraska symposium on motivation. University of NebraskaPress, 1977.

[3] J. H. Flavell, E. F. Flavell, F. L. Green, and S. A. Wilcox, “Youngchildren’s knowledge about visual perception: Effect of observer’sdistance from target on perceptual clarity of target.” DevelopmentalPsychology, vol. 16, no. 1, p. 10, 1980.

[4] N. Newcombe and J. Huttenlocher, “Children’s early ability to solveperspective-taking problems.” Developmental psychology, vol. 28,no. 4, p. 635, 1992.

[5] H. Moll and A. N. Meltzoff, “How does it look? level 2 perspective-taking at 36 months of age,” Child development, vol. 82, no. 2, pp.661–673, 2011.

[6] A. Pearson, D. Ropar, A. F. d. Hamilton, et al., “A review of visualperspective taking in autism spectrum disorder,” Frontiers in humanneuroscience, vol. 7, p. 652, 2013.

[7] B. Chen, C. Vondrick, and H. Lipson, “Visual behavior modelling forrobotic theory of mind,” Scientific Reports, vol. 11, no. 1, pp. 1–14,2021.

[8] J. G. Trafton, A. C. Schultz, D. Perznowski, M. D. Bugajska,W. Adams, N. L. Cassimatis, and D. P. Brock, “Children and robotslearning to play hide and seek,” in Proceedings of the 1st ACMSIGCHI/SIGART conference on Human-robot interaction, 2006, pp.242–249.

[9] S. V. Albrecht and P. Stone, “Autonomous agents modelling otheragents: A comprehensive survey and open problems,” Artificial Intel-ligence, vol. 258, pp. 66–95, 2018.

[10] B. Chen, S. Song, H. Lipson, and C. Vondrick, “Visual hide and seek,”in Artificial Life Conference Proceedings. MIT Press, 2020, pp. 645–655.

[11] B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew,and I. Mordatch, “Emergent tool use from multi-agent autocurricula,”2020.

[12] L. Weihs, A. Kembhavi, W. Han, A. Herrasti, E. Kolve, D. Schwenk,R. Mottaghi, and A. Farhadi, “Artificial agents learn flexible vi-sual representations by playing a hiding game,” arXiv preprintarXiv:1912.08195, 2019.

[13] A. Labash, J. Aru, T. Matiisen, A. Tampuu, and R. Vicente, “Per-spective taking in deep reinforcement learning agents,” Frontiers inComputational Neuroscience, vol. 14, 2020.

[14] F. Leibfried, N. Kushman, and K. Hofmann, “A deep learning approachfor joint video frame and reward prediction in atari games,” arXivpreprint arXiv:1611.07078, 2016.

[15] C. Finn and S. Levine, “Deep visual foresight for planning robotmotion,” in 2017 IEEE International Conference on Robotics andAutomation (ICRA). IEEE, 2017, pp. 2786–2793.

[16] S. Racaniere, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J.Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al., “Imagination-augmented agents for deep reinforcement learning,” in Advances inneural information processing systems, 2017, pp. 5690–5701.

[17] S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed, “Recurrentenvironment simulators,” arXiv preprint arXiv:1704.02254, 2017.

[18] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, andJ. Davidson, “Learning latent dynamics for planning from pixels,” inInternational Conference on Machine Learning. PMLR, 2019, pp.2555–2565.

[19] C. N. Street, W. F. Bischof, and A. Kingstone, “Perspective takingand theory of mind in hide and seek,” Attention, Perception, &Psychophysics, vol. 80, no. 1, pp. 21–26, 2018.

[20] A. S. Reinhold, J. I. Sanguinetti-Scheck, K. Hartmann, and M. Brecht,“Behavioral and neural correlates of hide-and-seek in rats,” Science,vol. 365, no. 6458, pp. 1180–1183, 2019.

[21] J. G. Trafton, N. L. Cassimatis, M. D. Bugajska, D. P. Brock, F. E.Mintz, and A. C. Schultz, “Enabling effective human-robot interactionusing perspective-taking in robots,” IEEE Transactions on Systems,Man, and Cybernetics-Part A: Systems and Humans, vol. 35, no. 4,pp. 460–470, 2005.

[22] J. Bratman, M. Shvartsman, R. L. Lewis, and S. Singh, “A new ap-proach to exploring language emergence as boundedly optimal controlin the face of environmental and cognitive constraints,” in Proceedingsof the 10th International Conference on Cognitive Modeling. Citeseer,2010, pp. 7–12.

[23] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, “Learningto communicate with deep multi-agent reinforcement learning,” inAdvances in neural information processing systems, 2016, pp. 2137–2145.

[24] A. Lazaridou, A. Peysakhovich, and M. Baroni, “Multi-agent co-operation and the emergence of (natural) language,” arXiv preprintarXiv:1612.07182, 2016.

[25] P. Grouchy, G. M. D’Eleuterio, M. H. Christiansen, and H. Lipson,“On the evolutionary origin of symbolic communication,” Scientificreports, vol. 6, no. 1, pp. 1–9, 2016.

[26] S. Sukhbaatar, R. Fergus, et al., “Learning multiagent communicationwith backpropagation,” in Advances in neural information processingsystems, 2016, pp. 2244–2252.

[27] I. Mordatch and P. Abbeel, “Emergence of grounded com-positional language in multi-agent populations,” arXiv preprintarXiv:1703.04908, 2017.

[28] K. Cao, A. Lazaridou, M. Lanctot, J. Z. Leibo, K. Tuyls, andS. Clark, “Emergent communication through negotiation,” arXivpreprint arXiv:1804.03980, 2018.

[29] U. Jain, L. Weihs, E. Kolve, M. Rastegari, S. Lazebnik, A. Farhadi,A. G. Schwing, and A. Kembhavi, “Two body problem: Collaborativevisual task completion,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2019, pp. 6689–6699.

[30] A. Byravan, F. Lceb, F. Meier, and D. Fox, “Se3-pose-nets: Structureddeep dynamics models for visuomotor control,” in 2018 IEEE Interna-tional Conference on Robotics and Automation (ICRA). IEEE, 2018,pp. 1–8.

[31] R. Hoque, D. Seita, A. Balakrishna, A. Ganapathi, A. K. Tanwani,N. Jamali, K. Yamane, S. Iba, and K. Goldberg, “Visuospatial fore-sight for multi-step, multi-task fabric manipulation,” arXiv preprintarXiv:2003.09044, 2020.

[32] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to con-trol: Learning behaviors by latent imagination,” arXiv preprintarXiv:1912.01603, 2019.

[33] D. Ha and J. Schmidhuber, “World models,” arXiv preprintarXiv:1803.10122, 2018.

[34] N. Hirose, F. Xia, R. Martın-Martın, A. Sadeghian, and S. Savarese,“Deep visual mpc-policy learning for navigation,” IEEE Robotics andAutomation Letters, vol. 4, no. 4, pp. 3184–3191, 2019.

[35] J. Wu, X. Sun, A. Zeng, S. Song, J. Lee, S. Rusinkiewicz, andT. Funkhouser, “Spatial action maps for mobile manipulation,” arXivpreprint arXiv:2004.09141, 2020.

[36] E. Olson, “Apriltag: A robust and flexible visual fiducial system,” in2011 IEEE International Conference on Robotics and Automation.IEEE, 2011, pp. 3400–3407.

[37] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for theheuristic determination of minimum cost paths,” IEEE transactions onSystems Science and Cybernetics, vol. 4, no. 2, pp. 100–107, 1968.

[38] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subramanian, “Intention-net: Integrating planning and deep learning for goal-directed au-tonomous navigation,” arXiv preprint arXiv:1710.05627, 2017.

[39] X. Chen, A. Ghadirzadeh, J. Folkesson, M. Bjorkman, and P. Jensfelt,“Deep reinforcement learning to acquire navigation skills for wheel-legged robots in complex environments,” in 2018 IEEE/RSJ Interna-tional Conference on Intelligent Robots and Systems (IROS). IEEE,2018, pp. 3110–3116.

[40] P. Shah, M. Fiser, A. Faust, J. C. Kew, and D. Hakkani-Tur, “Fol-lownet: Robot navigation by following natural language directionswith deep reinforcement learning,” arXiv preprint arXiv:1805.06150,2018.

[41] T. Bruls, H. Porav, L. Kunze, and P. Newman, “The right (angled)perspective: Improving the understanding of road scenes using boostedinverse perspective mapping,” in 2019 IEEE Intelligent Vehicles Sym-posium (IV). IEEE, 2019, pp. 302–309.

[42] C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine,and P. Sermanet, “Learning latent plans from play,” in Conference onRobot Learning, 2020, pp. 1113–1132.

[43] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learningoptical flow with convolutional networks,” in Proceedings of the IEEEinternational conference on computer vision, 2015, pp. 2758–2766.

[44] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” arXiv preprint arXiv:1412.6980, 2014.

[45] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine, “Visual

Page 8: arXiv:2105.05145v1 [cs.RO] 11 May 2021

foresight: Model-based deep reinforcement learning for vision-basedrobotic control,” arXiv preprint arXiv:1812.00568, 2018.

[46] A. Juliani, V.-P. Berges, E. Vckay, Y. Gao, H. Henry, M. Mattar, andD. Lange, “Unity: A general platform for intelligent agents,” arXivpreprint arXiv:1809.02627, 2018.

[47] L. Buesing, T. Weber, S. Racaniere, S. Eslami, D. Rezende, D. P.Reichert, F. Viola, F. Besse, K. Gregor, D. Hassabis, et al., “Learningand querying fast generative models for reinforcement learning,” arXivpreprint arXiv:1802.03006, 2018.

[48] S.-J. Chung, A. A. Paranjape, P. Dames, S. Shen, and V. Kumar, “Asurvey on aerial swarm robotics,” IEEE Transactions on Robotics,vol. 34, no. 4, pp. 837–855, 2018.

[49] G. GS, A. Arockia, and V. Berlin, “A survey on swarm roboticmodeling, analysis and hardware architecture,” Procedia ComputerScience, Science Direct, vol. 133, pp. 478–485, 2018.