perspective taking: an organizing principle for learning ... · mit media lab 20 ames st.,...

Perspective Taking: An Organizing Principle for Learning inHuman-Robot Interaction

Matt Berlin, Jesse Gray, Andrea L. Thomaz, Cynthia BreazealMIT Media Lab

20 Ames St., Cambridge, MA 02139{mattb, jg, alockerd, cynthiab}@media.mit.edu

Abstract

The ability to interpret demonstrations from the perspective ofthe teacher plays a critical role in human learning. Robotic sys-tems that aim to learn effectively from human teachers mustsimilarly be able to engage in perspective taking. We present anintegrated architecture wherein the robot’s cognitive function-ality is organized around the ability to understand the environ-ment from the perspective of a social partner as well as its own.The performance of this architecture on a set of learning tasks isevaluated against human data derived from a novel study exam-ining the importance of perspective taking in human learning.Perspective taking, both in humans and in our architecture, fo-cuses the agent’s attention on the subset of the problem spacethat is important to the teacher. This constrained attention al-lows the agent to overcome ambiguity and incompleteness thatcan often be present in human demonstrations and thus learnwhat the teacher intends to teach.

IntroductionThis paper addresses an important issue in building robots thatcan successfully learn from demonstrations that are providedby “naıve” human teachers who do not have expertise in thelearning algorithms used by the robot. As a result, the teachermay provide sensible demonstrations from a human’s perspec-tive; however, these same demonstrations may be insufficient,incomplete, ambiguous, or otherwise “flawed” in terms of pro-viding a correct and sufficiently complete training set in orderfor the learning algorithm to generalize properly.

To address this issue, we believe that socially situated robotswill need to be designed as socially cognitive learners thatcan infer the intention of the human’s instruction, even if theteacher’s demonstrations are less than perfect for the robot.Our approach to endowing machines with socially-cognitivelearning abilities is inspired by leading psychological theo-ries and recent neuroscientific evidence for how human brainsmight infer the mental states of others. Specifically, SimulationTheory holds that certain parts of the brain have dual use; theyare used to not only generate behavior and mental states, butalso to predict and infer the same in others (Davies & Stone1995; Barsalou et al. 2003; Sebanz, Bekkering, & Knoblich2006).

Copyright c© 2006, American Association for Artificial Intelligence(www.aaai.org). All rights reserved.

Figure 1: The Leonardo robot and graphical simulator

In this paper, we present an integrated architecture whereinthe robot’s cognitive functionality is organized around theability to understand the environment from the perspective ofthe teacher using simulation-theoretic mechanisms. Perspec-tive taking focuses the agent’s attention on the subset of theproblem space that is important to the teacher. Focusing ona subset of the input/problem space directly affects the set ofhypotheses entertained by the learning algorithm, and thus di-rectly affects the skill transferred to the agent via the inter-action with the teacher. This constrained attention allows theagent to overcome ambiguity and incompleteness that can of-ten be present in human demonstrations.

The outline of this paper is as follows. First, we presentan overview of our integrated architecture which runs on a65 degree of freedom humanoid robot and its graphical sim-ulator (Fig. 1). We then detail the perceptual-belief pipeline,describe our tutelage-inspired approach to task learning, andpresent how perspective taking integrates with these mecha-nisms. This architectural integration allows the robot to inferthe goal and belief states of the teacher, and thus to more ac-curately model the intent of their demonstrations. Finally, wepresent the results of a novel study examining the importanceof perspective taking in human learning. Data derived from thesuite of learning tasks in this study are used to create a bench-mark suite to evaluate the performance of our architecture andto illustrate the robot’s ability to learn in a human compatibleway from ambiguous demonstrations presented by a humanteacher.

1444

Sensors

Joint AngleMapping

TrajectoryClassifier

SchemaRecognizer

Matched Motions

Robot'sBelief System

Human'sBelief System

GoalInference &

TaskLearning

PerspectiveTransformation

Context Data

Belief Simulation

Motor Simulation Action Simulation

Human'sPerception System

Robot'sPerception System

Figure 2: System Architecture Overview

Architecture OverviewOur architecture, based on (Blumberg et al. 2002), incor-porates simulation-theoretic mechanisms as a foundationaland organizational principle to support collaborative forms ofhuman-robot interaction, such as the tutelage-based learningexamined in this paper. An overview of the architecture isshown in Figure 2. Our implementation enables a humanoidrobot to monitor an adjacent human teacher by simulating hisor her behavior within the robot’s own generative mechanismson the motor, goal-directed action, and perceptual-belief levels(Gray, J. et al. 2005).

While others have identified that visual perspective takingcoupled with spatial reasoning are critical for human-robotcollaboration on a shared task within a physical space (Traftonet al. 2005), and collaborative dialog systems have investi-gated the role of plan recognition in identifying and resolv-ing misconceptions (see (Carberry 2001) for a review), thisis the first work to examine the role of perspective taking forintroceptive states (e.g., beliefs and goals) in a human-robotlearning task.

Belief ModelingIn order to convey how the robot interprets the environmentfrom the teacher’s perspective, we must first describe how therobot understands the world from its own perspective. Thissection presents a technical description of two important com-ponents of our cognitive architecture: the Perception Systemand the Belief System. The Perception System is responsi-ble for extracting perceptual features from raw sensory infor-mation, while the Belief System is responsible for integratingthis information into discrete object representations. The Be-lief System represents our approach to sensor fusion, objecttracking and persistence, and short-term memory.

On every time step, the robot receives a set of sensory ob-servations O = {o1, o2, ..., oN} from its various sensory pro-cesses. Information is extracted from these observations by thePerception System. The Perception System consists of a set ofpercepts P = {p1, p2, ..., pK}, where each p ∈ P is a classifi-cation function defined such that

p(o) = (m, c, d), (1)

where m, c ∈ [0, 1] are match and confidence values and d isan optional derived feature value. For each observation oi ∈O, the Perception System produces a percept snapshot

si = {(p, m, c, d)|p ∈ P, p(oi) = (m, c, d),m ∗ c > k}, (2)

where k ∈ [0, 1] is a threshold value, typically 0.5.

These snapshots are then clustered into discrete object rep-resentations called beliefs by the Belief System. This cluster-ing is typically based on the spatial relationships between thevarious observations, in conjunction with other metrics of sim-ilarity. The Belief System maintains a set of beliefs B, whereeach belief b ∈ B is a set mapping percepts to history func-tions: b = {(px, hx), (py, hy), ...}. For each (p, h) ∈ b, h is ahistory function defined such that

h(t) = (m′t, c

′t, d

′t) (3)

represents the “remembered” evaluation for percept p at timet. History functions may be lossless, but they are often imple-mented using compression schemes such as low-pass filteringor logarithmic timescale memory structures.

A Belief System is fully described by the tuple(B,G, M, d, q, w, c), where

• B is the current set of beliefs,

• G is a generator function map, G : P → G, where eachg ∈ G is a history generator function where g(m, c, d) = his a history function as above,

• M is the belief merge function, where M(b1, b2) = b′ rep-resents the “merge” of the history information containedwithin b1 and b2,

• d = d1, d2, ..., dL is a vector of belief distance functions,di : B × B → R,

• q = q1, q2, ..., qL is a vector of indicator functions whereeach element qi denotes the applicability of di, qi : B ×B → {0, 1},

• w = w1, w2, ..., wL is a vector of weights, wi ∈ R, and

• c = c1, c2, ..., cJ is a vector of culling functions, cj : B ×B → {0, 1}.

Using the above, we define the Belief Distance Function, D,and the Belief Culling Function, C:

D(b1, b2) =L∑

i=1

wiqi(b1, b2)di(b1, b2) (4)

C(b) =J∏

j=1

cj(b) (5)

The Belief System manages three key processes: creatingnew beliefs from incoming percept snapshots, merging thesenew beliefs into existing beliefs, and culling stale beliefs. Forthe first of these processes, we define the function N , whichcreates a new belief bi from a percept snapshot si:

bi = N(si) = {(p, h)|(p, m, c, d) ∈ si,

g = G(p), h = g(m, c, d)} (6)

For the second process, the Belief System merges new be-liefs into existing ones by clustering proximal beliefs, assumedto represent different observations of the same object. This isaccomplished via bottom-up, agglomerative clustering as fol-lows.

For a set of beliefs B:

1445

1: while ∃bx, by ∈ B such that D(bx, by) < thresh do2: find b1, b2 ∈ B such that D(b1, b2) is minimal3: B ← B ∪ {M(b1, b2)} \ {b1, b2}4: end while

Finally, the Belief System culls stale beliefs by removing allbeliefs from the current set for which C(b) = 1.

Task and Goal LearningWe believe that flexible, goal-oriented, hierarchical task learn-ing is imperative for learning in a collaborative setting froma human partner, due to the human’s propensity to communi-cate in goal-oriented and intentional terms. Hence, we have ahierarchical, goal-oriented task representation, wherein a taskis represented by a set, S, of schema hypotheses: one primaryhypothesis and n others. A schema hypothesis has x executa-bles, E, (each either a primitive action a or another schema), agoal, G, and a tally, c, of how many seen examples have beenconsistent with this hypothesis.

Goals for actions and schemas are a set of y goal beliefsabout what must hold true in order to consider this schema oraction achieved. A goal belief represents a desired change dur-ing the action or schema by grouping a belief’s percepts intoi criteria percepts (indicating features that holds constant overthe action or schema) and j expectation percepts (indicatingan expected feature change). This yields straightforward goalevaluation during execution: for each goal belief, all objectswith the criteria features must match the expectation features.

Schema Representation:

S = {[(E1...Ex), G, c]P , [(E1...Ex), G, c]1...n}E = a|SG = {B1...By}B = pC1 ...pCi ∪ pE1 ...pEj

For the purpose of task learning, the robot can take a snap-shot of the world (i.e. the state of the Belief System) at timet, Snp(t), in order to later reason about world state changes.Learning is mixed-initiative such that the robot pays atten-tion to both its own and its partner’s actions during a learn-ing episode. When the learning process begins, the robot cre-ates a new schema representation, S, and saves belief snap-shot Snp(t0). From time, t0, until the human indicates thatthe task is finished, tend, if either the robot or the human com-pletes an action, act, the robot makes an action representation,a = [act, G] for S:

1: For action act at time tb given last action at ta2: G = belief changes from Snp(ta) to Snp(tb)3: append [act, G] to executables of S4: ta = tb

At time tend, this same process works to infer the goal for theschema, S, making the goal inference from the differences inSnp(t0) and Snp(tend). The goal inference mechanism notesall changes that occurred over the task; however, there maystill be ambiguity around which aspects of the state changeare the goal (the change to an object, a class of objects, thewhole world state, etc.). Our approach uses hypothesis testingcoupled with human interaction to disambiguate the overalltask goal over a few examples.

Sensors

Leo's Behavior System

Sensory System

Perception System

Belief System

Action Selection

...

Simulation of Human

Sensory System

Perception System

Belief System

VisualPerspectiveFiltering andPerspective

Transformation

Figure 3: Architecture for modeling the human’s beliefs re-uses the robot’s own architecture for belief maintenance.

Once the human indicates that the current task isdone, S contains the representation of the seen example([(E1...Ex), G, 1]). The system uses S to expand other hy-potheses about the desired goal state to yield a hypothesis ofall goal representations, G, consistent with the current demon-stration (for details of this expansion process see (Lockerd &Breazeal 2004); to accommodate the tasks described here weadditionally expand hypotheses whose goal is a state changeacross a simple disjunction of object classes). The current bestschema candidate (the primary hypothesis) is chosen througha Bayesian likelihood method: P (h|D) ∝ P (D|h)P (h). Thedata, D, is the set of all examples seen for this task. P (D|h)is the percentage of the examples in which the state changeseen in the example is consistent with the goal representa-tion in h. For priors, P (h), hypotheses whose goal states ap-ply to the broadest object classes with the most specific classdescriptions are preferred (determined by number of classesand criteria/expectation features, respectively). Thus, when atask is first learned, every hypothesis schema is equally repre-sented in the data, and the algorithm chooses the most specificschema for the next execution.

Perspective TakingIn this section, we describe how perspective taking integrateswith the cognitive mechanisms discussed above: belief mod-eling and task learning. Inferring the beliefs of the teacher al-lows the robot to build task models which capture the intentbehind human demonstrations.

Perspective Taking and Belief InferenceWhen demonstrating a task to be learned, it is important thatthe context within which that demonstration is performed bethe same for the teacher as it is for the learner. However, incomplex and dynamic environments, it is possible for the in-structor’s beliefs about the context surrounding the demonstra-tion to diverge from those of the learner. For example, a visualocclusion could block the teacher’s viewpoint of a region ofa shared workspace (but not that of the learner) and conse-quently lead to ambiguous demonstrations where the teacherdoes not realize that the visual information of the scene differsbetween them.

1446

Time

Leo's belief about Button:

Leo's model of human's belief about Button:

position:state:

position:state:

front

front

on

on

Human removesOcclusion

off

Leo Moves ButtonBehind Occlusion

off

Button Visible to Human

Human Arrives ButtonTurned OFF

side

side

Figure 4: Timeline following the progress of the robot’s be-liefs for one button. The robot updates its belief about the but-ton with any sensor data available - however, the robot onlyintegrates new data into its model of the human’s belief if thedata is available when the human is able to perceive it.

To address this issue, the robot must establish and main-tain mutual beliefs with the human instructor about the sharedcontext surrounding demonstrations. The robot keeps track ofits own beliefs about object state using its Belief System, de-scribed above. In order to model the beliefs of the human in-structor as separate and potentially different from its own, therobot re-uses the mechanism of its own Belief System. Thesebeliefs that represent the robot’s model of the human’s beliefsare in the same format as its own, but are maintained sepa-rately so the robot can compare differences between its beliefsand the human’s beliefs.

As described above, belief maintenance consists of incorpo-rating new sensor data into existing knowledge of the world.The robot’s sensors are all in its reference frame, so objects inthe world are perceived relative to the robot’s position and ori-entation. In order to model the beliefs of the human, the robotre-uses the same mechanisms used for its own belief model-ing, but first transforms the data into the reference frame of thehuman (see Fig. 3).

The robot can also filter out incoming data that it believesis not perceivable to the human, thereby preventing that newdata from updating the model of the human’s beliefs. As yourecall, the sensory observations O = {o1, o2, ..., oN} are theinput to the robot’s belief system. The inputs to the secondarybelief system that models the human’s beliefs are O′, where:

O′ = {P (o′)|o′ ∈ O, V (o′) = 1} (7)

where:

V (x) ={

1 if x is visible to human0 otherwise

(8)

and:

P :{robot local observations}→ {person local observations} (9)

Visibility can be determined by a cone calculated from the hu-man’s position and orientation, and objects on the oppositeside of known occlusions from the human can be marked in-visible.

Maintaining this parallel set of beliefs is different from sim-ply adding metadata to the robot’s original beliefs because it

reuses the entire architecture which has mechanisms for objectpermanence, history of properties, etc. This allows for a moresophisticated model of the human’s beliefs. For instance, Fig.4 shows an example where this approach keeps track of the hu-man’s incorrect beliefs about objects that have changed statewhile out of the human’s view. This is important for establish-ing and maintaining mutual beliefs in time-varying situationswhere beliefs of individuals can diverge over time.

Perspective Taking and Task LearningIn a similar fashion, in order to model the task from thedemonstrator’s perspective, the robot runs a parallel copy ofits task learning engine that operates on its simulated repre-sentation of the human’s beliefs. In essence, this focuses thehypothesis generation mechanism on the subset of the inputspace that matters to the human teacher.

At the beginning of a learning episode, the robot can take asnapshot of the world in order to later reason about world statechanges. The integration of perspective taking means that thissnapshot can either be taken from the robot’s (R) or the hu-man’s (H) belief perspective. Thus when the learning processbegins, the robot creates two distinct schema representations,SRobot and SHum, and saves belief snapshots Snp(t0, R) andSnp(t0,H). Learning proceeds as before, but operating onthese two parallel schemas.

Once the human indicates that the current task is done,SRobot and SHum both contain the representation of the seenexample. Having been created from the same demonstration,the executables will be equivalent, but the goals may not beequal since they are from differing perspectives. Maintainingparallel schema representations gives the robot three optionswhen faced with inconsistent goal hypotheses: assume that thehuman’s schema is correct, assume that its own schema is cor-rect, or attempt to resolve the conflicts between the schemas.Our evaluation in the following section focuses on the simplestapproach: take the perspective of the teacher, and assume thattheir schema is correct.

Human Subjects StudyWe conducted a human subjects study to evaluate our ap-proach. The study had two purposes. First, to gather humanperformance data on a set of learning tasks that were wellmatched to our robot’s existing perceptual and inferential ca-pabilities, creating a benchmark suite for our perspective tak-ing architecture. Second, the study served to highlight the roleof perspective taking in human learning.

Study participants were asked to engage in four differentlearning tasks involving foam building blocks. We gathereddata from 41 participants, divided into two groups. 20 partic-ipants observed demonstrations provided by a human teachersitting opposite them (the social condition), while 21 partici-pants were shown static images of the same demonstrations,with the teacher absent from the scene (the nonsocial condi-tion). Participants were asked to show their understanding ofthe presented skill either by re-performing the skill on a novelset of blocks (in the social context) or by selecting the bestmatching image from a set of possible images (in the nonso-cial context).

1447

Task 1 Task 2

Task 3 Task 4

Figure 5: The four tasks demonstrated to participants in thestudy (photos taken from the participant’s perspective). Tasks1 and 2 were demonstrated twice with blocks in different con-figurations. Tasks 3 and 4 were demonstrated only once.

Fig. 5 illustrates sample demonstrations of each of the fourtasks. The tasks were designed to be highly ambiguous, pro-viding the opportunity to investigate how different types ofperspective taking might be used to resolve these ambiguities.The subjects’ demonstrated rules can be divided into three cat-egories: perspective taking (PT) rules, non-perspective taking(NPT) rules, and rules that did not clearly support either hy-pothesis (Other).

Task 1 focused on visual perspective taking during thedemonstration. Participants were shown two demonstrationswith blocks in different configurations. In both demonstra-tions, the teacher attempted to fill all of the holes in the squareblocks with the available pegs. Critically, in both demonstra-tions, a blue block lay within clear view of the participant butwas occluded from the view of the teacher by a barrier. Thehole of this blue block was never filled by the teacher. Thus,an appropriate (NPT) rule might be “fill all but blue,” or “fillall but this one,” but if the teacher’s perspective is taken intoaccount, a more parsimonious (PT) rule might be “fill all ofthe holes” (see Fig. 6).

Task 2 focused on resource perspective taking during thedemonstration. Again, participants were shown two demon-strations with blocks in different configurations. Various ma-nipulations were performed to encourage the idea that some ofthe blocks “belonged” to the teacher, whereas the others “be-longed” to the participant, including spatial separation in thearrangement of the two sets of blocks. In both demonstrations,the teacher placed markers on only “his” red and green blocks,ignoring his blue blocks and all of the participant’s blocks. Be-cause of the way that the blocks were arranged, however, theteacher’s markers were only ever placed on triangular blocks,long, skinny, rectangular blocks, and bridge-shaped blocks,and marked all such blocks in the workspace. Thus, if theblocks’ “ownership” is taken into account, a simple (PT) rulemight be “mark only red and green blocks,” but a more com-plicated (NPT) rule involving shape preference could account

Table 1: Differential rule acquisition for study participants insocial vs. nonsocial conditions. ***: p < 0.001

Task Condition PT Rule NPT Rule Other p

Task 1 social 6 1 13nonsocial 1 12 8 ***

Task 2 social 16 0 4nonsocial 7 12 2 ***

Task 3 social 12 8 -nonsocial 0 21 - ***

Task 4 social 14 6 -nonsocial 0 21 - ***

for the marking and non-marking of all of the blocks in theworkspace (see Fig. 6).

Task 3 and 4 investigated whether or not visual perspec-tive is factored into the understanding of task goals. In bothtasks, participants were shown a single construction demon-stration, and then were asked to construct “the same thing”using a similar set of blocks. Fig. 5 shows the examples thatwere constructed by the teacher. In both tasks, the teacher as-sembled the examples from left to right. In task 4, the teacherassembled the word “LiT” so that it read correctly from theirown perspective. Our question was, would the participants ro-tate the demonstration (the PT rule) so that it read correctly forthemselves, or would they mirror the figure (the NPT rule) sothat it looked exactly the same as the demonstration (and thusread backwards from their perspective). Task 3, in which theteacher assembled a sequence of building-like forms, was es-sentially included as a control, to see if people would performany such perspective flipping in a non-linguistic scenario.

The results of the study are summarized in Table 1 whereparticipant behavior was recorded and classified according tothe exhibited rule. For every task, differences in rule choice be-tween the social and nonsocial conditions were highly signif-icant (chi-square, p < 0.001). The most popular rule for eachcondition is highlighted in bold (note that, while many partic-

PT Domain

NPT DomainPT Domain

NPT Domain

Figure 6: Input domains consistent with the perspective taking(PT) vs. non-perspective taking (NPT) hypotheses. In visualperspective taking (left image), the student’s attention is fo-cused on just the blocks that the teacher can see, excluding theoccluded block. In resource perspective taking (right image),attention is focused on just the blocks that are considered to be“the teacher’s,” excluding the other blocks.

1448

Table 2: High-likelihood hypotheses entertained by the robotat the conclusion of benchmark task demonstrations. The high-est likelihood (winning) hypotheses are highlighted in bold.

Task Condition High-Likelihood HypothesesTask 1 with PT all; all but blue

without PT all but blueTask 2 with PT all red and green; shape preference

without PT shape preferenceTask 3 with PT rotate figure; mirror figure& 4 without PT mirror figure

ipants fell into the “Other” category for Task 1, there was verylittle rule agreement between these participants). These resultsstrongly support the intuition that perspective taking plays animportant role in human learning in socially situated contexts.

Robot Evaluation and DiscussionThe tasks from our study were used to create a benchmarksuite for our architecture. In our graphical simulation environ-ment, the robot was presented with the same task demonstra-tions as were provided to the study participants (Fig. 7). Thelearning performance of the robot was analyzed in two condi-tions: with the perspective taking mechanisms intact, and withthem disabled.

The robot was instructed in real-time by a human teacher.The teacher delineated task demonstrations using verbal com-mands: “Leo, I can teach you to do task 1,” “Task 1 is done,”etc. The teacher could select and move graphical buildingblocks within the robot’s 3D workspace via a mouse interface.This interface allowed the teacher to demonstrate a wide rangeof tasks involving complex block arrangements and dynamics.For our benchmark suite, the teacher followed the same taskprotocol that was used in the study, featuring identical blockconfigurations and movements. For the purposes of perspec-tive taking, the teacher’s visual perspective was assumed to bethat of the virtual camera through which the scene was ren-dered.

As the teacher manipulated the blocks, the robot attendedto the teacher’s movements. The robot’s task learning mech-anisms parsed these movements into discrete actions and as-sembled a schema representation for the task at hand, as de-tailed in previous sections. At the conclusion of each demon-stration, the robot expanded and revised a set of hypothesesabout the intended goal of the task. After the final demonstra-tion, the robot was instructed to perform the task using a novelset of blocks arranged in accordance with the human studyprotocol. The robot’s behavior was recorded, along with allof the task hypotheses considered to be valid by the robot’slearning mechanism.

Table 2 shows the highest-likelihood hypotheses entertainedby the robot in the various task conditions at the conclusion ofthe demonstrations. In the perspective taking condition, likelyhypotheses included both those constructed from the teacher’sperspective as well as those constructed from the robot’s own

Table 3: Hypotheses selected by study participants followingtask demonstrations. The most popular rules are highlighted inbold.

Task Condition Hypotheses SelectedTask 1 social all; number; spatial arrangement

nonsocial all but blue; spatial arrangement;all but one

Task 2 social all red and green; shape preference;spatial arrangement

nonsocial shape preference; all red and greenTask 3 social rotate figure; mirror figure& 4 nonsocial mirror figure

perspective; however, as described above, the robot preferredhypotheses constructed from the teacher’s perspective. Thehypotheses favored by the learning mechanism (and thus ex-ecuted by the robot) are highlighted in bold. For comparison,Table 3 displays the rules selected by study participants, withthe most popular rules for each task highlighted in bold.

For every task and condition, the rule learned by the robotmatches the most popular rule selected by the humans. Thisstrongly suggests that the robot’s perspective taking mecha-nisms focus its attention on a region of the input space similarto that attended to by study participants in the presence of a hu-man teacher. It should also be noted, as evident in the tables,that participants generally seemed to entertain a more variedset of hypotheses than the robot. In particular, participants of-ten demonstrated rules based on spatial or numeric relation-ships between the objects — relationships which are not yetrepresented by the robot. Thus, the differences in behavior be-tween the humans and the robot can largely be understood as adifference in the scope of the relationships considered betweenthe objects in the example space, rather than as a difference inthis underlying space. The robot’s perspective taking mecha-nisms are successful at bringing the agent’s focus of attentioninto alignment with the humans’ in the presence of a socialteacher.

This is the first work to examine the role of perspective tak-ing for introceptive states (e.g., beliefs and goals) in a human-robot learning task. It builds upon and integrates two impor-tant areas of research: (1) ambiguity resolution and perspectivetaking, and (2) learning from humans. Ambiguity has beena topic of interest in dialog systems (Grosz & Sidner 1990;Gorniak 2005). Others have looked at the use of visual per-spective taking in collaborative settings (Trafton et al. 2005;Jones & Hinds 2002). We also draw inspiration from researchinto learning from humans, which typically focuses on ei-ther modeling a human via observation (Horvitz et al. 1998;Lashkari, Metral, & Maes 1994) or on learning in an in-teractive setting (Lieberman 2001; Atkeson & Schaal 1997;Nicolescu & Mataric 2003). The contribution of our work is incombining and extending these thrusts into a novel, integratedapproach where perspective taking is used as an organizingprinciple for learning in human-robot interaction.

1449

Figure 7: The robot was presented with similar learning tasksin a simulated environment.

ConclusionThis paper makes the following contributions. First, in a novelhuman subjects study, we show the important role that per-spective taking plays in learning within a socially situated con-text. People use perspective taking to entertain a different setof hypotheses when demonstrations are presented by anotherperson, verses when they are presented in a nonsocial context.Thus, perspective taking abilities shall be critical for robotsthat interact with and learn from people in social contexts.

Second, we present a novel architecture for collaborativehuman-robot interaction, informed by recent scientific find-ings for how people are able to take the perspective of oth-ers, where simulation-theoretic mechanisms serve as the or-ganizational principle for the robot’s perspective taking skillsover multiple system levels (e.g., perceptual-belief, action-goal, task learning, etc.).

Finally, we evaluated our architecture on a benchmark suitedrawn from the human subjects study and show that our hu-manoid robot can apply perspective taking to draw the sameconclusions as humans under conditions of high ambiguity.Perspective taking, both in humans and in our architecture, fo-cuses the agent’s attention on the subset of the problem spacethat is important to the teacher. This constrained attention al-lows the agent to overcome ambiguity and incompleteness thatcan often be present in human demonstrations.

AcknowledgmentsThe work presented in this paper is a result of ongoing effortsof the graduate and undergraduate students of the MIT MediaLab Robotic Life Group. This work is funded by the DigitalLife and Things That Think consortia of the MIT Media Lab.

Jesse Gray would like to thank Samsung Electronics for theirsupport.

ReferencesAtkeson, C. G., and Schaal, S. 1997. Robot learning from demon-stration. In Proc. 14th International Conference on Machine Learn-ing, 12–20. Morgan Kaufmann.Barsalou, L. W.; Niedenthal, P. M.; Barbey, A.; and Ruppert, J.2003. Social embodiment. The Psychology of Learning and Mo-tivation 43.Blumberg, B.; Downie, M.; Ivanov, Y.; Berlin, M.; Johnson, M. P.;and Tomlinson, B. 2002. Integrated learning for interactive syn-thetic characters. ACM Transactions on Graphics 21(3: Proceedingsof ACM SIGGRAPH 2002).Carberry, S. 2001. Techniques for plan recognition. User Modelingand User-Adapted Interaction 11(1-2):31–48.Davies, M., and Stone, T. 1995. Introduction. In Davies, M., andStone, T., eds., Folk Psychology: The Theory of Mind Debate. Cam-bridge: Blackwell.Gorniak, P. 2005. The Affordance-Based Concept. Phd thesis, MIT.Gray, J.; Breazeal, C.; Berlin, M.; Brooks, A.; and Lieberman, J.2005. Action parsing and goal inference using self as simulator. In14th IEEE International Workshop on Robot and Human InteractiveCommunication (ROMAN). Nashville, Tennessee: IEEE.Grosz, B. J., and Sidner, C. L. 1990. Plans for discourse. In Cohen,P. R.; Morgan, J.; and Pollack, M. E., eds., Intentions in communi-cation. Cambridge, MA: MIT Press. chapter 20, 417–444.Horvitz, E.; Breese, J.; Heckerman, D.; Hovel, D.; and Rommelse,K. 1998. The lumiere project: Bayesian user modeling for infer-ring the goals and needs of software users. In In Proceedings ofthe Fourteenth Conference on Uncertainty in Artificial Intelligence,256–265.Jones, H., and Hinds, P. 2002. Extreme work teams: using swatteams as a model for coordinating distributed robots. In Proceedingsof the 2002 ACM conference on Computer supported cooperativework, 372–381. ACM Press.Lashkari, Y.; Metral, M.; and Maes, P. 1994. Collaborative Inter-face Agents. In Proceedings of the Twelfth National Conference onArtificial Intelligence, volume 1. Seattle, WA: AAAI Press.Lieberman, H., ed. 2001. Your Wish is My Command: Programmingby Example. San Francisco: Morgan Kaufmann.Lockerd, A., and Breazeal, C. 2004. Tutelage and socially guidedrobot learning. In IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS).Nicolescu, M. N., and Mataric, M. J. 2003. Natural methods forrobot task learning: Instructive demonstrations, generalization andpractice. In Proceedings of the Second International Joint Confer-ence on Autonomous Agents and Multi-Agent Systems.Sebanz, N.; Bekkering, H.; and Knoblich, G. 2006. Joint action:Bodies and minds moving together. Trends in Cognitive Sciences10(2):70–76.Trafton, J. G.; Cassimatis, N. L.; Bugajska, M. D.; Brock, D. P.;Mintz, F. E.; and Schultz, A. C. 2005. Enabling effective human-robot interaction using perspective-taking in robots. IEEE Transac-tions on Systems, Man, and Cybernetics 35(4):460–470.

1450

perspective taking: an organizing principle for learning ... · mit media lab 20 ames st.,...

Documents