[lecture notes in computer science] intelligent virtual agents volume 2792 ||

Lecture Notes in Artificial Intelligence 2792Edited by J. G. Carbonell and J. Siekmann

Subseries of Lecture Notes in Computer Science

3BerlinHeidelbergNew YorkHong KongLondonMilanParisTokyo

Thomas Rist Ruth AylettDaniel Ballin Jeff Rickel (Eds.)

IntelligentVirtual Agents

4th International Workshop, IVA 2003Kloster Irsee, Germany, September 15-17, 2003Proceedings

1 3

Series Editors

Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USAJorg Siekmann, University of Saarland, Saarbrucken, Germany

Volume Editors

Thomas RistDFKI GmbHStuhlsatzenhausweg 3, 66111 Saarbrücken, GermanyE-mail: [email protected]

Ruth AylettUniversity of Salford, Centre for Virtual EnvironmentsBusiness House, Salford, M5 4WT, UKE-mail: [email protected]

Daniel BallinRadical Multimedia Lab, BT ExactRoss PP4 Adastral Park Ipswich, IP5 3RE, UKE-mail: [email protected]

Jeff RickelUSC Information Sciences Institute4676 Admiralty Way, Suite 1001, Marina del Rey, USAE-mail: [email protected]

Cataloging-in-Publication Data applied for

A catalog record for this book is available from the Library of Congress.

Bibliographic information published by Die Deutsche BibliothekDie Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie;detailed bibliographic data is available in the Internet at <http://dnb.ddb.de>.

CR Subject Classification (1998): I.2.11, I.2, H.5, H.4, K.3

ISSN 0302-9743ISBN 3-540-20003-7 Springer-Verlag Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer-Verlag. Violations areliable for prosecution under the German Copyright Law.

Springer-Verlag Berlin Heidelberg New Yorka member of BertelsmannSpringer Science+Business Media GmbH

http://www.springer.de

© Springer-Verlag Berlin Heidelberg 2003Printed in Germany

Typesetting: Camera-ready by author, data conversion by Boller MediendesignPrinted on acid-free paper SPIN: 10931851 06/3142 5 4 3 2 1 0

Preface

This volume, containing the proceedings of IVA 2003, held at Kloster Irsee,in Germany, September 15–17, 2003, is testimony to the growing importance ofIntelligent Virtual Agents (IVAs) as a research field. We received 67 submissions,nearly twice as many as for IVA 2001, not only from European countries, butfrom China, Japan, and Korea, and both North and South America.

As IVA research develops, a growing number of application areas and plat-forms are also being researched. Interface agents are used as part of larger ap-plications, often on the Web. Education applications draw on virtual actors andvirtual drama, while the advent of 3D mobile computing and the convergenceof telephones and PDAs produce geographically-aware guides and mobile en-tertainment applications. A theme that will be apparent in a number of thepapers in this volume is the impact of embodiment on IVA research – a charac-teristic differentiating it to some extent from the larger field of software agents.Believability is a major research concern, and expressiveness – facial, gestural,postural, vocal – a growing research area. Both the modeling of IVA emotionalsystems and the development of IVA narrative frameworks are represented inthis volume. A characteristic of IVA research is its interdisciplinarity, involvingArtificial Intelligence (AI) and Artificial Life (ALife), Human-Computer Inter-action (HCI), Graphics, Psychology, and Software Engineering, among otherdisciplines. All of these areas are represented in the papers collected here. Thepurpose of the IVA workshop series is to bring researchers from all the relevantdisciplines together and to help in the building of a common vocabulary andshared research domain. While trying to attract the best work in IVA research,the aim is inclusiveness and the stimulation of dialogue and debate.

The larger this event grows, the larger the number of people whose effortscontribute to its success. First, of course, we must thank the authors themselves,whose willingness to share their work and ideas makes it all possible. Next, theProgram Committee, including the editors and 49 distinguished researchers, whoworked so hard to tight deadlines to select the best work for presentation, andthe extra reviewers who helped to deal with the large number of submissions.The local arrangements committee played a vital role in the smooth running ofthe event. Finally, all those who attended made the event more than the sum ofits presentations with all the discussion and interaction that makes a workshopcome to life.

July 2003 Ruth AylettDaniel Ballin

Jeff RickelThomas Rist

To Our Friend and Colleague Jeff

Committee Listings

Conference Chairs

Ruth Aylett (Centre for Virtual Environments, University of Salford, UK)Daniel Ballin (Radical Multimedia Laboratory, BTexact, UK)Jeff Rickel (USC Information Sciences Institute, USA)

Local Conference Chair

Thomas Rist (DFKI, D)

Organizing Committee

Elisabeth Andre (University of Augsburg, D)Patrick Gebhard (DFKI, D)Marco Gillies (UCL@Adastral Park, University College London/BTexact, UK)Jesus Ibanez (Universitat Pompeu Fabra, E)Martin Klesen (DFKI, D)Matthias Rehm (University of Augsburg, D)Jon Sutton (Radical Multimedia Laboratory, BTexact, UK)

Invited Speakers

Stacy Marsella (USC Information Sciences Institute)Antonio Kruger (University of Saarland)Alexander Reinecke (Charamel GmbH)Marc Cavazza (Teeside University)

Program Committee

Jan AlbeckElisabeth AndreYasmine ArafaRuth AylettNorman BadlerDaniel BallinJosep BlatBruce BlumbergJoanna Bryson

Lola CanameroJustine CassellMarc CavazzaElizabeth ChurchillBarry CrabtreeKerstin DautenhahnAngelica de AntonioNadja de CarolisFiorella de Rosis

X Committee Listings

Patrick DoyleMarco GilliesPatrick GebhardJonathan GratchBarbara Hayes-RothRandy HillAdrian HiltonKristina HookKatherine IsbisterMitsuru IshizukaIdo IurgelLewis JohnsonMartin KlesenJarmo LaaksolahtiJohn LairdJames LesterCraig LindleyBrian LoyallYang LuNadia Magnenat-ThalmannAndrew MarriotStacy Marsella

Michael MateasAlexander NareyekAnton NijholtGregory O’HareSharon OviattAna PaivaCatherine PelachaudPaolo PettaTony PolichroniadisHelmut PrendingerThomas RistMatthias RehmJeff RickelDaniela RomanoAnthony SteedEmmanuel TanguyDaniel ThalmannKris ThorissonDemetri TerzopolousHannes VilhjalmssonJohn VinceMichael Young

Sponsoring Institutions

EU 5th Framework VICTEC ProjectSIGMEDIA, ACL’s Special Interest Group on Multimedia Language ProcessingDFKI, German Research Center for Artificial Intelligence GmbHBTexact TechnologiesUniversity of Augsburg, Dept. of Multimedia Concepts and Applications

Table of Contents

Keynote Speech

Interactive Pedagogical Drama: Carmen’s Bright IDEAS Assessed. . . . . . . . . . . .1S.C. Marsella

Interface Agents and Conversational Agents

Happy Chatbot, Happy User. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5G. Tatai, A. Csordas, A. Kiss, A. Szalo, L. Laufer

Interactive Agents Learning Their Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . .13M. Hildebrand, A. Eliens, Z. Huang, C. Visser

Socialite in derSpittelberg: Incorporating Animated Conversationinto a Web-Based Community-Building Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18B. Krenn, B. Neumayr

FlurMax: An Interactive Virtual Agent for Entertaining Visitorsin a Hallway. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23B. Jung, S. Kopp

When H.C. Andersen Is Not Talking Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27N.O. Bernsen

Emotion and Believability

Emotion in Intelligent Virtual Agents: The Flow Model of Emotion . . . . . . . . . 31L. Morgado, G. Gaspar

The Social Credit Assignment Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39W. Mao, J. Gratch

Adding the Emotional Dimension to Scripting Character Dialogues . . . . . . . . . 48P. Gebhard, M. Kipp, M. Klesen, T. Rist

Synthetic Emotension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57C. Martinho, M. Gomes, A. Paiva

FantasyA – The Duel of Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62R. Prada, M. Vala, A. Paiva, K. Hook, A. Bullock

Double Bind Situations in Man-Machine Interaction under Contextsof Mental Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67T. Nomura

XII Table of Contents

Expressive Animation

Happy Characters Don’t Feel Well in Sad Bodies! . . . . . . . . . . . . . . . . . . . . . . . . . . . 72M. Vala, A. Paiva, M.R. Gomes

Reusable Gestures for Interactive Web Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Z. Ruttkay, Z. Huang, A. Eliens

A Model of Interpersonal Attitude and Posture Generation . . . . . . . . . . . . . . . . . 88M. Gillies, D. Ballin

Modelling Gaze Behaviour for Conversational Agents . . . . . . . . . . . . . . . . . . . . . . . 93C. Pelachaud, M. Bilvi

A Layered Dynamic Emotion Representation for the Creationof Complex Facial Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .101E. Tanguy, P. Willis, J. Bryson

Eye-Contact Based Communication Protocol in Human-AgentInteraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106H. Nonaka, M. Kurihara

Embodiment and Situatedness

Embodied in a Look: Bridging the Gap between Humans and Avatars . . . . . 111N. Courty, G. Breton, D. Pele

Modelling Accessibility of Embodied Agents for Multi-modal Dialoguein Complex Virtual Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119D. Sampath, J. Rickel

Bridging the Gap between Language and Action . . . . . . . . . . . . . . . . . . . . . . . . . . . 127T. Takenobu, K. Tomofumi, S. Suguru, O. Manabu

VideoDIMs as a Framework for Digital Immortality Applications . . . . . . . . . . 136D. DeGroot

Motion Planning

Motion Path Synthesis for Intelligent Avatar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141F. Liu, R. Liang

”Is It Within My Reach?” – An Agents Perspective . . . . . . . . . . . . . . . . . . . . . . . .150Z. Huang, A. Eliens, C. Visser

Simulating Virtual Humans Across Diverse Situations . . . . . . . . . . . . . . . . . . . . . 159B. Mac Namee, S. Dobbyn, P. Cunningham, C. O’Sullivan

A Model for Generating and Animating Groups of Virtual Agents . . . . . . . . . 164M. Becker Villamil, S. Raupp Musse, L.P. Luna de Oliveira

Table of Contents XIII

Scripting Choreographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170S.M. Grunvogel, S. Schwichtenberg

Behavioural Animation of Autonomous Virtual Agents Helped byReinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175T. Conde, W. Tambellini, D. Thalmann

Modells, Architectures, and Tools

Designing Commercial Applications with Life-like Characters . . . . . . . . . . . . . . 181A. Reinecke

Comparing Different Control Architectures for Autobiographic Agentsin Static Virtual Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182W.C. Ho, K. Dautenhahn, C.L. Nehaniv

KGBot: A BDI Agent Deploying within a Complex 3DVirtual Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192I.-C. Kim

Using the BDI Architecture to Produce Autonomous Charactersin Virtual Worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197J.A. Torres, L.P. Nedel, R.H. Bordini

Programmable Agent Perception in Intelligent Virtual Environments. . . . . . .202S. Vosinakis, T. Panayiotopoulos

Mediating Action and Music with Augmented Grammars . . . . . . . . . . . . . . . . . . 207P. Casella, A. Paiva

Charisma Cam: A Prototype of an Intelligent Digital Sensory Organfor Virtual Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212M. Bechinie, K. Grammer

Mobile and Portable IVAs

Life-like Characters for the Personal Exploration of Active CulturalHeritage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .217A. Kruger

Agent Chameleons: Virtual Agents Real Intelligence . . . . . . . . . . . . . . . . . . . . . . . 218G.M.P. O’Hare, B.R. Duffy, B. Schon, A.N. Martin, J.F. Bradley

A Scripting Language for Multimodal Presentation on Mobile Phones . . . . . .226S. Saeyor, S. Mukherjee, K. Uchiyama, M. Ishizuka

XIV Table of Contents

Narration and Storytelling

Interacting with Virtual Agents in Mixed Reality InteractiveStorytelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231M. Cavazza, O. Martin, F. Charles, S.J. Mead, X. Marichal

An Autonomous Real-Time Camera Agent for Interactive Narrativesand Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236A. Hornung, G. Lakemeyer, G. Trogemann

Solving the Narrative Paradox in VEs – Lessons from RPGs . . . . . . . . . . . . . . . 244S. Louchart, R. Aylett

That’s My Point! Telling Stories from a Virtual Guide Perspective . . . . . . . . . 249J. Ibanez, R. Aylett, R. Ruiz-Rodarte

Virtual Actors in Interactivated Storytelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254I.A. Iurgel

Symbolic Acting in a Virtual Narrative Environment . . . . . . . . . . . . . . . . . . . . . . 259L. Schafer, B. Bokan, A. Oldroyd

Enhancing Believability Using Affective Cinematograhy . . . . . . . . . . . . . . . . . . . .264J. Laaksolathi, N. Bergmark, E. Hedlund

Agents with No Aims: Motivation-Driven Continous Planning . . . . . . . . . . . . . 269N. Avradinis, R. Aylett

Evaluation and Design Methodologies

Analysis of Virtual Agent Communities by Means of AI Techniquesand Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .274D. Kadlecek, D. Rehor, P. Nahodil, P. Slavık

Persona Effect Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .283H. Prendinger, S. Mayer, J. Mori, M. Ishizuka

Effects of Embodied Interface Agents and Their Gestural Activity . . . . . . . . . 292N.C. Kramer, B. Tietz, G. Bente

Embodiment and Interaction Guidelines for Designing Credible,Trustworthy Embodied Conversational Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301A.J. Cowell, K.M. Stanney

Animated Characters in Bullying Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310S. Woods, L. Hall, D. Sobral, K. Dautenhahn, D. Wolke

Embodied Conversational Agents: Effects on Memory Performanceand Anthropomorphisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315R.-J. Beun, E. de Vos, C. Witteman

Table of Contents XV

Agents across Cultures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320S. Payr, R. Trappl

Education and Training

Steve Meets Jack: The Integration of an Intelligent Tutor anda Virtual Environment with Planning Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . 325G. Mendez, J. Rickel, A. de Antonio

Machiavellian Characters and the Edutainment Paradox . . . . . . . . . . . . . . . . . . . 333D. Sobral, I. Machado, A. Paiva

Socially Intelligent Tutor Agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .341D. Heylen, A. Nijholt, R. op den Akker, M. Vissers

Multimodal Training Between Agents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .348M. Rehm

Posters

Intelligent Camera Direction in Virtual Storytelling . . . . . . . . . . . . . . . . . . . . . . . .354B. Bokan, L. Schafer

Exploring an Agent-Driven 3D Learning Environment for ComputerGraphics Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355W. Hu, J. Zhu, Z.G. Pan

An Efficient Synthetic Vision System for 3D Multi-character Systems . . . . . . 356M. Lozano, R. Lucia, F. Barber, F. Grimaldo, A. Lucas, A. Fornes

Avatar Arena: Virtual Group-Dynamics in Multi-character NegotiationScenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358M. Schmitt, T. Rist

Emotional Behaviour Animation of Virtual Humans in IntelligentVirtual Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359Z. Liu, Z.G. Pan

Empathic Virtual Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360C. Zoll, S. Enz, H. Schaub

Improving Reinforcement Learning Algorithm Using Emotions ina Multi-agent System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .361R. Daneshvar, C. Lucas

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .363

Interactive Pedagogical Drama: Carmen’s Bright

IDEAS Assessed

Stacy C. Marsella

Center for Advanced Research in Technology for EducationInformation Sciences Institute

University of Southern California4676 Admiralty Way, Suite 1001

Marina del Rey, California, 90292, [email protected]

1 Extended Abstract

The use of drama as a pedagogical tool has a long tradition. Aristotle argued thatdrama is an imitation of life, and not only do we learn through that imitationbut our enjoyment of drama derives in part from our delight in learning. Morerecently, research in psychology has argued that narrative is central to how weunderstand the world and communicate that understanding[1]. And of course,the engaging, motivational nature of story is undeniable; the world consumesstories with a “ravenous hunger”[3].

However, stories traditionally place the learner in the role of passive spectatorinstead of active learner. The goal of Interactive Pedagogical Drama (IPD) isto exploit the edifying power of story while promoting active learning. An IPDimmerses the learner in an engaging, evocative story where she interacts openlywith realistic characters. The learner makes decisions or takes actions on behalfof a character in the story, and sees the consequences of her decisions. The learneridentifies with and assumes responsibility for the characters in the story, whilethe control afforded to the learner enhances intrinsic motivation[2]. Since theIPD framework allows for stories with multiple interacting characters, learningcan be embedded in a social context[6]. We take a very wide view of the potentialapplications of interactive story and IPD in particular. We envision interactivestory as a means to teach social skills, to teach math and science, to furtherindividual development, to provide health interventions, etc.

We have developed an agent-based approach to interactive pedagogical drama.Our first IPD was Carmen’s Bright IDEAS (CBI), an interactive, animatedhealth intervention designed to improve the social problem-solving skills of moth-ers of pediatric cancer patients. Parents of children with chronic diseases areoften poorly equipped to handle the multiple demands required by their ill childas well as the needs of their healthy children, spouse and work. Critical deci-sions must be made that affect family and work. To help train parents in theproblem-solving skills required to address such challenges, CBI teaches a methodfor social problem-solving called Bright IDEAS[5]. Each letter of IDEAS refersto a separate step in the problem solving method: Identify a solvable problem,

T. Rist et al. (Eds.): IVA 2003, LNAI 2792, pp. 1–4, 2003.c© Springer-Verlag Berlin Heidelberg 2003

2 Stacy C. Marsella

Develop possible solutions, Evaluate options, Act on plan and See if it worked.Prior to CBI, the Bright IDEAS method was taught in a series of one-on-onesessions with trained counselors, using worksheets that helped a mother detailher problems in terms of IDEAS steps. The purpose of Carmen’s Bright IDEASis to teach mothers how to apply the Bright IDEAS method in concrete situa-tions. Mothers learn more on their own and at times of their own choosing, andrely less on face-to-face counseling sessions,

The interactive story of Carmen’s Bright IDEAS is organized into three acts.The first act reveals the back story; various problems Carmen is facing, includingher son’s cancer, her daughter Diana’s temper tantrums, work problems, etc. Thesecond, main, act takes place in an office, where Carmen discusses her problemswith a clinical counselor, Gina, who suggests she pick a solvable problem anduse Bright IDEAS to help her find solutions. See Figure 1. With Gina’s help,Carmen goes through the initial steps of Bright IDEAS, applying the steps toone of her problems and then completes the remaining steps on her own. Thefinal act reveals the outcomes of Carmen’s application of Bright IDEAS.

The learner interacts with the drama by making choices for Carmen suchas what problem to work on and how she should cope with the stresses she isfacing. The learner can choose alternative internal thoughts for Carmen. Theseare presented as thought balloons (see Figure 2). Both Gina’s dialog moves andthe learner’s choices influence the cognitive and emotional state of the agentplaying Carmen, which in turn impacts her behavior and dialog,

Fig. 1. Gina (left) and Carmen (right) in Gina’s office.

Interactive Pedagogical Drama: Carmen’s Bright IDEAS Assessed 3

Fig. 2. Interaction with Carmen through Thought Balloons.

In general, creating an IPD requires the designer to balance the demandsof creating a good story, achieving pedagogical goals and allowing user control,while maintaining high artistic standards. To ensure a good story, dramatictension, pacing and the integrity of story and character must be maintained.Pedagogical goals require the design of a pedagogically-appropriate “gaming”space with appropriate consequences for learner choices, scaffolding to help thelearner when necessary and a style of play appropriate to the learner’s skill andage. To provide for learner control, an interaction framework must be developedto allow the learner’s interactions to impact story and the pedagogical goals.These various demands can be in conflict, for example, pedagogically appropriateconsequences can conflict with dramatic tension and learner control can impactpacing and story integrity. As the difficult subject matter and pedagogical goalsof Carmen’s Bright IDEAS makes clear, all these design choices must be sensitiveto the learner and their needs.

An early version of the CBI system was first described in [4]. It was subse-quently further developed and then tested as an exploratory arm of a clinicaltrial of the Bright IDEAS method at seven cancer centers across the U.S. Re-sults of the exploratory trial have recently become available. The results overallwere very positive and promising for the use of IPD in health interventions.For example, the mothers found the experience very believable and helpful inunderstanding how to apply Bright IDEAS to their own problems.

This talk will reveal the rationale behind the design choices made in creatingCBI, describe the technology that was used in the version that went into clinical

4 Stacy C. Marsella

trials, as well as discuss in detail results from its evaluation. Our more recentresearch in applying IPD to language learning will also be discussed. Althoughthe expectation that a system like CBI could substitute for time spent with atrained clinical counselors teaching Bright IDEAS is bold, the reality is thatthe alternative of repeated one-on-one sessions with counselors is not feasiblefor reaching a larger audience. Interactive Pedagogical Drama could fill a voidin making effective health intervention training available to the larger public attheir convenience. The training task for Carmen’s Bright IDEAS was a difficultone, fraught with many potential pitfalls. The fact that it was so well received bythe mothers was remarkable, and bodes well for applying IPD to other trainingand learning tasks.

Acknowledgement

I would like to thank my colleagues on the Carmen project, W. Lewis Johnsonand Catherine M. LaBore as well as our clinical collaborators, particularly O.J.Sahler, MD, Ernest Katz, Ph.D., James Varni, Ph.D., and Karin Hart, Psy.D.Supported in part by the National Cancer Institute under grant R25CA65520.

References

1. Bruner, J. (1990). Acts of Meaning. Harvard Univ., Cambridge, MA2. Lepper, M.R. and Henderlong, J. (2000). Turning play into work and work into

play: 25 years of research in intrinsic versus extrinsic motivation. In Sansone andHarackiewicz (Eds.), Intrinsic and Extrinsic Motivation: The Search for OptimalMotivation and Performance, 257-307. San Diego: Academic Press.

3. McKee, R. (1997). Story. Harper Collins, NY, NY.4. Marsella, S. Johnson, W.L. and LaBore, C. (2000). Interactive Pedagogical Drama.

In Proceedings of the Fourth International Conference on Autonomous Agents, 301-308.

5. Varni, J.W., Sahler, O.J., Katz, E.R., Mulhern, R.K., Copeland, D.R., Noll, R.B.,Phipps, S., Dolgin, M.J., and Roghmann, K. (1999) Maternal problem-solving ther-apy in pediatric cancer. Journal of Psychosocial Oncology, 16, 41-71.

6. Vygotsky, L. (1978). Mind in society: The development of higher psychological pro-cesses. (M. Cole, V. John-Steiner, S. Scribner and E. Souberman, Eds. and Trans.).Cambridge, England: Cambridge University Press.

[email protected]

{acsordas;kysss;aszalo}@aitia.ai

[email protected]

Interactive Agents Learning Their Environment

Michiel Hildebrand, Anton Eliens, Zhisheng Huang, and Cees Visser

Intelligent Multimedia GroupVrije Universiteit, Amsterdam, Netherlands{mhildeb,eliens,huang,ctv}@cs.vu.nl

Abstract. In this paper1 we describe the implementation of interactiveagents capable of gathering and extending their knowledge. Interactiveagents are designed to perform tasks requested by a user in natural lan-guage. Using simple sentences the agent can answer questions and in casea task can not be fulfilled the agent must communicate with the user. Inparticular, an interactive agent can tell when necessary information fora task is missing, giving the user a chance to supply this information,which may in effect result in teaching the agent. The interactive agentplatform is implemented in DLP, a tool for the implementation of 3D webagents. In this paper we discuss the motivation for interactive agents, thelearning mechanisms and it’s realization in the DLP platform.

1 Introduction

Research done in the combined fields of computational linguistics, computergraphics and autonomous agents has led to the development of autonomous vir-tual characters. Agents with a humanoid appearance and autonomous behaviorprovide a user friendly alternative to traditional interfaces. The Agents mayperform actions, display information or, in addition, gather information as well.They use language interaction to learn about their environment by adding ormodifying their knowledge. The benefit of this approach is that agents need notbe given all information in advance but can instead build up their knowledgeduring the process.

Our system is implemented in DLP [3], a distributed logic programming lan-guage suited for the implementation of 3D intelligent agents [4]. Avatars forthe agents are built in the Virtual Reality Modelling Language (VRML). Theseavatars have a humanoid appearance based on the H-anim specification2. Thegestures for the agent are made in the STEP3 scripting language [5]. For natu-ral language processing we use a type logical grammar [1], a resource sensitiveformalism for syntax and meaning assembly.

Without going into much detail we mention work related to this project.The Gesture and Narrative Language group works on embodied conversationalagents capable of multi-modal in- and output [8]. At the Synthetic Creatures1 http://www.cs.vu.nl/~eliens/research/media/title-interactive.html2 http://h-ahim.org3 http://wasp.cs.vu.nl/step


14 Michiel Hildebrand et al.

group virtual agents are used to simulate animal behavior [9]. The ParlevinkInstitute created an animated instruction agent called Jocob. RTI Internationaldeveloped Just-Talk, an application for virtual role-plays [11]. In comparisonwith the work just mentioned we may characterize our project as focusing onnatural language interaction between an agent and the user. The agent useslogic-based learning to adapt to his environment.

The structure of this paper is as follows: In section 2 we demonstrate howusers and agents can interact with each other. Section 3 provides a formal de-scription of interaction and learning. The realization of the platform is describedin section 4. Section 5 ends the paper with a conclusion and suggestions for fur-ther research.

2 A Sample Scenario of Interaction

Figure 1 shows the application as it is run in a web browser. The main screencontains the virtual world including the agent. The interface is created withstandard html forms. At the bottom there are fields for language input, twoselection menus for command shortcuts and a status screen with four buttonsto display the agents knowledge. On the right side there are various shortcuts todemo actions and predefined viewpoints, and a selection list of active modules.

Fig. 1. Interactive agent situated in a virtual house

A user can give natural language input to an interactive agent by typingEnglish sentences. Three types of sentences are available: commands, questionsand declarative sentences. To indicate the type of a sentence the punctuationmarks, respectively (!,?,.), have to be used. The following sentences present asample of possible inputs: Switch on the TV!, Sit on the table!, Where is your

Interactive Agents Learning Their Environment 15

bed?, Can you give me a book?, Yes, you can sit on the table., There is a bookinside the studyroom.

Given a command the agent will try to perform the corresponding action. Inresponse to the command Switch on the TV! the agent will walk to the TV andraise his hand to switch the power button. Successful performance of an actionis not always guaranteed. It could for example occur that the whereabouts ofan object are unknown. To locate an object the agent uses a search routine.If no object is found the user is asked for help. In response the user can givean indication about where the object is located, for example there is a bookinside the studyroom. Information exchange could also be reversed. For examplethe agent will respond to the question where is your bed? that it is inside thebedroom.

Interactive agents posses knowledge about their environment. In situationswhere common knowledge is not sufficient to make a judgement the agent willcommunicate with the user. For example an agent will not immediately sit on atable if requested. Instead he will tell the user he is not sure sitting on tables isallowed. The user can confirm, for example if all seats on the couch are taken,and say: Yes, you can sit on the table. The agent adds this new belief to hisknowledge and takes place on the table.

3 Formal Description of Interaction

Input can result in three types of actions according to the types of sentenceswe distinguished. A command results in a basic action, a physical action in theenvironment. Questions can be used for two different reasons: to ask information,Is there a book on the table? or to request an action, Can you give me the book?.The former results in an answer action while the latter results in a basic action.Declarative sentences are used to give information that agents can use to modifytheir knowledge, a learning action.

To control the performance of actions they are embedded in conditions andeffects, in a similar way as in capabilities used in [6, 7]. The conditions check ifthe actions are possible in the current state. Only if all conditions are satisfiedthe actions are performed. The effects update the agent’s knowledge according tothe changes made in the environment. The actions itself consist of calculations,physical actions, text outputs or references to other actions.

A state, which we will denote with the letter S, is determined by the agentsknowledge. The knowledge of interactive agents consists of objects and theirproperties. We characterize a state as a mapping from object to (property, value)pairs. Given a state S an agent can perform the actions if the correspondingconditions C can be satisfied in that state. Formally we write:

< S, C >a→ S′ , (1)

where a is the observable behavior of the agent in his environment. The newstate S’ is obtained by modifying the agents knowledge in S by the effects.

16 Michiel Hildebrand et al.

Learning. A learning action has no observable behavior. Learning affects theactions an agent can perform. In other words, we can describe a learning actionas a transition from one state to another changing the set of possible actions.For example, the permission to sit on a table extends the actions the agent canperform, by adding the possibility to sit on tables.

4 Realization

The agent is built from components as depicted in figure 2. The componentsare implemented in the distributed logic programming language DLP [3]. Theavatar and the virtual world are constructed in VRML. Interaction between theagent and his environment is possible because DLP has programmatic controlover VRML through the External Authoring Interface (EAI) [3]. All objects inthe VRML world are created according to a prototypical structure. The sensorcomponent uses this structure to check the object’s properties. Perception is re-stricted to objects located in the same room and positioned in the agent’s fieldof vision. The agent’s body movements are defined with STEP, a scripting lan-guage for embodied agents based on dynamic logic [5]. The actuator componentcontains the STEP kernel to execute these scripts.

Fig. 2. Agent components

Given an input the controller creates the appropriate behavior based on theagent’s knowledge. An answer action creates an appropriate answer by matchingthe term with the agents beliefs. A learning action updates the agents knowledgeeither by modifying or adding beliefs. A basic action is a collection of conditions,actions and effects. The basic actions, STEP scripts, beliefs and the lexicon areall treated as knowledge. Currently only beliefs can be learned. However, thesystem is constructed in such a way that it can be extended to learn the othertypes of knowledge as well. Beliefs gathered through perception give an objectiveview of the world, since their values are determined by the properties of objects.Subjective beliefs are gathered through interaction with the user, in the form oflearning actions. Beliefs can also be given in advance to minimize the need forirrelevant interaction.

Interactive Agents Learning Their Environment 17

Natural language processing is done with grail, a parser for type logical gram-mars [2]. A grammar is given by an assignment of types to the words in thelexicon. The types are constructed from atoms and logical connectives. An ex-pression is well-formed if a derivation in the logic for these connectives is possible.Based on the Curry-Howard morphism the meaning of an expression is assembledaccording to this derivation [1].

5 Conclusions and Future Research

In this paper we have described a system where users interact with agents us-ing natural language. In response to user input the agent can perform physicalactions, give answers to questions or learn new information. If non of the threeactions are possible the agent generates an output message to indicate what in-formation is missing. This approach makes it possible that the agent can learnto act in unfamiliar environments. Currently the agent can learn by modifyinghis beliefs. In further research the system can be extended to learn other kindsof knowledge as well. For example actions can be learned by combining existingactions, or by defining simple STEP scripts.

References

[1] Moortgat, M. (2002). Catergorial Grammer and Formal Semantics. Ency-clopedia of Congnitive Science, Nature Publishing Group.

[2] Moot, R. (2002). Proof Nets for Linguistic Analysis. Ph. D. thesis, UtrechtInstitute of Linguistics OTS, Utrecht University.

[3] Eliens, A. (1992). DLP, A Language for Distributed Logic Programming,Wiley

[4] Eliens, A., Huang, C., Visser C. (2002). A platform for Embodied Conversa-tional Agents based on Distributed Logig Programming. AAMAS Workshop –Embodied conversational agents - let’s specify and evaluate them!, Bologna.

[5] Huang, Z., Eliens, A., Visser, C. (2003). STEP: a Scripting Language for Em-bodied Agents. In: Helmut Prendinger and Mitsuru Ishizuka (eds.), Life-likeCharacters, Tools, Affective Functions and Applications, Springer-Verlag.

[6] Hendriks, K., de Boer, F., vd Hoek, W., Meyer, J-J. (1999). Agent program-ming in 3APL. Autonomous Agents and Multi-Agent Systems, pp. 357-401.

[7] Panayiotopoulos, T., Anastassakis, G. (1999). Towards a virtual reality in-telligent agent language. 7th Hellenic Conf. on Informatics, Ioannina.

[8] Cassell, J., Bickmore, B., Campbell, L., Vilhjalmsson, H., Yan, H. (2001).Conversation of a System Framework: Designing Embodied ConverstationalAgents. Embodied Conversational Agents, pp. 29-63. MIT Press.

[9] Isla, D., Burke, R., Downie, M., Blumberg, B. (2001). a Layered Brain Archi-tecture for Synthetic Creatures. In Proc. of the Int. Joint Conf. on ArtificalIntelligence (IJCAI), pp. 1051-1058, Seattle.

[10] Evers, M., Nijholt, A. (2000). Jacob - An animated instruction agent invirtual reality. In Proc. of the 3rd Int. Conf. on Multimodal Interaction.

[11] Frank, G., Hubal, R. (2002) An Application of Responsive Virtual HumanTechnology. In Proc. of the 24th Interservice/Industry Training, Simulationand Education Conf.

FlurMax: An Interactive Virtual Agent for

Entertaining Visitors in a Hallway

Bernhard Jung and Stefan Kopp

Artificial Intelligence and Virtual Reality LabUniversity of Bielefeld

http://www.techfak.uni-bielefeld.de/ags/wbski/

Abstract. FlurMax, a virtual agent, inhabits a hallway at the Univer-sity of Bielefeld. He resides in a wide-screen panel equipped with a videocamera to track and interact with visitors using speech, gesture, and emo-tional facial expression. For example, FlurMax will detect the presence ofvisitors and greet them with a friendly wave, saying ”Hello, I am Max”.FlurMax also recognizes simple gesturing of the by-passer, such as wav-ing, and produces natural multimodal behaviors in response. FlurMax’sbehavior selection is controlled by a simple emotional/motivational sys-tem which gradually changes his mood between states like happy, bored,surprised, and neutral.

1 Introduction

In the AI & VR Lab in Bielefeld, a multimodal virtual agent, Max, is underdevelopment. Max has been employed as human-like interlocutor in various ap-plications involving a CAVE-like VR installation [1]. This contribution presentsa new application of Max, FlurMax (German Flur: hallway), where he is visu-alized on a wide-screen panel located in a fairly well frequented hallway, nextto the door of our lab (see Fig. 1). Below the panel, a video camera is mountedthat provides color images of the hallway area as seen from Max’s perspective.The agent’s task is to continuously observe his environment and, based on his

Fig. 1. FlurMax installation: Max waving back at a by-passer (left); Max followsthe by-passer’s position with his eyes (center); a bored yawn (left).


24 Bernhard Jung and Stefan Kopp

perception executionBehaviorVisual Emotional

system

Controlmodule

Fig. 2. Main components of FlurMax’s software architecure

visual perception, to entertainingly interact with persons passing by or standingin front of the panel. Max thus acts as a kind of receptionist whose presence andnon-obtrusive communicative behavior shall contribute to an overall friendly andcreative atmosphere at the entrance to our lab.

2 Software Architecture

FlurMax’s software architecure is composed of three main parts (see Fig. 2): (1)a visual perception component, (2) a central control module, and (3) a behaviorexecution component.

The visual perception component processes the video data from the camera. Toensure an overall reactive agent behavior, only real-time capable image recog-nition techniques are employed. Image analysis first scans the video data forskin-colored areas. The highest regions are then classified as faces and trackedover time. That way, the visual perception is able to discriminate between dif-ferent persons (as long as no overlaps of face regions in the image occur). Thepositions of all visible faces are continuously sent to the control component. Inaddition, fast movements of small skin-colored regions, suddenly raised to a cer-tain height, are interpreted as hand-waving and reported as separate events. Tomaintain a reliable perception over longer time periods, the image recognitioncomponent adapts itself to moderate changes of the overall lighting conditions.

The control component receives events from the visual perception componentand schedules FlurMax’s behavior routines for reacting to changes in his envi-ronment. One main task of the control component is to direct the agent’s gazeto appropriate target positions. Every time a new set of positions is delivered,the control component keeps track of the number of people currently presentand decides who the agent should look at. In the majority of cases, Max tracksthe face of a particular person; sometimes, Max switches his attention by chanceto another person. In addition, verbal utterances like ”Hello, I am Max” arerandomly scheduled to welcome a newly spotted person.

The control component utilizes a simple emotional/motivational system thatlets the agent’s mood gradually vary between happy, bored, surprised, and neu-tral depending, e.g., on the presence or absence of people to interact with. Theemotional state influences the agent’s behavior in several ways: First, the inter-nal emotional state is quantitatively reflected in the animation of Max’s facial

FlurMax: An Interactive Virtual Agent 25

expression. Second, verbal utterances are modulated in prosody, a bad mood re-sulting in a lower average pitch, a narrowed amplitude of intonational variationand a slower speech rate. Third, and most importantly, the agent’s emotionalstate affects the selection of behaviors. For example, periods lacking by-passerinteraction increase the bored value, which eventually results in the schedulingof certain behaviors, such as Max leaning back, head cupped in hands, displayingfrustrated inclination. Other behaviors resulting from Max being bored includetaking deep breaths, yawning (see Fig. 1, right), head-scratching, and lookingaround as well as verbal complaints like ”Nothing’s up here” or even leaving thepanel. In contrast, the presence of people increases Max’s happy value by de-grees, resulting in friendly facial expression, verbal utterances like ”Have fun atwork!” or ”How are you?”, and greetings involving a friendly wave (Fig. 1, left).To increase the variation of Max’s behavior, the emotional system now and thenamplifies the agent’s surprise about visual perceptions, which is immediatelyreflected in a corresponding facial display. In addition, the dominant emotion isstochastically reduced at regular intervals.

The central control module schedules behaviors by transmitting XML-basedaction specifications as requests to the execution component. Such requestsare formulated in the representation language MURML [3] that provides flex-ible means of specifying prosodic speech, synchronized gestures of a definableform, emotional expression, locomotion, and arbitrary keyframe animations ofthe agent’s body and face. Upon deciding which behavior to execute, the con-trol component draws a parameterized MURML specification from a libraryand adapts it by inserting required parameters (e.g. the target location for thelook-at-behavior). Currently, the control component is implemented in CLIPS,a rule-based production system.

The behavior execution component is responsible for the real-time execution ofaction requests specified in MURML. The agent’s underlying kinematic skele-ton comprises 103 DOF in 57 joints, all of which are constrained to realisticjoint limits. This articulated body can be driven either by applying keyframeanimations or by means of a hierarchical gesture generation model that createsanimations from specifications of spatiotemporal gesture properties [2]. The faceof the agent is controlled by 21 muscles that are employed to animate lip-syncspeaking movements, to create facial expression of emotion (examples shown inFig. 3), or to perform arbitrary keyframe animations. Gesture generation andface animation are combined with a module for synthesizing prosodic speech inan overall production model for synchronized multimodal utterances (see [2]).

While the central control module is exclusively responsible for selecting andinvoking primary behaviors, incessant low-level actions like eye blink and breathmovements are controlled directly in the behavior execution component. To avoidinterferences between behaviors, a mediator detects whether two behaviors areconsuming conflicting body resources at the same time. In this case, the mediatorremoves the behavior with the lower priority value and the earlier start time.That is, lower-level actions as well as previous behaviors may be interrupted orskipped when a more recent, high-level behavior is to be executed.

26 Bernhard Jung and Stefan Kopp

Fig. 3. Facial display of various emotional states.

3 Discussion

FlurMax’ personality was designed to be friendly and non-obstrusive to visitors.His facial animation capabilities, in principle, provide for a wide sprectrum ofemotional expressions, including negative ones (see Fig. 3). Likewise, negativeemotional states, like anger, sadness, and disgust, are actually represented bycorresponding variables in his emotional system. However, such states are con-sidered inappropriate for FlurMax’s task and, thus, remain intentionally unusedin the current implementation.

Generally, FlurMax is well received by visitors of our hallway. Especiallystriking is the attribution of personality and human-like qualities by personswho are, at least in principle, fully aware of his lack of speech understandingand higher-level cognitive capabilities. For example, it is quite typical that Flur-Max encourages by-passers to engage in natural conversational behavior (”Hello,Max!”). Although FlurMax’s conversational repertoire is currently rather lim-ited, he has proven himself as quite successful entertainer of visitors to our labwaiting for a demonstration of his CAVE-based sibling Max [1].

Current work on FlurMax aims at a camera-based recognition of the by-passers’ emotional states. Further work concerns the development of simplechatterbot-like capabilities based on typed natural language.

Acknowlegements. The authors would like to thank Andreas Bartels, Lars Gesel-lensetter, Nicolas Gorges, Axel Haasch, Timo Krause, Ralf Kruger, and CarstenSpetzler, who developed the visual perception and the control component partsof the FlurMax -system in the context of a student project.

References

1. S. Kopp, B. Jung, N. Lessmann, and I. Wachsmuth. Max - a multimodal assistantin virtual reality construction. KI - Kunstliche Intelligenz, Special Issue EmbodiedConversational Agents, 2003. arenDTap Verlag, Bremen, to be published.

2. S. Kopp and I. Wachsmuth. Model-based animation of coverbal gesture. In Proc.of Computer Animation 2002, pages 252–257, Los Alamitos, CA, 2002. IEEE Com-puter Society Press.

3. A. Kranstedt, S. Kopp, and I. Wachsmuth. MURML: A multimodal utterancerepresentation markup language. Technical Report 2002/05, SFB 360 SituierteKunstliche Kommunikatoren, Universitat Bielefeld, 2002.

[email protected]

[email protected]

{mao, gratch}@ict.usc.edu

Synthetic Emotension

Building Believability

Carlos Martinho, Mario Gomes, and Ana Paiva

Instituto Superior Tecnico, Taguspark CampusAvenida Prof. Cavaco Silva, TagusPark 2780-990 Porto Salvo, Portugal

{carlos.martinho,mario.gomes,ana.paiva}@dei.ist.ult.pt

Emotension: concatenation of the words emotion, attention, and tension, ex-pressing the attentional and emotional predisposition towards an action, aswell as the cognitive “tension” sustained during this action.

Abstract. We present our first steps towards a framework aiming at in-creasing the believability of synthetic characters through attentional andemotional control. The framework is based on the hypothesis that theagent mind works as a multi-layered natural evolution system, regulatedby bio-digital mechanisms, such as synthetic emotions and synthetic at-tention, that qualitatively regulate the mind endless evolution. Built onthis assumption, we are developing a semi-autonomous module extendingthe sensor-effector agent architecture, handling the primary emotensionalaspects of the agent behavior thus, providing with the necessary elementsto enrich its believability.

1 Introduction

Believability is a subjective yet critical concept to account for when creatingand developing synthetic beings. Synthetic characters are a proved medium toenhance and enrich the interaction between the user and the machine, be itfrom the usability point of view or from the entertainment point of view. Whenfocusing on the machine-to-user side of the interaction, the believability of theintervening artificial life forms plays an important role in the definition of thequality of the interaction. By believable character, we mean a digital being that“acts in character, and allows the suspension of disbelief of the viewer”[1].

Disney’s concept of awareness [2] provides useful guidelines to build believ-able characters, namely specifying the function of attention and emotions inbuilding believability. Being more than 80 years old, Disney’s approach to createthe illusion of life remains nonetheless actual. Although generally interpretedas “display the internal state of the character to the viewer”, the concept ofawareness suggests more: that expression should be consistent with the sur-rounding environment, specially in terms of attentional and emotional reactionsof the intervening characters. Consider the following example. Nita stands insidea room when Emy (main character) enters. Nita should respond by looking at


58 Carlos Martinho, Mario Gomes, and Ana Paiva

Emy - implicitly, Nita’s attention focus is on Emy. Furthermore, Nita should ex-press an emotional reaction - perceived as caused by Emy, even if indirectly. Thesame behavior principle should be applied to all intervening characters, includ-ing Emy. The richness of the reactive response varies according to the characterimportance, but should always be present, as this behavior loop increases thebelievability of the main character.

Taking the concept of awareness one step further, this work researches whichmechanisms are suited to control both the focus of attention and the emotionalreactions of a synthetic character, to increase its believability. Furthermore, itwill assert if such control can be performed on a semi-autonomous basis, thatis with a certain independence from the main processing of the agent. Thiswould allow us to extend the base agent architecture1 with a module designedto provide support for believability in synthetic character creation.

The document is organized as follows. Next section, “Architecture”, presentsthe architecture extension being researched and exemplifies the role of the emoten-sional module. Afterwards, “Emotension” discusses the role of evolution, at-tention and emotion in the architecture. Finally, “Results” and “Conclusions”discuss our preliminary experiments and results.

2 Architecture

The emotensional module architecture is based on the hypothesis that the agent’smind is a natural evolution system of perceptions, regulated by emotension.

Fig. 1. Agent Architecture

The module is composed by two columns where perceptions evolve (Fig. 1):the sensor column, intercepting the data flow from sensors and; the effector1 Russel and Norvig definition of agent: “anything that can be viewed as perceiving its

environment through sensors and acting upon that environment through effectors”.

Synthetic Emotension 59

column, regulating the data flow entering the effector module. The choice of theterm “column” will become clear in next section. The intercepted data is usedto update the current emotensional state of the agent. At all time, the currentemotensional state is available to the processing module. The data flow betweenthe base agent modules may be altered or induced by the emotensional module,transparently simulating sensor or effector information.

The emotensional module transparently provides the agent with the buildingblocks of its believability. Let us go back to our example and analyse it at the ar-chitecture level. When Emy enters the room where Nita is standing, Nita’s sensorcolumn appraises the stimulus as highly relevant, since it is unexpected (invol-untary attention) and raises a set of emotional memories associated with Emyin Nita’s mind (somatic markers), pulled from long-term memory (by similaritywith the current stimulus). Due to the nature of these memories, Nita becomesafraid. Regulated by Nita’s emotensional state, the effector column starts evolv-ing fear-oriented action tendencies which are fed to Nita’s effectors, disclosingher inner state to the viewer. Nita becomes restless. Meanwhile, all emotensionalstimuli continue their natural evolution process in the sensor column. Suddenly,and as a result of this evolution process, the memory of an unhappy episodewith Emy and a glove pops up in Nita’s mind. Nita’s attention is now on thebeige glove left on the table near her (involuntary attention provoked from therecalled event), although the one she remembers from the story was brown. Shecannot take her eyes from the glove (strong emotional reaction which temporarilyfloods the evolution process in the sensor column). The same type of processinghappens in Emy’s emotensional module. She entered the room searching for hermissing glove (voluntary attention) and suddenly, noticed that Nita was near (in-voluntary attention). Remembering the same past experience (somatic marker),she leaves the room laughing out loud (gloating behavior evolved in the effectorcolumn) while looking at Nita (focus of attention), paler than ever. The mainpoint is that everything happened transparently. From the processing modulepoint of view, Emy entered the room, picked her glove and left. Nothing more.

3 Emotension

In brief, the emotension module works as follows: the sensor column receivesthe perceptions and updates the emotensional state, while the effector columngenerates the action tendencies based on the current emotensional state.

The sensor column receives the agent internal and external perceptions andappraises their relevance using a process inspired on the Psychology of humanattention and emotion. As current research shows[3], both concepts are interre-lated. In our work, the notion of expectation brings them together. A signal isconsidered relevant when it is unexpected according to the extrapolations madefrom past observations, or when it is actively searched for (this information isprovided by the processing module). So inner drives which suddenly change, ob-jects which suddenly pop up, or objects that are being searched for will havehigh relevance. This approach is consistent with the voluntary-involuntary con-trol dichotomy based on Posner’s and Muller’s theories of attention[4] as well

60 Carlos Martinho, Mario Gomes, and Ana Paiva

as with the notion of emotion as an interruption or warning mechanism[5]. It isalso consistent with the developmental theories of emotions from Psychology[5]which, following the seminal works of Watson (1929) and Bridges (1932), con-sider emotions to be a relevant selection mechanism, as well as an adaptingmechanism controlling the individual’s behavior and the control process itself.The relevance is presently calculated mathematically, using polynomial extrapo-lation. However, we aim at developing an evolution system to evaluate relevance.

Unexpected changes in inner drives generate primary emotions. For instance,while experiencing a “sugar need” and sensing the need increasing (as energyis spent on movement), we suddenly notice a fall (as energy is replenished aftereating a candy): a relief emotion is launched. By monitoring a drive over time,we predict its next state. If the prediction is far from the value read from thedrive, an emotion is raised. The greater the error, the higher the intensity. Thetype and valence of the emotion depends on the previous state, the expectedstate and the new state. Note that by mapping the concept of emotion overthe concept of drive, an unidimensional value with a resting (desired) state, weachieve a certain semantical independence from the agent body.

The external signal then enters the sensor memory (Fig. 2) associated withthe calculated relevance, and joins the previously evolved perceptions. The evo-lution happening inside the column is implemented as a multi-layered cellularautomata. Each plateau is a bidimensional hexagonal wrapped cellular automatawhere perceptions are copied, crossed, mutated and decay. Perceptions also movein between plateaux according to specific rules allowing meaningful perceptionsto be recorded in memory or recalled when emotensionally significant2. All sig-nals decay. Each plateau has a different decay rate. When a cell becomes empty,it is substituted by a combination of its neighbor cells, with added noise.

At all time, the more relevant signals in sensor memory define the primaryemotensional state of the agent. This state is used to generate the agent actiontendencies. Although currently the effector column is limited to a random actionbased on the current emotensional state of the agent3, we aim at developing anaction memory as the effector column depicted in Fig. 2.

4 Results

Our first results have been encouraging. Fig. 2 shows some snaps of the simula-tions we are performing on the sensor column. Preliminary results point to:

– given time, the column can make recall associations, similar to the associa-tion of the beige-brown glove in the previous example. This can potentiallyenrich idle character behavior in a believable way, at no additional cost;

– a certain immunity to sensor noise and a certain randomness/diversity ofdirected behavior, due to the chaotic nature of the dynamic system;

– unfortunatly, one parameter still has to be tuned: stimulus decay.2 Currently, the signals are compared in terms of intensity with the short-term memory

(working memory) and in terms of similarity with the long-term memory.3 An action is selected in response to a < emotion, object > pair.

Synthetic Emotension 61

Fig. 2. Emotensional module (left) and sensor column prototype (right)

5 Conclusion

We presented an architecture and briefly described the implementation of amodule aiming at increasing the believability of synthetic characters. The mainmotivation is that by providing with automated control of a synthetic characteremotensional reactions and following Disney’s Illusion of Life guidelines, we willincrease its believability.

The framework, based on the hypothesis that the agent mind works as a nat-ural evolution system regulated by synthetic emotions and synthetic attention,provides the basis of a potentially robust, diverse, and generic form of handlingthe agent perceptions. The architecture, extending the base sensor-effector agentarchitecture, allows the believability to occur without an explicit control from theprocessing module. The multi-layer cellular automata prototype implementationallowed us to assert the preliminary feasibility of the enunciated concepts.

References

[1] Bates, J.: The role of emotions in believable agents. Technical report, CarneggieMellon University (1994)

[2] Thomas, F., Johnson, O.: The Illusion of Life. Hyperion Press (1994)[3] Wells, A., Matthews, G.: Attention and Emotion - a Clinical Perspective. Psychol-

ogy Press (1994)[4] Styles, E.: The Psychology of Attention. Psychology Press (1995)[5] Strongman, K.: The Psychology of Emotions. John Wiley and Sons, Ltd (1996)

FantasyA – The Duel of Emotions

Rui Prada1, Marco Vala1, Ana Paiva1, Kristina Hook2, and Adrian Bullock2

1 IST and INESC-ID, Rua Alves Redol 9, 1000-029 Lisboa, Portugal{rui.prada,marco.vala,ana.paiva}@gaips.inesc.pt

2 SICS, Box 1263, 164 29 Kista, Sweden{kia,adrian}@sics.se

Abstract. FantasyA is a computer game where two characters face eachother in a duel and emotions are used as the driving elements in the actiondecision of the characters. In playing the game, the user influences theemotional state of his or her semi-autonomous avatar using a tangibleinterface for affective input, the SenToy. In this paper we show howwe approached the problem of modelling the emotional states of thesynthetic characters, and how to combine them with the perception ofthe emotions of the opponents in the game. This is done by simulatingthe opponents action tendencies in order to predict their possible actions.For the user to play, he or she must understand the emotional state of hisopponent which is achieved through animations (featuring affective bodyexpressions) of the character. FantasyA was evaluated with 30 subjectsfrom different ages and the preliminary results showed that the usersliked the game and were able to influence the emotional states of theircharacters, in particular the young users.

1 Introduction

Believability is one important issue when constructing synthetic characters.Characters that are believable provide richer interactions to the users engag-ing them more deeply in the interaction experience. Emotions have a crucialrole in the creation of such believable characters, as Bates [2] stated, emotionscreate the ”illusion of life” that drives the users to the suspension of disbelief.

A large part of the research on emotions in synthetic characters has beenprimarily concerned with the problem of expressing emotions [6] [10]. This is afundamental and indeed quite difficult problem, and our technology so far doesnot allow us to obtain a truly believable synthetic character.

However, it is not only the expression of emotions that is essential to regulatecommunication with synthetic characters. Several other important aspects suchas gesture, speech, etc, are needed. In particular, the capability to understandthe other’s emotional state is part of that regulation process and must not beforgotten.

In this paper we show how we approached the problem of modelling theemotional states of other synthetic characters, combining it with the adequateemotional processes, action tendencies and expression in the agents. This was


FantasyA – The Duel of Emotions 63

done using the context of a computer game, FantasyA, where emotions are theessential mechanism for playing the game.

The remainder of this paper is organized as follows. First we describe Fan-tasyA and how the interaction between the user and the system is achieved.Secondly we describe the emotional theory behind the scenes and how the char-acters take into account the emotions of others in making decisions. Finally webriefly describe a study conducted to evaluate the system and its results.

2 FantasyA

FantasyA is a computer game where users play the role of an apprentice wizardwho is challenged to find the leader of her/his clan. In the first challenge thewizard must duel other apprentices in the magic arena until s/he masters thebasic magic skills and is ready to proceed to the exploration in the land ofFantasyA.

To control the characters in the game, players use the SenToy, which is atangible interface in the form of a doll that allows the user to transmit emotionsto a synthetic character (see [9] for more details). It allows the user to influencesix emotions (anger, fear, surprise, gloat, sadness and happiness) by expressinggestures associated. E.g. moving the doll energetically up and down will inducehappiness, while placing the doll arms in front of its head will induce fear.

The duel is played in a virtual environment, the arena, by two software agentsone influenced by the user and another influenced by an AI player controlled bythe system. The agents are semi-autonomous as they make their own decisions,but those decisions depend on the emotional state that was induced by theplayer.

The actions performed by the characters in the game are spells. Characterscan cast offensive spells to inflict damage on the opponent, cast defensive spellsto heal and protect themselves, or cast spells that gives them more power duringcombat.

The duel is run in a turn taking sequence. Each turn the acting player inducesan emotion to his/her character, the character will act according to its and theopponent’s emotions, and then both characters react emotionally to the resultsof the action performed. The acting player changes and a new turn is played.The game ends when the maximum number of allowed turns was reached orwhen a character has taken too much damage.

3 Conflict of Emotions

The emotional state of each agent is essential for the whole organic of the game.Basically it is the emotional state that constrains, or more accurately, influences,the actions to be taken by the characters. However, given that this is a game,and game-play and must be considered, the action tendencies should produceconcrete rules that could be easily learned by the player. This means that if

64 Rui Prada et al.

one player once learns that making her/his character angry will induce it toattack, when s/he in future influences the same emotion the character shouldalso attack, following the player’s expectations. On the other hand we didn’twant to map the emotions directly into actions (e.g. if a character is angry italways attacks) because this would weaken the emotion role as the player maymisinterpret the emotional influence and fail to distinguish it from the actionitself. So we decided to try an approach where the players had to consider notonly their characters’ emotions but also the their opponents’ emotions.

3.1 Action Decision

FantasyA characters use their emotions and their feeling about the opponent’semotions to decide what action to take. The decision is made based on the actiontendencies that those emotions induce on the character. These action tendenciescan be of two different types: induced by the character’s own emotions or inducedby the opponent’s emotions. The design of the first type of action tendencieswas supported by the emotion theories formulated by Lazarus[7], Darwin[3] andEkman[4].

According to Lazarus, fear’s action tendency is avoidance or escape, thereforea frightened character will favor defensive actions. Anger has an innate tendencyfor attack, angry characters will favor offensive actions. Sadness by its turn doesnot have a ”clear action tendency - except inaction, or withdrawal into oneself”, asad character prefers actions that do not involve the opponent, e.g. non offensiveactions. Happiness induces a sense of security in the world, happy charactersare unconcerned about defense and favor offensive actions. Surprise appears onDarwin’s definition on the same axis as fear, therefore surprised characters favordefensive spells. Paul Ekman describes gloat as an expression of anger when therelation towards the blameworthy object is of clear superiority, characters whengloating favor offensive actions.

To address the action tendencies induced by the others’ emotions we lookedat theories of empathy [11], emotional contagion[5] and social referencing[1]. Em-pathy and emotional contagion suggest mechanisms for transmitting emotions toothers, while social referencing has been defined as the process of using anotherpersons interpretive message, or emotional information, about an uncertain sit-uation to form ones own understanding of that situation[1].

Following the social referencing theory we can evaluate the situation anddecide what to do based on the current emotion of the opponent. This emotioninduces action tendencies on the opponent that can be assumed to be, and infact are, the same as for the acting character. By imagining the action thatthe opponent is willing to perform the character will have tendency to counterthat action. Therefore if the opponent’s emotional state is such that it inducesan attack, the character should defend otherwise it should attack (e.g. if theopponent is happy this should mean that it feels comfortable about the currentstate of the duel and will attack, thus we should defend to counter its confidencein the attack). We agreed that the reaction to the situation depends highly on thepersonality of the character. In the example above we described the behaviour

FantasyA – The Duel of Emotions 65

of a cautious character, but if it had a more aggressive personality it mightrespond to the attack tendencies of the opponent with attacks and not defenses.Following this idea, and also to increase the richness of game-play, we defineddifferent personalities for each clan giving them action tendencies based on theopponent’s emotion according to the personality.

Combining both tendencies we get an overall action tendency for the char-acter that is used in the action decision process. If both action tendencies areoffensive then the character chooses strong offensive actions. On the other handif both action tendencies are defensive the character chooses strong defensive ac-tions. If the two action tendencies mismatch then weak offensive and defensiveactions are possible.

3.2 Emotional Reaction

After the decision the character performs the selected action and both charactersreact emotionally to the results. The emotional reaction depends on the actionitself, its results (e.g. if it succeeded or failed) and on the previous emotionalstate of the character. Theories like OCC[8] have described appraisal mecha-nisms that activate emotions on individuals according to event that it perceives.In FantasyA the emotion state creates an action expectation on the characterbased on the action tendency that the emotion has. This means that an angrycharacter expects to attack its opponent, but this is not necessarily true becausethe action also depends on the opponent emotion. Characters will react differ-ently to the action result if the action taken was within its expectations or not. Inthe case of failure, characters will react more drastically if the action was withinthe expectations. On the other hand, if the spell succeeds the reaction will bemore enthusiastic if the action was expected (e.g. if the action was expected thecharacter might gloat instead of just being happy). The emotion and the actionresult define guidelines to reaction rules, as discussed above, but the reactionrules also depend on the particular character’s personality.

4 Study and Results

We conducted a study to evaluate FantasyA and the SenToy. The FantaysAevaluation was conducted with 30 subjects: 8 children, 12 high-school studentsand 10 adults - from ages 9 to 38. The students and children play computergames for an average of 10 hours per week, while adults almost didn’t play atall. We ran 15 sessions of 50 to 90 minutes each. The subjects were given twosheets with the game rules, but not with the emotion rules behind the combatlogic. The results were obtained from three sources: video observation, open-ended interviews and a questionnaire.

In general the character expressions were well accepted and understood butthe more exaggerated were better perceived. On the other hand the game logicsseem too complex but some subjects got a few ideas about it as we can see fromthe following comment:

66 Rui Prada et al.

”I believe that you should check somewhat what the other guy [the opponent]does. What he expresses. [..] Yes, because he is probably expressing the samethings as our guy is. Then you react to that. But we did not do that very much.[..]” (adult player)

Regarding the entertainment aspect of the game we were very successful! Allsubjects were very pleased with the experience and some would even like to buythe game.

”This was a different game, enormously funny!” (adult player) or ”It was afun game that I hope will be released on the market sometime” (13-year old)

5 Conclusions

Although the game was a success in terms of how much the players liked it, wewere not very successful in making the user’s understand the role of emotions inthe game. One possible reason may be that the rules of the game are too complexto grasp in the short time given for the evaluation. On the other hand, most of theplayers performed some kind of mental mapping between the emotional gesturesand the behaviour of their avatar, using the gestures to perform certain actions.Although this, at first glance can be seen as a bad result, we do not think so,as the role of emotions was essential for the whole development of the system,more specifically for the belivability of the agents.

References

[1] M. S. Atkins. Dynamical analysis of infant social referencing. Master’s thesis,Eberly College Of Arts and Sciences at West Virginia University, 2000.

[2] J. Bates. The role of emotion in believable agents. Technical Report CMU-CS-94-13, Carnegie Mellon University, 1994.

[3] C. Darwin. The expression of emotions in man and animals: 3:rd ed. by PaulEkman. Oxford University Press, 1872/1998.

[4] P. Ekman. Emotion in the Face. New York, Cambridge University Press, 1982.[5] E. Hatfield, J. Calioppo, and R. Rapson. Emotional Contagion. Cambridge Press,

1994.[6] R. J. and J. L. Integrating pedagogical capabilities in a virtual environment agent.

In L. Johnson and B. Hayes-Roth, editors, Autonomous Agent ’97. ACM Press,1997.

[7] R. Lazarus. Emotion and Adaptation. Oxford University Press, 1991.[8] A. Ortony, G. Clore, and A. Collins. The Cognitive Structure of Emotions. Cam-

bridge University Press, New York, reprinted 1994 edition, 1988.[9] A. Paiva, M. Costa, R. Chaves, M. Piedade, D. Mourao, G. Hook, K. Andersson,

and A. Bullock. Sentoy: an affective sympathetic interface. International Journalof Human Computer Studies, 2003.

[10] C. Pelachaud and I. Poggi. Subtleties of facial expressions in embodied agents.Journal of Visualization and Computer Animation, 2002.

[11] L. Wispe. History of the concept of empathy. In N. Eisenberg and J. Strayer,editors, Empathy and its development. Cambridge Press, 1987.

Double Bind Situations in Man-Machine

Interaction under Contexts of Mental Therapy

Tatsuya Nomura1,2

1 Faculty of Management Information, Hannan University5–4–33, Amamihigashi, Matsubara, Osaka 580–8502, Japan

2 ATR Intelligent Robotics and Communication Laboratories

Abstract. This paper suggests some potential negative effects result-ing from the application of artificial intelligence to mental therapy. Inparticular, it focuses on the relations between cultural trends of mentalhealth and the clients’ personal traits in mental therapy, and considersthe therapeutic effects of robots and software agens based on the theoryof double bind situations. Moreover, it reports a current research plan.

1 Introduction

Recently the application of artificial intelligence (AI) to mental therapy has beenstudied [1,2]. The ultimate goal of this research is the substitution of therapeuticanimals in animal therapy with pet robots, autonomous self–control learningby clients through interaction with intelligent software agents, construction ofnovel therapeutic methodology, and so on. In Japan, in particular, the decreasingworker population, increasing number of elderly, and the recognized necessity formore sufficient welfare and mental health support have been referred to as thebackground for these studies.

Our previous research suggested that the artificial emotions of therapeuticrobots and software agents may impact negatively on clients of mental therapyfrom the perspectives of social psychology, sociology of emotions, narrative ther-apy, and sociology of health and illness [3,4]. However, the use of computers androbots in the context of mental therapy itself can influence some clients in themodern cultural trend of mental health, regardless of the existence of emotionalsystems in the therapeutic agents and robots.

This paper deals with the deeper relations between the cultural trends ofmental health and the clients’ personal traits in mental therapy using softwareagents and robots, that is, the relations between “psychologism” and anxietytraits for computers and robots. We then suggest that clients of mental therapyusing artificial intelligence may be forced into a kind of double bind situation [5].Finally, we introduce our current research plan to investigate our assumption onthe double bind situations of mental therapy clients.


68 Tatsuya Nomura

2 Anxiety for Computers and Robots

The concept of computer anxiety means anxious emotions that prevent humansfrom using and learning computers in educational situations and daily life [6,7].Anxiety can generally be classified into two categories: state and trait anxiety.Trait anxiety is a kind of personal characteristics that is stable in individuals.State anxiety can be changed depending on the situation and time, and com-puter anxiety is classified into this category. From the perspective of education,computer anxiety in individuals should and can be reduced by educationallyappropriate programs, and several psychological scales for its measurement havebeen developed [6,7].

On the other hand, from the perspective of mental therapy, computer anx-iety can influence the therapeutic effect of software therapy using artificial in-telligence. If the client’s computer anxiety is high, it can prevent interactionwith the therapeutic software agents even if the agents are designed based ontheories of mental therapy. Of course, communication anxiety should be con-sidered even in therapy with human therapists [8]. However, it can be reducedduring the therapy process by the therapist’s careful treatment. It is not clearwhether computer anxiety can be reduced during the therapy processes by soft-ware agents. Thus, anxiety should either be reduced before therapy proceedsor another person should assist clients in interacting with the software agentsduring the therapy process.

Similar emotions of anxiety should also be considered in robotic therapy.Robotic therapy may be different from therapy using software agents in the sensethat robots have concrete bodies and can influence client’s cognition. Thus, anx-iety with robots may be different from computer anxiety. However, it should beconsidered that anxiety may be caused by highly technological objects and com-munication with them. In this sense, anxiety with robots is a complex emotionof computer anxiety and communication anxiety [9].

3 Psychologism

The cultural trend called “psychologism” refers to a trend in modern societywhere psychiatric symptoms in individuals are internalized although they maybe caused by social structures and cultural customs, and, as a result, the root so-cial and cultural situations that need to be clarified are concealed. The Japanesesociologist S. Mori focused on psychologism in discussing the extreme self–controlof people in modern society [10]. His theory is based on the theory of feeling rulesby Hochschild [11] and the theory of McDonaldozation of Society (rationaliza-tion) by Ritzer [12].

The theory of feeling rules argues that people in modern society control notonly outer expression but also inner evocation and suppression of their emo-tions according to specific rules corresponding to given social situations, andextreme emotion management in service industries causes the alienation of work-ers,such as flight attendants [11]. Moreover, the theory of McDonaldozation of

Double Bind Situations in Man-Machine Interaction 69

Society argues that the principle of rationalization based on efficiency, calcula-bility (quantification), predictability, and control by technology dominates manyfields of modern society [12]. Mori’s claim based on these theories is summarizedas follows:

– In modern society, we are always forced to pay attention to ours and oth-ers’ emotions in order not to hurt emotionally each other (the worship ofindividuals’ character). Moreover, this worship of individuals’ character andpsychologism complement each other.

– Furthermore, psychologism and rationalization in modern society also com-plement each other, and, as a result we are required to have a high degreeof emotional self–control.

These statements imply that people in modern society are required to executeemotion management and are dependent on mental therapy for it. In addition,modern rationalism, as Ritzer pointed out [12], may also encourage a reductionof manpower in mental therapy, and, as a result, software and robotic therapyusing artificial intelligence may be encouraged. Thus, people in modern societyare forced to face therapeutic software agents and robots by the social pressureof the self–control of their emotions and mental health, and rationalism in par-ticular, if mental therapy becomes the duty of members in organizations suchas businesses and schools. If these therapies are introduced without consider-ation of the anxiety that individuals may experience in their interaction withcomputers and robots, however, they may cause double bind situations for theseindividuals.

4 Double Bind Situations in Mental Therapy by AI

The Double Bind Theory, proposed as a source of schizophrenia from the view-point of social interactions in the 1950s [5], argues that schizophrenia may resultfrom not only impact on the mental level of individuals, such as trauma, butalso inconsistency in human communication. The conditions for double bind areformalized as:

1. the existence of one victim (a child in many cases) and an assailant or someassailants (the mother in many cases,)

2. the customization of cognition for double bind structures through repeatedexperience,

3. the first prohibition message with punishment,4. the second prohibition message inconsistent with the first one at another

level (inconsistent situations,)5. a third message that prohibits the victim from stepping out of the incon-

sistent situation (prohibition of the victim’s movement to a meta level ofcommunication.)

It is pointed out that the double bind theory itself has largely not beendeveloped in the theoretical sense since the 1970s [13], and there has not been

70 Tatsuya Nomura

enough empirical evidence showing that double bind situations are a sourceof schizophrenia [14]. Even if the double bind situations are not a source ofschizophrenia, however, the double bind theory has been applied in the clinicalfield as a basic concept of family system theory [15], and it is said that doublebind situations frequently exist in daily life.

Under social pressure for the self–control of mental health, therapy usingsoftware agents and robots can cause a kind of double bind situation of whichclients with high anxiety for computers and robots are victims. The clients areforced to face these systems by social pressure, but they cannot get sufficienttherapeutic benefit due to their anxiety for the systems if their anxiety is notreduced by appropriate treatment, due to rationalism in the therapy process.Furthermore, social pressure prohibits them from stepping out of these situationsbecause it signifies their rejection of accountability for their own mental health.In other words, this type of client cannot be treated with software or robotictherapy even if these software agents and robots are designed based on theoriesof mental therapy.

5 Current and Future Work

In order to verify our conclusion, we should investigate whether double bind sit-uations can arise even in interaction with software agents and robots, dependingon situations and personal traits, and how they influence their users. Because ofethical constraints on psychological experiments, we cannot design experimentsthat really cause double bind situations in mental therapy. Thus, we executed apsychological experiment in which users of a software agent were given doublebind messages by the agent [16]. Here, we briefly describe the content and result.

Our experiment was presented as a quiz game by the software agent and thesubjects experienced pseudo double bind situations consisting of inconsistentmessages from the agent on their answers and prohibition of the subjects’ exitfrom the games by using simple animations. These games were designed basedon a double bind model consisting of the theory of feeling rules and some socialpsychological theories on triad relations [17], and the mental reactions of thesubjects were measured by their evaluation of the agent in a questionnaire. Thesequestionnaires consisted of several pairs of adjectives (for example, “violent” –“gentle”), and the subjects selected one of seven grades between both polescorresponding to these adjectives. Then, the scores were compared between theexperimental group under the double bind situation and a controlled group. Asa result, it was found by t–test that there was a statistically significant differenceon one item (p < .05, df = 8) and significant tendencies at some items (p < .10).Although these experiments did not take into account the computer anxiety ofsubjects as a controlled variable, we are designing a new experiment using theexisting questionnaire for computer anxiety [6].

The same experiment should also be designed for robots. However, neithera strict concept of anxiety for robots nor psychological scale for measurementof it has been proposed yet. We are currently constructing a concept of “robot

Double Bind Situations in Man-Machine Interaction 71

anxiety” and a psychological scale for it, and will ultimately design experimentsusing this scale [9].

Acknowledgment: This research was supported by the TelecommunicationsAdvancement Organization of Japan.

References

1. Turkle, S.: Life on the Screen. Simon & Schuster (1995) (Japanese translation: M.Higure. Hayakawa–Shobo, 1998.).

2. Tashima, T., Saito, S., Kudo, T., Osumi, M., Shibata, T.: Interactive pet robotwith an emotional model. Advanced Robotics 13 (1999) 225–226

3. Nomura, T., Tejima, N.: Critical consideration of applications of affective robotsto mental therapy from psychological and sociological perspectives. In: Proc. 11thIEEE International Workshop on Robot and Human Interactive Communication(ROMAN 2002). (2002) 99–104

4. Nomura, T.: Problems of artificial emotions in mental therapy. In: Proc. IEEEInternational Symposium on Computational Intelligence in Robotics and Automa-tion (IEEE CIRA 2003). (2003) (to appear).

5. Bateson, G.: Steps to an Ecology of Mind. Harper & Row (1972) (Japanesetranslation: Y. Sato. Shisaku–Sha, 1990.).

6. Hirata, K.: The concept of computer anxiety and measurement of it. Bulletin ofAichi University of Education 39 (1990) 203–212 (in Japanese).

7. Raub, A.C.: Correlates of comuter anxiety in college students. PhD thesis, Uni-versity of Pennsylvania (1981)

8. Pribyl, C.B., Keaten, J.A., Sakamoto, M., Koshikawa, F.: Assessing the cross–cultural content validity of the Personal Report of Communication Apprehensionscale (PRCA–24). Japanese Psychological Research 40 (1998) 47–53

9. Nomura, T., Kanda, T.: On proposing the concept of robot anxiety and consideringmeasurement of it. In: Proc. IEEE ROMAN 2003. (2003) (submitted).

10. Mori, S.: A Cage of Self–Control. Kodansha (2000) (in Japanese).11. Hochschild, A.R.: The Managed Heart. University of California Press (1983)

(Japanese translation: J. Ishikawa and A. Murufushi. Sekaishishosha, 2000.).12. Ritzer, G.: The McDonaldozation of Society. Pine Forge Press (1996) (Japanese

Edition: K. Masaoka (1999). Waseda University Press).13. Ciompi, L.: Affektlogik. Klett–Cotta (1982) (Japanese translation: M. Matsumoto

et al. (1994). Gakuju Shoin).14. Koopmans, M.: Schizophrenia and the Family: Double Bind Theory Revisited. Dy-

namical Psychology (1997) http://goertzel.org/dynapsyc/1997/Koopmans.html(electric journal).

15. Foley, V.D.: An Introduction to Family Therapy. Allyn & Bacon (1986) (Japanesetranslation: A. Fujinawa, et al. Sogensha, 1993.).

16. Ohnishi, K., Nomura, T.: Verification of mental influence in man–machine inter-action based on double–bind theory. In: Proc. 34th Annual Conference of Interna-tional Simulation and Gaming Association (ISAGA 2003). (2003) (to appear).

17. Nomura, T.: Formal representation of double bind situations using feeling rulesand triad relations for emotional communication. In: Cybernetics and Systems2002 (Proc. the 16th European Meeting on Cybernetics and Systems Research).(2002) 733–738

[email protected] http://www.cwi.nl/~zsofi

{huang,eliens}@cs.vu.nl http://www.cs.vu.nl/{~huang,~eliens}

script(conduct_half(Intensity,BeatDuration, DownHold, UpHold, DownDynamism, UpDynamism, Hand),Action):-

Action=seq([ get_parameters(beat_down, Intensity, StartHand, EndHand,

StartWrist, EndWrist, Hand),

perform_beat(EndHand, EndWand, StrokeDuration, Dynamism, Hand), wait(DownHold), perform_beat(StartHand, StartWrist, StrokeDuration, Dynamism, Hand), wait(UpHold)]),!.

get_parameters(…)

A Model of Interpersonal Attitude

and Posture Generation�

Marco Gillies1 and Daniel Ballin2

1 UCL@Adastral Park, University College London,Adastral Park, Ipswich IP5 3RE, UK,

[email protected],http://www.cs.ucl.ac.uk/staff/m.gillies

2 Radical Multimedia Lab, BTexact,Adastral Park, Ipswich IP5 3RE, UK,

[email protected]

Abstract. We present a model of interpersonal attitude used for gener-ating expressive postures for computer animated characters. Our modelconsists of two principle dimensions, affiliation and status. It takes intoaccount the relationships between the attitudes of two characters andallows for a large degree of variation between characters, both in howthey react to other characters’ behaviour and in the ways in which theyexpress attitude.

Human bodies are highly expressive, a casual observation of a group of peoplewill reveal a large variety of postures. Some people stand straight, while othersare slumped or hunched over; some people have very asymmetric postures; headscan be held at many different angles, and arms can adopt a huge variety of pos-tures each with a different meaning: hands on hips or in pockets; arms crossed;scratching the head or neck, or fiddling with clothing. Computer animated char-acter often lack this variety of expression and can seem stiff and robotic, however,posture has been relatively little studied in the field of expressive virtual char-acters. It is a useful cue as it is very clearly visible and can be displayed wellon even fairly graphically simple characters. Posture is particularly associatedwith expressing relationships between people or their attitude to each other, forexample a close posture displays a liking while drawing up to full height displaysa dominant attitude. Attitude is also an area of expressive behaviour that hasbeen less studied than say, emotion. As such we have chosen to base our modelof gesture generation primarily on attitude rather than emotion or other factors.

1 Related Work

Various researchers have worked on relationships between animated characters.Prendiger and Ishizuka[7] and Rist and Schmitt[8] have studied the evolution of� This work has been supported by BTexact. We would like to thank Mel Slater and

the UCL Virtual Environments and Computer Graphics group for their help andsupport and Amanda Oldroyd for the use of her character models.


A Model of Interpersonal Attitude and Posture Generation 89

relationships between characters but, again, have not studied the non-verbal ex-pression aspects. Cassell and Bickmore[4] have investigated models relationshipsbetween characters and users. Closer to our work, Hayes-Roth and van Gent[5]have used status, one of our dimensions of attitude, to guide improvisationalscenes between characters.

Research on posture generation has been limited relative to research on gen-erating other modalities of non-verbal communication such as facial expression orgesture. Cassell, Nakano, Bickmore, Sidner and Rich[3] have investigated shiftsof postures and their relationship to speech, but not the meaning of the pos-tures themselves. As such their work is complimentary to ours. Becheiraz andThalmann[2] use a one-dimensional model of attitude, analogous to our affilia-tion, to animate the postures of characters. Their model differs from ours in thatit involves choosing one of a set of discrete postures rather than continuouslyblending postures. This means that it is less able to display varying degrees ofattitude or combinations of different attitudes.

2 The Psychology of Interpersonal Attitude

We have based our model of interpersonal attitude on the work of Argyle[1] andMehrabian[6]. Though there is an enormous variety in the way that people canrelate to each other Argyle identifies two fundamental dimensions that can ac-count for a majority of non-verbal behaviour, affiliation and status. Affiliationcan be broadly characterised as liking or wanting a close relationship. It is as-sociated with close postures, either physically close such as leaning forward orother close interaction such as a direct orientation. Low affiliation or dislike isshown by more distant postures, including postures that present some sort ofbarrier to interaction, such as crossed arms. Status is the social superiority (dom-inance) or inferiority (submission) of one person relative to another. It also coveraggressive postures and postures designed to appease an aggressive individual.Status is expressed in two main ways, space and relaxation. A high status canbe expressed by making the body larger (rising to full height, wide stance of thelegs) while low status is expressed with postures that occupy less space (loweringhead, being hunched over). People of a high status are also often more relaxed,being in control of the situation, (leaning, sitting and asymmetric postures) whilelower status people can be more nervous or alert (fidgeting, e.g. head scratch-ing). The meaning of the two types of expression are not fully understood butArgyle[1] suggests that space filling is more associated with establishing statusor aggressive situations while relaxation is more associated with an establishedheirarchy.

Attitude and its expression can depend both on the general disposition of theperson and their relationship to the other person, for example status depends onwhether they are generally confident for status and whether they feel superiorto the person they are with. The expression of attitude can also vary betweenpeople both in style and degree.

90 Marco Gillies and Daniel Ballin

The relationship between the attitude behaviour of two people can take twoforms, compensation and reciprocation. Argyle presents a model in which peoplehave a comfortable level of affiliation with another person and will attempt tomaintain it by compensating for the behaviour of the other, for example, if theother person adopts a closer posture they will adopt a more distant one. Similarbehaviour can be observed with status, people reacting to dominant postureswith submission. Conversely there are times where more affiliation generatesliking and is therefore reciprocated, or where dominance is viewed as a challengeand so met with another dominant posture. Argyle suggests that reciprocationof affiliation occurs in early stages of a relationship. Status compensation tendto occur in an established heirarchy, and challenges occur outside of a heirarchy.

3 Implementation

This section presents a model of interpersonal behaviour that is used to gener-ate expressive postures for pairs of interactive animated characters. The modelintegrates information about a character’s personality and mood, as well as infor-mation about the behaviour and posture of the other character. Firstly a valuefor each of the two attitude dimensions is generated and then this is used togenerate a posture for the character. An overview of the process is shown in fig-ure 1. As described below this process is controlled by a number of weights thatare able to vary the character’s behaviour thus producing different behaviourfor different characters. Values for these weights are saved in a character profilethat is loaded to produce behaviour appropriate to a particular character.

The first stage in the process is to generate a value for each of the dimensionsof attitude. As described above these depend both on the character itself andthe behaviour of the other character. The character’s own reactions can be con-trolled directly by the user. A number of sliders are presented to the user withparameters that map onto the two dimensions. They take two forms, parametersrepresenting the personality of the character, for example “friendliness” mapson to affiliation, and parameters representing the character’s evaluation of theother character, for example “liking of other”. These parameters are combinedwith variables corresponding to the posture types of the other character (seebelow) to produce a final value for the attitude. For example, affiliation dependson how close or distant the other person is being, and possibly other factors suchas how relaxed the other character is. Thus the equation for affiliation is:

affiliation =∑

wselfisliderValuei +∑

wotheripostureTypei

Where wselfi is a weighting over the parameters representing the characters ownreactions and wotheri is a weighting over the other characters posture types.These weights not only control the relative importance of the various posturetypes but their sign controls whether the character displays reciprocation orcompensation. There is an equivalent equation for status.

The attitude values are used to generate a new posture. Firstly they aremapped onto a posture type, which represents a description of a posture in

A Model of Interpersonal Attitude and Posture Generation 91

Fig. 1. The posture generation process.

terms of its behavioural meaning, as discussed in section 2. The postures typesare: close (high affiliation), distant (low affiliation), space filling (high status),shrinking (low status), relaxation (high status) and nervousness (low status). Asattitudes can be expressed in different ways, or to a greater and lesser degreethe mapping from attitude to posture type is controlled by a weighting for eachposture type that is part of a characters profile. As well as being used to generateconcrete postures the posture type values are also passed to the other characterto use as described above. The values of the posture values are clamped to bebetween 0 and 1 to prevent extreme postures.

Each posture type can be represented in a number of different ways, forexample space filling can involve raising to full height or putting hands on hipswhile closeness can be expressed as leaning forward or making a more directorientation (or some combination). Actual postures are calculated as weightedsums over a set of basic postures each of which depends on a posture type.The basic postures were designed based on the description in Argyle[1] andMehrabian[6] combined with informal observations of people in social situations.The weights of each basic posture is the product of the value of its posturetype and its own weight relative to the posture type. The weights of the basicpostures are varied every so often so that the character changes its posturewithout changing its meaning, thus producing a realistic variation of postureover time. Each basic posture is represented as an orientation for each joint of thecharacter and final posture is calculated as a weight sums of these orientations.Figure 2 shows example output postures.

4 Conclusion

We have explored the use of interpersonal attitude for the generation of bodylanguage and in particular posture. Our initial results are encouraging and inparticular attitude seems to account for a wide range of human postures Figure2 shows some examples of postures generated for interacting characters.

92 Marco Gillies and Daniel Ballin

Fig. 2. Examples of postures generated displaying various attitudes. (a) affilia-tion reciprocated by both parties, displaying close posture with a direct orienta-tion and a forward lean. (b) the male character has high affiliation and the femalelow affiliation, turning away with a distant crossed arm posture. (c) both char-acters are dominant, the female has a space filling, straight posture with raisehead, while the male also has a space filling posture with a hand on his hips. (d)The male character responds submissively to the dominant female character, hishead is lowered and his body is hunched over. (e) The female character respondswith positive affiliation to the male character’s confident, relaxed, leaning pos-ture. (f) A combined posture: the female character shows both low affiliationand high status and the male character low affiliation and low status.

References

1. Michael Argyle: Bodily Communication. Routledge (1975)2. Becheiraz, P. and Thalmann, D.: A Model of Nonverbal Communication and Inter-

personal Relationship Between Virtual Actors. Proceedings of the Computer Ani-mation. IEEE Computer Society Press (1996) 58–67

3. Cassell, J., Nakano, Y., Bickmore, T., Sidner, C., Rich, C.: Non-Verbal Cues forDiscourse Structure. Proceedings of the 41st Annual Meeting of the Association ofComputational Linguistics, Toulouse, France.(2001) 106-115

4. Cassell, J., Bickmore, T.: Negotiated Collusion: Modeling Social Language and itsRelationship Effects in Intelligent Agents. User Modeling and User-Adapted Inter-action 13(1-2) (2003) 89-132

5. Barbara Hayes-Roth and Robert van Gent: Story-Making with Improvisational Pup-pets. in Proc. 1st Int. Conf. on Autonomous Agents. (1997) 1–7.

6. Albert Mehrabian: Nonverbal Communication. Aldine-Atherton (1972)7. Helmut Prendiger and Mitsuru Ishizuka: Evolving social relationships with animate

characters. in Proceedings of the AISB symposium on Animating expressive char-acters for social interactions (2002) 73–79

8. Thomas Rist and Markus Schmitt: Applying socio-psychological concepts of cogni-tive consistency to negotiation dialog scenarios with embodied conversational char-acters. in Canamero and Aylett (eds) Animating Expressive Characters for SocialInteraction. John Benjamins (in press)

Modelling Gaze Behavior for Conversational

Agents

Catherine Pelachaud1 and Massimo Bilvi2

1 IUT of Montreuil, University of Paris 8, LINC - [email protected]

http://www.iut.univ-paris8.fr/∼pelachaud2 Department of Computer and Systems Science, University of Rome “La Sapienza”

Abstract. In this paper we propose an eye gaze model for an embodiedconversational agent that embeds information on communicative func-tions as well as on statistical information of gaze patterns. This latterinformation has been derived from the analytic studies of an annotatedvideo-corpus of conversation dyads. We aim at generating different gazebehaviors to stimulate several personalized gaze habits of an embodiedconversational agent.

1 Introduction

Toward the creation of more friendly user interfaces, embodied conversationalagents (ECAs) are receiving a lot of attention. To be more believable these agentsshould be endowed with the communicative and expressive capacities similar tothose exhibited by humans (speech, gestures, facial expressions, eye gaze, etc).In the context of the EU project MagiCster 3, we aim at building a prototypeof conversational communication interface that makes use of non-verbal signalswhen delivering information, in order to achieve an effective and natural com-munication with humans or artificial agents. To this aim, we create an ECA,Greta, that incorporates communicative conversational aspects. To determinespeech-accompanying non-verbal behaviors the system relies on a taxonomy ofcommunicative functions proposed by [14]. A communicative function is definedas a pair (meaning, signal) where meaning corresponds to the communicativevalue the agent wants to communicate and signal to the behavior used to con-vey this meaning. To control the agent we are using a representation language,called ‘Affective Presentation Markup Language’ (APML) where the tags of thislanguage are the communicative functions [13]. Our system takes as input thetext (tagged with APML) the agent has to say. The system instantiates the com-municative functions into the appropriate signals. The output of the system is

3 IST project IST-1999-29078, partners: University of Edinburgh, Division of Infor-matics; DFKI, Intelligent User Interfaces Department; Swedish Institute of Com-puter Science; University of Bari, Dipartimento di Informatica; University of Rome,Dipartimento di Informatica e Sistemistica; AvartarME


94 Catherine Pelachaud and Massimo Bilvi

the audio and the animation files that drive the facial model (for further detailssee [13]).

After presenting related works, we present our gaze model. The model embedsa communicative functions and a statistical models (sections 3 and 4). The gazemodel is implementing using a Belief Network (BN). In Sections 5 and 6 wepresent the algorithm and the parameters that are used to simulate several gazepattern types. We end the paper by describing examples of our gaze model(section 7).

2 Related Work

Of particular interest for our work are also approaches that aim to produce com-municative and affective behaviors for embodied conversational characters (e.g.,by Ball & Breese [1], Cassell et al. [5,4], Lester et al. [11], Lundeberg & Beskow[12], Poggi et al. [15]). Some researchers concentrate on gaze models to emulateturn-taking protocols [2,5,6,17,4], or to call for the user’s attention [18] to indi-cate objects of interest in the conversation [2,11,16], to simulate the attendingbehaviors of agents during different activities and for different cognitive actions[7]. On the other hand [8,9,10] use a statistical model to drive eye movements.In particular, the model of Colburn et al. [8] uses hierarchical state machines tocompute gaze for both one-on-one conversation than multi-party interactions.On the other hand, Fukayama et al. [9] use a two-state Markov model whichoutputs gaze points in the space derived from three gaze parameters (amount ofgaze, mean duration of gaze and gaze points while averted). These three param-eters have been selected based on gaze perception studies. While the previousresearches focused more on eye gaze as communication channel, Lee et al. [10]an eye movement model based on empirical studies of saccades and statisticalmodels of eye-tracking data. An eye saccade model is provided for both talkingand listening modes. The eye movement is very realistic but no information onthe communication functions of gaze drives the model. Most models presentedso far concentrate either on the communicative aspects of gaze or on a statisticalmodel. In this paper we propose a method that combines both approaches toget a more natural as well as meaningful gaze behavior.

3 The Eye Gaze Model

In previous work, we have developed a gaze model based on the communicativefunctions model proposed by Poggi et al. [15]. This model predicts what shouldbe the value of gaze in order to have a given meaning in a given conversationalcontext. This model has some drawbacks as it does not take into account theduration that a given signal remains on the face. To embed this model into tem-poral considerations as well as to compensate somehow missing factors in ourgaze model (such as social and culture aspects) we have developed a statisticalmodel. That is we use our previously developed model to compute what shouldbe the communicative gaze behavior; the gaze behavior outputted by this model

Modelling Gaze Behavior for Conversational Agents 95

is then probabilistically modified. The probabilistic model is not simply a ran-dom function, rather it is a statistical model defined with constraints. This modelhas been built using data reported in [3]. This data corresponds to interactionsbetween two subjects lasting between 20 and 30 minute. A number of behaviors(vocalic behaviors, gaze, smiles and laughter, head nods, back channels, posture,illustrator gestures, and adaptor gestures) have been coded every 1/10th of sec-ond. Analysis of this data (cf. [3]) was done having in mind to establish twosets of rules: the first one, called ‘sequence rules’, refers to the time a behaviorchange occurs and its relation with other behaviors (does breaking mutual gazehappened by having both conversants breaking the gaze simultaneously or oneafter the other); while the second set of rules, called ‘distributional rules’ refersto probabilistic analysis of the data (what is the probability to have mutual gazeand mutual smile). Our model comprises two main steps:

1. Communicative prediction: First it applies the communicative function modelas introduced in [13] and [15] to compute the gaze model so as to convey agiven meaning.

2. Statistical prediction: The second step is to compute the final gaze behaviorusing a statistical model and considering information such as: what is thegaze behavior for the Speaker (S) and a Listener (L) that was computed instep one of our algorithm, in which gaze behavior S and L were previously,the durations of the current gaze of S and of L.

4 Statistical Model

The first step of the model has already been described elsewhere [13,15]. Inthe remaining of this section we concentrate on the statistical model. We use aBelief Network (BN) made up of several nodes (see Figure 1). Suppose we wantto compute the gaze states of two agents, one being the speaker S and the otherbeing the listener L, at time Ti the nodes are:

– Communicative Functions Model: these nodes correspond to the com-municative functions occuring at time Ti. These functions have been furtherincreased from the set specified in [15] to take into account Listener’s func-tions such as back-channel and turn-taking functions. The nodes are:

• MoveSpeaker STi : the gaze state of S at time Ti. The set of possiblestates is 0, 1 which correspond, respectively, to the states look away andlook at.

• MoveListener LTi : the gaze state of L at time Ti. The set of possiblestates is 0, 1.

– Previous State: these nodes denote the gaze direction at time Ti−1 (pre-vious time). As previous nodes, the possible values are 0 and 1. We consider:

• PrevGazeSpeaker STi−1 : the gaze state for S at time Ti−1.• PrevGazeListener LTi−1 : the gaze state of L at time Ti−1.


– Temporal consideration: these nodes monitor for how long S (respectivelyL) has been in a given gaze state. They ensure that neither S or L will beblocked in a given state for too long.• SpeakerDuration SD: this node is used to “force” somehow the speaker to

change her current gaze state if she has been in this particular state fortoo long. The set of possibile states is 0, 1 which correspond, respectively,to the states less than duration D (meaning that S has been for a lessertime than a given duration D in the current gaze state) and greater thanduration D (S has been for a greater time than a given duration D inthe current gaze state).

• ListenerDuration LD: same function as the SpeakerDuration node butfor the listener.

– NextGaze (S′Ti

, L′Ti

): the gaze state for both agents at time Ti. The stateis computed by setting the root nodes with the respective values and bypropagating the probabilities to the leaf node. The set of possible states is:{ G00 = (0, 0) , G01 = (0, 1) , G10 = (1, 0) , G11 = (1, 1) }.

Fig. 1. The belief network used for the gaze model

The transition from Ti−1 to Ti is phoneme based that is at each phoneme thesystem instantiates the BN nodes with the appropriate values to obtain the nextgaze state (S

′Ti

, L′Ti

) by the BN. The weights specified within each node of theBN have been computed using empirical data reported in [3] as well as it shouldrespect the sequence and distributional rules [3]. For example, the BN has beenbuilt so that a change of state corresponding to ‘breaking mutual gaze’ may nothappen when both agents breaking the gaze simultaneously; that is our modeldoes not allow that, given the previous states STi−1 = 1 and LTi−1 = 1, to havethe next gaze state sets to STi = 0 and LTi = 0.

5 Temporal Gaze Parameters

We aim at simulating not a generic and unique gaze behavior type but to simulatepersonalized gaze patterns. The gaze behaviors ought to depend on the commu-nicative functions one desires to convey as well as on factors such as the general


purpose of the conversation (persuasion discours, teaching...), and personality,cultural root, social relations... Not having precise information on the influenceeach of such factors may have on the gaze behavior, we introduce parametersthat characterized the gaze patterns. Thus we do not propose a gaze model for agiven culture or social role; rather we propose parameters that control the gazebehavior itself. The parameters we are considering are: the maximun durationof mutual gaze state, S = 1 and L = 1; of gaze state S = 1; of gaze state L = 1;of gaze state S = 0; and of gaze state L = 0. Their role is to control the gazepattern in an overall manner rather than phoneme to phoneme as it is done inthe BN. They provide further control on the overall gaze behavior. We noticethat by using different values for the temporal gaze parameters we can simulatedifferent gaze behaviors. For example if we want to simulate a shy agent whoglances very rapidly at the interlocutor, we can lower the values of mutual gazeand of gaze state S = 1 while raising the value of gaze state S = 0. On the otherhand, to simulate two agents that are very good friends and that look at eachother a lot, we can raise the time value of mutual gaze.

6 Algorithm for Gaze Personalization

Let us now see how the computation of gaze behavior is done, using the tem-poral gaze parameters, the communicative function and statistical models andthe BN. The first step is to compute the value of the BN nodes. The instan-tiation of the node values MoveSpeaker and MoveListener are provided by themodel of communicative functions [15]. The PrevGazeSpeaker and PrevGazeLis-tener get the values of the previous gaze state (i.e. at time Ti). SpeakerDura-tion and ListenerDuration correspond to the time a given gaze state has lasted.NextGaze is computed by propagating the probabilities in the BN. The out-comes are probabilities P (G) for each of the four possible states of NextGaze,namely G = {G00, G01, G10, G11}. These four states correspond to the set of pos-sible candidates. From this set of possible solutions for the next gaze value, webuild the set of valid state by considering only the probabilities that are greaterthan a given threshold. The final candidate is obtained applying the uniformdistribution computation.

7 Examples

We aim at simulating the gaze behavior by setting different values for the tem-poral gaze parameters. In this section we illustrate two cases of gaze behavior.In the first case, the speaker barely looks at the listener while the second case,both agents look at each other quite a lot (“mutual gaze” is set to a high value).

7.1 Case 1

Let us set the values of the gaze parameters to be: T maxS=1,L=1 = 1.54; T max

S=1 = 0.70;T max

L=1 = 2.08; T maxS=0 = 3.27; T max

L=0 = 1.81. Table 2 reports the gaze distribution


S/L (0, 0) (0, 1) (1, 0) (1, 1)

(0, 0) 34.76 1.42 0.28 0.0

(0, 1) 1.42 30.48 0.0 1.99

(1, 0) 0.28 0.0 3.42 0.28

(1, 1) 0.0 1.71 0.28 23.65

Table 1. Case 1 - State transition probability

Fig. 2. Case 1 - Gaze behavior for S (continuous line) and for L (dotted line)

as outputted by the BN. As we can see the speaker agent looks away 70.17% ofthe time from the listener. Table 1 provides the probabilities to change state asspecified by the BN. That is, given a certain gaze state (STi−1 , LTi−1), the tabletells what is the probability to go to one of four other possible states (STi , LTi).Examining this table, we notice that all the states for which the gaze of S wouldremain equal to 1 (“look at”) after a transition (e.g. (1,0) to (1,0) or to (1,1)),have much smaller values than the transition states S=0. Thus, we can deducethat S will gaze at L by short glances. This result is further illustrated in Figure2. We can see that the S gaze is set up to 0 (“look away”) for longer time thanto 1 (“look at”). From Table 1, we can aslo remark that our model will notallow simultaneous mutual inversion gaze. Indeed the probabilities of transitionbetween all mutual inverstion are set up to 0 (for example (0,0) to (1,1); (0,1)to (1,0)). Our model does allow such an inversion to happen but it must occurthrough a transition state as the analysis of video corpus has outlined [3]. Forexample, to change from the gaze aversion state (0,0) to the mutual gaze state(1,1), either the speaker should look at L first ((0,0) to (1,0)), or vice-versa, thelistener looks at S first ((0,0) to (0,1)). From either states, the transition to themutual gaze state is allowed.

Speaker Look Away 70.17 Listener Look Away 40.34 Mutual gaze 25.85

Speaker Look At 29.83 Listener Look At 59.66 Averted gaze 74.15

Table 2. Case 1 - Gaze Distribution


7.2 Case 2

The setting of the temporal gaze parameters is T maxS=1,L=1 = 4.24; T max

S=1 = 4.50;T max

L=1 = 4.00; T maxS=0 = 1.27; T max

L=0 = 0.81. Figure 3 illustrates the gaze behaviorof S and L during S’s discourse. We can notice differences with the results ofcase 1. Both, S and L look at each other for long time. Moreover, mutual gazeoccurs 66.19% of the time as reported in Table 3.

Speaker Look Away 24.72 Listener Look Away 15.63 Mutual gaze 66.19

Speaker Look At 75.28 Listener Look At 84.37 Averted gaze 33.81

Table 3. Case 2 - Gaze Distribution

Fig. 3. Case 2 - Gaze behavior

8 Conclusions

In this paper we have proposed a gaze model based on a communicative func-tions model [13,15] and on a statistical model. These models are integratedwithin a belief network using data reported in [3]. The values of the BN nodeshave been set up using results from the statistical analysis of conversation dyads[3]. To allow for the creation of personalized gaze behavior, temporal gaze pa-rameters have been specified. The main purpose of this research is to builddifferent gaze behaviors for agent characteristics and investigate the effects onthe quality of human-agent dialogues. Animations may be viewed at the URL:http://www.iut.univ-paris8.fr/˜pelachaud/IVA03

References

1. G. Ball and J. Breese. Emotion and personality in a conversational agent. InS. Prevost, J. Cassell, J. Sullivan, and E. Churcill, editors, Embodied ConversationalCharacters. MIT Press, Cambridge, MA, 2000.

2. J. Beskow. Animation of talking agents. In C. Benoit and R. Campbell, editors,Proceedings of the ESCA Workshop on Audio-Visual Speech Processing, pages 149–152, 1997.


3. J. Cappella and C. Pelachaud. Rules for Responsive Robots: Using Human In-teraction to Build Virtual Interaction. In Reis, Fitzpatrick, and Vangelisti, edi-tors, Stability and Change in Relationships, New York, 2001. Cambridge UniversityPress.

4. J. Cassell, T.W. Bickmore, M. Billinghurst, L. Campbell, K. Chang, Hannes HogniVilhjalmsson, and H. Yan. Embodiment in Conversational Interfaces: Rea. InProceedings of CHI99, pages 520–527, Pittsburgh, PA, 1999.

5. J. Cassell, C. Pelachaud, N.Badler, M. Steedman, B. Achorn, T. Becket, B. Dou-ville, S. Prevost, and M. Stone. Animated conversation: Rule-based generation offacial expression, gesture & spoken intonation for multiple conversational agents. InComputer Graphics Proceedings, Annual Conference Series, pages 413–420. ACMSIGGRAPH, 1994.

6. J. Cassell, O. Torres, and S. Prevost. Turn Taking vs. Discourse Structure: HowBest to Model Multimodal Conversation. In I. Wilks, editor, Machine Conversa-tions. Kluwer, The Hague, 1999.

7. S. Chopra-Khullar and N. Badler. Where to look? Automating visual attendingbehaviors of virtual human characters. In Autonomous Agents Conference, Seattle,WA, 1999.

8. R.A. Colburn, M.F. Cohen, and S.M. Drucker. The role of eye gaze in avatarmediated conversational interfaces. Technical Report MSR-TR-2000-81, MicrosoftCorporation, 2000.

9. A. Fukayama, T. Ohno, N. Mukawaw, M. Sawaki, and N. Hagita. Messages em-bedded in gaze on interface agents - Impression management with agent’s gaze. InCHI, volume 4, pages 1–48, 2002.

10. S. Lee, J. Badler, and N. Badler. Eyes alive. In ACM Transactions on Graphics,Siggraph, pages 637–644. ACM Press, 2002.

11. J.C. Lester, S.G. Stuart, C.B. Callaway, J.L. Voerman, and P.J. Fitzgeral. De-ictic and emotive communication in animated pedagogical agents. In S. Prevost,J. Cassell, J. Sullivan, and E. Churcill, editors, Embodied Conversational Charac-ters. MIT Press, Cambridge, MA, 2000.

12. M. Lundeberg and J. Beskow. Developing a 3D-agent for the August dialoguesystem. In Proceedings of the ESCA Workshop on Audio-Visual Speech Processing,Santa Cruz, USA, 1999.

13. C. Pelachaud, V. Carofiglio, B. de Carolis, F. de Rosis, and I. Poggi. EmbodiedContextual Agent in Information Delivering Agent. In Proceedings of AAMAS,volume 2, 2002.

14. I. Poggi. Mind markers. In N. Trigo, M. Rector, and I. Poggi, editors, Meaningand use. University Fernando Pessoa Press, Oporto, Portugal, 2002.

15. I. Poggi, C. Pelachaud, and F. de Rosis. Eye communication in a conversational3D synthetic agent. Special Issue on Behavior Planning for Life-Like Charactersand Avatars, Journal of AI Communications, 13(3):169–181, 2000.

16. K.R. Thorisson. Layered modular action control for communicative humanoids.In Computer Animation’97. IEEE Computer Society Press, Geneva, Switzerland,1997.

17. K.R. Thorisson. Natural turn-taking needs no manual. In I. Karlsson,B. Granstrom, and D. House, editors, Multimodality in Language and speech sys-tems, pages 173–207. Kluwer Academic Publishers, 2002.

18. K. Waters, J. Rehg, M. Loughlin, S.B. Kang, and D. Terzopoulos. Visual sensingof humans for active public interfaces. Technical Report CRL 96/5, CambridgeResearch Laboratory, Digital Equipment Corporation, 1996.

A Layered Dynamic Emotion Representation

for the Creation of Complex Facial Expressions

Emmanuel Tanguy, Philip Willis, and Joanna Bryson

Department of Computer ScienceUniversity of Bath

Bath BA2 7AY, United Kingdom{E.A.R.Tanguy Or P.J.Willis Or J.J.Bryson}@bath.ac.ukhttp://www.cs.bath.ac.uk/∼{cspeart Or pjw Or jjb}

Abstract. This paper describes the Dynamic Emotional Representation(DER): a series of modules for representing and combining the manydifferent types and time-courses of internal state which underly com-plex, human-like emotional responses. This system may be used eitherto provides a useful real-time, dynamic mechanism for animating emo-tional characters, or to underly the personality and action-selection ofautonomous virtual agents. The system has been implemented and testedin a virtual reality animation tool where it expresses different moods andpersonalities as well as real-time emotional responses. Preliminary resultsare presented.

1 Introduction

Emotions are an important part of human expression. They are expressedthrough behaviours: body motions, facial expression, speech inflection, wordchoice and so on. If we wish to make virtual agents (VAs) communicate wellwith humans, they need to express these emotional cues [3]. Emotional VAshave been classified into two categories: communication-driven and simulation-driven [2]. Communication-driven agents display emotional expression withoutany true emotion representation within the character. Simulation-driven agentsuse modelling techniques to both represent and generate emotions. The problemwith the first view is that it provides no mechanism to keep consistency withinthe displayed emotions, making them less believable and the agent less compre-hensible. The problem with the simulation of emotion within a character is thecomplexity of the emotion generation, which should be based on assessment ofthe environment and cognitive processes [1].

The work in this paper provides for a third, middle way. We present a system,the Dynamic Emotion Representation (DER) which develops a rich, real-timerepresentations of emotions, but encapsulates this without their automatic gen-eration. This representation will give the consistency needed for communication-driven VAs. For simulation-driven agents, this representation can also be inte-grated with emotion synthesis. Under either approach, the state in the represen-tation can be used to influence behaviour through action-selection mechanisms.


102 Emmanuel Tanguy, Philip Willis, and Joanna Bryson

The DER is based on a two-layer architecture inspired by CogAff architecture[6] and Thayer’s theory of mood [7]. DER contributes a real-time representationof dynamic emotions which incorporates multiple durations for internal state.Brief emotional stimuli are integrated into more persistent emotional structuresin a way that is influenced by long-term mood and personality variables.

The DER’s layered representation allows us to create complex facial expres-sions. Complex here means expressions that present more than one emotion orconflicting emotional state. For example, looking at the Figure 1, which one ofthese two smiling characters would you trust more?

Fig. 1. One secondary emotion + two different mood states ⇒ two facial ex-pressions.

Our research also addresses the issue of the VA personality. In the DER,a VA personality will be represented by a set of values that influences how theemotions vary over time and how they are affected by external emotional stimuli.

2 Layered Dynamic Emotion Representation

The DER architecture is inspired by Sloman theory of emotions which describesa three layers architecture, CogAff [6]. In Sloman’s model each layer generates atype of emotions. In the same way, in our architecture each one of the two layersrepresents a type of emotion: Secondary Emotions and Moods.

The Secondary Emotional state is described by six variables (see Fig. 2)which are related to the Six Universal Facial Expressions. Typically, this type ofemotion will last for less than a minutes to few minutes [5]. An emotion’s intensityincreases quickly when this emotion is stimulated and decays slowly afterwards. Ifan emotion is consecutively stimulated the effect of these stimuli is cumulative [5].

A Layered Dynamic Emotion Representation 103

This layer requires external Emotional Stimuli (E.S.) to be sorted into categoriescorresponding to the Secondary Emotions. An emotion is stimulated when acorresponding E.S. occurs but only if the intensity of this emotion is currentlythe highest. Otherwise, higher intensities of opposing emotions, such as Angerfor Happiness, have their normal decay accelerated.

The mood state is described by two variables: Energy and Tension [7] (seeFig. 2). These two variables vary slowly due to the persistence of moods for hoursor days. External E.S. are separated into two groups; positive and negative thatrespectively decreases and increases the Tension level. The Energy level shouldvary over time but it is fixed in the current implementation and equal to theexternal parameter Disposition to anxiety. The Energy level determines howmuch negative E.S. influences the Tension. The external parameter Emotionstability determines how much the Tension is influenced by any E.S. A particularset of values for these two parameters determines the emotional personality ofthe VA. To simulate the influence of the mood on the Secondary Emotions,the intensity of E.S. is modified by the mood state before the E.S. enter in theSecondary Emotions layer.

Anger Joy Sadness

Surprise Disgust Fear

Facial Mood

Facial expressionrepresenting the tension

Facial Emotion

Facial expressionrepresenting theemotion

CompositionExpression

Combine thetension &emotionalexpressions

Tension

Levels

Emotion

Levels

Mood emotiobnal stimuli filter

Energy Tension

Secondary Emotion State

Mood state

Emotional stabilityDisposition to anxiety

Expression

Expression

Expression

EmotionalStimuli

StimuliEmotionalM

ood

Lay

erSe

cond

ary

Em

otio

n L

ayer

Fig. 2. Diagram of the current DER system implementation. Arrows indicatethe flow of information or influence, boxes indicate processing modules and rep-resentations. Double-headed arrows represent variable state.

3 From Emotion States to Facial Expressions

To animate the facial mesh we use an implementation of the Waters’ abstractmuscle model [4].

To express the Secondary Emotion state, the emotion with the highest inten-sity is selected and the facial expression corresponding to the selected emotion isdisplayed. To communicate the intensity of the emotion, the muscle contractions

104 Emmanuel Tanguy, Philip Willis, and Joanna Bryson

of the facial expression are proportional to this intensity. The facial expressionof the higher emotion intensity is also influenced by the facial expressions of theother secondary emotions.

In the mood layer, the same system is used to create a facial expression thatcommunicates the Tension intensity of the character (see Fig. 3). The final facialexpression is created by selecting the highest muscle contraction from the facialexpressions of Secondary Emotions and mood.

4 Preliminary Results

The results can be analysed in two ways: statically and dynamically.Statically, the system enables an animators to create complex facial expres-

sions. In Fig. 1, the differences between the two VAs are small but play a crucialrole in the expressiveness of a person. In fact, these differences are due to differ-ent levels of tension. An animator could give a certain expression to a characterby firstly describing its overall emotional state (mood), and then by applying ashorter but highly expressive emotion. By decomposing the creation of emotionsthrough these two stages we facilitate the animator’s need to create complexemotional expressions. There is a parallel with real actors, when they rehearseto assimilate the state of mind and the personality of the character, before theyplay the scene.

Dynamically, the personality parameters enable the creation of charactersthat react differently to the same emotional stimuli (see Fig. 3).

(a) Different Tension Levels (b) Different Anger Intensities

Fig. 3. Different facial expression due to different personalities: 3(a) shows theoutput of the Facial Mood module: the agent on the right is more tense than theleft one. 3(b) (the output of the Facial Emotion module) shows that this tensionalso affects secondary emotion accumulation — in this case, anger in the sametwo agents.

An important problem with the results presented here is the small perceptivedifferences between each picture. This is partially due to a loss of quality in the

A Layered Dynamic Emotion Representation 105

pictures, but also to the fact that the DER is best appreciated running in realtime. These are only snapshots of dynamic expressions.

The number and type of Secondary Emotions we chose to represent is dis-putable because there is no consensus on this matter in psychology. However, theDER architecture itself does not constrain what Secondary Emotions developersmay employ in their models, others can easily be added.

Given a functional view of emotion [1], we may expect that the separation ofemotions by types and durations to have significant impact on decision-makingor action selection. The different durations and decay attributes of emotionsand moods might reflect their impact on an autonomous agent’s intelligence.Artificial life models incorporating the DER could be used to explore manyinteresting theoretical issues about the evolution of these characteristics andtheir role in both individual and social behaviour.

5 Conclusion

We have presented the Dynamic Emotional Representation. By introducing twolayers of emotional representation, we have created a good platform for buildingcharacters which display personality and mood as well as (and at the sametime as) transient emotion. We have demonstrated this system as a real-timeanimation tool suitable either for directed animation or as a VR platform for anintelligent virtual agent.

Acknowledgement

All three authors wish to acknowledge that the first author (Tanguy) was solelyresponsible for the coding of DER and most of its design. This work is fundedby a studentship from the Department of Computer Science, University of Bath.

References

[1] Canamero, D. (1998). Issues in the design of emotional agents. In In Emotionaland Intelligent: The Tangled Knot of Cognition, pages 49–54, Papers from the 1998AAAI Fall Symposium. Menlo Park. AAAI Press.

[2] Gratch, J., Rickel, J., Andre, E., Badler, N., Cassell, J., and Petajan, E. (2002).Creating interactive virtual humans: Some assembly required. IEEE Intelligent sys-tems, 17(4):54–63.

[3] Johnson, C. G. and Jones, G. J. F. (1999). Effecting affective communication invirtual environments. In Ballin, D., editor, Proceedings of the second workshop onintellingent virtual agents, pages 135–138, University of Salford.

[4] Parke, F. I. and Waters, K. (1996). Computer Facial Animation. A K Peters Ltd.[5] Picard, R. W. (1997). Affective Computing. The MIT Press, Cambridge, Mas-

sachusetts, London, England.[6] Sloman, A. (2001). Beyond shallow models of emotions. Cognitive Processing,

2(1):177–198.[7] Thayer, R. E. (1996). The origin of everyday moods. Oxford University Press.

Eye-Contact Based Communication Protocol in

Human-Agent Interaction

Hidetoshi Nonaka and Masahito Kurihara

Hokkaido University, Sapporo 060 8628, Japan,{nonaka, kurihara}@main.eng.hokudai.ac.jp

Abstract. This paper proposes a communication protocol in human-agent interaction based on eye and head movement tracking. Visual sen-sorimotor integration with eye-head cooperation is considered, especiallyhead gesture accompanied with vestibulo-ocular reflex is used for com-mand for agents. Nonverbal communication with eye contact was intro-duced in an animated character system.

1 Introduction

Interactive animated characters become popularly employed as assistants, guides,entertainers, or various virtual agents, in conventional graphical user interfacedesign. They are applied to window systems, web pages, augmented and mixedreality systems, and so on. It is said that character interaction should not beforced on users, and users should be able to hire and fire the character anytime.If character always thrusts itself forward and makes itself too conspicuous duringthe ordinary tasks of the user, he/she may feel it bothersome and disturbing,or wants to expunge it. In many applications with animated characters, polite-ness in interaction is expected more than our dairy face-to-face conversation,especially when communication is established.

Our face-to-face conversation involves intonation of speech, body posture,clothes or costume, gesture, facial expression, eye gaze and other various commu-nication channels. Using these nonverbal communication channels, we can startconversation with unacquainted people (attracting with eye gaze and gesture),and we can avoid conversation with unwilling people (ritual of civil inattention).Eye and head directions of animated agent are effectively used for focusing theuser’s attention, controlling conversational turn, and so on [1], [2], [3]. On theother hand, those of user are required for our purpose. For example, head gestureis utilized in [4]. This paper focuses on user’s eye gaze with head gesture, mainlybecause eye gaze is most effective in establishment and continuation of commu-nication [5]. It will be applicable to the communication between computer userand animated character instead of conventional input devices, which are alreadyoccupied with the ordinary tasks: programming; debugging; revising papers; andso on.

Recently, various eye-controlled computer input devises have been proposed[6], [7], [8], [9]. A major problem of this technique is ”Midas touch” problem. It


Eye-Contact Based Communication Protocol in Human-Agent Interaction 107

is actually difficult to discriminate attentive gaze from bemused or inconscientseeing only by tracking eye movement. Furthermore, it is reported that endoge-nous movement of attention and movement of gaze point are not necessarilycoincident [10], [11]. To cope with the problem, some systems use the dwell timeof eye movement to represent the ”click” in the case of mouse, and others useadditional button, blink, clenching, and so on [12], [13].

In our system, head gesture was used for this purpose [14]. Variouse eye- andhead- based interfaces have been developed, but conventionally, head movementis used to compensate the gaze point suffered from fluctuation of head [15], [16],[17]. In contrast, we considered visual sensorimotor integration with eye-headcooperation. In order to keep the direction of gaze stable in head-free condi-tion, coordinative movements of eye and head are required. They are naturallyachieved by vestibulo-ocular reflex (VOR) to stabilize an image on the surfaceof the retina against head movement. While user’s gaze point is fixed, eye andhead move antagonistically and spontaneously, therefore, by tracking both eyeand head movements, it can be affirmed that the gaze is certainly fixed and thehead movement is an intentional gesture. In our system, nodding head, shakinghead and inclining head are assigned to ”yes”, ”no” and ”undo” respectively.

2 System Configuration

The hardware configuration is illustrated in Fig. 1. Eye movement is measured byinfrared corneal reflex method in the eye tracking unit. Infrared LED and ImageSensor (CCD) are attached to the right frame of glasses. Pitch and roll of headmovement are measured by accelerometer (ADXL202E, Analog Devices, 2000),and yaw of head movement are measured by gyroscope (ADXRS300, AnalogDevices, 2002). These motion sensors are attached to the left frame of glasses.Eye tracking unit and head tracking unit are configured in a micro controler(PIC16F877(QTFP), Microchip, 2001), which is attached behind the head.

Eye-head tracking data are sent to PC peripheral device by wireless com-munication: RF Transmitter (AM-RT5, RF Solutions, 2000) and RF Receiver(AM-HRR3, RF Solutions, 2000). The caliburation of eye- and head- trackingunit is initially achieved by In-Circuit Serial Programming in wired condition.The total weight of tracking units and sensors is about 8[g] except for battery.

The block diagram of software configuration is shown in Fig. 2. The angledata from eye tracking unit are smoothed by integration. The gaze line is cal-culated with the compensation of head tracking data. In parallel, fixation of eyegaze with VOR is detected by S-DP matching. While a head gesture is arising,the data of eye movement are unstable and noisy, therefore, it is difficult todetect a fixation condition from only the data of gaze line. S-DP matching canwell cope with this purpose. The gestures of ”Shaking Head”, ”Nodding Head”,and ”Inclining Head” are also detected using S-DP matching with their referencepatterns. Details of S-DP matching are described in [14]. The commands of user:”Yes”, ”No”, and ”Undo” are identified with following fuzzy inference.

108 Hidetoshi Nonaka and Masahito Kurihara

Fig. 1. Block diagram of hardware configuration

Fig. 2. Block diagram of software configuration

Eye-Contact Based Communication Protocol in Human-Agent Interaction 109

3 Eye-Contact Based Communication Protocol

There are several agents (animated characters) in the screen, and they are usu-ally still with their faces down and eyes drooping. Each agent corresponds torespective application such as on-line manual, dictionary, calender, timetable,and so on. They are never activated without fair way to gaze and nod accordingto the protocol. For example, startup process is shown in Fig. 3.

In this example, the nodding or shaking gesture is detected at the last stage.Assuming the gaze line has been cast on the center between agent’s eyes, au-tomatic software calibration is activated with the data of precedent 2[s]. Theprotocol is not definite. The optimal setting of the protocol depends on the kindof agent (application), individual difference of users, and context of user’s tasks.Further considerations for customizing and adaptation to user are needed.

Fig. 3. An example of protocal: establishment of communication

4 Conclusions

This paper proposed an eye-contact based communication protocol in animation-based human-agent interaction, especially for the establishment of conversation.It is based on a communication interface with eye-gaze. To cope with ”Midastouch” problem, head gesture is used, considering vestibulo-ocular reflex. Inidentifying the head gesture, we used successive dynamic programming (S-DP)matching method with fuzzy inference.

The technique of adaptation to individual difference is needed for furtherimprovement. The consideration for continuation of conversation and effectivequery sequence is under way.

110 Hidetoshi Nonaka and Masahito Kurihara

References

1. Bailenson, J. N., Beall, A. C., and Blascovich, J.: Gaze and Task Performance inShared Virtual Environments, Journal of Visualization and Computer Animation,13, (2002) 313-320.

2. Andre, E., Rist, T., and Muller, J.: Employing AI Methods to Control the Behaviorof Animated Interface Agents, Applied Artificial Intelligence, 13, (1999) 415-448.

3. Cassell, J., Nakano, Y. I., and Bickmore: Non-Verbal Cues for Discourse Struc-ture, Association for Computational Linguistics Joint EACL-2001 ACL Conference,(2001).

4. Davis, J. W. and Vaks, S.: A Perceptual User Interface for Recognizing Head GestureAcknowledgements, ACM Workshop on Perceptual User Interfaces, (2001).

5. Argyle, M., and Cook, M.: Gaze and Mutual Gaze, Cambridge Univ. Press (1976).6. Hutchinson, T. E., White, K. P., Martin, W. N, Reichert, K. C., and Frey, L. A.:

Human-Computer Interaction Using Eye-Gaze Input, IEEE Trans. Systems, Man,Cybernetics, 19, 6 (1989) 1527-1534.

7. LaCourse, J. R., and Hludik, F. C.: An Eye Movement Communication-ControlSystem for the Disabled, IEEE Trans. Biomedical Engineering, 37, 12 (1990) 1215-1220.

8. Jacob, R. K.: Eye-gaze Computer Interfaces: What you look at is what you get,IEEE Computer, 26 (1993) 65-67.

9. Cleveland, N. R.: Eyegaze Human-Computer Interface for People with Dis-abilities, 1st Conference on Automation Technology and Human Performance,http://www.eyegaze.com/doc/cathuniv.htm (1997).

10. Stelmach, L. B., Campsall, J. M., and Herdman, C.M.: Attentional and OcularMovements, Journal of Experimental Psychology, 23, 3 (1997) 823-844.

11. Ditterich, J., Eggert, T., and Straube, A.: The Role of the Attention focus in theVisual Information Processing Underlying Saccadic Adaptation, Vision Research,40 (2000) 1125-1134.

12. Gips, J., and Olibieri, C. P., and Tecce, J. J.: Direct Control of the Computerthrough Electrodes Placed around the Eyes, in Smith, M.J. et al. (Ed.) Human-Computer Interaction, Elsevier (1993) 630-635.

13. Velichkovsky, B. M., Sprenger, A., and Unema, P.: Towards Gaze-Mediated Inter-action: Collecting Solutions of the ”Midas Touch Problem”, in Howard, S. et al.(Ed.) Human Computer Interaction: Interact ’97(1997).

14. Nonaka, H.: Communication Interface with Eye-gaze and Head Gesture using Suc-cessive DP Matching and Fuzzy Inference, Journal of Intelligent Information Sys-tems, 21, 2 (2003).

15. Xie, X., Cuddhakar, R., Zhuang, H.: A Cascaded Scheme for Eye Tracking andHead Movement Compensation, T-SMC(A28), (1998) 487-490.

16. Park, K. S., and Lim, C. J.: A Simple Vision-based Head Tracking Method forEye-controlled Human/Computer Interface, Int. J. Human-Computer Studies, 54(2001) 319-332.

17. Stiefelhagen, R., and Zhu, Jie.: Head Orientation and Gaze Direction in Meetings,Proc. of CHI2002 (2002).

Embodied in a Look: Bridging the Gap betweenHumans and Avatars

Nicolas Courty1, Gaspard Breton2, and Danielle Pelé2

1 UNISINOS PIPCA, av. Unisinos 950, Sao Leopoldo, RS, Brazil,[email protected]

2 France Télécom R&D, 4 rue du Clos Courtel BP59, F-35012 Cesson Sévigné Cedex, France{Gaspard.Breton||Danielle.Pele}@rd.francetelecom.com

Abstract. Recent studies on non-verbal communication have put the need for-ward to provide virtual agents lifelike looks. In this paper, we present a systemallowing to embody a conversational agent by modeling one of its perceptual be-havior: gazing at a user. Our system takes as input images of the real scene from awebcam, and allows the virtual agent to look at the person it’s facing. Animationpurposes are mainly explored in this paper through the description of our anima-tion system. This system is composed of two types of models: muscle-based (forfacial animation) and parametric (to control the gaze). Realism of the animationis also discussed.

1 Introduction

Embodied conversational agents offer nowadays new opportunities to interact withcomputers. Humans communicate combining various non verbal modalities of expres-sions in addition to speech: gestures, facial expressions, gaze, face and body postures.As pointed in many papers [6,13,1], one challenge regarding Embodied ConversationalAgents is to make them as lively as possible by combining all these modalities in thebest suitable way according to the relation to the user. In particular in face-to-face com-munication between a human and a virtual character, one important communication cueis related to the eyes: eye contact, eye gaze, eye tracking. Those features help to regu-late the conversation and the search for feedback [2,5,13]. In this paper, we propose asystem that allows the Embodied Conversational Agent to follow the real person withits eyes while having a dialog together. More precisely, difficulties arise when the realworld has to be linked with the virtual world. Hence, those aspects have to be taken intoaccount.

The remaining of this paper is structured as follow: first, we present an overviewof the architecture of the system. Following, animation purposes are then discussed,notably the combination of two different paradigms of animation. We then present anoriginal technique to simulate the vestibulo ocular reflex, that constitutes a specificanimation feature in gazing tasks. Finally, results are presented.


112 Nicolas Courty, Gaspard Breton, and Danielle Pelé

2 Overview of the System

Our aim is to design an embodied conversational agents that can be aware of a hu-man interlocutor. This awareness can be expressed through a particular animation ofthe avatar’s face, which is a “face tracking” animation. To build such an animation,several components are needed (those elements are depicted in the overall architecturepresented in figure 1).

Fig. 1. Overview architecture of our system

First, images of the user are captured by a webcam. Those images are then processedusing a face tracking system, which allows to get the position of the user’s face inthe image captured from the webcam. For the purpose of the application, we used thesystem described in [10], mainly based on two types of information: flesh color of theface, and position of some singular facial locations. Let us note that it is possible touse any type of face tracking algorithm for this application. Next, the position of theuser’s face in the captured image is then handled by the animation engine to producethe required animation of the avatar. The whole pipeline has to work in real-time.

3 Animation Engine

This section deals with the animation of the avatar. Our animation engine is an hy-brid animation system that allows to use different animation paradigms to be used atthe same time. Hence, in our application, mainly two different types of animation areused: muscular animation and image-based animation (that can be referred to as inversekinematics). Those two aspects shall now be discussed.

3.1 Facial Animation

Facial Animation is made by an animation engine called FaceEngine [4]. This anima-tion engine is real time and has been designed in order to realize conversational agents.

Embodied in a Look: Bridging the Gap between Humans and Avatars 113

It is coupled with real time speech synthesis and voice segmentation. Scalability canalso be achieved on the animation engine as well as on the meshes through the use ofdynamic level of details.

FaceEngine is a hybrid animation system using both muscular and parametric an-imation. Muscular animation is very interesting because it allows a very compact rep-resentation of the human expressions. It also provides a universal set of expressionsmaking the creation of new faces straightforward. The muscular system is made of 29muscles working on the same principle as [15]. Parametric animation performs neck andeyes rotation. Each part of the face (muscles, neck, eyes...) are implemented through theuse of animation modules.

a b

Fig. 2. Parametric system. (a) Jaw and neck (b) Example of expression channels

So far, the animation engine is composed of 49 modules. At each frame, each ofthese modules take a primitive called a Control Unit as input. These CU come fromcomputations performed on the expression corpus or, at a lower level, can be sent di-rectly to the animation engine. In our case, the CU controlling the neck and the eyesare directly computed by the face tracking system on a separate expression channel (seeFigure 2).

The system is made of 8 expression channels working in parallel and each of themproduce 49 CU. The CU of all the expression channels are composed according to pri-orities, types and weights in order to produce a single set of CU to send to the animationmodules. At the very end of the pipeline, some noise can be added in order to producesmall variations in the muscles contraction or quivering.

3.2 Gazing at the User: Image-Based Animation

The tracking system is in charge of computing the different angles used by the para-metric system. The difficulty here lies in estimating the relation between the real worldframe and the virtual world frame. This estimation can rely upon a calibration phase,


but in this paper we propose a rough estimation based on an a priori knowledge of theposition of the webcam relatively to the monitor.

Estimating the position of the user’s face.The face tracking system used in our system [10] gives an estimation of the size of

the head in the image perceived from the webcam. From this information we estimatethe distance between the user’s head and the webcam. Thus the position oP of the facecan be obtained in the webcam frame (that can be the real world frame). Let us notethat the distance estimation is not that important in the final result, and could also beroughly set to one meter, which corresponds to the common distance between a userand the screen. Then, it is of need to estimate the matrix aMo defining the transfor-mations between the webcam frame and the agent’s eyes frame. Those transformationsare purely virtual, in the sense that they are linking a real and a virtual world. For thepurpose of our system, we only considered a simple translation between the webcamframe and the virtual world frame. Knowing the matrix eyesMa linking the agent’s eyesframe and the virtual world frame, it is thus possible to compute the position eyesP ofthe user’s eyes in the eyes frame of the agent :

eyesP = (eyesMaaMo).oP

This information is then processed by our animation system, and is updated everyframe. Figure 3 depicts this transformations pipeline.

AVATARP

a

M a

o

Ma

USER

SCREEN

Virtual World absolute Frame

WEBCAMReal World absolute frame

eyes

Fig. 3. Overview of the transformation pipeline allowing to estimate the position of theuser’s face int the virtual world frame

Animating the agent.In order to animate the agent, we used an image-based technique already detailed

in [8,9]. This animation technique is related to “image-based animation”, in the sensethat tasks given as input to the animation system are specified in the 2D image space,and allows to compute motions of an articular chain. In our case, only the system eyes-neck is considered, but within this formalism it is also possible to consider more com-plex chains, involving spine for instance. The animation system can be seen as a partic-ular type of inverse kinematics, where the task is specified as the regulation in the imagespace of a set of geometric primitives. For a gazing task, the visual task is specified as“I want to see the face of the user centered in the image space”. In comparison to tradi-tional inverse kinematics, the design of the task is much more suitable to our focusingproblem.


Knowing eyesP the position of the user’s face in the eyes frame, it is then possibleto compute its projection p in the image perceived from the agent point of view. To seethe user’s face centered in the image space, we want p = pd = (0, 0), where pd isthe desired position of the geometric feature in the image space. The visual task e1 isthen defined as e1 = (p − pd). Let us note that is it also possible to consider somesecondary tasks e2 (generally designed as the gradient of a cost function), that shouldbe realized provided that the primary task e1 is fully realized, by exploiting the possibleredundancy of the articular chain. We invite readers to refer to [9] for further details onthis animation technique.

Modeling the vestibulo-ocular reflex.Using this animation model does not guarantee realism in the achieved motion.

Hence, neurophysiologists have proved the existence of some particular coupling in theprocesses controlling the eyes and the neck motions [12]. Human beings are endowedwith a special reflex named Vestibulo-Ocular Reflex (VOR), that allows to maintain theorientation of the gaze despite the motions of the head. During the gazing task, eyes arefocusing quickly on the target, while the head, that moves slower due to inertia factor,tends to maintain a perpendicular alignment of the horizon. Broadly speaking, the an-imation system should imitate such a behavior. We thus designed a special secondarytask constraining the realization of the focusing task. Our approach can be related to thesearch of a less-effort posture, which has already been solved in the context of inversekinematics in [11,3]. To make the eyes converge to this less effort posture qlep, wedesigned a simple cost function hs such that:

hs = α12

∑i

(qi − qlepi )2 (1)

where i ∈ {1, 2} determines the two possible rotations of the eyes. The secondary taskis then given by:

e2 =∂hs

∂q= α

(q1 − qlep

1

q2 − qlep2

)(2)

where α is a scalar that sets the amplitude of the secondary task. In order to test thismethod, we performed a comparison with results obtained by Robinson [14] on realpatients (see Figure 4 for details). The different curves have the same behaviors alongthe time, which intends to justify the use of this technique to model the particularitiesof the human motions involved in a gazing task.

4 Results

We tested our application within our simulation framework. Tests were performed on aPentium 4 PC with a 1Ghz CPU and a GeForce2 graphic card. A photo of the systemand some snapshots of the application are presented in Figure 5. The face tracking partis the most time consuming (handling between 10 and 20 images per second), whereasrendering and computing animation allow to run the application at very fast rate (70frames per second). Hence, if the face tracking system is running too slowly, a smalllatency may occur between the animation and what is really going on (this is the case


0

10

20

30

(Target Location (deg)

Time

eyes rotation

sum of the two

neck rotation

a 00.050.10.150.20.250.30.350.40.45

010

2030

4050

6070

"q3" "q5""err_

moy"

b

Fig. 4. Cooperation between neck and eyes for a tracking task. (a) results adapted fromRobinson [14] (b) results obtained with our animation system

for fast motions of the user). Nevertheless, resulting animations are very interesting,since the simple fact of having the avatar looking at the user makes it very lively andembodied. Moreover, using a simple webcam is not very invasive, and as such a userwho is not aware of this type of application can be very surprised when he finds out thatthe avatar is looking at him. We are planning to begin studies about the impact of sucha non-verbal modality on the emotional commitment a user can get through using suchan application.

5 Drawbacks and Future Works

As stated in introduction, gazing and looking are part of the modalities involved into thenon-verbal communication. For now, the tracking is performed all along the animation,and is not linked with the speech of the avatar, nor either a potential personality oremotions. This may in fact be a barrier to good interaction as far as gazing away isan important part of turn taking behavior, and having an avatar constantly staring atthe user may impede good conversational flow. Future works will consider introducingbehavioral models to closely link the visual behavior of the avatar to its personality,and also to its speech. Systems such as BEAT developed at MIT allows to decoratea phrase with attributes referred to non-verbal communication [7], and could be usedto parameterize visual behaviors. Moreover, visual behaviors could also be executedwhenever the user disappear from the screen, or when a new person enters the visualfield. Such upgrades may greatly enhance the liveliness of the embodied conversationalagent. We plan to design a behavior manager that could handle several visual behaviorsat the same time, and switch from one to the other with respect to the current context ofthe conversation. This behavioral layer would act as a supervision layer for executingthe different CU, thus providing a total control over the animation of the avatar.

6 Conclusion

We have presented in this paper an animation system intending to simulate a per-ceptive behavior of a virtual agent: gazing at a user. Our animation architecture de-signed for Embodied Conversational Agents is hybrid, thus allowing different anima-tion paradigms (muscle-based, skeleton-based) to run simultaneously. Images acquired


a

b

b

Fig. 5. Face tracking and animation of an embodied conversational agent (a) Overviewof the system (b) Screenshots of the application. Encrusted pictures are the video cap-tured from the webcam, while the red cross is the result of the tracking process


from a webcam are processed using a face tracking algorithm. The position in the im-age space of the user’s face allows to get an estimation of the position of the user inthe agent’s virtual frame. This information is then used by our animation system. Thisanimation framework allows to define special constraints on the generated motions likethe modeling of the vestibulo ocular reflex, which permits to get some very lively andembodied results. Next works will consider the use of this tool to model more complexvisual behaviors involving non-verbal communication features.

Acknowledgments: the authors would like to thanks Olivier Bernier for the face track-ing module, Tifenn for her work on this application and the reviewers for their interest-ing comments.

References

1. J. Allbeck and N. Badler. Toward representing agent behaviors modified by personality andemotion. In AAMAS Workshop on Embodied Conversational Agents, Bologna, Italy, July2002.

2. M. Argyle and M. Cook. Gaze and mutual gaze. Cambridge University press, London, 1976.3. R. Boulic, R. Mas, and D. Thalmann. A robust approach for the control of the center of mass

with inverse kinetics. Computers and Graphics, 20(5):693–701, September–October 1996.4. G. Breton, D. Pelé, and C. Bouville. Faceengine : a 3d facial animation engine for real time

applications. In ACM Web3D Symposium, Paderborn, Germany, February 2000.5. J. Cassel and K. Thorisson. The power of a nod and a glance: envelope vs. emotional feed-

back in animated conversational agents. In Applied Artificial intelligence, volume 13, pages519–538, 1999.

6. Justine Cassel. Embodied conversationnal interface agents. Communication of the ACM,43(4):70–78, 2000.

7. J. Cassell, H. Vilhjálmsson, and T. Bickmore. BEAT: The behavior expression animationtoolkit. In Proc. of SIGGRAPH 01, in Computer Graphics Proceedings, pages 477–486,2001.

8. N. Courty and E. Marchand. Computer animation: a new application for image-based visualservoing. In IEEE Int. Conf. on Robotics and Automation, volume 1, pages 223–228, Seoul,South Korea, May 2001.

9. N. Courty, E. Marchand, and B. Arnaldi. Through-the-eyes control of a virtual humanoïd.In IEEE Int. Conf. on Computer Animation 2001, pages 234–244, Seoul, Korea, November2001.

10. R. Feraud, O. Bernier, J.-E. Viallet, and M. Collobert. A fast and accurate face detector basedon neural networks. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23:42–53,2001.

11. A. Kuo and F. Zajac. Human standing posture : Multi-joint movement strategies based onbiomechanical constraints. Journal of Progress in Brain Research, 97, 1993.

12. P. Morasso and V. Tagliasco, editors. Human Movement Understanding. North-Holland,Amsterdam, 1986.

13. I. Poggi, C. Pelachaud, and F. DeRosis. Eye communication in a conversational 3d syntheticagent. In Special Issue of Artificial Intelligence Communications, volume 13, pages 169–181,2000.

14. D. Robinson. The mechanics of human saccadic eye movements. Journal of Physiology,174:245–264, 1964.

15. K. Waters. A muscle model for animating three-dimensional facial expression. In Pro-ceedings Of Siggraph, Signal Processing and Communications Series, Anaheim, California,1987.

Modeling Accessibility of Embodied Agents for

Multi-modal Dialogue in Complex VirtualWorlds

Dasarathi Sampath and Jeff Rickel

USC Information Sciences Institute4676 Admiralty Way, Suite 1001

Marina del Rey, CA, [email protected], [email protected]

Abstract. Virtual humans are an important part of immersive virtualworlds, where they interact with human users in the roles of mentors,guides, teammates, companions or adversaries. A good dialogue model isessential for achieving realistic interaction between humans and agents.Any such model requires modeling accessibility of individuals, so thatagents know which individuals are accessible for communication, by whatmodality (e.g. speech, gestures) and at what degree they can see or heareach other. This work presents a computational model of accessibilitythat is domain independent and capable of handling multiple individualsinhabiting a complex virtual world.

1 Introduction

Immersive virtual worlds have been playing an ever-increasing role in the fieldsof education, training and entertainment. A human user can be transported tovirtual classrooms, fantasy adventures, life-like drama or a friendly chat session;the possibilities are endless. Humans interact with other people as well as vir-tual humans (agents) who manifest themselves as instructors, teammates, socialcompanions or enemies. Virtual worlds range from a small space to complexworlds where individuals (agents and humans) can move among different loca-tions. Complex worlds are becoming more and more common. Examples includethe massive online multiplayer game Everquest [1], which has thousands of play-ers in its fantasy worlds, the Mission Rehearsal Exercise [9, 10] army trainingplatform set in a virtual village, and the social virtual reality system, DiamondPark [13], which has human users interacting in a virtual park.

One of the important goals of virtual worlds is to make users feel that theyare interacting with real humans as opposed to computer programs. Face-to-face conversations between humans and agents presents quite a challenging steptowards achieving this goal. Agents participating in such conversations makeactive use of gestures and speech to get and convey information [2]. They needto be aware of individuals who are intentionally or unintentionally (overhearers)involved in the conversation. These issues are especially challenging in a complex


120 Dasarathi Sampath and Jeff Rickel

world where an agent is influenced by factors like the relative positions of theindividuals involved, whether they can see or hear each other and the level (e.g.talking loud or soft) at which they need to communicate. This calls for a sounddialogue model that takes all the above mentioned aspects into account.

We term that an agent X is in contact with another individual Y when X cancommunicate with Y through some mode. The modes can be basic ones like aural(speech) and visual (gestures) as well as electronic ones like radios or cell phones.Being in contact is a prerequisite for communicating, the communication beingeither intentional or unintentional. For instance, if the mode is aural, contactinformation about Y helps X decide whether Y is accessible by speech, and atwhat loudness X should speak to be audible to Y; contact information aboutother individuals helps X be aware of individuals that may be eavesdropping onthis conversation.

Any dialogue model for face-to-face interaction in complex virtual worldsneeds to have a model of contact built in as a foundation. There has been ex-tensive work in the Embodied Conversational Agents (ECA) [4] community onface-to-face interaction but the issue of contact has been largely simplified oroverlooked. The only explicit model of contact has been in the multi-layereddialogue model proposed by Traum and Rickel [12]. Our current work builds ontheir model.

We present a computational model for determining contact information foran agent inhabiting a virtual world. Our model utilizes the perceptual infor-mation available to the agent and tells it which individuals are accessible forcommunication and at what level. It is designed to handle multiple individualsseamlessly moving in and out of conversations as well as keep track of potentialonlookers or overhearers that an agent is interested in.

2 Platform

The test bed for this research is the Mission Rehearsal Exercise (MRE) project[9, 10]. This places the human user in the role of an army lieutenant who findshimself driving to a situation in a virtual environment resembling a Bosnianvillage and populated by virtual humans. There has been an accident where oneof his platoon’s vehicles has injured a boy from the locality. The boy’s motherand a medic are attending to the boy while a sergeant is in charge. Squad leadersand soldiers are present for support. Now the lieutenant has to take over fromhere, assess the situation by talking to the sergeant and the medic and issueorders accordingly. All characters in the scenario apart from the lieutenant areagents.

Essentially, right from the point where the simulation starts, the situationsthat arise necessitate the continuous computation of contact. Initially, the lieu-tenant is in contact via radio with the sergeant before arriving at the scene. Oncethe lieutenant is there, the sergeant can see and talk to him face-to-face thuschanging the mode of contact. The sergeant and the medic have to constantlywatch out for overhearers, mainly the mother and crowd members. Depending

Modeling Accessibility of Embodied Agents for Multi-modal Dialogue 121

on the location of his squad members, the sergeant has to decide how loud heshould be talking to them or if he should communicate over the radio. He hasto keep track of people in a landing zone who are out of earshot but can beseen. These situations signify issues that are representative of any complex vir-tual world populated by individuals, thus making MRE an ideal test bed formodeling contact.

The initial version of MRE had only some domain specific rules for comput-ing contact. These rules told the agents about contact information for a smallnumber of specific cases. For example, if the first squad leader reaches the land-ing zone, he is termed to be in visual contact with the sergeant but not aural.Such rules are limited at covering situations, cumbersome to encode, and can-not be reused when applied to a new scenario. Our model of contact is aimedat solving these problems by having a generic domain-independent model thatcan be used in any virtual world where the agents have access to certain basicperceptual information.

3 Related Work

This work builds on prior research in the field of Embodied ConversationalAgents [4]. Most ECAs are designed to communicate only with a single humanuser and no other virtual humans in the scenario. The computation of contactboils down to determining whether the user is present or absent for conversation.The Rea agent [3] and Gandalf agent [11] tracked the human user by placingtrackers on the body and using cameras. Such a model will not be capable ofhandling contact computation in complex worlds, where multiple individuals aremoving among different locations. Steve [7, 8] provided an immersive trainingexperience with multiple humans and agents involved in maintenance operationson a virtual ship with many rooms and floors, a complex world. Steve agentsdonned the role of mentors as well as teammates. Despite the complexity of thevirtual world, the agent assumes that it is in contact with every individual in thevirtual world. Steve ignored the physical boundaries and perceptual limitationsin the 3D world and talked to teammates in distant rooms. Situations like thishinder the goal of having realistic team training.

The above mentioned models of contact fall short of the demands of a com-plex world. In contrast, the dialogue model proposed by Traum and Rickel [12] isdesigned to support face-to-face interaction among agents and humans in com-plex virtual worlds. Their model includes a number of layers, including a contactlayer. Their contact layer includes a representation for contact information anda set of abstract acts that update this representation. Contact was representedas a vector for all individuals that an agent may interact with, with each ele-ment representing the available modes of contact (visual, aural, and radio). Actsincluded make-contact and break-contact for a specified mode.

We extend their model in two important ways. First, we have extended theirrepresentation to include additional levels of contact for each mode and a mea-sure of the certainty of the agent’s contact information. Second, and most impor-


tantly, we provide a domain-independent mapping from an agent’s perceptualinformation to the abstract acts; they only provided a few illustrative exam-ples of how events in the virtual world would correspond to make-contact andbreak-contact acts.

4 Modeling Contact

Let us consider the sergeant as a representative example of an agent and explorethe modeling of contact from his viewpoint. For each individual like the medicwith whom the sergeant would be potentially interested in holding a conversa-tion, a data structure (ContactInfo) is created. Therefore, the sergeant wouldhave a set of ContactInfo structures, one for each individual he is interested in.For each of the modes of communication, ContactInfo has a level denoting thedegree of contact currently available. For example, the aural contact level wouldtell the sergeant how loud he should speak while talking to the medic.

The ContactInfo structure is updated continuously over the course of a simu-lation as opposed to doing it when the agent needs the information. To illustratethe advantage, let us consider a situation in our scenario where the sergeant isconversing with the lieutenant and the mother is behind the sergeant, talkingoccasionally and thus giving an indication of her position. If contact were tobe computed on demand, every time the sergeant wants to make sure that themother would not hear him talking to the lieutenant, he has to look behindto check her position. On the other hand, computing contact information con-tinuously will ensure that the sergeant has an idea about the mother’s positionbased on her occasional utterances. This would make the agents’ behaviour morenatural and ensure that there is no loss of information.

4.1 Levels of Contact

Each mode of contact has different levels. For example, if the sergeant can makehimself audible to the medic even by talking softly, then the medic is at a higherlevel of aural contact with respect to the sergeant than someone who is far awayand needs the sergeant to shout to make himself audible.

The levels of aural contact are labeled as Unknown, High, Medium, Low, Bareand Zero. The levels were decided based on the required decibel level (loudness)of the agent in question to be audible. The required decibel level depends on twofactors: the distance between the two individuals and the environmental noise.Initially, given the distance between the sergeant and the medic, the intensitywith which the sergeant should speak such that he is just audible (around 35dB)to the medic is calculated based on the inverse square law [5]. The final requiredintensity is calculated based on the heuristic that it is greater than the environ-mental noise level by around 10dB. If the final intensity is higher than the limitof human loudness level (around 75dB) then it is impossible to contact and thelevel is set to Zero, else the required decibel level is found and translated to auralcontact level by the following mapping. A High means that it is enough for the


sergeant to talk at the level of whispering, around 30dB. A Zero means that hecannot make himself audible. Medium (around 65dB) corresponds to the normalloudness of a close face-to-face conversation between individuals. If the sergeantneeds to shout at the top of his voice the level would then be Bare (around75dB). Low (around 70dB) is a level of aural contact added to take care of theintermediate level between Medium and Bare; it is applicable in situations whereindividuals are not close (less than 4m) but not so far that it requires shouting.The level is set to Unknown when the information is considered uncertain; thisis explained further in section 4.3.

In the case of visual contact, the levels were decided based on the typesof non-verbal signals that are visible. An agent’s ability to communicate visu-ally with another individual depends on what types of gestures the other agentcan recognise. The important gestures commonly employed include facial ex-pressions, gaze shifts, hand and arm motions, head nods and body posture. Fordetermining the levels of visual contact, a study was conducted wherein a subjectwas made to stand at various distances and do all the relevant types of gestures.A mapping from distance (between individuals) to the visibility of the varioustypes of gestures was determined. The types of gestures visible progressivelydecrease with increasing distance. A High(till about 4-10metres) corresponds toall gestures being visible. At the Medium level(around 15-25m), fine gestureslike eyebrow movements and eye movements cannot be recognised. Low(around30-60m) is the level where no facial expressions can be recognized, and onlybody postures and hand gestures are visible. The level Bare(around 75-150m)signifies that merely gross features like body orientation and big hand motionsare visible. None(greater than 200m) denotes that the person is barely visible.Unknown means the information is uncertain, as in aural contact. This studyis a suggested mapping because the actual perception depends on various otherconditions like lighting and contrasting background. A mapping which takes allthese other factors into account is an area for future work. This representationhelps the agent decide on what signals to use while communicating.

The final mode of contact, Radio, is just a representative means of com-munication, used in the MRE scenario. There might be other forms of contactlike a cell phone for example, but the basic idea is to represent any means ofcontact apart from aural and visual. Here the levels used are Radio-on, Radio-off, No-radio and Unknown. No-Radio means that there is no radio. Radio-onand Radio-off correspond to the cases where the agent’s radio is on and offrespectively. Unknown is used to indicate that the information about radio isuncertain.

4.2 Resources for Computing Contact

The computation of contact assumes that certain basic resources are available tothe agent: perceptual information and some domain knowledge. The perceptualdata is the aural and visual information that any individual placed in a 3D worldwould receive. Domain knowledge includes minor details about the scenario likethe radio being a method of communication.


Perceptual Information The main motivation behind basing the model onperceptual information was to make it domain-independent without sacrificingthe accuracy. When involved in conversations with individuals, humans tend toknow their locations based on the regular sensory input that they receive, auraland visual. This location information is what helps humans automatically adjusttheir volume when speaking to individuals standing at different locations. Thus,the main problem in computing contact is that of estimating the locations ofindividuals, which we do using the perceptual information. Given an estimate ofan individual’s location, and knowledge of objects in the environment (e.g., trees,walls), an agent could easily infer cases where objects blocked visual contact. Weignore that issue in this paper, but it would be a natural extension to our model.

In a virtual world, the perceptual input to an agent would be human-like per-ceptual information about individuals it can see and hear. For our contact model,the visual input includes the location vector and velocity vector of the individualwho is visible. The aural input consists of the perceived loudness (decibel level)and direction of the speaking individual as well as the total environmental noiseintensity at the agent.

4.3 Computing Contact Information

The computation of contact uses data from perception to update the levels ofcontact continuously. For aural and visual contact, the first task at hand is to getsome measure of distance based on the input information from the perceptionmodule. Visual information overrides the aural information whenever both areavailable. The aural input from the medic is then ignored in this case. In thecase that only the medic’s voice is heard, the distance is estimated by assumingthat the medic is speaking at the typical volume level for a human conversation(around 65 dB). Given the perceived intensity of the medic, the distance to themedic is calculated based on the simple fact that sound intensity decays withdistance according to the inverse square law [5].

Once we have this distance measure, computing the levels of contact isstraightforward. The level of visual contact is determined based on the map-ping between distance and visual levels, which was the result of our experimentdescribed in section 4.1. Aural contact level is determined by taking both thedistance measure and the environmental noise intensity into consideration asdescribed in section 4.1 again. The levels of radio contact are determined basedon simple computation, given domain knowledge about the radio being presentor absent. A successful radio communication with the medic sets the level toRadio-on while an unsuccessful attempt by the sergeant sets it to Radio-off.

The levels of contact are continuously computed as described above as longas there is sensory input continuously available. For example, once the medic isout of the field of view of the sergeant, the medic might not stay in the sameposition. Same position refers to the medic’s last-observed position if his velocitywas zero; in case the velocity is non-zero it is the new position that is updated astime progresses. In both cases, the information cannot be relied on forever sinceit is a dynamic world. So we have internal fields in each ContactInfo structure,


location-belief-level and radio-belief-level, to represent the decay of certaintyabout location and radio information respectively. Once the sensory input stops,the belief-levels, initially at the maximum value, start decreasing as a functionof time. We chose the exponential function mentioned in Moray [6] to simulatethis decay. Thresholds for the belief-levels are fixed and the sergeant relies onthe contact information for the medic until the belief-levels decay below thethresholds. When this happens, contact levels are set to Unknown. We do nothave a principled way of choosing the threshold values. For location-belief-level,a simple way of doing it would be based on the typical time differences betweenconsecutive inputs, visual and aural, for a wide range of conversations.

When contact levels are set to Unknown, the agent has to execute a taskof actively computing contact whenever necessary. This is a separate task onits own, which will be more domain-dependent. However, some simple generictechniques can be employed for aural and visual contact, like looking aroundfor the concerned individual or shouting out to see if he responds back. Morecomplex techniques of actively computing contact information is an area forfuture work.

5 Status

The contact model has been implemented in the MRE project. The main agentsin the scenario like the sergeant and the medic are autonomous virtual humans[9] while the other agents like crowd members are scripted. Contact informationabout all individuals (including scripted agents) is maintained only by thesemain agents. The computation of contact has been completed and tested byhaving individuals moving among different locations with respect to the sergeant,sometimes coming into the field of view and sometimes talking, thus providingaural and visual input. The model seems to provide realistic contact informationin all the situations we have tested and we are currently planning a more formalevaluation.

6 Conclusion

We have presented a computational model of contact for an agent in a complexvirtual world with access to basic perceptual information. The model has beenimplemented and tested in a scenario having many individuals moving amongdifferent locations. This domain-independent model simulates the way humansdetermine contact information in the real world using perceptual information.Our research on contact is an important step towards achieving realistic face-to-face interaction between agents and human users.

Acknowledgements

This research was funded by the U.S. Department of the Army under contractDAAD 19-99-D-0046. We would like to thank Praveen Paruchuri and members


of the MRE project, especially Youngjun Kim, David Traum, Chris Kyriakakis,Jon Gratch and Sheryl Kwak. Any opinions, findings, and conclusions expressedin this paper are those of the authors and do not necessarily reflect the views ofthe Department of the Army.

References

[1] Everquest. http://www.everquest.com.[2] Justine Cassell. Nudge nudge wink wink: Elements of face-to-face conversation

for embodied conversational agents. In J. Cassell, J. Sullivan, S. Prevost, andE. Churchill, editors, Embodied Conversational Agents. MIT Press, Cambridge,MA, 2000.

[3] Justine Cassell, Tim Bickmore, Lee Campbell, Hannes Vilhjalmsson, and Hao Yan.Human conversation as a system framework: Designing embodied conversationalagents. In J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, editors, EmbodiedConversational Agents. MIT Press, Cambridge, MA, 2000.

[4] Justine Cassell, Joseph Sullivan, Scott Prevost, and Elizabeth Churchill, editors.Embodied Conversational Agents. MIT Press, Cambridge, MA, 2000.

[5] Lawrence; Kinsler, Alan; Frey, Austin R.; Coppens, and James Sand, editors.Fundamentals of Acoustics. Wiley, New York, 1982.

[6] Neville Moray. Designing for attention. In Alan Baddeley and LawrenceWeiskrantz, editors, Attention: Selection, Awareness, and Control. ClarendonPress, Oxford, NY, 1993.

[7] Jeff Rickel and W. Lewis Johnson. Animated agents for procedural training invirtual reality: Perception, cognition, and motor control. Applied Artificial Intel-ligence, 13:343–382, 1999.

[8] Jeff Rickel and W. Lewis Johnson. Extending virtual humans to support teamtraining in virtual reality. In G. Lakemayer and B. Nebel, editors, ExploringArtificial Intelligence in the New Millenium, pages 217–238. Morgan Kaufmann,San Francisco, 2002.

[9] Jeff Rickel, Stacy Marsella, Jonathan Gratch, Randall Hill, David Traum, andWilliam Swartout. Toward a new generation of virtual humans for interactiveexperiences. IEEE Intelligent Systems, 17(4):32–38, 2002.

[10] W. Swartout, R. Hill, J. Gratch, W.L. Johnson, C. Kyriakakis, C. LaBore, R. Lind-heim, S. Marsella, D. Miraglia, B. Moore, J. Morie, J. Rickel, M. Thiebaux,L. Tuch, R. Whitney, and J. Douglas. Toward the holodeck: Integrating graphics,sound, character and story. In Proceedings of the Fifth International Conferenceon Autonomous Agents, pages 409–416, New York, 2001. ACM Press.

[11] Kristinn R. Thorisson. Communicative Humanoids: A Computational Model ofPsychosocial Dialogue Skills. PhD thesis, Massachusetts Institute of Technology,1996.

[12] David Traum and Jeff Rickel. Embodied agents for multi-party dialogue in immer-sive virtual worlds. In Proceedings of the First International Joint Conference onAutonomous Agents and Multi-Agent Systems, pages 766–773, New York, 2002.ACM Press.

[13] Richard C. Waters, David B. Anderson, John W. Barrus, David C. Brogan,Michael A. Caseyand Steph, G. McKeown, Tohei Nitta, Ilene B. Sterns, andWilliam S. Yerazunis. Diamond park and spline: A social virtual reality sys-tem with 3d animation, spoken interaction and runtime modifiability. Presence:Teleoperators and Virtual Environments, 6:461–480, 1996.

Bridging the Gap between Language and Action

Tokunaga Takenobu1, Koyama Tomofumi1, Saito Suguru2, andOkumura Manabu2

1 Department of Computer Science, Tokyo Institute of TechnologyTokyo Meguro Oookayama 2-12-1, Japan 152-8552

{take@cl,tomoshit@img}.cs.titech.ac.jp2 Precision and Intelligence Laboratory, Tokyo Institute of Technology

Yokohama Midori Nagatsuta 4259, Japan 226-8503{suguru,oku}@pi.titech.ac.jp

Abstract. When communicating with animated agents in a virtual spacethrough natural language dialogue, it is necessary to deal with vaguenessof language. To deal with vagueness, in particular vagueness of spatialrelation, this paper proposes a new representation of locations. The rep-resentation is designed to have bilateral character, symbolic and numeric,in order to bridge the gap between the symbolic system (language pro-cessing) and the continuous system (animation generation). Through theimplementation of a prototype system, the effectiveness of the proposedrepresentation is evaluated.

1 Introduction

Research of animated agents capable of interacting with humans through naturallanguage has drawn much attention in recent years [1, 2, 3]. When communicat-ing with animated agents in a virtual space, it is necessary to deal with vaguenessof language as well as ambiguity. Vagueness and ambiguity of language are sim-ilar but different concepts.

The following short conversation between a human (H) and a virtual agent(A) highlights the contrast between vagueness and ambiguity.

H: Do you see a ball in front of the desk?A: Yes.H: Put it on the desk.

In the third utterance, the pronoun “it” could refer to one of the objects men-tioned in the preceding utterance, “a ball” or “the desk”. So there is ambiguityin reference. Solving this kind of ambiguity has been studied for many years asanaphora resolution [5].

This example includes vagueness as well. When putting the ball on the desk,a location to place the ball should be decided. There is no explicit mention ofthe location on the desk where the ball to be placed. It is just mentioned as “onthe desk”. In contrast with the reference ambiguity, there is, in principle, infinitechoices of the location. When we interact with virtual agents through natural


128 Tokunaga Takenobu et al.

language, such kind of vagueness is inevitable. In particular, vagueness of spatialrelations could be a crucial obstacle for autonomous agents, because the agentcannot perform a proper action without dealing with the vagueness.

As this example shows, solving ambiguity is a process of choosing a correctone from discrete and categorical choices, which has an affinity to the symbolicnature of language. On the other hand, solving vagueness involves finding a plau-sible point or area in continuous space, which is incompatible with the symbolicsystem. This would be one of the main reasons that vagueness has not drawnmuch attention in past natural language processing research.

Most of the past natural language dialogue systems worked in discrete spacewhere every relation among objects and locations are described in terms of sym-bols. The discrete space has an affinity to conventional symbol-based planningwhich plays a crucial role in realizing intelligent agents. When moving from dis-crete space to continuous space, however, symbolic planning faces difficulty ofvagueness. If it treats every location as a symbol, the number of symbols couldbe infinite in theory. To avoid this problem, Shinyama et al. proposed to usecomposite lambda functions to delay computation of locations [11]. However,since the result of computation are the coordinate values of a single location,their method does not deal with vagueness in the strict sense.

The above speculation tells us that the representation of locations and spa-tial relations for the virtual agents needs to have both symbolic and numericcharacter. With such representation, bridging the gap between a symbolic sys-tem (language) and a continuous system (action) could be achieved. Olivier etal. also proposed a similar idea in which the representation had both qualitativeand quantitative properties [9]. However, their motivation was the visualizationof spatial description and did not consider its use in a more dynamic environ-ment. Horswill proposed a framework in which all logical variables are directlygrounded on visual information in the real world [7]. It is not clear if this frame-work is applicable to general linguistic expressions. Our research is motivatedto explore the spatial representation satisfying these requirements for intelligentagents.

The structure of the paper is as follows. Section 2 describes an overview ofour prototype system with its architecture. Section 3 proposes the Space objectwhich fulfills the above requirements. In Section 4, we show an example of howthe Space object behaves in the planning process. Finally, Section 5 concludesthe paper and looks at the future work.

2 System Architecture

To achieve the above goal, we are developing a prototype system K2 as a testbed to evaluate our idea. Fig. 1 shows a screen shot of K2 . There are two agentsand several objects (colored balls and desks) in a virtual world. Through speechinput, a user can command the agents to manipulate the objects. The currentsystem accepts simple Japanese utterances with anaphoric and elliptical expres-sions, such as “Walk to the desk.”, “Further”. The size of the lexicon is about

Bridging the Gap between Language and Action 129

100 words. The agent’s behavior and the subsequent changes in the virtual worldare presented to the user in terms of a three-dimensional animation.

Fig. 1. A screen shot of K2

Fig. 2 illustrates the architecture of the K2 system. The speech recognitionmodule receives the user’s speech input and generates a sequence of words. Thesyntactic and semantic analysis modules analyze the word sequence to extracta case frame. At this stage, not all case slots are necessarily filled, because ofellipses in the utterance. Even in cases there is no ellipsis, instances of objectsare not identified at this stage. Resolving ellipses and anaphora, and identifyingthe instances in the world are performed by the discourse analysis module.

Micro planning

Semanticdictionary

Rendering

Syntactic/Semantic analysis

Planlibrary

Virtualworld

Utterancehistory

OntologyWorddictionary

Languagemodel

Word sequence Case frame Goal

Basicmovement

SPACEobject

Coordinate value

Speech input

Animation

Discourseanalysis

Spacerecognition

Macroplanning

Movementgeneration

Speechrecognition

Fig. 2. The system architecture of K2

The discourse analysis module extracts the user’s goal as well and hands itover to the planning modules which build a plan to generate the appropriate


animation. In other words, the planning modules translate the user’s goal intoanimation data. However, the properties of these two ends are very different andstraightforward translation is rather difficult. The user’s goal is represented interm of symbols, while the animation data is a sequence of numeric values. Tobridge this gap, we take a two-stage approach – macro and micro planning.

The macro planner adopts a conventional planning framework like STRIPS [4],that is, given a goal, it generates a sequence of predefined primitive operators.In this case, the planner generates a sequence of basic movements of the agent.For instance, the goal “on(ball#1, desk#2)” would be satisfied by a sequence ofbasic movements, “go near to the ball”, “pick up the ball” and “put the ballon the desk”. It is generally difficult to define a set of basic movements, sinceit depends on the application domain. One approach to this issue is describedin [12].

During the macro planning, the planner happens to need to know the physicalproperties of objects, such as their size, location and so on. For example, to pickup a ball, the agent first needs to move to the location at which he can reach theball. In this planning process, the distance between the ball and the agent needsto be calculated. This sort of information is represented in terms of coordinatevalues of the virtual space and they are handled by the micro planner.

The results of micro planning update the virtual world database, and theupdate reflects the output animation through the rendering module.

3 The Space Object

To interface the macro and micro planning, we propose the Space object torepresent a location in the virtual space, with its bilateral character; symbolicand numeric. To realize such bilateral character, the following two requirementsarise for the Space object describing a location.

R1. It can be an argument of logical functions.R2. It can represent plausibility of a location.

The requirement R1. comes from the macro planner side. The macro planneruses plan operators described in terms of logical forms, in which a location isdescribed such as InFrontOf(Obj). Such representation needs to be an argumentof another logical function. From the viewpoint of the macro planner, the Spaceobject is designed to behave as a symbolic object by referring to its uniqueidentifier.

The requirement R2. comes from the micro planner side. A location couldhave vagueness and the most plausible place changes depending on the situation.Therefore it should be treated as a certain region rather than a single point.To fulfill this requirement, we adopt the idea of the potential model proposedby Yamada et al. [13], in which a potential function maps a location to itsplausibility. Vagueness of a location is naturally realized as a potential functionembedded in the Space object.


We design the potential function f to satisfy the following two conditions.

C1. It is differentiable throughout the domain.C2. It moves range 0 to 1.

When the most plausible point is required by the micro planner for generatingthe animation, the point is calculated by using the potential function with theSteepest Descent Method (SDM). The condition C1. is necessary to adopt theSDM.

In the condition C2., the point with value 1 is defined as the most plausiblelocation and that with value 0 is the least plausible one. This definition makesit possible to translate the logical AND operation on the Space objects to theproduct of the potential function values of the objects. The NOT operation is de-fined as (1−f) and the OR operation can be derived from the combination of theAND and NOT operations. As this example shows, relations between objects andlocations are represented as symbols in the macro planner, and as compositionof potential functions in the micro planner.

Currently, we have defined the potential functions for the following spatialconcepts:

– relations “front”, “back”, “left”, “right”, “on”, “between”– a place occupied by an object– a place close to an object

The parameters of a potential function are derived from the size and shape ofobjects.

We can also use the Space object to represent more complex spatial con-straints such as a reachableByHand(Agent) location. In this case the potential func-tion reflects the stress on the agent’s arm.

4 Example of Planning with the Space Object

This section describes how the Space object plays a role as a mediator be-tween the symbolic and continuous system through the example introduced inSection 1.

When an utterance “Do you see a ball in front of the desk?” is given in thesituation shown in Fig. 1, the discourse analysis module identifies an instance of“a ball” in the following steps.

(1) space#1 := new inFrontOf(desk#1, viewpoint#1, MIRROR)

(2) list#1 := space#1.findObjects()

(3) ball#1 := list#1.getFirstMatch(kindOf(BALL))

In step (1), an instance of Space is created as an instance of the class inFrontOf.The constructor of inFrontOf takes three arguments; the reference object, the


viewpoint and the axis order 3 [6]. There have been several studies on the clas-sification of spatial reference [6, 10, 8]. In this paper, we follow Herskovits’sformulation [6] due to its simplicity, in which a reference frame is determined interms of the above three parameters.

To interpret a speaker’s utterance correctly, it is necessary to identify thereference frame which the speaker used. However, there is no decisive methodto accomplish this, since various factors, such as object properties, precedingdiscourse context and psychological factors are involved. The current prototypesystem adopts a naive algorithm to determine the reference frame based onheuristic rules. In this paper, we focus on the calculation of potential functionsgiven a reference frame.

Suppose the parameters of inFrontOf have been resolved in the precedingsteps, and the discourse analysis module chose the axis mirror order and theorientation of the axis based on the viewpoint as the light-colored arrows inFig. 3. While the desk has four potential direction, (1) through (4), only one ofthem can be the “front” axis of the desk. The closest one to the viewpoint-based“front” axis is chosen as the “front” of the desk. In this example, (1) is chosen.Then, the parameters of potential function f corresponding to “front” are set asshown in Fig. 4.

The potential function f is defined as given in equation (1). The first Gaussianfactor expresses the expanse of the potential on both sides of the “front” axis,the second sigmoid factor reduces the potential of the “back” side region. Sincethese two factors satisfy the conditions described in Section 3, f also satisfiesthem.

f(d1, d2) = exp−d2

2

b2(d21− l2

2 +l12 )2 × 1

1 + exp−ad1(1)

d1 : value of the “front” axisd2 : value of the “left-right” axisl1, l2 : maximum length of the reference object along axes d1, d2

a, b : coefficient

In step (2), the method matchObjects() returns a list of objects located in thepotential field of space#1 shown in Fig. 5. The objects in the list is sorted indescending order of the potential value of its location.

In step (3), the most plausible object satisfying the type constraint (BALL)is selected by the method getFirstMatch().

When receiving the next utterance “Put it on the desk.”, the discourse anal-ysis module resolves the referent of the pronoun “it” and extract the user’s goal.The macro planner constructs a plan to satisfy the goal as follows:

3 There are two types of axis order, basic and mirror. In the basic order, the axes areordered clockwise around the origin as “front”, “right”, “back” and “left”. In themirror order, however, the order is “front”, “left”, “back” and “right”. The mirrororder is used when the speaker faces an object.


back

left

right

Desk

front

Viewpoint

Ball(1)

(2)

(3)

(4)

Fig. 3. Adjustment of axis

leftright

back

front

boundary of direction

d1

d2

d1d2( )

view point

l1

2l

Fig. 4. Parametersof f

Viewpoint

Fig. 5. Potential field of f

(1) walk(inFrontOf(ball#1, viewpoint#1, MIRROR) ANDreachableByHand(ball#1) AND NOT(occupied(ball#1)))

(2) grasp(ball#1)(3) put(ball#1,on(desk#1, viewpoint#1, MIRROR)) 4

Walk, grasp and put are defined as basic movements. They are handed over to themicro planner one by one.

The movement walk takes a Space object representing its destination as anargument. In this example, the conjunction of three Space objects is given asthe argument. The potential function of the resultant Space is calculated bymultiplying the values of corresponding three potential functions at each point.Fig. 6 illustrates the three potential fields (a) through (c) and the resultant field(d). The agent walks to the location which has the maximum potential valuewith respect to the field (d).

After moving to the specified location, the movement grasp is performed tograb the ball#1. This movement should succeed because the agent is guaranteedto be at the location from which the ball is reachable.

When putting the ball on the desk, the micro planner looks for a space onthe desk which no other object occupies by composing the potential functionssimilar to the walk step.

As this example illustrates, the Space object effectively plays a role as amediator between the macro and micro planning.

5 Conclusion

This paper proposed a representation of a location in the virtual world. Theproposed representation, the Space object is designed to have bilateral characterin order to bridge the gap between the symbolic system (language processing)and the continuous system (animation generation). Through the implementationof the prototype system K2 in which a user can interact with animated agentsin the virtual world, we found the Space object is a promising candidate to dealwith vagueness of language.4 Actually, further constraints are necessary to ensure enough room for the ball.


(a) Potential field of

inFrontOf(ball#1, viewpoint#1, MIRROR)

(b) Potential field of

reachableByHand(ball#1)

(c) Potential field of NOT(occupied(ball#1)) (d) Composition of (a) – (c)

Fig. 6. Composition of potential fields

Our future research plan includes increasing the number of spatial relationsand utilizing potential fields for path planning in the micro planner. For example,introducing a potential field decreasing the value along with the orthogonaldirection of the wall makes it possible to deal with an expression like “Walk alongthe wall to the big desk.”. In addition, more principled algorithm to disambiguatethe reference frame should be incorporated into the system.

References

[1] N. I. Badler, M. S. Palmer, and R. Bindinganavale. Animation control for realtimevisual humans. Communication of the ACM, 42(8):65–73, 1999.

[2] R. Bindinganavale, W. Schuler, J. Allbeck, N. Badler, A. Joshi, and M. Palmer.Dynamically altering agent behaviors using natural language instructions. InAutonomous Agents 2000, pages 293–300, 2000.

[3] J. Cassell, J. Sullivan, S. Prevost, and E. Churchill, editors. Embodied Conversa-tional Agents. The MIT Press, 2000.

[4] R. E. Fikes. STRIPS: A new approach to the application of theorem problemsolving. Artificial Intelligence, 2:189–208, 1971.

[5] B. J. Grosz, A. K. Joshi, and P. Weinstein. Centering: A framework for modelingthe local coherence of discourse. Computational Linguistics, 21(2):203–226, 1995.

[6] A. Herskovits. Language and Spatial Cognition. An Interdisciplinary Study of thePrepositions in English. Cambridge University Press, 1986.

[7] I. D. Horswill. Visual routines and visual search. In Proceedings of the 14thInternational Joint Conference on Artificial Intelligence, August 1995.

[8] W.J.M. Levelt. Speaking: From Intention to Articulation. The MIT Press, 1989.


[9] P. Olivier, T. Maeda, and J. Tsujii. Automatic depiction of spatial descriptions.In AAAI 94, pages 1405–1410, 1994.

[10] G. Retsz-Schmidt. Various views on spatial prepositions. AI Magazine, 9(2):95–105, 1988.

[11] Y. Shinyama, T. Tokunaga, and H. Tanaka. Processing of 3-D spatial relations forvirtual agents acting on natural language instructions. In the Second Workshopon Intelligent Virtual Agents, pages 67–78, 1999.

[12] T. Tokunaga, M. Okumura, S. Saito, and H. Tanaka. Constructing a lexicon ofaction. In the 3rd International Conference on Language Resources and Evaluation(LREC), pages 172–175, 2002.

[13] A. Yamada, T. Nishida, and S. Doshita. Figuring out most plausible interpretationfrom spatial description. In the 12th International Conference on ComputationalLinguistics (COLING), pages 764–769, 1988.

[email protected]

Motion Path Synthesis for Intelligent Avatar

Feng Liu1 and Ronghua Liang1,2

1 College of Computer Science, Zhejiang [email protected]

2 Institute of VR and Multimedia, Hang Zhou Inst. of Electronics [email protected]

Abstract. In this paper, we present a new motion generation technique,called motion path synthesis. The objective is to generate a motion for anintelligent avatar to move along a planned route using a pre-existing mo-tion library. First, motion primitives, each defined as a dynamic motionsegment, are extracted from the motion library. Then a motion graphis constructed based on the motion primitives and their connectivities.Within this motion graph, the desired realistic motion for avatars can besynthesized through a two-stage process: search an optimal motion pathwithin the motion graph, joint the motion path and adapt it to the route.The experiment shows the effectiveness of the presented technique.

1 Introduction

Creating a realistic motion along a planned route is often an important task inmaking an intelligent virtual avatar. As people are well qualified to discern theartificiality in the motion of human-like avatars, both effort and expertise areneeded to create realistic motions by key framing. And a large number of DOFs(Degree of Freedom) of avatars make it even more difficult.

A recent popular solution to this problem is motion capture [1]: the requiredmotion along a given route is performed by an actor and then recorded to driveavatars. However, motion capture data is hard to be adapted to different routes.Two methods have been presented to improve the re-use of the motion capturedata. One is to adapt existing motions to new requirements through interactivecontrol, such as constraint based methods [2, 3], motion retargeting [4], etc. Amore attractive way is to synthesize new motions from example. That is to gen-erate motions through selecting and jointing existing motions along a specifiedpath or according to statistical distributions. Most presented techniques, how-ever, are limited by the size of a motion library, and the results lack variation.

In this paper, we present a new example based motion synthesis techniquefor creating motions for intelligent avatars to move along a planned route. Eachmotion is considered to be composed of a series of motion primitives, the minimalelement that embodies the dynamic of a motion, and is modelled as a first-orderMarkov process. Then a directed graph, called motion graph, can be constructedfrom the motion library, with each motion primitive as a vertex. New motioncan be generated by synthesizing primitives along a constrained or specified path


142 Feng Liu and Ronghua Liang

within the motion graph. Since each new motion is created by segmented motioncapture data (motion primitives), the reality of each motion is preserved. Andas motion primitives are used in synthesizing new motions instead of full-lengthmotion clips, the promising motion space is well enlarged.

The remainder of the paper is organized as follows: in the next section, wegive an overview on related work. In Section 3, we describe motion graph con-struction. In Section 4, we propose motion path synthesis for intelligent avatars.We show the result in Section 5 and conclude the paper in the last section.

2 Related Work

Fruitful work has been carried out to adapt motion capture data for avatarsto new application. The early research mostly aimed at providing convenienttools for interactive motion editing. A constraint based method was proposedfor editing a pre-existing motion such that it meets new needs yet preserves theoriginal quality as much as possible [2, 3]. A similar technique was presented foradapting a motion from one character to another [4]. Other researchers, such asBruderlin and Williams [5], apply techniques from image and signal-processingdomain to designing, modifying and adapting animated motion. Similarly, Liu etal.[6] provided a series of tools for editing motion at a high level by introducingwavelet transformation into motion analysis and synthesis. Unuma et al.[7] de-scribed a method for modelling human figure locomotion with emotions. Brandand Hertzmann [8] proposed a style machine to produce new motion with thedesired feature.

More recently, a fantastic approach, motion synthesis by example, is pre-sented. Molina-Tanco and Hilton [9] presented a system that can learn a statis-tical model from motion capture data and synthesize new motions by specifyingthe start and end key-frames, and sampling the original captured sequence ac-cording to the statistical model to generate novel sequences between key-frames.Pullen and Bregler [10] presented a motion capture assisted animation, whichcan derive realistic motions from partially key-framed motion using motion cap-ture data. Arikan and Forsythe [11] constructed a hierarchical graph from amotion library and adopted a randomized search algorithm to extract motionaccording to user constraints. Kovar et al. [12] build a similar motion graph thatencapsulates connections among the motion library and synthesize motions bybuilding walks on the graph. In the work of Li et al.[13], motion data is dividedinto motion textons, each of which can be modelled as a linear dynamic sys-tem. Motions are synthesized by considering the likelihood of switching fromone texton to the next.

3 Motion Graph

Each motion is represented as a frame sequence, with each frame defining aposture using a popular hierarchical posture description, in which the posture

Motion Path Synthesis for Intelligent Avatar 143

of an articulate figure is specified by its joint configurations together with theorientation and position of Root as follows:F t=(TRoot(t), QRoot(t), Q1(t), Q2(t). . .Qn(t))where TRoot(t) and QRoot(t) are the translation vector and rotation vector ofRoot at time t, and Q i(t) is the rotation vector of joint i around its parentjoint at time t.

A directed graph, called motion graph as shown in Fig. 1, is built to cap-ture the connection among motions (motion segments)in a motion library. Motion graph is a newly arisenstructure for motion description, plan and generation.Each vertex is associated with a motion segment, vary-ing from a single frame to a full-length motion, andthe edge from vertex i to vertex j is associated with aweight denoting the connectivity from the motion seg-ment associated with vertex i to that with vertex j. InArikan and Forsyth [11], each vertex is a single frame,which enlarges the motion graph and the promisingmotion space. Its drawback, however, is obvious in that Fig. 1. Motion graph.

a large number of vertices in the graph cause high calculation overhead, and therealism of the original motion is damaged severely for the selected neighborframes do not necessarily have smooth transitions. Whereas it comes to associ-ating each vertex with a long motion segment or even a full-length motion, therealism of the original motion is preserved at the cost of small promising motionspace. We adopt a similar scheme to Li et al.[13], and associate each vertex witha motion segment. Unlike Li et al.[13], in which each motion segment is replacedcompletely with a dynamic model, called motion texton, we preserve all the orig-inal frames in the motion segment/primitive and synthesize new motions withthem. Thus the detail and reality of motion capture data is well preserved.

3.1 Motion Primitive Extraction

Motion primitive is defined as a fundamental segment that captures the dynamicsof a motion and can be well fitted to a quadratic dynamic model in the followingway:F t=F 0+At2+Bt, where F 0 and F t are the initial and t th frames, A andB are dynamic parameters.

In this paper, we model an articulate avatar with the global position, orienta-tion and other 17 joints. Thus the posture can be represented as a 57-dimensionvector and a motion M with T frames can be described as a T×57 matrix. Toreduce the high dimensions, we adopt SVD to decompose M and get the princi-pal motion components in the following way: M T×57=U T×T ΛT×57V57×57. LetU 1−q be the first q columns of U, it comprises the principal components of M.Greedy algorithm is employed to extract motion primitives.

1. Fetch a motion M from the library, decompose it using SVD and get U 1−q.2. Fetch Tmin frames from U 1−q , and fit them to a dynamic model using the least

square method. Add subsequent frames to them until the fitting error exceeds agiven threshold.


3. Fetch the next Tmin frames from U 1−q , and extract a new motion primitive MP i

in the similar way to that described in step 2.

4. Repeat step 3 until all frames in M are processed.

5. Repeat step from 1 to 4 until all motions in the library are processed.

3.2 Motion Graph Construction

Inspired by the previous work, such as Li et al.[13] and J.Lee et al.[14], wemodel each motion as a first-order Markov process. Each motion primitive isa state and the next state only depends on the current state. Unlike J.Lee etal.[14], each state is a motion primitive, instead of a single frame, which makesthe previous states have little effect on the current state and thus makes thismotion model more plausible. Each motion in the library can be modelled as afirst-order Markov process, and thus the motion library can be represented as amotion graph. Each vertex is a motion primitive, and the directed edge eij fromMP i to MPj is defined as the connectivity from MP i to MPj , P(MPj |MP i),which can be calculated as the similarity between the last L frames from MP i

and those from the precedence of MPj .The similarity between two frame sequences can be defined as the sum of the

similarity between the two corresponding trajectories of Root and that betweentwo posture sequences as follows.

sim(Mi, Mk) = βsimt(Mi, Mk) + (1 − β)simp(Mi, Mk)

simt(Mi, Mk) = exp(− minR,T

L∑t=1

(α((RMpit + T) − Mpkt)2 + (1 − α)(RMoit − Mokt)

2)/L)

simp(Mi, Mk) = exp(−L∑

t=1

∑j

ωj(Mjit − Mjkt)2/L)

where sim(M i,M k) is the similarity between M i and M k, simt(M i,M k) is thesimilarity between the two Root trajectories, simp(M i,M k) is the similaritybetween the two posture sequences. M jit is the rotation of joint j in frame t inM i, and ωj is a weight indicating the significance of joint j. M pit and M oit arethe position and orientation of Root in frame t in M i respectively. To avoidthe disturbance of the initial orientation and position of a motion, we calculatethe optimal simt(M i,M k) by transforming M i globally with rotation R andtranslation T. The optimal R and T can be found using the algorithm detailedin [15 ]. The result is normalized such that

∑Nj=1eij=1. If no precedence of MPj

is found, the connectivity is set 0.5 empirically.

4 Motion Synthesis

We generate the desired motion along a planned route R through a two-stageprocess. Search a motion path within the motion graph, joint the motion prim-itives on the extracted motion path and adapt it to the route R.


4.1 Motion Path Finding

The optimal motion path shall satisfy the following two criterions, fitting for theroute and having natural transition. We define the motion path MS as follows:

MS = arg maxΠ

P (MPk1,MPk2 · · ·MPkn|MPk1 = MPs,G,R)

where∏

is the set of motion primitive sequences beginning with MPs withinthe motion graph G. We can transform this problem into a path finding problemin a graph based on the first-order Markov process as follows.

MS = arg maxΠ

P (MPk1, MPk2 · · ·MPkn|MPk1 = MPs, G, R)

= arg maxΠ

P (MPk2|MPs)P (MPk2|R, Sk2) · · ·P (MPkn|MPkn−1)P (MPkn|R, Skn−1)

= arg maxΠ

lgP (MPk2|MPs)P (MPk2|R, Sk2) · · ·P (MPkn|MPkn−1)P (MPkn|R, Skn−1)

= arg minΠ

−(lgP (MPk2|MPs)P (MPk2|R, Sk2) + · · ·P (MPkn|MPkn−1)P (MPkn|R, Skn−1))

where MPs is the specified start motion primitive, P(MPkj |MPki) is the tran-sition from MPki to MPkj , P(MPki|R,Ski) is the fitness of MPki to the pathsegment on R starting at the position Ski and is defined as a decreasing functionof the total change needed to adapt MPki to the path segment. The process ofadapting is detailed in Section 4.2.

We construct a hierarchi-cal directed graph as shown inFig. 2. The start motion prim-itive is selected as vertex V0.Let the number of all the mo-tion primitives be n, we setthe other vertices Vij=MPj ,where Vij is the j th vertex atlevel i, j ε[1, n]. The weight of Fig. 2. Hierarchical graph for motion path finding.

each edge is -lgP(MPki|MPki−1)-lgP(MPki|R,Ski), to be calculated in theruntime. According to this graph, the problem can be transformed to findinga shortest path. Because no evident end vertex is present and the number ofvertices in the graph is infinite, the traditional Dijkstra algorithm can not beadopted directly. We devise an adapted Dijkstra algorithm. Let S be the set ofvertices whose final shortest path weights starting at V0 have been determined,and Q be the set of vertices whose best estimations of the shortest path weightshave been calculated, the adapted Dijkstra algorithm can be outlined as follows:

1)Initialization.

V0.d=0 //set the shortest path weight starting at V0 to this vertex

V0.p_end=P0//P0 is the start location of the route R

Add(V0,Q) //add V0 to Q

2)While the end location of R has not been approximated enough

Vc=Extract-Min(Q)//fetch Vc with the minimal shortest path weight in Q

Add(Vc,S) //add Vc to S

Remove(Vc,Q) // remove Vc from Q


for each vertex Vi in the next level to Vc

Vi.pre=Vc // set Vc as the precedence of Vi

Vi.d=Vc.d+(-lg(P(Vi|Vc)P(Vi|R,Vc.p_end))// calculate the shortest path

// weights from V0 to Vi

Vi.p_end=Reach-end(R,Vc.p_end,Vi)//calculate the end location after

the shortest

//path from V0 to Vi has been adapted to R

if Vi has already been in Q,

Update the Vi in Q

else

Add(Vi,Q)

endif

endfor

endwhile

The calculation of Reach-end(R,Vc.p end,Vi), the end location after theshortest path from V0 to Vi has been adapted to R, is similar to the processfor jointing motion primitives in Section 4.2. The motion path can be obtainedthrough starting from the last vertex added to S, and tracing its precedence inS back up to the start vertex.

4.2 Motion Primitives Jointing and Adapting

Let M be the motion resulting from jointing all the motion primitives on themotion path before MP i+1, the current motion primitive to be connected toM, PS be the end location of M, we devise the following motion transition andadaptation algorithm to append M with MP i+1.

1. Fetch the first frame from MP i+1 as an appended frame.

2. Translate the appended frame to locate its Root at PS.

3. Calculate the tangent of R at PS, and orientate the appended frame parallel tothe tangent. If the appended frame is the first frame of MP i+1, replace the lastframe of M with it. Otherwise append M with this appended frame.

4. If no frame left in MP i+1, end. Otherwise fetch the next frame as the appendedframe, and move PS along R with the distance from the appended frame to itsprevious frame in MP i+1. Goto Step 2.

5. Smooth Mwith a Gauss convolution template G.

⎡⎢⎣

M(m − T b)M(m − T b+1)

· · ·M(m + T b)

⎤⎥⎦ =

⎡⎢⎣

G0(0) G0(1) · · · G0(2Tb)G1(-1) G1(0) · · · G1(2Tb-1)· · · · · · · · · · · ·

G2Tb(-2Tb) G2Tb (1-2Tb) · · · G2Tb(0)

⎤⎥⎦

⎡⎢⎣

M(m − T b)M(m − T b+1)

· · ·M(m + T b)

⎤⎥⎦

Gi(t)=p(t)/∑2Tb−i

j=−ip(j)

where m is the number of frames in M before appending it with MP i+1,[m-Tb,m+Tb] is the range for smoothing, and p(t) is the Standard Normal Distri-bution.


Table 1. Composition of the motion library

Motion The number of frames The number of Motion Primitives

Normal Walk 136 7

Cat Walk 121 8

Fig. 3. Path synthesis.

5 Experiment

To verify the effectiveness of the presented technique, we build a small motionlibrary as shown in Table 1. We extract motion primitives from this library, andthe number of motion primitives extracted from each motion is also shown inTable 1. A motion graph is constructed using these motion primitives. Below weshow some motions synthesized within this motion graph.

Firstly, a circle route is specified, and a natural motion along this route issynthesized. This motion consists of 477 frames. The result is shown in Fig. 3(a).Secondly, a sine curve is sketched. Again, a smooth motion along this route isgenerated. This motion consists of 1020 frames. The result is shown in Fig. 3(b).

From these two examples, we can see that there are far more frames ineach generated motion than those in the original motion library. Some motionprimitives are frequently used to generate the desired motion, which demandsnew transitions for every two motion primitives which are not connected in theoriginal motions. Within the presented motion generation framework, we proposetwo strategies to achieve the smooth transition, selecting the motion primitivesequence, which is most likely to have natural transitions among its every twonearby motion primitives, and smoothing every two nearby motion primitiveswith a Gauss convolution template. The results above show that using thesetwo strategies does help to produce realistic motions. We adopt a frame byframe strategy to adapt the motion primitive sequence to the route. To preservethe realism of the motion as much as possible, we incorporate a least changestrategy into the motion primitive sequence extraction. The results show thatthe generated motions fit the route well; meanwhile, these generated motions arerealistic though something different from the original one.


6 Conclusion

In this paper, we present a new example based motion synthesis technique tocreate motions for intelligent avatars to move along a planned route. The maincontribution of this paper is to propose a new motion graph using motion capturedata. Because we adopt motion primitives, segments from the original motions,as the vertices, and use them to create new motions, our motion graph preservesthe reality of the original motion yet provides a large promising motion space.Another contribution is that we present an efficient way to generate motionsfor intelligent avatars. Given a planned route, the motions can be producedautomatically.

The problem in this technique is that we use the motion primitives to con-struct the motion graph directly, which results in the problem of high computa-tional overhead in searching the optimal motion path when the motion libraryis large. This drawback may affect the application of our method in real-timefields when the motion library is large. We will focus our future work on it.

References

1. Michael Gleicher.: Animation from observation: Motion capture and motion editing.Computer Graphics, 1999. 4(33):51–55

2. Michael Gleicher.: Motion editing with spacetime constraints. In: Proceedings of the1997 Symposium on Interactive 3D Graphics. Providence,1997. 139–148

3. Jehee Lee, Sung Yong Shin.: A hierarchical approach to interactive motion editingfor human-like figures. In: Proceedings of SIGGRAPH 99. Los Angeles,1999. 39–48

4. Michael Gleicher.: Retargeting motion to new characters. In: Proceedings of SIG-GRAPH 98. Orlando, Florida, 1998. 33–42

5. Bruderlin. A, Williams. L.: Motion signal processing. In: Proceedings of SIGGRAPH95. Los Angeles, 1995. 97–104

6. Feng Liu, Yueting Zhuang, Zhongxiang Luo, Yunhe Pan.: A hybrid motion data ma-nipulation: Wavelet based motion Processing and spacetime rectification. In: Pro-ceedings of IEEE PCM 2002, Hsinchu, Taiwan, 2002. 743–750

7. Unuma. M, Anjyo. K, Takeuchi. R.: Fourier principles for emotion-based humanfigure animation. In: Proceedings of SIGGRAPH 95. Los Angeles, 1995. 91–96

8. Matthew Brand, Aaron Hertzmann.: Style Machine. In: Proceedings of SIGGRAPH2000. New Orleans, 2000. 183–192

9. L. Molina Tanco, A. Hilton.: Realistic synthesis of novel human movements from adatabase of motion capture examples. In: Proceedings of IEEE Workshop on HumanMotion. Austin, Texas, 2000. 137–142

10. Katherine Pullen, Christoph Bregler.: Motion capture assisted animation: Textur-ing and synthesis. In: Proceedings of SIGGRAPH 2002. San Antonio, Texas, 2002.501–508

11. Okan Arikan, D.A. Forsyth.: Interactive Motion Generation From Examples. In:Proceedings of SIGGRAPH 2002. San Antonio, Texas, 2002. 483–490

12. Lucas Kovar, Michael Gleicher, Frdric Pighin.: Motion graphs. In: Proceedings ofSIGGRAPH 2002. San Antonio, Texas, 2002. 473–482

13. Yan Li , Tianshu Wang, Heung-Yeung Shum.: Motion texture: A two-level statisti-cal model for character synthesis. In: Proceedings of SIGGRAPH 2002. San Antonio,Texas, 2002. 465–472


14. Jehee Lee , Jinxiang Chai , Paul S. A. Reitsma , Jessica K. Hodgins, Nancy S.Pollard.: Interactive control of avatars animated with human motion data. In: Pro-ceedings, SIGGRAPH 2002. San Antonio, Texas, 2002. 491–500

15. A. Eden.: Directable Motion Texture Synthesis. technical report, Harvard Univer-sity, April 2002.

“Is It Within My Reach?” – An Agents

Perspective

Zhisheng Huang, Anton Eliens, and Cees Visser

Intelligent Multimedia Group, Vrije Universiteit Amsterdam,De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands

{huang,eliens,ctv}@cs.vu.nl

Abstract. This paper investigates three levels of reaching actions for in-telligent virtual agents: reach by hand, reach by body, and reach by move.‘Reach by hand’ is discussed as a typical inverse kinematics problemwhich involves the arm joints. ‘Reach by body’ is examined as a decision-making problem for the involved joints. ‘Reach by move’ is investigatedas a search problem with a set of rational postulates for modeling agent’sknowledge/beliefs to reach objects by moving. The paper also discusseshow the approaches can be tested and implemented for a virtual agentplatform which is based on the distributed logic programming languageDLP and VRML/X3D, the standard Web3D technology.

1 Introduction

Intelligent virtual agents are autonomous agents which interact with virtualworlds. In order to increase the efficiency of the interaction, intelligent virtualagents may be provided with the ability to reach for, and to interact with graphi-cally modeled objects or other virtual agents. These kind of reach actions includetouch, grasp, hold, pinch, hit, etc. The reach actions are typically considered asan inverse kinematics problem in computer graphics and robotics. Inverse kine-matics is the process to find the rotations for a chain of joints so that the endeffector of the joint chain can reach to a given position, or a given pose. Actu-ally the realization of the reach actions involves not only inverse kinematics, butalso more complicated reasoning about situations, which may require decisionmaking procedures and planning with knowledge.

For different levels of the reachability, agents may need different levels ofinformation and knowledge for computations, reasoning and planning. Thereexist at least the following three levels of reachability problems.

– Reach by hand. If the object/position is reachable within the arm reacharea, this reaching problem can be solved by inverse kinematics which in-volves only a chain with three arm joints: shoulder, elbow, and wrist, oradditional joints for fingers for more flexible hand shapes, like grasping.

– Reach by body. If the object/position is not reachable within the armreach area, but still reachable within the body reach area, the agent can


“Is It Within My Reach?” – An Agents Perspective 151

change the torso posture to reach it by hand. This reaching problem canalso be considered as an inverse kinematics problem which involves morejoints, like sacroiliac, hip, and knee. However, more joints in a chain wouldsignificantly increase the complexity of inverse kinematics computations. Analternative is to use the decision making approach. Under this approach, theagent can reason about the situation to find a proper list of involved joints.For instance, the agent may only bend the torso to reach an object on atable, or may squat down to reach an object on the ground.

– Reach by move. If the object/position is not reachable within the bodyreach area, the agent has to move to a position from where the object wouldbecome reachable by hand or by body. This reaching problem is one whichinvolves reasoning about situations to find a proper reachable position. Wecall it the reach with knowledge approach. It may also involve planning tosearch for the shortest path to the reachable area.

This paper investigates the reaching problem for the above-mentioned threelevels for virtual agents, in particular for web-based virtual agents. The solutionsfor web-based virtual agents emphasize a satisfying performance, by trading-offsome level of realism. We will also show how the approach of reaching withknowledge can be used for web-based virtual agents.

The avatars of web-based virtual agents are often built in the Virtual Real-ity Modeling Language (VRML) or X3D, the new generation of VRML.1 Theproposed reaching approach in this paper is implemented and tested in thevirtual agent platform [5] which is based on VRML/X3D and the distributedlogic programming language DLP[4]2. The avatars of 3D web agents are usu-ally humanoid-like ones. The humanoid animation working group3 proposes aspecification, called H-anim specification, for the creation of libraries of reusablehumanoids in Web-based applications. H-anim specifies a standard way of repre-senting humanoids in VRML. We have implemented a scripting language STEPfor H-anim based virtual agents by using DLP[7,8]4.

This paper is organized as follows: Section 2 is a brief introduction to theDLP+VRML platform. Section 3 discusses reaching by hand and shows how itcan be solved in STEP. Section 4 investigates the reaching problem by body.Section 5 examines reaching by moving, by proposing a theoretical model forreasoning about situations and reachability. Section 6 concludes the paper.

2 DLP+VRML: A Platform for Intelligent VirtualAgents

DLP [4] combines logic programming, object-oriented programming and paral-lelism. DLP has been used as a tool for web agents, in particular for 3D web1 http://www.web3d.org2 http://www.cs.vu.nl/∼eliens/projects/logic/index.html3 http://www.h-anim.org4 http://wasp.cs.vu.nl/step

152 Zhisheng Huang, Anton Eliens, and Cees Visser

agents[6]. DLP incorporates object-oriented programming concepts, which makeit a useful tool for programming. The language accepts the syntax and semanticsof logic programming languages like Prolog. It is a high-level declarative lan-guage suitable for the construction of distributed software architectures in thedomain of artificial intelligence. In particular, it’s a flexible language for rule-based knowledge representation. DLP has been extended with a run-time libraryfor VRML EAI. The typical predicates for the manipulation of virtual worldsin DLP are the get/set predicates, like getPosition(Object, X, Y, Z), which getsthe position 〈X, Y, Z〉 of the Object in the virtual world.

STEP is a scripting language for embodied agents, in particular for theirnon-verbal acts like gestures and postures[7]. The design of STEP was mo-tivated by the following principles: convenience, compositional semantics, re-definability, parametrization, and interaction. The principle of convenience im-plies that STEP uses some natural-language-like terms for 3D graphical refer-ences. The principle of compositional semantics states that STEP has a set ofbuilt-in action operators. The principle of re-definability means that STEP in-corporates a rule-based specification system. The principle of parametrizationjustifies that STEP introduces a Prolog-like syntax. The principle of interactionrequires that STEP is based on a more powerful meta-language, like DLP.

STEP is a scripting language for H-anim based virtual agents. Turn andmove are the two main primitive actions for body movements in STEP. Script-ing actions can be composed by using the following composite operators: thesequence operator ‘seq’, and the parallel operator ‘par’. When using high-levelinteraction operators, scripting actions can directly interact with internal statesof embodied agents or with external states of worlds. These interaction opera-tors are based on a meta-language which is used to build embodied agents. Thetypical higher-level interaction operators in STEP are: the do-operator do(φ),which executes a goal φ in the meta language, and the conditional operator.

3 Reaching by Hand

Reaching by hand is an inverse kinematics problem which involves the calculationof rotations of arms and wrists of embodied agents so that their hands can touchan object. As discussed in [10], many research efforts deal with this kind ofproblems. Finding solutions to this kind of problems usually involves complexcomputations, like solving differential equations or applying particular non-linearoptimizations [1,10]. A lot of work has been done on this issue. What we want topoint out is the possibility to define the scripting actions for reaching by hand forvirtual agents in STEP. The solution in STEP is suitable for web-based virtualagents, by applying to some extent a performance and realism trade-off.

In [7] we have defined the scripting actions ’touch’ based on inverse kine-matics.5 A simplified ’touch’ problem can be described as: given an agent Agentand a position 〈x0, y0, z0〉 of an object, try to set the rotations of the joints of

5 This section is just a brief outline. See [7] for more details.


the shoulder and the elbow so that the hand of the agent can touch exactly theposition if the position is reachable. Suppose that the length of the upper arm isu, the length of the forearm is f , and the distance between the shoulder center〈x3, y3, z3〉 and the destination position 〈x0, y0, z0〉 is d. The position 〈x0, y0, z0〉is reachable if and only if d ≤ u + f if we ignore the upper and lower limits ofthe joint rotations. From the cosine law we know that if the object is reachable,then α, the angle between the upperarm and the forearm, can be calculated fromthe edges u, d, and f . Furthermore, if v is the direction vector which points tothe destination position from the shoulder center, v0 the default direction vectorof the arm, and v1 the destinating direction vector of the upperarm (Fig.1(a)),then the angle β between the vector v and v1 can be computed similarly.

The cross product v0×v, i.e. a normal vector n = 〈xn, yn, zn〉, can be consid-ered as a normal vector for v0 and v1 which defines the plane in which the armturns from its default rotation to a destination rotation. We require that thevector v1 is in the same plane as the vectors v and v0 so that the arm will turnclose to the destination position via a shortest path. The angle γ between v0 andv, can be calculated with vector predicates. Thus, the rotation for the elbow jointis 〈xn, yn, zn, π−α〉, and the rotation for the shoulder joint is 〈xn, yn, zn, γ−β〉.Here is the part of the definition of the action ’touch’ in STEP:

script(touch(Agent, position(X0,Y0,Z0),l),Action):-

Action = seq([getABvalue(Agent,position(X0,Y0,Z0),l,A,B),

do(R1 is 3.14-A),

getVvalue(Agent,position(X0,Y0,Z0),l,V),

get_arm_vector(Agent,l,V0),

do(vector_cross_product(V0,V,vector(X3,Y3,Z3),C)),

do(R2 is C-B),

par([turn(Agent,l_shoulder,rotation(X3,Y3,Z3,R2),fast),

turn(Agent,l_elbow,rotation(X3,Y3,Z3,R1),fast),

turn(Agent,l_wrist,rotation(X3,Y3,Z3,-0.5),fast)])]).

Several touch situations based on this scripting action are shown in Fig.1(b).The tests show that the proposed approach is well suitable for web-based virtualagents which involves reaching by hand[7].

4 Reaching by Body

If the object/position is not reachable by hand, the agent can try to reach it bychanging the torso posture. The torso flexion and extension control the forwardand backward bending of the torso, whereas the torso side bending leads to theleft and right side bending of the torso[11]. This reaching problem can also beconsidered as an inverse kinematics problem which involves more body joints,like sacroiliac, hip, and knee. However, more joints in a chain would significantlyincrease the complexity of computations. So, it is not suitable for web-basedintelligent virtual agents. An alternative is to use a decision making approach.Under this approach, the agent can reason about situations to find a proper listof involved joints. For instance, the agent may only bend the torso to reach anobject on a table, or may squat down to reach an object on the ground. The


Fig. 1. Touch a Ball

problem of reaching by body can be specified as a rule-based decision makingmodel. For example, a simplified reaching model can be specified as a set ofdecision making rules as follows:

– if the object is lower than the agent’s knee, then the agent should squat to reachit;

– if the object is higher than the hip and lower than the head, then the agent shouldturn the torso to reach it;

– if the object is higher than the head and is not reachable by hand, the agent shouldjump to reach it.

An improved decision making model for reaching by body would considermore comprehensive scenarios, like the torso side bending and reasoning aboutwhich hand or whether both hands should be used. Once it has been decidedwhich action should be taken to reach the object, the agent should take theintended action. The reach actions can be defined as scripting actions whichcalculate the concrete rotations for the joints involved.

5 Reaching by Moving

If the object/position is not reachable within the body area, the agent can moveto a reachable position, i.e., a position from where the targeted object/position isreachable by hand or by body. This reaching problem involves reasoning aboutsituations to find a proper reachable position. To find it, the agent has to adopta strategy, more exactly a set of rules, for decision making. Different strategiesto reach an object may result in different solutions for the agent.

For example, consider a situation in which there is a table and a cup on thetable, as shown in Figure 2. In order to reach the cup, the agent may move toposition A, however, this is a shortest path that requires maximal bending of thetorso. Alternatively, the agent may move to the position C which is the reachableposition nearest to the object, but with a longer path. Another possible solutionis that the agent may move to the position B from where the agent can reachthe object with a proper bending of the torso.


Fig. 2. Move to a position to reach the object

We will propose a set of the rules, or alternatively called rational postulates,which models the agent’s behavior for reaching with knowledge. Since the agentshould try to find a proper position to reach the object, the rules can be specifiedas a preference function on positions by which the agent can evaluate the posi-tions to move to the most preferred one. Let dmax be the maximal reachabledistance for the agent with reaching by body. For two positions p and q, thefunction d(p, q) is the distance between the position p and the position q. Wealso use object and agent to denote the positions of the object and the agentrespectively. Let � be the preference relation on positions. Note that a reach-able position is one which the agent can move to. That means that a reachableposition cannot be blocked by some other objects, like the table in the example.We assume that the preference relation � is defined on unblocked positions.

The rational postulates on the preference relation � are formalized as:(A) (Reachability) The agent prefers a reachable position to unreachable

position: d(p1, object) < dmax ∧ d(p2, object) > dmax → p1 � p2.It is a basic rule for reach, because it requires that the agent has to move toa position from which the object is reachable. However, the rule (A) can besubsumed by a more general rule as follows:

(B0) (Nearness) The agent prefers a position which is closer to the object:d(p1, object) < d(p2, object) → p1 � p2.The rule (B0) implies that the agent would move to a nearer position to reachthe object. In the example, the agent would move to the position C with a longerpath, as shown in Figure 2. However, it is not intuitive. An alternative rule isthat the agent would prefer a shorter path, which can be specified as follows:

(C0) (Shortness) The agent prefers a position for which the move path isshorter: d(p1, agent) < d(p2, agent) → p1 � p2.

It is easy to see that the rule (B0) and the rule (C0) together may result in a


contradiction, that is, p1 � p2 ∧ p2 � p1 ∧ (p1 = p2) may hold. The rule (B0)and (C0) can be modified as the following conditional rules:

(B) (Conditional Nearness) If the object is unreachable, then the agentprefers a position which is closer to the object:d(p1, object) > dmax∧ d(p2, object) > dmax ∧ d(p1, object) < d(p2, object) → p1 � p2.

(C) (Conditional Shortness) If the object is reachable, then the agent prefersa position for which the move path is shorter:d(p1, object) < dmax ∧ d(p2, object) < dmax ∧ d(p1, agent) < d(p2, agent) → p1 � p2.

The rules (A), (B), and (C) together imply that the agent would try to getas close as possible to the object until it is reachable to the object, then stop.The solution for this behavioral modeling is that the agent would move to theposition A in the example. This means that the agent has to fully bend the torsoto reach the object. That is also not intuitive. In real life, most people are neithertaking the shortest path to the position A, nor taking the nearest position C,but just take some position between A and B from which the agent can reach theobject with a proper bending of the torso. That means that we need a trade-offbetween the nearness and the shortness postulates. Therefore, we introduce thenotion of acceptable reachable distance (ARD) which denotes the distance fromwhich the agent can reach the object with an acceptable bending of the torso.Let dard be the ARD. We have 0 ≤ dard ≤ dmax. ARD might be different fordifferent agents. For instance, for agents which do not mind to bend, it can bedard = dmax. Based on the notion of ARD, we have the following rules:

(D) (Acceptable Nearness) If the object is not acceptably reachable, then theagent prefers a position which is closer to the object:d(p1, object) > dard ∧ d(p2, object) > dard ∧ d(p1, object) < d(p2, object) → p1 � p2.

(E) (Acceptable Shortness) If the object is acceptably reachable, then theagent prefers a position for which the move path is shorter:d(p1, object) < dard ∧ d(p2, object) < dard ∧ d(p1, agent) < d(p2, agent) → p1 � p2.

The rules (A), (D),and (E) together imply that the agent would try to getas close as possible to the object until it is acceptably reachable, then stop.Therefore, the agent would try to move to a position between the position Aand B, based on an acceptable reachable distance. Although we have the rules(D) and (E), we still keep the rule (B) and (C) in the model for the cases inwhich an acceptable reachable position may not exist (say, because it may beblocked by the table); then the agent may use the maximal reachable distancedmax to find a proper position to reach the object.

Based on the preference relation �, we can define a weak preference relation as follows:

p1 p2 iff p1 � p2 or (p1 = p2)

Proposition 1. The preference relation which is defined on the rules (A)-(E)is a partial order relation, namely, it is reflexive, antisymmetric, and transitive.

The preference relation can serve as a heuristic function for agents to do aninformed search to find a path to reach the object. Searching and planning are


frequently used in AI technology[9]. In [2,3] Bandi et al. discuss and proposeseveral processes for finding optimal paths, including the traditional A∗ searchand its variants, for virtual agents. In an informed search with the heuristicfunction proposed above, the agent moves to position A first (because of rule(B)), then move from position A in the direction of position B (because of rule(D)), finally stop somewhere between the position A and B (because of the rule(E)) in the test example. Based on this rational postulates of reaching by moving,the agent apparently shows intuitive behavior to reach objects.

6 Conclusion and Future Work

We have investigated three levels of reaching actions for intelligent virtual agents,in particular for web-based intelligent virtual agents. ‘Reach by hand’ is discussedas a typical inverse kinematics problem which involves the arm joints. ‘Reach bybody’ is examined as a decision-making problem for reasoning about involvedbody joints. ‘Reach by move’ is investigated as a search problem with a set ofrational postulates for modeling agent’s knowledge/beliefs to reach objects. Wehave also discussed how the approaches can be tested and implemented in theDLP+VRML platform.

There is a lot of interesting future work. For instance, ‘reach by hand’ mayinvolve more complicated hand shapes for grasp and hold. ‘Reach by body’can develop more fine-grained models for the description of the decision-makingprocedure. ‘Reach by move’ would consider more comprehensive scenarios forplanning and reasoning. All these issues will improve the behavior of intelligentvirtual agents.

References

1. Badler, N., Manoochehri, K., and Walters, G., Articulated Figure Positioning byMultiple Constraints, IEEE Computer Graphics and Applications, 7(6), 28–38,1987.

2. Bandi, S., and Thalmann, D., Space Discretization for Efficient Human Navigation,Computer Graphics Forum, 17(3), 195–206, 1998.

3. Bandi, S.,and Cavazza, M., Integrating World Semantics into Path PlanningHeuristics for Virtual Agents, Proceedings of the 1999 Workshop on IntelligentVirtual Agents, 1999.

4. Eliens, A., DLP, A Language for Distributed Logic Programming, Wiley, 1992.5. Eliens, A., Huang, Z., and Visser, C., A platform for Embodied Conversational

Agents based on Distributed Logic Programming, Proceedings of AAMAS 2002Workshop.

6. Huang, Z., Eliens, A., and Visser, C., 3D Agent-based Virtual Communities, Pro-ceedings of the 2002 Web 3D Conference, ACM Press, 2002.

7. Huang, Z., Eliens, A., and Visser, C., STEP: a Scripting Language for EmbodiedAgents, in: Helmut Prendinger and Mitsuru Ishizuka (eds.), Life-like Characters,Tools, Affective Functions and Applications, Springer-Verlag, (to appear).

8. Huang, Z., Eliens, A., and Visser, C., Implementation of a Scripting Language forVRML/X3D-based Embodied Agents, Proceedings of the 2003 Web 3D Conference.


9. Russell, S., and Norvig, P., Artificial Intelligence, Prentice Hall, 1995.10. Tolani, D., Goswami, A., and Badler, N., Real-Time Inverse Kinematics Techniques

for Anthropomorphic Limbs, Graphical Models, 62, 353–388, 2000.11. Zhao, X., Kinematic Control of Human Postures for Task Simulation, Ph.D. dis-

sertation, University of Pennsylvania, 1996.

A Model for Generating and Animating Groups

of Virtual Agents

Marta Becker Villamil, Soraia Raupp Musse, and Luiz Paulo Luna de Oliveira

University of Vale do Rio dos SinosMasters in Applied Computing, Sao Leopoldo, Brazil{martabv,soraiarm,luna}@exatas.unisinos.br

Abstract. This paper presents a model to generate and animate groupswhich emerge as a function of interaction among virtual agents. Theagents are characterized through the following parameters: sociability,communication, comfort, perception and memory. The emergent groupsare characterized through the cohesion parameter which describes thehomogeneity of ideas of the group members. In this work we are mainlyinterested in investigating the formation of groups (membership and timefor grouping), the groups characterization (cohesion parameter) and theirvisual representation (group formation). The overall results suggest thatthe interaction among agents contributes to larger groups and highercrowd cohesion values.

1 Introduction and Related Work

Behavioral modelling for animation and virtual reality have advanced dramat-ically over the past decade, revolutionizing the motion picture, game and mul-timedia industries. The field advanced from purely geometric models to moreelaborated physics-biological based models [11].

Motivated by our developed researches concerning crowd simulation [9] [5],we are currently interested in providing a model to generate emergent groupsas a function of interaction among virtual agents. In order to investigate thegroups aspects, we used the concept of group cohesion. When every member of agroup has a strong feeling of belonging to the group, we can say that the grouphas a high cohesion. The first theoretical discussions about group characteristicsdescribe that, when a group becomes organized as a limited social system, thereare growing attractions inside it [7].

Some researchers have worked to simulate in a realistic way the behavior ofgroups of agents [2] [8] [10] [12]. ViCrowd, the system proposed by Musse andThalmann[9], manages information from hierarchical crowds generating behav-iors of groups and individuals with different levels of autonomy: programmed,guided or autonomous behaviors.

This work presents a rule-based system in order to simulate the formation andthe behaviors of groups of virtual humans. Like the model of Tu and Terzopoulos[12], some abilities which characterize artificial life (perception, locomotion andgroup behavior) are implemented in this model.


A Model for Generating and Animating Groups of Virtual Agents 165

2 A Model for Grouping Social Agents

One of the hypothesis we have used in this work concerns that agents are mod-elled with a set of attributes which characterize its personality and internal sta-tus. We selected the following attributes (values normalized in the interval [0, 1])in order to define our agents since we need to simplify the reality in order tosimulate it. The attributes related to personality are: Sociability (S)(Equations2 and 3), Communication (C) and Comfort (Co)(Equation 4).

The attributes which characterize the agent mobility and perception are ve-locity (each agent receives randomly a motion speed) and perception region (thecombination of perception distance di and the perception angle θi) (Fig. 1-left).

The agent’s basic abilities are:

– Movement - at the beginning, before acquiring any ability, the agents moverandomly according to their attributed velocity. Afterwards, the agents moveaccording to their acquired abilities.

– Perception - each agent is capable of checking if another agent is inside itsperception region (Fig. 1-left).

– Interaction - it happens when two agents perceive each other (Fig. 1-right).At this moment, the agents attributes and abilities are evaluated and can beupdated.

Fig. 1. The figure in the left side shows the graphic representation of perceptionangle and distance and in the right side, there are two agents which are perceivingeach other.

– Memory - the memory of an agent is represented by an array which con-tains the agents it interacted with, and the quality of these interactionsrepresented by Iij (presented later on Equation 1). A second variable in theagents memory is represented by the Interactions Counter (IC) which indi-cates the number of interactions the agent had and it is increased at eachinteraction. At the beginning of the simulation all agents have IC set tozero.

During the simulation, each i-agent can acquire the following new socialabilities.

166 Marta Becker Villamil, Soraia Raupp Musse, Luiz P. Luna de Oliveira

– To Follow - the agent is able to follow other agents.– To Access Memory - the agents access their memory to search for the infor-

mation about the quality of past interactions.– To Select - the agents are able to select which agents they had a good quality

of interaction identified by a score of Iij > 0.6 (Equation 1). We assumed thatthe quality of interaction greater than 0.6 represents a ”good” interaction.

– To Group - after having attained the last level of reward (Table 1), theagents are able to form groups with the ones they had a good quality ofinteraction (Iij > 0.6).

The simulation begins with a randomly distributed population of agents ina 3D-world space. At the beginning, each agent (i-agent) moves changing itsdirection randomly at each step of time. An interaction between i-agent and j-agent happens when both are in the perception region of each other (Fig. 1). Foreach interaction Iij between the i-agent and j -agent, one point is added to ICi

and ICj . Moreover, the quality of interaction (Iij) between i-agent and j-agentis defined as

Iij =SiCi + SjCj

Ci + Cj. (1)

In order to describe the mutual influence that happens when people interactamong themselves [4] we defined that the agent sociability can be updated duringthe simulation. The Si and Sj attributes change at each interaction consideringthe quality of interaction (Iij) and sociability itself. The agent which has thelower S will have its value increased and the opposite will happen to the otheragent, representing the referred mutual influence. Therefore, the new Si and Sj

will achieve the same value as shown in the following equations:

Si = Si + ((Sj − Si)(1 − Iij)) (2)

Sj = Sj + ((Si − Sj)(Iij)) (3)

In addition, the level of agent comfort is also updated as indicated in Equation4.

Coi = Si

∑Iij

ICi(4)

All agents follow the same levels of rewards. The condition to attain newreward levels is described by a relation between the agents attributes. We definedthat attributes Si, Coi and ICi represent this condition in our model (Table1). The values val1, val2 and val3 used in order to attain new levels, can becustomized by the user and it can influence the time spent by agents to evolvein the levels of reward.

In this work, each group is created by an i-agent called leader which attainsthe last level of reward and satisfies the condition Si ≥ 0.6. This value hasbeen empirical chosen and can influence the number of groups that are created.For instance if this threshold is lower more leaders will arise and consequentlymore groups. Each leader starts to move in order to aggregate with other agents.


Reward X = Si + Coi + ICi abilities relatedLevels1 X ≤ val1 moving,perceiving,relating and memorizing2 X >val1 and X ≤val2 moving,perceiving,relating,memorizing, and following3 X >val2 and X ≤val3 moving,perceiving,relating,memorizing,

following,access memory and selecting4 X >val3 moving,perceiving,relating,memorizing,following

access memory,selecting and grouping

Table 1. Rules for attain a new level. Text in bold are the abilities acquired ateach level.

Furthermore, the condition to one agent be aggregated by a leader is the qualityof interaction they recorded in their memories: Ii ≥ 0.6.

Since a group is formed, the cohesion rate (K ) can be calculated. Inspiredby the social literature [4], the cohesion expresses the group homogeneity, con-sequently we defined it inversely proportional to the standard deviation of thegroup sociability.

Related to Group 1 :

K1 =1

1N1

∑[(Si − Si)2]

, (5)

where N1 represents the number of individuals of group 1. Consequently, thegreater is the standard deviation, smaller will be the cohesion.

The group cohesion is visually represented by the spatial formation of itsmembers. All agents of the group follow a leader in a shape described by anangle and a distance (Fig. 2-middle). Visually, the leader is the agent who islocated at the vertex of the shape.

Fig. 2. The figure in the left side represents a group with a high cohesion rate(K > 400). In the middle, there is the representation of a group with mediumcohesion rate (100 < K <= 400). And in the right side, there is a group withlow cohesion rate (K <= 100).

This work is integrated with a visualization framework where the visualiza-tion process is completely isolated from the behavioral model. The goal is towork with different kinds of viewers depending on the simulation purpose.

168 Marta Becker Villamil, Soraia Raupp Musse, Luiz P. Luna de Oliveira

In order to visualize the simulation we are able to use two viewers which areCaterva[3](Fig. 2) and RTKrowd[3](Fig. 3-left). Their main difference betweenthem is in the level of realism provided.

3 Further Discussions and Future Works

The performed simulations aim to describe the influence of a meeting pointand the communication parameter of a population in the virtual space in orderto change the social behavior of agents. In this case we generated a region inthe center of virtual room with more probability to be visited, consequentlyproviding more interactions among agents. (Fig.3-left).

Fig. 3. The figure in the left represents the meeting point and the figure in theright shows that the agents form larger groups in less time of simulation.

We assumed that agents tend to go to the meeting point with a probabilityproportional to the agent communication. We performed 10 simulations withall agents having high communication levels (C=1) and 10 simulations withall agents having low communication levels (C=0.1). The results indicate thatagents which go to the meeting point, can easily perceive others and consequentlyinteract more times (Fig. 3-right). Moreover, the groups are formed in less timeand at the end of simulation more people have grouped. A movie showing themeeting point influence can be seen at http://inf.unisinos.br/cglab/meeting.zip.

We are currently investigating the possibility of applying this methodologyin combat games. For instance, Command and Conquer Generals [6] and Age ofMitology[1] which involve army and combats.

References

1. “Age of mitology,” http://www.ensemblestudios.com/aom/,2003.2. Allbeck, J., Badler, N., Bindiganavale, R.: Describing human and group movements

for interactive use by reactive virtual characters. In:Army Science Conference,(2002).

3. Barros L. M., Evers T., Musse, S. R.: A framework to investigate behavioralmodels. In: Journal of WSCG, 40-47, (2002).


4. Belnesch, H.: Atlas de la Psychologie, Encyclopdies d’Aujourd’hui, (1995).5. Braun, A., Musse S.R., Oliveira L.P.L., Bodman B. E. J.: Modeling Individual

Behaviors in Crowd Simulation. In: Computer Animation and Social Agents, 143-148, (2003).

6. “Command and conquer generals,” http://www.westwood.com, 2003.7. Lewin, K., Field Theory and Experiment in Social Psychology: concepts and meth-

ods. (1939).8. Mataric, M. J.: Learning to behave socially. In: From Animals to Ani-

mats:International Conference on Simulation of Adaptative Behavior, 453-462,(1994).

9. Musse, S.R., Thalmann, D.: Hierarquical model for real time simulation of virtualhuman crowds. In: IEEE Transactions on Visualization and Computer Graphics,Vol. 7,152-164, (2001).

10. Reynolds, C. W., Flocks, herds, and schools: A distributed behavioral model.Computer Graphics, Vol. 21, no. 4, 25-34, (1987).

11. Terzopoulos, D.: Artificial life for computer graphics, In: Communications of theACM, (1999), Vol. 42, 33–42.

12. Tu, X., Terzopoulos, D.: Artificial fishes: Physics, locomotion, perception, behavior.In:Computer Animation, 43-50, (1994).

Scripting Choreographies

Stefan M. Grunvogel and Stephan Schwichtenberg�

Laboratory for Mixed Realities, Institute at the Academy of Media Arts CologneAm Coloneum 1, D-50829 Koln, Germany

{gruenvogel, schwichtenberg}@lmr.khm.de

Abstract. We present a system for the simple and fast creation of char-acter animation which is used within an augmented reality environment.Instead of using fixed sequences of animation, the motion of the char-acters are created by subtasks which are able to change the motion dy-namically according to the environment. Dynamic motion models areused to produce the animation data. These dynamic motion models arecontrolled by the subtasks to create their appropriate behaviour.

1 Introduction

We are developing the character animation system for the augmented realityproject mqube (http://www.mqube.de). The aim of this project is to build aprototype of a multi-user environment, where several people (directors, stageand light designers) work together to create the stage set and to place the lightson a miniaturised stage scaled down by factor 4. The user group also teststhe interplay between actors, dynamic light and scenery. For the simulation ofactors on the stage virtual characters are used. The users are not interested inthe low level editing of character animation. Instead a simple and fast way forthe creation of complex character behaviour (e.g. let the character walk alonga given path and wave with its arms at a certain time) has to be provided.Furthermore a character should react on the properties on the stage, e.g. jumpindependently over obstacles which occur on its path.

There are only few approaches where the interaction with virtual charactersin augmented reality is examined (e.g. [1],[2]). There the behaviour and theanimation of the characters is created automatically by the underlying system.Our aim is to give the user herself the possibility to edit the movements and thebehaviour of the character within the environment.

2 System Architecture

The CEManager is the interface to the AR-System (cf. Figure 1). It creates anddeletes the scene graph nodes of the characters in the render engine and passescommands to the choreography editor.� This work was supported by the German Ministry of Education and Research

(BMBF Grant 01 IR A04 C: mqube - Eine mobile Multi-User Mixed Reality Umge-bung).


Scripting Choreographies 171

The choreography editor controls the behaviour of all the characters in thescene. A character is represented within this level by a CECharacter object. TheCEChoreographyEditor creates or deletes CECharacter objects and sends themCEDirectives to control their behaviour.

To each CECharacter belongs a unique AEMotionController. The CECharac-ter controls the production of the animation of the character by sending AECom-mands to the AEMotionController. The AEMotionController is responsible forthe real-time creation of the animation. The AEBuffer and the AESubmitterconnect the output of the AEMotionController with the corresponding node inthe scene graph.

CECommand

TimeController

Scene Graph

Scene Graph NodeCEManager

CECommandAnimation Data

Animation Data

AEMotionController

AESubmitter

CEDirective

CEChoreographyEditor

CECharacterAECommand

Animation Data

Choreography Editor Animation Engine

AEBuffer

UIC

Fig. 1. The System Architecture.

CEManager, the AESubmitters and AEBuffers are separate threads synchro-nised by the TimeController. The TimeController also transforms the time in-formation from the AR-system (which has its own virtual time) into the internalvirtual time of the character subsystem. The current time within the internalvirtual time can be moved backward and forward with arbitrary speed.

3 Choreography Editor

3.1 Creation of Choreographies

Non-linear animation tools (e.g. MayaTM or FilmboxTM) create animations byblending and merging animations layered on a time line together. We adopt thisparadigm, but use instead of fixed animation sequences so called subtasks forthe dynamic creation of the character motion.

Subtasks model reflexive behaviour exhibiting a fixed behavioural pattern inresponse to given stimuli. They can be classified as level 0 in Brooks’ subsumptionarchitecture (cf. [3]). Examples for subtasks are waving with hands or to walkalong a given path. The resulting motions of the character can be influenced bythe user or by objects within the virtual environment during their execution (cf.Figure 2).

For the user and the AR-system these subtasks are hidden behind the CEMa-nager. The user creates a choreography script by spooling forward or backward in

172 Stefan M. Grunvogel and Stephan Schwichtenberg

virtual time and sending CECommands to the CEManager. These CECommandsare used to create new subtasks at the current time or to change the behaviour ofa subtask which is active at this time. The overall choreography of the characteris given by playing the CECommands at the virtual time they are received bythe CEManager.

3.2 Character Model

The CEChoreographyEditor is responsible for the creation of CECharacters andtheir animation engine (cf. Figure 1). It also interprets the CECommands re-ceived from the CEManager and sends the appropriate commands (CEDirec-tives) to the CECharacter.

Within a CECharacter the subtasks are realized as state machines. The sub-tasks are using dynamic motion models (cf. Section 4) for the creation of theiranimation. The advantage of using dynamic motion models is, that they alsoprovide a high level interface for the creation and manipulation of animationdata, hiding their actual implementation.

Each CECharacter administers its choreography script of the character, whichis a ordered list of CEDirectives. They are used to create, delete or manipulatesubtasks. If the choreography is played, CEDirectives are send to the subtasksaccording to their time stamp. CEDirectives hold information like the addressedsubtask and a set of parameters which may change the behaviour of the subtask.

The CECharacter also resolves conflicts between different active subtaskswhich can occur if two subtasks need to control the same parts of the body.There are three methods to resolve this conflict: ignore one subtask, suspendone subtask until the other has finished or abort one subtask.

Fig. 2. Result of two simple choreographies. Two characters are walking alonga path with different walk styles, one is waving.

4 Animation Engine

The overall real-time creation of the animation data of a character is controlledby its AEMotionController. The AEMotionController uses dynamic motion mod-els for the creation of animation data. The term motion model first appeared

Scripting Choreographies 173

in the offline animation context (Grassia [4]). They resemble the lowest level inthe Improv system by Perlin and Goldberg [5] (cf. [6] for a discussion on thedifferences). In [6] we describe dynamic motion models, which can be used forinteractive real-time environments.

Dynamic motion models are models for simple movements like waving, point-ing or walking, each having motion specific sets of parameters (e.g. walk speedand style for the walk motion or the left or right hand for waving), which canbe changed while playing the animation. The dynamic motion models are im-plemented as state machines. Their state can change according to the passedtime of the animation or the received new parameters or commands. All motionmodels have four states in common: idle, prestart, running and stopping. In idlestate the motion model does not produce any animation. If the motion modelreceives the command to start it turns into prestart state where it adjusts thecharacters’ pose to start the actual motion. After having reached this pose, itswitches into running state, where the actual motion is synthesised. If the goal ofthe motion is reached, it switches into stop state where the character is broughtinto a neutral position and after that the motion model switches back into idlestate.

The characters are represented within the animation engine as articulatedfigures, where the rotation and translation values of joints are updated in a fixedframe rate. Dynamic motion models create their motion by combining animationclips with clip operators. Animation clips are abstract objects producing for aspecific time span on the time line animation data. Clip operators have animationclips as operands and take the animation data of their operands to produce newanimation data. Because clip operands are likewise animation clips, they can alsobe the operands of other clip operators. The actual source of animation data areclip primitives which store pre-produced animation data. To create the animationof a character which starts to walk from a standing posture we construct e.g. theoperator tree in Figure 3. WalkStart and WalkCycle are clip primitives whereWalkStart is the animation where the character makes his first step from astanding posture and WalkCycle is a walkcycle. By the TimeShift clip operatorthe start frame of the WalkLoop is shifted to the last frame of the WalkStartclip on the time line. The Loop repeats the underlying clip for an infinite time.Then finally the overall animation is produced by blending the WalkStart withthe result of the Loop together. The advantage of dynamic motion models is thate.g. by exchanging the clip primitives with walk animations in a different stylea whole new animation is produced easily.

Motion models produce their animations by constructing operator trees likein Figure 3 with the help of base motions. These are clip primitives together withannotations further describing the clip primitive. E.g. for a walk cycle the frameswhere the heel hits the ground or the specific style of the animation is storedin the annotations. The annotations are used to adjust the parameters of theclip operators in the operator tree for different base motions. If a motion modelreceives some new parameters (e.g. change the style of the walk movement) itdestroys its current operator tree and creates a new one according to the new

174 Stefan M. Grunvogel and Stephan Schwichtenberg

WalkStart

WalkCycle

Blend

LoopTimeShift

Fig. 3. The operator tree for a walk animation.

parameters. Thus it is possible to change the characteristics of a motion in real-time. If more than one motion model is active, then the AEMotionControllercreates the overall animation by blending the operator trees of the motion modelstogether (cf.[6]). It is also responsible for resolving conflicts between motionmodels which want to use the same parts of the body.

5 Current State and Future Research

The animation engine and the choreography editor have been integrated in theaugmented reality system. The user can place commands on the time line bymoving the current time on the time line and choosing some commands from agraphical menu. Currently we have implemented commands like pathwalking (letthe character walk along a path), wave or walkstyle, where within the latter wecan choose between different styles of walking. The first impression is that ourapproach for scripting character animations works and is intuitive for the user.Within a minute a character is created walking along a path, starting to waveand changing its walk style at user-defined time stamps. Currently we work onimproving the capabilities to edit the choreography and enlarge the animationand interaction possibilities of the character within the virtual environment.

References

[1] Torre, R., Fua, P., Balcisoy, S., Ponder, M., Thalmann, D.: Augmented reality forreal and virtual humans. In: CGI 2000, IEEE Computer Society Press (2000) 303– 308

[2] Tamura, H.: Real-time interaction in mixed reality space: Entertaining real andvirtual worlds. In: Proc. Imagina 2000. (2000)

[3] Brooks, R.A.: Intelligence without representation. Artificial Intelligence 47 (1991)139–159

[4] Grassia, F.S.: Believable Automatically Synthesized Motion by Knowledge-Enhanced Motion Transformation. PhD thesis, School of Computer Science,Carnegie Mellon University, Pittsburgh (2000)

[5] Perlin, K., Goldberg, A.: Improv: A system for scripting interactive actors in virtualworlds. Computer Graphics 30 (1996) 205–218

[6] Grunvogel, S.M.: Dynamic character animation. International Journal of IntelligentGames & Simulation (2003) 11–19

{jorget,nedel}@inf.ufrgs.br

[email protected]

1: and( and( eq( [Chair]class, ’chair’ ), eq( [Table]class, ’table’ ) ), near( Chair, Table )) -> belongs( Chair, Table ) 2: and( and( eq( [Human]class, ’human’), eq([Chair]class, ’chair’ ) ), intersect( Human, Chair )) -> sitting( Human, Chair ) 3: forall( Chair, belongs(Chair, Table), not( sitting(Human, Chair) ) ) -> free(Table)

Mediating Action and Music with

Augmented Grammars

Pietro Casella and Ana Paiva

Instituto Superior TecnicoIntelligent Agents and Synthetic Characters Group

Instituto de Engenharia de Sistemas e ComputadoresLisboa, Portugal

{pietro.casella, ana.paiva}@gaips.inesc.pthttp://gaips.inesc-id.pt

Abstract. Current approaches to the integration of action and musicin Intelligent Virtual Environments are in most part, “hardwired” to fita particular interaction scheme. This lack of independence between bothparts prevents the music or sound director from exploring new musicalpossibilities in a systematic and focused way.A framework is proposed which mediates action and music, using anaugmented context free grammar based model of behavior to map eventsfrom the environment to musical events. This framework includes mech-anisms for self modification of the model and maintenance of internalstate variables. The output module can be changed to perform othernon musical behavior.

1 Introduction

The use of background music is a very powerful way to communicate with andeven induce certain moods on the users of Intelligent Virtual Environments. Itsuse is nowadays commonplace in most applications such as Games, Virtual The-aters, etc. However, it still happens that due to project planning issues, andother misconceptions about the task, the resulting background music and corre-sponding dramatic capabilities are way under their full potential, even resultingin a poor user experience caused by repetitive music, bad choice of trigger pointsfor sounds, etc.

Several problems are mentioned by the game audio community [2] whichrelate to the integration strategy assumed on most projects. The main suchproblem is that the integration of music/sound with the action is done at aprogramming level, (i.e. all the code for triggering and loading of compositionalelements is embedded within the rest of the game’s code) leading to an overalllack of flexibility for the composer to explore new musical ideas such as differenttrigger points, scores, etc.

The present research seeks to close this gap with the development of a frame-work which facilitates the integration and exploration of mediated action andmusic. The integration strategy is resumed to having the environment output


208 Pietro Casella and Ana Paiva

identifiers of all occurring actions. The mediator uses an internal model to de-cide which music or sound to play in response to these environment events. Thismodel and corresponding musical behavior may be altered independently fromthe environment’s code.

As of the time of this writing, the proposed system is being integrated withthe FantasyA Intelligent Virtual Environment, to control a sound/music play-ing system, based on the emotional events of the game. It will also control analgorithmic composition system for the generation of emotion-based music.

2 Related Work

The problem of action and music integration has not been addressed as a researchproblem per se in the past. Still, some existing systems do partially solve theunderlying problems. This section will analyze some of these systems with ageneral focus on action and music mediation.

The main efforts towards having better musical quality have been the com-mercial sound engines (ex. [5],[4]) which mainly provide primitives for a simplersound manipulation. Specific support for adaptive music is recent (most notablyDirectMusic [4]) and consists of primitives for the dynamic alignment of genericcompositional elements, and calls to scripts on the engine side which may beprogrammed by the musician, thus achieving some level of independence. How-ever the musician has no access to triggering points other than those where thescripts are called.

Other approaches to the generation of music for Intelligent Virtual Envi-ronments are the work of Casella and Paiva [1] where an agent architecture ispresented that generates music automatically based on the emotions from theenvironment. This system exclusively produces background music to supportthe current emotional state, rather than responding to any desired action. An-other approach was the work of Downie [3] on the use of Behavioral ArtificialIntelligence for the real time selection and alignment of small music segments.Downie implemented a Music Character which plays in response to observationson the environment. The present work differs from the previous approaches inthat it provides the ability to work with mediation of action and music from ahigh-level, independent and testable way.

3 System Description

In broad terms, the main execution loop of the system consists of the followingsteps:

1. An action occurs on the environment.2. The environment sends a symbol X identifier of that action on to the system.

(at this point the system loop begins)

Mediating Action and Music with Augmented Grammars 209

3. The arrival of the symbol X triggers the generation process. An internalgrammar model is used to compute one sequence of symbols. The initialrules for the inference are those of the form X →? whose precondition issatisfied, and probability is not zero.

4. Once a symbol sequence is output from the generation process, it is inter-preted from left to right.

5. The interpretation of each symbol depends on its type and originates eitherinternal state modifications or output actions.

3.1 Inside the Mediator

The architecture is composed of a set of variables, and a generative model. Thesenumerical variables may be used with any meaning (for example, to store levelsof energy, character’s emotional states, etc.).

The execution model consists of receiving symbols from the environment andcomputing a list of symbols which denote the actions that result from the input.

Two types of symbols exist, symbols which affect the internal state of themediator (referred to as internal symbols), and the symbols which representactions for the output module or sound system (referred to as external symbols).These two types of symbols can be seen as a way to generate internal and externalactions. The architecture may be extended so that other types of actions exist,so as to provide a framework which may be customized to new types of outputmodules.

The generative model consists of a set of context free rules, augmented with aprecondition based on the state of the internal variables, and an associated weightwhich determines the probability distribution used by the generation procedureto choose one among several applicable rules, so that each inference producesonly one sequence. Note that the grammar is context free, in the sense thatthe applicability of the rules does not depend on the sequence being generated.However, successive arrivals of the same initial symbol may use different activerules (i.e. a different grammar), as the applicability of these rules depends onthe values of the variables, because of the precondition, which may have beenchanged by the previously generated sequences. This results on a “dynamic”grammar model rather than a context sensitive grammar.

3.2 Scripting Language

The scripting language is XML based. The following list resumes the availableprimitives.

– Variable manipulation primitives - primitives for setting, incrementing anddecrementing the value of variables.

– Variable testing primitives - primitives for testing the value of variablesagainst thresholds.

210 Pietro Casella and Ana Paiva

XMLSpec

GenerativeModel

Variables

generateinitialize

interpret

Environmentrules action

Set symbolsOutput Module

Up

d ate

seq

uen

ce

Ch

eck

Fig. 1. System Architecture

– Variable declaration - When declared, a variable includes a name, an initialvalue and an optional trigger. The trigger has an associated value, a behaviortype (which specifies if the trigger is run when the variable’s value is under,over or equal to the value of the trigger) and a sequence of symbols (whichare treated as if they where generated with the generative process). Thetriggers are tested each time the variable value is altered.

– Symbols - Each symbol has an associated identifier and an output token.– Rule declaration - Rules are composed of a name and a weight plus a pre-

condition, a left side and a right side. The precondition is a Variable testingprimitive, the left side is a symbol, and the right side is a sequence of symbols.

– Rule manipulation primitives - primitives for changing attributes of rulessuch as the weight.

3.3 Output Module

The output module must have an interface that supports the arrival of actionsymbols from the mediator. These symbols are interpreted independently fromthe mediator. The semantics however must be fully understood by the developerof the model, and must be provided prior to the writing of the specification.

The currently implemented Output Module has some inertia properties whichsolve the problem of too much requests simultaneously, thus each request has anassociated behavior which may be ’enqueue’, ’play immediately’, ’play if possible’and ’play solo’. The synchronization issues related to playing musical loops aresolved in two ways, namely it is possible to instruct the system to cross-fadetwo sounds, or it is possible to have the system play the second sound only afterthe currently playing loop is finishes. The rationale for this rich output moduleis similar to mind-body metaphor used when building for example human-likeintelligent virtual agents, where the body has some coherence properties whichare imposed by the physics of the body rather than the mind.

Mediating Action and Music with Augmented Grammars 211

4 Future Work

The proposed architecture brings a new level of flexibility to the mediation ofaction and music or other action based systems. Several future extensions tothe system are planned. The first one is the expansion of the capabilities ofthe mediator. For example the inclusion of other triggering mechanisms suchas timers. The second one is to create new execution models to include firstorder symbols, hierarchical symbols and predefined specifications for characterspecific feature tracking. Another level of possible evolution is to make possiblethe interaction on the opposite direction, i.e. to allow the output module tocallback and send information to the environment.

References

1. Pietro Casella and Ana Paiva. MAgentA: An Architecture for Real Time AutomaticComposition of Background Music. IVA01, Madrid, Spain. Springer

2. Gamasutra. http://www.gamasutra.com3. Mark Downie. Behavior, animation, music: the music and movement of synthetic

characters. MSc Thesis. MIT, 2001.4. Microsoft DirectMusic. http://www.microsoft.com/directx5. Miles Sound System. http://www.sensaura.com

Life-like Characters for the Personal Exploration

of Active Cultural Heritage

Talk Summary

Antonio Kruger

Saarland UniversityFB Informatik, Geb. 36

66123 Saarbrucken, Germany

This talk will focus on perspectives and problems that arise when using virtuallife-like agents to entertain and inform human visitors in an instrumented en-vironment. In this respect issues of interleaving presentations of mobile devicesand stationary devices are addressed in a typical situation of educational enter-tainment: the visit to a museum. Some of the salient elements of the describedwork are the emphasis on multi-modality in the dynamic presentation and co-herence throughout the visit. The adopted metaphor is a kind of contextualizedTV-like presentation, useful for engaging (young) visitors. A life-like characterleads through the presentations of both mobile and stationary devices. On themobile device, personal video-clips are dynamically generated from personalizedverbal presentations; on the larger stationary screens, distributed throughout themuseum, further background material and additional information is provided bythe virtual presenter. The use of life-like characters on portable devices has tobe carefully weighted because of the small dimension of the display. Neverthe-less, there are specific roles that a properly designed character can play on amobile device to improve the level of engagement with the presentation. In par-ticular, two roles will be explained, the role of a presenter and an anchorman.When playing the role of the presenter, the character introduces new media as-sets and uses pointing gestures. When playing the role of the anchorman, thecharacter just introduces complex presentations without interfering with themany further. The anchorman provides a context in which different presentationparts make sense. The character also supports the seamless integration of themobile devices’ small screen and large screens available in the museum. Similarto a TV-presenter who walks around the studio to present different content, thecharacter is able to move between the mobile device and the large screen. Be-sides the specific role that the character may play, it is also a metaphor for theactual interests of the visitor. By providing different characters and giving thevisitor the choice between them, the different views on the exhibits are trans-parently conveyed and selected. The talk will also discuss the general technicalopportunities for the realization of virtual agents in instrumented environmentsand will give some perspectives towards the use of virtual inhabitants in thosespaces. The described work is part of the PEACH (Personal Experience of ActiveCultural Heritage) project and is joint work with Oliviero Stock and MassimoZancanaro from ITC-IRST in Trento.


{gregory.ohare, bianca.schoen, alan.martin, john.bradley}@ucd.ie http://chameleon.ucd.ie

[email protected]

An Autonomous Real-Time Camera Agent for

Interactive Narratives and Games

Alexander Hornung1, Gerhard Lakemeyer2, and Georg Trogemann1

1 Laboratory for Mixed Realities, Academy for Media Arts Cologne,Am Coloneum 1, 50829 Koln, [email protected], [email protected]

2 Department of Computer Science V, Aachen University of Technology,Ahornstr. 55, 52056 Aachen, [email protected]

Abstract. Virtual reality environments provide the possibility to cre-ate interactive stories with the audience being an active part of the nar-rative. This paper presents our work on transferring cinematographicknowledge with respect to dramaturgical means of expressions of cam-eras to the domain of interactive narratives. Based on this formalisationwe developed an autonomous real-time camera agent implementing thiscinematographic knowledge with the goal of incorporating the camera asan active part into the storytelling process. The system was integratedinto an interactive narrative environment to demonstrate the practicalityof the system.

1 Introduction

The visual presentation of a narrative, like in movies, has a very strong influ-ence on how we perceive and interpret a scene or situation of a story. Theseeffects based on the camera position, image composition, choice of colours, etc.,are well known from the theory of cinematography. But in contrast to classicalcinematography, research in the field of interactive virtual reality narratives andthe application of cinematographic concepts to this field is still rather young.

A variety of ideas for camera handling in 3D-based, virtual reality envi-ronments exist, ranging from first person views to complex, scripted cameramovements. But as virtual environments evolve into a platform for interactivestorytelling, these mostly geometrically oriented techniques fail to actively em-phasise narrative content. Figure 1 shows two different examples of how thecamera position actively contributes to the perception of an image.

We investigated classical cinematographic concepts applicable to this field,with a strong focus on dramaturgical principles of cameras and the visual em-phasis of narrative content. The goal was to find a basic formalisation of the nar-rative expressiveness of cameras and cinematographic rules, and to implementthis knowledge in form of a camera agent for interactive narrative applications.The agent autonomously chooses appropriate camera shots for a given situa-tion within the current narrative context. The prototype of this virtual camera


An Autonomous Real-Time Camera Agent 237

Fig. 1. Examples of visual emphasis of narrative content. a) A dark, low-anglecamera shot emphasising the evil, terrifying nature of the vampire Nosferatu.b) A vast, calm landscape, with slow camera movement creating a lyrical, epicimpression

system was integrated into an interactive narrative environment to provide anexample for the practicality of the system.

Possible target applications for an autonomous camera agent following cine-matographic concepts are Computer Games : Current 3D-based computer gameswith a strong focus on narrative content rather than pure action, e.g., adventuregames. Storytelling Authoring Tools: In tools like alVRed [Lab] for authoringinteractive non-linear stories, an autonomous camera agent supports the authorby preventing him from having to manually define camera positions. Virtual Re-ality Environments : VR-applications or e-learning systems supporting narrativecontent like a walk through a virtual museum. Interactive Movies: The capa-bilities of modern TV and DVD applications already point in the direction ofinteractive stories.

For further information and details about this work, and a playable demo ofthe camera agent see [Hor03].

2 Related Work

Publications dealing with models or implementations of virtual cameras are com-monly based on [Ari76] and [Kat91]. In [BGL98], a system based on constraintsatisfaction is described, [HCS96] creates a camera working with film idioms im-plemented as hierarchically organised finite state machines. Another approach fordeclarative camera control is presented in [CAH+96]. [DZ95] presents a methodof encapsulating camera tasks into well-defined camera modules, [HHS01] dealswith a camera system designed for games, focusing on predictive camera plan-ning and frame coherence. Most of these works deal primarily with geometricalconstraint-satisfaction. [TBN] and [KM02] present cinematographic systems fo-cusing with emotional content of digital scenes. The mathematical aspects ofcamera positioning are found in books like [AMH02], [Bli96].

238 Alexander Hornung, Gerhard Lakemeyer, and Georg Trogemann

Fig. 2. Transfer and formalisation of cinematographic concepts to the domain ofinteractive narratives, resulting in an interface based on eight dramaturgicallyrelevant parameters to communicate between a narrative application and thecamera agent

3 Cinematographic Concepts

The inherent idea of visual storytelling is that the interpretation of a pictureby the audience is based on a process of identification of the spectator with thecamera standpoint and view (see Figure 1). This identification strongly influencesinterpretation of an image and must be considered during image creation.

Camera views depend on a large number of parameters. For example, thebasic shot characterisations used in the cinematographic literature are based onthe shot size, camera angle, camera movement, image composition, and shotduration. Furthermore, shots normally cannot be considered as single, atomicentities, but have always to be seen within the context of preceding and follow-ing shots. Within such sequences, temporal, causal, or spatial relationships areestablished, which significantly influence the narrative interpretation, and evencan completely change the interpretation of a single shot [Kat91]. There exista lot of cinematic rules to deal with such sequences, like the “Line of Action”rule, which forbids to cross the virtual line between two acting objects during acut to preserve a consistent orientation for the viewer [Ari76] and [Kat91]. Theserules of classical cinematography are often not (directly) transferable to interac-tive narratives, because future narrative events are usually not known until theyoccur. This makes narratively consistent planning ahead of shots very difficult.

We identified a set of basic dramaturgical principles, which describes thenarrative expressiveness of cameras on a narrative level rather than on a cameralevel. Rules for camera placement were analysed and modified to meet the de-mands of interactive environments. The idea of this transfer of cinematographicconcepts is depicted in Figure 2. Based on these principles we use eight param-eters to describe dramaturgically relevant information in a story (Table 1). Forcontinuous parameters ∈ [−1.0 . . .1.0] like Object Might, positive values stand


for a high power of the respective subject of the action (with respect to otherobjects for instance), zero for neutral values or equilibrium, and negative valuesfor low power, or in general, the opposite meaning. These parameters are thebasis to communicate story-events from the narrative application to the cameraagent for proper visualisation.

Table 1. Eight parameters describing the dramaturgically relevant informationof a narrative event, with respect to the expressiveness of cameras

Parameter Values Description

Given by the narrative application

Scenery Variant Action|Dialogue|Both Classification of the event

Action Type Physical|Mental|Predicate Classification of the action

Radius of Interaction ∈ [−1.0 . . . 1.0] Amount of space covered(RI) by the action

Object Might ∈ [−1.0 . . . 1.0] Power of subject or(OM) relationship between objects

Emotional Involvement ∈ [−1.0 . . . 1.0] Narrative climax, involvement(EI) of the spectator

Hectic / Dynamic ∈ [−1.0 . . . 1.0] Hectic or calm, static or(DY) very dynamic situations

Excitement / Stress ∈ [−1.0 . . . 1.0] Dominating inner emotion(EX) of characters

Computed by the agent

Event Coherence ∈ [0.0 . . . 1.0] Similarity of objects betweensubsequent events

4 Narrative Events

Our interface to communicate narrative information and to build a representa-tion of the story within the camera agent are narrative events. Every narrativeapplication can be assumed to have some form of representation of story events,be it in the form of explicit graphs like in [Lab], or as instantly occurring de-cisions or actions like in [VS98] for example. A narrative event encapsulates asingle event of the story. The relevant narrative information about events andsituations in a story can be formalised by a sentence-like structure consistingof the subject, the action, and the objects. Figure 3 gives an example of a nar-rative event. The only ‘carrier’ of dramaturgically relevant information is theaction itself. We characterise the action of a narrative event by the parametersrepresenting the basic dramaturgical principles of cameras as introduced in Ta-ble 1. Objects like characters are described only by their respective geometricalinformation. This way we can keep out application specific knowledge from the


Fig. 3. An example of a narrative event, representing an attack of a woundedcharacter. The interaction type describes an action with medium range, a lowobject might because of an injury, emotional involvement of the audience, highdynamics, and even higher stress of the fighting character

camera agent. The coherence between subsequent events is actually computedby the agent, based on the similarity of objects. This knowledge enables thecamera agent to detect connected sequences of events.

In addition, each event contains temporal information and a prioritisationassigned by the narrative application, so that the agent can build its own con-sistent representation of the story. The prioritisation and the coherence help tofurther distinguish important events of the story from unimportant ones, and tocreate visual continuity. In some forms of interactive narrative environments, likevirtual reality systems based on independent agents inhabiting the world, thereis seldom an explicit representation of the most important events, but the narra-tive emerges from interaction between these agents. In such cases it is necessaryto enable the agent to distinguish between connected, story-driving events, and‘environmental’ events. Knowledge about the coherence between events is alsoneeded for cinematographic concepts like establishment shots to introduce newsituations.

5 The Camera Agent

One of the strengths of agent-based software design is the construction of com-plex systems of interacting, autonomous modules. These approaches are moreand more considered to model and implement autonomous characters or entitiesin virtual reality-based interactive narratives. Motivated by concepts from mul-tiagent systems, we developed an agent-based software with the goal to provideintuitive software design and understanding of the decision procedures as wellas an easy integration into agent-based, interactive narrative applications.

Figure 4 shows the overall system and the flow of information between thecamera agent and the application, it is embedded in. The application sends rel-evant information about its current or future state as narrative events to thecamera agent. These events contain geometrical information about the partici-pating objects, and a description of the main action by the parameters found inTable 1. The camera agent then investigates and recomputes its internal repre-sentation of the narrative based on these events. It uses a hybrid method of rules,


Fig. 4. System structure and flow of information

e.g., for the scenery type, and perceptron-based decision-making for continuousparameters to classify narrative events and to assign matching camera shots.

Based on the history of the narrative, the coherence between subsequentevents, the event priority, and other factors like minimum shot durations, itchooses an active event for visualisation. The decision-module then adds poten-tially matching shots for this event from a user-defined shot library to a prioritylist. The shots in this list are ordered by a motivation value to apply this shotto the given event. This motivation is based on the activation of perceptronsassociated with the different shots. The user can train different types of per-ceptrons with examples of narrative events to return a high activation only forspecific actions. The inputs for each perceptron are the continuous parametersdescribed in Table 1. For instance, we trained a perceptron to respond to a highevent coherence and neutral interaction types, and associated this neuron with aclose-up shots for dialogues (Figure 5). Using this approach, one does not have tohand-tune complex rules for action-classification. During the finding of matchingshots, the decision-module ensures that cinematic rules like the ‘Line of Action’are not violated.

In the final step, the module for action-realisation tries to carry out theshot with the highest priority for the chosen event. It computes the cameraposition, orientation, and other parameters like the field-of-view, and makesthis information available to the narrative application for visualisation. If a shotcannot be realised due to geometrical constraints for instance, the next matchingshot is chosen. Cinematic rules are considered also during this phase.

The user can easily add new shots to the shot library. Currently, we providethe possibility to describe shots based on spherical coordinates, and on the finalon-screen position of the objects. Using these descriptions, we could easily imple-ment all standard shots found in cinematic literature. We additionally providekey-frame animation for camera movements. Direct interdependencies betweenshots can be modelled by specifying positive or negative predecessor shot classes,similar to approaches based on state-machines like in [HCS96].


Fig. 5. A dialogue between four characters. a) Original first-person view. b)Camera agent: the spectator is guided through the conversation by appropriatecamera shots.

6 Experimental Results: Half-Life

The computer game Half-Life [VS98] was published in November 1998, and isconsidered to be one of the first games combining the elements of pure 3D actiongames with narrative elements, like dialogues, intermissions, and actions scenes,which are embedded into the game-play. We modified the software to generatea narrative event every time a character changes its internal state. For instance,if a character switches from an idle state to a conversation, it sends a corre-sponding event to the camera agent. The experimental evaluation was done byletting unprofessional audiences as well as professional cinematographers experi-ence both views, the original first-person view of the game, and the shots createdby the camera agent. Consensus was that the narrative, dramaturgical contentof the game could be emphasised significantly by choosing cinematographicallyappropriate shots. For example Figure 5 compares the original first-person per-spective during a dialogue to a sequence of camera shots created by the cameraagent. In the first-person perspective, the player is not forced to concentrate onthe dialogue and can miss potentially story-relevant parts of the conversation.By guiding the player through the conversation with appropriate shots of thecamera agent, the player’s focus can be manipulated to intensify the narrativeexperience. More examples for diverse situations can be see online at [Hor03].

7 Conclusion

We presented our work on transferring cinematographic concepts to interactivenarratives, and an autonomous camera agent implementing this knowledge. Theagent-based approach enabled us to integrate the system easily into an existingapplication. Our experimental results were very convincing and significantly en-hanced the narrative experience of Half-Life for the spectator. Future researchis necessary to understand and formalise more complex cinematic concepts, and


to integrate planning techniques to allow for sophisticated reasoning about thevisual outcome of camera shots to support the narrative. Reasoning about ge-ometrical constraints within a scene is necessary as well as considering suchcomplex concepts as ‘visual metaphors’.

References

[AMH02] T. Akenine-Moller and E. Haines. Real-Time Rendering, Second Edition.A K Peters, 2002.

[Ari76] D. Arijon. Grammar of the Film Language. Focal Press, Boston, 1976.[BGL98] W. H. Bares, J. P. Gregoire, and J. C. Lester. Realtime constraint-based

cinematography for complex interactive 3d worlds. In AAAI/IAAI, pages1101–1106, 1998.

[Bli96] J. Blinn. Jim Blinn’s Corner: A Trip Down the Graphics Pipeline. MorganKaufmann, 1996.

[CAH+96] D. B. Christianson, S. E. Anderson, L. He, D. Salesin, D. S. Weld, andM. F. Cohen. Declarative camera control for automatic cinematography.In AAAI/IAAI, Vol. 1, pages 148–155, 1996.

[DZ95] S. M. Drucker and D. Zeltzer. Camdroid: A system for implementing in-telligent camera control. In Symposium on Interactive 3D Graphics, pages139–144, 1995.

[HCS96] L. He, M. F. Cohen, and D. H. Salesin. The virtual cinematographer: Aparadigm for automatic real-time camera control and directing. ComputerGraphics, 30(Annual Conference Series):217–224, 1996.

[HHS01] N. Halper, R. Helbing, and T. Strotthotte. A camera engine for computergames: Managing the trade-off between constraint satisfaction and framecoherence. In Computer Graphics Forum: Proceedings EUROGRAPHICS20(3), pages 174–183, 2001.

[Hor03] A. Hornung. Autonomous real-time camera agents in interactivenarratives and games. Master’s thesis, Department of ComputerScience V, Aachen University of Technology, 2003. Download athttp://www.cocoonpage.com/hlcam.

[Kat91] S. Katz. Film Directing Shot By Shot: Visualizing From Concept To Screen.Micheal Wiese Productions, 1991.

[KM02] K. Kennedy and R. E. Mercer. Planning animation cinematography andshot structure to communicate theme and mood. In Proceedings of the 2ndinternational symposium on smart graphics, pages 1–8. ACM Press, 2002.

[Lab] Laboratory for Mixed Realities, Cologne. alVRed. Nonlinear Dramaturgyin VR-Environments, http://www.alvred.de/.

[TBN] B. Tomlinson, B. Blumberg, and D. Nain. Expressive autonomous cin-ematography for interactive virtual environments. In Proceedings of theFourth International Conference on Autonomous Agents, Barcelona, Cat-alonia, Spain. ACM Press.

[VS98] Valve-Software. Half-life. Computer Game used as Interactive NarrativePlatform for the Camera Agent, http://www.sierra.com/games/half-life/,1998.

That’s My Point! Telling Stories from a Virtual

Guide Perspective

Jesus Ibanez1,2, Ruth Aylett2, and Rocio Ruiz-Rodarte3

1 Departamento de Tecnologıa, Universidad Pompeu Fabra,Passeig de Circumvallacio, 8, 08003 Barcelona, Spain

[email protected] Centre for Virtual Environments, The University of Salford,

Business House, University Road, Salford, M5 4WT, Manchester, [email protected]

3 Instituto Tecnologico de Estudios Superiores de Monterrey,Campus Estado de Mexico, 52926 Mexico

[email protected]

Abstract. This paper describes our proposal for storytelling in virtualenvironments from a virtual guide perspective. In our model the guidebegins at a particular location and starts to navigate the world tellingthe user stories related to the places she visits. Our guide tries to emulatea real guide’s behaviour in such a situation. In particular, she behavesas a spontaneous real guide who knows stories about the places in thevirtual world but has not prepared an exhaustive tour nor a storyline.

1 Introduction

Nowadays, virtual environments are becoming a widely-used technology as theprice of the hardware necessary to run them decreases. Current video gamesshow 3D environments unimaginable some years ago. Many recently developedvirtual environments recreate real spaces with an impressive degree of realism. Insuch contexts, however, a lack of information for the user is frequently perceived,which makes him lose his interest in these environments. In the real world, peoplerelate the environments that surround them to the stories they know about theplaces and objects in the environment. Therefore, in order to obtain more humanand useful virtual environments, we need to add a narrative layer to them. Weneed stories related to the places and objects in the world. And finally, we needa virtual guide able to tell us these stories.

On the other hand, as pointed out in [4], one of the most striking featuresof historical investigations is the coexistence of multiple interpretations of thesame event or process. The same historical events can be told as different sto-ries depending on the storyteller’s point of view. The story of the same battlebetween two cities, for example, will be different depending on the origin of thestoryteller. It would be interesting that the virtual guide which tells us stories


250 Jesus Ibanez, Ruth Aylett, and Rocio Ruiz-Rodarte

about the virtual environment she 4 inhabits could tell us these stories from herown perspective. Such a guide would be, in addition, very useful for educationalpurposes. Children would be more open-minded if they could listen to differentversions of the same historical events depending on the profile of the storyteller.In this sense, this paper describes the design and development of a novel proposalfor storytelling in virtual environments from a virtual guide perspective.

2 Narrative Construction

In our model the guide begins at a particular location and starts to navigate theworld telling the user stories related to the places she visits. Our guide tries toemulate a real guide’s behaviour in such a situation. In particular, she behavesas a spontaneous real guide who knows stories about the places in the virtualworld but has not prepared an exhaustive tour nor a storyline.

Furthermore, our guide tells stories from her own perspective, that is, she nar-rates historical facts and events taking into account her own interests and roles.In fact, she extends the stories she tells with comments that show her own pointof view. This mixture of neutral information and personal comments is what wecan expect from a real guide who, on the one hand, has to tell the information hehas learnt, but on the other hand, cannot hide his feelings, opinions, etc aboutthe information he is telling. We have designed a hybrid algorithm that models avirtual guide behaviour taking into account all the aspects described above. Themechanisms involved in the algorithm can be separated in three global processeswhich are carried out with every step. The next three subsections detail thesegeneral phases.

2.1 Finding a Spot in the Guide’s Memory

Given a particular step in the navigation-storytelling process (that is, the virtualguide is at a particular location and she has previously narrated a series of storypieces), the guide should decide where to go and what to tell there. To emulatea real guide’s behaviour, the virtual guide evaluates every candidate pair (storyelement, location) taking into account three different factors: the distance fromthe current location to location, the already told story elements at the currentmoment and the affinity between story element and the guide’s profile.

A real guide will usually prefer nearer locations, as further away locationsinvolve long displacements which lead to unnatural and boring delays amongthe narrated story elements. In this sense, our guide prefers nearer locationstoo, and therefore shorter displacements. When a real guide is telling stories inan improvisational way, the already narrated story elements make him recall,by association, related story elements. In a spontaneous way, a real guide tendsto tell these recently remembered stories. In this sense, our guide prefers story

4 In order to avoid confusion, in this paper, the virtual guide is supposed to be female,while the human guide is supposed to be male

That’s My Point! Telling Stories from a Virtual Guide Perspective 251

elements related (metaphorically remembered) to the ones previously narrated.Finally, a real guide tends to tell stories related to his own interests (hobbies,preferences, etc) or roles (gender, job, religion, etc). In this sense, our guideprefers story elements related to her own profile.

The system evaluates every candidate pair (storyelement, location) such thatthere is an entry in the knowledge base that relates storyelement to location (notethat this means that storyelement can be narrated in location) and such thatstoryelement has not been narrated yet. In particular three scores correspond-ing to the previously commented factors are calculated. These three scores arethen combined to calculate an overall score for every candidate pair. Finally thesystem chooses the pair with the highest overall score value.

2.2 Extending and Contextualising the Information

Figure 1a represents a part of the general memory the guide uses. This mem-ory contains story elements that are interconnected with one another in termsof different relations. In particular, in our case, cause-effect and subject-objectrelations interconnect the story elements. Figure 1b shows the same part of thememory, where a story element has been selected by obtaining the best overallscore described in the previous section. If the granularity provided by the selectedstory element is not considered to be large enough to generate a little story, thenmore story elements are selected. The additional story elements are chosen ac-cording to a particular criteria or a combination of several criteria (cause-effectand subject-object in our case). This process can be considered as navigating thememory from the original story element. Figure 1c shows the same part of thememory, where three additional story elements have been selected by navigatingfrom the original story element.

...

Storyboardelements

linguistic contents

special effects

guide actions

........................................

a) Part of the general memory ofthe guide

b) A story element is selected c) More elements are selected

e) Commonsense rules d) Some selected elements aretranslated

f) Commonsense consequencesextend the selected elements

g) A storyboard is generated

Fig. 1. Storyboard construction

Once the granularity provided by the selected story elements is consideredto be large enough, the selected story elements are translated, if possible, from

252 Jesus Ibanez, Ruth Aylett, and Rocio Ruiz-Rodarte

the virtual guide perspective (see figure 1d). For this task the system takesinto account the guide profile and the meta-rules stored in the knowledge basethat are intended to situate the guide perspective. The translation process alsogenerates guide attitudes that reflect the emotional impact that these storyelements cause her. Lets demonstrate this by a simple example. Let us assumethe following information extracted from a selected story element

fact(colonization, spanish, mayan)

meaning that the Spanish people colonized the Mayan. And let us assume thefollowing meta-rules included in the knowledge base, aimed to situate the guideperspective

fact(colonization, Colonizer, Colonized) and profile(Colonized) =>fact(colonizedColonizacion, Colonizer, Colonized) andguideattitude(anger)

meaning that a colonizedColonization fact and anger as the guide’s attitudeshould be inferred if a colonization fact is included in the story element andthe guide profile matches the third argument of this fact, that is, the guide isthe Colonized. In this example that will happen if the guide is Mayan. The newinferred fact represents the original one but from the guide’s perspective.

In addition, the new translated story elements are enhanced by means of newinformation items generated by inferring simple commonsense rules allowing toadd some comments showing her perspective. The guide uses the new contextu-alised story elements (figure 1d) as input for the rules that codify commonsense(figure 1e). By inferring these rules the guide obtains consequences that areadded to the contextualised story elements (figure 1f), obtaining a new datastructure which codifies the information that should be told. Let us continuewith the previous example. Let us assume the following commonsense rule

fact(colonizedColonizacion, Colonizer, Colonized) =>fact(culturalDestruction, Colonized) andfact(religionChange, Colonized)

meaning that the colonized’s view implies the destruction of the colonized’sculture and the change of the colonized’s religion. Therefore, if in our examplethe guide were Mayan, the story element to be told would be enhanced with thefacts culturalDestruction and religionChange.

2.3 Generating the Story

As a result of the previous processes, the guide obtains a set of inter-relatedinformation items to tell (figure 1f). These elements are stored as a structurethat reflects the relations among them, as well as the reasons why each one wasselected. Some elements are also related to particular guide attitudes. Now thesystem generates the text to tell (expressing these elements) as well as specialeffects and guide’s actions to show while telling the story. The phases of thisstory generation process are as follows:

That’s My Point! Telling Stories from a Virtual Guide Perspective 253

1. The first step is to order the data elements. To do so we consider threecriteria: cause-effect (if an element Y was caused by another element X,then X should precede Y), subject-object (the elements whose subject/objectare similar should be grouped together) and classic climax (the first selectedstory element, i.e. the one that obtained the best overall score, is supposedto be the climax of the narration, and therefore all the rest of the elementsare arranged taking it into account).

2. The text corresponding to the ordered set of elements is generated. Thecomplexity of this process depends on the particular generation mechanism(we use a template system) and the degree of granularity employed (we usea sentence per every story element).

3. A process that relies on the guide expression rules (the set of rules thattranslate abstract guide’s attitudes in particular guide’s actions) generatesa set of guide actions (each one related to a particular story element).

4. Every story element is associated to particular environment conditions orspecial effects. Thus, finally, a storyboard like the one shown in figure 1g isobtained.

3 Implementation

We have chosen Unreal Tournament (UT) engine as the platform on which ourvirtual worlds run. As we wished our system to be open and portable, we de-cided to use Gamebots to connect our virtual guide to UT. Gamebots [3] isa modification to UT that allows characters in the game to be controlled vianetwork sockets connected to other programs. The core of the virtual guide isa Java application which is able to connect to UT worlds through Gamebots.This Java application controls the movement and animations of the guide in theworld as well as the presentation of special effects and texts which show thegenerated narratives. The current version uses a MySQL [2] database to storethe knowledge base. The Java application accesses these data through JDBC.The developed system uses Jess [1] to carry out inferences on the information.

The described system has been substantially developed and it seems thatit works properly with small knowledge bases. We still have to check how thesystem behaves when dealing with large knowledge bases and evaluate differenttemplate systems for the generation of text from the storyboard.

References

[1] Jess the rule engine for the java platform, Available athttp://herzberg.ca.sandia.gov/jess/.

[2] Mysql, Available at http://www.mysql.com/.[3] G. A. Kaminka, M. M. Veloso, S. Schaffer, C. Sollitto, R. Adobbati, A. N. Marshall,

A. Scholer, and S. Tejada, Gamebots: a flexible test bed for multiagent team research,Communications of the ACM 45 (2002), no. 1.

[4] Veronica Tozzi, Past reality and multiple interpretations in historical investigation,Studies in Social and Political Thought 2 (2000).

http://www.zgdv.de [email protected]

Sto ryM o n ito rSSccee nn ee ss

RRee ssee rrvv oo iirr

S to ryL in e

SSttoo rryy WWoo rrlldd DD iisspp llaa yy

Sto ry Wo rld

Re p re se n -

ta t io n

UUssee rr

AAvv aa ttaa rr

Av a ta r

Re p e rto ire

VV iirrttuu aa ll

AAccttoo rrss

Sp e cif ic

Re p e rto ire

GGee nn ee rriicc

RRee pp ee rrttoo iirree

Sp e cif ic

Kn o w le d g e

Ba se

GGee nn ee rriicc

KKnn oo ww llee dd gg ee

BBaa ssee

[email protected]

[email protected]

[email protected]

{jarmo,niklas,erikh}@sics.se http://www.sics.se

[email protected], [email protected]

Persona Effect Revisited

Using Bio-Signals to Measure and Reflect the Impact ofCharacter-Based Interfaces

Helmut Prendinger, Sonja Mayer, Junichiro Mori, and Mitsuru Ishizuka

Department of Information and Communication EngineeringGraduate School of Information Science and Technology

University of [email protected], [email protected],{jmori,ishizuka}@miv.t.u-tokyo.ac.jp

Abstract. The so-called ‘persona effect’ describes the phenomenon thata life-like interface agent can have a positive effect on the user’s percep-tion of a computer-based interaction task. Whereas previous empiricalstudies rely on questionnaires to evaluate the persona effect, we utilizebio-signals of users in order to precisely associate the occurrence of in-terface events with users’ autonomic nervous system (ANS) activity. Inthis paper, we first report on the results of an experiment with an agent-guided mathematical game suggesting that an interface character withaffective behavior may significantly decrease user stress. Then, we de-scribe a character-based job interview scenario where a user’s affectivestate derived from physiological data is projected back (or ‘mirrored’) tothe user in real-time. Rather than measuring the effect of an interfaceagent, the focus here is on employing a character as a medium to reflectthe user’s emotional state, a concept with some potential for emotionalintelligence training and the medical domain, especially e-Healthcare.

1 Introduction

While animated agents, or life-like characters, start populating the interfaces ofnumerous computer-based applications [10], their impact on human users is stilllargely unexplored, or at least very general in its formulation. In the context ofeducational software, Lester et al. [5] identified the persona effect which refers to(i) the credibility and motivation enhancing effects of character-based interfaces,as well as to (ii) the positive effect of animated agents on the users’ perception ofthe learning experience. Van Mulken et al. [14] conducted a follow-up study to [5]where a life-like character acts as a presenter of technical and non-technical in-formation. In their experiment, the positive effect of an animated interface agenton the ‘subjective measures’ entertainment and perceived difficulty is supported(for technical information), whereas no significant effect on ‘objective’ measuresof the interaction, such as comprehension and recall, could be shown. Both of thementioned studies rely on questionnaires as an evaluation method that does not


284 Helmut Prendinger et al.

allow for a precise temporal assessment of which particular behavior of agents isresponsible for their good overall perception.

In this paper, we propose to take physiological data of users during the inter-action with a character-based interface as an evaluation method. As bio-signalshave recently been shown to be indicative of the affective state of users [12], wemay gain new insights concerning subjective measures of the persona effect.1 Therecorded history of users’ bio-signals will enable to precisely relate ANS activitywith the (user-computer) interaction state, and hence track the impact of agentbehavior. Furthermore, by using both bio-signals and questionnaires as evalua-tion methods, we may detect possible discrepancies between the interaction asperceived by the user and the factual physiological state of the user.

The promising results of the experiment sparked our interest to reflect theeffect of an interface persona to the user in a more direct way, thus allowingthe user to inspect his or her arousal in real-time. The Emotion Mirror is aweb-based application depicting a job interview scenario. The arousal level ofusers—again inferred from their ANS activity—during the interview is ‘mirrored’to them by employing a life-like character as an embodied mediator of their ex-perienced stress or relaxation states. Although the current demonstrator systemis fairly simple, it allows to gather valuable experiences for the next generationof emotional intelligence [1] training systems and e-Healthcare applications [7].

The following section reports on the experiment, and Section 3 briefly ex-plains the Emotion Mirror application. Section 4 concludes the paper.

2 Character-Based Quiz Game – An Empirical Study

The experiment about a simple quiz game described in this section investigatesthe effect of a life-like character with affective behavior on users’ affective statewhich is derived from physiological data. The primary hypothesis of this studycan be formulated as follows: If a life-like interface agent provides affective feed-back to the user, it can effectively reduce user stress. To our knowledge, this isthe first investigation that explores the possibility of employing an animatedagent to respond to presumed negative feelings on the part of the user. Otherresearch used an embodied character without addressing the issue of user frus-tration (Mulken et al. [14]) or provided only text-based response as a feedbackmedium to the (deliberately frustrated) user (Klein et al. [3]).

2.1 Theory and Game Design

We implemented a simple mathematical quiz game where subjects are instructedto sum up five consecutively displayed numbers and are then asked to subtractthe i-th number of the sequence (i ≤ 4). The instruction is given by the“Shima”

1 It is important to note that our work differs from other research on the persona effect[5,14] in that we compare “affective persona” vs. “non-affective persona” conditionsrather than “persona” vs. “no persona” conditions.

Persona Effect Revisited 285

character, an animated cartoon-style 2D agent, using synthetic speech and ap-propriate gestures. The numbers are also displayed in a balloon adjacent to thecharacter. Subjects compete for the best score in terms of correct answers andtime. Subjects were told that they would interact with a prototype interfacethat may still contain some bugs. This warning was essential since in some quizquestions, a delay was inserted before showing the 5th number. The delay wasassumed to induce frustration as the subjects’ goals of giving the correct answerand achieving a fast score are thwarted.

In order to measure user frustration (or stress), we took users’ galvanic skinresponse (GSR) signal which is an indicator of skin conductance.2 It has beenshown that skin conductance varies linearly with the overall level of arousal andincreases with anxiety and stress (see Picard [9], Healey [2]).

2.2 Method

Subjects and Design. Participants of the experiment were twenty male stu-dents of the School of Engineering at the University of Tokyo, on average 24years of age, and all of them native speakers of Japanese. According to the in-dependent variables, affective vs. non-affective feedback of a life-like character,two versions of the quiz game have been prepared:

– Affective version. Depending on whether the subject selects the correct orwrong answer from the menu displayed in the game window (see the num-bers in Fig. 1), the character expresses ‘happy for’ and ‘sorry for’ emotionsboth verbally and nonverbally, e.g., by “smiling” (for happiness) and “hang-ing shoulders” (for sorriness). When a delay in the game flow happens, thecharacter expresses empathy for the subject after the subject answers thequestion that was affected by the delay (see Fig. 1).

– Non-affective version. The character does not give any affective feedbackto the subjects. It simply replies “right” or “wrong” to the answer of thesubjects. If a delay happens, the agent does not comment on the occurrenceof the delay, and simply remains silent for a short period of time.

If a delay occurs (in the affective version), the character expresses empathy tothe subjects by displaying a gesture that Japanese people will easily understandas a signal of the interlocutor’s apology (see Fig. 1), and uttering: “I apologizethat there was a delay in posing the question” (English translation). Note thatthe apology is given after the occurrence of the delay, immediately after thesubject’s answer (and not during the delay period).

In order to show the effect of the character’s behavior on the physiologicalstate of subjects, we consider specific segments. (i) The DELAY segment refersto the period after which the agent suddenly stops activity while the question isnot completed until the moment when the agent continues with the question; (ii)

2 We also recorded subjects’ blood volume pulse (BVP) signal from which the heartrate of subjects can be calculated. Unfortunately, the low reliability of our methodused to gather the BVP signal precluded its consideration in the analysis.


Fig. 1. Shima character: “I apologizethat there was a delay in posing thequestion.”

Fig. 2. Schematic of the experimentalsetup.

the DELAY-RESPONSE segment refers to the period when the agent expressesempathy concerning the delay, or ignores the occurrence of the delay—whichfollows the agent’s response (regarding the correctness of the answer) to thesubject’s answer; (iii) the RESPONSE segment refers to the agent’s response tothe subject’s correct or wrong answer to the quiz question.Procedure and Apparatus. The subjects were recruited directly by the ex-perimenter and offered 1000 Yen for participation, and additionally 5000 Yenfor the best score. Subjects have been randomly assigned to one of the two ver-sions of the game. The experiment was conducted in Japanese, and lasted forabout 25 minutes (15 minutes for game play, and 10 minutes for experimenterinstructions, attaching the sensors, etc). Subjects came to the testing room in-dividually and were seated in front of a computer display, keyboard, and mouse.After briefing the subjects about the experiment and asking them to sign theconsent form, they were attached to galvanic skin response and blood volumepulse sensors on the first three fingers of their non-dominant hand (see Fig. 2).

Before subjects actually started to play the game, the character shows somequiz examples that explain the game. This period also serves to collect physio-logical data of subjects that are needed as a baseline to normalize data obtainedduring game play. In six out of a total of thirty quiz questions, a delay was in-serted before showing the 5th number. The duration of delays was 6–14 secs. (9secs. on average). While subjects played the game the experimenter remained inthe room and monitored their physiological activity on a laptop computer. Theexperimenter and laptop were hidden from the view of the subjects. After thesubjects completed the quiz, the sensors have been removed from their hand,and they were asked to fill out a short questionnaire, which contained questionsabout the difficulty and their impression of playing the game. Finally, subjectswere told to keep checking a web page that will announce the best score.

The game was displayed on a 20 inch color monitor, running Internet Explorerwith browsing buttons deactivated. The Microsoft Agent package [8] was used to


control character animations and synthetic speech. Two flat speakers producedthe sound. Physiological signals have been recorded with the ProComp+ unit andvisualized with BioGraph2.1 software (both from Thought Technology Ltd. [13]).

2.3 Results

The first observation relates to the use of delays in order to induce stress insubjects. All eighteen subjects showed a significant rise of skin conductance inthe DELAY segment, indicating an increased level of arousal. The data of twosubjects of the non-affective version were discarded because of extremely deviantvalues. In the following, the confidence level α is set to 0.05.

The general hypothesis about the positive effect of life-like characters withaffective behavior on a subjective measure, here the users’ stress level, can bedivided into two specific hypotheses (Empathy and Affective Feedback).

– Empathy Hypothesis: Skin conductance (stress) is lower when the charactershows empathy after a delay occurred, than when the character does notshow empathy.

– Affective Feedback Hypothesis: When the character tells whether the sub-ject’s answer is right or wrong, skin conductance is lower in the affectiveversion than in the non-affective version.

To support the Empathy Hypothesis, the differences between the mean valuesof the GSR signal (in micro-Siemens) in the DELAY and DELAY-RESPONSEsegments have been calculated for each subject. In the non-affective version (nodisplay of empathy), the difference is even negative (mean = −0.08). In the af-fective version (display of empathy), GSR decreases when the character respondsto the user (mean = 0.14). The t-test (two-tailed, assuming unequal variances)showed a significant effect of the character’s emphatic behavior as opposed tonon-affective behavior (t(16) = −2.47; p = 0.025). This result suggests that ananimated agent expressing empathy may undo some of the frustration (or reducestress) caused by a deficiency of the interface.

The Affective Feedback Hypothesis compares the means of GSR values of theRESPONSE segments for both versions of the game. Note that the characterresponses of all queries, not only the queries affected by a delay, are consideredhere. However, the t-test showed no significant effect (t(16) = 1.75; p = 0.099).When responding to the subject’s answer, affective behavior of the character hasseemingly no major impact on subjects’ skin conductance.

In line with the study of van Mulken et al. [14] who show that embodiedinterface agents have no significant effect on comprehension and recall, we ex-pected that affective life-like characters do not influence objective performancemeasures. Our expectation was confirmed as the average score in the affectiveversion was 28.5 (from 30 answers), and 28.4 in the non-affective version.

In addition to taking physiological data of subjects, they were asked to fill outa short questionnaire. Table 1 shows the mean scores for some questions. None ofthe differences in rating reached the level of significance. Only the scores for the


Table 1. Mean scores for questions about interaction experience in non-affective(NA) and affective (A) game version. Ratings range from 1 (disagreement) to 10(agreement).

Question NA A

I experienced the quiz as difficult. 7.5 5.4

I was frustrated with the delays. 5.2 4.2

I enjoyed playing the quiz game. 6.6 7.2

first question suggest a tendency about the subjects’ impression of the difficultyof the game (t(17) = 1.74; p = 0.1). This result can be compared to the findingsof Mulken et al. [14], which show that a character may influence the subjects’perception of difficulty. In their experiment though, van Mulken and coworkerscompare “persona” vs. “no persona” conditions rather than “affective persona”vs. “non-affective persona” conditions.

3 The Emotion Mirror – Future Work

This section briefly describes the Emotion Mirror, a character-based applicationaimed at training interpersonal communication skills known as emotional intelli-gence [1], specifically the abilities to be aware of and to regulate one’s emotions.A job interview situation is one example where emotional intelligence is benefi-cial, as the interviewee has to manage his or her emotions when confronted withunpleasant and probing questions of the interviewer.3 Since physiological mani-festations of stress may reflect negatively on the interviewer’s impression of theinterviewee, a virtual job interview alerting the user (as interviewee) about hisor her arousal level might serve as a valuable preparatory training environment.The Emotion Mirror application assumes that users are biased to conceive life-like characters as veritable social actors (the ‘Media Equation’ [11]), and henceactually get aroused when interviewed by a virtual agent.4

The job interview scenario features two life-like characters, the interviewerto the left and the ‘Mirror Agent’ to the right (see Fig. 3). Users in the roleof interviewees are attached to the sensors of the ProComp+ device. As in thestudy described above, we currently take the galvanic skin response (GSR) signalonly. Unlike the implementation of the experiment, however, the Emotion Mirrorapplication requires to process physiological data in real-time. This was achievedby using Visual C++ and the ProComp+ data capture library, with the Active

3 A job interview scenario featuring an ‘affective mirror’ has been suggested by Picard[9, p. 86], but to our knowledge, it was never implemented.

4 It is certainly true that an online interview cannot induce the stress level of a face-to-face or phone interview.


Fig. 3. Job Interview Scenario with Emotion Mirror.

Template Library (ATL) as an interface to the JavaScript code and the MicrosoftAgent controls [8] that drive the agents’ animation and speech engines.

The baseline for subsequent bio-signal changes is obtained during an initialrelaxation period of 40 secs. where the user listens to music from Cafe del Mar(Vol. 9), as the average of GSR values. An interview episode consists of foursegments: (i) The interviewer character asks a question; (ii) the user selects ananswer from a set of given options (the lower part in Fig. 3); (iii) the interviewerresponds to the user’s answer; (iv) the Mirror Agent displays the user’s arousallevel calculated from the data gathered during segments (i)–(iii). More precisely,we take values every 50 msec., for a period of 5 secs. The psychophysiologicalliterature, e.g., Levenson [6, p. 30], suggests 0.5–4 secs. as an approximation forthe duration of an emotion.

When the (average) GSR signal is 15–30% above the baseline, the user’sarousal level is assumed as ‘high’. If the signal is on average higher than 30%,the user is assumed to be very aroused, and the Mirror Agent would display agesture expressing anxiety and utter, e.g.,“You seem to be quite stressed”. TheMirror Agent reflects the arousal state of the user in a rather exaggerated way,in order to alert the user of his or her presumed impression on the interviewer.

While initial experiences with the Emotion Mirror are promising—people doget aroused for some questions—the Mirror Agent’s reaction is still too unspecificabout the user’s emotion. We currently implement the electrocardiogram (ECG)rhythm trace as an indicator of heart rate, which together with skin conductancewill allow to identify named emotions in Lang’s [4] two-dimensional model. We


also plan to use a Bayesian network to combine different sensor data and toaccount for the uncertainty of the domain.

4 Conclusions

This paper proposes a new approach to the persona effect, a phenomenon thatdescribes the positive effect of life-like interface agents on human users. In ad-dition to questionnaires, we suggest to utilize bio-signals in order to show thepersona effect in terms of users’ arousal level. While user stress could not be di-rectly deduced from the sensor data, the design of the experiment suggests thisinterpretation. The main results of the empirical study discussed in the paperare: (i) a character displaying empathy may significantly decrease user stress;(ii) the character’s affective behavior has no impact on users’ performance in asimple mathematical game; but (iii) it has a (almost significant) positive effecton the users’ perception of task difficulty.

The Emotion Mirror application offers a direct interpretation of the personaeffect, by reflecting the user’s arousal level in a job interview scenario to the userin real-time. This concept is already under consideration for e-Healthcare sys-tems [7]. We also hope that character-based interfaces with emotion recognitioncapability will prove useful for social skills training and software testing.

Acknowledgments. This research is supported by the JSPS Research Grant(1999-2003) for the Future Program. The authors would like to thank NaoakiOkazaki for his generous help with implementing the ATL interface.

References

1. D. Goleman. Emotional Intelligence. Bantam Books, New York, 1995.2. J. A. Healey. Wearable and Automotive Systems for Affect Recognition from Phys-

iology. PhD thesis, Massachusetts Institute of Technology, 2000.3. J. Klein, Y. Moon, and R. Picard. This computer responds to user frustration:

Theory, design, and results. Interacting with Computers, 14:119–140, 2002.4. P. J. Lang. The emotion probe: Studies of motivation and attention. American

Psychologist, 50(5):372–385, 1995.5. J. C. Lester, S. A. Converse, S. E. Kahler, S. T. Barlow, B. A. Stone, and R. S.

Bhogal. The Persona effect: Affective impact of animated pedagogical agents. InProceedings of CHI-97, pages 359–366. ACM Press, 1997.

6. R. W. Levenson. Emotion and the autonomic nervous system: A prospectus forresearch on autonomic specificity. In H. L. Wagner, editor, Social Psychophysiologyand Emotion: Theory and Clinical Applications, pages 17–42. John Wiley & Sons,Hoboken, NJ, 1988.

7. C. Lisetti, F. Nasoz, C. LeRouge, O. Ozyer, and K. Alvarez. Developing multimodalintelligent affective interfaces for tele-home health care, 2003. International Journalof Human-Computer Studies. To appear.

8. Microsoft. Developing for Microsoft Agent. Microsoft Press, 1998.9. R. W. Picard. Affective Computing. The MIT Press, 1997.


10. H. Prendinger and M. Ishizuka, editors. Life-like Characters. Tools, Affective Func-tions and Applications. Cognitive Technologies. Springer Verlag, 2003. To appear.

11. B. Reeves and C. Nass. The Media Equation. How People Treat Computers, Tele-vision and New Media Like Real People and Places. CSLI Publications, Center forthe Study of Language and Information. Cambridge University Press, 1998.

12. J. Schreirer, R. Fernandez, J. Klein, and R. W. Picard. Frustrating the user onpurpose: A step toward building an affective computer. Interacting with Computers,14:93–118, 2002.

13. Thought Technology Ltd. URL: http://www.thoughttechnology.com.14. S. van Mulken, E. Andre, and J. Muller. The Persona Effect: How substantial is

it? In Proceedings Human Computer Interaction (HCI-98), pages 53–66, Berlin,1998. Springer.

Steve Meets Jack: The Integration of an

Intelligent Tutor and a Virtual Environmentwith Planning Capabilities

Gonzalo Mendez1, Jeff Rickel2, and Angelica de Antonio1

1 Computer Science SchoolTechnical University of Madrid

Campus de Montegancedo, 28660 Boadilla del Monte (Madrid)[email protected], [email protected]

2 Information Sciences InstituteUniversity of Southern California

4676 Admiralty Way, Marina del Rey, CA [email protected]

Abstract. In this paper, we describe how we have integrated Steve, anintelligent tutor based on Soar, and HeSPI, a human simulation tool forplanning and simulating maintenance tasks in nuclear power plants. Theobjectives of this integration were to test Steve’s flexibility to be used indifferent applications and environments and to extend HeSPI to use it asa virtual environment for training. We discuss the problems encounteredand the solutions we have designed to solve them.

1 Introduction

Intelligent animated agents that interact with human users and other agentsin virtual worlds have been applied to a wide variety of applications, such aseducation and training [1], therapy [2], and marketing [3, 4]. However, it is rarefor such an agent to be reused across multiple virtual worlds, and rarer still forone to be applied to a new virtual world developed independently, rather thanwith the agent in mind. Most animated agents, especially those that interactclosely with their virtual world, were designed for a particular virtual world andapplication. Because of the effort required for the development of such agents, itwould be desirable to be able to reuse these agents in new environments easily.An unlimited exchange of agents and environments will only be possible if somestandard is developed to facilitate this task. Unfortunately, this standard is stillfar from being available.

In this paper, we describe an experiment in reusing an intelligent animatedagent, Steve [5, 6], in a new virtual world, HeSPI [7], which was developedindependently. This experience has allowed us to reflect on the problems thatmay arise when doing such an integration as a first step towards the definitionof a new starndard.

Steve was designed to be easy to apply to new domains and virtual worlds.It was originally applied to equipment operation and maintenance training on


326 Gonzalo Mendez, Jeff Rickel, and Angelica de Antonio

board a virtual ship. Subsequently, it was significantly extended and appliedto leadership training in virtual Bosnia [8]. However, the leadership trainingapplication was designed with Steve in mind. In contrast, HeSPI was developedindependently as a tool for equipment operation and maintenance trainingfor Nuclear Power Plants (NPPs). Thus, HeSPI provides a good test for howportable Steve really is.

In addition to evaluating Steve’s portability, our experiment was alsomotivated by a real need. HeSPI was designed as a tool for planning andsimulating procedures in nuclear power plants. Via a user interface, anexperienced operator can specify the required steps in a procedure, and theywill be carried out by an animated human figure, Jack [9]. These procedurescan subsequently be replayed by a less experienced operator learning to performthem, thus providing a training tool. However, the trainee is limited to watchingthe procedure, and cannot ask questions, get explanations, or practice the taskhimself, something that has already proven to be useful in previous systemsapplied to training in NPPs [10]. In contrast, Steve can demonstrate procedures,it can monitor students as they practice a task, giving them feedback on theiractions, and it can answer simple questions. Thus, integrating Steve into HeSPIwould greatly enhance its training capabilities.

In the remainder of the paper, we provide brief background on Steve andHeSPI and then discuss our experience integrating the two. As we will discuss,many aspects of the integration went smoothly, but some difficulties did arise.The results of our effort serve not only as an evaluation of Steve and HeSPI butalso as important lessons in building portable agents and flexible virtual worlds.

2 Steve

Steve (Soar Training Expert for Virtual Environments) is an autonomous,animated agent for training in 3D virtual environments. Steve’s role is to helpstudents learn procedural tasks, and he has many pedagogical capabilities onewould expect of an intelligent tutoring system. However, because he has ananimated body, and cohabits the virtual world with students, he can providemore human-like assistance than previous disembodied tutors. For example, hecan demonstrate actions, use gaze and gestures to direct a student’s attention,guide students around in the virtual world, and play the role of missingteammates for team training. His architecture allows him to robustly handlea dynamic virtual world, potentially populated with people and other agents;he continually monitors the state of the virtual world, always maintaining aplan for completing his current task, and revising the plan to handle unexpectedevents. This novel combination of capabilities makes Steve a unique substitutefor human instructors and teammates when they are unavailable.

To support these capabilities, Steve consists of three main modules:perception, cognition, and motor control [5]. The perception module monitorsmessages from other software components, identifies relevant events, andmaintains a snapshot of the state of the world. It tracks the following information:

Steve Meets Jack 327

the simulation state (in terms of objects and their attributes), actions takenby students and other agents, the location of each student and agent, theobjects within a student’s field of view, and human and agent speech. Thecognition module, implemented in Soar [11], interprets the input it receivesfrom the perception module, chooses appropriate goals, constructs and executesplans to achieve those goals, and sends motor commands to the motor controlmodule. The cognition module includes a wide variety of domain-independentcapabilities, including planning, replanning, and plan execution; mixed-initiativedialogue; assessment of student actions; simple question answering; episodicmemory; path planning; communication with teammates [6]; and control ofthe agent’s body. The motor control module accepts the following types ofcommands: move to an object, point at an object, manipulate an object(about ten types of manipulation are currently supported), look at someoneor something, change facial expression, nod or shake the head, and speak. Themotor control module decomposes these motor commands into a sequence oflower-level messages that are sent to the other software components (simulator,graphics software, speech synthesizer, and other agents) to realize the desiredeffects.

Steve was designed to make it easy to connect him to new virtual worlds.First, the perception and motor control modules include two layers: an abstractlayer, which deals with the types of information Steve needs to exchange withthe other software components, and a virtual world interface layer, which mapsthe abstract layer to the messages used to communicate with a particular setof software components. Thus, one can connect Steve to a new virtual worldimplemented with new software components simply by rewriting the virtualworld interface layer.

Second, to allow Steve to operate in a variety of domains, his architecturehas a clean separation between domain-independent capabilities and domain-specific knowledge. The code in the perception, cognition, and motor controlmodules provides a set of general capabilities that are independent of anyparticular domain. To allow Steve to operate in a new domain, a course authorsimply specifies the appropriate domain knowledge in a declarative language.The domain knowledge that Steve requires falls in two categories:

Perceptual Knowledge This knowledge tells Steve about the objects in thevirtual world, their relevant simulator attributes, and their spatial properties.It resides in the perception module.

Task Knowledge This knowledge tells Steve about the procedures for accom-plishing domain tasks and provides text fragments so that he can talk aboutthem. It resides in the cognition module and is organized around a relativelystandard hierarchical plan representation [5].

Finally, Steve’s cognition module has a library of actions he understands,including things like manipulating objects, moving objects, and checking thestate of objects. Each action in the library is implemented by a set of Soarproduction rules that tell Steve what sorts of messages to send out to perform the


action and what sorts of messages to expect from the simulator to know whetherthe action succeeded. The library is organized in a hierarchy with inheritance sothat new actions can often be written by simply specializing existing (general)actions.

3 HeSPI

VRIMOR is a shared cost project funded by the European Union whose aimhas been to combine environmental laser-scanning technologies with humanmodelling and radiological dose estimating tools and to deliver an intuitive andcost-effective system to be used by operators involved with human interventionsin radiologically controlled areas [7].

HeSPI (Tool for Planning and Simulating Interventions) is a tool that hasbeen developed in the VRIMOR project to serve as a means to plan and simulatemaintenance tasks in a 3D environment reproducing a Nuclear Power Plant. Thissystem has been built on top of Jack [9], a human simulation tool that providedus with the necessary mechanisms to import the scanned virtual environment(VE) and animate the virtual operators.

HeSPI provides users with two kinds of interfaces. One is the classic graphicalinterface which is based on windows and is controlled using a mouse and akeyboard. This has been complemented with a voice recognition system thatallows the user to interact with the system in a much faster way.

Once an operation is planned, the user can generate as an output thetrajectories of all the operators that have taken part in the intervention. Thesetrajectories measure the position of the operators’ head, hands and chest duringthe operation, so they can be used by an external application to determine theradiation dose received by each operator during the intervention.

There are some other actions the user can perform, such as including semanticinformation about the objects of the scene or creating new actions to completethe predefined library of activities that HeSPI offers for the operators.

HeSPI has a very simple architecture that makes it easy to communicatewith other applications. Since one of the main objectives of the project was totest different user interfaces, we created an API to control all the actions thatcould be performed in the VE, so that different applications could make use ofthe API to communicate with the VE.

HeSPI has already been under evaluation at the Almaraz NPP (Spain) andthe results have been quite satisfactory, especially in those aspects concerningthe voice controlled tasks.

4 Integration

The integration of these two systems, Steve and HeSPI, has taken place duringthe second half of 2002. There are still some features that have to be fully tested,but in general it has been quite successful.


Fig. 1. Architecture for the integration

4.1 The Integration Process

As described earlier, there are several steps in connecting Steve to a new VE.The architecture for this integration can be seen in Fig.1.

First of all, it was necessary to let Steve connect to the VE and have itsown physical representation. HeSPI already provided the necessary functions tocreate a virtual operator using the GUI, so it was just a matter of letting Steveuse them. Steve’s virtual world interface layer made this easy.

As a second step, it was necessary to provide Steve with all the domainspecific knowledge he needed about the virtual world, i.e. the objects of theworld and the tasks that had to be performed in the training process, so that hecould have all the necessary information to act as a tutor. Steve’s representationfor task knowledge was sufficient to represent the tasks that had to be carriedout inside the NPP, since procedures are so strict in NPPs that, in most cases,only one course of action is valid to complete a task.

The most logical way to continue the integration was to let Steve performbasic actions, where no interaction with the environment was required, so wechose to make him walk around the VE and look at different places. This involvedhaving to modify the control flow in HeSPI, since it was designed to return thecontrol to the user once the execution of a command had started, so that the usercould keep on working while the command was being executed. However, Steveneeds to know when a command (e.g., to walk to a new location) has finishedexecuting. Thus, it was necessary to make HeSPI wait for the action to finishbefore returning the control to Steve, so that the system would work properly.

The interaction with objects required a bigger effort than the previous steps,since these interactions are, in many cases, domain dependent. In previousoccasions, Steve had been used in static environments where he had to pressbuttons, move handles or turn valves. In this case, we were facing a more dynamicenvironment, where he had to move objects or use tools to mount and dismountsome pipes.

To let Steve work in the NPP it was necessary to extend the existing actionlibrary in order to allow him to perform tasks such as picking up objects, whichwas defined in terms of an existing action, or dropping them in a specific location,which was defined from scratch.


At this point, Steve was already able to perform all the required actions, buthe wasn’t able to explain them to the student, yet. We used IBM’s Via Voice asthe text to speech tool to let Steve communicate with the student. It was easyto map Steve’s required speech synthesis API, which is part of the virtual worldinterface layer, to the commands provided by ViaVoice.

Finally, Steve’s path planning requires the virtual world to be described as agraph, where the nodes are different locations in the VE and the edges establishthe possibility to go from one place to another [5]. This information was easy toprovide for the nuclear power plant.

4.2 Difficulties

There were undesired behaviours due to the fact that both Steve and HeSPIperformed redundant actions. When Steve has to work with an object, he firsttries to approach it and then sends the command to work with the object. WhenHeSPI receives this command, it tries to make the mannequin approach theobject and then commands him to work with it. Thus, the mannequin receivesthe command to walk towards the object twice and, due to Jack’s managementof positions and movements, this produces undesired results.

Another difficulty we found was that, for some actions, Steve needsinformation that external applications may not have. For example, whendropping an object, you need to tell Steve where you drop it. However, HeSPIdoesn’t need this information, since a mannequin drops his objects wherever heis in that moment. Thus, when a mannequin moved somewhere, it was necessaryto store that destination in a temporary register so it could be used wheneverSteve performed an action that required that location.

To grasp and manipulate objects, Jack requires more information about themthan Steve’s perceptual knowledge provides. Thus, Jack sometimes grasps thingsawkwardly. This is a problem in HeSPI too: users complained that it was toocomplicated to provide such information. Ultimately, some extension in Steve orHeSPI needs to compute such information automatically.

Steve has been used in environments where he couldn’t change objects’location. However, in a NPP, many tasks involve moving objects from one placeto another. As both Steve and the student have their own body, if Steve startedexplaining an action where he had to move an object and the student wanted tofinish that action, he would have to take the object from Steve’s hands. Then,if the student needed some help to finish the task, Steve would have to takethe object again, which would complicate the process. A possible solution wouldbe for Steve and the student to share the same body, but it would cause twonew problems. First, the VE should allow different users to manipulate the samemannequin, and if that were possible, then Steve and the student might tryto manipulate the mannequin at the same time. Thus, the types of actions in adomain may constrain the times at which Steve and a student can switch control.

Steve assumes that once he commands an action, it will be possible to performit. If that is not the case, because, for example, there is an unexpected obstaclein the VE that doesn’t allow Jack to keep on moving towards the object Steve


wants to manipulate, there might be two courses of action. If we decide notto send feedback to Steve until the action has finished, he will be waiting fora callback that may never arrive. If we return a value saying that the actioncouldn’t be performed, he would keep on trying to carry it out, because his plansays that it is the right action to perform. Thus, Steve needs extensions thatsupport more general reasoning about failed actions.

5 Future Work

HeSPI has been tested with an intervention where a team of operators have tochange a filter that is kept inside a pipe. We have used this operation to testthe integration with Steve, so that the number of actions that Steve has had toperform is still a bit limited. Thus, one of the first things that must be done inthe near future is to test both HeSPI and Steve with different operations in orderto see if Steve’s action library is good enough for these and other environments.

In addition, the integration has been completed for a single tutor teachinga single student, but further work is needed in order to test whether teamworkcan also be supported, so that different Steve agents can perform the role of atutor or a teammate in HeSPI. Steve is already able to support teamwork, andthere shouldn’t be much trouble in integrating this functionality with HeSPI,since tutors and teammates would be treated as different users.

One last thing that could be achieved is the generation of Steve’s domainspecific knowledge using HeSPI’s planner and semantic information about theworld. As far as we have been able to see, not all the necessary information couldbe generated, but a fairly complete skeleton could be created.

6 Conclusions

We have tested both Steve and HeSPI in order to see how easy it might be tointegrate an intelligent tutor and a VE developed independently. We have proventhat Steve is flexible enough to plug it into a different application for trainingin new domains. In addition, we have shown that HeSPI has been designed insuch a way that is has been easy to convert it in a training tool.

However, some issues have arisen that require further consideration. Forthe most part, they have to do with the responsibilities each applicationmust assume and with the information each system must provide to externalsystems and can expect to receive from other applications. For example, theresponsibility of deciding whether to approach an object before manipulatingit should be assumed by the agent, whereas the VE should only check if theaction is physically feasible. Besides, the VE should be able to provide completeinformation about the state of the environment and the effects of any event thatmay occur. Actions should be parameterized and the agents should be able to setthese parameters or let them take default values (i.e. walk expressing a certainmood or just walk normally).


As it can be seen, a big effort must still be devoted to this standardizationprocess so that, in the near future, it may be possible to easily connect any agentto any VE.

Acknowledgements Steve was funded by the Office of Naval Research undergrant N00014-95-C-0179 and AASERT grant N00014-97-1-0598. HeSPI has beenfunded by the EU through the Vrimor project under contract FIKS-CT-2000-00114. This research has been funded by the Information Sciences Institute -University of Southern California, the Spanish Ministry of Education under grantAP2000-1672 and the Spanish Ministry of Science and Technology through theMAEVIF project under contract TIC2000-1346.

References

[1] Johnson, W.L., Rickel, J.W., Lester, J.C.: Animated pedagogical agents: Face-to-face interaction in interactive learning environments. International Journal ofArtificial Intelligence in Education 11 (2000) 47–78

[2] Marsella, S.C., Johnson, W.L., LaBore, C.: Interactive pedagogical drama. In:Proceedings of the Fourth International Conference on Autonomous Agents, NewYork, ACM Press (2000) 301–308

[3] Andre, E., Rist, T., van Mulken, S., Klesen, M., Baldes, S.: The automated designof believable dialogues for animated presentation teams. In Cassell, J., Sullivan,J., Prevost, S., Churchill, E., eds.: Embodied Conversational Agents. MIT Press,Cambridge, MA (2000)

[4] Cassell, J., Bickmore, T., Campbell, L., Vilhjalmsson, H., Yan, H.: Conversationas a system framework: Designing embodied conversational agents. In Cassell,J., Sullivan, J., Prevost, S., Churchill, E., eds.: Embodied Conversational Agents.MIT Press, Cambridge, MA (2000)

[5] Rickel, J., Johnson, W.L.: Animated agents for procedural training in virtualreality: Perception, cognition, and motor control. Applied Artificial Intelligence13 (1999) 343–382

[6] Rickel, J., Johnson, W.L.: Extending virtual humans to support team training invirtual reality. In Lakemayer, G., Nebel, B., eds.: Exploring Artificial Intelligencein the New Millenium. Morgan Kaufmann, San Francisco (2002) 217–238

[7] de Antonio, A., Ferre, X., Ramırez, J.: Combining virtual reality with an easyto use and learn interface in a tool for planning and simulating interventions inradiologically controlled areas. In: 10th International Conference on Human -Computer Interaction, HCI 2003, Creta, Greece (2003)

[8] Rickel, J., Marsella, S., Gratch, J., Hill, R., Traum, D., Swartout, W.: Towarda new generation of virtual humans for interactive experiences. IEEE IntelligentSystems 17 (2002) 32–38

[9] Badler, N.I., Phillips, C.B., Webber, B.L.: Simulating Humans. Oxford UniversityPress, New York (1993)

[10] Mendez, G., de Antonio, A., Herrero, P.: Prvir: An integration between anintelligent tutoring system and a virtual environment. In: SCI2001. Volume VIII.,Orlando, FL, IIIS, IEEE Computer Society (2001) 175–180

[11] Laird, J.E., Newell, A., Rosenbloom, P.S.: Soar: An architecture for generalintelligence. Artificial Intelligence 33 (1987) 1–64

Machiavellian Characters and

the Edutainment Paradox

Daniel Sobral1, Isabel Machado1, and Ana Paiva2

1 INESC-ID, Rua Alves Redol 9, 1000 Lisboa, Portugal{daniel.sobral, isabel.machado}@gaips.inesc.pt

2 IST - Technical University of Lisbon, Av. Rovisco Pais 1, P-1049 Lisboa, [email protected]

Abstract. Purely script-based approaches to building interactive narra-tives have often limited interaction capabilities where variability demandexponential work. This is why Intelligent Virtual Agents (IVAs) are atransparent technique to handle user interaction in interactive narrativesystems. However, it is hard to predict a sense of educational purpose inthe global behavior of a group of IVAs, if no script or control is given. Ef-forts have been channelled to achieve such control, but are yet to achievetruly satisfactory results. These efforts are usually based on a direct con-nection between the control and the IVA architecture, which is a sourceof exponential complication. We propose a system that, based on a com-mon ontology, can flexibly support the human authoring of educationalgoals independently of any specific IVA architecture. This is done byhaving a stage manager that follows an episodic based narrative whereeach episode is only specified through a set of properties and conditionsthat set the context for the characters. Although acting as they please,this contextualization will limit their range of action thereby facilitatingthe achievement of dramatic and educational goals.

1 Introduction

Recent findings from studies with primates [1] seem to corroborate that narrativecommunication skills occurred early in man’s evolution, making unquestionablethe fact that narrative plays an important role in the social development of theindividual. Traditional narratives are usually represented by a single thread ofaction describing the plot. This sequence of events is the fundamental sourcefor emotional engagement, and a great deal of artistic (and often scientific)knowledge is necessary to successfully achieve a good sequence.

However, with the emergence of interactive narratives, such single and linearstructure was challenged, and other types of narrative constructs were proposed.One of them is the approach that a narrative may emerge from actions of au-tonomous characters, that in a play, act to achieve a goal, which is the playitself. Plus, the role of the spectator changes, and becomes more active in takingdecisions on what may happen next in the story.

The need for emotional engagement and active presence of the spectator leadto what is described as the Narrative Paradox. To keep the audience’s interest


334 Daniel Sobral, Isabel Machado, and Ana Paiva

one must let them participate and get involved. At the same time, the authorintends to transmit a message, which limits what should the audience perceiveand do. This effect is particularly severe in educational contexts, where a specificauthor-established purpose should be apprehended by the audience. Therefore,the Edutainment Paradox is an even more intricate version of the NarrativeParadox, where education and entertainment struggle in a constant limbo.

Systems that represent and manipulate story elements at a higher level andtry to select sequences of events defining the narrative are described as plot-basedapproaches. Some of such systems use tree (or graph) structures, achieving vari-ability through the expansion and navigation of such structures. This variabilityis achieved at an exponential cost for the creation of these structures (usuallyHuman-made). Other systems use structuralist narrative knowledge to definethe structures and the navigation that creates the narrative. Nevertheless, asSzilas [6] already noticed, such models are also ineffective in responding to theuser’s needs for interaction.

Intelligent Virtual Agents (IVAs) provide a flexible support to coherentlyhandle user interaction. This ability stems from a set of parameters that de-fine the agents by controlling their behavior (eg., personality, emotion, goals).Systems that use virtual agents are defined as character-based approaches, orig-inating what is generally described as emergent narratives. While each agentis coherent and effective, the global behavior of a group of agents is hardlypredictable (in that sense, it is emergent). The amount of parameters usuallyassociated to each agent makes it very hard to attain an emergent narrativethat fulfils a global educational purpose [5] [6]. Therefore we feel that IVA’s(specifically the ones that represent Characters) are Machiavellic in the sensethat, although they are able to coherently react to the user’s actions, they aregenerally unaware of the impact of its actions on the user’s educational needs.Although unintentionally, they don’t respect the systems’ moral, its global pur-pose.

Therefore, in certain contexts, namely educational ones, we clearly identifythe need to achieve a high-level control for the global behavior of a group ofIVAs. In Section 2 we will describe the context where our work is being applied.In Section 3 we discuss some works that relate to ours and the problems they arefacing. In section 4 we propose a different approach to obviate such problems.Then we present some of the results from the early steps of implementation(Section 5) and finally, we try to derive some conclusions and refer to futurework .

2 Context

Before going into the description of our approach to handle the narrative para-dox, we will describe briefly the context in which this is applied so that theexamples can be better understood.

Machiavellian Characters and the Edutainment Paradox 335

The major goal of the Victec3 is to apply synthetic characters and emergentnarrative to social education for children aged 8-12. By focusing on the issue ofbullying and building empathy between child and character we aim at not onlycreating a novel experience for the children, but mostly to attain some educa-tional impact in the bullying problem that largely affects several countries. Thefinal product of the project will consist on a real-time 3D virtual environmentwhich will give students the possibility to be face to face with bullying situa-tions and try to help the victims through the interactive drama we are creating.Each session with the user will consist on a sequence of episodes, where eachepisode depicts a certain dramatic situation in the bullying context. Betweeneach episode, the user will enter in an Introspection Phase, where he or she eval-uates the situation and suggests a possible course of action for the character(victim of bully) the user is helping.

After each episode has terminated successfully, the system will pass to theIntrospective Phase. Within this phase we aim at collecting some informationabout what the child feels about the story, specifically about the previous episode,and also about what is happening to his/her friend - the main character. To dothis, we think that the most efficient way to gather this kind of information istrough an open dialogue between the child, who is interacting with the applica-tion, and the protagonist of the story. This dialogue-based interaction is a verycomplex subject by itself and will not be further discussed in this paper.

3 Related Work

In this section we will succinctly describe the approach of a few related worksthat have similar problems we are facing in our context.

Perhaps the most important work in this area, where narrative and educationmingle is the Carmen’s Bright IDEAS (CBI) [4], a 2D educational system whichrelies deeply on a dialogue between two characters, Carmen and a mother. There,the user has a third person perspective and interaction is limited to specific de-cision points (which comes natural in a dialogue-based system). Furthermore,CBI’s plot structure is based on an explicit graph demanding an exploding effortfor the author. Moreover, the plot conduction role is performed by one of thecharacters, making its development dependent of the specific character architec-ture, with the parameter twitching consequences mentioned earlier.

Differently, Teatrix [3] represents, as in our system, a 3D environment, whichfaces different problems than the 2D dialogue-based CBI. There, the ability toinspect the characters minds was a very important aspect of the system, since itconveyed a purpose and increased the sense of immersive interaction. In Teatrix,they concluded for the need of a Hot Seating, a special place where the child couldsee the agent’s emotional evolution and try to explain it through the events andactions, relating it with the character’s personality and role. This notion of a

3 http://www.victec.orgProject funded by the EU Framework 5 programme


role within a domain (fairy tales in this case) is the major narrative conductingconcept in Teatrix.

Most plot-based narrative systems divide the story into pieces and use ex-plicit links to connect them, which imposes an exploding amount of effort to theauthor. Facade[5] provides an alternative, as it divides the story into a set of in-dependent pieces (without explicit links) being the story defined by a sequencingpolicy. The author did not have an educational purpose in mind but the sameprinciple can be applied to define a policy that tries to depict educational goals.Nevertheless, although this sequencing method allows for greater flexibility, eachstory piece in Facade represents a specific set of behaviors the Characters canfollow in that piece of the narrative. This may provide for the best results, butit imposes an exceeding burden to the author.

One of the problems for plot-based systems reside on the level of abstractionwhere the plot should be managed [2]. We suggest the author should avoidthe direct control of the characters. As long as sufficient ontological support isgiven, the author can assume they will perform according to certain patterns ofbehavior. This requirement was the major influence in the development of thearchitecture which will be described in the following section.

4 Architecture

The 3D virtual environment system that generates the interactive stories has anarchitecture that is schematized in Figure 1.

Fig. 1. The System’s Architecture

The World Model is the central Knowledge Source for the system. It keepsthe Narrative Structure and information about the current state of the virtualenvironment. This component also provides for the communication between theother modules (generically referred as Agents). The two most important Agentsof this architecture are the Characters which will be playing the story andthe Stage Manager, which guarantees the achievement of educational goals.Finally, we have included in the architecture the notion of a View Manager.


Its main function is the translation between actions performed by other modulesinto graphic calls to a specific Display Engine, thereby concealing some of thevisualization issues from the rest of the system. Furthermore, this componentalso transmits to the system the user interaction with the displayed environment.

4.1 Narrative Structure

In our system the Narrative Structure necessary for the generation of interactivestories includes a Domain (eg. Bullying), a set of Characters and the Plot.

The Domain defines an Ontology that all components must support in or-der for the system to work. This is an essential process, being obtained throughan iterative process of extracting information from the experts (in our case ex-perts in bullying). It will provide for a common vocabulary and enable a betterseparation between the components. This vocabulary includes a set of commu-nication acts, concepts related to the Characters roles and concepts related tothe Episodes. Communication acts are used to transmit information between theAgents. Examples of communication acts specific for this domain are Hit, Askto be Friend, Tease, Insult, Ask for Help.

Typical characters that are usually present in bullying environments are Vic-tims, Bullies, Helpers, Neutrals. These roles have a specific semantic which theauthor uses in defining the story. For example, the Bully should usually Hit orInsult the Victim when they are alone. If there is a Helper around, there is evena greater probability of that to happen, and the impact of the bully’s actionshould be worse to the victim. Nevertheless, these are just assumptions thatare related to ontological knowledge about the concepts in the Domain. Eachparticular Character must implement the semantics of their role in the story.

The ontology also defines some concepts that are useful to characterize ep-isodes. For example, episodes can represent Conflicts between characters, offersome Empowerment opportunities for the main character to deal with problemsand can also represent situations where the Resolution of the bullying situationcan happen.

The Plot is the Human authored description of the story. It includes theUser’s Needs [6] and a set of Episodes [5].

The user’s needs represent a set of properties that describe dramatic andeducational requirements. A generic dramatic requirement is that there must beat least one action ever n seconds. An educational requirement can be that, in abullying situation, the author intends the child (user) to advise the victim (thecharacter he/she is helping) to ask for help when suffering from abuse.

Episodes represent context for the characters. As we have seen, this makesno assumption on the characters architecture, but assumes a certain behaviorpattern. For example, an episode could be described as follows: “John (the vic-tim) and his new friend are in the schoolyard, and in front of them is Luke (thebully). Some other children (neutrals) are also there”. This episode is annotatedas being a possible resolution situation, because although we expect the bully tohit or insult the victim (which is a conflict situation), the presence of a friendmay provide for a solution for the bullying situation.


Episodes are not explicitly linked. Instead, they contain information that canbe used by the Stage Manager in a selection process. This information includea set of Pre-conditions, Contextual information and Post-conditions.

The Pre-conditions of each episode allow for an important pre-filtering ofavailable episodes. For example, if we want the user to advice the victim toask for help to someone he/she trusts, the victim must have previously met afriend (the act Make Friend must have been previously performed by the victim).Therefore, a resolution episode where the victim has the possibility of asking forhelp should only be available after he/she has made some friends.

Contextual information are a set of properties and logical propositions thatset the context where the characters will perform. For example, it will indi-cate what type of episode is this (eg., conflict), what are the characters present(namely, which roles are present), what is the episode’s location, and severalother factors that may be relevant to determine the episode’s expected results.

Post-conditions represent probable outcomes of the episode, assuming thecharacters will behave coherently within the domain. These will also work as ter-mination conditions, used by the Stage Manager to determine when the episodehas ended. These conditions are important due to the lack of a formal defi-nition for episodes. This gives the author the power to define (with more orless detail) what are his expectancies for the episodes. For example, one of thetermination conditions in a resolution episode can be Hit(Bully, Victim). If nopost-conditions are given, generic knowledge is used by the stage manager to de-termine the ending of the episode (eg., when anything happens for some time).

This information, together with the domain knowledge, is used by the stagemanager to infer a measure of the episode’s contribution to the global goal.Further details of this process will be explained in the following section.

4.2 Narrative Navigation

The Stage Manager uses the information available in the Plot to dynamicallydefine, at each moment, which is the best episode to display next (much likeFacade [5]).

In a first step, pre-conditions are checked in order to make a pre-filtering ofepisodes. For example, in the beginning, resolution episodes are not availablebecause no act determining conflict has occurred yet.

The contextual information and post-conditions of the filtered episodes arethen used to see their potential to achieve the global goal. For this, simple logicalinference is performed to assess which episodes provide the necessary elements.For example, to choose the first episode, the Stage Manager will try to satisfythe need for conflict. In this case, only conflict episodes are chosen. To sortbetween similar episodes, a probability estimation is performed. For example,the probability associated with the occurrence of bullying actions (denoting aconflict) is greater when some helpers are present to encourage the bully, makingsuch an episode more likely to be chosen.

This process is repeated when the current episode is finished, which is deter-mined either by checking the post-conditions or through internal rules. Finally,


the interaction will end when the goal is reached or when there are no moreavailable episodes.

5 First Results

A first prototype has been created with a 3D virtual interface (Figure 2) whichtests the visualization process by simulating one sample episode, depicting asituation created by specialists in bullying from the Victec team. These simplesimulations are important in a process of evaluating some aspects of the archi-tecture, the adequacy of the narrative approach, character design, interactionfacilities and usability. Furthermore, it provides the proof of concept for theusefulness of this technology in the educational context.

Fig. 2. The first prototype

The prototype’s Display System runs on a web page embedded with theWildtangent (WT) Plugin and an applet which communicates with the plugin.Geometry and animations for scenarios, props and characters were all created in3D Max and exported to the WT proprietary format which are then loaded andsent to the WT window in the browser. The choice for a 3D game engine was dueto the integrated support for graphical and audio display. Furthermore, it hassupport for specific functionalities (such as animation rendering and blending,collision detection, object picking) which are useful to quickly handle characterbehavior and user interaction and enabled us to rapidly build a quite powerfulprototype.

An evaluation of the prototype was conducted in a conference that gathereda diverse audience to discuss bullying in schools. This study concluded that thesystem has a large potential to explore bullying issues. More details about thisstudy can be found in [7]. Further studies were also performed is Portugueseschools. Although this latter data still needs to be treated, the first reactionswere very encouraging. The children show a clear enthusiasm in the use of thesetechnologies. This serve as a good indicator for the system’s potential, in the


specific case of bullying, but also in a general case of an educative interactivesystem.

6 Conclusions and Future Research

According to what was referred in Section 3, we believe that an educational sys-tem must support a flexible authoring process, not imposing a direct control overthe characters. The first results obtained show that, given that an appropriatecommon domain is defined, the architecture developed will allow us to work onan appropriate abstraction level.

This will enable us to thoroughly explore the narrative control system withscripted characters, which will bring greater confidence in latter stages of theapplication, where truly autonomous agents (which are being developed in par-allel) are included. Moreover, although the user interactive capabilities are cur-rently mostly limited to the Introspection Phase, this Agent-based frameworkwill enable us to smoothly introduce further interaction potential in the episodes,exploring the notion of an invisible advisor we pretend to our system.

7 Acknowledgements

Thanks to all the partners in the VICTEC project for their comments andcriticisms in particular to Ruth Ayllet.

References

1. Dautenhahn, K.: Stories of Lemurs and Robots - The Social Origin of Story-Telling.In: Narrative Intelligence, Michael Mateas and Phoebe Sengers (eds.), John Ben-jamins Publishing Company, 2003

2. Louchart, S. and Aylett, R.: Narrative Theory and Emergent Interactive Narra-tive. In: Proceedings of 2nd International Workshop on Narrative and InteractiveLearning Environments, 6th - 9th August, Edinburgh,Scotland, 2002

3. Machado, I. and Paiva, A. and Prada, R.: Is the wolf angry or just hungry? Inspect-ing, modifying and sharing Characters’ Minds. In: Proceedings of the InternationalConference on Autonomous Agents, ACM Press, 2001

4. Marsella, S. and Johnson, W. and LaBore, C.: Interactive Pedagogical Drama. In:Proceedings of the Fourth International Conference on Autonomous Agents, 301–308, 2000

5. Mateas, M.: Interactive Drama, Art, and Artificial Intelligence. PhD Thesis, Tech-nical Report CMU-CS-02-206, School of Computer Science, Carnegie Mellon Uni-versity, Pittsburgh, PA, December 2002

6. Szilas, N.: IDtension: a narrative engine for Interactive Drama. In: Proceedings ofthe 1st International Conference on Technologies for Interactive Digital Storytellingand Entertainment (TIDSE 2003), March 24-26, Darmstadt (Germany), 2003

7. Woods, S. et al.: What’s Going On? Investigating Bullying using Animated Char-acters. IVA 2003

Multimodal Training Between Agents

Matthias Rehm

Multimedia Concepts and ApplicationsInstitute of Computer Science, University of Augburg

D-86153 Augsburg, [email protected]

Abstract. In the system Locator1, agents are treated as individualand autonomous subjects that are able to adapt to heterogenous usergroups. Applying multimodal information from their surroundings (vi-sual and linguistic), they acquire the necessary concepts for a success-ful interaction. This approach has proven successful in a domain thatexhibits a remarkable variety of possible (often language-specific) struc-turings: the spatial domain. In this paper, the further development isdescribed (Locator2) that allows for agent-agent interactions such thatan agent is instructed by another agent that plays the role of a teacher.

1 Introduction

Current literature on interface agents treats the feature of embodiment in oneof two ways. Either as a matter of design or as a possibility to convey moreinformation to the user by the use of gestures or facial expressions. But embod-iment may go a step further, regarding the agent as an individual entity withspecific sensory abilities and specific ways of interacting in and with its environ-ment (e.g., [5]). In this paper, agents are introduced that learn relevant conceptsthrough multimodal input and even trade this knowledge to other agents in theirsurroundings.

2 Related Work

The work of Cassell and colleagues on the Rea agent ([6]) is a good examplesof embodiment as a means to more communications channels. Rea, short forReal-Estate Agent, is a multimodal conversational interface agent that showsusers around virtual houses. An important features of this approach is the use ofthe agent’s body to convey certain communicative functions like turn-taking orinitiation. Another interesting approach combines embodiment and the modelingof emotions. Andre, Rist, and colleagues ([1], [7]) describe sales presentationteams that use their bodies to convey certain personality traits like introvert vsextrovert.1 The original work on Locator was funded by the German Research Foundation

(DFG) in the framework of the graduate program Task-Oriented Communication.


Multimodal Training Between Agents 349

Fig. 1. A snapshot of Locator2. The agent in front is the teacher who waitsfor the learner to catch up.

Few approaches are concerned with how an (embodied) interface learns on thebasis of its perceptions. Billard and Hayes present robotic agents that learnobject concepts on the basis of their sensory input and that even may teach eachother labels for these concepts ([2]). The concepts are formed without taking thelabels into account. Thus, they play language games similar to those described in[10]. In Locator, the linguistic input is part of the concept formation processbecause it gives a positive example of a possible structuring of the perceivablereality.

3 Teaching an Agent

Locator is a testbed for embodied agents that learn via multimodal input. Vir-tual anthropomorphic agents move around in a virtual, complex world (see Fig.1). Different sensors allow them to receive two kinds of input during their explo-ration: visual and linguistic. The agents acquire individual concepts dependingon their embodiment (i.e., their specific sensory equipment) and the specific sit-uations they encounter. The linguistic input describes spatial relations betweenobjects in one of two frames of spatial reference ([3], [8]): either in German (rela-tive frame of reference) or in Marquesan2 (absolute frame of reference). Relativeand absolute frames of reference have different logical implications concerningstandpoint and orientation of the speaker. Relative frames of reference make useof reference axes that are anchored in the speaker whereas absolute frames ofreference are anchored in the environment and consequently are unaffected bythe spatial orientation of the speaker. Thus, it is not possible to map utterancesfrom one into the other language without knowing the exact context in whichthe utterance was produced.2 Marquesan speakers employ a directed axis sea – land (tai – uta) and an undirected

cross axis (ko) (see [4] for details on Marquesan).

350 Matthias Rehm

choose newposition

look forteacher

teacherwalk to initiate search

behaviour

utterancegenerate

visual + linguisticcategorise input

start concep−tualisation

teachernotify

utterancetransmit

notifyteacher

notify

learner

Teacher Learner

found not found

failure

Fig. 2. Overview of the coordination behavior between teacher and learner.

Steels ([9]) has proposed discrimination games to model the process of situ-ated acquisition of object concepts. An initial feature detector exists for eachpossible feature. This initial detector ranges over the whole value range of thecorresponding feature. Consequently, it is always activated if a value for thisfeature is encountered. During the process of concept formation, the initial fea-ture detectors are elaborated, resulting in a number of different discriminationnets, one for each perceptual feature. Each node is itself a feature detector thatcorresponds to parts of the value range of the given perceptual feature and isactivated if a value falls into this range. In Locator, this approach was modi-fied and extended to take linguistic input into account. The pressure to modifyan existing discrimination net results in Steels’ approach from the assumption,that identical objects do not exist. Thus, a distinctive set of features can bedetermined for every object. Discrimination nets are build up only on the in-formation supplied by the available sensors without taking linguistic input intoaccount. In Locator the linguistic input realizes a generally accepted way ofstructuring the spatial domain and is given as a positive example to the learner.Receiving a linguistic input triggers a categorization attempt and, if this fails,a learning step. The success of this categorization attempt is measured. Thevisual and linguistic input activate concepts that represent the joint meaning ofthe different types of input. If a single concept is activated, the categorizationattempt is successful.

4 Coordinating Teacher and Learner

To achieve a successful communication, teacher and learner have to coordinatethemselves. Because the sensory abilities of the agents are limited, this can hap-pen only by linguistic or visually perceivable means. The coordination behaviorconsists of several interaction steps between the agents (Fig. 2). i.) The teacher


df Fsucc Ferr

Agents 4,89 0.77 0.55Relations 1,89 0.24 1.08A x R 4,89 0.35 0.81

p < 0.05∗, p < 0.01∗∗

df Fsucc Ferr

Agents 9,179 0.15 0.23Relations 1,179 0.02 6.66∗∗

A x R 9,179 0.19 0.73

p < 0.05∗, p < 0.01∗∗

Fig. 3. Comparing results for Locator (left) and Locator2 (right): Analysisof co-variance (above) and mean categorization success (below).

chooses a position, which is suited for a spatial description. The autonomous ex-ploration behavior of the agents comes into play here. Then the teacher notifiesthe learner that she is ready and waiting for him. ii.) Receiving this message, thelearner passes it to his natural language interface to analyze it. He tries to findthe teacher employing his visual sensor. If he sees the teacher, he starts walkingin her direction. Otherwise he starts looking for the teacher by walking aroundin circles. Having reached the teacher, the learner orients himself and notifiesher. iii.) When the teacher has analyzed this message, she starts generating anutterance that describes the currently visible scene. This task corresponds to auser-agent interaction where the user requests the agent to describe the scene,i.e., neither the figure nor the ground object are known. They have to be de-termined by the teacher. The generated utterance is transmitted to the learner.iv.) The learner tries to categorize the multimodal input. If this process fails,a learning step becomes necessary, resulting in the creation of a new concept,in the modification of an existing concept, or in initiating the modification ofexisting feature detectors. At last, the teacher is notified that a new round maybegin.Figure 1 shows a snapshot of the coordinating behavior of the two agents. Theteacher (in front) has already reached a position to generate a spatial descriptionand has notified the learner that she is waiting. The learner is in a convenientposition as he can see the teacher. The coordination of teacher and learner aswell as the instruction of the learner are based on the sensory abilities of theagents and thus depend on their specific embodiment.

352 Matthias Rehm

5 Results

In a first simulation, a group of ten agents had to learn the left/right-dichotomy.They are instructed by a teacher agent that has learned this dichotomy from theuser. A single agent autonomously explores its environment, i.e. it follows a ran-dom path through its environment based on local behaviors to avoid collisionswith objects. From time to time the teacher describes the spatial arrangement,which the learner perceives with his visual sensor. The teacher’s input triggersa categorization attempt and, if this fails, a learning step. The same kind ofsimulation was conducted in the original Locator system which allows for acomparison of the results. Each agent is confronted with 1600 German utter-ances that realize the relations rechts (right) and links (left). Due to a variationacross agents in the specific number of uses of each relation, an analysis ofco-variance is necessary. Categorization success is measured every 160 trials asthe mean value over the last 160 trials. Figure 3 (above) gives the results forthe original system (left) and for Locator2 (right). Concerning categorizationsuccess (Fsucc), there is not significant effect, neither for performance betweenagents nor for performance regarding the two relations. In Locator2, a signif-icant effect shows up for the number of false categorization attempts betweenthe two relations (Ferr = 6.66, p < 0.01∗∗). Although the agents use both con-cepts equally successful, they make more errors with the concept right. Becausethere is no significant effect between agents, a probable explanation is, that thisleft/right-deficiency is learned from the teacher. This remains to be shown. Fig-ure 3 (below) shows the mean categorization success for the systems Locator(left) and Locator2 (right). At the end of the simulation, categorization suc-cess is near 100% in both cases. The mean over all trials is 93% for Locatorand 91% for Locator2. This difference between the mean performance and theperformance at the end of the simulations is due to the fact, that the agents startwith no concepts at all and thus will frequently make errors at the beginning.

6 Conclusion

Appointing an agent as teacher has proven a successful extension of the Locatorsystem. Next, more complex learning tasks will be examined, that allow for morevariation in the acquired concepts and thus will also allow for individual flawsof the individual agents. The hypothesis is, that such individual flaws will makethe agents more believable.

References

[1] E. Andre and T. Rist. Presenting through Performing. In H. Lieberman, editor,Proceedings of IUI 2000, pages 1–8, 2000.

[2] A. Billard and G. Hayes. DRAMA, a Connectionist Architecture for Control andLearning in Autonomous Robots. Adaptive Behavior, 7(1):35–63, 1999.


[3] M. Bowerman and S. C. Levinson, editors. Language acquisition and conceptualdevelopment. Cambridge University Press, Cambridge, 2001.

[4] G. H. Cablitz. The Acquisition of an Absolute System: Learning to talk aboutSPACE in Marquesan. In Proc. of the 31st SCLR Forum, 2002.

[5] K. Dautenhahn. The Art of Designing Socially Intelligent Agents. Applied Arti-ficial Intelligence, 1998.

[6] J. Cassell et al. Designing embodied conversational agents. In J. Cassell et al.,editor, Embodied conversational agents, pages 29–63. MIT Press, Cambridge, MA,2000.

[7] T. Rist et al. CrossTalk: An Interactive Installation with Animated PresentationAgents. In E. Andre et al., editor, Proc. of COSIGN02, pages 61–67, 2002.

[8] S. C. Levinson. Frames of Reference and Molyneux’s Question. In P. Bloom et al.,editor, Language and Space, pages 109–169. MIT Press, Cambridge, MA, 1996.

[9] L. Steels. Perceptually grounded meaning creation. In M. Tokoro, editor, Proc. ofthe Int. Conf. on Multi-Agent Systems, pages 338–344. AAAI Press, 1996.

[10] L. Steels and F. Kaplan. Bootstrapping Grounded Word Semantics. In T. Briscoe,editor, Linguistic evolution through language acquisition. CUP, Cambridge, 1999.

An Efficient Synthetic Vision System for 3D

Multi-character Systems�

Miguel Lozano1, Rafael Lucia2, Fernando Barber1, Fran Grimaldo2,Antonio Lucas1, and Alicia Fornes1

1 Computer Science Department, University of Valencia,Dr.Moliner 50, (Burjassot) Valencia, Spain

{Miguel.Lozano}@uv.es2 Institute of Robotics, University of Valencia,Pol. de la Coma s/n (Paterna) Valencia, Spain

This poster deals with the problem of sensing virtual environments for 3D intelli-gent multi-character simulations. As these creatures should display reactive skills(navigation or gazing), together with the necessary planning processes, requiredto animate their behaviours, we present an efficient and fully scalable sensor sys-tem designed to provide this information (low level + high level) to differentkinds of 3D embodied agents (games, storytelling, etc). Inspired in Latombe’svision system[3], as recently presented by Peters [1], we avoid the second ren-dering mechanism [4][2], looking for the necessary efficiency, and we introducea fully scalable communication protocol, based on XML labelling techniques, tolet the agent handle the communication flow within its 3D environment (sense+ act). Although the possibility of directly accessing the database is simple andfast, it suffers from scalability and realism problems, so that, synthetic visionhas been considered as the process of capturing the list of visible actors (objectsor agents) from the agent point of view. Firstly, we extract the nearest actorsfrom the Unreal Tournament 3D environment map area (UT). Secondly, as NonPlayer Characters (NPC’s) during games are normally allowed to see throughwalls and other objects, we have filtered this UTLst removing the occluded ac-tors from the agent point of view (Visible List). Furthermore, assuming thegeneral procedure for distributing all the simulation changes in the 3D environ-ment (event-driven), and in order to be consistent with the behavioural modelof such graphic engine, we have implemented three communication modes tolet the agent deal with its vision system: a) Event-driven mode (on propertychange), b) Fixed time mode (on time), c) Agent’s demand (on demand).In this way, each agent can easily manage its communication protocol (XMLbased), which can be viewed as a general mechanism for handling the knowledgeembedded in 3D multi-character simulation systems.

References

1. C. Peters, C. OSullivan, ”Synthetic Vision and Memory for Autonomous VirtualHumans”. Computer Graphics Forum Volume 21, Issue 4, 2002

� partially supported by the GVA-project CTIDIB-2002-182 (Spain).


An Efficient Synthetic Vision System for 3D Multi-character Systems 357

2. Renault, O., D. Thalmann, and N.M. Thalmann, ”A vision-based approach to be-havioural animation. Visualization and Computer Animation”, Vol. 1, pages 18-21,1990. 2, 3

3. J. Kuffner and J.C. Latombe, ”Fast Synthetic Vision, Memory, and Learning forVirtual Humans”. Proc. of Computer Animation, IEEE, pp. 118-127, May 1999.

4. X. Tu and D. Terzopoulos, ”Artificial Fishes: Physics, Locomotion, Perception, Be-havior”, Computer Graphics, SIGGRAPH 94 Conference Proceedings, pp. 43-50,July, 1994

Author Index

Akker, Rieks op den 341Antonio, Angelica de 325Avradinis, Nikos 269Aylett, Ruth 244, 249, 269

Ballin, Daniel 88Barber, Fernando 356Bechinie, Michael 212Becker Villamil, Marta 164Bente, Gary 292Bergmark, Niklas 264Bernsen, Niels Ole 27Beun, Robbert-Jan 315Bilvi, Massimo 93Bokan, Bozana 259, 354Bordini, Rafael H. 197Bradley, John F. 218Breton, Gaspard 111Bryson, Joanna 101Bullock, Adrian 62

Casella, Pietro 207Cavazza, Marc 231Charles, Fred 231Conde, Toni 175Courty, Nicolas 111Cowell, Andrew J. 301Csordas, Annamaria 5Cunningham, Padraig 159

Daneshvar, Roozbeh 361Dautenhahn, Kerstin 182, 310DeGroot, Doug 136Dobbyn, Simon 159Duffy, Brian R. 218

Eliens, Anton 13, 80, 150Enz, Sibylle 360

Fornes, Alicia 356

Gaspar, Graca 31Gebhard, Patrick 48Gillies, Marco 88Gomes, Mario R. 57, 72Grammer, Karl 212

Gratch, Jonathan 39Grimaldo, Fran 356Grunvogel, Stefan M. 170

Hall, Lynne 310Hedlund, Erik 264Heylen, Dirk 341Hildebrand, Michiel 13Ho, Wan Ching 182Hook, Kristina 62Hornung, Alexander 236Hu, Weihua 355Huang, Zhisheng 13, 80, 150

Ibanez, Jesus 249Ishizuka, Mitsuru 226, 283Iurgel, Ido A. 254

Jung, Bernhard 23

Kadlecek, David 274Kim, In-Cheol 192Kipp, Michael 48Kiss, Arpad 5Klesen, Martin 48Kopp, Stefan 23Kramer, Nicole C. 292Krenn, Brigitte 18Kruger, Antonio 217Kurihara, Masahito 106

Laaksolahti, Jarmo 264Lakemeyer, Gerhard 236Laufer, Laszlo 5Liang, Ronghua 141Liu, Feng 141Liu, Zhen 359Louchart, Sandy 244Lozano, Miguel 356Lucas, Antonio 356Lucas, Caro 361Lucia, Rafael 356Luna de Oliveira, Luiz Paulo 164

Mac Namee, Brian 159Machado, Isabel 333

364 Author Index

Manabu, Okumura 127Mao, Wenji 39Marsella, Stacy C. 1Marichal, Xavier 231Martin, Alan N. 218Martin, Olivier 231Martinho, Carlos 57Mayer, Sonja 283Mead, Steven J. 231Mendez, Gonzalo 325Morgado, Luıs 31Mori, Junichiro 283Mukherjee, Suman 226

Nahodil, Pavel 274Nedel, Luciana P. 197Nehaniv, Chrystopher L. 182Neumayr, Barbara 18Nijholt, Anton 341Nomura, Tatsuya 67Nonaka, Hidetoshi 106

O’Hare, Gregory M.P. 218Oldroyd, Amanda 259O’Sullivan, Carol 159

Paiva, Ana 57, 62, 72, 207, 333Pan, Zhi Geng 355, 359Panayiotopoulos, Themis 202Payr, Sabine 320Pelachaud, Catherine 93Pele, Danielle 111Prada, Rui 62Prendinger, Helmut 283

Raupp Musse, Soraia 164Rehm, Matthias 348Rehor, David 274Reinecke, Alexander 181Rickel, Jeff 119, 325Rist, Thomas 48, 358

Ruiz-Rodarte, Rocio 249Ruttkay, Zsofia 80

Saeyor, Santi 226Sampath, Dasarathi 119Schafer, Leonie 259, 354Schaub, Harald 360Schmitt, Markus 358Schon, Bianca 218Schwichtenberg, Stephan 170Slavık, Pavel 274Sobral, Daniel 310, 333Stanney, Kay M. 301Suguru, Saito 127Szalo, Attila 5

Takenobu, Tokunaga 127Tambellini, William 175Tanguy, Emmanuel 101Tatai, Gabor 5Thalmann, Daniel 175Tietz, Bernd 292Tomofumi, Koyama 127Torres, Jorge A. 197Trappl, Robert 320Trogemann, Georg 236

Uchiyama, Koki 226

Vala, Marco 62, 72Visser, Cees 13, 150Vissers, Maarten 341Vos, Eveliene de 315Vosinakis, Spyros 202

Willis, Philip 101Witteman, Cilia 315Wolke, Dieter 310Woods, Sarah 310

Zhu, Jiejie 355Zoll, Carsten 360

[lecture notes in computer science] intelligent virtual agents volume 2792 ||

Documents