wordseye: an automatic text-to-sceneconversion systemcoyne/papers/wordseye_siggraph.pdf ·...

10
WordsEye: An Automatic Text-to-Scene Conversion System Bob Coyne Richard Sproat AT&T Labs — Research Abstract Natural language is an easy and effective medium for describing visual ideas and mental images. Thus, we foresee the emergence of language-based 3D scene generation systems to let ordinary users quickly create 3D scenes without having to learn special software, acquire artistic skills, or even touch a desktop window-oriented interface. WordsEye is such a system for automatically convert- ing text into representative 3D scenes. WordsEye relies on a large database of 3D models and poses to depict entities and actions. Ev- ery 3D model can have associated shape displacements, spatial tags, and functional properties to be used in the depiction process. We describe the linguistic analysis and depiction techniques used by WordsEye along with some general strategies by which more ab- stract concepts are made depictable. CR Categories: H.5.1 [Information Interfaces and Presentation]: Multimedia Information Systems—Artificial Realities H.5.2 [Infor- mation Interfaces and Presentation]: User Interfaces—Input De- vices and Strategies I.2.7 [Artificial Intelligence]: Natural Lan- guage Processing—Language Parsing and Understanding I.3.6 [Graphics]: Methodologies—Interaction Techniques I.3.8 [Graph- ics]: Applications Keywords: Applications, HCI (Human-Computer Interface), Multimedia,Text-to-Scene Conversion, Scene Generation 1 Introduction Creating 3D graphics is a difficult and time-consuming process. The user must learn a complex software package, traverse pages of menus, change tools, tweak parameters, save files, and so on. And then there’s the task of actually creating the artwork. We see the need for a new paradigm in which the creation of 3D graphics is both effortless and immediate. It should be possible to describe 3D scenes directly through language, without going through the bottle- neck of menu-based interfaces. Creating a 3D scene would then be as easy as dashing off an instant message. Natural language input has been investigated in a number of 3D graphics systems including an early system by [2] and the oft-cited Put system [8]; the Put system shared our goal of making graphics creation easier, but was limited to spatial arrangements of existing objects. Also, input was restricted to an artificial subset of English consisting of expressions of the form Put (X P Y) , where X and Y are objects, and P is a spatial preposition. The system did al- low for fairly sophisticated interpretation of spatial relations so that on in on the table and on the wall would both be appropriately in- terpreted. More recently, there has been work at the University of Pennsylvania’s Center for Human Modeling and Simulation [3, 4], where language is used to control animated characters in a closed virtual environment. In previous systems the referenced objects and actions are typically limited to what is available and applicable in the pre-existing environment. These systems therefore have a nat- ural affinity to the SHRDLU system [23] which, although it did not have a graphics component, did use natural language to interact with a “robot” living in a closed virtual world. coyne,rws @research.att.com Figure 1: John uses the crossbow. He rides the horse by the store. The store is under the large willow. The small allosaurus is in front of the horse. The dinosaur faces John. A gigantic teacup is in front of the store. The dinosaur is in front of the horse. The gigantic mushroom is in the teacup. The castle is to the right of the store. The goal of WordsEye, in contrast, is to provide a blank slate where the user can literally paint a picture with words, where the description may consist not only of spatial relations, but also ac- tions performed by objects in the scene. The text can include a wide range of input. We have also deliberately chosen to address the generation of static scenes rather than the control or genera- tion of animation. This affords us the opportunity to focus on the key issues of semantics and graphical representation without hav- ing to address all the problems inherent in automatically generating convincing animation. The expressive power of natural language enables quite complex scenes to be generated with a level of spon- taneity and fun unachievable by other methods (see Figure 1); there is a certain magic in seeing one’s words turned into pictures. WordsEye works as follows. An input text is entered, the sen- tences are tagged and parsed, the output of the parser is then con- verted to a dependency structure, and this dependency structure is then semantically interpreted and converted into a semantic repre- sentation. Depiction rules are used to convert the semantic repre- sentation to a set of low-level depictors representing 3D objects, poses, spatial relations, color attributes, etc; note that a pose can be loosely defined as a character in a configuration suggestive of a particular action. Transduction rules are applied to resolve con- flicts and add implicit constraints. The resulting depictors are then used to manipulate the 3D objects that constitute the final, render- able 3D scene. For instance, for a short text such as: John said that the cat was on the table. The animal was next to a bowl of apples, WordsEye would construct a picture of a human character with a cartoon speech bubble coming out of its mouth. In that speech bub- ble would be a picture of a cat on a table with a bowl of apples next to it. 1 1 With the exception of the initial tagging and parsing portion of the lin- guistic analysis, WordsEye is implemented in Common Lisp, runs within

Upload: trinhdieu

Post on 07-Nov-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

WordsEy e: An Automatic Text-to-Scene Conversion System

BobCoyne RichardSproat

AT&T Labs— Research�

Abstract

Natural languageis an easyand effective mediumfor describingvisualideasandmentalimages.Thus,we foreseetheemergenceoflanguage-based3D scenegenerationsystemsto let ordinaryusersquickly create3D sceneswithout having to learnspecialsoftware,acquireartistic skills, or even touch a desktopwindow-orientedinterface. WordsEyeis sucha systemfor automaticallyconvert-ing text into representative 3D scenes.WordsEyerelieson a largedatabaseof 3D modelsandposesto depictentitiesandactions.Ev-ery3D modelcanhaveassociatedshapedisplacements,spatialtags,andfunctionalpropertiesto be usedin the depictionprocess.Wedescribethe linguistic analysisand depictiontechniquesusedbyWordsEyealongwith somegeneralstrategiesby which moreab-stractconceptsaremadedepictable.

CR Categories: H.5.1[InformationInterfacesandPresentation]:MultimediaInformationSystems—ArtificialRealitiesH.5.2[Infor-mation Interfacesand Presentation]:User Interfaces—InputDe-vices and Strategies I.2.7 [Artificial Intelligence]: Natural Lan-guage Processing—LanguageParsing and UnderstandingI.3.6[Graphics]: Methodologies—InteractionTechniquesI.3.8 [Graph-ics]: Applications

Keywords: Applications, HCI (Human-ComputerInterface),Multimedia,Text-to-SceneConversion,SceneGeneration

1 Intr oduction

Creating3D graphicsis a difficult and time-consumingprocess.The usermust learn a complex software package,traversepagesof menus,changetools, tweakparameters,save files, andso on.And thenthere’s the taskof actuallycreatingtheartwork. We seetheneedfor anew paradigmin whichthecreationof 3D graphicsisbotheffortlessandimmediate.It shouldbepossibleto describe3Dscenesdirectly throughlanguage,withoutgoingthroughthebottle-neckof menu-basedinterfaces.Creatinga 3D scenewould thenbeaseasyasdashingoff aninstantmessage.

Naturallanguageinput hasbeeninvestigatedin a numberof 3Dgraphicssystemsincludinganearlysystemby [2] andtheoft-citedPutsystem[8]; thePutsystemsharedour goalof makinggraphicscreationeasier, but waslimited to spatialarrangementsof existingobjects.Also, input wasrestrictedto anartificial subsetof Englishconsistingof expressionsof the form Put (X P Y)

�, whereX and

Y areobjects,andP is a spatialpreposition. The systemdid al-low for fairly sophisticatedinterpretationof spatialrelationssothaton in on the tableandon thewall would bothbeappropriatelyin-terpreted.More recently, therehasbeenwork at theUniversity ofPennsylvania’s Centerfor HumanModelingandSimulation[3, 4],wherelanguageis usedto control animatedcharactersin a closedvirtual environment.In previoussystemsthereferencedobjectsandactionsaretypically limited to what is availableandapplicableinthepre-existing environment. Thesesystemsthereforehave a nat-ural affinity to the SHRDLU system[23] which, althoughit didnothaveagraphicscomponent,did usenaturallanguageto interactwith a “robot” living in a closedvirtual world.

���coyne,rws� @research.att.com

Figure1: Johnusesthecrossbow. He rides thehorseby thestore.Thestore is underthelarge willow. Thesmallallosaurusis in frontof thehorse. ThedinosaurfacesJohn.A giganticteacupis in frontof the store. Thedinosauris in front of the horse. Thegiganticmushroomis in theteacup.Thecastleis to theright of thestore.

The goal of WordsEye,in contrast,is to provide a blank slatewherethe usercanliterally paint a picturewith words,wherethedescriptionmay consistnot only of spatialrelations,but alsoac-tions performedby objectsin the scene. The text can include awide rangeof input. We have alsodeliberatelychosento addressthe generationof static scenesratherthan the control or genera-tion of animation.This affords us theopportunityto focuson thekey issuesof semanticsandgraphicalrepresentationwithout hav-ing to addressall theproblemsinherentin automaticallygeneratingconvincing animation. The expressive power of naturallanguageenablesquitecomplex scenesto begeneratedwith a level of spon-taneityandfun unachievableby othermethods(seeFigure1); thereis a certainmagicin seeingone’s wordsturnedinto pictures.

WordsEyeworks asfollows. An input text is entered,the sen-tencesaretaggedandparsed,theoutputof theparseris thencon-vertedto a dependencystructure, andthis dependency structureisthensemanticallyinterpretedandconvertedinto a semanticrepre-sentation. Depictionrulesareusedto convert the semanticrepre-sentationto a setof low-level depictors representing3D objects,poses,spatialrelations,color attributes,etc; note that a posecanbe looselydefinedasa characterin a configurationsuggestive ofa particularaction. Transductionrulesareappliedto resolve con-flicts andaddimplicit constraints.Theresultingdepictorsarethenusedto manipulatethe3D objectsthatconstitutethefinal, render-able3D scene.For instance,for ashorttext suchas:Johnsaidthatthecat wason thetable. Theanimalwasnext to a bowl of apples,WordsEyewould constructa pictureof a humancharacterwith acartoonspeechbubblecomingoutof its mouth.In thatspeechbub-blewouldbeapictureof acatona tablewith abowl of applesnextto it.1

1With theexceptionof theinitial taggingandparsingportionof thelin-guistic analysis,WordsEyeis implementedin CommonLisp, runswithin

John�

said that �

the�

cat on the �

table�

was

Figure2: Dependency structurefor John said that the cat wasonthetable..

Sincelinguisticdescriptionstendto beatahigh level of abstrac-tion, therewill beacertainamountof unpredictabilityin thegraph-ical result. This sametradeoff is seenin computationalmodelsofbehavior [21, 12] andnaturalphenomena.We alsoacknowledgeup front thatit is infeasibleto fully capturethesemanticcontentoflanguagein graphics.But we do believe thata largenumberof in-teresting3D scenescanbedescribedandgenerateddirectlythroughlanguage,andlikewisethata widevarietyof text canbeeffectivelydepicted.

In the remainderof this paper, we describeeachof the com-ponentsof WordsEye,startingwith an overview of the linguisticanalysistechniquesused.

2 Linguistic Anal ysis

Thetext is initially taggedandparsedusingapart-ofspeech-tagger[7] anda statisticalparser[9]. Theoutputof this processis a parsetreethatrepresentsthestructureof thesentence.Notethatsincetheparseris statistical,it will attemptto resolve ambiguities,suchasprepositionalphraseattachments,accordingto the statisticsof thecorpusonwhichit wastrained(thePennTreebank[19]). Theparsetree is thenconvertedinto a dependencyrepresentation(see[16],interalia)which is simplya list of thewordsin thesentence,show-ing thewordsthatthey aredependenton (theheads) andthewordsthat aredependenton them(the dependents). Figure2 shows anexampledependency structure,with arrows pointingfrom headstotheir dependents.The reasonfor performingthis conversionfromparsetreeto dependency structureis thatthedependency represen-tationis moreconvenientfor semanticanalysis.For example,if wewish to depictthe large naughtyblack cat we might actuallyhavenowayof depictingnaughty, but westill would like to depictlargeandblack. To do this we needmerelyto look at cat’s dependentsfor depictableadjectives,which is in generalsimplerthanhuntingfor depictablemodifiersin a treestructureheadedby cat.

The next phaseof the analysisinvolves converting the depen-dency structureinto asemanticrepresentation.Thesemanticrepre-sentationis a descriptionof theentitiesto bedepictedin thescene,andthe relationshipsbetweenthem. The semanticrepresentationfor the sentenceJohn said that the cat is on the table is given inFigure3. The semanticrepresentationis a list of semanticrepre-sentationfragments,eachfragmentcorrespondingto a particularnodeof thedependency structure.Consider“node1”, which is thesemanticrepresentationfragmentfor theactionsay, deriving fromthe nodesay in the dependency structure.The subjectis “node2”(correspondingto John), andthe direct object is the collectionof“node5”, “node4” and“node7”, correspondingto nodesassociatedwith the subordinateclausethat the cat was on the table. Eachof thesenodesin turn correspondto particularnodesin thedepen-dency structure,andwill eventuallyin turn bedepictedby a given3D object: so John will be depicted(in the currentsystem)by ahumanoidfigurewe call “Mr. Happy”, and tablewill bedepictedby oneof a setof available3D tableobjects.2

theMirai � 3D animationsystemfrom IZware,anduses3D modelsfromViewpoint.

2An individual semanticrepresentationfragmentas currently usedinWordsEyemay seemrelatively simplewhencompared,say, with thePAR

(("node2" (:ENTITY :3D-OBJECTS ("mr_happy"):LEXICAL-SOURCE "John" :SOURCE SELF))

("node1" (:ACTION "say" :SUBJECT "node2":DIRECT-OBJECT ("node5" "node4" "node7")...))

("node5" (:ENTITY :3D-OBJECTS ("cat-vp2842")))("node4" (:STATIVE-RELATION "on" :FIGURE "node5"

:GROUND "node7"))("node7" (:ENTITY :3D-OBJECTS

("table-vp14364" "nightstand-vp21374""table-vp4098" "pool_table-vp8359" ...))))

Figure3: Semanticrepresentationfor Johnsaidthat thecatwasonthetable.

Semanticrepresentationfragmentsarederived from the depen-dency structureby semanticinterpretationframes.Theappropriatesemanticinterpretationframesarefoundby tablelookup,giventheword in question.Theseframesdiffer dependinguponwhat kindof thing theword denotes.For nounssuchascat or table, Words-EyeusesWordNet[10], which providesvariouskindsof semanticrelationsbetweenwords,theparticularinformationof interestherebeing the hypernymand hyponymrelations. The 3D objectsarekeyedinto theWordNetdatabasesothataparticularmodelof acat,for example,can be referencedas cat, or feline or mammal, etc.Personalnamessuchas John or Mary are mappedappropriatelyto maleor femalehumanoidfigures. Spatialprepositionssuchason arehandledby semanticfunctionsthat look at the left andrightdependentsof theprepositionandconstructa semanticrepresenta-tion fragmentdependingupontheir properties.Notethat therehasbeena substantialamountof previous work into the semanticsofspatialprepositions;see,inter alia, [5, 14, 15] andthe collectionsin [11, 20]; therehasalso beena greatdeal of interestingcross-linguistic work, asexemplified by [22]. Therehave beenonly asmallnumberof implementationsof theseideashowever; oneso-phisticatedinstanceis [24]. Oneimportantconclusionof muchofthis researchis that the interpretationof spatialrelationsis oftenquite object-dependent,andrelatesasmuchto the functionof theobjectasits geometricproperties,a point that ties in well with ouruseof spatialtags, introducedbelow in Section3.1.

Finally, mostverbsarehandledby semanticframes,which areinformedby recentwork on verbalsemantics,including [18]. Thesemanticentry for say is shown in Figure4. This semanticentrycontainsa setof verb frames,eachof which definestheargumentstructure of one “sense”of the verb say. For example, the firstverb frame, namedthe SAY-BELIEVE-THAT-S-FRAME, hasasrequiredargumentsa subjectanda THAT-S-OBJECT, or in otherwordsan expressionsuchasthat thecat is on the table. Optionalargumentsincludeactionlocation(e.g.,Johnsaidin the bathroomthat thecat wason the table) andactiontime (e.g.,Johnsaidyes-terday that thecatwasonthetable.) Eachof theseargumentspec-ificationscausesa function to be invoked to checkthe dependen-ciesof theverb for a dependentwith a givenproperty, andassignssucha dependentto a particularslot in thesemanticrepresentationfragment.WordsEyecurrentlyhassemanticentriesfor about1300Englishnouns(correspondingto the1300objectsdescribedin Sec-tion 3.1),andabout2300verbs,in additionto a few depictablead-jectives,andmostprepositions.Thevocabularyis, however, readilyextensibleandis limited only by whatweareableto depict.

In addition to semanticallyinterpretingwords that denotepar-

representationof [3]. But bearin mind that an entiresemanticrepresen-tation for a whole scenecanbe a quite complex object,showing relationsbetweenmany differentindividual fragments;furthersemanticinformationis expressedin thedepictionrulesdescribedbelow. Also notethat part ofthecomplexity of PAR is dueto thefact that thatsystemis gearedtowardsgeneratinganimationratherthanstaticscenes.

(SEMANTICS :GENUS say:VERB-FRAMES

((VERB-FRAME:NAME SAY-BELIEVE-THAT-S-FRAME:REQUIRED (SUBJECT THAT-S-OBJECT):OPTIONAL (ACTIONLOCATION ACTIONTIME))

(VERB-FRAME:NAME SAY-BELIEVE-S-FRAME:REQUIRED (SUBJECT S-OBJECT):OPTIONAL (ACTIONLOCATION ACTIONTIME)) ...))

Figure4: Semanticentryfor say.

ticular entities, actions or relations, WordsEye also interpretsanaphoricor coreferringexpressions.Simplepronominalslike heor she, are interpretedby searchingthrough the context to findanappropriatecoreferent(whereappropriateincludesmatchingonnumberandgenderfeatures).Nounscanalsocorefer, asin theex-ample:Johnsaidthatthecatwasonthetable. Theanimalwasnextto a bowl of apples.While it is not strictly requiredthat theanimaldenotethecatmentionedin theprevioussentence,thecoherenceofthediscoursedependsuponthereaderor listenermakingthatcon-nection. WordsEyecurrentlyhandlessuchassociationsby notingthat in the WordNethierarchy, the denotationsof cat area subsetof thedenotationsof animal, and“guessing”that thenounphrasemight coreferwith thepreviously mentionedcat (see,e.g.,[13]).

Beforeclosingthediscussionof thenaturallanguagecomponentof WordsEye,it is worth notingtwo points. First of all, WordsEyeof necessityperformsa fairly deepsemanticanalysisof the inputtext. This contrastswith whatcountsfor “understanding”in muchrecentwork on, for example,MessageUnderstanding(MUC) (see,e.g., [1]); in the MUC context oneis typically trying to answerasmallnumberof questionsaboutthetext (e.g.,wholeft whichcom-pany to headup which othercompany), andsomostapproachestounderstandingin this context eschew a completesemanticanalysisof sentences.In WordsEyewe do not have this luxury sincethenumberof “questions”to beansweredin orderto constructasceneis in principleunbounded.Second,WordsEyeowesanintellectualdebtto work in Cognitive Grammar[17]. While WordsEyeis notstrictly speakinganimplementationof thistheory, CognitiveGram-mar’s modelof semanticsis like WordsEye’s in that it constructsmeaningsof utterancesout of graphicalcomponentsthat combinein waysthataresimilar to WordsEye’s depictionphase,which wenow describe.

3 Depictor s

All scenesare ultimately definedin terms of a set of low-levelgraphicalspecificationswhich we call depictors. Depictorsexistto control 3D object visibility, size,position, orientation,surfacecolor and transparency. Depictorsare alsousedto specify posesof humancharacters,control InverseKinematics(IK), andmodifyvertex displacementsfor facial expressionsor otheraspectsof theobjects.In this sectionwe examinedepictorsin moredetail.

3.1 Object Database

The basicelementsof any 3D sceneare the objectsthemselves.WordsEyecurrentlyutilizesapproximately20003D polygonalob-jects,with another10,000in the processof being integratedintothe system. The majority of the databaseis licensedfrom View-point Digital and includesmodelsfor animal andhumancharac-ters as well as buildings, vehicles,householditems, plants, etc.SinceWordsEyeis extensible,userscanaddtheir own modelstothedatabase.

Figure5: Spatialtagfor “canopy area”,indicatedby thebox underthe lefthandchair; and“top surface”, indicatedby the box on therighthandchair.

In additionto the raw 3D data,WordsEyeassociatesadditionalinformationwith each3D model.

Skeletons: Objects can contain skeletal control structures.Theseareusedin humanandanimalcharactersto defineposesrep-resentingdifferentactions.

Shape displacements: Someobjects, like humanfaces,canchangeshape(e.g.,smiling,eyesclosed,frowning,etc.).Shapedis-placementsareassociatedwith theobjectandusedto depictemo-tionsor otherstatesof theobject.

Parts: Theseare namedcollectionsof polygonalfacesrepre-sentingsignificantareasof thesurface.For example,theheadlights,roof, andwindshieldof acarwould bein differentparts.

Color parts: Thesearethe setof partsto be coloredwhentheobjectis specifiedby the text ashaving a particularcolor. For ex-ample,in theblueflower, thepetalsof theflowershouldbecoloredblue,not thestem.If nocolor partsarespecified,thelargestpartiscolored.

Opacity parts: Thesearepartsthat get a default transparency(e.g.,theglasspartof a framedwindow).

Default size: All objectsaregiven a default size,expressedinfeet.

Functional properties: Thesepropertiesareusedin thedepic-tion processto determinehow anobjectcanbeused.For example,cars,bicycles,trucks,andmotorcyclesareall roadvehicles.Then,while depictingthe verb ride, we selectamongtheseto chooseavehiclein thesentenceJohnridesto thestore.

Spatial tags: In order to depictspatialrelations,the shapeoftheobjectsin questionmustbeknown. We do this by associatingspatialtagswith all objectsin thedatabase.For examplethe inte-rior areaof anashtrayfunctionsasa cup to containwhatever is putin it. Theareais markedwith a tag(a simplespace-filling3D ob-ject) representingthecuppedinterior. This is usedwhendepictingspatialrelationssuchas in andsometimeson. Someexamplesofspatialtagsare: canopyareaandtop surface(Figure5); andbaseandcup(Figure6). Otherspatialtagsnotshown herearewall, cap,enclosed-area, ridge, peak.

3.2 Spatial Relations

Spatialrelationsdefinethe basiclayout of scenes.They includerelative positionsas in John is next to Mary, distancesas in Thechair is 5 feetbehindthetable, andorientationsasin Johnis facingthe abyss. And, asdiscussedlater, spatialrelationsarefrequentlyanimplicit partof actionsandcompoundobjects.

Spatialrelationsareoftendenotedbyprepositionslikeon, under,beyond, etc. Theexactplacementof objects,in depictionsof spa-

Figure6: Spatialtagsfor “base”and“cup”.

Figure7: Thedaisyis in thetesttube.

tial relations,dependson theshapesandsurfacesof thoseobjects.Additionally, termslike in andunder can have differentpossiblespatialintepretationsdependingon theobjectsin question.For ex-ample,Thecat is under the table andTherug is under the tabledenotedifferentspatialareas.Someexamplesof spatialrelationsaredescribedbelow.

For Thebird is on thecat, we find a top surfacetag for thecat(on its back)anda basetag for the bird (underits feet). We thenrepositionthebird sothatits feetareon thecat’sback.

For Thedaisyis in the testtube, we find thecup tag for the testtubeandthestemtagfor thedaisyandput thedaisy’s steminto thetesttube’scuppedopening.SeeFigure7. Spatialtagsfor stemsareappliedto any objectwith a long, thin baseleadingto a thicker orwider top area.Someobjectswith stemsarestopsigns,umbrellas,palmtreesandstreetlamps.

For Theelephantis under the chair, we look for a canopytagfor thechair(theareaundertheseatof thechairbetweenthelegs)andput theelephantthere.Thismight involve resizingtheelephantto make it fit. However, asnotedearlier, undercanalsobe inter-pretedsothatthechairis puton theelephant’sback.Dependingonthesizeandshapeof theobjectsin question,oneinterpretationoranotherwill bechosen.In general,we try to choosean interpreta-tion that avoids resizing. However, we notethat grosschangesofscaleareextremelycommonin advertisingandoftenhighlight thesignificanceor functionalrole of theobjectsin question.

Theseexamplesarenotmeantto beanexhaustive list, but ratherillustratethe mannerin which we useobjecttagsto depictspatialrelations. A renderedexampleof a spatialrelationusing the topsurfaceandenclosure spatialtagsis shown in Figure8.

Figure8: Thebird is in thebird cage. Thebird cage is on thechair.

Figure9: Usageposefor a10-speed.

3.3 Poses and Grips

Most actionsaredepictedin WordsEyevia the useof predefinedposes,wherea posecanbelooselydefinedasa characterin a con-figurationsuggestive of a particularaction.

Standaloneposesconsistof a characterin a particularbodypo-sition. Examplesof this arewaving, running,bowing, or kneeling.

Specializedusage posesinvolve a characterusinga specificin-strumentor vehicle. Someexamplesareswinginga baseballbat,shootinga rifle, andriding a bicycle. For a bicycle, a humanchar-acterwill beseatedonabicyclewith its feetonthepedalsandhandson thehandlebars.In these,eachposeis tightly associatedwith aparticularmanipulatedobject;seeFigure9 for anexample.

Generic usage posesinvolve a characterinteracting with agenericstand-inobject. The stand-inobjectsare representedbysimplemarkerslikespheresandcubes.Weusethesein caseswhereanotherobjectcanconvincingly besubstitutedfor thestand-in.Forexample,in the throw smallobjectpose(Figure10, left panel),theball is representedby agenericsphere.If theinputsentenceis Johnthrew the watermelon, the watermelonwill be substitutedfor thespherein thesamerelative positionwith respectto the hand. Thenew objectcanbesubstitutedasis or, alternatively, resizedto matchthesizeof thestand-insphere.Thepositionalandsizingconstraintsareassociatedwith thestand-inobjectsandarestoredin thepose.

Grip posesinvolve a characterholdinga specificobjectin a cer-

Figure10: “Throw smallobject” poseand“hold winebottle” grip.

Figure11: Johnridesthebicycle. Johnplaysthetrumpet.

tainway. Someobjectscanbeusedin avarietyof wayswhile beingheldin thesamegrip. For instance,if we have a grip for holdingawine bottle (Figure10, right panel),this grip canbe usedin vari-ousposes,suchaspouringwine,giving thebottleto someoneelse,puttingthebottleon a surface,andsoforth. This techniqueallowsus to avoid a combinatorialexplosion in the numberof posesforspecificobjects.Wedonotwantaseparatepour, give, andputposefor every objectin our database.We avoid this by having a smallnumberof grips for eachobjectandthenselectingthegrip appro-priatefor themoregenericactionpose.To do this,we first putandattachtheobjectin thehandbeforegoing to theactionpose.Thisis facilitatedby classifyingobjectsandposesinto categoriesrepre-sentingtheir shape.For example,theposesswinglong objectandhold longobjectmightbeappliedto a sword in a holdsword grip.

Bodywearposesinvolve a characterwearingarticlesof clothinglike hats,gloves,shoes,etc. Theseareusedto attachtheobjecttotheappropriatebodypartandarelatercombinedwith otherposesandbodypositions.

Anotherstrategy we adoptis to combineupperandlower bodyposes.Someposesrequirethewholebodyto bepositioned,whilefor othersonly the upperor lower body needsto be positioned.We usethesimpleprocedureof associatinganactive bodypart foreachpose,and then moving only thosebonesthat are necessarywhen more than one poseis applied to the samecharacter. Forexample,seeFigure11 which shows a characterriding a bicycle(lower) while playingthetrumpet(upper).

3.4 Inverse Kinematics

Posesareeffectivefor puttingacharacterinto astancethatsuggestsa particularaction. But for a sceneto look realisticandconvinc-

Figure12: Spatialtagfor “pushhandle”of babycarriage,indicatedby thebox aroundthehandle.

Figure13: The lawn moweris 5 feet tall. John pushesthe lawnmower. Thecat is 5 feetbehindJohn.Thecat is 10 feettall.

ing, the charactermustsometimesinteractdirectly with the envi-ronment.We useIK to do this [25]. So,for example,in pointing,itis notenoughjust to put thecharacterinto apointingposesincetheobjectpointedat canbeanywherein theenvironment.Instead,thehandmustbemovedwith IK to point in thedesireddirection.

We alsouseIK to modify existing poses.For example,thepushlarge objectposeconsistsof thecharacterleaningtowardanobjectwith legsin strideandarmsoutstretched.Consider, however, push-ing variousobjectssuchasa lawnmower, a car, or a babycarriage.Sincethe different objectshave handlesand surfacesat differentheights,nosinglebodyposecanwork for themall. Thehandsmustbelocatedat thecorrectpositionontheobject.To dothis,thechar-acteris first put behindthe object in the pushlarge object pose.Thenthe handsaremoved usingIK to the handleor vertical sur-faceof theobject.Notethatthis techniquerelieson objecttagsforhandleor vertical surfacein orderto determinethetargetpositionfor theIK; seeFigure12,andFigure13for arenderedexamplethatusesIK to move thehandsto thehandleof a lawnmower.

3.5 Attrib utes

WordsEyecurrentlyhandlesattributesfor size,color, transparencyandshape.Color andtransparency areappliedto theobjectassur-faceattributes.They areappliedto thedominantpart(asdefinedinthe objectdatabase)of the objectunlessotherwisespecified.Theshapeof the objectcanbe modifiedusingshapedisplacementsinthe Mirai animationsystem. Theseare predefinedstatesof ver-tex positionsassociatedwith the 3D model that canbe additivelycombined.For example,in ahumanface,therecanbeseparatedis-placementsfor a smileanda wink. Thevariousdisplacementscan

becombinedindependently. We currentlyusethis in WordsEyetocontrol facial expressions,but thesametechniquecanbeusedforothershapedeformations(e.g.,twisting,bending,etc.).An obviousalternative is to usefree-form deformationsto dynamicallycom-putebending,twisting, andso forth [6]. Size(e.g., large, small)andaspectratio (e.g.,flattened, squashed) attributesarecontrolledby manipulatingthe3D object’s transformmatrix.

4 The Depiction Process

All scenesare ultimately definedin terms of a set of low-leveldepictors(3D objectsand their spatialand graphicalproperties).The job of WordsEye’s depictionmoduleis to translatethe high-level semanticrepresentationproducedby the linguistic analysisinto theselow-level depictors.Theprocessinvolvedin thecreationandapplicationof depictors,leadingto afinal rendering,is outlinedbelow:

1. Convert the semanticrepresentationfrom the nodestructureproducedby thelinguistic analysisto a list of typedsemanticelementswith all referencesresolved.

2. Interpretthesemanticrepresentation.This meansanswering“who?” , “what?”, “when?”, “where?” “how?” when theactor, object,time, location,andmethodareunspecified.

3. Assigndepictorsto eachsemanticelement.

4. Resolve implicit andconflictingconstraintsof depictors.

5. Readin referenced3D models.

6. Apply eachassigneddepictor, while maintainingconstraints,to incrementallybuild up thescene.

7. Add backgroundenvironment,groundplane,lights.

8. Adjust thecamera,eitherautomatically(currentlyby framingthesceneobjectsin a threequartersview) or by hand.

9. Render.

Wenow describethisprocessin moredetail.

4.1 Depiction Rules

Theoutputof thelinguistic analysisis a semanticrepresentationofthe text. The semanticrepresentationin turn consistsof a list ofsemanticelementsrepresentingthe variousconstituentmeaningsandrelationsinherentin theinput text. ThemainsemanticelementtypesareENTITY, ACTION,ATTRIBUTE andRELATION whichroughly correspondto nouns,verbs,adjectives and prepositions.Someadditional,morespecializedtypesarePATHS, TIMESPEC,CONJUNCTION,POSSESSIVE,NEGATION, CARDINALITY.Eachtype of semanticelementhasvarioustype-specificparame-ters.For example,ACTIONShave tense.We omit theseincidentalparametersin theexamplesbelow.

As anexample,thesentenceThecowboyrodetheredbicycletothestore is representedby thefollowing semanticelements:

1. Entity: cowboy2. Entity: bicycle3. Entity: store4. Attribute:

Subject: � element2 �Property:red

5. Action:Actor: � element1 �

Action: rideObject: � element2 �Path: � element6 �

6. Path:Relation:toFigure: � element5 �Ground: � element3 �

In order to depict a sentence,the semanticelementsmust bemadegraphicallyrealizable.This is doneby applyinga setof de-piction rules. Depictionrulesaretestedfor applicabilityandthenappliedto translatesemanticelementsinto graphicaldepictors.

As anexample,weexaminethedepictionrulefor theactionkick.We handleseveral cases.Thefirst caseis for largeobjects,wherethekicked objectshouldbedepictedon thegroundin front of theactor’s foot. This happenswhenthereis no specifiedpath(e.g.,wesayJohnkickedthecar, asopposedto somethinglike Johnkickedthe car over the fence) and when the size of the direct object islargerthansomearbitrarysize(for example,3 feet).

Thesecondcaseis for smallobjectswhereno pathor recipientis specified.It usesthekick objectposeandsubstitutesthekickedobject for the stand-inobject in the pose,placedabove the foot.This would be usedwith sentenceslike John kicked the football.Thefootball getsput in theair, justabove thefoot.

Thethird caseis usedwhentheactorkicks theobjecton a path(e.g.,to a recipient).This might correspondto sentenceslike Johnkicked the football to Mary. The PATH-DEPICTOR usedin thisdepictionrule specializesin depictingobjectsonpaths.

Note that somedepictorsaremarked as tentativebecausetheyarejustdefaultsandarenotinherentto thekicking action.Alsonotethat the depictionrulesdescribedabove aresomewhat simplified;they canbemadearbitrarilymorecomplex to handlevariousspecialcasesandsubtleties.

DEFINE-DEPICTION-R ULE ACTION kickCase1: no PATH or RECIPIENT, DIRECT-OBJECT sizegreaterthan3 feet

Pose:kick, ACTOR

Position:ACTORdirectlybehindDIRECT-OBJECT

Orientation:ACTORfacingDIRECT-OBJECT

Case2: noPATH or RECIPIENT, DIRECT-OBJECTsizelessthan3 feet

Pose:kick object, ACTOR,DIRECT-OBJECT

Case3: PATH andRECIPIENT

Pose:kick, ACTOR

Path: DIRECT-OBJECTbetweenACTOR’s foot andRECIPIENT

Orientation:ACTORfacingRECIPIENT

Pose:catch, RECIPIENT[tentative]

Orientation:RECIPIENTfacingACTOR[tentative]

Position: ACTOR 10 feet from RECIPIENTin Z axis[tentative]

Position: ACTOR 0 feet from RECIPIENT in X axis[tentative]

In somecases,dependingon the object in question,differentposesmaybeappliedto thesamebasicaction.For example,differ-entobjects(e.g.,baseball,soccerball, frisbee,javelin) arethrownin differentmanners.Thedepictionrulesareresponsiblefor findingthemostappropriateposefor theaction,given the total context ofthesemanticelements.

For attributes,depictionrulescancreatedepictorsfor size,color,transparency, aspectratio, andotherdirectly depictablepropertiesof theobjects.Sometimesanattributeis bestdepictedby attachinganiconic appendageto theobject.This is illustratedin thefollow-ing depictionrule,which is usedto depicta spinningobject.Sincewe cannotdirectly depictmotion, we cansuggestit iconically byputtinga circulararrow above thegivenobject.

DEFINE-DEPICTION-R ULE ATTRIB UTE spinningSpatial-Relation:above, SUBJECT, circular arrow3D model

For entitiesthemselves,depictionrulesarenormallyresponsiblefor selectingwhich 3D objectto use.This is currentlydoneby se-lecting randomlyamongthosematchingthe given term. In somecases,however, anentity is depictedby anassemblyof separate3Dobjects. With environments(e.g.,a living room), this will almostalwaysbethecase.And in anotherexample,cowboymight bede-pictedasa humancharacterwearinga cowboy hat. So we createa posedepictorthat positionsandattachesthe cowboy hat to theactor’s head.

DEFINE-DEPICTION-R ULE ENTITY cowboyPose:wearcowboyhat, ACTOR

4.2 Implicit Constraints

In certaincircumstancesit is desirableto add depictorsfor con-straintsthatarenot explicitly stated,but ratherarebasedon com-mon senseknowledgeor are in someway deduciblefrom the se-manticrepresentation.A setof transductionrulesis invoked to dothis.

Consider:Thelampis onthetable. Theglassis next to thelamp.Althoughit is not statedthat theglassis on thetable,we probablywant it thereratherthanfloating in theair next to the lamp. To dothis,weinvokearulethatsays“If X is next to Y, X is notalreadyona surface,andX is not anairborneobject(e.g.,a heliumballoon)”then“Put X on thesamesurfaceasY”.

Considernext thesentenceThecat and thedog are on the rug.Sincethereis no specificationof whereon therug thecatanddogarelocated,it wouldbelogically consistentto put thedogandcatinexactly thesameplaceon therug. To avoid this unintuitive result,we invoke a rule thatsays“If X is on Y, andY is on Z, andX andY arenot spatiallyconstrained”then“Put X next to Y”. Note thatalthoughthe previous rule is specifiedonly with respectto the onrelation,amoregeneralformulationis possible.

4.3 Conflicting Constraints

Depictionspecificationssometimesconflict with oneanother. Thistypically occurswhenthe default depictorsassignedby an actionconflict with thoseexplicitly specifiedelsewhere. For example,John kicked the ball to Mary will generatedepictorsto put Johnin a kick pose,put JohnbehindandfacingMary, put the ball be-tweenJohnandMary, etc. Someof thosedepictors,suchas theexactpositionsof thetwo characters,arelabeledtentativebecausethey arejust defaultsandarenot inherentto thekicking pose.

1. POSE:Johnin kick pose

2. PATH: Ball betweenJohn’s foot andMary

3. ORIENTATION: JohnfacingMary

4. POSE:Mary in catchpose[tentative]

5. ORIENTATION: Mary facingJohn[tentative]

6. POSITION:John10 feetfrom Mary in Z axis[tentative]

7. POSITION:John0 feetfrom Mary in X axis[tentative]

But assumeweaddthespecificationsthatMary is 20 feetto theleftof John andMary is 30 feetbehindJohn. which generatesthesedepictors:

8. POSITION:Mary 20 feetfrom Johnin X axis

9. POSITION:Mary 30 feetfrom Johnin Z axis

We now have a conflict betweendepictors6,7 and8,9. To resolvethese,a transductionrule is invoked thatwhendetectinga conflictbetweendepictorsX andY, whereX is tenative,will removedepic-tor X. So, in this examplesincedepictors6,7 aremarkedastenta-tive, they areremoved.Thefollowing is theresult:

1. POSE:Johnin kick pose

2. PATH: Ball betweenJohn’s foot andMary

3. ORIENTATION: JohnfacingMary

4. POSE:Mary in catchpose[tentative]

5. ORIENTATION: Mary facingJohn[tentative]

6. POSITION:John20 feetfrom Mary in Z axis

7. POSITION:John30 feetfrom Mary in X axis

4.4 Appl ying Depictor s

In orderto actuallycreateacoherentscene,thevariousindependentgraphicaldepictors(poses,spatialrelations,etc.) derivedfrom thesemanticrepresentationneedto beapplied.This is doneby apply-ing constraintsin a simpleprioritizedmanner:

1. Objectsareinitializedto theirdefaultsizeandshape.Sizeandcolorchangesto objectsaremadeat this stagealso.

2. Apply shapechangesandattachments.Charactersareput intoposes,andinstrumentsandotherpose-relatedobjectsareat-tached.At the sametime, shapechanges(for facial expres-sions,etc.) are made. The integration of upperand lowerbodyposesarealsohandledat this stage.

3. Onceobjectsarein their correctshapesandposes,all objectsarepositioned,with theexceptionof objectsplacedon pathsin step5. Onceobjectsare constrainedtogether(indepen-dently in eachaxis),neithercanbemoved without theother(alongthataxis).

4. With objectsin thecorrectposes/shapesandpositions,orien-tationsareapplied. This handlescaseswhereoneobject isspecifiedto faceanotheror in someparticulardirection.

5. Dynamicoperationssuchasplacingobjectson pathsandIKarenow performed.

4.5 Interpretation, Activities, Envir onment

In orderfor text to bedepicted,it mustfirst be interpreted.A sen-tencelike John went to the store is somewhat vague. We do notknow if Johndrove, walked, ice skated,hitchhiked, etc. Further-more, we do not know how old John is, how he is dressed,orwhetherhe is doing anything elseon the way. Nor do we knowthe type of store,its location,what typeof building it is in, or thepathtakento getthere.To bedepicted,thesemustberesolved.Thiswill ofteninvolveaddingdetailsthatwerenotexplictly statedin thetext.

We rely on the functional propertiesof objectsto make someof theseinterpretations.Verbstypically rangealonga continuumfrom puredescriptionsof statechangeslike Johnwentto thestoreto moreexplicit specificationsof mannerlike John crawledto thestore. Sometimesaninstrument(or vehicle)is mentionedasin Johnrodea bicycleto thestore while in othercases,the typeof instru-mentis only impliedby theverbasin Johnrodeto thestore. To findimplied instruments,we look for objectswhosefunctionalproper-tiesarecompatiblewith theinstrumenttypedemandedby theverb.In this casewe want a rideablevehicleandfind (amongothers)abicycle. We thenapply the “usage”posefor thatobject(bicycle).In this way, thesentenceJohnrodeto thestore getsdepictedwithJohnin ariding poseon a bicyle.

Very often the interpretationwill dependon the settingof thescene,eitheran environment(e.g.,a forest)or an activity (e.g.,afootball game).Sometimesthereis no explicitly specifiedenviron-ment,in whichcaseanenvironmentcompatiblewith therestof thetext couldbe supplied.Consider, for example,Theflower is blue.Ratherthan just depictinga blue flower floating on the page,wehave theoptionof supplyinga background.Thesimplestcaseforthis is a groundplaneand/orsupportingobject. For morevisuallycomplex cases,we maywant to put theflower in a vaseon a fire-placemantlein themiddleof a fully decoratedliving room. Evenwhenthesettingis specifiedasin Johnwalkedthroughtheforest, itmustberesolvedinto specificobjectsin specificplacesin ordertobedepicted.

It should be noted that the sametype of semanticinferencesmadewith instrumentalobjectscanalsobeappliedto settings.Forthe sentenceJohn filled his car with gas, we know he is proba-bly at a gasstationandwe might want to depictJohnholding thegaspump. WordsEyecurrentlydoesnot have enoughreal-worldknowledgeor the mechanismsin placeto handleenvironmentsoractivitiesbut werecognizetheir importancebothto interpretingse-manticrepresentationsandaddingbackgroundinterestto otherwisemorepurelyliteral depictions.

4.6 Figurative and Metaphorical Depiction

Many sentencesincludeabstractionsor describenon-physicalprop-ertiesandrelations,andconsequentlythey cannotbe directly de-picted. We usethe following techniquesto transformthem intosomethingdepictable:

Textualization: Whenwe have no otherway to depictanentity(for example,it maybeabstractor maybewe do nothave a match-ing 3D modelin ourdatabase),wegenerate3D extrudedtext of theword in question. This can sometimesgenerateamusingresults:seeFigure14.

Emblematization: Sometimesan entity is not directly de-pictable,but some3D object can be an emblemfor it. In thosecases,the emblemis used. A simpleexampleof an emblemis alight bulb to representtheword idea, or achurchto (somewhateth-nocentrically)representreligion. Wealsouseemblemsto representfieldsof study. For example,entomology is depictedby abookwithaninsectasanemblemon its cover.

Characterization: Thisis aspecializedtypeof emblematizationrelatedto humancharactersin theirvariousroles.In orderto depictthese,we addan article of clothing or have the characterhold aninstrumentthatis associatedwith thatrole. So,for example,acow-boywill weara cowboy hat,a football player will weara footballhelmet,a boxerwill wearboxinggloves,a detectivemight carryamagnifyingglass,andsoon.

Conventional icons: We use comic book conventions, likethoughtbubbles,to depictthe verbsthink or believe. The thoughtbubblecontainsthedepictionof whatever is beingthoughtof. Like-wise,we usea redcircle with a slashto depictnot; seeFigure15.Theinteriorof thecirclecontainsthedepictionof thesubjectmatter

Figure14: Thecat is facingthewall.

Figure15: Thebluedaisyis not in thearmyboot.

beingnegated.This samesortof depictionprocesscanbeappliedrecursively. For example,for Johnthinksthecat is notonthetable,thethoughtbubblecontainsa red-slashedcircle which in turn con-tainsthecaton the table. Alternatively, John doesnot believe theradio is greenis depictedwith theslashedcircle encompassingtheentiredepictionof Johnand the thoughtbubbleand its contents;seeFigure16. Similarly, comic book techniqueslike speedlinesandimpactmarkscouldbeusedto depictmotionandcollisions.

Literalization: Sometimesfigurative or metaphoricalmeaningscanbedepictedmosteffectively in a literal manner:seeFigure17.Wenotethatthis is awell establishedtechnique.For example,T. E.Breitenbach’s poster“Proverbidioms”3 containsdepictionsof hun-dredsof figuresof speech.Throwingthebabyoutwith thebathwa-ter is depictedliterally, asa babybeingtossedout a window alongwith a tub of bathwater. This approachcomesnaturallyto Words-Eye.

Personification: Whenmetaphoricalstatementsareinterpretedliterally, an inanimateor abstractentity oftenneedsto bedepictedin a humanrole (e.g.,Time marcheson). Our currentminimalistapproachis to affix somerepresentation(flattenedobject, text, oremblem)of thatentity ontoa generichumancharacter’s chestasavisualidentifier, like Superman’s “S”. A moresatisfactorysolutionwould be, in cartoonstyle, to give theobjecta setof genericlegsandarmsanda superimposedface.

3www.tebreitenbach.com/posters.htm

Figure16: Johndoesnotbelievetheradio is green.

Figure17: Thedevil is in thedetails.

Degeneralization:Generalcategoricaltermslike furniture can-not bedepicteddirectly. We depicttheseby picking a specificob-ject instanceof the sameclass(in this case,perhapschair). Thisworkswell enoughin mostcases,asin Johnboughta pieceof fur-niture. But sometimes,thereferenceis to thegeneralclassitself andhencetheclass,not aninstanceof it, shouldbedepictedasin Thistable lamp is not furniture. We currentlydo not handlethis case.Onedepictionstrategy mightbeto choosearepresentative,genericlooking objectwithin the classandaffix a textual label consistingof theclassnameitself.

5 Discussion and Future Work

We believe WordsEyerepresentsa new approachto creating3Dscenesandimages. It is not intendedto completelyreplacemoretraditional3D software tools, but ratherto augmentthemby, forexample,allowing oneto quickly setup a sceneto be later refinedby othermethods.WordsEyefocuseson translatingthe semanticintentof theuser, asexpressedin language,into agraphicrepresen-tation.Sincesemanticintentis inherentlyambiguous,theresulting3D scenemight only looselymatchwhat theuserexpected.Suchvariability, however, will be an assetin many cases,providing in-terestingandsurprisinginterpretations.And whenuserswant tocontroladepictionmoreprecisely, they canadjusttheir languagetobetterspecifytheexactmeaningandgraphicalconstraintsthey en-

Figure18: Somereal1stGradehomework, andaWordsEye“inter-pretation”.

vision. We believe that the low overheadof language-basedscenegenerationsystemswill provide a naturaland appealingway foreverydayusersto createimageryandexpressthemselves.

WordsEyeis currentlya researchprojectandis underactive de-velopment.We expectthat eventuallythis technologywill evolveinto somethingthatcanbeappliedto awidevarietyof applications,including: First andsecondlanguageinstruction(seeFigure18);e-postcards(cf. www.bluemountain.com); visual chat; storyillustrations;gameapplications;specializeddomains,suchascook-bookinstructionsor productassembly.

In its currentstate,WordsEyeis only a first steptoward thesegoals. Therearemany areaswherethe capabilitiesof the systemneedto be improved, suchas: Improvementsin the coverageandrobustnessof thenaturallanguageprocessing,includinginvestigat-ing corpus-basedtechniquesfor deriving linguistic andreal-worldknowledge;languageinputvia automaticspeechrecognitionratherthantext; a larger inventoryof objects,poses,depictionrules,andstatesof objects;mechanismsfor depictingmaterialsandtextures;mechanismsfor modifyinggeometricandsurfacepropertiesof ob-ject parts(e.g. Johnhasa long rednose); environments,activities,

and common-senseknowledgeabout them; comic-stylemultipleframesfor depictingsequencesof activities in a text; methodsforhandlingphysicalsimulationsof skeletaldynamics,shapedeforma-tion andnaturalphenomena.Work is ongoingto improve Words-Eye along thesevariouslines. However, we feel that even in itspresentstate,WordsEyerepresentsa significantadvancein a newapproachto thegraphicalexpressionof a user’s ideas.

6 Ackno wledg ements

WethankMichaelCollins,Larry Stead,AdamBuchsbaum,andthepeopleatIZwarefor supportof someof thesoftwareusedin Words-Eye. We also thankaudiencesat the University of PennsylvaniaandBell Labsfor feedbackonearlierpresentationsof thismaterial.Finally we thankanonymousreviewers for SIGGRAPH2001forusefulcomments.

References

[1] Proceedingsof theSixthMessage UnderstandingConference(MUC-6), SanMateo,CA, 1995.MorganKaufmann.

[2] G. Adorni, M. Di Manzo,andF. Giunchiglia. NaturalLan-guageDrivenImageGeneration.In COLING84, pages495–500,1984.

[3] N. Badler, R. Bindiganavale,J.Allbeck, W. Schuler, L. Zhao,andM. Palmer. ParameterizedAction Representationfor Vir-tual HumanAgents. In J.Cassell,J.Sullivan,S.Prevost,andE.Churchill,editors,EmbodiedConversationalAgents, pages256–284.MIT Press,Cambridge,MA, 2000.

[4] R. Bindiganavale,W. Schuler, J.Allbeck,N. Badler, A. Joshi,andM. Palmer. DynamicallyAltering AgentBehaviorsUsingNaturalLanguageInstructions.In AutonomousAgents, pages293–300,2000.

[5] C. Brugman. TheStoryof Over. Master’s thesis,Universityof California,Berkeley, Berkeley, CA, 1980.

[6] Y. ChangandA. P. Rockwood. A Generalizedde CasteljauApproachto 3D Free-FormDeformation. In SIGGRAPH94ConferenceProceedings, pages257–260.ACM SIGGRAPH,AddisonWesley, 1994.

[7] K. Church. A StochasticPartsProgramand Noun PhraseParserfor UnrestrictedText. In Proceedingsof the SecondConferenceon AppliedNatural Language Processing, pages136–143,Morristown, NJ, 1988.Associationfor Computa-tionalLinguistics.

[8] S. R. Clay andJ. Wilhelms. Put: Language-BasedInterac-tive Manipulationof Objects. IEEE ComputerGraphicsandApplications, pages31–39,March1996.

[9] M. Collins. Head-Driven Statistical Models for NaturalLanguage Parsing. PhD thesis,University of Pennsylvania,Philadelphia,PA, 1999.

[10] C. Fellbaum, editor. WordNet: An Electronic LexicalDatabase. MIT Press,Cambridge,MA, 1998.

[11] C. Freksa,C. Habel,andK. F. Wender, editors.SpatialCog-nition. Springer, Berlin, 1998.

[12] J. Funge,X. Tu, andD. Terzopoulos. Cognitive Modeling:Knowledge,ReasoningandPlanningfor Intelligent Charac-ters. In SIGGRAPH99 ConferenceProceedings, pages29–38.ACM SIGGRAPH,AddisonWesley, 1999.

[13] SandaHarabagiuand Steven Maiorano. Knowledge-leancoreferenceresolutionandits relationto textualcohesionandcoherence.In Proceedingsof theACL-99WorkshopontheRe-lation of Discourse/DialogueStructure andReference, pages29–38,College Park, MD, 1999.Associationfor Computa-tionalLinguistics.

[14] B. Hawkins. TheSemanticsof EnglishSpatialPrepositions.PhD thesis,University of California, SanDiego, SanDiego,CA, 1984.

[15] A. Herskovitz. Language and SpatialCognition: An Inter-disciplinaryStudyof thePrepositionsin English. CambridgeUniversityPress,Cambridge,1986.

[16] R. Hudson.Word Grammar. Blackwell,Oxford,1984.

[17] R. Langacker. Foundationsof CognitiveGrammar: Theoret-ical Prerequisites. StanfordUniversity Press,Stanford,CA,1987.

[18] B. Levin. EnglishVerbClassesAndAlternations:A Prelimi-nary Investigation. Universityof ChicagoPress,Chicago,IL,1993.

[19] M. Marcus,B. Santorini,andM. A. Marcinkiewicz. Buildinga Large AnnotatedCorpusof English: the PennTreebank.ComputationalLinguistics, 19(2):313–330,1993.

[20] P. Olivier andK.-P. Gapp,editors. RepresentationandPro-cessingof Spatial Prepositions. LawrenceErlbaum Asso-ciates,Mahwah,NJ,1998.

[21] C. W. Reynolds. Flocks,HerdsandSchools:A DistributedBehavioral Model. In SIGGRAPH87 ConferenceProceed-ings, pages25–34.ACM SIGGRAPH,AddisonWesley, 1987.

[22] G. Senft,editor. Referringto Space:Studiesin AustronesianandPapuanLanguages. ClarendonPress,Oxford,1997.

[23] T. Winograd.UnderstandingNatural Language. PhDthesis,MassachusettsInstituteof Technology, 1972.

[24] A. Yamada. Studieson Spatial DescriptionUnderstandingbasedonGeometricConstraintsSatisfaction. PhDthesis,Ky-otoUniversity, Kyoto,1993.

[25] J.ZhaoandN. Badler. InverseKinematicsPositioningUsingNonlinearProgrammingfor Highly ArticulatedFigures.ACMTransactionson Graphics, pages313–336,October1994.