confucius: an intelligent multimedia storytelling interpretation and presentation system minhua...

CONFUCIUS:An Intelligent MultiMedia Storytelling

Interpretation and Presentation System

Minhua Eunice MaSupervisor: Prof. Paul Mc Kevitt

School of Computing and Intelligent SystemsFaculty of Engineering

University of Ulster, Magee

Faculty Research Student Conference

Jordanstown, 15 Jan 2004

Outline

Related research Overview of CONFUCIUS Automatic generation of 3D animation Semantic representation Natural language processing Current state of implementation Relation to other work Conclusion & Future work



3D visualisation Virtual humans & embodied agents: Jack, Improv, BEAT MultiModal interactive storytelling: AesopWorld, KidsRoom,

Larsen & Petersen’s Interactive Storytelling, computer games Automatic Text-to-Graphics Systems: WordsEye, CD-based

language animation

Related research in NLP Lexical semantics Levin’s verb classes Jackendoff’s Lexical Conceptual Structure Schank’s scripts

Related research



Objectives of CONFUCIUS

To interpret natural language sentences/stories and to extract conceptual semantics from the natural language

To generate 3D animation and virtual worlds automatically from natural language

To integrate 3D animation with speech and non-speech audio, to form an intelligent multimedia storytelling system

Story in natural language

CONFUCIUSMovie/drama script 3D animation

non-speech audioTailored menu for script input

Speech (dialogue)

Storywriter

/playwright

User/story listene

r



Architecture of CONFUCIUS

3D authoring tools, existing 3D

models & character models

visual knowledge (3D graphic library)

Prefabricated objects(knowledge base)

Script writer

Script parser

Natural Language Processing

Text To Speech

Sound effects

Animation generation

Synchronizing & fusion

3D world with audio in VRML

Natural language stories

Language knowledge

mapping

LCS lexicongrammar

semantic representations

visual knowledge



Software & Standards

Java parsing semantic representation changing VRML code to add/modify animation integrating modules

Natural language processing tools Connexor Machinese DFG parser (morphologic and syntax

parsing) WordNet (lexicon, semantic inference)

3D graphic modelling Existing 3D models (virtual human/object) on Internet Authoring tools

Humanoid characters: Character Studio Props & stage: 3D Studio Max Narrator: Microsoft Agent

Modelling language & standard VRML 97 for modelling geometry of objects, props, environment H-Anim specifications for humanoid modelling



Agents and Avatars—How much autonomy?

Autonomy & intelligence: highlow

autonomous agents

avatars interface agentsVirtual humans:

Autonomous agents have higher requirements for sensing, memory, reasoning, planning, behaviour control & emotion (sense-emotion-control-action structure) “User-controlled” avatars require fewer autonomous actions-- basic naïve physics such as collision detection and reaction still required Virtual character in non-interactive storytelling between agents and avatars--its behaviours, emotion, responses to changing environment described in story input

characters in non-interactive storytelling



Graphics library

Simple geometry filesgeometry & joint hierarchy

Files (H-Anim)

animation library(key frames)

objects/props characters

motions

instantiation



Level of Articulation (LOA) of H-Anim

Joints and segments of LOA1

CONFUCIUS adopts LOA1 in human animation animation engine adds ROUTEs dynamically

based on H-anim’s joints & animation keyframes CONFUCIUS’ human animation adapted for

other LOAs.

Example site nodes on hands

pushing objects

holding objects



Semantic representations

Categories Knowledge representations Decomposite Typical applications rule-based representation

expert systems

FOPC (First Order Predicate Calculus)

sentence representation, expert systems

semantic networks

lexical semantics

Schank’s scripts

story understanding

frame-based representations

general knowledge representation & reasoning

XML-based representations

multimodal semantics

Conceptual Dependency (CD)

event-logic truth conditions

x-schema and f-structure

Lexical-Conceptual Structure (LCS)

physical knowledge representation & reasoning (inc. spatial /temporal reasoning)

Lexical Visual Semantic Representation (LVSR)

dynamic vision (movement) recognition & generation



Lexical Visual Semantic Representation (LVSR): semantic representation between language syntax and 3D models

LVSR based on Jackendoff’s LCS adapted to task of language visualization (enhancement with Schank’s scripts)

Ontological categories: OBJ, HUMAN, EVENT, STATE, PLACE, PATH, PROPERTY

OBJ -- props/places (e.g. buildings) HUMAN -- human being/other articulated animated characters

(e.g. animals) as long as their skeleton hierarchy is defined EVENT -- actions, movements and manners STATE -- static existence PROPERTY -- attributes of OBJ/HUMAN

Lexical Visual Semantic Representation



PATH & PLACE predicates

PATH predicates

Direction feature

Termination feature

PLACE predicates

contact/attach feature

to 1 1 at unmarked

from 0 1 behind <-contact>

toward 1 0 end_of n/a

away_from 0 0 in unmarked

via n/a 0 in_front_of <-contact>

across n/a n/a near <-contact>

along n/a n/a on <+contact>

out unmarked

over <-contact>

top_of n/a

under unmarked

interpret spatial movement of OBJ/HUMANs 62 common English prepositions 7 PATH predicates & 11 PLACE predicates



NLP in CONFUCIUS

Coreference resolution

Part-of-speech tagger

Syntactic parser Morphological parser

Semantic inference

Pre-processing

Connexor FDG parser

WordNetLCS database

FEATURES

DisambiguationTemporal reasoning

Lexicaltemporal relations

Post-lexicaltemporal relations



Visual valency & verb ontology

2.2.1. Human action verbs 2.2.1.1. One visual valency (the role is a human, (partial) movement) 2.2.1.1.1. Biped kinematics: arm actions (wave, scratch), leg actions (walk, jump, kick), torso actions (bow), combined actions (climb) 2.2.1.1.2. Facial expressions & lip movement, e.g. laugh, fear, say, sing, order 2.2.1.2. Two visual valency (at least one role is human) 2.2.1.2.1. One human and one object (vt. or vi.+instrument) e.g. throw, push, kick, open, eat, drink, bake, trolley 2.2.1.2.2. Two humans, e.g. fight, chase, guide 2.2.1.3. Visual valency ≥ 3 (at least one role is human) 2.2.1.3.1. Two humans and one object (inc. ditransitive verbs), e.g. give, show 2.2.1.3.2. One human and 2+ objects (vt. + object + implicit instr./goal/theme) e.g. cut, write, butter, pocket, dig, cook 2.2.1.4. Verbs without distinct visualisation when out of context: verbs of trying, helping, letting, creating/destroying 2.2.1.5. High level behaviours (routine events), political and social activities

e.g. interview, eat out (go to restaurant), go shopping



Level-of-Detail (LOD)basic-level verbs & troponyms

EVENT

go

run

cause…

event level verbs

walk climb jump manner level verbs

limp stride swaggertrot

…

skip bounce hopjog romp troponym level verbs



Current status of implementation

Collision detection example (contact verbs: hit, collide, scratch, touch)The car collided with a wall.

using ParallelGraphics’ VRML extension--object-to-object collision non-speech sound effects

H-Anim examples:3 visual valency verbsJohn put a cup of coffee on the table.

H-Anim Site node locative tags of object (on_table tag for table object)

2 visual valency verbs John pushed the door.John ate the bread.Nancy sat on the chair.

1 visual valency verbsThe waiter came to me: “Can I help you? Sir.”

speech modality & lip synchronization camera direction (avatar’s point-of-view)

http://www.infm.ulst.ac.uk/~eunice/confucius_samples/collideCar.wrl

http://www.infm.ulst.ac.uk/~eunice/confucius_samples/bobPut.wrl

http://www.infm.ulst.ac.uk/~eunice/confucius_samples/bob_push.wrl

http://www.infm.ulst.ac.uk/~eunice/confucius_samples/bob_eat.wrl

http://www.infm.ulst.ac.uk/~eunice/confucius_samples/nanaSit.wrl

http://www.infm.ulst.ac.uk/~eunice/confucius_samples/BOBWKSAY.WRL



Relation to other work

Domain-independent general purpose humanoid character animation

CONFUCIUS’ character animation focuses on language-to-humanoid animation process rather than considering human modelling & motion solely

Implementable semantic representation LVSR connecting linguistic semantics to visual semantics & suitable for action execution (animation)

Categorization and visualisation of eventive verbs based on visual valency

Reusable common sense knowledge base to elicit implied actions, instruments, goals, themes underspecified in language input



Prospective applications Children’s education Multimedia presentation Movie/drama production Computer games Virtual Reality

Conclusion & Future work

Humanoid animation explores problems in language visualization & automatic animation production

Formalizes meaning of action verbs and spatial prepositions Maps language primitives with visual primitives Reusable common senses knowledge base for other systems

Further work Discourse level interpretation Action composition for

simultaneous activities Verbs concerning multiple

characters’ synchronization & coordination(e.g. introduce)

confucius: an intelligent multimedia storytelling interpretation and presentation system minhua...

Documents