ccm oi ist project comic vision and approach results first 1.5 years

C CMo i

IST project COMIC

Vision and Approach

Results first 1.5 years

http://www.hcrc.ed.ac.uk/comic/

C CMo i

Vision of COMIC

• Multimodal interaction will only be accepted by non-expert users if fundamental cognitive interaction capabilities of human beings are properly taken into account

C CMo i

Approach of COMIC

• Obtain fundamental knowledge on multimodal interaction – use of speech, pen, and facial expressions

C CMo i

Approach (2)

• Develop new approaches for component technologies that are guided by human factor experiments

C CMo i

Approach (3)• Obtain hands-on experience by building an

integrated multimodal demonstrator for bathroom design that combines new approaches for:– Automatic speech recognition– Automatic pen gesture recognition– Fusion– Dialogue and Action management– Fission– Output generation combining text and speech and

facial expression– System integration– Cognitive knowledge

C CMo i

The partners of COMIC• Max Planck Institute for Psycholinguistics –

Fundamental Cognitive Research• Max Planck Institute for Biological

Cybernetics – Fundamental Cognitive Research • University of Nijmegen – ASR and AGR• University of Sheffield – Dialogue and Action• University of Edinburgh – Fission and Output• DFKI – Fusion and System Integration• ViSoft – Graphical part of Demonstrator

C CMo i

This presentation

• Explanation of the demonstrator

• Results of fundamental cognitive research – Multimodal Interaction– Facial Expressions

• Results of Human Factor Experiments

C CMo i

The COMIC demonstrator• Bathroom design for non-expert users• Final goal is to implement 4 phases:

– 1: Input shape and dimensions of own bathroom (pen and speech input)

– 2: Choose position of sanitary ware (based on templates)

– 3: Conversational dialogue about types of sanitary ware and tiles

– 4: 3D view of negotiated bathroom

• Result is taken to expert salesman who will proceed from there.

C CMo i

The COMIC demonstrator

• Three versions– T12: Proof of technical integration of all

modules– T24: Limited functionality – fixed bathroom/

only tiles– T36: Full functionality – own bathroom,

sanitary ware, tiles

C CMo i

T12 Demonstrator

C CMo i

Fundamental Research on Human-Human Multimodal Interaction

C CMo i

The SLOT Research Platform

• Recording dyadic, natural interactions

• Route negotiation task with road maps

• Use of electronic pen/ink for drawing routes

• Elaborate and theory-free coding of data

• Systematically manipulating available modalities (drawing, visual contact)

C CMo i

C CMo i

Results Quantitative analysis of turn-taking behavior

4x4 dyads; 6 hours annotated interaction• Normally, there is no delay between people’s

turns• With one-way mirror, the “blind” person is

slower to take up her turn• This leads to longer silent periods (pauses)• … which leads to significantly slower

communication

C CMo i

Possible relevance for HCI

In conversational HCI with talking head:• User sees computer’s “face”• User might assume that the computer sees

his face• Speech recognition has a hard time reliably

detecting end-of-speech acoustically

Therefore we hypothesize that • User will notice (even more) that the

computer responds very slowly

nn:nn:nn:nn:

C CMo i

Fundamental Research on Facial Expressions

• Faces do a lot in a conversation • Lip motion for speaking• Emotional Expression (pleasure, surprise, fear)• Dialog flow (back-channeling: confusion, comprehension, agreement)• Co-expression (emphasis and word/topic stress)

• Most work on Avatars focuses exclusively on lip motion for speech.

• We aim to broaden the capabilities of Avatars, allowing for more sophisticated self expression and more subtle dialog control.

• To this end, we use psychophysical knowledge and procedures as a basis for synthesizing human conversational expressions.

C CMo i

First step: Real expressions• We recorded variety of conversational expressions from

several individuals. We then experimentally determined how identifiable and believable those expressions were.

• In general, we found that:– The expressions were easily recognized -- even in the complete absence of

conversation context! (and thus can be useful for back-channeling).

– The pattern of confusions between expressions indicate potential trouble areas (e.g., thinking was often mistaken for confusion!)

– These (“enacted”) expressions were not always recognized or found to be completely sincere (speech might help here).

C CMo i

Next step: What moves when? We are now performing a fine-grained analysis of the necessary and sufficient

components of conversational facial motion. What must move when to produce an identifiable and believable expression?

C CMo i

Relevance for HCI and eCommerce• Psychophysical studies of real expressions offer strong

insights into how one can produce identifiable, realistic, and believable conversational expressions.

• The expansion of Avatars’ expressive capabilities promises to improve the ease of use of HCI systems.

C CMo i

Human Factors Experiments guiding the technology

• University of Nijmegen investigated Input issues (ASR, AGR, Fusion)

• University of Edinburgh investigated output issues (Text, Graphics, and Face, Fission)

C CMo i

Human Factors Experiments

• Exploratory pilot studies– Can users combine pen and speech for entering

data about the layout of a room?– Do they like it, what do they prefer?– System-driven vs mixed-initiative dialogs– Pen+speech data acquisition and analysis

C CMo i

HF Experiments input

• Task: study a blueprint and specify this using speech and/or pen

• Subjects had to specify position + lengths of walls, doors, windows, sanitary ware

• Experiment is directly related to phase 1 of demonstrator

C CMo i

HF Experiments main results

• Subjects prefer gestures and speech, or gestures only; speech only is not preferred

• Subjects show a large variation in behaviour even when restricted to narrowly defined tasks

• Subjects prefer mixed-initiative dialogue• System-driven results in fewer errors, but requires more time

C CMo i

HF Experimentsspeech

• Subjects use three types of speech comments– Within task

“here is a wall with width 3 meter 40 …”– Out-of-task, within dialogue

“now I am going to draw the next wall”– Out-of-dialogue

“I hope I'm drawing this in the right way ..”

C CMo i

HF Experimentspen

Large variation in

• Graphical symbols

• Deictic gestures

• Handwriting

C CMo i

Human Factors ExperimentsOutput

• Fission module: translates abstract dialogue acts into specifications for output channels

• Goal: model the choices made in the COMIC fission module after naturally-occurring interactions.

• Question: What are important natural actions in multimodal dialogue?

C CMo i


• “Wizard of Oz” recordings

• Set up of the recordings: – Subjects (native English speakers, not

bathroom design experts) played the role of a bathroom sales consultant presenting a range of options to the client.

– Total recordings: 7 interactions; approximately 2.5 hours of video.

C CMo i


• Making use of the recordings:Annotation

• Focus on scenes where the “consultant” says things similar to the planned system output– In particular, descriptions and comparisons of

options

• Mark up surface features like those under control of the fission module, and factors predicted to have an effect on those features

C CMo i

Making use of the recordings: Using the results

• Examine:– Range of surface features, deictic gestures,

prosody, facial expressions and gaze; both occurrence and timing

– Correlation between features and factors such as description vs comparison, first vs repeated mention, positive vs negative context

• Use these results in the development of the Fission module

C CMo i

Sample comparison“So they give you a degree of colour, they’re slight– they’re obviously slightly busier than looking at something like this, but, umm, they’re not quite as intense as having a whole block of colour, such as those two.”

C CMo i

Towards T24

• Presentation ViSoft

ccm oi ist project comic vision and approach results first 1.5 years

Documents

account slide

t12 demonstrator slide

facial expressions slide

slower communication

fundamental research

visual contact slide

approach of comic

fundamental knowledge