the smartkom multimodal corpus − data collection and...

The SmartKom Multimodal Corpus − Data Collection and End−to−End

Evaluation

Nicole Beringer

Institut für Phonetik und Sprachliche Kommunikation

LMU München

Evaluation

Where can the IPSK (LMU) be found within the project?

Data Collection, Evaluation, Annotation

ModulesModules

Implementation of problem

solving strategies

Evaluation

Responsibilities of the IPSK−group in SmartKomOverview:

� Data Collection

� WOZ design

� WOZ experiments

� some useful results

� End−to−End−Evaluation

� Problems with Multimodality

� Evaluation Framework

� Annotation

� Transliteration of the audio

� Prosodic Annotation

� Annotation of the gestures

� Annotation of facial expression

� Annotation of user states

Data Collection Evaluation

User modelling

� WOZ System − Studio� Recordings� Annotation of audio, gesture,

emotion� Distribution

MODULES

Providing Data for Recognition

EvaluationResponsibility Network

Data Collection

� Creating and publishing of data for

� the training of recognizers (speech, prosodic feature, gesture, facial expression, emotion)

� dialogue creation

� generation of information (speech)

� Research

� user modelling

� evaluation (usability & technical evaluation)

� Software

Evaluation

Training of recognizers user modelling

The BIG Problem:

Wizard−of−Oz" different users" Instruction − „Market Research“" 2 recordings (4,5 minutes each)"Recording of audio (different characteristics)"Recording of video (face, profile, display, gestures)" Interview

How to persuade users of a nonexisting system just by simulation?

Evaluation

" realistic prototype" created by partners & LMU" influence on development " playback of atmosphere" creation of the studio

" Reliability" Quality of speech output" Experiment design" WOZ System with technical defects

" Evocation of behaviour (trial and error, gestures, emotion)" Instruction" Provoking of different behaviour (new gestures, anger, new input facilities)" Design of the display

" few associations to existing systems " Dialogue with intelligent machine, no ordinary input facilities

Evaluation

1 good preparation1intensive training of the wizards1 System makes mistakes Perception of the SmartKom−System

The system is a machine

The system is a person

The system is something in between

„That’s a telephone box, I wouldn’t expect to talk to a human. I do not have illusions!“

Evaluation

Reliability: the fraud should not be noticed

Only few associations to existing systems allowed1 Simulation of a personal assistant.

A existing dialogue partnerA Assistant has „personality“A Assistant leads through the dialogue, has proposals

Percent

Polite Users

subjects used polite expressions

subjects used greetings

subjects used thanks

subjects used sorry

Evaluation

positive aspects

verbale Interaktion mit Assistent läuft gut

einzelne Anwendungen oder Seiten

positive Bewertung Persona

Schnell

insgesamt eine gute Idee

Übersichtlich

Praktisch

Benutzung macht Spaß

Multimodalität

Sonstiges

− verbal Interaction!

− Multimodality is

only noticed by a few users

− too slow

− too few Possibilities

− more Help needed

− Persona not often

criticized!

Evaluation

negative aspects

Kritik an der Sprachausgabe

zu langsam

zu geringer Umfang

zu wenig Unterstützung

Kritik an der Spracheingabe

insgesamt nicht gut

Straßenlärm stört

Kritik an der Persona

Gestikeingabe nicht gut

Display

What characterizes a comfortable system?

Einfache Bedienung

Spracherkennung

Hardware/Aus−stattung

Display−Layout

Schnelligkeit

Serviceangebot

Multimodalität

Synthese

Sonstiges

Evaluation

SmartKom WOZ−Recordings and Processing of the Data at the LMU

WOZ − Recordings

Coordin. of Graph

Tablet

DV−VideoFront

DV−Video Side View

Beamer−Output

SIVIT Stream

11 Audio−streams

Cutting

Transliteration(TRL)

Preparation of Gesture Label−

stream

Holistic US−Labeling

Prosodic US−Labeling

Gesture Labeling(GES)

US−Labeling Facial Expr.

Deliver. Files to DFKI Server

Recording of DVD

EvaluationAnnotation of emotions

" System is simulated

" Subjects are recorded (audio and video)

" 4,5Min interaction − e.g. „find a movie for this evening“

" emotions are partly provoked by the wizards

Subjects during a recording

Front view Side view

� Orthographical Annotation

� Marking of repetitions, hesitations, noise, speech disfluencies etc.

w001_pkw_003_SMA: <Ger"ausch> @1hier @1sehen <:<#> Sie:> <:<#> eine:> "Ubersicht "uber das Programm der ~Heidelberger Kinos .

w001_pkd_004_AAA: mhm [PA] [B3 cont] . <Ger"ausch> oh<Z> [B2] ,

~F<Z>ight+Club<ROT> <!1 Flight−Club> [NA] [B2] ,

~Das+f"unfte+Element<Z><ROT> [NA] [B2] , ~Drum%<ROT> ,

~Jakob+der+L"ugner<ROT> [NA] [B3 cont] . <A> ah<OOT> [PA] [B2] , ich

w"urde gerne [NA] ~Aimee+_ <"ah> _und+Jaguar [PA] sehen [B3 fall] .

<Ger"ausch> wo [PA] wird das gespielt<Z> [NA] [B3 rise] ? <PP>

Evaluation

A Annotation of gestures in 3 categories: A Interactional gestures: pointing (long & short), free gestures

A Supporting gestures: reading, searching, counting

A Residual gestures: Emotional gestures, not identifiable gestures

I−Point (short −)

R−Emotional (+ cubus)

Evaluation

A 3 steps: A Prosodic annotation: audio only, formal labelling system

A Holistic labelling: facial expression, audio, contextA Holistic labeling includes context information, which is not relevant for the facial

expression recognizer.

1Therefore we included a „facial expression only“ labeling step (no audio).

A For the analysis of the prosody the speech had to be labeled.

A The functional approach did not seem to work with speech.

1Therefore we adopted a formal coding step that was used in Verbmobil (Fischer, 1999) for the prosody.

1The holistic and the formal step for the speech can be combined to get ecological valid data.

A facial expression: labelling without audio

� Categories for the prosody

� Pauses between phrases

� Pauses between words

� Pauses between syllables

� Irregular length of syllables

� Emphasized words

� Strongly emphasized words

� Clearly articulated words

� Hyperarticulated words

� Words overlapped by laughing

� Labeling with some defined

subjective categories

� „anger/irritation"

� „joy/gratification (being successful)“

� „helplessness“

� „pondering/reflecting“

� „surprise“

� „neutral“

� „unidentifiable episode“

Conclusion (WOZ)

" WOZ: realistic data for man−machine interaction

" Training of recognizers

" Observation of user behaviour

" WOZ−technique is time consuming and expensive

" BUT: Results out of user observations and

questionnaires can early influence the development of

the system

Evaluation

Evaluation� Website: http://www.smartkom.org/

� http://www.phonetik.uni−muenchen.de/Forschung/Publications/index.html

� Corpus Overview: Schiel, F. et al. (2002): Integration of multi−modal data

and annotations into a simple extendable form: the extension of the BAS Partitur Format. LREC Conference

� Steininger, S. et al. (2002b): User−State Labeling Procedures For The

Multimodal Data Collection Of SmartKom. LREC conference.

� Beringer N. (2001): Evoking Gestures in SmartKom − Design of the

Graphical User Interface. Gesture Workshop 2001, London, UK. to appear

in: Springer "Gesture Workshop 2001, London"

� Labeling of gestures: Steininger, S. et al. (2001): Labeling of Gestures in

SmartKom − The Coding System. Gesture Workshop 2001, London.

� Transliteration: Oppermann, D. et al.: Transliterationskonventionen.

Evaluation

General Criteria of Dialogue SystemEvaluation (End−to−End Evaluation)

� „The performance of the evaluation is very often driven by the characteristics of the system that has to be judged “ [Andenfilger−97].

� An evaluation framework must abstract from the system itself and from different dialogue strategies.

� Combination of the developers’ and the users’ needs as well as the constraints on the evaluation of multimodal systems in general.

� Combination of objective and subjective evaluation criteria

EvaluationPARADISE: Paradigm for Dialogue Systems Evaluation

� Comparison of Dialogue Strategies

� Direct Comparison with other Dialogue Systems

� Comparison of usability and objectively measurable results

� Generalization and normalization over measures" Standardization of

the Evaluation of successful transactions via Attribute Value Matrices

Evaluation

Evaluation framework for unimodal Dialogue Systems − Problems

� Usability

� What about multimodal systems?

� separation of user satisfaction and dialogue complexity

� unique scales

� Objective measures

� multimodal costs

� higher dimensional AVMs

� there exist no static definitions of the ‘‘keys’’ necessary to compute an AVM

Evaluation

Problems with Spoken Dialogue Evaluation Frameworks in Multimodal Dialogue Environments

� How to score multimodal inputs or outputs?

� How to score the use of multimodal technologies?

� How to weight the several multimodal components of recognition systems?

� How to evaluate different scenarios?

Evaluation

Problems with Spoken Dialogue Evaluation Frameworks in Multimodal Dialogue Environments

� How to define an optimal dialogue?

� How to evaluate uncompleted tasks?

� How to deal with bad performance due to user incooperativity?

Evaluation

Usability

" Multimodal evaluation criteria

" Questionnaire adapted to cost functions

" User Satisfaction is separately compiled

" Standardization of questions

" User Satisfaction range from −3 to +3

Evaluation

Objective Evaluation Measures� Optimal dialogues depend on the system processing

� Length of the dialogue is defined by the user

� Weighting of quality and quantity measures and task success by Correlation between user satisfaction and objective measure.

� Definition of multimodal costs

� Definition of a bipolar function τ for the compilation of task success via biunique information clusters

� Integration of uncompleted tasks: τ (j) = − 1 : task failure.

EvaluationDefinition of Weights

Quality and quantity measures usability questionTransaction success

Task complexity

The task was easy to solve

Misunderstanding of input

Offtalk

SmartKom has understood myinput

Misunderstanding of output SmartKom can easily beunderstood

Semantical, syntactical correctness

Incremental compatibility

SmartKom has answered properlyin most cases

Mean system response time

Mean user response time

The speed of the system was acceptable for each situation

Timeout I always knew what to say

Acc. gesture recognition The gestural input was successful

Acc. ASR The speech input was successful

Evaluation

Definition of WeightsQuality and quantity measures usability questionDialogue complexity SmartKom worked as assumed

SmartKom reacted quickly tomy input

SmartKom is easy to handle

Percentage of appropriate/inappropriate system directive diagnostic utterances

SmartKom offered an adequateamount of high quality information

Percentage of explicit recovery answers SmartKom is easy to handle

repetitions

No. of ambiguities

Diagnostic error messages

Rejections

SmartKom needs input only onceto successfully complete a task

Timeout

Help−analyzer SmartKom offers adequate help

EvaluationDefinition of Weights

Quality and quantity measures usability questionOutput complexity (display) The display is clearly designed

Mean elapsed time

Task completion time

Dialogue elapsed time

SmartKom reacted fast to myinput

Duration of speech input

Duration of ASR

SmartKom reacted fast to speechinput

Duration of gestural input

Duration of gesture recognition

SmartKom reacted fast togestural input

BargeIn

Cancel

SmartKom allows interrupts

Dialogue complexity Was the task difficult?

Gesture turns input via graphical display

Ways of interaction

Display turns

output via graphical display

Evaluation

Definition of WeightsQuality and quantity measures usability questionSpeech input speech input

Speech synthesis (synchronicity) speech output

N−way communication

Ways of interaction

Error rate of questions

Input complexity

Possibility to interact in a quasi−human way with SmartKom

Recognition/duration of facialexpression

Prosodic features

SmartKom reacted my emotionalstate

Synchronicity

Graphical output (turns)

How do you score the competenceof the agent?

Cooperativity Were actions of the personanatural?

Gestural input Gestural input

Evaluation

Information Clusters

" Extract different superordinate concepts depending on the task at hand.

" Example: EPG

„City of Angels“ (Assumption: unique day, time, channel) => one information needed

Movie today at 8 p.m. on SAT1(channel) => three informations needed

Evaluation

User Incooperativity

" Smartakus, do the dishes!

" Other frameworks: task failure attributed to the system

" Only dialogues with cooperative users are evaluated using empirical methods

" Only dialogues which terminate with finished tasks are evaluated.

Evaluation

How to score multimodal inputs or outputs?

� Multimodal cost functions „no.of multiple input“ and „ways of interaction“

� Weighting of recognition scores via defined user satisfaction score

Evaluation

How to evaluate different scenarios?

� Intra−scenarios: Normalization over tasks

� Inter−scenarios: three systems

� Possibility to compute the performance over the three scenarios after all evaluation periods

Evaluation j = biunique Information cluster;t (j) = + 1 : task success;t (j) = − 1 : task failure;

ci = cost function i

Performance = α⋅τ− ∑ni=1 ωi⋅N ( ci )

α = Correlation between User Satisfaction und mean value of τ

ωi = Correlation between

User Satisfaction und normalized costs

x − xN (x) = −−−−−−−−

Evaluation

Conclusion (Evaluation)

� PROMISE offers an overall evaluation result integrating cost functions and user satisfaction

� PROMISE can deal with multimodality

� PROMISE is independent of task definitions (static or dynamic tasks)

Evaluation� Beringer et al. (2002): End−to−End Evaluation of Multimodal Dialogue

Systems −can we Transfer Established Methods? Proc. of the Third

International Conference on Language Resources and Evaluation. Las Palmas, Gran Canaria, Spain.

� Beringer et al. (2002): PROMISE: A Procedure for Multimodal Interactive

System Evaluation. Proceedings of the Workshop ’Multimodal Resources

and Multimodal Systems Evaluation’ 2002, Las Palmas, Gran Canaria, Spain, pp. 77−80.

� Beringer et al. (2002): How to relate User Satisfaction and System

Performance in Multimodal Dialogue Situations − a Graphical

Approach. Proceedings of the International CLASS Workshop on Natural,

Intelligent and Effective Interaction in Multimodal Dialogue Systems, Copenhagen, Denmark, 28−29 June 2002, pp. 8−14.

the smartkom multimodal corpus − data collection and...

Documents

colloq. generally accepted... idiom practice everyday and...

prosody as design, outcome and confounding variable in...

walmart women’s diversity, diversity...

smartkom: towards multimodal dialogues with

fall 2009 colloquium › math › colloq ›...

harmony of scattering amplitudes: from quantum...

new developments in statistical mechanics of money,...

web.sonoma.eduweb.sonoma.edu/math/colloq/colloq-sems21-30.pdfanecdotes...

jeudi 14 janvier 15h-17h et...

smartkom: modality fusion for a mobile companion based on...

it studying collective phenomena in arrays of sdtis...

supersolid matter, or how do bosons resolve their...

« les jeux en ligne, quelle influence en...

acl, eccai and the verbmobil/smartkom consortia german...

string theory and our real world - particle theory...

music and the making of modern...

bloch, landau, and...

van der waals forces - university of california, santa...

geospatial cognition and understanding of global...

carla schelfhout, peter-arno coppen, nelleke oostdijk* a...