the smartkom multimodal corpus − data collection and...
Post on 22-Jul-2020
4 Views
Preview:
TRANSCRIPT
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Nicole Beringer
Institut für Phonetik und Sprachliche Kommunikation
LMU München
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Where can the IPSK (LMU) be found within the project?
Data Collection, Evaluation, Annotation
Feed
back
abou
t use
r rea
ction
s
ModulesModules
user
beh
avio
ur?
Implementation of problem
solving strategies
impr
oved
pro
toty
pe
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Responsibilities of the IPSK−group in SmartKomOverview:
� Data Collection
� WOZ design
� WOZ experiments
� some useful results
� End−to−End−Evaluation
� Problems with Multimodality
� Evaluation Framework
� Annotation
� Transliteration of the audio
data
� Prosodic Annotation
� Annotation of the gestures
� Annotation of facial expression
� Annotation of user states
Data Collection Evaluation
User modelling
� WOZ System − Studio� Recordings� Annotation of audio, gesture,
emotion� Distribution
MODULES
Providing Data for Recognition
The SmartKom Multimodal Corpus − Data Collection and End−to−End
EvaluationResponsibility Network
Data Collection
� Creating and publishing of data for
� the training of recognizers (speech, prosodic feature, gesture, facial expression, emotion)
� dialogue creation
� generation of information (speech)
� Research
� user modelling
� evaluation (usability & technical evaluation)
� Software
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Training of recognizers user modelling
The BIG Problem:
Wizard−of−Oz" different users" Instruction − „Market Research“" 2 recordings (4,5 minutes each)"Recording of audio (different characteristics)"Recording of video (face, profile, display, gestures)" Interview
How to persuade users of a nonexisting system just by simulation?
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
" realistic prototype" created by partners & LMU" influence on development " playback of atmosphere" creation of the studio
" Reliability" Quality of speech output" Experiment design" WOZ System with technical defects
" Evocation of behaviour (trial and error, gestures, emotion)" Instruction" Provoking of different behaviour (new gestures, anger, new input facilities)" Design of the display
" few associations to existing systems " Dialogue with intelligent machine, no ordinary input facilities
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
1 good preparation1intensive training of the wizards1 System makes mistakes Perception of the SmartKom−System
The system is a machine
The system is a person
The system is something in between
„That’s a telephone box, I wouldn’t expect to talk to a human. I do not have illusions!“
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Reliability: the fraud should not be noticed
Only few associations to existing systems allowed1 Simulation of a personal assistant.
A existing dialogue partnerA Assistant has „personality“A Assistant leads through the dialogue, has proposals
Percent
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Polite Users
subjects used polite expressions
subjects used greetings
subjects used thanks
subjects used sorry
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
N0
2.5
5
7.5
10
12.5
15
17.5
20
positive aspects
verbale Interaktion mit Assistent läuft gut
einzelne Anwendungen oder Seiten
positive Bewertung Persona
Schnell
insgesamt eine gute Idee
Übersichtlich
Praktisch
Benutzung macht Spaß
Multimodalität
Sonstiges
− verbal Interaction!
− Multimodality is
only noticed by a few users
− too slow
− too few Possibilities
− more Help needed
− Persona not often
criticized!
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
N
0
2.5
5
7.5
10
12.5
15
17.5
20
negative aspects
Kritik an der Sprachausgabe
zu langsam
zu geringer Umfang
zu wenig Unterstützung
Kritik an der Spracheingabe
insgesamt nicht gut
Straßenlärm stört
Kritik an der Persona
Gestikeingabe nicht gut
Display
What characterizes a comfortable system?
Einfache Bedienung
Spracherkennung
Hardware/Aus−stattung
Display−Layout
Schnelligkeit
Serviceangebot
Multimodalität
Synthese
Sonstiges
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
SmartKom WOZ−Recordings and Processing of the Data at the LMU
WOZ − Recordings
Coordin. of Graph
Tablet
DV−VideoFront
DV−Video Side View
Beamer−Output
SIVIT Stream
11 Audio−streams
Cutting
Transliteration(TRL)
Preparation of Gesture Label−
stream
Holistic US−Labeling
(USH)
Prosodic US−Labeling
(TRP)
Gesture Labeling(GES)
US−Labeling Facial Expr.
(USM)
Deliver. Files to DFKI Server
Recording of DVD
The SmartKom Multimodal Corpus − Data Collection and End−to−End
EvaluationAnnotation of emotions
" System is simulated
" Subjects are recorded (audio and video)
" 4,5Min interaction − e.g. „find a movie for this evening“
" emotions are partly provoked by the wizards
Subjects during a recording
Front view Side view
� Orthographical Annotation
� Marking of repetitions, hesitations, noise, speech disfluencies etc.
w001_pkw_003_SMA: <Ger"ausch> @1hier @1sehen <:<#> Sie:> <:<#> eine:> "Ubersicht "uber das Programm der ~Heidelberger Kinos .
w001_pkd_004_AAA: mhm [PA] [B3 cont] . <Ger"ausch> oh<Z> [B2] ,
~F<Z>ight+Club<ROT> <!1 Flight−Club> [NA] [B2] ,
~Das+f"unfte+Element<Z><ROT> [NA] [B2] , ~Drum%<ROT> ,
~Jakob+der+L"ugner<ROT> [NA] [B3 cont] . <A> ah<OOT> [PA] [B2] , ich
w"urde gerne [NA] ~Aimee+_ <"ah> _und+Jaguar [PA] sehen [B3 fall] .
<Ger"ausch> wo [PA] wird das gespielt<Z> [NA] [B3 rise] ? <PP>
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
A Annotation of gestures in 3 categories: A Interactional gestures: pointing (long & short), free gestures
A Supporting gestures: reading, searching, counting
A Residual gestures: Emotional gestures, not identifiable gestures
I−Point (short −)
R−Emotional (+ cubus)
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
A 3 steps: A Prosodic annotation: audio only, formal labelling system
A Holistic labelling: facial expression, audio, contextA Holistic labeling includes context information, which is not relevant for the facial
expression recognizer.
1Therefore we included a „facial expression only“ labeling step (no audio).
A For the analysis of the prosody the speech had to be labeled.
A The functional approach did not seem to work with speech.
1Therefore we adopted a formal coding step that was used in Verbmobil (Fischer, 1999) for the prosody.
1The holistic and the formal step for the speech can be combined to get ecological valid data.
A facial expression: labelling without audio
The SmartKom Multimodal Corpus − Data Collection and End−to−End
EvaluationAnnotation of emotions
The SmartKom Multimodal Corpus − Data Collection and End−to−End
EvaluationAnnotation of emotions
� Categories for the prosody
step:
� Pauses between phrases
� Pauses between words
� Pauses between syllables
� Irregular length of syllables
� Emphasized words
� Strongly emphasized words
� Clearly articulated words
� Hyperarticulated words
� Words overlapped by laughing
� Labeling with some defined
subjective categories
� „anger/irritation"
� „joy/gratification (being successful)“
� „helplessness“
� „pondering/reflecting“
� „surprise“
� „neutral“
� „unidentifiable episode“
Conclusion (WOZ)
" WOZ: realistic data for man−machine interaction
" Training of recognizers
" Observation of user behaviour
" WOZ−technique is time consuming and expensive
" BUT: Results out of user observations and
questionnaires can early influence the development of
the system
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation� Website: http://www.smartkom.org/
� http://www.phonetik.uni−muenchen.de/Forschung/Publications/index.html
� Corpus Overview: Schiel, F. et al. (2002): Integration of multi−modal data
and annotations into a simple extendable form: the extension of the BAS Partitur Format. LREC Conference
� Steininger, S. et al. (2002b): User−State Labeling Procedures For The
Multimodal Data Collection Of SmartKom. LREC conference.
� Beringer N. (2001): Evoking Gestures in SmartKom − Design of the
Graphical User Interface. Gesture Workshop 2001, London, UK. to appear
in: Springer "Gesture Workshop 2001, London"
� Labeling of gestures: Steininger, S. et al. (2001): Labeling of Gestures in
SmartKom − The Coding System. Gesture Workshop 2001, London.
� Transliteration: Oppermann, D. et al.: Transliterationskonventionen.
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
General Criteria of Dialogue SystemEvaluation (End−to−End Evaluation)
� „The performance of the evaluation is very often driven by the characteristics of the system that has to be judged “ [Andenfilger−97].
� An evaluation framework must abstract from the system itself and from different dialogue strategies.
� Combination of the developers’ and the users’ needs as well as the constraints on the evaluation of multimodal systems in general.
� Combination of objective and subjective evaluation criteria
The SmartKom Multimodal Corpus − Data Collection and End−to−End
EvaluationPARADISE: Paradigm for Dialogue Systems Evaluation
� Comparison of Dialogue Strategies
� Direct Comparison with other Dialogue Systems
� Comparison of usability and objectively measurable results
� Generalization and normalization over measures" Standardization of
the Evaluation of successful transactions via Attribute Value Matrices
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Evaluation framework for unimodal Dialogue Systems − Problems
� Usability
� What about multimodal systems?
� separation of user satisfaction and dialogue complexity
� unique scales
� Objective measures
� multimodal costs
� higher dimensional AVMs
� there exist no static definitions of the ‘‘keys’’ necessary to compute an AVM
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Problems with Spoken Dialogue Evaluation Frameworks in Multimodal Dialogue Environments
� How to score multimodal inputs or outputs?
� How to score the use of multimodal technologies?
� How to weight the several multimodal components of recognition systems?
� How to evaluate different scenarios?
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Problems with Spoken Dialogue Evaluation Frameworks in Multimodal Dialogue Environments
� How to define an optimal dialogue?
� How to evaluate uncompleted tasks?
� How to deal with bad performance due to user incooperativity?
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Usability
" Multimodal evaluation criteria
" Questionnaire adapted to cost functions
" User Satisfaction is separately compiled
" Standardization of questions
" User Satisfaction range from −3 to +3
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Objective Evaluation Measures� Optimal dialogues depend on the system processing
� Length of the dialogue is defined by the user
� Weighting of quality and quantity measures and task success by Correlation between user satisfaction and objective measure.
� Definition of multimodal costs
� Definition of a bipolar function τ for the compilation of task success via biunique information clusters
� Integration of uncompleted tasks: τ (j) = − 1 : task failure.
The SmartKom Multimodal Corpus − Data Collection and End−to−End
EvaluationDefinition of Weights
Quality and quantity measures usability questionTransaction success
Task complexity
The task was easy to solve
Misunderstanding of input
Offtalk
SmartKom has understood myinput
Misunderstanding of output SmartKom can easily beunderstood
Semantical, syntactical correctness
Incremental compatibility
SmartKom has answered properlyin most cases
Mean system response time
Mean user response time
The speed of the system was acceptable for each situation
Timeout I always knew what to say
Acc. gesture recognition The gestural input was successful
Acc. ASR The speech input was successful
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Definition of WeightsQuality and quantity measures usability questionDialogue complexity SmartKom worked as assumed
SmartKom reacted quickly tomy input
SmartKom is easy to handle
Percentage of appropriate/inappropriate system directive diagnostic utterances
SmartKom offered an adequateamount of high quality information
Percentage of explicit recovery answers SmartKom is easy to handle
repetitions
No. of ambiguities
Diagnostic error messages
Rejections
SmartKom needs input only onceto successfully complete a task
Timeout
Help−analyzer SmartKom offers adequate help
The SmartKom Multimodal Corpus − Data Collection and End−to−End
EvaluationDefinition of Weights
Quality and quantity measures usability questionOutput complexity (display) The display is clearly designed
Mean elapsed time
Task completion time
Dialogue elapsed time
SmartKom reacted fast to myinput
Duration of speech input
Duration of ASR
SmartKom reacted fast to speechinput
Duration of gestural input
Duration of gesture recognition
SmartKom reacted fast togestural input
BargeIn
Cancel
SmartKom allows interrupts
Dialogue complexity Was the task difficult?
Gesture turns input via graphical display
Ways of interaction
Display turns
output via graphical display
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Definition of WeightsQuality and quantity measures usability questionSpeech input speech input
Speech synthesis (synchronicity) speech output
N−way communication
Ways of interaction
Error rate of questions
Input complexity
Possibility to interact in a quasi−human way with SmartKom
Recognition/duration of facialexpression
Prosodic features
SmartKom reacted my emotionalstate
Synchronicity
Graphical output (turns)
How do you score the competenceof the agent?
Cooperativity Were actions of the personanatural?
Gestural input Gestural input
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Information Clusters
" Extract different superordinate concepts depending on the task at hand.
" Example: EPG
„City of Angels“ (Assumption: unique day, time, channel) => one information needed
Movie today at 8 p.m. on SAT1(channel) => three informations needed
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
User Incooperativity
" Smartakus, do the dishes!
" Other frameworks: task failure attributed to the system
" Only dialogues with cooperative users are evaluated using empirical methods
" Only dialogues which terminate with finished tasks are evaluated.
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
How to score multimodal inputs or outputs?
� Multimodal cost functions „no.of multiple input“ and „ways of interaction“
� Weighting of recognition scores via defined user satisfaction score
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
How to evaluate different scenarios?
�
� Intra−scenarios: Normalization over tasks
� Inter−scenarios: three systems
� Possibility to compute the performance over the three scenarios after all evaluation periods
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation j = biunique Information cluster;t (j) = + 1 : task success;t (j) = − 1 : task failure;
ci = cost function i
__
Performance = α⋅τ− ∑ni=1 ωi⋅N ( ci )
α = Correlation between User Satisfaction und mean value of τ
ωi = Correlation between
User Satisfaction und normalized costs
_
x − xN (x) = −−−−−−−−
σx
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation
Conclusion (Evaluation)
� PROMISE offers an overall evaluation result integrating cost functions and user satisfaction
� PROMISE can deal with multimodality
� PROMISE is independent of task definitions (static or dynamic tasks)
The SmartKom Multimodal Corpus − Data Collection and End−to−End
Evaluation� Beringer et al. (2002): End−to−End Evaluation of Multimodal Dialogue
Systems −can we Transfer Established Methods? Proc. of the Third
International Conference on Language Resources and Evaluation. Las Palmas, Gran Canaria, Spain.
� Beringer et al. (2002): PROMISE: A Procedure for Multimodal Interactive
System Evaluation. Proceedings of the Workshop ’Multimodal Resources
and Multimodal Systems Evaluation’ 2002, Las Palmas, Gran Canaria, Spain, pp. 77−80.
� Beringer et al. (2002): How to relate User Satisfaction and System
Performance in Multimodal Dialogue Situations − a Graphical
Approach. Proceedings of the International CLASS Workshop on Natural,
Intelligent and Effective Interaction in Multimodal Dialogue Systems, Copenhagen, Denmark, 28−29 June 2002, pp. 8−14.
top related