a comprehensive framework for multimodal meaning representation ashwani kumar laurent romary...
TRANSCRIPT
![Page 1: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/1.jpg)
A comprehensive framework for multimodal meaning representation
Ashwani KumarLaurent Romary
Laboratoire Loria, Vandoeuvre Lès Nancy
![Page 2: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/2.jpg)
Overview - 1
Context: Conception phase of the EU IST/MIAMM project (Multidimensional Information Access using Multiple Modalities - with DFKI, TNO, Sony, Canon)
Study of the design factors for a future haptic PDA like deviceUnderlying application: multidimensional access to a musical database
![Page 3: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/3.jpg)
Overview - 2
Objectives:Design and implementation of a unified representation language within the MIAMM demonstrator• MMIL: Multimodal interface language
“Blind” application of (Bunt & Romary 2002)
![Page 4: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/4.jpg)
Methodology
Basic componentsRepresent the general organization of any semantic structureParameterized by
• data categories taken from a common registry• application specific data categories
General mechanismsTo make the thing work
General categoriesDescriptive categories available to all formats
+ strict conformance to existing standards
![Page 5: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/5.jpg)
MIAMM - wheel mode
![Page 6: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/6.jpg)
MIAMM architecture
dependancies Dialogue Manager
MultiModalFusion (MMF)
MiaDomo
Database
Dialogue Historyy
Visual configuration
Action Planner (AP)
Sentences
Haptic Device
Display
Haptic Processor
Visualization Haptic-Visual
Generation
Visual-Haptic Processing (VisHapTac)
Speaker Speech Synthesis
Language Generation
Scheduling Information
Speech Generation
Haptic-Visual Interpretation
Microphone (Headset)
Continuous Speech Recognizer
Structural Analysis (SPIN)
Word/Phoneme Lattice
Speech Analysis
Word/Phoneme sequence
![Page 7: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/7.jpg)
Various processing steps - 1
Reco:Provides word latticesOut of our scope (MPEG7 word and phone lattice module)
SPIN:Template based (en-de) or TAG-based (fr) dependency structuresLow level semantic constructs
![Page 8: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/8.jpg)
Various processing steps - 2
MMF (Multimodal Fusion)Fully interpreted structuresReferential (MMILId) and temporal anchoringDialogue history update
AP (Action Planner)Generates MIAMM internal actions
• Request to MiaDoMo• Actions to be generated (Language+VisHapTac)
![Page 9: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/9.jpg)
Various processing steps - 3
VisHaptacInforms MMF of current graphical and haptic configuration (hierarchies of objects, focus, selection)
MMIL: must answer those needsBut not at the same time
![Page 10: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/10.jpg)
Main characteristics of MMIL
Basic ontologyEvents and participants (organized as hierarchies)Restrictions on events and participantRelations among these
Additional mechanismsTemporal anchoring of eventsRanges and alternatives
RepresentationFlat meta-model
![Page 11: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/11.jpg)
MMIL meta-model (UML)
LevelName:Event
LevelName:Time
LevelName:MMIL
LevelName:Participant
LevelName :NMTOKEN
Struct Node
Associationdependancy
Association
dependancy
Association
dependancy
Association
dependancy
Associationdependancy
Associationdependancy
Associationdependancy
MMIL Level Event Level Time Level Participant Level
0..*
0..*
0..*
0..* 0..* 0..*
0..*1..1 1..1 1..1 0..11..1 1..1 1..1
![Page 12: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/12.jpg)
Meta-model DatCat Registry
DatCat Specification- DCR subset- Application dependantDatCats
Interoperability conditionsGMT
Dialecti- Expansion trees- DatCat styles +
vocabularies
Semantic Markup Language (e.g. MM IL)
![Page 13: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/13.jpg)
An overview of data categories
Underlying ontology for a variety of formatsDistinction between abstract definition and implementation (e.g. in XML)Standardization objective: implementing a reference registry for NLP applications
Wider set of DatCats than just semanticsISO 11179 (meta-data registries) as a reference standard for implementing such a registry
![Page 14: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/14.jpg)
DatCat example: Addressee
/Addressee/Definition: The entity that is the intended hearer of a speech event. The scope of this data category is extended to deal with any multimodal communication event (e.g. haptics and tactile)Source: (implicit) an event, whose evtType should be /Speak/Target: a participant (user or system)
![Page 15: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/15.jpg)
Styles and vocabularies
Style: design choice to impement a data actegory as an XML element, a database field, etc.Vocabulary: the names to be provided for a given styleE.g. (for /Addressee/)
Style: ElementVocabulary: {“addressee”}
Note:Multilingualism
![Page 16: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/16.jpg)
Time stamping
/Starting point/• Def: indicates the beginning of the event• Values: dateTime• Anchor: time level
Style: attributeVocabulary: {“startPoint”}
Example<event id="e4">
<evtType>yearPeriod</evtType><lex>1991</lex><tempSpan
startPoint=“1991-01-01T00:00:00”endPoint=“1991-12-31T24:59:59”/>
</event>
![Page 17: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/17.jpg)
Application: a family of formats
Openness: a requirement for MIAMM
Specific formats for input and output of each moduleEach format is defined within the same generic MMIL framework:• Same meta-model for all• Specific DatCat specification for each
![Page 18: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/18.jpg)
The MIAMM family of formats
SPIN-OMMF-O AP-O
VisHapTac-OMMF-I
MMIL+
The specifications provide typing information for all these formats
![Page 19: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/19.jpg)
SPIN-O exampleSpiel mir den lied bitte vor(Please play the song)
e0
e1
p1
destination
evtType=speakdialogueAct=request
evtType=playlex=vorspielen
p2
objectType=user
objType=tunerefType=definiterefStatus=pending
object
propContent
speaker
![Page 20: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/20.jpg)
<mmilComponent> <event id="e0"> <evtType>speak</evtType>
<dialogueAct>request</dialogueAct><speaker target=“p1“/>
</event> <event id="e1"> <evtType>play</evtType> <lex>vorspielen</lex> </event> <participant id= "p1"> <objType>user</objType> </participant> <participant id= "p2"> <objType>tune</objType> <refType>definite</refType> <refStatus>pending</refStatus> </participant> <relation source="e1" target="e0" type="propContent"/> <relation source=" p1" target="e1" type="destination"/> <relation source= "p2" target="e1" type="object"/></mmilComponent>
![Page 21: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/21.jpg)
• The use of perceptual grouping
Reference domains and visual contexts
« these three objects »
{ , , }
« the triangle »
{ }
« the two circles » { , }
• The use of salience
![Page 22: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/22.jpg)
VisHapTac-Oe0
set1
s1
s2
set2
s25
…
description
s2-1
s2-2
s2-3
inFocus
inSelection
Visual haptic state
Participant setting
Sub-divisions
![Page 23: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/23.jpg)
VisHapTac output - 1<mmilcomponent>
<event id=“e0”>
<evtType>HGState</evtType>
<visMode>galaxy</visMode>
<tempSpan
startPoint=“2000-01-20T14:12:06”
endPoint=“2002-01-20T14:12:13”/>
</event>
<participant id=“set1”>
…
</participant>
<relation type=“description” source=“set1” target=“e0”/>
</mmilcomponent>
![Page 24: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/24.jpg)
VisHapTac output - 2<participant id=“set1”>
…<participant id=”s1”>
<Name>Let it be</Name></participant><participant id=“set2”>
<individuation>set</individuation><attentionstatus>inFocus</attentionstatus><participant id=“s2-1”>
<Name>Lady Madonna</Name></participant>…<participant id=“s2-3”>
<attentionStatus>inSelection</attentionStatus><Name>Revolution 9</Name>
</participant></participant>…
</participant>
![Page 25: A comprehensive framework for multimodal meaning representation Ashwani Kumar Laurent Romary Laboratoire Loria, Vandoeuvre Lès Nancy](https://reader035.vdocuments.net/reader035/viewer/2022062309/56649f275503460f94c3e7f0/html5/thumbnails/25.jpg)
Conclusion
Most of the properties we wanted are fulfilled:
Uniformity, incrementality, partiality, openness and extensibility
Discussion point:Semantic adequacy:• Not a direct input to an inference system
(except for underlying ontology)• Semantics provided through specification