icmi 2008 abstracts

7/31/2019 ICMI 2008 Abstracts

1/32

1

Organized by the:

ACM

Telecommunication Systems Institute

Technical University of Crete

http://www.icmi2008.org/


2/32

2

INDUSTRIAL SPONSORS


3/32

3

Message from the Chairs

On behalf of everyone involved in the organization of ICMI 2008 and the ACM Society, we

would like to welcome all participants to the 10th International Conference on Multimodal

Interfaces. ICMI is a truly interdisciplinary conference attracting contributions from both the

computer science and engineering community. Following the example of recent successful

ICMI conferences, we have tried to put together an exciting technical program with special

sessions, keynotes and demos in both emerging and established research areas of multimodal

processing and interaction.

This year the ICMI program includes 44 regular papers, 4 special session papers, 2 keynote

presentations, 1 tutorial presentation and 7 demo presentations. The acceptance rate for

regular papers was 48%; all papers received at least three reviews. The technical program is

organized in five oral sessions, two poster sessions, one demo session and one panel

discussion over three days. In addition, the program contains one special session in the

exciting emerging area of social signal processing. The program includes keynote addressesin the area of multimodal interaction and multimedia processing by two distinguished

researchers, Phil Cohen and George Drettakis. For more information, see the detailed

technical program that follows.

ICMI takes place this year in an especially beautiful setting in a region rich in natural beauty,

history, memories and culture. We hope that the setting will inspire creativity and

brainstorming among participants. While at the conference, don't forget to visit the old city of

Chania with its famed Venetian harbour and narrow colourful streets. If you are able to spend

extra time in Chania, consider exploring the natural beauty of the region by taking excursions

to the Samaria Gorge, Falasarna, Elafonisi or Gramvousa.

We would like to express our sincere thanks to the members of the organizing committee fora great collaboration, the scientific review committee for the high quality review process and

the staff at ACM for their help. Many thanks go to the staff at the Telecommunication

Systems Institute and the Technical University of Crete, as well as the student volunteers, all

of whom made the staging of this event possible. Finally we wish to offer our sincere thanks

to our sponsors: Nokia Research, Microsoft Research and Dialogos Inc.

All in all, we hope that you will find that the 10th International Conference on Multimodal

Interfaces a rewarding and fulfilling experience both at the scientific and personal level.

Alexandros Potamianos

Vassilis Digalakis

Matthew Turk

ICMI 2008 General Chairs


4/32

4

ICMI 2008 Organizing Committee

General Co-Chairs

Vassilis Digalakis, TU Crete, GreeceAlex Potamianos, TU Crete, Greece

Matthew Turk, UC Santa Barbara, USA

Program Co-Chairs

Roberto Pieraccini, SpeechCycle, USA

Yuri Ivanov, MERL Research, USA

Workshop & Special Sessions

Fabio Pianesi, FBK-irst, Italy

Aikaterini Mania, TU Crete, Greece

Finance Chair

Rainer Stiefelhagen, U. Karlsruhe, Germany

Publicity Chair

Rana el Kaliouby, MIT Media Lab, USA

Demo Chair

Joakim Gustafson, KTH, Sweden

Industrial Liaison

Gerasimos Potamianos, IBM Research, USA

Local Arrangements

Lida Dimitropoulou, TU Crete, GreeceMaria Koutrouli, Telecommunication Systems Institute, Greece

Poster & Web

Ilias Iosif, TU Crete, Greece


5/32

5

ICMI 2008 Program Committee

Nikolaos Bourbakis, Wright Univ., USA

Herve Bourlard, IDIAP, FranceNoelle Carbonell, LORIA, France

Phil Cohen, Adapx, USA

James Crowley, INRIA, France

Trevor Darrell, MIT, USA

Giuseppe Di Fabbrizio, AT&T Research, USA

Rana el Kaliouby, MIT Media Lab, USA

Mastaka Goto, AIST, Japan

Evandro Gouvea, MERL, USA

Joakim Gustafson, KTH, Sweden

Norihir Hagita, ATR, Japan

Ilias Iosif, Tech. Univ. of Crete, Greece

Michael Johnston, AT&T Research, USA

Aikaterini Mania, Tech. Univ. of Crete, Greece

Kenji Mase, Nagoya Univ., Japan

Michijiko Minoh, Kyoto Univ., Japan

Louis-Philippe Morency, Univ. of Southern California, USA

Sharon Oviatt, Adapx, USA

Maja Pantic, Imperial College, UK

Alex Pentland, MIT Media Lab, USA

Manolis Perakakis, Tech. Univ. of Crete, Greece

Fabio Pianesi, FBK-IRST, Italy

Gerasimos Potamianos, IBM Research, USA

Bhiksha Raj, MERL, USAGerhard Sagerer, Bielefeld Univ., Germany

Yohichi Sato, Univ. of Tokyo, Japan

Paris Smaragdis, MIT Media Lab, USA

Rainer Stiefelhagen, Karlsruhe Univ., Germany

Kazuya Takeda, Nagoya Univ., Japan

Haruo Takemura, Osaka Univ., Japan

Alessandro Vinciarelli, IDIAP, France

Andy Wilson, Microsoft Research, USA

Chris Wren, MERL, USA

Jie Yang, CMU, USA

Shunsuke Yoshida, ATR, Japan

Massimo Zancanaro, FBK-IRST, Italy

ICMI Advisory Board

Matthew Turk, Chair, UCSB, USA

Jim Crowley, INRIA, France

Trevor Darrell, MIT, USA

Kenji Mase, Nagoya Univ., Japan

Eric Horvitz, Microsoft Research, USA

Sharon Oviatt, Adapx, USA

Fabio Pianesi, FBK-irst, Italy

Wolfgang Wahlster, DFKI, Germany

Jie Yang, Carnegie Mellon Univ., USA


6/32

6

ICMI 2008 Reviewers

Aguilar, Joaquin; Anthony, Lisa; Araki, Masahiro; Arita, Daisaku; Ariu, Masahide;

Ashimura, Kazuyuki; Ba, sileye; Banerjee, Satanjeev; Barakonyi, Istvan; Beinhauer,

Wolfgang; Berglund, Aseel; Bernardin, Keni; Bickmore, Timothy; Biswas, Pradipta;Bocchieri, Enrico; Bohus, Dan; Brewster, Stephen; Camurri, Antonio; Caputo, Barbara;

Carletta, Jean; Casas, Josep R.; Caskey, Sasha; Castellano, Ginevra; Cesar, Pablo; Checka,

Neal; Chignell, Mark; Choumane, Ali; Christmann, Olivier; Curin, Jan; Dakopoulos,

Dimitrios; Dannenberg, Roger; Dielmann, Alfred; Dietz, Paul; Dines, John; Dixon, Simon;

Dong, Wen; Edlund, Jens; Ezzat, Tony; Forlines, Clifton; Fraser, Mike; French, Brian; Fujie,

Shinya; Fujinaga, Ichiro; Fukuchi, Kentaro; Glas, Dylan; Graziola, Ilenia; Gupta, Abhinav;

Haciahmetoglu, Yonca; Hakkani-Tur, Dilek; Hataoka, Nobuo; Heylen, Dirk; Heylen, Dirk;

Hoggan, Eve; Hudson, Scott; Huerta, Juan; Hung, Hayley; Janin, Adam; Juba, Derek;

Kalgaonkar, Kaustubh; Kameda, Yoshinari; Kanda, Takayuki; Kapoor, Ashish; Karahalios,

Karrie; Karargyris, Alexandros; Karpouzis, Kostas; Kato, Hirokazu; Katsurada, Kouichi;

Khalidov, Vasil; Lee, Minkyung; Lepri, Bruno; Lepri, Bruno; Libal, Vit; Liu, Yang; Luo, Lu;

Mana, Nadia; Marcheret, Etienne; Masaaki, Iiyama; McGookin, David; Metze, Florian;

Miyashita, Takahiro; Morimoto, Carlos; Mourkoussis, Nick; Myers, Brad; Nickel, Kai;

Nijholt, Anton; Noma, Haruo; Obrenovic, Zeljko; Obrist, Marianna; Oliver, Nuria; Pampalk,

Elias; Pantelopoulos, Alexandros; Patel, Shwetak; Pfleger, Norbert; Pnevmatikakis,

Aristodemos; Portillo Rodriguez, Otniel; Radhakrishnan, Regunathan; Raducanu, Bogdan;

Renals, Steve; Ricci, Elisa; Rienks, Rutger; Rivera, Fiona; Robinson, Peter; Saito, Hideo;

Schuller, Bjoern; Schultz, Tanja; Sezgin, Metin; Shiomi, Masahiro; Siracusa, Michael;

Slaney, Malcolm; Starner, Thad; Stent, Amanda; Tan, Hong; Terken, Jacques; Thomaz,

Andrea; Tian, Tai-peng; Tsukada, Koji; Turaga, Pavan; Tzanetakis, George; Vanacken, Lode;

Vatavu, Radu-Daniel; Weinberg, Garrett; Wilson, Kevin; Yasumura, Michiaki; Yeh, Tom;

Zhang, Cha; Zudilova-Seinstra, Elena


7/32

7

General Information

Accommodation

ICMI 2008 will be hosted at the Panorama Hotel located 5 km outside of the city of Chania.

Panorama is a luxurious beach hotel built at a beautiful location offering magnificent-all

round views of the bay of Chania and the historic islet of Thodorou. For late reservations or

accommodation issues please contact the registration desk.

Weather

The morphology of the landscape and the location of Crete in the center of the Mediterranean

have a direct effect on the climate of the prefecture of Chania which is characterized as

temperate Mediterranean and particularly dry with sunlight 70% of the year. Winter is mild

and the climate from November to March is characterized as cold, but not frosty with frequentshowers. In October it rarely rains, the weather is still warm and mild and a dip in the sea is

still a pleasant one. May and September are usually sunny, but not excessively warm. The

summer however is quite hot and arid with June and July being the hottest months of the year

and without rainfall. For more information check out the Hellenic National Meteorological

Service online at http://www.hnms.gr/hnms/english/

Currency

The Euro (EUR) is the currency of Greece. Most ATMs accept Visa and MasterCard (you'll

pay interest on cash withdrawals) as well as debit cards of internationally recognized

networks such as Cirrus and Maestro. Traveller's cheques and urocheques issued by official

carriers can be exchanged to all Greek and foreign banks. Most major credit cards are

accepted in Greece.

Tax and Tipping

Sales tax (VAT) is included in prices quoted. Tipping is a matter of personal discretion.

Although the bill normally includes a service charge, it is customary to tip in restaurants

(about 5%) and other places that cater to tourists.

Public Transportation in Chania

Chania is a rather compact town with short distances between most points of interest that can

easily be walked. There is a bus network that also connects to the outer skirts of the town. Bus

tickets can be purchased in the vehicles from the driver, but are cheaper when bought in

advance from newspaper kiosks or ticket machines at the stops. In October, service between

the city and the workshop venue can be sparse.

Electrical Appliances

The electricity supply in Greece is alternating current, 220-250 volts, 50 cycles. Appliances

for 110 or 120 volts may be operated by using step down transformers of 220 - 250/110 volts

connected to each outlet.


8/32

8

Conference Information

Registration

The registration desk will be open:Sunday 19 October 17:00 20:00

Monday 20 October 08:30 18:00

Tuesday 21 October 08:30 18:00

Wednesday 22 October 08:30 18:00

The full conference registration package includes:

All technical sessions Welcome reception Abstract and papers CD-ROM Coffee breaks Lunch Exhibits Banquet

For administrative issues, please contact the registration desk.

Badge

Participants are requested to wear their badge at all times during the conference.

Catering

Coffee breaksCoffee breaks will be organized in the hotel lobby and are offered on a complimentary basis

at the following hours:

Monday: 10:00 10:30

15:30 16:00

Tuesday / Wednesday: 10:00 10:30

15:00 15:30

LuncheonsLuncheons will be organized in the hotel restaurant and are offered on a complimentary basis

at the following hours:

Monday / Tuesday / Wednesday: 12:00 13:30

Lost and Found

All enquiries should be directed to the registration desk. Participants are advised to mark their

conference material with their name.


9/32

9

Internet Access

Free wireless internet access is provided for registered conference participants.

ICMI 2008 Workshops

ICMI 2008 is hosting three workshops in emerging areas of multimodal interaction and

multimedia processing. The following workshops will be hosted at ICMI 2008, following the

main conference.

Thursday, October 23, 2008

The Child, Computer and Interaction Workshop aims at bringing together researchers andpractitioners from universities and industry working in all aspects of multimodal child-

machine interaction with particular emphasis on, but not limited to, speech interactiveinterfaces.

Friday, October 24, 2008

The Affective Interaction in Natural Environments (AFFINE) Workshop will consider real-time techniques for the recognition and interpretation of human verbal and non-verbal

behaviour for application in human-agent and human-robot interaction frameworks.

Friday Morning, October 24, 2008

The Multimedia Analysis for Multimodal Interaction of User Behaviour in aControlled Environment Workshop aims to study techniques that capture and analyze

multi-modal behaviour in controlled environments. As a result of such analysis,information should be adapted to the user needs and situation. Its goal is to stimulate

interest in this field and to create productive synergy among researchers who are

working in this fascinating area.


10/32

10

Chania City and Region

Chania is a heavenly and pure land brimming with natural beauty, history, memory and

culture. It is a land whose visitors will experience nature in all its glory and will encounter

breathtaking sights. Endless stretches of seashore bordered in frothy lace, inlets and islands of

exotic beauty and sandy beaches tucked away at the foot of forbidding mountains.

Impenetrable but yet such majestic gorges, holy caves, blessed rivers and lush, green plains

thickly covered with olive and citrus trees. It is a self-sufficient land in every way that is richin endemic and rare flora and fauna. Perhaps most importantly, the visitor will encounter a

people who recognize life's gifts and value. The Cretan soul will infuse him with the feeling

of true hospitality and will leave him mesmerized and forever partial to its beauty. For more

information about Chania and Greece visit the Chania Prefecture online at

http://www.chaniacrete.gr/ and the Greek National Tourism Organization.

Cretan Diet and Cuisine

In Crete the earth and its produce are identified with mythology and the ancient gods. The

purity of the products, their vitamin-rich content and wonderful taste have always been the

best choice for health food lovers, as well as an item for research and study for the scientific

community, since their beneficial properties against heart diseases are evident. Geological

conditions, care for natural cultivation and mostly love for tradition and quality have all

contributed to the acknowledged excellence of the Cretan products.

Sights

The Venetian Harbour of Chania

The harbour was first constructed between 1320 and

1356 by the Venetians to protect the City. The Venetian

physiognomy of its buildings, full of colour and shapes,

makes the port unique. At the entrance to the harbour,

there is a Venetian light-house which was reconstructed

during the Turkish occupation and given the shape of a

minaret. The visitor can also see the Venetian shipyards

(Neoria)(14- 16th century) and the Great Arsenal which

has been restored and accommodates the exhibition and

meeting venue of the Centre of Architecture of the Mediterranean.


11/32

11

The Municipal Market

The municipal Market or Agora, is the most central point

in Chania. Inaugurated in 1913 and built on the site of the

leveled Venetian bastion, Piatta Forma, traces of which

have been uncovered next to the southern entrance, theAgora is a stately cruciform structure which shelters 76

shops where the visitor may find almost any product of the

municipality of Chania.

Historical quarters of Halepa

Beyond the touristically and historically rich

center of Chania and past the walls of the old city,

the visitor can visit the historical quarters of

Halepa with the aristocratic houses from the end of

the 19th and beginning of the 20th century and the

"tabakaria" neighbourhood. There he will gazeupon the Palace of Prince Georgios and the

residence of Eleftherios Venizelos which houses

the National Foundation of Research "Eleftherios

K. Venizelos", with his statue which has been

erected in the square, the French Academy (1860), the Russian church of Agia Magdalini and

the church of Evangelistria.

Museums and collections

One of the many beautiful medieval monuments of the city

of Chania, the imposing church of the monastery of the

monks of Frangiskan (16th century), houses the

Archaeological Museum of Chania which has a collection

of finds from a variety of sites in the prefecture.

At another Venetian church, the church of San Salvatore of

the monks Frangiskan (15th -17th century) the noteworthy

Byzantine and post - Byzantine Collection of Chania is

housed. It boasts discoveries from the prefecture such as

architectural sculptures, inscriptions, frescoes, icons, coins, jewellery and ceramics. Inside the

fortress of Firka is where the Naval Museum of Chania is housed and where exhibits relevant

to naval history and the sea such as naval maps, engravings, naval instruments, model ships

and a rich collection of shells can be found while a big part of the museum is taken up by

memorabilia from the Battle of Crete.The War Museum of Chania is housed in the ItalianBarracks, a work of the Italian architect Macuzo(1870), next to the Public Garden. It is a branch of

the War Museum of Athens. The Historic

Archives of Crete was founded in 1932 and it is

located at 20, I. Sfakianaki Street. Its library

comprises over 6500 thousand books as well as

very important documents from the Cretan

rebellions from 1821, archives of the Cretan State

as well as archives of civilians. It is one of the

richest archive collections, second only to the

General Archives of the State.

And much more


12/32

12

Social Program

Welcoming Reception

All participants are cordially invited to the welcoming reception to take place on SundayOctober 19, from 18:00 to 20:00 at Panorama hotel, by the pool.

Banquet

All participants are cordially invited to the banquet on Tuesday October 21, from 19:00 to

22:30. Please join us and appreciate the gastronomic delicacies of Cretan and international

cuisine. Additional tickets for accompanying persons can be ordered at the time of

registration or on site. The location of the banquet is the Neraida restaurant overlooking the

city of Chania. Transportation will be provided for ICMI 2008 participants. Buses will depart

in front of the Panorama Hotel at 6:45 pm.

Other activities

Chania provide for a variety of day-long excursions including the famous Samaria Gorge hike

and half-day excursions including Elafonisi and Gramvousa. Water-sport activities include

sailing and scuba-diving. For more information ask at the hotel desk.


13/32

13

Monday, October 20 Tuesday, October 21 We

Start End Start End Start

08:00 09:00 Breakfast 08:00 09:00 Breakfast 08:00 0

09:00 10:00

Invited TalkNatural Interfaces in the Field: the

case for Pen and Paper, by

Phil Cohen

09:00 10:00Technical Panel

Multi-modal: Because we can orbecause we should?

09:00 1

10:00 10:30 Coffee Break 10:00 10:30 Coffee Break 10:00 1

10:30 12:00

D1O1:

Multimodal System

Evaluation

10:30 12:00

D2O1:

Multimodal System Design

and Tools

10:30 1

12:00 13:30 Lunch 12:00 13:30 Lunch 12:00 1

13:30 15:00 D2O2:Multimodal Interfaces I

13:30 113:30 15:30

D1O2:

Special Session on Social

Signal Processing 15:00 15:30 Coffee Break 15:00 1

15:30 16:00 Coffee Break15:30 17:00

DS1: Demo Session

ST1: Show and Tell15:30 1

16:00 17:30D1P1:

Multimodal Systems I17:00 1

19:00 22:30 SOCIAL EVENT


14/32

14


15/32

15

INV: Monday, October 20, 09:00 10:00INV Invited Talk

Time: Monday, October 20, 09:00 10:00

Place: Main Hall

Chair: Roberto Pieraccini

Natural Interfaces in the Field: the case for Pen and Paper

Phil Cohen, Adapx IncOver the past 7 years, Adapx (formerly, Natural Interaction Systems) has been developing digital pen-based natural interfaces for field tasks. Examples include products for field note-taking, mapping andarchitecture/engineering/construction, which have been applied to such uses as: surveying, wild-fire

fighting, land use planning and dispute resolution, and civil engineering. In this talk, I will describe thetechnology and some of these field-based use cases, discussing why natural interfaces are the preferred

means for human-computer interaction for these applications.

D1O1: Monday, October 20, 10:30 12:00D1O1 Multimodal System Evaluation (Oral Session)


Place: Main HallChair: Massimo Zancanaro

10:30 10:50

D1O1.1 Manipulating Trigonometric Expressions Encoded through Electro-Tactile

Signals

Tatiana G. Evreinova, University of TampereVisually challenged pupils and students need special developmental tools. To facilitate their skillsacquisition in math, different game-like techniques have been implemented. Along with Braille, theelectro-tactile patterns (eTPs) can be used to deliver mathematical content to the visually challengeduser. The goal of this work was to continue an exploration on non-visual manipulating mathematics. TheeTPs denoting four trigonometric functions and their seven arguments (angles) were shaped with

designed electro-tactile unit. Matching software application was used to facilitate the learning process ofthe eTPs. The permutation puzzle game was employed to improve the perceptual skills of the players inmanipulating the trigonometric functions and their arguments encoded. The performance of 8 subjectswas investigated and discussed. The experimental findings confirmed the possibility of the use of the

eTPs for communicating different kinds of math content.10:50 11:10

D1O1.2 Multimodal System Evaluation using Modality Efficiency and Synergy Metrics

Manolis Perakakis, Alexandros Potamianos, Technical University CreteIn this paper, we propose two new objective metrics, relative modality efficiency and multimodal

synergy, that can provide valuable information and identify usability problems during the evaluation ofmultimodal systems. Relative modality efficiency (when compared with modality usage) can identifysuboptimal use of modalities due to poor interface design or information asymmetries. Multimodal

synergy measures the added value from efficiently combining multiple input modalities, and can be usedas a single measure of the quality of modality fusion and fission in a multimodal system. The proposedmetrics are used to evaluate two multimodal systems that combine pen/speech and mouse/keyboardmodalities respectively. The results provide much insight into multimodal interface usability issues, and

demonstrate how multimodal systems should adapt to maximize modalities synergy resulting inefficient, natural, and intelligent multimodal interfaces.

11:10 11:30

D1O1.3 Effectiveness and Usability of an Online Help Agent Embodied as a Talking Head

Jrme Simonin, Nolle Carbonell, LORIA, Nancy Universit; Danielle Pel, FranceTelecom R&DWe present an empirical study which aims at assessing the contribution of online help embodiement as atalking head. 22 undergraduate students used succesively two multimodal online help systems for

learning how to create animations using Flash. Both systems used the same message data base; however,one system was embodied, oral messages being spoken by a talking head while the other system was not.Comparisons between the two conditions (i.e., presence or absence of the talking head) focused on


16/32

16

participants' performances and subjective judgments which were collected using a verbal and a nonverbal questionnaire, a post experiment debriefing and eye tracking data.

11:30 11:50

D1O1.4 Interaction techniques for the analysis of complex data on high-resolution

displays

Chreston Miller, Ashley Robinson, Rongrong Wang, Pak Chung, Francis Quek,Virginia Tech

When combined with the organizational space provided by a simple table, physical notecards are apowerful organizational tool for information analysis. The physical presence of these cards affords manybenefits but also is a source of disadvantages. For example, complex relationships among them are hard

to represent. There have been a number of notecard software systems developed to address theseproblems. Unfortunately, the amount of visual details in such systems is lacking compared to realnotecards on a large physical table; we look to alleviate this problem by providing a digital solution. Onechallenge with new display technology and systems is providing an efficient interface for its users. Inthis paper we look at comparing different interaction techniques of an emerging class of organizationalsystems that use high-resolution tabletop displays. The focus of these systems is to more easily and

efficiently assist interaction with information. Using PDA, token, gesture, and voice interactiontechniques, we conducted a within-subjects experiment comparing these techniques over a large high-

resolution horizontal display. We found strengths and weaknesses for each technique. In addition, wenoticed that some techniques build upon and complement others.

D1O2: Monday, October 20, 13:30 15:30D1O2 Special Session on Social Signal Processing (Oral Session)


Place: Main Hall

Chair: Alessandro Vincarelli

13:30 14:00

D1O2.5 Social Signal Processing: A Survey on Nonverbal Behaviour Analysis in Social

Interactions

Alessandro Vinciarelli,IDIAP Research Institute; Maja Pantic, Imperial CollegeThe ability to understand and manage social signals of a person we are communicating with is the core

of social intelligence. Social intelligence is a facet of human intelligence that has been argued to beindispensable and perhaps the most important for success in life. This paper argues that next-generationcomputing needs to include the essence of social intelligence the ability to recognize human socialsignals and social behaviours like politeness, and disagreement in order to become more effective andmore efficient. Although each one of us understands the importance of social signals in everyday lifesituations, and in spite of recent advances in machine analysis of relevant behavioural cues like blinks,smiles, crossed arms, laughter, and similar, design and development of automated systems for Social

Signal Processing (SSP) are rather difficult. This paper surveys the past efforts in solving these problemsby a computer, it summarizes the relevant findings in social psychology, and it proposes a set ofrecommendations for enabling the development of the next generation of socially-aware computing.

14:00 14:20

D1O2.1 Role Recognition in Multiparty Recordings using Social Affiliation Netwks andDiscr. Distributions.

Alessandro Vinciarelli, Sarah Favre, Hugues Salamin, John Dines, IDIAP ResearchInstituteThis paper presents an approach for the recognition of roles in multiparty recordings. The approach

includes two major stages: extraction of Social Affiliation Networks (speaker diarization andrepresentation of people in terms of their social interactions), and role recognition (application ofdiscrete \ probability distributions to map people into roles). The experiments are performed over severalcorpora, including broadcast data and meeting recordings, for a total of roughly 90 hours of material.The results are satisfactory for the broadcast data (around 80 percent of the data time correctly labeled interms of role), while they still must be improved in the case of the meeting recordings (around 45

percent of the data time correctly labeled). In both \ cases, the approach outperforms significantly

chance.14:20 14:40


17/32

17

D1O2.2 Audiovisual Laughter Detection Based on Temporal Features

Stavros Petridis,Imperial College London; Maja Pantic, University of TwentePast research on automatic laughter detection has focused mainly on audio-based detection. In this studywe present an audio-visual approach to distinguishing laughter from speech and we show that integrating

the information from audio and video channels leads to improved performance over single-modalapproaches. Static information is extracted on a audio/video frame basis and combined with temporalinformation extracted over a temporal window which describes the path of the static features in time.The use of several different temporal features is investigated and indeed the addition of temporalinformation results in improved performance. It is common to use a fixed set of temporal features whichimplies that all static features will exhibit the same behaviour over a temporal window. However, this isnot always true and we show that when AdaBoost is used as a feature selector, then different temporalfeatures for each static feature are selected, i.e. the time path of each static feature is described by

different statistical measures. When tested on 96 audiovisual sequences, depicting spontaneouslydisplayed (as opposed to posed) laughter and speech episodes, in a person independent way the proposedaudiovisual approach achieves an F1 rate of over 89%.

14:40 15:00

D1O2.3 Predicting two facets of social verticality in meetings from 5 min. time slices and

nonverbal cues

Dinesh Jayagopy, Daniel Gatica-Perez, IDIAP Research InstituteThis paper addresses the automatic estimation of two aspects of social verticality (status and dominance)

in small-group meetings using nonverbal cues. The correlation of nonverbal behavior with these socialconstructs have been extensively documented in social psychology, but their value for computational

models is, in many cases, still unknown. We present a systematic study of automatically extracted cues -including vocalic, visual activity, and visual attention cues - and investigate their relative effectiveness tojointly predict the most-dominant person and the high-status project manager from relative shortobservations. We use five hours of taskoriented meeting data with natural behavior for our experiments.Our work shows that, although dominance and role-based status are related concepts, they are notequivalent and are thus not equally explained by the same nonverbal cues. Furthermore, the best cuescan correctly predict the person with highest dominance or rolebased status with 70% accuracy.

15:00 15:20

D1O2.4 Multimodal Recognition of Personality Traits in Social Interactions

Fabio Pianesi, Nadia Mana, Alessandro Cappelletti, Bruno Lepri, Massimo

Zancanaro, Fondazione Bruno KesslerThis paper targets the automatic detection of personality traits in a meeting environment by means ofaudio and visual features; the information about the relational context is captured by keeping track of therelational roles played by meeting participants. Two personality traits are considered: extraversion (fromthe Big Five) and the Locus of Control. The classification task is applied to short (1 minute) behavioralsequences. SVM were used to test the performances of several training and testing instance setups,

including a restricted set of audio features obtained through feature selection. The outcomes improveconsiderably over existing results, provide evidence about the feasibility of the multimodal analysis of

personality, and pave the way to further studies addressing different features setups and/or targetingdifferent personality traits.

TS1: Monday, October 20, 15:30 15:40

TS1 Teaser SessionTime: Monday, October 20, 15:30 15:40

Place: Main Hall

Chair: Katerina Mania

D1P1: Monday, October 20, 16:00 17:30D1P1 Multimodal Systems I (Poster Session)


Place: Poster Hall

Chair: Katerina Mania

D1P1.1 VoiceLabel: Using Speech to Label Mobile Sensor DataS Harada, K Patel, J Lester, T. S. Saponas, J Fogarty, J Wobbrock, J Landay,


18/32

18

University of WashingtonMobile sensing and computation is an increasingly common component of everyday life, with oneexample being the iPhones integrated accelerometer and location sensing. Such devices enable newapplications that leverage advances in supervised machine learning to interpret sensor data and providenew types of context and input. However, supervised machine learning inherently requires the collectionof accurately labeled data for use in training a reliable model. In many situations, a traditional graphical

user interface on a mobile device may not be appropriate or viable for collecting labeled training data.This paper presents an alternative approach to mobile labeling. VoiceLabel consists of two components:

(1) a speech-based data collection tool for mobile devices, and (2) a desktop tool for offlinesegmentation of recorded data and recognition of spoken labels. Our desktop tool automatically analyzesthe audio stream to find and recognize spoken labels, then presents a multimodal interface for reviewingand correcting data labels using a combination of the audio stream, the systems analysis of that audio,and the corresponding mobile sensor data. A study with ten participants showed that VoiceLabel is aviable method for labeling mobile sensor data, and VoiceLabel illustrates several key features thatinform the design of other data labeling tools.

D1P1.2 The BabbleTunes System. Talk to Your IPod!

Jan Schehl, Alexander Pfalzgraf, Norbert Pfleger, Jochen Steigner, German ResearchCenter for Artificial Intelligence - DFKI GmbHThis paper presents a full-fledged multimodal dialogue system for accessing multimedia content in homeenvironments from both portable media players and online sources. We will mainly focus on two aspects

of the system that provide the basis for a natural interaction: (i) the automatic processing of namedentities which permits the incorporation of dynamic data into the dialogue (e.,g., song or album titles,artist names, etc.) and (ii) general multimodal interaction patterns that are bound to ease the access tolarge sets of data.

D1P1.3 Evaluating Talking Heads for Smart Home Systems

Christine Khnel, Benjamin Weiss, Ina Wechsung, Sascha Fagel, Sebastian Mller,Berlin Institute of TechnologyIn this paper we report the analysis of a user study focusing on the evaluation of talking heads in thesmart home domain. In this test the head and voice components are varied and the influence on overallquality is analyzed as well as the correlation between voice and head. A detailed description of the testdesign is provided. Three different ways to assess overall quality are presented. It is shown that they areconsistent in their results. Another important result is that in this design speech and visual quality of areindependent of each other. Furthermore, a linear combination of both qualities models overall quality oftalking heads to a good degree.

D1P1.4 Perception of Dynamic Audiotactile Feedback to Gesture Input

Teemu Ahmaniemi, Vuokko Lantz, Juha Marila, Nokia Research CenterIn this paper we represent results of a study where perception of dynamic audiotactile feedback togesture input was examined. Our main motivation was to investigate the effect of users active input onperception. The experimental prototype in the study was a handheld sensor-actuator device that respondsdynamically to users hand movements. The feedback was designed so that the amplitude and frequencywere proportional to the overall angular velocity of the device. In the perception tests we used fourdifferent feedback designs with different velocity responses. The feedback was presented to the user by

the tactile actuator in the device, by audio through headphones or both. During the experiments, thefeedback design was changed in random intervals and the task of the user was to detect the changeswhile moving the device freely. The performances of the users with audio or audiotactile feedback were

quite equal while tactile feedback alone yielded poorer performance. The feedback design didntinfluence the movement velocity or periodicity but significantly better performance was achieved withslower motion. Furthermore, modality condition had effect on energy of motion. Tactile feedbackinduced most and audio feedback least energetic motion. We also found that significant learning

happened over time; detection accuracy increased significantly during and between the experiments. Themasking noise used in tactile modality condition didnt significantly influence the detection accuracy

when compared to acoustic blocking but it increased the average reaction time of change detection.

D1P1.5 An Integrative Recognition Method for Speech and Gestures

Miki Madoka, Miyajima Chiyomi, Takanori Nishino, Kitaoka Norihide, KazuyaTakeda, Nagoya UniversityWe propose an integrative recognition method of speech accompanied with gestures such as pointing.

Simultaneously generated speech and pointing complementarily help the recognition of both, and thus

the integration of these multiple modalities may improve recognition performance. As an example ofsuch multimodal speech, we selected the explanation of a geometry problem. While the problem wasbeing solved, speech and fingertip movements were recorded with a close-talking microphone and a 3D


19/32

19

position sensor. To find the correspondence between utterance and gestures, we propose probabilitydistribution of the time gap between the starting times of an utterance and gestures. We also propose an

integrative recognition method using this distribution. We obtained approximately 3-point improvementfor both speech and fingertip movement recognition performance with this method.

D1P1.6 As Go the Feet : On the Estimation of Attentional Focus from Stance

Francis Quek, Roger Ehrich, Thurmon Lockhart, Virginia TechThe estimation of the direction of visual attention is critical to a large number of interactive systems.This paper investigates the cross-modal relation of the position of ones feet (or standing stance) to thefocus of gaze. The intuition is that while one CAN have a range of attentional foci from a particularstance, one may be MORE LIKELY to look in specific directions given an approach vector and stance.We posit that the cross-modal relationship is constrained by biomechanics and personal style. We definea stance vector that models the approach direction before stopping and the pose of a subjects feet. We

present a study where the subjects feet and approach vector are tracked. The subjects read aloudcontents of note cards in 4 locations. The order of visits to the cards were randomized. Ten subjectsread 40 lines of text each, yielding 400 stance vectors and gaze directions. We divided our data into 4sets of 300 training and 100 test vectors and trained a neural net to estimate the gaze direction given thestance vector. Our results show that 31% our gaze orientation estimates were within 5, 51% of ourestimates were within 10, and 60% were within 15. Given the ability to track foot position, theprocedure is minimally invasive.

D1P1.7 Knowledge and Data Flow Architecture for Reference Processing in Multimodal

Dialog SystemsAli Choumane, Jacques Siroux, IRISA, University of Rennes 1This paper is concerned with the part of the system dedicated to the processing of the user's designationactivities for multimodal search of information. We highlight the necessity of using specific knowledge

for multimodal input processing. We propose and describe knowledge modeling as well as the associatedprocessing architecture. Knowledge modeling is concerned with the natural language and the visualcontext; it is adapted to the kind of application and allows several types of filtering of the inputs. Part ofthis knowledge is dynamically updated to take into account the interaction history. In the proposedarchitecture, each input modality is processed first by using the modeled knowledge, producingintermediate structures. Next a fusion of these structures allows the determination of the referent aimedat by using dynamic knowledge. The steps of this last process take into account the possiblecombinations of modalities as well as the clues carried by each modality (linguistic clues, gesture type).The development of this part of our system is mainly complete and tested.

D1P1.8 The CAVA corpus: synchronized stereoscopic and binaural datasets with headmovements

Elise Arnaud,Universit Joseph Fourier, LJK and INRIA Rhne-Alpes; HeidiChristensen,Yan-Chen Lu,Jon Barker,University of Sheffield; Vasil Khalidov, Miles

Hansard, Bertrand Holveck, Herv Mathieu, Ramya Narasimha, Florence Forbes,INRIA Rhone-AlpesThis paper describes the acquisition and content of a new multi-modal database. Some tools for makinguse of the data streams are also presented. The Computational Audio-Visual Analysis (CAVA) databaseis a unique collection of three synchronised data streams obtained from a binaural microphone pair, a

stereoscopic camera pair and a head tracking device. All recordings are made from the perspective of aperson; i.e. what would a human with natural head movements see and hear in a given environment. The

database is intended to facilitate research into humans' ability to optimise their multi-modal sensoryinput and fills a gap by providing data that enables \textsl{human centred} audio-visual scene analysis. Italso enables 3D localisation using either audio, visual, or audio-visual cues. A total of 50 sessions, withvarying degrees of visual and auditory complexity, were recorded. These range from seeing and hearinga single speaker moving in and out of field of view, to moving around a `cocktail party' style situation,mingling and joining different small groups of people chatting.

D1P1.9 Towards A Minimalist Multimodal Dialogue Framework Using Recursive MVC

Pattern

Li Li, Wu Chou, Avaya Labs ResearchThis paper presents a formal framework for multimodal dialogue systems by applying a set ofcomplexity reduction techniques. The minimalist approach combines recursive application of Model-View-Controller (MVC) design pattern with layering and interpretation, which result in a modular,

concise, flexible and dynamic framework built upon a few core constructs. This framework could

expedite the development of complex multimodal dialogue systems by enabling sharing and reusingwell-defined multimodal components in the research communities. A XML based prototype multimodal


20/32

20

dialogue system that embodies the framework is developed and studied. Experimental results indicatethat the proposed framework is effective and well suited for multimodal interaction in complex business

transactions.

D1P1.10 Explorative Studies on Multimodal Interaction in a PDA and Desktop Scenario

Andreas Ratzka,University of RegensburgThis paper proposes anthropomorphic attentive behaviors for daily-partner robots that are aware of theuser's gaze and utterances. The purpose of these behaviors is to give the user particular impressions ofthe robot: 1) as if it has something to tell to the user but does not make an utterance, and 2) as if it isconsidering the appropriate timing to start talking to the user when the user is unavailable. The target ofthe user's speech can be estimated by detecting the user's gaze and utterance, giving the robot system thecapability of interpreting the user's situation. This capability can be used to design how the robot notifiesthe user of someone's messages, emergent information, or communicative interactions, at the appropriate

timing based on the user's state. Taking advantage of the effectiveness of the anthropomorphic attentiveexpressions, the behaviors of the proposed stuffed-toy robot, especially the gazing behaviors, areadopted for reacting or implying the need to speak to the user. The results of experiments combiningsubjects' daily tasks and various attentive behaviors of the robot show that i) crossmodal-awarebehaviors are important in communications with the stuffed-toy robot, and ii) speech-implying behaviorsare effective for conveying the robot's intention to speak and for drawing the user's attention withoutdisturbing the ongoing task of the user.

TP: Tuesday, October 21, 09:00 10:00TP Technical Panel

Time: Tuesday, October 21, 09:00 10:00

Place: Main Hall

Chair: Roberto Pieraccini

TP1.1 Multi-modal: Because we can or because we should?

Noelle Carbonell, LORIA; Phil Cohen, Adapx Inc; George Drettakis, INRIA Sophia-Antipolis; Roberto Pieraccini, SpeechCycleIs it often the case that problems of interaction can be solved in a single modality?What are the strong reasons to go multi-modal?

What are the theoretical justifications for multi-modal approaches?What are the new applications that are enabled? Have we seen, so far, any success story ofmultimodality applied in a business context?Do we really need to speech, touch, click, see, and hear at the same time?Is multimodality just an interesting subject of research?

D2O1: Tuesday, October 21, 10:30 12:00D2O1 Multimodal System Design and Tools (Oral Session)


Place: Main Hall

Chair: Louis-Philippe Morency10:30 10:50

D2O1.1 Designing Context-Aware Multimodal Virtual Environments

Lode Vanacken, Joan De Boeck, Chris Raymaekers, Karin Coninx, Hasselt UniversityDespite of decades of research, creating intuitive and easy to learn interfaces for 3D virtual environments(VE) is still not obvious, requiring VE specialists to define, implement and evaluate solutions in an

iterative way often using low-level programming code. Moreover, quite often the interaction with thevirtual environment may also vary dependent on the context in which it is applied, such as the availablehardware setup, user experience, or the pose of the user (e.g. sitting or standing). Lacking other tools, thecontexts-awareness of an application is usually implemented in an ad-hoc manner, using low-levelprogramming, as well. This may result in code that is difficult and expensive to maintain. One possibleapproach to facilitate the process of creating these highly interactive user interfaces is by adopting a

model-based user interface design. This lifts the creation of a user interface to a higher level allowing the

designer to reason more in terms of high-level concepts, rather than writing programming code. In thispaper, we adopt a model-based user interface design (MBUID) process for the creation of VEs, and


21/32

21

explain how a context system using an Event-Condition-Action paradigm is added. We illustrate ourapproach by means of a case study.

10:50 11:10

D2O1.2 High-Performance Dual-Wizard Infrastructure Designing Speech, Pen, and

Multimodal Interfaces

Phil Cohen, Adapx Inc; Colin Swindells,University of Victoria; Sharon Oviatt,Incaadesigns; Alex Arthur, Adapx Inc

The present paper reports on the design and performance of a novel dual-Wizard simulationinfrastructure that has been used effectively to prototype next-generation adaptive and implicitmultimodal interfaces for collaborative groupwork. This high-fidelity simulation infrastructure builds

from past work by Arthur et al. (2006), which developed single-wizard simulation tools for multipartymultimodal interactions involving speech, pen, and visual input. In the new infrastructure, a dual-wizardsimulation environment was developed that supports (1) real-time tracking, analysis, and systemadaptivity to a user's speech and pen paralinguistic signal features (e.g., speech amplitude, pen pressure),as well as the semantic content of their input. This simulation also supports (2) transparent user trainingto adapt their speech and pen signal features in a manner that enhances the reliability of system

functioning, or the design of mutually-adaptive interfaces. To accomplish these objectives, this newenvironment also is capable of handling (3) dynamic streaming digital pen input. We illustrate the

performance of the simulation infrastructure during longitudinal empirical research in which a user-adaptive interface was designed for implicit system engagement based exclusively on users' speech

amplitude and pen pressure (CHI ref). While using this dual-wizard simulation method, the wizardsresponded successfully to over 3,000 user inputs with 95-98% accuracy and a joint wizard response timeof less than 1.0 second during speech interactions and 1.65 seconds during pen interactions.Furthermore, the interactions they handled involved naturalistic multiparty meeting data in which high

school students were engaged in peer tutoring, and all participants believed they were interacting with afully functional system. The type of simulation capability reported in this work enables a new level offlexibility and sophistication in multimodal interface design, including the development of implicitmultimodal interfaces that place minimal cognitive load on users during mobile, educational, and otherload-critical applications.

11:10 11:30

D2O1.3 The WAMI Toolkit for Developing, Deploying, and Evaluating Web-Accessible

Multimodal Interfaces

Alexander Gruenstein, Ian McGraw, Ibrahim Badr, Massachusetts Institute ofTechnologyMany compelling multimodal prototypes have been developed which pair spoken input and output witha graphical user interface, yet it has often proved difficult to make them available to a large audience.This unfortunate reality limits the degree to which authentic user interactions with such systems can becollected and subsequently analyzed. We present the WAMI toolkit, which alleviates this difficulty by

providing a framework for developing, deploying, and evaluating Web-Accessible MultimodalInterfaces in which users interact using speech, mouse, pen, and/or touch. The toolkit makes use of

modern web-programming techniques, enabling the development of browser-based applications whichrival the quality of traditional native interfaces, yet are available on a wide array of Internet-connecteddevices. We will showcase several sophisticated multimodal applications developed and deployed usingthe toolkit, which are available via desktop, laptop, and tablet PCs, as well as via several mobile devices.In addition, we will discuss resources provided by the toolkit for collecting,transcribing, and annotating

usage data from multimodal user interactions.11:30 11:50

D2O1.4 A Three Dimensional Characterization Space of Software Components for

Rapidly Developing Multimodal Interfaces

Marcos Serrano, David Juras, Laurence Nigay, University of GrenobleIn this paper we address the problem of the development of multimodal interfaces. We describe a three-dimensional characterization space for software components along with its implementation in acomponent-based platform for rapidly developing multimodal interfaces. By graphically assemblingcomponents, the designer/developer describes the transformation chain from physical devices to tasksand vice-versa. In this context, the key point is to identify generic components that can be reused fordifferent multimodal applications. Nevertheless for flexibility purposes, a mixed approach that enables

the designer to use both generic components and tailored components is required. As a consequence, our

characterization space includes one axis dedicated to the reusability aspect of a component. The twoother axes of our characterization space, respectively depict the role of the component in the data-flow


22/32

22

from devices to tasks and the level of specification of the component. We illustrate our threedimensional characterization space as well as the implemented tool based on it using a multimodal map

navigator.

D2O2: Tuesday, October 21, 13:30 15:00D2O2 Multimodal Interfaces I (Oral Session)


Place: Main HallChair: Noelle Carbonell

13:30 13:50

D2O2.1 Crossmodal Congruence: The Look, Feel and Sound of Touchscreen Widgets

Eve Hoggan, Stephen Brewster, University of Glasgow; Topi Kaaresoja, PauliLaitinen,Nokia Research CenterOur research considers the following question: how can visual, audio and tactile feedback be combinedin a congruent manner for use with touchscreen graphical widgets? For example, if a touchscreen displaypresents different styles of visual buttons, what should each of those buttons feel and sound like? This

paper presents the results of an experiment conducted to investigate methods of congruently combiningvisual and combined audio/tactile feedback by manipulating the different parameters of the modalities.The results show definite trends with individual visual parameters such as shape, size and height beingcombined congruently with audio/tactile parameters such as texture, duration and different actuatortechnologies. We draw further on the experiment results using individual quality ratings to evaluate theperceived quality of our touchscreen buttons then reveal a correlation between perceived quality and

crossmodal congruence. The results of this research enable mobile touchscreen UI designers to createrealistic, congruent buttons by selecting the most appropriate audio and tactile counterparts of visual

button styles.13:50 14:10

D2O2.2 MultiML - A General Purpose Representation Language for Multimodal Human

Utterances

Manuel Giuliani, Alois Knoll, Technische Universitt MnchenWe present MultiML, a markup language for the annotation of multimodal human utterances. MultiML

is able to represent input from several modalities, as well as the relationships between these modalities.Since MultiML separates general parts of representation from more context-specific aspects, it can easilybe adapted for use in a wide range of contexts. This paper demonstrates how speech and gestures aredescribed with MultiML, showing the principles---including hierarchy and underspecification---that

ensure the quality and extensibility of MultiML. As a proof of concept, we show how MultiML is usedto annotate a sample human-robot interaction in the domain of a multimodal joint-action scenario.

14:10 14:30

D2O2.3 Deducing the Visual Focus of Attention from Head Pose Estimation in Dynamic

Multi-view Meeting Scenarios

Michael Voit, Fraunhofer IITB; Rainer Stiefelhagen, Universitt KarlsruheThis paper presents our work on recognizing the visual focus of attention during dynamic meetingscenarios. We collected a new dataset of meetings, in which acting participants were to follow a

predefined script of events, to enforce focus shifts of the remaining, unaware meeting members.Including the whole room, all in all, a total of 35 potential focus targets were annotated, of which somewere moved or introduced spontaneously during the meeting. On this dynamic dataset, we present a newapproach to deduce the visual focus by means of head orientation as a first clue and show, that oursystem recognizes the correct visual target in over 57% of all frames, compared to ~47% when mappinghead pose to the first-best intersecting focus target directly.

14:30 14:50

D2O2.4 Context-based Recognition during Human Interactions: Automatic Feature

Selection and Encoding Dictionary

Louis-Philippe Morency, Jonathan Gratch, USC Institute for Creative TechnologiesDuring face-to-face conversation, people use visual feedback such as head nods to communicate relevant

information and to synchronize rhythm between \ participants. In this paper we describe how contextual

information from other participants can be used to predict visual feedback and improve recognition ofhead gestures in human-human interactions. The main challenges addressed in this paper are optimal


23/32

23

feature representation using an encoding dictionary and automatic selection of the optimal feature-encoding pairs. We evaluate our approach on a dataset involving 78 human participants. Using a

discriminative approach to multi-modal integration, our context-based recognizer significantly improveshead gesture recognition performance over a vision-only recognizer.

TS2: Tuesday, October 21, 15:15 15:30TS2 Teaser Session

Time: Tuesday, October 21, 15:15 15:30Place: Main Hall

Chair: Joakim Gustafson

Demo Session: Tuesday, October 21, 15:30 17:00DS1 Demo Session


Place: Poster Hall


DS1.1 AcceleSpell, a Gestural Interactive Game to Learn and Practice Finger Spelling

Jos L. Hernandez-Rebollar, Jos D. Alanis-Urquieta, Universidad Tecnolgica dePuebla; Ethar Ibrahim Elsakay, Institute for Disabilities Research and Training IncIn this paper, an interactive computer game for learning and practicing continuous fingerspelling isdescribed. The game is controlled by an instrumented glove known as AcceleGlove and a recognitionalgorithm based on decision trees. The Graphical User Interface is designed to allow beginners toremember the correct hand shapes and start finger spelling words sooner than traditional methods oflearning.

DS1.2 A multi-modal spoken dialog system for interactive TV

R. Balchandran, M. Epstein, G. Potamianos, and L. Seredi, IBM T. J. Watson ResearchCenterIn this demonstration we present a novel prototype system that implements a multi-modal interface forcontrol of the television. This system combines the standard TV remote control with a dialog

management based natural language speech interface to allow users to e_ciently interact with the TV,and to seamlessly alternate between the two modalities. One of the main objectives of this system is tomake the unwieldy Electronic Program Guide information more navigable by the use of voice to _lterand locate programs of interest.

DS1.3 Multimodal Slideshow: Demonstration of the OpenInterface Interaction

Development Environment

David Juras, Laurence Nigay, Michael Ortega, Marcos Serrano, University ofGrenobleIn this paper, we illustrate the OpenInterface Interaction Development Environment (OIDE) thataddresses the design and development of multimodal interfaces. Multimodal interaction softwaredevelopment presents a particular challenge because of the ever increasing number of novel interactiondevices and modalities that can used for a given interactive application. To demonstrate our graphical

OIDE and its underlying approach, we present a multimodal slideshow implemented with our tool.DS1.4 A Browser-based Multimodal Interaction System

Kouichi Katsurada, Teruki Kirihata and Masashi Kudo, Toyohashi University ofTechnologyIn this paper, we propose a system that enables users to have multimodal interactions (MMI) with ananthropomorphic agent via a web browser. By using the system, a user can interact simply by accessinga web site from his/her web browser. A notable characteristic of the system is that the anthropomorphic

agent is synthesized from a photograph of a real human face. This makes it possible to construct a website whose owners facial agent speaks with visitors to the site. This paper describes the structure of the

system and provides a screen shot.

DS1.5 iGlasses: An Automatic Wearable Speech Supplement in Face-to-Face

Communication and Classroom Situations

Dominic W. Massaro, University of California, Santa Cruz; Miguel . Carreira-Perpin, University of California, Merced; David J. Merrill, Massachusetts Institute


24/32

24

of Technology; Cass Sterling, Stephanie Bigler and Elise Piazza, University ofCalifornia, Santa CruzThe need for language aids is pervasive in todays world. There are millions of individuals who havelanguage and speech challenges, and these individuals require additional support for communication and

language learning. We demonstrate technology to supplement common face-to-face language interactionto enhance intelligibility, understanding, and communication, particularly for those with hearingimpairments. Our research is investigating how to automatically supplement talking faces withinformation that is ordinarily conveyed by auditory means. This research consists of two areas ofinquiry: 1) developing a neural network to perform real-time analysis of selected acoustic features forvisual display, and 2) determining how quickly participants can learn to use these selected cues and howmuch they benefit from them when combined with speechreading.

DS1.6 Innovative interfaces in MonAMI: The Reminder

Jonas Beskow, Jens Edlund, Bjrn Granstrm, Teodor Germani, Joakim Gustafson,Gabriel Skantze, KTH Speech Music & Hearing; Oskar Jonsson, Swedish institute of

Assistive Technology; Helena Tobiasson, KTH Human Computer InteractionThis demo paper presents an early version of the Reminder, a prototype ECA developed in the Europeanproject MonAMI, which aims at mainstreaming accessibility in consumer goods and services, usingadvanced technologies to ensure equal access, independent living and participation for all. TheReminder helps users to plan activities and to remember what to do. The prototype merges mobile ECAtechnology with other, existing technologies: Google Calendar and a digital pen and paper. The solution

allows users to continue using a paper calendar in the manner they are used to, whilst the ECA providesnotifications on what has been written in the calendar. Users may ask questions such as When was I

supposed to meet Sara? or Whats my schedule today?

DS1.7 PHANTOM Prototype: exploring the potential for learning with multimodal

features in dentistry

Jonathan P. San Diego, Margaret Cox, Alastair Barrow and William Harwin, King'sCollege LondonIn this paper, we will demonstrate how force feedback, motion-parallax, and stereoscopic vision canenhance the opportunities for learning in the context of dentistry. A dental training workstation prototypehas been developed intended for use by dental students in their introductory course to preparing a toothcavity. The multimodal feedback from haptics, motion tracking cameras, computer generated sound andgraphics are being exploited to provide 'near-realistic' learning experiences. Whilst the empirical

evidence provided is preliminary, we describe the potential of multimodal interaction via thesetechnologies for enhancing dental-clinical skills.

Show and Tell: Tuesday, October 21, 15:30 17:00ST1 Show and Tell


Place: Main Hall


ST1.1 A Demonstration of Crossmodal Congruence: The Look, Feel and Sound of

Touchscreen Widgets

Eve Hoggan, University of Glasgow, UK; Topi Kaaresoja, Pauli Laitinen, NokiaResearch Center; Stephen Brewster, University of GlasgowHow can visual, audio and tactile feedback be combined in a congruent manner for use in touchscreen

GUI buttons? Using multiple modalities (audio, visual, tactile) together can create much richer sets offeedback as opposed to using a single modality. It is well known that adding feedback from one modality

to another modality (e.g. adding audio to tactile) can significantly alter perception of the feedback. Forinstance, by simply changing the timbre of the audio, the perception of tactile texture can change frommetallic to soft without ever changing the tactile feedback. Our experiment investigated methods ofcombining visual, audio and tactile feedback by manipulating different parameters of each modality inorder to produce congruent sets of feedback. The demo will involve various different sets of congruenttouchscreen buttons (metallic, plastic, rounded, flat, rough, spongy) displayed on the N770 augmented

with piezo-electric actuators.

ST1.2 Affective Computer-Aided Learning for Autistic ChildrenA. Luneski, E. I. Konstantinidis, M. Hitoglou-Antoniadou and P. D. Bamidis, Lab of


25/32

25

Medical Informatics, Medical School, Aristotle University of Thessaloniki, Lab ofSpeech and Communication Disorders, 1st ENT Clinic, Medical School, Aristotle Univ.of ThessalonikiAutism is a mental disability that requires early intervention by educating autistic children on theeveryday social, communication and reasoning skills. Computer-aided learning (CAL) has recently beenconsidered as the most successful educational method and various CAL systems have been developed. In

this paper we examine the existing CAL systems and platforms and discuss the benefits of adding anaffective/emotional dimension in the interaction process between the CAL system and the autistic

person. We present our work on a CAL system that is based on affective avatar interaction, as well as, apersonalisation database containing user profiles and records of the educational process. The systemallows the educator not only to personalise the system for each user, but also to exploit records of thelearning progress for further statistical analysis. Pilots and acceptability results are already on theimmediate plan in a school for autistic persons.

ST1.3 Audio-Visual cues for Dominance Estimation

Hayley Hung, Dinesh Babu Jayagopi, Daniel Gatica-Perez, IDIAP Research Institute,Switzerland and EPFL, SwitzerlandOur demo will show different cues that we have used for estimating dominance in multi-party meetings.This will include simpler features such as speaking activity features to more complex features like thevisual focus of attention of participants. The demos will show our estimates synchronised with real videofootage of 4-participant meetings.

ST1.4 Feel-Good Touch: Finding the Most Pleasant Tactile Feedback for a MobileTouch Screen Button

Emilia Koskinen, Topi Kaaresoja, Pauli Laitinen, Nokia Research CenterEarlier research has shown the benefits of tactile feedback for touch screen widgets in all metrics:performance, usability and user experience. In our current research the goal was to go deeper inunderstanding the characteristics of a tactile click for virtual buttons. More specifically we wanted to

find a tactile click which is the most pleasant to use with a finger. We used two actuator solutions in asmall mobile touch screen: piezo actuators or a standard vibration motor. We conducted three

experiments: The first and second experiments aimed to find the most pleasant tactile feedback donewith the piezo actuators or a vibration motor, respectively, and the third one combined and compared theresults from the first two experiments. The results from the first two experiments showed significantdifferences for the perceived pleasantness of the tactile clicks, and we used these most pleasant clicks in

the comparison experiment in addition to the condition with no tactile feedback. Our findings confirmedresults from earlier studies showing that tactile feedback is superior to a nontactile condition when

virtual buttons are used with the finger regardless of the technology behind the tactile feedback. Anotherfinding suggests that the users perceived the feedback done with piezo actuators slightly more pleasant

than the vibration motor based feedback, although not statistically significantly. These results indicatethat it is possible to modify the characteristics of the virtual button tactile clicks towards the mostpleasant ones, and on the other hand this knowledge can help designers to create better touch screenvirtual buttons and keyboards.

ST1.5 HMM-Based Synthesis of Child Speech

Oliver Watts, Junichi Yamagishi, Kay Berkling, Simon King, University of Edinburgh,UK & Polytechnic University of Puerto RicoThe synthesis of child speech presents challenges both in the collection of data and in the building of a

synthesiser from that data. Because only limited data can be collected, and the domain of that data isconstrained, it is difficult to obtain the type of phonetically-balanced corpus usually used in speech

synthesis. As a consequence, building a synthesiser from this data is difficult. Concatenative synthesisersare not robust to corpora with many missing units (as is likely when the corpus content is not carefully

designed), so we chose to build a statistical parametric synthesiser using the HMM-based system HTS.This technique has previously been shown to perform well for limited amounts of data, and for data

collected under imperfect conditions. We compared 6 different configurations of the synthesiser, usingboth speaker-dependent and speaker-adaptive modelling techniques, and using varying amounts of data.The output from these systems was evaluated alongside natural and vocoded speech, in a Blizzard-stylelistening test.

ST1.6 Italian Literacy Tutor

Piero Cosi, ISTC-SPFD CNRA newly trained SONIC childrens speech recognition model has been integrated into ILT the Italian

version of the Colorado Literacy Tutor platform. Specifically, childrens speech recognition researchfor Italian was conducted using the complete training and test set of the ITC-irst Childrens Speech


26/32

26

Corpus. Using the University of Colorado SONIC LVSR system, we demonstrate a phonetic recognitionerror rate of 12,0% for a system which incorporates Vocal Tract Length Normalization (VTLN),

Speaker-Adaptive Trained phonetic models, as well as unsupervised Structural MAP Linear Regression(SMAPLR). A simple DEMO of the system in a reading experiment will be illustrated during theworkshop.

ST1.7 Movie Summarization Using Audio, Visual and Text Saliency

G. Evangelopoulos, A. Zlatintsi, G. Skoumas*, K. Rapantzikos, A. Potamianos*, P.Maragos, Y. Avrithis, National Technical University of Athens; *Technical Universityof Crete, ChaniaDetection of perceptually important video events is formulated here on the basis of saliency models forthe audio, visual and textual information conveyed in a video stream. Audio saliency is assessed by cuesthat quantify multifrequency waveform modulations, extracted through nonlinear operators and energy

tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, colorand motion. Text saliency is extracted from part-of-speech tagging on the subtitles information available

with most movie distributions. The various modality curves are integrated in a single attention curve,where the presence of an event may be signified in one or more domains. This multimodal saliencycurve (MSC) is the basis of a bottom-up video summarization algorithm, that refines results fromunimodal or audiovisual-based skimming. The algorithm performs favorably for video summarization interms of informativeness and enjoyability.

ST1.8 Software Platform for Assisted Simulation of Novel User Interfaces, Replay and

Annotation of Rich Multimodal Interaction LogsJrme Simonin, Nolle Carbonell, INRIA; Danielle Pel, France Tlcom R&DAn empirical study is presented which aims at assessing the possible effects of embodiment on onlinehelp effectiveness and attraction. To simulate the intelligence of an Online Help Agent Embodied as a

talking Head, we used the Wizard of Oz technique. A software platform (a client-server platform inJava) was developed for recording participants' interactions with a Windows software application and

either help system. To assist the Wizard in his/her task, the platform can: forward displays on theparticipant's screen to the Wizard; display messages selected by her/him on the participant's screen; andassist the Wizard in the simulation of the help system by displaying, on selection of a message, a log (orhistory) of the various versions of this message received earlier by the current participant. Logs includetime-stamped user and system events, mouse positions and clicks, screen copies, gaze samples. They canbe "replayed" and annotated semi-automatically. So, an in-depth analysis of participants' interactions andgaze activity using this platform can be done. I will need of one screen to plug my netbook.

Keynote Address: Wednesday, October 22, 09:00 10:00KA Keynote Address George Drettakis

Time: Wednesday, October 22, 09:00 10:00

Place: Main Hall

Chair: Yuri Ivanov

Audiovisual 3D rendering as a tool for Multimodal Interfaces

George Drettakis, INRIA Sophia-AntipolisIn this talk, we will start with a short overview of 3D audiovisual rendering and its applicability tomultimodal interfaces. In recent years, we have seen the generalization of 3D applications, ranging from

computer games, which involve a high level of realism, to applications such as SecondLife, in which thevisual and auditory quality of the 3D environment leaves much to be desired. In our introduction willattempt to examine the relationship between the audiovisual rendering of the environment and theinterface. We will then review some of the audio-visual rendering algorithms we have developed in thelast few years. We will discuss four main challenges we have addressed. The first is the development ofrealistic illumination and shadow algorithms which contribute greatly to the realism of 3D scenes, butcould also be important for interfaces. The second involves the application of these illuminationalgorithms to augmented reality settings. The third concerns the development of perceptually-based

techniques, and in particular using audio-visual cross-modal perception. The fourth challenge has beenthe development of approximate but plausible", interactive solutions to more advanced renderingeffects, both for graphics and audio. On the audio side, our review will include the introduction ofclustering, masking and perceptual rendering for 3D spatialized audio and our recently developedsolution for the treatment of contact sounds. On the graphics side, our discussion will include a quickoverview of our illumination and shadow work, its application to augmented reality, our work oninteractive rendering approximations and perceptually driven algorithms. For all these techniques we

will discuss their relevance to multimodal interfaces, including our experience in a urban design case-


27/32

27

study. We will also attempt to relate these techniques to recent interface research. We will close with abroad reflection on the potential for closer collaboration between 3D audiovisual rendering and

multimodal interfaces.

D3O1: Wednesday, October 22, 10:30 12:00D3O1 Multimodal Interfaces II (Oral Session)


Place: Main HallChair: Yuri Ivanov

10:30 10:50

D3O1.1 Multimodal Presentation and Browsing of Music

David Damm, Christian Fremerey,University of Bonn; Frank Kurth, ResearchEstablishment for Applied Science; Meinard Mller, Max-Planck-Institut frInformatik; Michael Clausen, University of BonnRecent digitization efforts have led to large music collections, which contain music documents ofvarious modes comprising textual, visual and acoustic data. In this paper, we present a multimodal musicplayer for presenting and browsing digitized music collections consisting of heterogeneous document

types. In particular, we concentrate on music documents of two widely used types for representing amusical work, namely visual music representation (scanned images of sheet music) and associatedinterpretations (audio recordings). We introduce novel user interfaces for multimodal (audio-visual)

music presentation as well as intuitive navigation and browsing. Our system offers high quality audioplayback with time-synchronous display of the digitized sheet music associated to a musical work.Furthermore, our system enables a user to seamlessly crossfade between various interpretationsbelonging to the currently selected musical work.

10:50 11:10

D3O1.2 An Audio-Haptic Interface Based on Auditory Depth Cues

Delphine Devallez, Federico Fontana,University of Verona; Davide Rocchesso, IUAVof VeniceSpatialization of sound sources in depth allows a hierarchical display of multiple audio streams andtherefore may be an efficient tool for developing novel auditory interfaces. In this paper we present anaudio-haptic interface for audio browsing based on rendering distance cues for ordering sound sources in

depth. The haptic interface includes a linear position tactile sensor made by conductive material. Thetouch position on the ribbon is mapped onto the listening position on a rectangular virtual membrane,modeled by a bidimensional Digital Waveguide Mesh and providing distance cues of four equallyspaced sound sources. Furthermore a knob of a MIDI controller controls the position of the mesh alongthe playlist, which allows to browse the whole set of files. Subjects involved in a user study found theinterface intuitive and entertaining. In particular the interaction with the stripe was highly appreciated.

11:10 11:30

D3O1.3 Detection and Localization of 3D Audio-Visual Objects Using Unsupervised

Clustering

Vasil Khalidov, INRIA Rhone-Alpes and Universit Joseph Fourier; Florence Forbes,

Miles Hansard, INRIA Rhone-Alpes ; Elise Arnaud, INRIA Rhone-Alpes and UniversitJoseph Fourier; Radu Horaud, INRIA Rhone-AlpesThis paper addresses the issues of detecting and localizing objects in a scene that are both seen andheard. We explain the benefits of a human-like configuration of sensors (binaural and binocular) forgathering auditory and visual observations. It is shown that the detection and localization problem can berecast as the task of clustering the audio-visual observations into coherent groups. We propose aprobabilistic generative model that captures the relations between audio and visual observations. Thismodel maps the data into a common audio-visual 3D representation via a pair of mixture models.

Inference is performed by a version of the expectation-maximization algorithm, which is formallyderived, and which provides cooperative estimates of both the auditory activity and the 3D position of

each object. We describe several experiments with single- and multiple-speaker detection andlocalization, in the presence of other audio sources.

11:30 11:50

D3O1.4 Robust Gesture Processing for Multimodal InteractionSrinivas Bangalore, Michael Johnston, AT&T Labs Research


28/32

28

With the explosive growth in mobile computing and communication over the past few years, it ispossible to access almost any information from virtually anywhere. However, the efficiency and

effectiveness of this interaction is severely limited by the inherent characteristics of mobile devices,including small screen size and the lack of a viable keyboard or mouse. This paper concerns the use ofmultimodal language processing techniques to enable interfaces combining speech and gesture input thatovercome these limitations. Specifically we focus on robust processing of pen gesture inputs in a localsearch application and demonstrate that edit-based techniques that have proven effective in spokenlanguage processing can also be used to overcome unexpected or errorful gesture input. We also

examine the use of a bottom-up gesture aggregation technique to improve the coverage of multimodalunderstanding.

D3O2: Wednesday, October 22, 13:30 15:00D3O2 Multimodal Modeling (Oral Session)


Place: Main Hall

Chair: Michael Johnston

13:30 10:50

D3O2.1 Investigating automatic dominance estimation in groups from visual attention and

spk. activityHayley Hung, Sileye Ba, Idiap Research Institute; Dinesh Jayagopi, Jean-MarcOdobez, Daniel Gatica-Perez, Idiap Research Institute and Ecole Polytechnique

Fdrale de LausanneWe study the automation of the visual dominance ratio (VDR); a classic measure of displayeddominance in social pyschology literature, which combines both gaze and speaking activity cues. The

VDR is modified to estimate dominance in multi-party group discussions where natural verbalexchanges are possible and other visual targets such as a table and slide screen are present. Our findings

suggest that fully automated versions of these measures can estimate effectively the most dominantperson in a meeting and can match the dominance estimation performance when manual labels of visualattention are used.

13:50 14:10

D3O2.2 Dynamic modality weighting for multi-stream HMMs in Audio-Visual SpeechRecognition

Mihai Gurban, Ecole Polytechnique Fdrale de Lausanne; Thomas Drugman, FacultPolytechnique de Mons; Jean-Philippe Thiran, Polytechnique Fdrale de Lausanne;Thierry Dutoit,Facult Polytechnique de MonsMerging decisions from different modalities is a crucial problem in Audio-Visual Speech Recognition.To solve this, state synchronous multi-stream HMMs have been proposed for their important advantage

of incorporating stream reliability in their fusion scheme. This paper focuses on stream weightadaptation based on modality confidence estimators. We assume different and time-varying environmentnoise, as can be encountered in realistic applications, and, for this, adaptive unsupervised methods arebest-suited. Stream reliability is assessed directly through classifier outputs since they are not specific toeither noise type or level. The relative importance of transition probabilities with regard to stream

likelihoods in this framework is also discussed.14:10 14:30

D3O2.3 A Fitts Law Comparison of Eye Tracking and Manual Input in the Selection of

Visual Targets

Roel Vertegaal, Julian Lepinski, Human Media Lab, Queen's UniversityWe present a Fitts Law evaluation of a number of eye tracking and manual input devices in the selectionof large visual targets. We compared performance of two eye tracking techniques, manual click anddwell time click, with that of mouse and stylus. Results show eye tracking with manual clickoutperformed the mouse by 16%, with dwell time click 46% faster. However, eye tracking conditionssuffered a high error rate of 11.7% for manual click and 43% for dwell time click conditions. AfterWelford correction eye tracking still appears to outperform manual input, with IPs of 13.8 bits/s for

dwell time click, and 10.9 bits/s for manual click. Eye tracking with manual click provides the best

tradeoff between speed and accuracy, and was preferred by 50% of participants. Mouse and stylus hadIPs of 4.7 and 4.2 respectively. However, their low error rate of 5% makes these techniques more


29/32

29

suitable for refined target selection.14:30 14:50

D3O2.4 A Wizard of Oz Study for an AR Multimodal Interface

Minkyung Lee, Mark Billinghurst, HIT Lab NZ, University of CanterburyIn this paper we describe a Wizard of Oz (WOz) user study of an Augmented Reality (AR) interface thatuses multimodal input (MMI) with natural hand interaction and speech commands. Augmented Reality

technology creates the illusion that virtual objects are part of the users real environment. So the goal ofAR systems is to provide users with information enhanced environments with seamless connection

between the real and virtual worlds. To achieve this, we need to consider not only accurate tracking andregistration to align real and virtual objects, but also an interface which supports interaction in the realand virtual worlds. There has been a significant amount of research on AR interaction techniques, butthere has been almost no research on multimodal interfaces for AR, and none of this research has beenbased on a Wizard of Oz study. Accordingly, our goal is using a WOz study to help to create amultimodal AR interface which is most natural to the user. In this study we wanted to learn how userswould issue multimodal commands and how different AR display conditions would affect the commandsused when they did not have a set of given commands, but had perfect speech and gesture recognition.

We used three virtual object arranging tasks with two different display types (a head mounted display,and a desktop monitor) to see how users used multimodal commands, as well as how different ARdisplay conditions affect those commands. The three tasks were (1) changing the colour and shape ofsimple primitives and copying them to a target object configuration, (2) moving sample objects

distributed in 3D space into a final arrangement of objects, and (3) creating a virtual scene by arrangingdetailed models as users want. Subjects filled out surveys after each condition and their performance wasvideoed for later analysis. We also interviewed subjects to get additional comments from them. The

results provided valuable insights into how people naturally interact in a multimodal AR scene assemblytask. Video analysis showed that the main types of speech input were words for colour and object shape;74% of all speech commands were phrases of a few discrete words while only 26% of speech commandswere complete sentences. We also found that the main classes of gestures were deictic (65%) andmetaphoric (35%) gestures. Commands that combined speech and gesture input were 63% of the totalnumber of commands whereas gesture input only commands were 34%, and speech only input was3.7%. This implies that multimodal AR interfaces for object manipulation will rely heavily on accuraterecognition of users input gestures; almost 97% of commands involved some gesture input. We also

found that overall 94% of the time gesture commands were issued before the corresponding speech inputin a multimodal command in the AR environment. When considering fusion of speech and gesture

commands, we defined the time frame for combining gesture and speech input and found an optimaltime window of 7.9 seconds which would capture 98% of combined speech and gesture input that isrelated to each other. We also found that display type did not produce a significant difference in the typeof commands used, although users felt that the screen-based AR application provided a betterexperience. Using these results, we present design recommendations for multimodal interaction in ARenvironments which will be useful for others trying to develop multimodal AR interfaces. In the future

we will use these WOz results to create a functioning multimodal AR interface.

TS3: Wednesday, October 22, 15:20 15:30TS3 Teaser Session


Place: Main Hall

Chair: Alexandros Potamianos

D3P1: Wednesday, October 22, 15:30

icmi 2008 abstracts

Documents