eu fp7-ict-2011.2.1 ict for cognitive systems and robotics ...€¦ · ict for cognitive systems...

17
FP7 - MOBOT 600796 1 EU FP7-ICT-2011.2.1 ICT for Cognitive Systems and Robotics - 600796 Work Package 2: Human Behaviour Analysis and Mobility Assistance Models Deliverable D2.2: Multimodal sensory corpora annotations Release date: 15-09-2014 Status: public Document name: MOBOT_WP2_D2.2 WP2: Work Package Nr. D2.2: Deliverable Number (compare DoW)

Upload: phungtuyen

Post on 04-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

FP7 - MOBOT – 600796 1

EU FP7-ICT-2011.2.1

ICT for Cognitive Systems and Robotics - 600796

Work Package 2: Human Behaviour Analysis and Mobility Assistance Models

Deliverable D2.2: Multimodal sensory corpora annotations

Release date: 15-09-2014

Status: public

Document name: MOBOT_WP2_D2.2

WP2: Work Package Nr. D2.2: Deliverable Number (compare DoW)

FP7 - MOBOT – 600796 2

EXECUTIVE SUMMARY

This deliverable aims at describing the procedures followed throughout the post-

processing of the multisensory data as well as the annotation of the visual and audio

data that were acquired during the recording/measurement sessions that took place in

Agaplesion Bethanien Hospital/Geriatric Centre at the University of Heidelberg in

November 2013.

The data from the motion capture system were processed in order to gather precise

kinematic information from every action of the MOBOT patient related to the use of

and interaction with the MOBOT passive-rollator device. Such data contribute to the

understanding of elderly-specific motion sequences and general behaviour in MOBOT

related scenarios and also serve as an important input for other research areas involved

in the MOBOT project, such as gait pattern analysis and classification, motion

recognition, safety analysis, on-line control and optimization, as well as the mechanical

design of the device.

The post-processing of motion capture data generally includes two main steps: labelling

and cleaning the raw data in Qualisys software, and extracting human motion data in

Visual3D software. The use of the image-based motion capture system (Qualisys) was

preferred over the initially planned IMU-based system (XSens) since it proved to

provide much more precise results and could be adjusted in accordance with the other

sensors involved in the trials. However, the use of the image-based system in the post-

processing of the recorded data required much more effort than what had been

anticipated for the IMU-based recordings.

The post-processing of audiovisual data serves two goals; firstly, it serves for the

synchronization of all media; secondly, the in-depth annotation of the data provides

timestamps for actions, gestures and speech as well as for audio and visual noise,

supplying all technical partners with measurable information about the content of the

acquired data. The analysis, testing and implementation of these data will become an

important source for the different modules of the MOBOT robotic platform.

In the following sections of this deliverable there will first be a brief introduction

regarding the multisensory data; subsequent, there is a section focusing on the post-

processing of the acquired data offering some quantitative information and describing

the synchronization procedure for all the multisensory data and the creation of the Pip

video files for the annotation procedure. The third and final section of this deliverable

focuses on the conception and realisation of the audiovisual data annotation schemes.

FP7 - MOBOT – 600796 3

Deliverable Identification Sheet

IST Project No. FP7 – ICT for Cognitive Systems and Robotics - 600796

Acronym MOBOT

Full title Intelligent Active MObility Assistance RoBOT integrating

Multimodal Sensory Processing, Proactive Autonomy and Adaptive

Interaction

Project URL http://www.mobot-project.eu

EU Project

Officer

Michel Brochard

Deliverable D2.2 Multimodal sensory corpora annotations

Work package WP2 Human Behaviour Analysis and Mobility Assistance

Models

Date of delivery Contractual M 18 Actual 15-09-2014

Status Final

Nature Other

Dissemination

Level

Public

Authors

(Partner)

Evita Fotinea (ATHENA), Athanasia - Lida Dimou (ATHENA)

Responsible

Author

Athanasia - Lida Dimou Email < [email protected]>

Partner ATHENA Phone +30-210-6875358

Keywords Data synchronization, data post-processing, multimodal sensory corpora

annotation

Version Log

Issue Date Rev

No.

Author Change

15-07-2014 1.0 Evita Fotinea First Draft

17-07-2014 1.1 Athanasia - Lida Dimou Added ILSP data

25-07-2014 1.2 Angelika Peer, Milad Geravand Added TUM data

25-07-2014 1.3 Khai-Long Ho Hoang Added UHEI data

26-07-2013 1.4 Athanasia - Lida Dimou, Text & figures, finalization of

FP7 - MOBOT – 600796 4

Panagiotis Karioris, Theodore

Goulas

collaborative text on

annotation tiers from ICCS,

INRIA

28-07-2013 2.0 Evita Fotinea, Eleni Efthimiou Finalization

01-08-2014 2.1 Davide Dorradi Internal reviewer’s comments

17-08-2014 2.1 Iassonas Kokkinos Internal reviewer’s comments

28-08-2014 2.2 Athanasia – Lida Dimou, Evita

Fotinea, Eleni Efthimiou

Internal reviewers comments,

processing & modification,

finalization

TABLE OF CONTENTS

Executive Summary .......................................................................................................... 2

1. Introduction .................................................................................................................. 6

1.1. Creating multimodal sensory corpora.................................................................... 6

2. Data Post-Processing .................................................................................................... 7

2.1. Quantitative information of the obtained multisensory data ................................. 7

2.2. Synchronization of the acquired multimodal sensory corpus ................................ 8

2.3. Post-Processing for the motion capture system ..................................................... 8

2.3.1. Visual3D ......................................................................................................... 9

2.4. Video post-processing for PiP creation files ....................................................... 11

3. Annotation schemes for the audiovisual data ............................................................. 11

3.1. Introduction: Aims, scope ................................................................................... 12

3.2. Annotation Scheme: From generic to specific .................................................... 13

3.2.1. Generic.......................................................................................................... 13

3.2.2. Specific ......................................................................................................... 14

4. Conclusion .................................................................................................................. 16

References ...................................................................................................................... 16

LIST OF FIGURES

Figure 2-1: Motion capture markers on one of the patients of the MOBOT Recordings 9

Figure 2-2: Motion capture data exported from Qualisys (left), .................................... 10

Figure 2-3: Coordinate frames associated to the human body segments. ...................... 11

Figure 3-1: Sample of an annotation PiP video (Scenario 3, Patient 6). ........................ 14

LIST OF TABLES

Table 2-1: Total number of files by type of visual sensor for annotation ........................ 8

Table 3-1: Duration and number of files to be annotated per scenario/variant. ............. 12

FP7 - MOBOT – 600796 5

LIST OF ABBREVIATIONS

Abbreviation Description

PR Public Report

WP Work Package

Partner Abb. Description

TUM TECHNISCHE UNIVERSITAET MUENCHEN

ICCS INSTITUTE OF COMMUNICATION AND COMPUTER

SYSTEMS

INRIA INSTITUT NATIONAL DE RECHERCHE EN

INFORMATIQUE ET EN AUTOMATIQUE

ECP ÉCOLE CENTRALE DES ARTS ET MANUFACTURES

UHEI RUPRECHT-KARLS-UNIVERSITAET HEIDELBERG

ILSP / ATHENA

RC

INSTITUTE FOR LANGUAGE AND SPEECH

PROCESSING / ATHENA RESEARCH AND INNOVATION

CENTER IN INFORMATION COMMUNICATION &

KNOWLEDGE TECHNOLOGIES

ACCREA ACCREA BARTLOMIEJ MARCIN STAŃCZYK

BETHANIEN BETHANIEN KRANKENHAUS - GERIATRISCHES

ZENTRUM - GEMEINNUTZIGE GMBH

DIAPLASIS DIAPLASIS REHABILITATION CENTER SA

FP7 - MOBOT – 600796 6

1. INTRODUCTION

Having successfully completed the data acquisition during the recording/measurement

procedure in Agaplesion/Bethanien Hospital in November 2013, a complete dataset of

multisensory data was available to be utilized. Post-processing procedures included

several operations that were indispensable in order to provide the data in different as

well as functional formats. The synchronization of the acquired multimodal sensory

corpus was one such operation, described in more detail in the next section.

With reference to the audiovisual data, the previously synchronized data drawn from the

multiple video streams of the MOBOT dataset (4 different visual sources) were

provided in a Picture in Picture (PiP) format suitable for the annotation procedure.

Lastly, the PiP video files along with a modular annotation template – adjustable to the

needs of each scenario/variant – were provided to the annotators so that they could

manually annotate the audiovisual data that would set the basis for the HRI

communication model of the project.

1.1. Creating multimodal sensory corpora

The proposal on the recording scenarios for the multimodal sensory corpora was

developed and finalized after discussions among all partners involved in data

acquisition. The planning of these recording scenarios was completed in collaboration

with all MOBOT partners by September 2013; the recording/measurement sessions took

place at BETHANIEN and were completed in November 2013.

The recording/measurement procedure as well as all the relevant details regarding the

data acquisition process i.e. patients, scenarios, performed tasks, etc. are extensively

presented in the publicly available deliverable D2.1- Data acquisition and multimodal

sensory corpora collection. To keep this document self-contained, we briefly

summarize below the three types of sensors that provided the data to be post-processed:

Motion capture system: For the purposes of motion capturing, a Qualisys system

with 8 cameras was used. The cameras were mounted on tripods and placed around

the recording area. Passive reflective markers were installed on the human bodies of

patient and carer to measure their human limb movements. The marker set was

specially chosen after taking into account several limiting factors of the recording

population as well as potential supporting areas on the human body (which should

stay free of markers). In order to distinguish between carer and patient, two

additional markers were added to the head of the carer. Further visual markers were

added to objects placed in the acquisition space such as the door, the door frame,

and the obstacle.

Audiovisual data: these included HD cameras, Kinect Cameras and a GoPro

Camera. For the purposes of the recording sessions the following media were used.

o Microphone Arrays: an array of MEMS microphones was obtained for R&D

purposes, even though such devices are not yet a commercial product.

o 4 HD cameras:

Central: It was placed so as to record the patient when walking within the

recording area.

FP7 - MOBOT – 600796 7

Global: It was placed so as to cover any optical gaps and provide further

information of the patient’s motion and posture as well as details of

manoeuvring.

GoPRo: It was set on the passive rollator. The main criterion for the choice

of this particular camera was its ability to record at close range and at a

stable distance the patient’s torso and arms and, in some cases, the head as

well.

Side: This camera was always on one side to supplement and gave

information that was missed by the other cameras. Its position was not

predefined and it changed according to the different scenarios, so as to

achieve optimum position and offer the best viewing angle.

o 2 Kinect Cameras: We decided on using two Kinect-for-Windows (KFW)

sensors that come equipped with the ‘near mode’ option.

Upper Kinect: The first sensor was facing horizontally towards the patient,

aiming at capturing the area of the torso, waist, hips and the upper part of the

limbs.

Lower Kinect: The second sensor was facing downwards, capturing lower

limb motion, so as to enable the estimation of 3D limb positions and,

eventually, the analysis of gait abnormalities.

o Other sensors: laser range finders and force/torque sensors were mounted on

the rollator.

Laser range finders: two laser range finder sensors are mounted on the

rollator. One is on the front, facing towards the direction of motion, to

provide a full scan of the walking area. The other one is on the back, facing

towards the user's legs, aiming to provide data on the gait of the patient.

Force/torque sensors: two 6 DoF force/torque sensors are placed on the

handles of the rollator.

Below we briefly report on the labour-intensive procedure of post-processing of all

available post-processed data.

2. DATA POST-PROCESSING

2.1. Quantitative information of the obtained multisensory data

In total, six different scenarios were recorded, each serving a distinct purpose in the data

acquisition process. Each of these six scenarios had three variants; these variants were

specially designed in order to provide the participants with enough flexibility to perform

the easiest variants in case they could not perform all of them. Each variant of each

scenario was designed in a manner that would allow it to be repeated 1-5 times. The

number of trials varied according to the difficulty of each scenario, scenario/ variant and

/or performance of the individual informants.

Motion capture system: Qualisys motion tracking system with eight infrared

cameras, 48 reflective markers (10 additional markers for static trials), recording

volume: 2.7m x 3m x 2m (width, length , height)

Audiovisual data: HD cameras, Kinect Cameras, GoPro Camera, Microphone

Arrays

FP7 - MOBOT – 600796 8

Sensor data: laser range sensors.

Table 2-1: Total number of files by type of visual sensor for annotation

2.2. Synchronization of the acquired multimodal sensory corpus

Upon the creation of the MOBOT multimodal multisensory corpus recorded at the

premises of the Bethanien/Agaplesion hospital, the acquired video files (from external

HD cameras as well as from the Kinect and GoPro, mounted on the sensitized passive

rollator) were obtained, and the Kinect raw files were synchronized with the rest of the

visual data, providing a synchronization scheme between external HD data and the

RosBag related multimodal multisensory data.

The 2 HD cameras and the GoPro camera were synchronized to the extracted RGB topic

of the upper Kinect camera. Through this synchronization (HD with RGB topic of the

Upper Kinect) it was possible to have access to all other sensory data within the RosBag

files at specific timestamps, as indicated by the visual sensors.

All visual mediums were synchronized via a manually shot flashlight during the

recording/measurement sessions. The order in which the sensors started recording

created a time offset between the RosBag capturing and the flashlight; typically, in the

recording sessions the RosBag was initialized first then followed initialisation of all

audiovisual media (HD and Microphone Arrays) and afterwards there was a manual

shot of a flashlight, which is the first frame of the PiP video files used for the annotation

of the audiovisual data (see below). So in order to calculate the RosBag initial

timestamp, the time offset needed to be added to the timestamp of the flashlight.

2.3. Post-Processing for the motion capture system

The post-processing procedure of the motion capture data is split up into two parts:

Cleaning of raw data and labelling of marker trajectories involving the QTM-

manager software:

The image-based 3D-recordings of the trials are cleaned from gaps, phantom

markers, flickering and other inconsistencies which occur due to occlusions,

reflections, loose clothes of the patient, missing markers, and other unexpected

incidences during the recordings. Marker trajectories that have been mismatched

by the automatic marker identification algorithms of the software have to be

identified and reassigned manually. The corrected marker trajectories are

identified and labelled according to a unique nomenclature specifically

developed for the MOBOT recordings. Globally, the procedure of labelling the

FP7 - MOBOT – 600796 9

raw motion capture data is a complex task as labelling the recording of a single

trial can take up to several hours, due to the inconsistencies of the recorded data,

mentioned above. However, despite all the inconveniencies, the labelling

process is still at an ongoing state. The following figure presents a plot of the

markers on one of the patients.

Figure 2-1: Motion capture markers on one of the patients of the MOBOT

Recordings

Reconstruction of human model and motion from marker data involving the

Visual3D software:

A 15-segmented model of the human body is generated for each patient by

assigning the segments to marker sets. The consistency of the segment-marker

set assignment depends strongly on the quality of the results from the previous

step. In some cases, virtual markers can be used to replace missing markers in

the recordings. Each assigned segment is associated with biomechanical

parameters that represent the mass and inertia properties of the segment. The

model is applied on the static and dynamic recording files of the corresponding

patient so that motion data, such as position/orientation of body segments and

relative joint angles, is extracted.

2.3.1. Visual3D

Visual3d is a research software for 3D Motion Capture data analysis and modelling and

is used for derivation of the patient’s motion data. Model definition applying the model

to motion capture data and desired model-based data extraction are the three main steps

to be carried out in Visual3D.

A biomechanical human model including 15 body segments was defined for each

patient (see Figure 2-1). Each body segment was defined by using “static trials” and

tracked during “movement trials”. To define each body segment, firstly calibration

markers in static trials were used to associate the orientation of the segment’s coordinate

systems relative to the tracking markers. Secondly, the segment’s endpoints were

FP7 - MOBOT – 600796 10

defined considering the mid-point between two medial and lateral markers on each pair

of calibration markers as well as proper association of the segment’s radius. Thirdly, the

frontal plane for each segment was defined by the proper connection of at least three

markers constituting a plane. Finally, the segment coordinate system was defined as

follows: the z-axis by connecting segment end points, the y-axis perpendicular to the z-

axis and the frontal plane, and the x-axis perpendicular to the z-axis and y-axis (right-

hand rule. Figure 2-2 shows the currently assigned coordinate frames.

After that, the human model for each patient was applied to his/her movement trials.

Biomechanical model-based calculations were then defined and performed to extract the

desired information. Currently joint angles, joint velocities, and accelerations have been

derived while computation of other values can be carried out based on consortium

request. Exploring the data from movement trials and associating these data with the

model is done visually for each single trial. If errors or no correspondence between

obtained and expected results are found, both the model parameters and the marker sets

defining body segments are modified (as much as required) in Visual3D or markers are

amended in Qualisys. Approved results are saved for further analysis in WP1 and WP2.

Figure 2-2: Motion capture data exported from Qualisys (left),

and its corresponding biomechanical human model with segment definition (right).

FP7 - MOBOT – 600796 11

Figure 2-3: Coordinate frames associated to the human body segments.

2.4. Video post-processing for PiP creation files

Along with the synchronization procedure the “picture in picture” PiP video files were

created. The importance of this procedure is considerable as the resulting video files

provide the input which the annotators are called to manually annotate. The streams

from each medium were obtained independently, and were brought together in a PiP

stream in order to have all the information accumulated so as to facilitate the annotation

procedure of the video channel.

The PiP video files consist of 4 visual inputs:

– 1 Kinect camera (Upper Kinect): Upper Left

– 1 Go Pro camera: Upper Right

– HD Central: Lower Left

– HD Global: Lower Right

The files form the GoPro camera as well as the files from the RGB topic of the upper

Kinect (Kinect/RosBag) were converted into mpeg2 format; this was decided due to

codec incompatibility to handle noise (shadows, ghosting, etc).

3. ANNOTATION SCHEMES FOR THE AUDIOVISUAL DATA

The following table presents the duration of each scenario variant, the participants that

performed it, as well as the amount of files to be annotated.

FP7 - MOBOT – 600796 12

Table 3-1: Duration and number of files to be annotated per scenario/variant.

3.1. Introduction: Aims, scope

For the annotation procedure the primary recordings were maintained and new ones on

maximum resolution (HD) were created; along with these processes, the recording files

were also generated in compressed format (mp4) in order to facilitate the annotation

procedure. Furthermore, we favoured the creation of files in lossy format and lower

resolution e.g. 426*240 for better management of the video files as well as faster access

and exchange among partners.

The annotation of the visual data was performed in the ELAN environment (ELAN

4.6.21), an annotation environment specifically designed for the processing of

multimodal resources [Brugman et al., 2004]. Annotation is time aligned; each channel

of information was annotated into a separate annotation tier, which may consist of

several sub tiers according to the level of the fine-grained information that is needed.

The output of the annotation procedure was exported into .xml files. Previous

consortium agreement on prioritization of annotating the different scenarios had dictated

the following scenario order for the actual annotation procedures: 1,2,3,6,4,5. However,

as the annotation scheme was adapted to the needs of the project, providing extra

annotation tiers for the more complex scenarios/variants and hence making annotation

more time-consuming, the prioritization order was altered in order to promote all b

variants (with the rollator in following mode): 1,2,3.b, 3.b.2, 3.a, 6.a, 5.b, 4.a, 3.c, 4.c,

5.a

1 http://tla.mpi.nl/tools/tla-tools/elan/ , Max Planck Institute for Psycholinguistics, The Language Archive,

Nijmegen, The Netherlands

FP7 - MOBOT – 600796 13

All the PiP video files along with all the annotation files were uploaded in 2 websites:

a. in the TUM server of the MOBOT project

b. in an ILSP server

3.2. Annotation Scheme: From generic to specific

A preliminary inspection of the audiovisual data dictated the creation of at least 5

different major annotation tiers describing the scenario, the predefined tasks in each

scenario, the actions that were eventually performed, information from the audio

channel (noise, oral commands), information form the visual channel (noise, gestures,

pauses, stumbling etc). In parallel, discussions between partners lead to the finalization

of the annotation schemes followed in the different recording scenarios of the MOBOT

multimodal multisensory corpus.

3.2.1. Generic

This generic annotation scheme, that was later on enriched, consisted of the three

following annotation clusters, each containing multiple tiers, a brief description of

which is sketched below:

1. Information about the annotated data: Scenario, variant, task. This cluster

provides information, much like metadata, regarding the source of the annotated

data (which scenario and which variant) as well as a standard account of the

duration of the tasks that the patients were asked to perform.

2. Visual Input: Performed actions and gestures of the patient and visual noise

coming from the recording environment (mostly the carers). This annotation

cluster is the richest one as it provides more in-depth information regarding the

actions that were eventually performed and the gestures of the participants:

a. Within the same timestamps that were attributed for the duration of each task in

the task tier (see above) the annotator marks all actually performed actions. This,

in most cases, except for the few in scenario 1 in which the task and the actual

performed action coincide, signifies that the time boundaries set to describe a

task are divided into shorter segments and are aligned to the different sets of

actions each task includes, i.e. a simple sit-to-stand transfer with the rollator in

following mode is annotated as “Sit-to-stand” in both Task and Action tier while

the same sit-to-stand transfer with the rollator assisting the patient is annotated

as “Sit-to-stand” in the task tier and contains two segments in the Action tier:

“Grasp handles” and “Sit-to-stand”.

b. Two tiers were attributed to the annotation of the gestures: one marking the

duration of the gesturing activity and attributing the equivalent command and

another one marking the Handshape – hand-formation during the gesture

performance – adopting the HamNoSys2 notation system [Hanke, T. 2004] with

which the gesture was performed.

c. In the generic annotation scheme, this cluster contained also annotations

regarding the carer, which were generally described as noise.

2 http://www.sign-lang.uni-hamburg.de/projects/hamnosys.html

FP7 - MOBOT – 600796 14

3. Audio Input: Uttered speech vs non-speech and audio commands. At its original

state, this annotation cluster contained two tiers:

a. A speech/non-speech tier, which actually marked all parts of the audio sequence

that contained clearly uttered and fairly comprehensible speech versus all parts

that contained noise and noise-like audio interventions as well as non-

comprehensible speech parts, which in most cases consisted of sequences in

which many people talked at the same time.

b. An audio command tier in which timestamps for all audio commands, deriving

either from the patient or the carer(s), are marked.

3.2.2. Specific

Further, along the annotation procedure and following valuable discussions with all

technical partners it was made clear that the finalization of the annotation scheme

needed to take into consideration the particularities of each scenario variant in order to

provide actually usable annotation data. Consequently, the above described annotation

scheme template was first enriched as a whole and then adapted according to the needs

of each particular scenario variant.

Figure 3-1: Sample of an annotation PiP video (Scenario 3, Patient 6).

After this adaptation, the template output of the final version of the annotation scheme

is not a generic scheme but an in-depth annotation/analysis. The alterations and

enhancements that were made with respect to the previous annotation scheme are briefly

described below:

1. Actions/Actions_2: In several variants it was noted that the patient performed a task

which comprised of two actions that could not be linearly annotated into one tier; a

very typical example is the one given above with the “Grab Handles” and the “Sit-

FP7 - MOBOT – 600796 15

to-stand” transfer. So another tier named “Actions_2” was added in order to provide

the correct timestamps and durations of each patient’s actions. Furthermore, the list

of actions was enhanced in general, as the descriptions of the tasks needed to be

divided into more specific actions.

2. Gestures_Type of movement: In several cases, when asked to perform specific

gestures, patients made mistakes which can be clustered into 3 types: a. they

repeated the movement of a previous gesture with the correct handshape, b. their

gesture had a small but significant deviation from the gesture that was shown to

them (i.e. only one hand or use of different finger), or c. they performed something

entirely different from what they had been shown. In order to provide sufficient data

for the patients’ performance for the first two cases, an additional tier was needed to

be activated only in cases in which there appears this kind of mistake. This tier

would contain annotations describing the type of movement actually performed.

3. Visual Noise: All involved partners that acquired the annotated data made it

specifically clear that the annotation of most, if not all, types of visual noise that are

present in the MOBOT multisensory data is indispensable; a multisensory recording

dataset such as this, in which the interference of the assistive personnel is

unavoidable, is bound to provide data that need to be treated for noise reduction.

After long but fruitful discussions, it was decided that noise would be annotated for

two out of the four cameras, the Upper Kinect and the GoPro camera.

a. As far as the GoPro camera is concerned, we annotate the presence or the

absence of noise (another person or part of another person) in the vicinity of

(next to) the patient. In case of presence we also mark location of the noise with

respect to the patient (right, left, both).

b. As far as the Kinect camera is concerned, three types of noise are marked:

i. The presence or absence of noise in the frame, which is narrower than the

respective one of the GoPro camera. In case of presence we also mark location

of the noise with respect to the patient (right, left, both).

ii. The cases in which there is occlusion of a carer or an object in the space

between the front of the patient and the Upper Kinect. As in both previous

cases, we also locate the position of the noise in the frame with respect to the

patient (right, left, both).

iii. The cases in which the carer touches the patient; this type of noise includes all

instances in which a carer or part of his/her body is situated right next to the

patient or behind the body of the patient. The location of the noise is also

annotated (right, left, both) with respect to the patient.

4. Visual Channel / Carer: After having discussed the importance of annotating the

presence or absence of noise, it was also made clear that the presence of the assistive

personnel within the visual data can also be treated usefully; therefore, three tiers

dedicated to the carers were added under the section that was intended for the

annotation of the patient actions; of course, the presence of values in these three

tiers implies that the respective noise tiers are active as the carer is present within

the Kinect frame.

i. Meaningful Gesture: This tier is dedicated to the cases in which the carer is

visible in the Kinect camera, showing the patient how to perform the specific

set of gestures. This tier is complementary to the gesture tier, as it provides

FP7 - MOBOT – 600796 16

information about the visual input the patients had in order to perform the

gestures, a control over the patients’ performance and additional quantitative

data of people performing the specific gestures.

In case the tier for the noise visibility in the Kinect camera is active, the activity tier

or the stationery tier should also be marked.

ii. Activity: In this tier we mark that the carer is within the frame of the Upper

Kinect and is moving in a location near the patient.

iii. Stationary: In this tier we mark that the carer is within the frame of the Upper

Kinect and is standing still (probably moving his/her arms and upper part of

the body but remaining in the same position) in a location near the patient.

5. Audio Noise: A tier was added in which the verbal commands are translated into

English.

The finalization of the annotation scheme provided a template that is modular according

to the needs of each scenario variant. This procedure, although extremely interesting,

was proven far more time-consuming than expected, due to its elaboration and

adaptation to scenario-specific needs.

4. CONCLUSION

The D2.2 deliverable presents a methodological input on the post-processing of the

multimodal data for the MOBOT project.

Several levels of post-processing were presented; from the synchronization issue of all

multisensory media, a major milestone in the MOBOT project, to the post-processing of

the force/torque sensors and the motion capture system to the PiP video creation for the

annotation of the audiovisual data, as well as the actual annotation of the latter.

Significant effort was given to the creation of the annotation scheme finally adopted as

it was important that all information strings valuable to the technical partners would be

included, in a way that would enable a fast as well as effective annotation procedure.

The measurable outcome of this deliverable will be evaluated through a peer-reviewed

crosscheck of the annotated videos as well as from the feedback that will be received

from all involved partners.

It is foreseen that for a restricted part of the multimodal corpus a further more fine-

grained annotation will be performed. This, after being further discussed among

involved partners, will focus on resolving specific modelling/training issues that the

present annotation scheme might not be able to cover.

REFERENCES

[Brugman et al., 2004] H. Brugman, A. Russel, Annotating Multimedia/Multi-modal resources with

ELAN, In: Proceedings of LREC 2004, 4th International Conference on Language Resources and

Evaluation, 2004.

[Folstein et al., 1975] M. F. Folstein, S.E. Folstein, and P. R. McHugh, Mini-mental state. A practical

method for grading the cognitive state of patients for the clinician”. Journal of Psychiatric Research,

vol.12 (3), pp.89-98, 1975.

[Hanke, T. 2004]: “HamNoSys - representing sign language data in language resources and language processing

contexts.” In: Streiter, Oliver, Vettori, Chiara (eds): LREC 2004, Workshop proceedings: Representation and

processing of sign languages. Paris : ELRA, 2004.

FP7 - MOBOT – 600796 17

[Matthes et al., 2012a] S. Matthes, T. Hanke, A. Regen, J. Storz, S. Worseck, E. Efthimiou, A-L. Dimou,

A. Braffort, J. Glauert, E. Safar, Dicta-Sign – Building a Multilingual Sign Language Corpus, In

proceeding of 5th Workshop on the Representation and Processing of Sign Languages: Interactions

between Corpus and Lexicon (LREC 2012), Istanbul, Turkey, 2012.

[Matthes et al., 2012b] S. Matthes, T. Hanke, J. Storz, E. Efthimiou, A-L. Dimou, P. Karioris, A. Braffort,

A. Choisier, J. Pelhate, E. Safar, Elicitation Tasks and Materials designed for Dicta-Sign’s Multi-lingual

Corpus, In proceeding of 5th Workshop on the Representation and Processing of Sign Languages:

Interactions between Corpus and Lexicon (LREC 2012), Istanbul., Turkey, 2012.

[Wallraven et al., 2011] C. Wallraven, M. Schultze, B. Mohler, A. Vatakis and K. Pastra: "The

POETICON enacted scenario corpus - a tool for human and computational experiments on action

understanding". In 9th IEEE Conference on Automatic Face and Gesture Recognition (FG'11), 2011.