mediaeval 2010 workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • task:...

ediaEval multimedia benchmarking initiative

Gareth Jones Dublin City University

Ireland

Martha Larson TU Delft

The Netherlands

• What is MediaEval?

• MediaEval task selection

• MediaEval 2011

• MediaEval 2012

Overview

What is MediaEval?

• … a multimedia benchmarking initiative.

• … evaluates new algorithms for multimedia access and retrieval.

• … emphasizes the "multi" in multimedia: speech, audio, visual content, tags, users, context.

• … innovates new tasks and techniques focusing on the human and social aspects of multimedia content.

http://www.multimediaeval.org

What is MediaEval?

• MediaEval originates and follows on from the VideoCLEF track at CLEF 2008 and CLEF 2009.

• Established as an independent benchmarking initiative in 2010.

• Supported in 2010 and 2011 by PetaMedia EC Network of Excellence, from 2012 by the Cubrik project.

What is MediaEval?

• Follows standard retrieval annual benchmarking cycle.

• Look to real-world use scenarios for tasks.

• Where possible select tasks with industrial relevance.

• Also draw inspiration from the “PetaMedia Triple Synergy”:

– Multimedia content analysis

– Social network structures

– User-contributed tags

PetaMedia Network of Excellence: Peer-to-peer Tagged Media

MediaEval task selection

• Community based task selection process. – Proposed tasks included in pre-selection questionnaire.

• Tasks must have: – use scenario

– research questions

– accessible dataset – creative commons

– realistic groundtruthing process - crowdsourcing

– “champions” willing to be coordinators

• Selected tasks must have a minimum of 5 registered participants to run.

Participating Teams

MediaEval Tasks 2011

• Placing task (6)

• Spoken Web Search task (5)

• Affect task (6)

• Genre tagging task (10)

• Rich Speech Retrieval task (5)

• Social event detection task (7)

Placing Task

• Motivation: Knowing their geographical location helps users to localise images and video, allowing them to be anchored to real world locations. Currently most online images and video are no labeled with their location.

• Task: Automatically assigning geo- coordinates to Flickr videos using one or more of: Flickr metadata, visual content, audio content, social information. • Data: Creative Commons Flickr data,

predominantly English.

Placing Task

• Data: Creative Commons Flickr video data, predominantly English Participants may use audio and visual data, and any available

image metadata. Allowed to submit up 5 runs, including at most one incorporating

a gazzetter and/or one run using additional crawled material from outside the collection.

• Evaluation: Geo-coordinates from Flickr used as groundtruth. Evaluated based on number within 1km, 10km, 100km, 1,000km

and 10,000km of groundtruth. • Organizers:

Vanessa Murdock, Yahoo! Research Adam Rae, Yahoo! Research Pascal Kelm, TU Berlin Pavel Serdyukov, Yandex

Spoken Web Search

• Motivation: Spoken Web aspires to make information available to communities in the developing world via audio input and output to mobile devices. Small amount of training data available for these languages poses challenges to speech recognition.

• Task: Search FOR audio content WITHIN audio content USING an audio content query for poorly resourced languages.

This task is particularly intended for speech researchers in the area of spoken term detection.

• Data: Audio from four different Indian languages - English, Hindi, Gujarati and Telugu. Each of the ca. 400 data item is an 8 KHz audio file 4-30 secs in length.

• Organizers: Nitendra Rajput, IBM Research India Florian Metze, CMU

Affect Task: Violent Scene Detection

• Motivation: Technicolor is developing services which support the management of movie databases.

The company seeks to help users choose appropriate content. A particular use case involves helping families to choose

movies suitable for their children. • Task: Deploy multimodal features to automatically detect

portions of movies containing violent material. Violence is defined as “physical violence or accident resulting

human pain or injury”. • Data: A set of ca. 15 Hollywood movies (that must be

purchased by the participants): 12 used for training, 3 for testing.

Affect Task: Violent Scene Detection

• Groundtruth: 7 human assessors located one violent action, or different overlapping actions.

Each violent event labeled with start and end frame. • Features: Participants use any features extracted from the

content and subtitles. The may not use any data external to collection.

• Evaluation: Detection cost function combining false alarms and misses • Organizers:

Mohammad Soleymani, Univ. Geneva Claire-Helene Demarty, Technicolor Guillaume Gravier, IRISA

Flickr tylluan

Genre Tagging

• Motivation: Genre tags can provide valuable information for browsing video, especially true of semi-professional user generated content (SPUG).

• Task: Automatically assign genre tags to video using features derived from speech, audio, visual content or associated text

or social information. • Data: Around 350 hours of video data harvested from blip.tv creative commons internet video. • Data accompanied by automatic speech recognition transcripts provided by LIMSI and Vocapia research; automatically extracted shot boundaries and keyframes.

Genre Tagging

• Tag Assignment: Participants are required to predict one of 26 tags. Groundtruth tags collected directly using blip.tv API.

• Runs: 5 runs submitted, one based only on ASR transcripts and one including metadata. Evaluated using MAP.

• Organizers: Martha Larson, TU-Delft Sebastian Schmiedeke, TU-Berlin Christoph Kofler, TU-Delft Isabelle Ferrané, Université Paul Sabatier

Rich Speech Retrieval

• Motivation: Emerging semi-professional user generated content contains much potentially interesting material. However, this is only useful if relevant information can be found.

• Task: Given a set of queries and a video collection, participants are required to automatically identify relevant jump-in points into the video based on the combination of modalities

• Data: Same blip.tv collection used for Genre tagging task.

• Features: Can be derived from speech, audio, visual content or metadata.

Rich Speech Retrieval • Speech Act: Treat speech content in terms of `illocutionary acts’

– what speakers are accomplishing by speaking. Five acts chosen: `apology’, ‘definition’, ‘opinion’, ‘promise’ and ‘warning’.

• Search Topics: Formed by crowdsourcing with Amazon MT.

Workers located section of video containing one of the speech acts that they would wish to share, and then form long form description and a short form query that they would use to re-find a jump-iin point to begin playback.

• Evaluation: Evaluated as a know item search task using mGAP

• Organizers: Roeland Ordelman, University of Twente and B&G Maria Eskevich, Dublin City University

IISSCoS

Social Event Detection Task

• Motivation: Much content related to social events is being made available in different forms. People generally think in terms of events, rather than scattered separate items.

• Task: Discover events and detect media items that are related to either a specific social event or an event-class of interest.

The events are planned by people, attended by people and for which the social media are captured by people.

• Data: Set of 73,645 photos was collected from Flickr using its API. All geo-tagged images that were available for 5 cities: Amsterdam, Barcelona, London, Paris and Rome.

Social Event Detection Task

• Challenge 1: Find all soccer events taking place in Barcelona (Spain) and Rome (Italy) in the test collection. For each event provide all photos associated with it.

– Must be soccer matches, not someone with a ball or picture of football stadium.

• Challenge 2: Find all events that took place in May 2009 in the venue named Paradiso (in Amsterdam, NL) and in the Parc del Forum (in Barcelona, Spain). For each event provide all photos associated with it.

• Baseline run only using only metadata required, visual features can be used in other runs.

• Evaluation: Using F-score and Normalised Mutual Information

• Organizers: Raphael Troncy, Eurecom Vasileios Mezaris, ITI CERTH

MediaEval 2011 Workshop

20

• Held at Santa Croce in Fossabanda – a medieval convent in Pisa, Italy – 1st -2nd September 2011

• Official Workshop of InterSpeech 2011

– Unofficial satellite of ACM Multimedia 2010

– Potentially workshop at ECCV 2012

• Nearly 60 registered participants – up from around 25 in 2010

• 39 2-page working notes papers (13 in 2010)

– Published by CEUR.WS: http://ceur-ws.org/Vol-807/

http://ceur-ws.org/Vol-807/






21

http://www.flickr.com/photos/69524595@N06/6374950243/in/photostream





22

MediaEval Project Support

• Genre Tagging Task: PetaMedia

• Rich Speech Retrieval Task: AXES and IISSCOS with support from PetaMedia

• Affect Task: Violent Scenes Detection Task: PetaMedia and Quaero

• Social Event Detection: PetaMedia, Glocal, weknowit, Chorus+

• Placing Task: Glocal with support from PetaMedia

IISSCoS

Special Session

MediaEval 2010 results were presented at ACM ICMR 2011 in a special session entitles “Automatic Tagging and Geo-Tagging in Video Collections and Communities”

24

MediaEval 2012

25

• Informal presentation of task proposals at MediaEval 2011 workshop

• Call for formal task proposals in late 2011 – Please propose a task!

• MediaEval 2012 online task selection questionnaire in early 2012 – Please complete the questionnaire!

• Tasks announced and Call for Participation in spring 2012

• Task participation summer 2012

• MediaEval 2012 workshop - September/October

MediaEval

26

Thank You

Questions?

http://www.multimediaeval.org

mediaeval 2010 workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • task:...

Documents