mediaeval 2010 workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • task:...

26
ediaEval multimedia benchmarking initiative Gareth Jones Dublin City University Ireland Martha Larson TU Delft The Netherlands

Upload: others

Post on 01-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

ediaEval multimedia benchmarking initiative

Gareth Jones Dublin City University

Ireland

Martha Larson TU Delft

The Netherlands

Page 2: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

• What is MediaEval?

• MediaEval task selection

• MediaEval 2011

• MediaEval 2012

Overview

Page 3: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

What is MediaEval?

• … a multimedia benchmarking initiative.

• … evaluates new algorithms for multimedia access and retrieval.

• … emphasizes the "multi" in multimedia: speech, audio, visual content, tags, users, context.

• … innovates new tasks and techniques focusing on the human and social aspects of multimedia content.

http://www.multimediaeval.org

Page 4: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

What is MediaEval?

• MediaEval originates and follows on from the VideoCLEF track at CLEF 2008 and CLEF 2009.

• Established as an independent benchmarking initiative in 2010.

• Supported in 2010 and 2011 by PetaMedia EC Network of Excellence, from 2012 by the Cubrik project.

Page 5: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

What is MediaEval?

• Follows standard retrieval annual benchmarking cycle.

• Look to real-world use scenarios for tasks.

• Where possible select tasks with industrial relevance.

• Also draw inspiration from the “PetaMedia Triple Synergy”:

– Multimedia content analysis

– Social network structures

– User-contributed tags

PetaMedia Network of Excellence: Peer-to-peer Tagged Media

Page 6: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

MediaEval task selection

• Community based task selection process. – Proposed tasks included in pre-selection questionnaire.

• Tasks must have: – use scenario

– research questions

– accessible dataset – creative commons

– realistic groundtruthing process - crowdsourcing

– “champions” willing to be coordinators

• Selected tasks must have a minimum of 5 registered participants to run.

Page 7: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Participating Teams

Page 8: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

MediaEval Tasks 2011

• Placing task (6)

• Spoken Web Search task (5)

• Affect task (6)

• Genre tagging task (10)

• Rich Speech Retrieval task (5)

• Social event detection task (7)

Page 9: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Placing Task

• Motivation: Knowing their geographical location helps users to localise images and video, allowing them to be anchored to real world locations. Currently most online images and video are no labeled with their location.

• Task: Automatically assigning geo- coordinates to Flickr videos using one or more of: Flickr metadata, visual content, audio content, social information. • Data: Creative Commons Flickr data,

predominantly English.

Page 10: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Placing Task

• Data: Creative Commons Flickr video data, predominantly English Participants may use audio and visual data, and any available

image metadata. Allowed to submit up 5 runs, including at most one incorporating

a gazzetter and/or one run using additional crawled material from outside the collection.

• Evaluation: Geo-coordinates from Flickr used as groundtruth. Evaluated based on number within 1km, 10km, 100km, 1,000km

and 10,000km of groundtruth. • Organizers:

Vanessa Murdock, Yahoo! Research Adam Rae, Yahoo! Research Pascal Kelm, TU Berlin Pavel Serdyukov, Yandex

Page 11: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Spoken Web Search

• Motivation: Spoken Web aspires to make information available to communities in the developing world via audio input and output to mobile devices. Small amount of training data available for these languages poses challenges to speech recognition.

• Task: Search FOR audio content WITHIN audio content USING an audio content query for poorly resourced languages.

This task is particularly intended for speech researchers in the area of spoken term detection.

• Data: Audio from four different Indian languages - English, Hindi, Gujarati and Telugu. Each of the ca. 400 data item is an 8 KHz audio file 4-30 secs in length.

• Organizers: Nitendra Rajput, IBM Research India Florian Metze, CMU

Page 12: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Affect Task: Violent Scene Detection

• Motivation: Technicolor is developing services which support the management of movie databases.

The company seeks to help users choose appropriate content. A particular use case involves helping families to choose

movies suitable for their children. • Task: Deploy multimodal features to automatically detect

portions of movies containing violent material. Violence is defined as “physical violence or accident resulting

human pain or injury”. • Data: A set of ca. 15 Hollywood movies (that must be

purchased by the participants): 12 used for training, 3 for testing.

Page 13: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Affect Task: Violent Scene Detection

• Groundtruth: 7 human assessors located one violent action, or different overlapping actions.

Each violent event labeled with start and end frame. • Features: Participants use any features extracted from the

content and subtitles. The may not use any data external to collection.

• Evaluation: Detection cost function combining false alarms and misses • Organizers:

Mohammad Soleymani, Univ. Geneva Claire-Helene Demarty, Technicolor Guillaume Gravier, IRISA

Flickr tylluan

Page 14: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Genre Tagging

• Motivation: Genre tags can provide valuable information for browsing video, especially true of semi-professional user generated content (SPUG).

• Task: Automatically assign genre tags to video using features derived from speech, audio, visual content or associated text

or social information. • Data: Around 350 hours of video data harvested from blip.tv creative commons internet video. • Data accompanied by automatic speech recognition transcripts provided by LIMSI and Vocapia research; automatically extracted shot boundaries and keyframes.

Page 15: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Genre Tagging

• Tag Assignment: Participants are required to predict one of 26 tags. Groundtruth tags collected directly using blip.tv API.

• Runs: 5 runs submitted, one based only on ASR transcripts and one including metadata. Evaluated using MAP.

• Organizers: Martha Larson, TU-Delft Sebastian Schmiedeke, TU-Berlin Christoph Kofler, TU-Delft Isabelle Ferrané, Université Paul Sabatier

Page 16: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Rich Speech Retrieval

• Motivation: Emerging semi-professional user generated content contains much potentially interesting material. However, this is only useful if relevant information can be found.

• Task: Given a set of queries and a video collection, participants are required to automatically identify relevant jump-in points into the video based on the combination of modalities

• Data: Same blip.tv collection used for Genre tagging task.

• Features: Can be derived from speech, audio, visual content or metadata.

Page 17: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Rich Speech Retrieval • Speech Act: Treat speech content in terms of `illocutionary acts’

– what speakers are accomplishing by speaking. Five acts chosen: `apology’, ‘definition’, ‘opinion’, ‘promise’ and ‘warning’.

• Search Topics: Formed by crowdsourcing with Amazon MT.

Workers located section of video containing one of the speech acts that they would wish to share, and then form long form description and a short form query that they would use to re-find a jump-iin point to begin playback.

• Evaluation: Evaluated as a know item search task using mGAP

• Organizers: Roeland Ordelman, University of Twente and B&G Maria Eskevich, Dublin City University

IISSCoS

Page 18: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Social Event Detection Task

• Motivation: Much content related to social events is being made available in different forms. People generally think in terms of events, rather than scattered separate items.

• Task: Discover events and detect media items that are related to either a specific social event or an event-class of interest.

The events are planned by people, attended by people and for which the social media are captured by people.

• Data: Set of 73,645 photos was collected from Flickr using its API. All geo-tagged images that were available for 5 cities: Amsterdam, Barcelona, London, Paris and Rome.

Page 19: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Social Event Detection Task

• Challenge 1: Find all soccer events taking place in Barcelona (Spain) and Rome (Italy) in the test collection. For each event provide all photos associated with it.

– Must be soccer matches, not someone with a ball or picture of football stadium.

• Challenge 2: Find all events that took place in May 2009 in the venue named Paradiso (in Amsterdam, NL) and in the Parc del Forum (in Barcelona, Spain). For each event provide all photos associated with it.

• Baseline run only using only metadata required, visual features can be used in other runs.

• Evaluation: Using F-score and Normalised Mutual Information

• Organizers: Raphael Troncy, Eurecom Vasileios Mezaris, ITI CERTH

Page 20: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

MediaEval 2011 Workshop

20

• Held at Santa Croce in Fossabanda – a medieval convent in Pisa, Italy – 1st -2nd September 2011

• Official Workshop of InterSpeech 2011

– Unofficial satellite of ACM Multimedia 2010

– Potentially workshop at ECCV 2012

• Nearly 60 registered participants – up from around 25 in 2010

• 39 2-page working notes papers (13 in 2010)

– Published by CEUR.WS: http://ceur-ws.org/Vol-807/

Page 22: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

MediaEval 2011 Workshop

22

Page 23: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

MediaEval Project Support

• Genre Tagging Task: PetaMedia

• Rich Speech Retrieval Task: AXES and IISSCOS with support from PetaMedia

• Affect Task: Violent Scenes Detection Task: PetaMedia and Quaero

• Social Event Detection: PetaMedia, Glocal, weknowit, Chorus+

• Placing Task: Glocal with support from PetaMedia

IISSCoS

Page 24: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

Special Session

MediaEval 2010 results were presented at ACM ICMR 2011 in a special session entitles “Automatic Tagging and Geo-Tagging in Video Collections and Communities”

24

Page 25: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

MediaEval 2012

25

• Informal presentation of task proposals at MediaEval 2011 workshop

• Call for formal task proposals in late 2011 – Please propose a task!

• MediaEval 2012 online task selection questionnaire in early 2012 – Please complete the questionnaire!

• Tasks announced and Call for Participation in spring 2012

• Task participation summer 2012

• MediaEval 2012 workshop - September/October

Page 26: MediaEval 2010 Workshop - isical.ac.infire/2011/slides/fire.2011.jones.gareth.pdf · • Task: Deploy multimodal features to automatically detect portions of movies containing violent

MediaEval

26

Thank You

Questions?

http://www.multimediaeval.org