overview of the mediaeval 2012 tagging task

MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 1

2012 Tagging Task Overview2012 Tagging Task Overview

Christoph Kofler Sebastian Schmiedeke

Delft University of Technology Technical University of Berlin

Isabelle Ferrané

University of Toulouse (IRIT)


MotivationsMotivations

• MediaEval MediaEval – evaluate new algorithms for multimedia access and retrieval – emphasize the 'multi' in multimedia – focus on human and social aspects of multimedia tasks.

• Tagging TaskTagging Task– focus on semi-professional video on the Internet– use features derived from various media or information sources

speech, audio, visual content or associated metadata

social information

– focus on tags related to the genre of the video


HistoryHistory

• Tagging task at MediaEvalTagging task at MediaEval

– 20102010: The Tagging Task (Wild Wild Web Version): Prediction of user tags too many tags; great

diversity

– 20112011: The Genre Tagging Task: Genres from blip.tv labels (26 labels)

– 20122012: The Tagging Task: Same labels but much more dev/test data

• MediaEval joined byMediaEval joined by

– 20092009 & 20102010: Internal Quaero campaigns (Video Genre Classification) too few participants

– 2011 2011 & 2012 2012: Tagging task as an External Quaero evaluation


DatasetsDatasets

• A set of Videos (ME12TT)– Created by the PetaMediaPetaMedia Network of Excellence– downloaded from blip.tvblip.tv– episodes of shows mentioned in TwitterTwitter messages – licensed under Creative CommonsCreative Commons – 14,83814,838 episodes from 2,249 shows ~ 3,2603,260 hours of data– extension of the MediaEval Wild Wild Web dataset (2010)

• Split into Development and Test sets – 20112011 : 247 for development / 1,727 for test (1974 videos)

– 2012 2012 : 5,288 for development / 9,550 for test (7.5 times more)


GenresGenres

• 26 Genre labels from Blip.tv26 Genre labels from Blip.tv ... same as in 20112011 : 25 genres + 1 default_catagory

1000 art 1001 autos_and_vehicles 1002 business 1003 citizen_journalism 1004 comedy 1005 conferences_and_other_events 1006 default_category 1007 documentary 1008 educational 1009 food_and_drink 1010 gaming 1011 health 1012 literature 1013 movies_and_television 1014 music_and_entertainment 1015 personal_or_auto-biographical 1016 politics 1017 religion 1018 school_and_education 1019 sports 1020 technology 1021 the_environment 1022 the_mainstream_media 1023 travel 1024 videoblogging 1025 web_development_and_sites


Information availableInformation available (1/2) (1/2)

• From different sourcesFrom different sources– title, description, user tags, uploader ID, duration

– tweets mentioning the shows

– automatic processings: • automatic speech recognition (ASR)

– English transcription

– some other languages

• shot boundaries and 1 keyframe per shot

New !


Information available Information available (2/2) (2/2)

• Focus on speech data Focus on speech data – LIUM transcripts (5,084 files from dev / 6,879 files from test)

• English • One-best hypotheses (NIST CTM format)• Word-lattices (SLF HTK format) 4-Gram topology• Confusion network (ATT FSM-like format)

– LIMSI-VOCAPIA transcripts (5,237 files from dev. /7,215 files from test)

• English, French, Spanish, Dutch• Language identification strategy (language confidence score)

if Score > 0.8 then transcription based on the detected language

else best score between English and the detected language

New !


Task GoalTask Goal

• same as in 2011same as in 2011– searching and browsing the Internet for video– using genre as a searching or organizational criterion

Videos may not be accurately or adequately taggedVideos may not be accurately or adequately tagged

Automatically assign genre labels using features derived from:

speech, audio, visual content, associated textual or social information

• what’s new in 2012what’s new in 2012– provide a huge amount of data ( 7.5 times more )– enable information retrieval as well as classification approaches with

more balanced datasets (1/3 dev; 2/3 test); – each genre was “equally” distributed between both sets


Genre distribution over datasetsGenre distribution over datasets

Part 1: Genre 1000 to 1012

Part 2: Genres 1013 to 1025


• Task: Task: Predict the genre label for each video of the test set

• SubmissionsSubmissions: up to 5 runs, represent different approaches

• Groundtruth :Groundtruth : genre label associated to each video

• MetricMetric : Mean Average Precision (MAP) enable to evaluate the ranked retrieval results

regarding a set of queries Q

Evaluation ProtocolEvaluation Protocol

RUN 1 Audio and/or visual informations (including information about shots and keyframes)

RUN 2 ASR transcripts

RUN 3 All data except metadata

RUN 4 All data except the uploader ID (ID used in the 2011 campaign)

RUN 5 All data

New !


• 2012 : 20 registered, 6 submissions 2012 : 20 registered, 6 submissions (10 in 2011)(10 in 2011)

– 5 veterans, 1 new participant, – 3 organiser-connected teams, 5 countries

ParticipantsParticipants

System Participant Supporting Project

KIT Karlsruhe Institute of Technology Germany Quaero

UNICAMP-UFMG University of Campinas

Federal University of Minas Gerais Brazil

FAPEMIG, FAPESP, CAPES, & CNPq

ARF University Politechnica of Bucharest Romania

Johannes Kepler University &

Research Institute of Artificial Intelligence Austria

Polytech Annecy-Chambery France

EXCEL POSDRU

TUB Technical University of Berlin Germany EU FP7 VideoSense

TUD-MM Delft University of Technology The Netherlands

TUD Delft University of Technology The Netherlands

New !


FeaturesFeatures

ASR Only transcription Only transcription in englishin english

ASR Limsi,ASR Limsi,

ASR Lium

Or with translation of non-english transcription

Stop word Stop word filteringfiltering

& Stemming ; & Stemming ; Bags of Words Bags of Words (BoW); LDA(BoW); LDA

Semantic similarity

TF-IDF ; Top TF-IDF ; Top termsterms

Audio MFCC, LPC,LPS,MFCC, LPC,LPS, ZCRZCR, Energy

Spectrogram as an image Rhythm, timbre, onset strengh, Rhythm, timbre, onset strengh, loudness, …loudness, …

Visual Content

On image:On image:

Color,textureColor,texture

Face detection

On video;

Self-similarity matrix;

Shot boundaries;

Shot length ; Shot length ;

Transition between shots ; Motion features

On Keyframes: Bags of Visual On Keyframes: Bags of Visual WordsWords

SIFTSIFT / rgbSIFTrgbSIFT / SURF/ SURF / HoG/ HoG

Metadata Title, TagsTitle, Tags

Description,Description,

Filename, Show ID, Uploader ID

BoWBoW

Others Video from YouTube blip.tv

Video distribution Video distribution over genres (dev)over genres (dev)

Web pages from Google, Wikipedia

Synonyms, hyponyms, domain terms, BoW from Wordnet

Social data from Delicious

• Features used: Features used: in 2011 onlyin 2011 only - in 2011 & 2012 - - in 2011 & 2012 - in 2012 onlyin 2012 only


Methods Methods (1/2) (1/2)

• Machine Learning approach (ML)Machine Learning approach (ML)– parameter extraction from audio, visual, textual data, early fusion– feature:

• transformation (PCA), • selection (Mutual information, Term Frequency)

• dimension reduction (LDA)

– classification methods, supervised or unsupervised• K-NN, SVM, Naive Bayes, DBN, GMM, NN

• K-Means (clustering)

• CRF (Conditional Random Fields), Decision tree (Random Forest)

– training step, cross-validation approach and stacking – fusion of classifier results / late fusion / majority voting


Methods Methods (2/2) (2/2)

• Information retrieval approach (IR)Information retrieval approach (IR)– text preprocessing & text Indexing– query and ranking list, query expansion and re-ranking methods – fusion of ranked lists from different modalities (RRF)– selection of the category with the highest ranking score

• Evolution since 2011Evolution since 2011

– 2011 : 2 distinct communities : ML or IR approach– 2012 : mainly ML approach or mixed one


• Which one were used?Which one were used?

• Evolution since 2011 Evolution since 2011 – 2011 : 2011 : use of external data mainly from the web and social data

1 participant especially interested in social aspect

– 2012 : 2012 : no external data, no social data

ResourcesResources

System AUDIO ASR VISUAL METADATA SOCIAL OTHER

KIT

UNICAMP-UFMG

ARF

TUB

TUD-MM

TUD


• Each participant’s best result Each participant’s best result

Main results Main results (1/2) (1/2)

SYSTEM BEST RUN

MAP APPROACH FEATURES METHOD

KIT Run3

Run6*

0.3499

0.3581

ML Color, Texture, rgbSIFT

+ video distrib. over genre

SVM

UNICAMP-UFMG

Run 4 0.2112 ML BoW Stacking

ARF Run5 0.3793 ML TF-IDF mtd

ASR Limsi

SVM Linear

TUB Run4 0.5225 ML BoW mtd MI - Naive Bayes

TUD-MM Run4 0.3675 ML & IR TF on Visual word

ASR & mtd

SVM Linear + Reciprocal Rank Fusion

TUD Run2 0.25 ML ASR Lium

one-best

DBN

Baseline results : All videos into the default category MAP = 0.0063 / Videos randomly classified MAP = 0.002


• Official run comparison (MAP)Official run comparison (MAP)

Main results Main results (2/2) (2/2)

SYSTEM Run1

Audio/Visual

Run2

ASR

Run3Exc. MTD

Run4Exc. Upld. ID

Run5

All

Other Runs

KIT 0.3008 (visual1) 0.2329 (visual2)

0.3499 (fusion)

0.3461

0.1448

0.3581

UNICAMP

UFMG

0.1238 0.2112

ARF 0.1941(visual & audio)

0.1892 (audio)

0.2174 0.2204 0.3793

TUB 0.2301 0.1035 0.2259 0.5225 0.3304

TUD-MM 0.0061 0.3127 0.2279 0.3675 0.2157 0.0577

0.0047

TUD 0.23 / 0.25

0.10 / 0.09

Baseline results : All videos into the default category MAP = 0.0063 /Videos randomly classified MAP = 0.002


• About dataAbout data– 20112011 : small dataset for developement ~ 247 videos

– Difficulties to train models, external resources required

– 20122012 : huge amount of data for development ~ 5,288 videos– Enough to train models in machine learning approach

Impact on: - the type of methods used (ML against IR)

- the need/use of external data

No use of social data this year: - is it a question of community ?

- can be disappointing regarding the MediaEval motivations

Lessons learned or open questions Lessons learned or open questions (1/3)(1/3)


• About resultsAbout results– 20112011 : Best system MAP 0.56260.5626 using

– Audio, ASR, visual, metadata including the uploader ID– external data from Google and Youtube

– 20122012 : Best system (non-organiser connected) : MAP 0.37930.3793 – TF-IDF on ASR, metadata including the uploader ID, no visual data

Best system (organizer-connected): MAP 0.52250.5225 – BoW on metadata without the uploader ID, no visual data

results difficult to compare regarding the great diversity of features,

of methods and of systems combining bothmonomedia (visual only ; ASR only) or multimedia contributionsa failure analysis should help to understand « what impacts what? »

Lessons learned or open questions Lessons learned or open questions (2/3) (2/3)


• About metric About metric – MAP as the official metric– some participants provided other types of results in terms of correct

classification rate or F-score or detail AP results per genre

would « analysing the confusion between genres » be of interest?

• About genre labelsAbout genre labels– Labels provided by blip.tv are covering two aspects

• topics: Autos_and_vehicules, Health, Religion, ...

• Real genre : Comedy, Documentary, Videoblogging,...

would « making a distinction between form and content » be of interest?

Lessons learned or open questions Lessons learned or open questions (3/3)(3/3)


• What to do next time? What to do next time? – Has everything been said?– Should we leave the task unchanged?– If not, we have to define another orientation– Should we focus on another aspect of the content?

• Interaction, • Mood, • User intention regarding the query

– Define what needs to be changed: • Data• Goals and use cases• Metric

• ... a lot of points need to be considered

ConclusionConclusion


Tagging Task Overview Tagging Task Overview

More details about MetadataMore details about Metadata


Tagging Task OverviewTagging Task Overview

• Motivations & History Motivations & History

• Datasets, Genres, Metadata & ExamplesDatasets, Genres, Metadata & Examples

• Task Goal & Evaluation protocolTask Goal & Evaluation protocol

• Participants, Features & MethodsParticipants, Features & Methods

• Resources & Main resultsResources & Main results

• ConclusionConclusion


Tagging Task Overview : ExamplesTagging Task Overview : Examples

• Example of Medadata from Example of Medadata from

<video> <title> <![CDATA[One Minute Rumpole and the Angel Of Death]]> </title> <description> <![CDATA["Rumpole and the Angel of Death," by John Mortimer, …]> </description> <explicit> false </explicit> <duration> 66 </duration> <url> http://blip.tv/file/1271048 </url> <license> <type> Creative Commons Attribution-NonCommercial 2.0 </type> <id> 4 </id> </license> …

Video:Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv

Tag: Tag: 1012 literature Genre label from blip.tv


…<tags> <string> oneminutecritic </string>

<string> fvrl </string><string> vancouver </string><string> library </string><string> books </string>

</tags><uploader> <uid> 112708 </uid>

<login> crashsolo </login></uploader><file> <filename> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv </filename>

<link> http://blip.tv/file/get/… </link><size> 3745110 </size>

</file><comments /></video>


• Example of Medadata from Example of Medadata from

Tags given by the uploader

ID of the uploader


<?xml version="1.0" encoding="utf-8" ?> <Segmentation> <CreationID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659 </CreationID> <InitialFrameID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659/394.jpg </InitialFrameID> <Segments type="SHOT"> <Segment start="T00:00:00:000F1000" end="T00:00:56:357F1000"> <Index> 0 </Index> <KeyFrameID time="T00:00:28:142F1000"> CrashsoloOneMinuteRumpoleAndTheAngelOfDeath659/394.jpg </KeyFrameID> </Segment> ... </Segments> </Segmentation>


• Video data Video data (420,000 shots and keyframes)



Shot boundaries

One keyframe per shot


Tagging Task Overview : MetadataTagging Task Overview : Metadata

• Social data Social data (8,856 unique twitter users)



Code:Code: 1271048 Post:Post: http://blip.tv/file/1271048 http://twitter.com/crashsolo

Posted 'One Minute Rumpole and the Angel Of Death' to blip.tv: http://blip.tv/file/1271048

UserUser

Upload a file on blip.tvUpload a file on blip.tv& Post a tweet (level 0)& Post a tweet (level 0)

User’s contactsUser’s contacts

Contacts’own contactsContacts’own contacts

Level 0

Level 1 Level 2