overview of the mediaeval 2012 tagging task
DESCRIPTION
TRANSCRIPT
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 1
2012 Tagging Task Overview2012 Tagging Task Overview
Christoph Kofler Sebastian Schmiedeke
Delft University of Technology Technical University of Berlin
Isabelle Ferrané
University of Toulouse (IRIT)
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 2
MotivationsMotivations
• MediaEval MediaEval – evaluate new algorithms for multimedia access and retrieval – emphasize the 'multi' in multimedia – focus on human and social aspects of multimedia tasks.
• Tagging TaskTagging Task– focus on semi-professional video on the Internet– use features derived from various media or information sources
speech, audio, visual content or associated metadata
social information
– focus on tags related to the genre of the video
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 3
HistoryHistory
• Tagging task at MediaEvalTagging task at MediaEval
– 20102010: The Tagging Task (Wild Wild Web Version): Prediction of user tags too many tags; great
diversity
– 20112011: The Genre Tagging Task: Genres from blip.tv labels (26 labels)
– 20122012: The Tagging Task: Same labels but much more dev/test data
• MediaEval joined byMediaEval joined by
– 20092009 & 20102010: Internal Quaero campaigns (Video Genre Classification) too few participants
– 2011 2011 & 2012 2012: Tagging task as an External Quaero evaluation
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 4
DatasetsDatasets
• A set of Videos (ME12TT)– Created by the PetaMediaPetaMedia Network of Excellence– downloaded from blip.tvblip.tv– episodes of shows mentioned in TwitterTwitter messages – licensed under Creative CommonsCreative Commons – 14,83814,838 episodes from 2,249 shows ~ 3,2603,260 hours of data– extension of the MediaEval Wild Wild Web dataset (2010)
• Split into Development and Test sets – 20112011 : 247 for development / 1,727 for test (1974 videos)
– 2012 2012 : 5,288 for development / 9,550 for test (7.5 times more)
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 5
GenresGenres
• 26 Genre labels from Blip.tv26 Genre labels from Blip.tv ... same as in 20112011 : 25 genres + 1 default_catagory
1000 art 1001 autos_and_vehicles 1002 business 1003 citizen_journalism 1004 comedy 1005 conferences_and_other_events 1006 default_category 1007 documentary 1008 educational 1009 food_and_drink 1010 gaming 1011 health 1012 literature 1013 movies_and_television 1014 music_and_entertainment 1015 personal_or_auto-biographical 1016 politics 1017 religion 1018 school_and_education 1019 sports 1020 technology 1021 the_environment 1022 the_mainstream_media 1023 travel 1024 videoblogging 1025 web_development_and_sites
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 6
Information availableInformation available (1/2) (1/2)
• From different sourcesFrom different sources– title, description, user tags, uploader ID, duration
– tweets mentioning the shows
– automatic processings: • automatic speech recognition (ASR)
– English transcription
– some other languages
• shot boundaries and 1 keyframe per shot
New !
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 7
Information available Information available (2/2) (2/2)
• Focus on speech data Focus on speech data – LIUM transcripts (5,084 files from dev / 6,879 files from test)
• English • One-best hypotheses (NIST CTM format)• Word-lattices (SLF HTK format) 4-Gram topology• Confusion network (ATT FSM-like format)
– LIMSI-VOCAPIA transcripts (5,237 files from dev. /7,215 files from test)
• English, French, Spanish, Dutch• Language identification strategy (language confidence score)
if Score > 0.8 then transcription based on the detected language
else best score between English and the detected language
New !
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 8
Task GoalTask Goal
• same as in 2011same as in 2011– searching and browsing the Internet for video– using genre as a searching or organizational criterion
Videos may not be accurately or adequately taggedVideos may not be accurately or adequately tagged
Automatically assign genre labels using features derived from:
speech, audio, visual content, associated textual or social information
• what’s new in 2012what’s new in 2012– provide a huge amount of data ( 7.5 times more )– enable information retrieval as well as classification approaches with
more balanced datasets (1/3 dev; 2/3 test); – each genre was “equally” distributed between both sets
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 9
Genre distribution over datasetsGenre distribution over datasets
Part 1: Genre 1000 to 1012
Part 2: Genres 1013 to 1025
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 10
• Task: Task: Predict the genre label for each video of the test set
• SubmissionsSubmissions: up to 5 runs, represent different approaches
• Groundtruth :Groundtruth : genre label associated to each video
• MetricMetric : Mean Average Precision (MAP) enable to evaluate the ranked retrieval results
regarding a set of queries Q
Evaluation ProtocolEvaluation Protocol
RUN 1 Audio and/or visual informations (including information about shots and keyframes)
RUN 2 ASR transcripts
RUN 3 All data except metadata
RUN 4 All data except the uploader ID (ID used in the 2011 campaign)
RUN 5 All data
New !
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 11
• 2012 : 20 registered, 6 submissions 2012 : 20 registered, 6 submissions (10 in 2011)(10 in 2011)
– 5 veterans, 1 new participant, – 3 organiser-connected teams, 5 countries
ParticipantsParticipants
System Participant Supporting Project
KIT Karlsruhe Institute of Technology Germany Quaero
UNICAMP-UFMG University of Campinas
Federal University of Minas Gerais Brazil
FAPEMIG, FAPESP, CAPES, & CNPq
ARF University Politechnica of Bucharest Romania
Johannes Kepler University &
Research Institute of Artificial Intelligence Austria
Polytech Annecy-Chambery France
EXCEL POSDRU
TUB Technical University of Berlin Germany EU FP7 VideoSense
TUD-MM Delft University of Technology The Netherlands
TUD Delft University of Technology The Netherlands
New !
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 12
FeaturesFeatures
ASR Only transcription Only transcription in englishin english
ASR Limsi,ASR Limsi,
ASR Lium
Or with translation of non-english transcription
Stop word Stop word filteringfiltering
& Stemming ; & Stemming ; Bags of Words Bags of Words (BoW); LDA(BoW); LDA
Semantic similarity
TF-IDF ; Top TF-IDF ; Top termsterms
Audio MFCC, LPC,LPS,MFCC, LPC,LPS, ZCRZCR, Energy
Spectrogram as an image Rhythm, timbre, onset strengh, Rhythm, timbre, onset strengh, loudness, …loudness, …
Visual Content
On image:On image:
Color,textureColor,texture
Face detection
On video;
Self-similarity matrix;
Shot boundaries;
Shot length ; Shot length ;
Transition between shots ; Motion features
On Keyframes: Bags of Visual On Keyframes: Bags of Visual WordsWords
SIFTSIFT / rgbSIFTrgbSIFT / SURF/ SURF / HoG/ HoG
Metadata Title, TagsTitle, Tags
Description,Description,
Filename, Show ID, Uploader ID
BoWBoW
Others Video from YouTube blip.tv
Video distribution Video distribution over genres (dev)over genres (dev)
Web pages from Google, Wikipedia
Synonyms, hyponyms, domain terms, BoW from Wordnet
Social data from Delicious
• Features used: Features used: in 2011 onlyin 2011 only - in 2011 & 2012 - - in 2011 & 2012 - in 2012 onlyin 2012 only
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 13
Methods Methods (1/2) (1/2)
• Machine Learning approach (ML)Machine Learning approach (ML)– parameter extraction from audio, visual, textual data, early fusion– feature:
• transformation (PCA), • selection (Mutual information, Term Frequency)
• dimension reduction (LDA)
– classification methods, supervised or unsupervised• K-NN, SVM, Naive Bayes, DBN, GMM, NN
• K-Means (clustering)
• CRF (Conditional Random Fields), Decision tree (Random Forest)
– training step, cross-validation approach and stacking – fusion of classifier results / late fusion / majority voting
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 14
Methods Methods (2/2) (2/2)
• Information retrieval approach (IR)Information retrieval approach (IR)– text preprocessing & text Indexing– query and ranking list, query expansion and re-ranking methods – fusion of ranked lists from different modalities (RRF)– selection of the category with the highest ranking score
• Evolution since 2011Evolution since 2011
– 2011 : 2 distinct communities : ML or IR approach– 2012 : mainly ML approach or mixed one
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 15
• Which one were used?Which one were used?
• Evolution since 2011 Evolution since 2011 – 2011 : 2011 : use of external data mainly from the web and social data
1 participant especially interested in social aspect
– 2012 : 2012 : no external data, no social data
ResourcesResources
System AUDIO ASR VISUAL METADATA SOCIAL OTHER
KIT
UNICAMP-UFMG
ARF
TUB
TUD-MM
TUD
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 16
• Each participant’s best result Each participant’s best result
Main results Main results (1/2) (1/2)
SYSTEM BEST RUN
MAP APPROACH FEATURES METHOD
KIT Run3
Run6*
0.3499
0.3581
ML Color, Texture, rgbSIFT
+ video distrib. over genre
SVM
UNICAMP-UFMG
Run 4 0.2112 ML BoW Stacking
ARF Run5 0.3793 ML TF-IDF mtd
ASR Limsi
SVM Linear
TUB Run4 0.5225 ML BoW mtd MI - Naive Bayes
TUD-MM Run4 0.3675 ML & IR TF on Visual word
ASR & mtd
SVM Linear + Reciprocal Rank Fusion
TUD Run2 0.25 ML ASR Lium
one-best
DBN
Baseline results : All videos into the default category MAP = 0.0063 / Videos randomly classified MAP = 0.002
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 17
• Official run comparison (MAP)Official run comparison (MAP)
Main results Main results (2/2) (2/2)
SYSTEM Run1
Audio/Visual
Run2
ASR
Run3Exc. MTD
Run4Exc. Upld. ID
Run5
All
Other Runs
KIT 0.3008 (visual1) 0.2329 (visual2)
0.3499 (fusion)
0.3461
0.1448
0.3581
UNICAMP
UFMG
0.1238 0.2112
ARF 0.1941(visual & audio)
0.1892 (audio)
0.2174 0.2204 0.3793
TUB 0.2301 0.1035 0.2259 0.5225 0.3304
TUD-MM 0.0061 0.3127 0.2279 0.3675 0.2157 0.0577
0.0047
TUD 0.23 / 0.25
0.10 / 0.09
Baseline results : All videos into the default category MAP = 0.0063 /Videos randomly classified MAP = 0.002
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 18
• About dataAbout data– 20112011 : small dataset for developement ~ 247 videos
– Difficulties to train models, external resources required
– 20122012 : huge amount of data for development ~ 5,288 videos– Enough to train models in machine learning approach
Impact on: - the type of methods used (ML against IR)
- the need/use of external data
No use of social data this year: - is it a question of community ?
- can be disappointing regarding the MediaEval motivations
Lessons learned or open questions Lessons learned or open questions (1/3)(1/3)
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 19
• About resultsAbout results– 20112011 : Best system MAP 0.56260.5626 using
– Audio, ASR, visual, metadata including the uploader ID– external data from Google and Youtube
– 20122012 : Best system (non-organiser connected) : MAP 0.37930.3793 – TF-IDF on ASR, metadata including the uploader ID, no visual data
Best system (organizer-connected): MAP 0.52250.5225 – BoW on metadata without the uploader ID, no visual data
results difficult to compare regarding the great diversity of features,
of methods and of systems combining bothmonomedia (visual only ; ASR only) or multimedia contributionsa failure analysis should help to understand « what impacts what? »
Lessons learned or open questions Lessons learned or open questions (2/3) (2/3)
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 20
• About metric About metric – MAP as the official metric– some participants provided other types of results in terms of correct
classification rate or F-score or detail AP results per genre
would « analysing the confusion between genres » be of interest?
• About genre labelsAbout genre labels– Labels provided by blip.tv are covering two aspects
• topics: Autos_and_vehicules, Health, Religion, ...
• Real genre : Comedy, Documentary, Videoblogging,...
would « making a distinction between form and content » be of interest?
Lessons learned or open questions Lessons learned or open questions (3/3)(3/3)
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 21
• What to do next time? What to do next time? – Has everything been said?– Should we leave the task unchanged?– If not, we have to define another orientation– Should we focus on another aspect of the content?
• Interaction, • Mood, • User intention regarding the query
– Define what needs to be changed: • Data• Goals and use cases• Metric
• ... a lot of points need to be considered
ConclusionConclusion
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 22
Tagging Task Overview Tagging Task Overview
More details about MetadataMore details about Metadata
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 23
Tagging Task OverviewTagging Task Overview
• Motivations & History Motivations & History
• Datasets, Genres, Metadata & ExamplesDatasets, Genres, Metadata & Examples
• Task Goal & Evaluation protocolTask Goal & Evaluation protocol
• Participants, Features & MethodsParticipants, Features & Methods
• Resources & Main resultsResources & Main results
• ConclusionConclusion
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 24
Tagging Task Overview : ExamplesTagging Task Overview : Examples
• Example of Medadata from Example of Medadata from
<video> <title> <![CDATA[One Minute Rumpole and the Angel Of Death]]> </title> <description> <![CDATA["Rumpole and the Angel of Death," by John Mortimer, …]> </description> <explicit> false </explicit> <duration> 66 </duration> <url> http://blip.tv/file/1271048 </url> <license> <type> Creative Commons Attribution-NonCommercial 2.0 </type> <id> 4 </id> </license> …
Video:Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv
Tag: Tag: 1012 literature Genre label from blip.tv
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 25
…<tags> <string> oneminutecritic </string>
<string> fvrl </string><string> vancouver </string><string> library </string><string> books </string>
</tags><uploader> <uid> 112708 </uid>
<login> crashsolo </login></uploader><file> <filename> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv </filename>
<link> http://blip.tv/file/get/… </link><size> 3745110 </size>
</file><comments /></video>
Tagging Task Overview : ExamplesTagging Task Overview : Examples
• Example of Medadata from Example of Medadata from
Tags given by the uploader
ID of the uploader
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 26
<?xml version="1.0" encoding="utf-8" ?> <Segmentation> <CreationID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659 </CreationID> <InitialFrameID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659/394.jpg </InitialFrameID> <Segments type="SHOT"> <Segment start="T00:00:00:000F1000" end="T00:00:56:357F1000"> <Index> 0 </Index> <KeyFrameID time="T00:00:28:142F1000"> CrashsoloOneMinuteRumpoleAndTheAngelOfDeath659/394.jpg </KeyFrameID> </Segment> ... </Segments> </Segmentation>
Tagging Task Overview : ExamplesTagging Task Overview : Examples
• Video data Video data (420,000 shots and keyframes)
Video:Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv
Tag: Tag: 1012 literature Genre label from blip.tv
Shot boundaries
One keyframe per shot
MediaEval Workshop – 4-5 October 2012 - Pisa, ItalyMediaEval Workshop – 4-5 October 2012 - Pisa, Italy 27
Tagging Task Overview : MetadataTagging Task Overview : Metadata
• Social data Social data (8,856 unique twitter users)
Video:Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv
Tag: Tag: 1012 literature Genre label from blip.tv
Code:Code: 1271048 Post:Post: http://blip.tv/file/1271048 http://twitter.com/crashsolo
Posted 'One Minute Rumpole and the Angel Of Death' to blip.tv: http://blip.tv/file/1271048
UserUser
Upload a file on blip.tvUpload a file on blip.tv& Post a tweet (level 0)& Post a tweet (level 0)
User’s contactsUser’s contacts
Contacts’own contactsContacts’own contacts
Level 0
Level 1 Level 2