multimulti--frame, multiframe, multi--modal, and ... · • multimulti--modal using multimodal...
TRANSCRIPT
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 1
MultiMulti--Frame, MultiFrame, Multi--Modal, and MultiModal, and Multi--Kernel Kernel
Concept Detection in VideoConcept Detection in Video
CeesCees G.M. Snoek¹, G.M. Snoek¹, KoenKoen E.A. van de Sande¹, Jasper R.R. Uijlings¹,E.A. van de Sande¹, Jasper R.R. Uijlings¹,Miguel Bugalho², Isabel Trancoso², Miguel Bugalho², Isabel Trancoso², FeiFei Yan³, Yan³, MuhammedMuhammed A. Tahir³, A. Tahir³,
KrystianKrystian Mikolajczyk³, Josef Kittler³, Theo Gevers¹, Dennis C. Koelma¹, Mikolajczyk³, Josef Kittler³, Theo Gevers¹, Dennis C. Koelma¹, Arnold W.M. Smeulders¹Arnold W.M. Smeulders¹
¹¹ ²² ³³
Conclusions TRECVID 2008Conclusions TRECVID 2008
•• Good settings for BagGood settings for Bag--ofof--WordsWords–– SIFT + SIFT + colorSIFTcolorSIFT improves ~8%improves ~8%
–– Soft codebook assignment improves ~7%Soft codebook assignment improves ~7%
–– MultiMulti--frame analysis improves ~20%frame analysis improves ~20%
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 2
Myth: TRECVID incremental onlyMyth: TRECVID incremental only
>100% improvement in just 3 years
StateState--ofof--thethe--ArtArt
Snoek et al, TRECVID 2008Van de Sande et al, PAMI 2010
Van Gemert et al, PAMI 2010
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 3
StateState--ofof--thethe--ArtArt
Snoek et al, TRECVID 2008Van de Sande et al, PAMI 2010
Van Gemert et al, PAMI 2010
Software available for download at http://colordescriptors.com
Our TRECVID 2009 focusOur TRECVID 2009 focus
Spatio‐temporalsampling
Visualfeature
extraction
Codebooktransform
Kernel‐based learninglearning
Audio concept detection
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 4
Our TRECVID 2009 focusOur TRECVID 2009 focus
Spatio‐temporalsampling
Visualfeature
extraction
Codebooktransform
Kernel‐based learning
Multi-k llearning
Audio concept detection kernel
RoadmapRoadmap
Spatio‐temporalsampling
Visualfeature
extraction
Codebooktransform
Kernel‐based learninglearning
Audio concept detection
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 5
1,000,0001,000,000 frames analyzedframes analyzed
Snoek et al, ICME 2005
•• MultiMulti--frame biggest improvement in 2008frame biggest improvement in 2008–– Extend further by analyzing up to 10 extra Extend further by analyzing up to 10 extra ii--frames/shotframes/shot
–– Yields 1M frames to analyze for the Yields 1M frames to analyze for the test set test set collectioncollection
•• Need to speedNeed to speed--up by up by being “smart and strong”being “smart and strong”SpeedSpeed up feature extractionup feature extraction–– SpeedSpeed--up feature extractionup feature extraction
–– SpeedSpeed--up quantizationup quantization
–– SpeedSpeed--up kernelup kernel--based learningbased learning
–– SpeedSpeed--up by computingup by computing
RoadmapRoadmap
Spatio‐temporalsampling
Visualfeature
extraction
Codebooktransform
Kernel‐based learninglearning
Audio concept detection
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 6
FastFast densedense descriptorsdescriptorsA
Uijlings et al, CIVR 2009
R
x =
Final Descriptor
Pixel-wise Responses RLinear Interpolation
A x R x BT
Image Patch
16x speed-up2x speed-up
Reuse subregions
RoadmapRoadmap
Spatio‐temporalsampling
Visualfeature
extraction
Codebooktransform
Kernel‐based learninglearning
Audio concept detection
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 7
Fast quantizationFast quantization
Moosman, PAMI 2008Uijlings et al, CIVR 2009
•• Random Random forestsforests–– Randomized process makes it very fast to buildRandomized process makes it very fast to build
–– Tree structure allows fast vector quantizationTree structure allows fast vector quantization
–– Logarithmic rather than linear projection timeLogarithmic rather than linear projection time
•• RealReal timetime BoWBoW•• RealReal--timetime BoWBoW–– WhenWhen usedused withwith fastfast densedense samplingsampling
–– SURF 2x2 descriptor SURF 2x2 descriptor insteadinstead of 4x4of 4x4
–– RBF RBF kernelkernel
GPUGPU--empowered quantizationempowered quantization
•• Achieve dataAchieve data parallelism by writing Euclideanparallelism by writing Euclidean
Van de Sande et al, ASCI 2009
8,00
10,00
12,00
14,00
e (s)
CPU Xeon (3,4GHz)
CPU Opteron 250 (2,4GHz)
CPU Core 2 Duo 6400 (2,13GHz)
CPU Core i7 (2,66GHz)
•• Achieve dataAchieve data--parallelism by writing Euclidean parallelism by writing Euclidean distance in vector formdistance in vector form
0,00
2,00
4,00
6,00
0 5000 10000 15000 20000
Time Pe
r Im
age
Number of SIFT Descriptors Per Image
GPU Geforce 8800GTX (128 cores)
GPU Geforce GTX260 (216 cores)
17x speed-up
GPU
CPU
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 8
RoadmapRoadmap
Spatio‐temporalsampling
Visualfeature
extraction
Codebooktransform
Kernel‐based learninglearning
Audio concept detection
SVM preSVM pre--computed computed kernel trickkernel trick
•• Use distance between feature vectors Use distance between feature vectors –– Feature length easily > 100,000Feature length easily > 100,000
•• Increase efficiency significantlyIncrease efficiency significantly–– PrePre--compute the SVM kernel matrixcompute the SVM kernel matrix
–– Long vectors possible as we only need 2 in memoryLong vectors possible as we only need 2 in memory
–– Parameter optimization reParameter optimization re--uses preuses pre--computed matrixcomputed matrix
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 9
11 Compute average distances perCompute average distances per N²N² kernel subkernel sub--blockblock
GPUGPU--empoweredempoweredprepre--computed kernelcomputed kernel
Van de Sande et al, ASCI 2009
1.1. Compute average distances per Compute average distances per N²N² kernel subkernel sub--blockblock
2.2. Compute kernel function valuesCompute kernel function values
1000
1200
1400
1600
1800
s)
1x Core i7 (2,66GHz)
1x Opteron 250 (2,4GHz)4x Opteron 250 (2,4GHz)16x Opteron 250 (2,4GHz)25x Opteron 250
65x speed-up
1 CPU
4 CPU
0
200
400
600
800
0 20000 40000 60000 80000 100000 120000 140000
Time (
Total Feature Vector Length
25x Opteron 250 (2,4GHz)
p p
3x speed-up
GPU
ComputingComputing
•• 2009 system much 2009 system much more efficient than more efficient than 2008 2008 systemsystem–– 6x more visual data analyzed using less 6x more visual data analyzed using less compute powercompute power
•• Some best estimates:Some best estimates:Visual feature extraction:Visual feature extraction: 84008400 PProcessorrocessor NNodeode HHoursours–– Visual feature extraction: Visual feature extraction: 8400 8400 PProcessorrocessor--NNodeode--HHoursours
–– Training concept detectors: 4000 PNHTraining concept detectors: 4000 PNH
–– Applying concept detectors: ~1 week GPUApplying concept detectors: ~1 week GPU
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 10
RoadmapRoadmap
Spatio‐temporalsampling
Visualfeature
extraction
Codebooktransform
Kernel‐based learninglearning
Audio concept detection
Audio concept detectionAudio concept detection
Bugalho et al, InterSpeech 2009
External sound corpus:External sound corpus:~100 ~100 conceptsconcepts
Trancoso et al, ICME 2009
Feature extraction
SVM classification
Speech
Music Detector
Telephone
Non Speech
ReasoningSpeaker IDAudio Segmentation
speech, female voice,..
monologue, dialogue,…
music events
low frequency
sirens, water,…
•• Early fusion of features Early fusion of features –– MFCCs (+ deltas), PLPs (+ deltas), Brightness, Bandwidth, MFCCs (+ deltas), PLPs (+ deltas), Brightness, Bandwidth,
ZCR, Pitch, ZCR, Pitch, HarmonicityHarmonicity, Shifted delta , Shifted delta cepstracepstra, Audio , Audio spectrum envelope and flatnessspectrum envelope and flatness
–– 0.50s 0.50s window length, with window length, with 0.25s 0.25s spacingspacing
Telephone detector low frequency
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 11
RoadmapRoadmap
Spatio‐temporalsampling
Visualfeature
extraction
Codebooktransform
Kernel‐based learninglearning
Audio concept detection
MultiMulti--kernel learningkernel learning
Tahir et al, ICCV-Subspace 2009Yan et al, ICDM 2009
•• KKernel ernel DDiscriminantiscriminant AAnalysis combined with nalysis combined with sspectral pectral rregression egression [Tahir09][Tahir09]
–– We use SRWe use SR--KDA with 6 visual kernels KDA with 6 visual kernels
–– Weighted output combined using SUM ruleWeighted output combined using SUM rule
•• MMultiulti KKernelernel FFisherisher DDiscriminantiscriminant AAnalysisnalysis•• MMultiulti--KKernelernel FFisher isher DDiscriminantiscriminant AAnalysis nalysis
–– We use We use nonnon--sparse sparse L2L2 MKMK--FDA FDA [Yan09][Yan09]
–– Fusion of 1 audio and 6 visual kernelsFusion of 1 audio and 6 visual kernels–– 20 audio concept detector scores used as input for RBF kernel20 audio concept detector scores used as input for RBF kernel
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 12
Experiments Experiments (all type A)(all type A)
•• BaselineBaseline: single: single--frame SFS on all frame SFS on all visual visual kernelskernels
•• Experiment 1: Experiment 1: mmultiulti--modal & modal & multimulti--kernelkernel–– SRSR--KDA KDA (visual only)(visual only)
–– MKMK--FDA FDA (audiovisual fusion)(audiovisual fusion)
•• Experiment 2: Experiment 2: multimulti--frameframe–– Visual Visual fusion: 5 extra fusion: 5 extra ii--frames + MAX fusion frames + MAX fusion [donated][donated]
–– BestBest--of: 1 to 10 extra of: 1 to 10 extra ii--frames + MAX/AVG fusionframes + MAX/AVG fusion
–– SFS: all multiSFS: all multi--frame visual kernel combinationsframe visual kernel combinations
Results: experiment 1Results: experiment 1
•• MultiMulti--kernel improves upon baseline: ~9%kernel improves upon baseline: ~9%
•• MultiMulti--modal modal kernel outperforms kernel outperforms uniuni--modal modal kernel kernel only slightly: ~2% only slightly: ~2% –– …but for specific (audiovisual) concepts more impressive improvement, up to …but for specific (audiovisual) concepts more impressive improvement, up to 50%50%
Audio not decisive here?
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 13
Results: experiment 2Results: experiment 2
•• MultiMulti--frame is true performance frame is true performance booster, improvement over baseline: ~booster, improvement over baseline: ~30%30%
•• Best to select optimal number of extra frames, per kernel, per concept, Best to select optimal number of extra frames, per kernel, per concept, –– On average 6 additional On average 6 additional ii--frames with MAX or AVG fusion is a solid choiceframes with MAX or AVG fusion is a solid choice
Visualizing multiVisualizing multi--frame impactframe impact
http://www.MediaMill.nl
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 14
Conclusions TRECVID 2009Conclusions TRECVID 2009•• MultiMulti--modal using multimodal using multi--kernel kernel seems promisingseems promising
M i t d d t bM i t d d t b l il i–– More experiments needed to be More experiments needed to be conclusiveconclusive
•• MultiMulti--frame frame is true is true performance boosterperformance booster–– 30% 30% improvement over improvement over singlesingle--frame baselineframe baseline
–– Time for the community to move on Time for the community to move on to to videovideo analysisanalysis
Multi-frame
Multi-modal using multi-kernel
References References II
http://www.vidivideo.eu
The MediaMill TRECVID 2008 Semantic Video Search Engine. C.G.M. Snoek et al. Proceedings of the TRECVID Workshop, 2008.
Evaluating Color Descriptors for Object and Scene Recognition. K.E.A. van de Sande, Th. Gevers, C.G.M. Snoek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2010.
Visual Word Ambiguity. Jan C. van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, Jan-Mark Geusebroek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2009.
On the Surplus Value of Semantic Video Analysis Beyond the Key Frame. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, and Frank J. Seinst a P oc IEEE Int’l Confe ence on M ltimedia & E po 2005Seinstra. Proc. IEEE Int’l Conference on Multimedia & Expo, 2005.
Real-Time Bag of Words, Approximately. Jasper R. R. Uijlings, Arnold W. M. Smeulders, R. J. H. Scha. ACM Int'l Conference on Image and Video Retrieval, 2009.
Empowering Visual Categorization with the GPU. K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. In Proc. Annual Conference of the Advanced School forComputing and Imaging, 2009.
MediaMill TRECVID 2009 17‐11‐2009
http://www.MediaMill.nl 15
References IIReferences II
http://www.vidivideo.eu
Detecting Audio Events for Semantic Video Search. M. Bugalho, J. Portelo, I. Trancoso, T. Pellegrini, and A. Abad. In InterSpeech, 2009.
Audio Contributions to Semantic Video Search. I. Trancoso, T. Pellegrini, J. Portelo, H. Meinedo, M. Bugalho, A. Abad, and J. Neto. Proc. IEEE Int’l Conference on Multimedia & Expo, 2009.
Visual Category Recognition using Spectral Regression and Kernel DiscriminantAnalysis. M. A. Tahir, J. Kittler, K. Mikolajczyk, F. Yan, K. van de Sande, and T. Gevers. In Proc. 2nd Int’l Workshop on Subspace, In Conjunction with ICCV, 2009.
Nonsparse Multiple Kernel Learning for Fisher Discriminant Analysis. F. Yan, J. Kittler K Mikolajczyk and A Tahir In IEEE Int’l conf Data MiningKittler, K. Mikolajczyk, and A. Tahir. In IEEE Int l conf. Data Mining, 2009.
The MediaMill TRECVID 2009 Semantic Video Search Engine. C.G.M. Snoek et al. Proceedings of the TRECVID Workshop, 2009.
Concept-Based Video Retrieval. C.G.M. Snoek, M. Worring. Foundations and Trends in Information Retrieval, Vol. 4 (2), page 215-322, 2009.
Contact infoContact info
•• Cees Snoek Cees Snoek http://staff.science.uva.nl/~cgmsnoekhttp://staff.science.uva.nl/~cgmsnoek