multimulti--frame, multiframe, multi--modal, and ... · • multimulti--modal using multimodal...

MediaMill TRECVID 2009 17‐11‐2009

http://www.MediaMill.nl 1

MultiMulti--Frame, MultiFrame, Multi--Modal, and MultiModal, and Multi--Kernel Kernel

Concept Detection in VideoConcept Detection in Video

CeesCees G.M. Snoek¹, G.M. Snoek¹, KoenKoen E.A. van de Sande¹, Jasper R.R. Uijlings¹,E.A. van de Sande¹, Jasper R.R. Uijlings¹,Miguel Bugalho², Isabel Trancoso², Miguel Bugalho², Isabel Trancoso², FeiFei Yan³, Yan³, MuhammedMuhammed A. Tahir³, A. Tahir³,

KrystianKrystian Mikolajczyk³, Josef Kittler³, Theo Gevers¹, Dennis C. Koelma¹, Mikolajczyk³, Josef Kittler³, Theo Gevers¹, Dennis C. Koelma¹, Arnold W.M. Smeulders¹Arnold W.M. Smeulders¹

¹¹ ²² ³³

Conclusions TRECVID 2008Conclusions TRECVID 2008

•• Good settings for BagGood settings for Bag--ofof--WordsWords–– SIFT + SIFT + colorSIFTcolorSIFT improves ~8%improves ~8%

–– Soft codebook assignment improves ~7%Soft codebook assignment improves ~7%

–– MultiMulti--frame analysis improves ~20%frame analysis improves ~20%



Myth: TRECVID incremental onlyMyth: TRECVID incremental only

>100% improvement in just 3 years

StateState--ofof--thethe--ArtArt

Snoek et al, TRECVID 2008Van de Sande et al, PAMI 2010

Van Gemert et al, PAMI 2010



StateState--ofof--thethe--ArtArt

Snoek et al, TRECVID 2008Van de Sande et al, PAMI 2010

Van Gemert et al, PAMI 2010

Software available for download at http://colordescriptors.com

Our TRECVID 2009 focusOur TRECVID 2009 focus

Spatio‐temporalsampling

Visualfeature

extraction

Codebooktransform

Kernel‐based learninglearning

Audio concept detection



Our TRECVID 2009 focusOur TRECVID 2009 focus


Visualfeature

extraction

Codebooktransform

Kernel‐based learning

Multi-k llearning

Audio concept detection kernel

RoadmapRoadmap


Visualfeature

extraction

Codebooktransform





1,000,0001,000,000 frames analyzedframes analyzed

Snoek et al, ICME 2005

•• MultiMulti--frame biggest improvement in 2008frame biggest improvement in 2008–– Extend further by analyzing up to 10 extra Extend further by analyzing up to 10 extra ii--frames/shotframes/shot

–– Yields 1M frames to analyze for the Yields 1M frames to analyze for the test set test set collectioncollection

•• Need to speedNeed to speed--up by up by being “smart and strong”being “smart and strong”SpeedSpeed up feature extractionup feature extraction–– SpeedSpeed--up feature extractionup feature extraction

–– SpeedSpeed--up quantizationup quantization

–– SpeedSpeed--up kernelup kernel--based learningbased learning

–– SpeedSpeed--up by computingup by computing

RoadmapRoadmap


Visualfeature

extraction

Codebooktransform





FastFast densedense descriptorsdescriptorsA

Uijlings et al, CIVR 2009

R

x =

Final Descriptor

Pixel-wise Responses RLinear Interpolation

A x R x BT

Image Patch

16x speed-up2x speed-up

Reuse subregions

RoadmapRoadmap


Visualfeature

extraction

Codebooktransform





Fast quantizationFast quantization

Moosman, PAMI 2008Uijlings et al, CIVR 2009

•• Random Random forestsforests–– Randomized process makes it very fast to buildRandomized process makes it very fast to build

–– Tree structure allows fast vector quantizationTree structure allows fast vector quantization

–– Logarithmic rather than linear projection timeLogarithmic rather than linear projection time

•• RealReal timetime BoWBoW•• RealReal--timetime BoWBoW–– WhenWhen usedused withwith fastfast densedense samplingsampling

–– SURF 2x2 descriptor SURF 2x2 descriptor insteadinstead of 4x4of 4x4

–– RBF RBF kernelkernel

GPUGPU--empowered quantizationempowered quantization

•• Achieve dataAchieve data parallelism by writing Euclideanparallelism by writing Euclidean

Van de Sande et al, ASCI 2009

8,00

10,00

12,00

14,00

e (s)

CPU Xeon (3,4GHz)

CPU Opteron 250 (2,4GHz)

CPU Core 2 Duo 6400 (2,13GHz)

CPU Core i7 (2,66GHz)

•• Achieve dataAchieve data--parallelism by writing Euclidean parallelism by writing Euclidean distance in vector formdistance in vector form

0,00

2,00

4,00

6,00

0 5000 10000 15000 20000

Time Pe

r Im

age

Number of SIFT Descriptors Per Image

GPU Geforce 8800GTX (128 cores)

GPU Geforce GTX260 (216 cores)

17x speed-up

GPU

CPU



RoadmapRoadmap


Visualfeature

extraction

Codebooktransform



SVM preSVM pre--computed computed kernel trickkernel trick

•• Use distance between feature vectors Use distance between feature vectors –– Feature length easily > 100,000Feature length easily > 100,000

•• Increase efficiency significantlyIncrease efficiency significantly–– PrePre--compute the SVM kernel matrixcompute the SVM kernel matrix

–– Long vectors possible as we only need 2 in memoryLong vectors possible as we only need 2 in memory

–– Parameter optimization reParameter optimization re--uses preuses pre--computed matrixcomputed matrix



11 Compute average distances perCompute average distances per N²N² kernel subkernel sub--blockblock

GPUGPU--empoweredempoweredprepre--computed kernelcomputed kernel

Van de Sande et al, ASCI 2009

1.1. Compute average distances per Compute average distances per N²N² kernel subkernel sub--blockblock

2.2. Compute kernel function valuesCompute kernel function values

1000

1200

1400

1600

1800

s)

1x Core i7 (2,66GHz)

1x Opteron 250 (2,4GHz)4x Opteron 250 (2,4GHz)16x Opteron 250 (2,4GHz)25x Opteron 250

65x speed-up

1 CPU

4 CPU

0

200

400

600

800

0 20000 40000 60000 80000 100000 120000 140000

Time (

Total Feature Vector Length

25x Opteron 250 (2,4GHz)

p p

3x speed-up

GPU

ComputingComputing

•• 2009 system much 2009 system much more efficient than more efficient than 2008 2008 systemsystem–– 6x more visual data analyzed using less 6x more visual data analyzed using less compute powercompute power

•• Some best estimates:Some best estimates:Visual feature extraction:Visual feature extraction: 84008400 PProcessorrocessor NNodeode HHoursours–– Visual feature extraction: Visual feature extraction: 8400 8400 PProcessorrocessor--NNodeode--HHoursours

–– Training concept detectors: 4000 PNHTraining concept detectors: 4000 PNH

–– Applying concept detectors: ~1 week GPUApplying concept detectors: ~1 week GPU



RoadmapRoadmap


Visualfeature

extraction

Codebooktransform



Audio concept detectionAudio concept detection

Bugalho et al, InterSpeech 2009

External sound corpus:External sound corpus:~100 ~100 conceptsconcepts

Trancoso et al, ICME 2009

Feature extraction

SVM classification

Speech

Music Detector

Telephone

Non Speech

ReasoningSpeaker IDAudio Segmentation

speech, female voice,..

monologue, dialogue,…

music events

low frequency

sirens, water,…

•• Early fusion of features Early fusion of features –– MFCCs (+ deltas), PLPs (+ deltas), Brightness, Bandwidth, MFCCs (+ deltas), PLPs (+ deltas), Brightness, Bandwidth,

ZCR, Pitch, ZCR, Pitch, HarmonicityHarmonicity, Shifted delta , Shifted delta cepstracepstra, Audio , Audio spectrum envelope and flatnessspectrum envelope and flatness

–– 0.50s 0.50s window length, with window length, with 0.25s 0.25s spacingspacing

Telephone detector low frequency



RoadmapRoadmap


Visualfeature

extraction

Codebooktransform



MultiMulti--kernel learningkernel learning

Tahir et al, ICCV-Subspace 2009Yan et al, ICDM 2009

•• KKernel ernel DDiscriminantiscriminant AAnalysis combined with nalysis combined with sspectral pectral rregression egression [Tahir09][Tahir09]

–– We use SRWe use SR--KDA with 6 visual kernels KDA with 6 visual kernels

–– Weighted output combined using SUM ruleWeighted output combined using SUM rule

•• MMultiulti KKernelernel FFisherisher DDiscriminantiscriminant AAnalysisnalysis•• MMultiulti--KKernelernel FFisher isher DDiscriminantiscriminant AAnalysis nalysis

–– We use We use nonnon--sparse sparse L2L2 MKMK--FDA FDA [Yan09][Yan09]

–– Fusion of 1 audio and 6 visual kernelsFusion of 1 audio and 6 visual kernels–– 20 audio concept detector scores used as input for RBF kernel20 audio concept detector scores used as input for RBF kernel



Experiments Experiments (all type A)(all type A)

•• BaselineBaseline: single: single--frame SFS on all frame SFS on all visual visual kernelskernels

•• Experiment 1: Experiment 1: mmultiulti--modal & modal & multimulti--kernelkernel–– SRSR--KDA KDA (visual only)(visual only)

–– MKMK--FDA FDA (audiovisual fusion)(audiovisual fusion)

•• Experiment 2: Experiment 2: multimulti--frameframe–– Visual Visual fusion: 5 extra fusion: 5 extra ii--frames + MAX fusion frames + MAX fusion [donated][donated]

–– BestBest--of: 1 to 10 extra of: 1 to 10 extra ii--frames + MAX/AVG fusionframes + MAX/AVG fusion

–– SFS: all multiSFS: all multi--frame visual kernel combinationsframe visual kernel combinations

Results: experiment 1Results: experiment 1

•• MultiMulti--kernel improves upon baseline: ~9%kernel improves upon baseline: ~9%

•• MultiMulti--modal modal kernel outperforms kernel outperforms uniuni--modal modal kernel kernel only slightly: ~2% only slightly: ~2% –– …but for specific (audiovisual) concepts more impressive improvement, up to …but for specific (audiovisual) concepts more impressive improvement, up to 50%50%

Audio not decisive here?



Results: experiment 2Results: experiment 2

•• MultiMulti--frame is true performance frame is true performance booster, improvement over baseline: ~booster, improvement over baseline: ~30%30%

•• Best to select optimal number of extra frames, per kernel, per concept, Best to select optimal number of extra frames, per kernel, per concept, –– On average 6 additional On average 6 additional ii--frames with MAX or AVG fusion is a solid choiceframes with MAX or AVG fusion is a solid choice

Visualizing multiVisualizing multi--frame impactframe impact

http://www.MediaMill.nl



Conclusions TRECVID 2009Conclusions TRECVID 2009•• MultiMulti--modal using multimodal using multi--kernel kernel seems promisingseems promising

M i t d d t bM i t d d t b l il i–– More experiments needed to be More experiments needed to be conclusiveconclusive

•• MultiMulti--frame frame is true is true performance boosterperformance booster–– 30% 30% improvement over improvement over singlesingle--frame baselineframe baseline

–– Time for the community to move on Time for the community to move on to to videovideo analysisanalysis

Multi-frame

Multi-modal using multi-kernel

References References II

http://www.vidivideo.eu

The MediaMill TRECVID 2008 Semantic Video Search Engine. C.G.M. Snoek et al. Proceedings of the TRECVID Workshop, 2008.

Evaluating Color Descriptors for Object and Scene Recognition. K.E.A. van de Sande, Th. Gevers, C.G.M. Snoek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2010.

Visual Word Ambiguity. Jan C. van Gemert, Cor J. Veenman, Arnold W. M. Smeulders, Jan-Mark Geusebroek. IEEE Trans. Pattern Analysis and Machine Intelligence (in press), 2009.

On the Surplus Value of Semantic Video Analysis Beyond the Key Frame. Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, and Frank J. Seinst a P oc IEEE Int’l Confe ence on M ltimedia & E po 2005Seinstra. Proc. IEEE Int’l Conference on Multimedia & Expo, 2005.

Real-Time Bag of Words, Approximately. Jasper R. R. Uijlings, Arnold W. M. Smeulders, R. J. H. Scha. ACM Int'l Conference on Image and Video Retrieval, 2009.

Empowering Visual Categorization with the GPU. K. E. A. van de Sande, T. Gevers, and C. G. M. Snoek. In Proc. Annual Conference of the Advanced School forComputing and Imaging, 2009.



References IIReferences II

http://www.vidivideo.eu

Detecting Audio Events for Semantic Video Search. M. Bugalho, J. Portelo, I. Trancoso, T. Pellegrini, and A. Abad. In InterSpeech, 2009.

Audio Contributions to Semantic Video Search. I. Trancoso, T. Pellegrini, J. Portelo, H. Meinedo, M. Bugalho, A. Abad, and J. Neto. Proc. IEEE Int’l Conference on Multimedia & Expo, 2009.

Visual Category Recognition using Spectral Regression and Kernel DiscriminantAnalysis. M. A. Tahir, J. Kittler, K. Mikolajczyk, F. Yan, K. van de Sande, and T. Gevers. In Proc. 2nd Int’l Workshop on Subspace, In Conjunction with ICCV, 2009.

Nonsparse Multiple Kernel Learning for Fisher Discriminant Analysis. F. Yan, J. Kittler K Mikolajczyk and A Tahir In IEEE Int’l conf Data MiningKittler, K. Mikolajczyk, and A. Tahir. In IEEE Int l conf. Data Mining, 2009.

The MediaMill TRECVID 2009 Semantic Video Search Engine. C.G.M. Snoek et al. Proceedings of the TRECVID Workshop, 2009.

Concept-Based Video Retrieval. C.G.M. Snoek, M. Worring. Foundations and Trends in Information Retrieval, Vol. 4 (2), page 215-322, 2009.

Contact infoContact info

•• Cees Snoek Cees Snoek http://staff.science.uva.nl/~cgmsnoekhttp://staff.science.uva.nl/~cgmsnoek

multimulti--frame, multiframe, multi--modal, and ... · • multimulti--modal using multimodal...

Documents