vireo@trecvid 2016: multimedia event detection, ad-hoc video … · concept-bank vs object-vlad...

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

Hao Zhang, Lei Pang, Yi-Jie Lu, Chong-Wah NgoVideo Retrieval Group (VIREO), City University of Hong Kong

http://vireo.cs.cityu.edu.hk

VIREO@TRECVID 2016: Multimedia Event Detection, Ad-hoc Video Search,Video to Text Description

Multimedia Event Detection (MED) Video to Text Description (VTT)

Motivation

Framework

Experiments

Conv 1Max Pool

Conv 2Max Pool

Conv 3Conv 4Conv 5

Extarction Scaling

Scaling

RoI

RoI

Feature Map ofScaled RoI

Feature Map

Feature Map of RoI

Scaled Feature Map of RoI

Image

RoI

Dee

p F

eatu

re E

xtra

ctio

n

Conv 1Max Pool

Conv 2Max Pool

Conv 3Conv 4Conv 5

COS Similarity Only around 40% of Regional Information left

Discriminatory power

of deep features consistently improves

0.4

0.45

0.5

0.55

0.6

0.65

0.7

conv1 conv2 conv3 conv4 conv5

AlexNet

cos-sim

0.4 0.5 0.6 0.7 0.8 0.9

conv1-1conv1-2conv2-1conv2-2conv3-1conv3-2conv3-3conv3-4conv4-1conv4-2conv4-3conv4-4conv5-1conv5-2conv5-3conv5-4

VGG-19-Net

cos-sim

0.4

0.5

0.6

0.7

0.8

0.9

1

conv1Res2ARes2BRes2CRes3ARes3BRes3CRes3DRes4ARes4BRes4CRes4DRes4ERes4FRes5ARes5BRes5C

Deep-ResNet-50

cos-sim

Video Tim

eFrame

Selective search

Candidate Objects

Temporal line

Spatial & temporal features clustering

Primary Scenes

Regional objects

Deep features K-means & VLAD

Video Tim

e

Frame People

Concept Bank

Baby Cake People

Time

Sequential Features

Average pooling

Chi2 SVM

Concept Bank Vocabulary

Object-VLAD Spatial Pyramid Pooling VLAD Details

Deep Feature MapExtraction

Feature Map7 X 7

Spatial Pyramid Pooling

7 X 7 6 X 6 5 X 5 2 X 2

50

de

scri

pto

rs

VLAD

Max pool filter:

50

de

scri

pto

rs

Concept-Bank VS Object-VLAD

CNN-VLAD

CNN-VLAD VS Object-VLAD

0

5

10

15

20

25

30

35

40

45

Team

2

VIR

EO

Team

3

Team

4

Team

5

Team

6

Team

7

Team

8

Team

9

Team

10

Team

11

MED16-EvalSub (MinfAP200%)

25

27

29

31

33

35

37

39

41

Team2 VIREO Team3 Team4

MED16-EvalFull (MinfAP200%)

Evaluation: PS-10Ex (Object-VLAD + Concept-Bank)

Evaluation: AH-10Ex

(Object-VLAD + Concept-Bank)

25

30

35

40

45

50

Team2 VIREO Team3 Team4

MED16-EvalFull (MinfAP200%)

0

5

10

15

20

25

30

35

40

45

50MED16-EvalSub (MinfAP200%)

Evaluation: PS-100Ex (Object-VLAD)

Pri

mary

Ru

ns

Pri

mary

Ru

ns

Pri

mary

Ru

ns

Framework

Concept-based Text Description Matching

Stacked Attention Model

Pool5 Feature Map

Image embedding layer

Some guy is playing Frisbee with a dog

Text CNN Layer

Soft-m

ax

Soft-m

ax

Multi-modal embedding

layer

Multi-modal embedding

layer

tanh

hA P1

vT u

hA

(2)

PI

(2)

vI

u(2)

S(vI, vT )

: feature for each region

1st Attention Layer 2nd Attention Layer

Objective-Function

Training Dataset Deep Features

Experiments

Ad-hoc Video Search (AVS)

OnlineOnline

OfflineOffline

Semantic Query Generation

Event QueryCleaning an appliance Tokens

NLP Parser

Concept matching and selection

Shot Corpus

$

€

₤

ƒ

$

￥

Concept Bank

Shot Ranking

Concept-based Video Representation

cleaning

refrigeratordishwasherstovemicrowave

spray bottle

kitchen

Semantic Query

Vocabulary

Framework

Performance for different concept settings

in Video-Search 2008 dataset (MinfAP%)

Experiments

Performance for different concept settings in

AVS 2016 evaluation dataset (MinfAP%)

Performance with and without highlighted concept

ranking (HCR)

Highlighted Concept Reranking (HCR)

guitar (#RC)

acoustic guitar (#I-1K)

electric guitar (#I-1K)

daytime outdoor (#SIN)

outdoor (#SIN)

outdoor (#RC)

a person playing guitar outdoors

Rerank 1

500

Without reranking With reranking

vireo@trecvid 2016: multimedia event detection, ad-hoc video … · concept-bank vs object-vlad...

Documents