vireo@trecvid 2016: multimedia event detection, ad-hoc video … · concept-bank vs object-vlad...

1
RESEARCH POSTER PRESENTATION DESIGN © 2015 www.PosterPresentations.com Hao Zhang, Lei Pang, Yi-Jie Lu, Chong-Wah Ngo Video Retrieval Group (VIREO), City University of Hong Kong http://vireo.cs.cityu.edu.hk VIREO@TRECVID 2016: Multimedia Event Detection, Ad-hoc Video Search, Video to Text Description Multimedia Event Detection (MED) Video to Text Description (VTT) Motivation Framework Experiments Conv 1 Max Pool Conv 2 Max Pool Conv 3 Conv 4 Conv 5 Extarction Scaling Scaling RoI RoI Feature Map of Scaled RoI Feature Map Feature Map of RoI Scaled Feature Map of RoI Image RoI Deep Feature Extraction Conv 1 Max Pool Conv 2 Max Pool Conv 3 Conv 4 Conv 5 COS Similarity Only around 40% of Regional Information left Discriminatory power of deep features consistently improves 0.4 0.45 0.5 0.55 0.6 0.65 0.7 conv1 conv2 conv3 conv4 conv5 AlexNet cos-sim 0.4 0.5 0.6 0.7 0.8 0.9 conv1-1 conv1-2 conv2-1 conv2-2 conv3-1 conv3-2 conv3-3 conv3-4 conv4-1 conv4-2 conv4-3 conv4-4 conv5-1 conv5-2 conv5-3 conv5-4 VGG-19-Net cos-sim 0.4 0.5 0.6 0.7 0.8 0.9 1 conv1 Res2A Res2B Res2C Res3A Res3B Res3C Res3D Res4A Res4B Res4C Res4D Res4E Res4F Res5A Res5B Res5C Deep-ResNet-50 cos-sim Video Time Frame Selective search Candidate Objects Temporal line Spatial & temporal features clustering Regional objects Deep features K-means & VLAD Video Time Frame People Concept Bank Baby Cake People Time Sequential Features Average pooling Chi2 SVM Concept Bank Vocabulary Object-VLAD Spatial Pyramid Pooling VLAD Details Deep Feature Map Extraction Feature Map 7 X 7 Spatial Pyramid Pooling 7 X 7 6 X 6 5 X 5 2 X 2 50 descriptors VLAD Max pool filter: 50 descriptors Concept-Bank VS Object-VLAD CNN-VLAD CNN-VLAD VS Object-VLAD 0 5 10 15 20 25 30 35 40 45 Team2 VIREO Team3 Team4 Team5 Team6 Team7 Team8 Team9 Team10 Team11 MED16-EvalSub (MinfAP200%) 25 27 29 31 33 35 37 39 41 Team2 VIREO Team3 Team4 MED16-EvalFull (MinfAP200%) Evaluation: PS-10Ex (Object-VLAD + Concept-Bank) Evaluation: AH-10Ex (Object-VLAD + Concept-Bank) 25 30 35 40 45 50 Team2 VIREO Team3 Team4 MED16-EvalFull (MinfAP200%) 0 5 10 15 20 25 30 35 40 45 50 MED16-EvalSub (MinfAP200%) Evaluation: PS-100Ex (Object-VLAD) Primary Runs Primary Runs Primary Runs Framework Concept-based Text Description Matching Stacked Attention Model Pool5 Feature Map Image embedding layer Some guy is playing Frisbee with a dog Text CNN Layer Soft-max Soft-max Multi-modal embedding layer Multi-modal embedding layer tanh hA P1 vT u hA (2) PI (2) vI u (2) S(vI, vT ) : feature for each region 1 st Attention Layer 2 nd Attention Layer Objective-Function Training Dataset Deep Features Experiments Ad-hoc Video Search (AVS) Online Online Offline Offline Semantic Query Generation Event Query Cleaning an appliance Tokens NLP Parser Concept matching and selection Shot Corpus $ ƒ $ Concept Bank Shot Ranking Concept-based Video Representation cleaning refrigerator dishwasher stove microwave spray bottle kitchen Semantic Query Vocabulary Framework Performance for different concept settings in Video-Search 2008 dataset (MinfAP%) Experiments Performance for different concept settings in AVS 2016 evaluation dataset (MinfAP%) Performance with and without highlighted concept ranking (HCR) Highlighted Concept Reranking (HCR) guitar (#RC) acoustic guitar (#I-1K) electric guitar (#I-1K) daytime outdoor (#SIN) outdoor (#SIN) outdoor (#RC) a person playing guitar outdoors Rerank 1 500 Without reranking With reranking

Upload: others

Post on 07-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: VIREO@TRECVID 2016: Multimedia Event Detection, Ad-hoc Video … · Concept-Bank VS Object-VLAD CNN-VLAD CNN-VLAD VS Object-VLAD 0 5 10 15 20 25 30 35 40 45 m2 VIREO m3 m4 m5 m6 m7

RESEARCH POSTER PRESENTATION DESIGN © 2015

www.PosterPresentations.com

Hao Zhang, Lei Pang, Yi-Jie Lu, Chong-Wah NgoVideo Retrieval Group (VIREO), City University of Hong Kong

http://vireo.cs.cityu.edu.hk

VIREO@TRECVID 2016: Multimedia Event Detection, Ad-hoc Video Search,Video to Text Description

Multimedia Event Detection (MED) Video to Text Description (VTT)

Motivation

Framework

Experiments

Conv 1Max Pool

Conv 2Max Pool

Conv 3Conv 4Conv 5

Extarction Scaling

Scaling

RoI

RoI

Feature Map ofScaled RoI

Feature Map

Feature Map of RoI

Scaled Feature Map of RoI

Image

RoI

Dee

p F

eatu

re E

xtra

ctio

n

Conv 1Max Pool

Conv 2Max Pool

Conv 3Conv 4Conv 5

COS Similarity Only around 40% of Regional Information left

Discriminatory power

of deep features consistently improves

0.4

0.45

0.5

0.55

0.6

0.65

0.7

conv1 conv2 conv3 conv4 conv5

AlexNet

cos-sim

0.4 0.5 0.6 0.7 0.8 0.9

conv1-1conv1-2conv2-1conv2-2conv3-1conv3-2conv3-3conv3-4conv4-1conv4-2conv4-3conv4-4conv5-1conv5-2conv5-3conv5-4

VGG-19-Net

cos-sim

0.4

0.5

0.6

0.7

0.8

0.9

1

conv1Res2ARes2BRes2CRes3ARes3BRes3CRes3DRes4ARes4BRes4CRes4DRes4ERes4FRes5ARes5BRes5C

Deep-ResNet-50

cos-sim

Video Tim

eFrame

Selective search

Candidate Objects

Temporal line

Spatial & temporal features clustering

Primary Scenes

Regional objects

Deep features K-means & VLAD

Video Tim

e

Frame People

Concept Bank

Baby Cake People

Time

Sequential Features

Average pooling

Chi2 SVM

Concept Bank Vocabulary

Object-VLAD Spatial Pyramid Pooling VLAD Details

Deep Feature MapExtraction

Feature Map7 X 7

Spatial Pyramid Pooling

7 X 7 6 X 6 5 X 5 2 X 2

50

de

scri

pto

rs

VLAD

Max pool filter:

50

de

scri

pto

rs

Concept-Bank VS Object-VLAD

CNN-VLAD

CNN-VLAD VS Object-VLAD

0

5

10

15

20

25

30

35

40

45

Team

2

VIR

EO

Team

3

Team

4

Team

5

Team

6

Team

7

Team

8

Team

9

Team

10

Team

11

MED16-EvalSub (MinfAP200%)

25

27

29

31

33

35

37

39

41

Team2 VIREO Team3 Team4

MED16-EvalFull (MinfAP200%)

Evaluation: PS-10Ex (Object-VLAD + Concept-Bank)

Evaluation: AH-10Ex

(Object-VLAD + Concept-Bank)

25

30

35

40

45

50

Team2 VIREO Team3 Team4

MED16-EvalFull (MinfAP200%)

0

5

10

15

20

25

30

35

40

45

50MED16-EvalSub (MinfAP200%)

Evaluation: PS-100Ex (Object-VLAD)

Pri

mary

Ru

ns

Pri

mary

Ru

ns

Pri

mary

Ru

ns

Framework

Concept-based Text Description Matching

Stacked Attention Model

Pool5 Feature Map

Image embedding layer

Some guy is playing Frisbee with a dog

Text CNN Layer

Soft-m

ax

Soft-m

ax

Multi-modal embedding

layer

Multi-modal embedding

layer

tanh

hA P1

vT u

hA

(2)

PI

(2)

vI

u(2)

S(vI, vT )

: feature for each region

1st Attention Layer 2nd Attention Layer

Objective-Function

Training Dataset Deep Features

Experiments

Ad-hoc Video Search (AVS)

OnlineOnline

OfflineOffline

Semantic Query Generation

Event QueryCleaning an appliance Tokens

NLP Parser

Concept matching and selection

Shot Corpus

$

ƒ

$

Concept Bank

Shot Ranking

Concept-based Video Representation

cleaning

refrigeratordishwasherstovemicrowave

spray bottle

kitchen

Semantic Query

Vocabulary

Framework

Performance for different concept settings

in Video-Search 2008 dataset (MinfAP%)

Experiments

Performance for different concept settings in

AVS 2016 evaluation dataset (MinfAP%)

Performance with and without highlighted concept

ranking (HCR)

Highlighted Concept Reranking (HCR)

guitar (#RC)

acoustic guitar (#I-1K)

electric guitar (#I-1K)

daytime outdoor (#SIN)

outdoor (#SIN)

outdoor (#RC)

a person playing guitar outdoors

Rerank 1

500

Without reranking With reranking