kobe university, nict, and university of siegen at trecvid 2016 … · 2016-11-29 ·...

35
Kobe University, NICT, and University of Siegen at TRECVID 2016 AVS Task Yasuyuki Matsumoto, Kuniaki Uehara (Kobe University) Takashi Shinozaki (NICT) Kimiaki Shirahama, Marcin Grzegozek (University of Siegen)

Upload: others

Post on 13-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Kobe University, NICT, and University of Siegen at TRECVID 2016 AVS Task

Yasuyuki Matsumoto, Kuniaki Uehara (Kobe University) Takashi Shinozaki (NICT) Kimiaki Shirahama, Marcin Grzegozek (University of Siegen)

Page 2: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

2

Our Contribution

A method of using small-scale neural network to greatly accelerate concept classifier training.

Transfer learning can be used to acquire temporal characteristics effiently by combining both small networks and LSTM.

Evaluate the effectiveness of using balanced examples at the time of training.

Page 3: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

3

The Problem

Using pre-trained neural networks to extract features is a very popular approach.

However, training of classifiers takes long time.

This training gets even worse if classifiers required are many.

pre-trained network

~ ?extract feature

Page 4: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

4

Micro Neural Networks

Binary classifier that outputs two values to predict the presence or absence of the concept.

A micro Neural Network is a fully-connected neural network with a single hidden layer.

Dropout is used to avoid overfitting.

Calculation time could be reduced (hours->minutes).

Page 5: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Our Approach - Overview

5

+ Manual selection+ Feature extraction + MicroNN training + LSTM

+ Shot retrieval

Query Concept Model Precision

Overview of our method for TRECVID 2016 AVS task

Page 6: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

6

Query Concept Model

How we extracted concepts from the queries

+ Shot retrieval

Precision

+ Feature extraction + MicroNN training + LSTM

+ Manual selection

Our Approach - Overview

Page 7: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Our Approach - Manual Selection

7

Query (502) ’’Find shots of a man indoors looking at camera where a bookcase is behind him’’

“look’’ Base form

“man’’ Pick only noun and verb

Simple rule is used to make it easier to automate the concept selection in the future.

“bookcase’’, “bookshelf”, “furniture’’ Synonyms (from ImageNet)

Begin with manually selecting relevant concepts for each query

Page 8: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

8

Query (502) ’’Find shots of a man indoors looking at camera where a bookcase is behind him’’

“look’’ Base form

“man’’ Pick only noun and verb

“bookcase’’, “bookshelf”, “furniture’’ Synonyms (from ImageNet)

Begin with manually selecting relevant concepts for each query

Indoor Speaking_to_camera Bookshelf Funiture

Concept

Simple rule is used to make it easier to automate the concept selection in the future.

Our Approach - Manual Selection

Page 9: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

9

Query Concept Model

Overview of our method for TRECVID 2016 AVS task

+ Shot retrieval

Precision

Our Approach - Overview

+ Feature extraction + MicroNN training + LSTM

+ Manual selection

Page 10: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

10

Query Concept Model

Combine the concepts from each query.

+ Shot retrieval

Precision

+ Feature extraction + MicroNN training + LSTM

+ Manual selection

Our Approach - Overview

Page 11: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Our Approach - Feature Extraction

11

Use pre-trained VGGNet • ILSVRC 2014 • CNN with very deep architecture • The 16 layer version is used • FC7 : Use output at the second

fully connected layer

Pre-trained network is usually transferred into classifiers suitable for the target problem

Conv1

Conv2

Conv3

Conv4

Conv5

FC6

FC7

FC8

Softmax

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition”

Page 12: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

~

Image VGG Net

Our Approach - MicroNN Training

① Start with training microNN using images

Perform gradual transfer learning for each concept in the following step

12

Page 13: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

~

Image VGG Net

SVM

Until now . . .

13

Previous Approach - SVM Training

Previous studies have trained classifiers such as SVM

by extracted features.

This requires a lot of time.

Page 14: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

~

Image VGG Net

microNN

① Start with training microNN using images

Perform gradual transfer learning for each concept in the following step

14

Our Approach - MicroNN Training

Page 15: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

~

① Start with training microNN using images

Perform gradual transfer learning for each concept in the following step

15

Our Approach - MicroNN Training

Page 16: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Perform gradual transfer learning for each concept in the following step

16

Our Approach - MicroNN Training

~

② Refine the microNN using shots in video dataset.

Page 17: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Perform gradual transfer learning for each concept in the following step

17

② Refine the microNN using shots in video dataset. The microNN has weight parameters learned at first step as its initial value.

W, b

Our Approach - MicroNN Training

~

~

Video

Page 18: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Perform gradual transfer learning for each concept in the following step

18

W, b

Video

~

LSTM

Our Approach - MicroNN Training

~

~V

③ Futher, hidden layer of microNN is replaced with LSTM for acquiring temporal characteristics. Refine the microNN starting with weight parameters learned at the second step as initial values.

Page 19: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

19

Query Concept Model

Overview of our method for TRECVID 2016 AVS task

+ Shot retrieval

Precision

Our Approach - Overview

+ Feature extraction + MicroNN training + LSTM

+ Manual selection

Page 20: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

20

Query Concept Model

How we go from a shot’s concept relevance to its search score

+ Shot retrieval

Precision

Our Approach - Overview

+ Feature extraction + MicroNN training + LSTM

+ Manual selection

Page 21: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Our Approach - Shot Retrieval

21

For each shot, calculate the avarage of output values of microNNs for the selected concepts in a query

Indoor Speaking_to_camera Bookshelf Funiture

Concept

Output values

0.7 0.1 0.4 0.6

MicroNN outputs are normalized to [-1, 1], to balance between different concepts.

Page 22: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Our Approach - Shot Retrieval

22

Indoor Speaking_to_camera Bookshelf Funiture

Concept

Output values

Average of output values (Search Score)

0.7 0.1 0.4 0.6 / 4

0.45

Calculate the average of output values and use it as overall search score.

How do we compare that with other shots

Page 23: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

23

Purpose of Experiment

1. Evaluate the learning speed. 2. Evaluate the effectiveness of using LSTM to acquire

temporal characteristics. 3. Evaluate wheather using same number of positive and negative

examples (“Balanced”) for training improves classification.

Page 24: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Experiment - Three Runs

24

kobe_nict_siegen_D_M_1

Imbalanced

Fine-tuning is carried out using imbalanced numbers of positive and negative examples. (30,000 total)

kobe_nict_siegen_D_M_2

Balanced

Fine-tuning is carried out using balanced numbers of positive and negative examples. (30,000 total)

kobe_nict_siegen_D_M_3

(Imbalanced) LSTM

Unlike max-pooling, LSTM obtains temporal characteristics. LSTM-based microNNs are trained only for 14 concepts for which temporal relations among video frames are important positive

negative negative

positive

Dataset Ratio

Dataset Ratio

Submitted the following for TRECVID 2016 AVS task

Only 14 concepts

Page 25: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Experiment - Dataset

25

TRECVID IACC Video data

61 concepts

ImageNet Image data 39 concepts

UCF 101 Video data 5 concepts

Used in this study

Page 26: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Experiment - Dataset

26

TRECVID IACC Video data

61 concepts

ImageNet Image data 39 concepts

UCF 101 Video data 5 concepts

Training time

sec / concept (30000 shots)

min / concept (30000 shots)23

Page 27: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

27

List of some concepts selected for each query

query_id ImageNet TRECVID UCF 101

501 Outdoor playingGuitar

502

bookshelf

Indoor Speaking_to_camera Furniture

503 drumIndoor

drumming

Experiment - DatasetUsed in this study

Page 28: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Experiment - Result

28

LSTM

AP

Performance comparison between Imbalanced, Balanced and LSTM on each of the 30 queries

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530

Imbalanced Balanced

Page 29: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Experiment - Result

29

LSTM

AP

Performance comparison between Imbalanced, Balanced and LSTM on each of the 30 queries

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530

Imbalanced Balanced

Using imbalanced training examples leads to higher average precisions than using balanced ones.

Page 30: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Experiment - Result

30

LSTM

AP

Performance comparison between Imbalanced, Balanced and LSTM on each of the 30 queries

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530

Imbalanced Balanced

Using LSTM is more than three times higher than the ones not-using LSTM.

Page 31: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Experiment - Result

31

Ours Others

MAP

Performance comparison between our method and the other methods developed for the manually-assisted category in AVS task

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

LSTM Imbalanced Balanced

Page 32: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

Experiment - Result

32

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

Waseda.1

62

Waseda.1

61

Waseda.1

64

Waseda.1

63

NII_Hitachi_UIT.164

ITI_CERTH.164

ITI_CERTH.163

ITI_CERTH.161

kobe_nict_siegen.163

IMOTION.161

kobe_nict_siegen.161

IMOTION.162

NII_Hitachi_UIT.163

vitrivr.1

61

VIREO.165

vitrivr.1

62

VIREO.161

ITI_CERTH.164

ITI_CERTH.161

NII_Hitachi_UIT.162

NII_Hitachi_UIT.161

ITI_CERTH.162

kobe_nict_siegen.162

INF.16

1VIREO.166

VIREO.162

ITI_CERTH.163

ITI_CERTH.162

MediaM

ill.16

4MediaM

ill.16

2MediaM

ill.16

1INF.16

2MediaM

ill.16

3EU

RECO

M.162

INF.16

3FIU_U

M.162

FIU_U

M.161

VIREO.163

IMOTION.163

IMOTION.164

EURECO

M.161

VIREO.164

EURECO

M.164

INF.16

4UEC.162

UEC.161

vitrivr.1

64

vitrivr.1

63

ITEC_U

NIKLU

.161

EURECO

M.163

ITEC_U

NIKLU

.162

ITEC_U

NIKLU

.163

Ours Others

MAP

Performance comparison between our method and the other methods developed for the AVS task

Page 33: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

33

Conclusion

Video search through efficient transfer learning using microNN • fast • flexibile

Imbalanced examples are more useful than balanced examples

Validity of acquired temporal characteristics by LSTM

Page 34: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

34

Future work

Further experiments by using

LSTM on reduced frame

interval.one video frame every 30 frames in a shot

more densly sampled video frames

Page 35: Kobe University, NICT, and University of Siegen at TRECVID 2016 … · 2016-11-29 · kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive

35

Future work

Acquiring temporal characteristics using optical flow.

Before detecting objects

in a scene, we can first

classify its environment to improve the performance.

~

~

~

oprical flow

scene