kobe university, nict and university of siegen at … · iacc 61 concepts ucf 101 or 5 concepts...

1
IACC 61 concepts UCF 101 5 concepts or Video Video Kobe University, NICT and University of Siegen at TRECVID 2016 AVS Task Yasuyuki Matsumoto, Kuniaki Uehara (Kobe University), Takashi Shinozaki (CiNet, NICT) Kimiaki Shirahama, Marcin Grzegozek (University of Siegen) Query Concept Model Precision + Manual selection + Feature extraction + MicroNN training + LSTM + Shot retrieval Overview 3 sec / concept (30000 shots) 2 min / concept (30000 shots) Correspond: Takashi Shinozaki ([email protected]) TRECVID 2016, Nov 14-16, NIST, Washington DC, US concept /rank 1 2 3 50 100 500 1000 Multiply Normalize Average query_id ImageNet TRECVID UCF 101 501 Outdoor playingGuitar 502 bookshelf Indoor Speaking_to_camera Furniture 503 drum Indoor drumming Ours Others MAP kobe_nict_siegen_D_M_2 Balanced Fine-tuning is carried out using balanced numbers of positive and negative examples. negative positive Dataset Ratio kobe_nict_siegen_D_M_1 Unbalanced Fine-tuning is carried out using imbalanced numbers of positive and negative examples. positive negative Dataset Ratio kobe_nict_siegen_D_M_3 (Unbalanced) LSTM Each LSTM takes as input a vector of extracted by a pre-trained CNN. Unlike max-pooling, obtain temporal characteristics. Only 14 concepts ~ ~ LSTM Base form “Look” “man” Pick only noun and verb “bookcase” “furniture” Synonyms Indoor Speaking_to_camera Bookshelf Funiture Concept: ~ VGG-16 Net microNN FC7 (15 th layer) Manual Concept Selection Concept Transfer Shot Retrieval AP LSTM Imbalanced Balanced MAP Contribution Query (502) ’’Find shots of a man indoors looking at camera where a bookcase is behind him’’ Results Our Three Runs Transfer W & b Transfer W & b ImageNet 39 concepts Image IACC 61 concepts UCF 101 5 concepts or Video Video Curriculum Transfer Micro Neural Networks Binary classifier: just discriminate exist or not Very simple network structure 4096 input units, 32 hidden units, 2 output units Fully-connected each other Recent DNN technique ReLU and Dropout for hidden layer outputs A method of using small-scale neural network to greatly accelerate concept classifier training (hours -> minutes). Transfer learning can acquire temporal characteristics efficiently by combining both small networks and LSTM. Evaluate the effectiveness of using unbalanced examples at the time of training. Begin with manual selection of relevant concepts for each query Simple rule makes it easier to automatic concept selection in the future Future works VGG-16 Net VGG-16 Net microNN Selected concepts are transferred from a dataset to dataset A kind of Curriculum Learning? Firstly, output values of microNNs with target concepts are normalized Then, Search Score is calculated as the average among the outputs Indoor Speaking_to_camera Bookshelf Funiture Concepts 0.7 0.2 0.5 0.6 / 4 0.5 Outputs of microNNs Average Precision (Search Score) Summation is more robust than multiplication of outputs More temporal resolution for LSTM: 1 fps 5 fps or more Utilize outputs of several types of CNNs: Scene CNN, Optical flow etc. MicroNNs are worked efficiently Easy to use, and so-so results Enough speed for plenty of concepts Unbalanced condition is better than balanced Should not persist data balance LSTM is also worked correctly Much better result in some queries So much space for improvement… Pros Much faster Learn iteratively Easy to extend Cons Less precise a llittle #509: a crowd demonstrating in a city street at night

Upload: hadung

Post on 30-Apr-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Kobe University, NICT and University of Siegen at … · IACC 61 concepts UCF 101 or 5 concepts Video Video Kobe University, NICT and University of Siegen at TRECVID 2016 AVS Task

IACC 61 concepts

UCF 101 5 concepts or

Video Video

Kobe University, NICT and University of Siegen at TRECVID 2016 AVS Task

Yasuyuki Matsumoto, Kuniaki Uehara (Kobe University), Takashi Shinozaki (CiNet, NICT)

Kimiaki Shirahama, Marcin Grzegozek (University of Siegen)

Query Concept Model Precision

+ Manual selection + Feature extraction + MicroNN training + LSTM

+ Shot retrieval

Overview

3 sec / concept (30000 shots)

2 min / concept (30000 shots)

Correspond: Takashi Shinozaki ([email protected]) TRECVID 2016, Nov 14-16, NIST, Washington DC, US

concept /rank

1 2 3 50 100 500 1000

Multiply

Normalize Average

query_id ImageNet TRECVID UCF 101

501 Outdoor playingGuitar

502 bookshelf

Indoor Speaking_to_camera Furniture

503 drum Indoor

drumming

Ours Others

MA

P

kobe_nict_siegen_D_M_2

Balanced Fine-tuning is carried out using balanced numbers of positive and negative examples.

negative

positive

Dataset Ratio

kobe_nict_siegen_D_M_1

Unbalanced Fine-tuning is carried out using imbalanced numbers of positive and negative examples.

positive

negative

Dataset Ratio

kobe_nict_siegen_D_M_3

(Unbalanced) LSTM Each LSTM takes as input a vector of extracted by a pre-trained CNN. Unlike max-pooling, obtain temporal characteristics.

Only 14 concepts

~

~ LSTM

Base form “Look”

“man” Pick only noun and verb

“bookcase” “furniture” Synonyms

Indoor Speaking_to_camera Bookshelf Funiture Concept:

~ VGG-16 Net

microNN

FC7 (15th layer)

Manual Concept Selection

Concept Transfer

Shot Retrieval

AP

LSTM Imbalanced Balanced

MA

P

Contribution

Query (502) ’’Find shots of a man indoors looking at camera where a bookcase is behind him’’

Results

Our Three Runs

Transfer W & b

Transfer W & b

ImageNet 39 concepts

Image

IACC 61 concepts

UCF 101 5 concepts or

Video Video

Curriculum Transfer

Micro Neural Networks • Binary classifier: just discriminate exist or not • Very simple network structure

• 4096 input units, 32 hidden units, 2 output units • Fully-connected each other

• Recent DNN technique • ReLU and Dropout for hidden layer outputs

A method of using small-scale neural network to greatly accelerate concept classifier training (hours -> minutes). Transfer learning can acquire temporal characteristics efficiently by combining both small networks and LSTM. Evaluate the effectiveness of using unbalanced examples at the time of training.

• Begin with manual selection of relevant concepts for each query

• Simple rule makes it easier to automatic concept selection in the future

Future works

VGG-16 Net

VGG-16 Net

microNN

• Selected concepts are transferred from a dataset to dataset

• A kind of Curriculum Learning?

• Firstly, output values of microNNs with target concepts are normalized

• Then, Search Score is calculated as the average among the outputs

Indoor Speaking_to_camera Bookshelf Funiture

Concepts

0.7 0.2 0.5 0.6 / 4

0.5

Outputs of microNNs

Average Precision (Search Score)

• Summation is more robust than multiplication of outputs

• More temporal resolution for LSTM: 1 fps ⇒ 5 fps or more

• Utilize outputs of several types of CNNs: Scene CNN, Optical flow etc.

• MicroNNs are worked efficiently

Easy to use, and so-so results

Enough speed for plenty of concepts

• Unbalanced condition is better than balanced

Should not persist data balance

• LSTM is also worked correctly

Much better result in some queries

• So much space for improvement…

Pros

• Much faster

• Learn iteratively

• Easy to extend

Cons

• Less precise a llittle

#509: a crowd demonstrating in a city street at night