kobe university, nict, and university of siegen at trecvid 2016 … · 2016-11-29 ·...
TRANSCRIPT
Kobe University, NICT, and University of Siegen at TRECVID 2016 AVS Task
Yasuyuki Matsumoto, Kuniaki Uehara (Kobe University) Takashi Shinozaki (NICT) Kimiaki Shirahama, Marcin Grzegozek (University of Siegen)
2
Our Contribution
A method of using small-scale neural network to greatly accelerate concept classifier training.
Transfer learning can be used to acquire temporal characteristics effiently by combining both small networks and LSTM.
Evaluate the effectiveness of using balanced examples at the time of training.
3
The Problem
Using pre-trained neural networks to extract features is a very popular approach.
However, training of classifiers takes long time.
This training gets even worse if classifiers required are many.
pre-trained network
~ ?extract feature
4
Micro Neural Networks
Binary classifier that outputs two values to predict the presence or absence of the concept.
A micro Neural Network is a fully-connected neural network with a single hidden layer.
Dropout is used to avoid overfitting.
Calculation time could be reduced (hours->minutes).
Our Approach - Overview
5
+ Manual selection+ Feature extraction + MicroNN training + LSTM
+ Shot retrieval
Query Concept Model Precision
Overview of our method for TRECVID 2016 AVS task
6
Query Concept Model
How we extracted concepts from the queries
+ Shot retrieval
Precision
+ Feature extraction + MicroNN training + LSTM
+ Manual selection
Our Approach - Overview
Our Approach - Manual Selection
7
Query (502) ’’Find shots of a man indoors looking at camera where a bookcase is behind him’’
“look’’ Base form
“man’’ Pick only noun and verb
Simple rule is used to make it easier to automate the concept selection in the future.
“bookcase’’, “bookshelf”, “furniture’’ Synonyms (from ImageNet)
Begin with manually selecting relevant concepts for each query
8
Query (502) ’’Find shots of a man indoors looking at camera where a bookcase is behind him’’
“look’’ Base form
“man’’ Pick only noun and verb
“bookcase’’, “bookshelf”, “furniture’’ Synonyms (from ImageNet)
Begin with manually selecting relevant concepts for each query
Indoor Speaking_to_camera Bookshelf Funiture
Concept
Simple rule is used to make it easier to automate the concept selection in the future.
Our Approach - Manual Selection
9
Query Concept Model
Overview of our method for TRECVID 2016 AVS task
+ Shot retrieval
Precision
Our Approach - Overview
+ Feature extraction + MicroNN training + LSTM
+ Manual selection
10
Query Concept Model
Combine the concepts from each query.
+ Shot retrieval
Precision
+ Feature extraction + MicroNN training + LSTM
+ Manual selection
Our Approach - Overview
Our Approach - Feature Extraction
11
Use pre-trained VGGNet • ILSVRC 2014 • CNN with very deep architecture • The 16 layer version is used • FC7 : Use output at the second
fully connected layer
Pre-trained network is usually transferred into classifiers suitable for the target problem
Conv1
Conv2
Conv3
Conv4
Conv5
FC6
FC7
FC8
Softmax
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition”
~
Image VGG Net
Our Approach - MicroNN Training
① Start with training microNN using images
Perform gradual transfer learning for each concept in the following step
12
~
Image VGG Net
SVM
Until now . . .
13
Previous Approach - SVM Training
Previous studies have trained classifiers such as SVM
by extracted features.
This requires a lot of time.
~
Image VGG Net
microNN
① Start with training microNN using images
Perform gradual transfer learning for each concept in the following step
14
Our Approach - MicroNN Training
~
① Start with training microNN using images
Perform gradual transfer learning for each concept in the following step
15
Our Approach - MicroNN Training
Perform gradual transfer learning for each concept in the following step
16
Our Approach - MicroNN Training
~
② Refine the microNN using shots in video dataset.
Perform gradual transfer learning for each concept in the following step
17
② Refine the microNN using shots in video dataset. The microNN has weight parameters learned at first step as its initial value.
W, b
Our Approach - MicroNN Training
~
~
Video
Perform gradual transfer learning for each concept in the following step
18
W, b
Video
~
LSTM
Our Approach - MicroNN Training
~
~V
③ Futher, hidden layer of microNN is replaced with LSTM for acquiring temporal characteristics. Refine the microNN starting with weight parameters learned at the second step as initial values.
19
Query Concept Model
Overview of our method for TRECVID 2016 AVS task
+ Shot retrieval
Precision
Our Approach - Overview
+ Feature extraction + MicroNN training + LSTM
+ Manual selection
20
Query Concept Model
How we go from a shot’s concept relevance to its search score
+ Shot retrieval
Precision
Our Approach - Overview
+ Feature extraction + MicroNN training + LSTM
+ Manual selection
Our Approach - Shot Retrieval
21
For each shot, calculate the avarage of output values of microNNs for the selected concepts in a query
Indoor Speaking_to_camera Bookshelf Funiture
Concept
Output values
0.7 0.1 0.4 0.6
MicroNN outputs are normalized to [-1, 1], to balance between different concepts.
Our Approach - Shot Retrieval
22
Indoor Speaking_to_camera Bookshelf Funiture
Concept
Output values
Average of output values (Search Score)
0.7 0.1 0.4 0.6 / 4
0.45
Calculate the average of output values and use it as overall search score.
How do we compare that with other shots
23
Purpose of Experiment
1. Evaluate the learning speed. 2. Evaluate the effectiveness of using LSTM to acquire
temporal characteristics. 3. Evaluate wheather using same number of positive and negative
examples (“Balanced”) for training improves classification.
Experiment - Three Runs
24
kobe_nict_siegen_D_M_1
Imbalanced
Fine-tuning is carried out using imbalanced numbers of positive and negative examples. (30,000 total)
kobe_nict_siegen_D_M_2
Balanced
Fine-tuning is carried out using balanced numbers of positive and negative examples. (30,000 total)
kobe_nict_siegen_D_M_3
(Imbalanced) LSTM
Unlike max-pooling, LSTM obtains temporal characteristics. LSTM-based microNNs are trained only for 14 concepts for which temporal relations among video frames are important positive
negative negative
positive
Dataset Ratio
Dataset Ratio
Submitted the following for TRECVID 2016 AVS task
Only 14 concepts
Experiment - Dataset
25
TRECVID IACC Video data
61 concepts
ImageNet Image data 39 concepts
UCF 101 Video data 5 concepts
Used in this study
Experiment - Dataset
26
TRECVID IACC Video data
61 concepts
ImageNet Image data 39 concepts
UCF 101 Video data 5 concepts
Training time
sec / concept (30000 shots)
min / concept (30000 shots)23
27
List of some concepts selected for each query
query_id ImageNet TRECVID UCF 101
501 Outdoor playingGuitar
502
bookshelf
Indoor Speaking_to_camera Furniture
503 drumIndoor
drumming
Experiment - DatasetUsed in this study
Experiment - Result
28
LSTM
AP
Performance comparison between Imbalanced, Balanced and LSTM on each of the 30 queries
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530
Imbalanced Balanced
Experiment - Result
29
LSTM
AP
Performance comparison between Imbalanced, Balanced and LSTM on each of the 30 queries
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530
Imbalanced Balanced
Using imbalanced training examples leads to higher average precisions than using balanced ones.
Experiment - Result
30
LSTM
AP
Performance comparison between Imbalanced, Balanced and LSTM on each of the 30 queries
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530
Imbalanced Balanced
Using LSTM is more than three times higher than the ones not-using LSTM.
Experiment - Result
31
Ours Others
MAP
Performance comparison between our method and the other methods developed for the manually-assisted category in AVS task
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
LSTM Imbalanced Balanced
Experiment - Result
32
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Waseda.1
62
Waseda.1
61
Waseda.1
64
Waseda.1
63
NII_Hitachi_UIT.164
ITI_CERTH.164
ITI_CERTH.163
ITI_CERTH.161
kobe_nict_siegen.163
IMOTION.161
kobe_nict_siegen.161
IMOTION.162
NII_Hitachi_UIT.163
vitrivr.1
61
VIREO.165
vitrivr.1
62
VIREO.161
ITI_CERTH.164
ITI_CERTH.161
NII_Hitachi_UIT.162
NII_Hitachi_UIT.161
ITI_CERTH.162
kobe_nict_siegen.162
INF.16
1VIREO.166
VIREO.162
ITI_CERTH.163
ITI_CERTH.162
MediaM
ill.16
4MediaM
ill.16
2MediaM
ill.16
1INF.16
2MediaM
ill.16
3EU
RECO
M.162
INF.16
3FIU_U
M.162
FIU_U
M.161
VIREO.163
IMOTION.163
IMOTION.164
EURECO
M.161
VIREO.164
EURECO
M.164
INF.16
4UEC.162
UEC.161
vitrivr.1
64
vitrivr.1
63
ITEC_U
NIKLU
.161
EURECO
M.163
ITEC_U
NIKLU
.162
ITEC_U
NIKLU
.163
Ours Others
MAP
Performance comparison between our method and the other methods developed for the AVS task
33
Conclusion
Video search through efficient transfer learning using microNN • fast • flexibile
Imbalanced examples are more useful than balanced examples
Validity of acquired temporal characteristics by LSTM
34
Future work
Further experiments by using
LSTM on reduced frame
interval.one video frame every 30 frames in a shot
more densly sampled video frames
35
Future work
Acquiring temporal characteristics using optical flow.
Before detecting objects
in a scene, we can first
classify its environment to improve the performance.
~
~
~
oprical flow
scene