pascal voc classification: local features vs. deep...
TRANSCRIPT
Shuicheng YAN, NUS
PASCAL VOC Classification: Local Features vs. Deep Features
PASCAL VOC
PASCAL VOC Visual object classes challenges Be held yearly 2007 – 2012 Tens of teams from universities and industries participated including INRIA,
Berkeley, Oxford, NEC, etc. Become “the dataset” for visual object recognition research
Main tasks: object classification, detection and segmentation Other tasks: person layout, action recognition, etc.
Data: 20 object classes, ~23,000 images with fine labeling
Visual ObjectRecognition
ObjectSegmentation
ObjectClassification
Person, Horse,Barrier, Table, etc
ObjectDetection
Why valuable? Multi-label, Real Scenarios!Why valuable? Multi-label, Real Scenarios!
PASCAL VOC: 2010-2014
NUS-(PSL) team results 2014, Classification MAP to 0.91
2012, 2011, 2010, Winner of object classification task. (cls)
2012, Winner of object segmentation task. (seg)
2010, Honorable mention of object detection task. (det)
NUS-(PSL) architecture A joint learning of cls-det-seg.
Cls: GlobalInformation
Det: LocalInformation
Seg: Fine-detailed
Information
VisualObject
Recognition
PASCAL VOC: 2010-2014
2010: 73.8%
2012: 82.2%
2014: 83.2%
2014: 91.4%
2011: 78.7%
LLC
Context-SVM
GHM
Sub-category
Deep feature
HCP
25%
2013: 79.0%
Deep feature
I. Spring of Local Features: 2010-2012
Pipeline
1. Jian Dong, Qiang Chen, Jiashi Feng, Wei Xia, Zhongyang Huang, Shuicheng YAN, Subcategory-aware Object Classi-fication. In CVPR'13.2. Qiang Chen, Zheng Song, Yang Hua, Zhongyang Huang, Shuicheng Yan. Hierarchical Matching with Side Informationfor Image Classification. In CVPR’12.3. Zheng Song*, Qiang Chen*, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection andClassification. In CVPR'11.
FeatureRepresentation
FeatureRepresentation
ModelLearningModel
Learning
Low LevelFeatures
FeatureEncoding
FeaturePooling
ClassifierLearning
ContextModeling
GHM[2]:Generalized HierarchicalMatching (GHM) forobject central problems.Object central pooling.
Subcategory mining[1]:Automatically mining thevisual subcategoriesbased on ambiguitymodeling.
Contextualization[3]:Mutual Contextualizationfor object classification anddetection tasks. Greatperformance improvement.
Visual Features
Feature PoolingSPM
Feature PoolingSPM
Local FeatureExtraction
Local FeatureExtraction
Feature CodingFeature Coding
Classification
SVMSVM
RegressionRegression
Post Processing
KernelRegression
KernelRegression
ConfidenceRefinement
with Exclusiveprior
ConfidenceRefinement
with Exclusiveprior
Detection ResultsDetection Results
Max poolingMax pooling
Kernel
NonlinearKernel
NonlinearKernel
Linear KernelLinear Kernel
Chair
Framework – NUS-PSL 2010
Feature PoolingSPM
Feature PoolingSPM
Feature PoolingSPM, GHM
Feature PoolingSPM, GHM
Feature CodingFeature CodingFeature CodingFK
Feature CodingFK
Chair
Detection ResultsDetection ResultsSubcategory
Detection ResultsSubcategory
Detection Results
Max poolingMax pooling
Framework – NUS-PSL 2012
Visual Features
Classification
SVMSVM
RegressionRegression
Post Processing
KernelRegression
KernelRegression
ConfidenceRefinement
with Exclusiveprior
ConfidenceRefinement
with Exclusiveprior
Local FeatureExtraction
Local FeatureExtraction
Kernel
NonlinearKernel
NonlinearKernel
Linear KernelLinear Kernel
Nonlinear +Linear KernelNonlinear +Linear Kernel
SubcategoryMining
SubcategoryMining
Flipping
Flipping
Flipping
II GeneralizedHierarchical
Matching
III SubcategoryMining
I ContextualizedObject
Classification andDetection
Outline for VOC: 2010-2012
Context model: Contextualized Object Classification andDetection
Feature pooling: Generalized Hierarchical Matching/Pooling
Subcategory learning: Sub-Category Aware Detection &Classification
Contextualized Object Classification and Detection
Occurrence Probability
Det: Localpatches with
matched localshape/texture
Whether CanExchange
Information?
Det
Cls
Cls: Globalprobabilities tocontain objects
Observations
Object classification and detection are mutuallycomplemental to each other. Each subject task servesas context task for the other.
Context is not robust for the subject task, so use onlywhen necessary
Scene/Global level information isnot stable for object detection.
person
False alarm of object detectionharms object classification.
• Adaptive contextualization
• Configurable model complexity: low rank constraint
• dim n x m R x (n + m)
• Easy to be solved and kernelized, if is fixed.
Contextualized SVM - Formulation
Adaptive embeddingof context features
Original classificationhyperplane
Context model(dim m)
Selection toambiguous samples(dim n)
Sample specificclassification
n: feature dimm: context dim
Contextualized SVM - Formulation
Ambiguity modeling: Define the ambiguity degree of sample as the hinge loss of the subject task,
Learn the Ambiguity-guided Mixture Model (AMM) through EM to maximize thefollowing objective,
Multi-mode ambiguity term is defined as the posterior of each mixture r,
Iterative Co-training of Detection and Classification
Learn to Detect
Learn to Classify
Context frominitial
Classification
Context frominitial
Classification
Detection Feature
Context frominitial
Detection
Context frominitial
Detection
Classification Feature
ContextSVM
Context from1st
Classification
Context from1st
Classification
DetectionFeature
DetectionFeature
Contextfrom 1st
Detection
Contextfrom 1st
Detection
ClassificationFeature
ClassificationFeature
InitialModel
Classification PipelineDetection Pipeline
…ContextSVM
a) initial model b) 1st iteration of ContextSVM c) 2nd iteration of ContextSVM
Results
Iterative contextualization:Mean AP values of 20 classes on VOC 2010 train/val
Results
Comparison with state-of-the-arts on VOC 2010
Exemplar results
Representative examples of the baseline (without contextualization) andContext-SVM for classification task.
Outline for VOC: 2010-2012
Context model: Contextualized Object Classification andDetection
Feature pooling: Generalized Hierarchical Matching/Pooling
Subcategory learning: Sub-Category Aware Detection &Classification
Generalized Hierarchical Matching/Pooling
Traditional Pooling: SPM = approximate geometric constraint Not optimal for object recognition due to misalignment
(a) Images (b) SPM partitions (c) Object Confidence Map partition
Hierarchical Pooling for Image Classification
Design a general form of hierarchical matching withside information.
Represent image with hierarchical structure
Hierarchical Matching Kernel
Image Similarity Kernel is defined as the weightedsum over each cluster kernel.
General form of SPM, PMK, etc… Flexible to integrate other side information.
Generalized Hierarchical Matching/Pooling
Encoded local featurevs. side information
(b) Hierarchically cluster by side information.Level 1 (top),2 (mid),3 (bottom)
(a) Side informationand Image
(c) Hierarchical structurerepresentation
(d) Matching/poolingwithin each cluster
Utilize side information to hierarchically pool local features
Side information design
FusingFusing
Images
ObjectConfidenceMaps
sub-windowsub-windowSliding
windowSliding
window
Score voteback to image
Score voteback to image
ShapeModel
AppearanceModel
Process Score voteback to image
Score voteback to image
Side Information - Detection Confidence Map
Results
VOC
Outline for VOC: 2010-2012
Context model: Contextualized Object Classification andDetection
Feature pooling: Generalized Hierarchical Matching/Pooling
Subcategory learning: Sub-Category Aware Detection &Classification
Sub-Category Mining
Sofa
Chair
Diningtable
Ambiguity Guided Subcategory Mining
Subcategory-aware Object Classification
Fusion ModelFusion Model
Subcategory Model2
Subcategory Model2
Subcategory ModelN
Subcategory ModelN
Subcategory Model1
Subcategory Model1
Calculate the sample intra-class similarity
Calculate the sample inter-class ambiguity
Detect dense subgraphs by graph shift algorithm [1]
Subgraphs to subcategories.
Sub-Category Mining
Subcategory Mining based on both Similarity & Ambiguity
[1] Hairong Liu, Shuicheng Yan. Robust Graph Mode Seeking by Graph Shift. ICML 2010
Chair
SofaAmbiguous Categories
Similarity
Similarity
InstanceAffinity Graph
Graph Shift
DetectedSubgraphs
Visualization
CorrespondingSubcategories
Ambiguity
Ambiguity
Sub-Category Aware Detection & Classification
Subcategory Model 1
Detection Model
Classification Model
FusionModelFusionModel
GHM PoolingLocal FeatureExtraction and
Coding
SubcategoryDetectionResult N
SubcategoryDetectionResult NSubcategory Model NSubcategory Model N
Categorylevel
Result
Categorylevel
Result
ImageRepresentation Subcategory
ClassificationResult N
SubcategoryClassification
Result N
SubcategoryDetectionResult 1
SubcategoryDetectionResult 1
SubcategoryClassification
Result 1
SubcategoryClassification
Result 1Sliding/SelectiveWindow Search
FeatureExtraction
Testing Image
Sub-Category Mining Result
OutliersSubcategories
Bus
Chair
Summary of VOC results
2010 2011 2012Our Best Other's Best Our Best Other's Best Our Best Other's Best
aeroplane 93 93.3 95.5 94.5 97.3 92bicycle 79 77 81.1 82.6 84.2 74.2
bird 71.6 69.9 79.4 79.4 80.8 73boat 77.8 77.2 82.5 80.7 85.3 77.5
bottle 54.3 53.7 58.2 57.8 60.8 54.3bus 85.2 85.9 87.7 87.8 89.9 85.2car 78.6 80.4 84.1 85.5 86.8 81.9cat 78.8 79.4 83.1 83.9 89.3 76.4
chair 64.5 62.9 68.5 66.6 75.4 65.2cow 64 66.2 74.7 74.2 77.8 63.2
diningtable 62.9 61.1 68.5 69.4 75.1 68.5dog 69.6 71.1 76.4 75.2 83 68.9
horse 82 76.7 83.3 83 87.5 78.2motorbike 84.4 81.7 87.5 88.1 90.1 81
person 91.6 90.2 92.8 93.5 95 91.6pottedplant 48.6 53.3 56.5 58.7 57.8 55.9
sheep 65.4 66.3 77.7 75.5 79.2 69.4sofa 59.6 58 67 66.3 73.4 65.4train 89.4 87.5 91.2 90 94.5 86.7
tvmonitor 77.2 76.2 77.5 77.2 80.7 77.4MAP 73.8 78.7 82.2
II. Spring of Deep Feature: 2013-2014
CNN: Single-label Image Classification
Definition Assign one and only one label from a pre-defined set to an image
Explicit assumption: object is roughly aligned
Alex Net [1] made a great breakthrough in single-label classification inILSVRC2012 (with 10% gain over the previous methods)
[1] A. Krizhevsky, I. Sutskever, G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012.
CNN: Multi-label Image Classification
Definition Assign multiple labels from a pre-defined set to an image
Challenges Foreground objects are not roughly aligned
Interactions between different objects, e.g. partial visibility and occlusion
A large number of training images are required The label space is expanded from n to 2^n
Single-label images Multi-label images
vs.
Directly CNN training is unreasonable and unreliable!
Hypotheses-CNN-Pooling(HCP)
Our framework
c
96
256384 384 256
4096 4096
5527 13 13 13
5
5
3
33
333
MaxPooling
MaxPooling
MaxPooling
Shared convolutional neural network
11
…………………
dog,person,sheep
MaxPooling
…
Scores for individualhypothesis
Hypotheses assumption:single-labeled
Characteristics of Our Framework
No ground-truth bounding box information is required for training on themulti-label image dataset
The proposed HCP infrastructure is robust to the noisy and/or redundanthypotheses
No explicit hypothesis label is required for training
The shared CNN can be well pre-trained with a large-scale single-labelimage dataset
The HCP outputs are naturally multi-label prediction results
Training of HCP
Hypotheses extraction
Initialization of HCP Pre-training on a large-scale single-label image set, e.g. ImageNet
Image-fine-tuning on a multi-label image set
Hypotheses-fine-tuning
Hypotheses Extraction
Criteria: High object detection recall rate
Small number of hypotheses
High computational efficiency
Solution: BING [2]+ Boxes clustering
[2] M.-M. Cheng, J. Warrell, W.-Y. Lin, and P.H.S.Torr. BING: Binarized normed gradients for objectness estimation at 300fps. CVPR 2014.
Hypotheses Extraction
Initialization of HCP
Single-label Images(e.g. ImageNet)
…
Pre-training
Step1
Multi-label Images(e.g. Pascal VOC)
…
Image-fine-tuning
Step2
Parameters transferring
Hypotheses-fine-tuning
A subset from detection dataset of ILSVRC 2013 is used for BING training
Experimental Results
Experimental Results
Performance on PASCAL VOC 2007
NewNew
Experimental Results
Performance on PASCAL VOC 2012
New-1New-1
New-2New-2
Experimental Results
Complementary Analysis: Hand-crafted features vs. Deep features
One test sample from VOC2007 500 hypotheses for each image, 1~1.5s
Experimental Results
car horse
person
personcar
horseperson
…
…
Generate hypotheses
Feed into the shared CNN
Cross-hypothesis max-pooling
New Result: “Network in Network” (NIN)
NIN: CNN with non-linear filters, yet without final fully-connected NN layer
Intuitively less overfitting globally, and more discriminative locally(not finally used in our submission due to the surgery of our main team member, but very effective)
With less parameter #[4] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, Yoshua Bengio: Maxout
Networks. ICML (3) 2013: 1319-1327
[4]
CNNNIN
Better Local Abstraction
Local patch is projected to its feature vector.Using a small network.
Motivation: Better Local Abstraction!
Cascaded Cross Channel Parametric Pooling (CCCP)
Lin, Min, Qiang Chen, and Shuicheng Yan. "Network In Network." ICLR-2014.
CCCP ≈ Cascaded 1x1 Convolution in Implementation
Global Average Pooling
Save tons of parameters
CNN NIN
Confidence map of each category
To avoid hyper-parameter tuning,we put cccp layer directly on convolutionlayers of ZFNet. (Network in ZFNet)
layer details
Conv1 Stride = 2, kernel = 7x7,channel_out = 96
Conv2 Stride = 2, kernel = 5x5,channel_out = 256
Conv3 Stride = 1, kernel = 3x3,channel_out = 512
Conv4 Stride = 1, kernel = 3x3,channel_out = 1024
Conv5 Stride = 1, kernel = 3x3,channel_out = 512
Fc1 Output = 4096
Fc2 Output = 4096
Fc3 Output = 1000
layer details
Conv1 Stride = 2, kernel = 7x7,channel_out = 96
Cccp1 Output = 96
Conv2 Stride = 2, kernel = 5x5,channel_out = 256
Cccp2 Output = 256
Conv3 Stride = 1, kernel = 3x3,channel_out = 512
Cccp3 Output = 256
Conv4 Stride = 1, kernel = 3x3,channel_out = 1024
Cccp4 Output = 512
Cccp5 Output = 384
Conv5 Stride = 1, kernel = 3x3,channel_out = 512
Cccp6 Output = 256
Fc1 Output = 4096
Fc2 Output = 4096
Fc3 Output = 1000
(10.91%) With 256xN training and 3 view test
Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks."Computer Vision–ECCV 2014. Springer International Publishing, 2014. 818-833.
NIN in ILSVR2014
…………………
Shared NIN
dog,person,sheep
c
MaxPooling
…
Scores for individualhypothesis
NIN in HCP
Compared with State-of-the-arts on VOC 2012
Category NUS-PSL[1] PRE-1000C[2] PRE-1512[2] Chatfield et al.[3] HCP-NIN HCP-NIN+NUS-PSLplane 97.3 93.5 94.6 96.8 98.4 99.5
bicycle 84.2 78.4 82.9 82.5 89.5 93.7bird 80.8 87.7 88.2 91.5 96.2 96.8boat 85.3 80.9 84.1 88.1 91.7 94.0
bottle 60.8 57.3 60.3 62.1 72.5 77.7bus 89.9 85.0 89.0 88.3 91.1 95.3car 86.8 81.6 84.4 81.9 87.2 92.4cat 89.3 89.4 90.7 94.8 97.1 98.2
chair 75.4 66.9 72.1 70.3 73.0 86.1cow 77.8 73.8 86.8 80.2 89.5 91.3table 75.1 62.0 69.0 76.2 75.1 83.5dog 83.0 89.5 92.1 92.9 96.3 97.3
horse 87.5 83.2 93.4 90.3 93.0 96.8motor 90.1 87.6 88.6 89.3 90.5 96.3person 95.0 95.8 96.1 95.2 94.8 95.8plant 57.8 61.4 64.3 57.4 66.5 72.2sheep 79.2 79.0 86.6 83.6 90.3 91.5sofa 73.4 54.3 62.3 66.4 65.8 81.1train 94.5 88.0 91.1 93.5 95.6 97.6
tv 80.7 78.3 79.8 81.9 82.0 90.0MAP 82.2 78.7 82.8 83.2 86.8 91.4
[1] S. Yan, J. Dong, Q. Chen, Z. Song, Y. Pan, W. Xia, H. Zhongyang, Y. Hua, and S. Shen. Generalized hierarchical matching for subcategory awareobject classification. In Visual Recognition Challange workshop, ECCV, 2012.
[2] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. CVPR, 2014.
[3] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. Return of the Devil in the Details: Delving Deep into Convolutional Nets , BMVC, 2014
From 81.7% | < 90.3%From 81.7% | < 90.3%
Demo
Online Demo
Highest and Lowest Score Five Images for Each Class
Aeroplane
Bicycle
Bird
Boat
Bottle
Highest and Lowest Score Five Images for Each Class
Bus
Car
Cat
Chair
Cow
Dining table
Dog
Horse
Motorbike
Person
Highest and Lowest Score Five Images for Each Class
Pottedplant
Sheep
Sofa
Train
TV monitor
Highest and Lowest Score Five Images for Each Class
What’s next?
Better Deep Features?Better Deep Features?
Better Local Features?Better Local Features?
2009: 66.5%
2010: 73.8%
2012: 82.2%
2014: 83.2%
2014: 91.4%
2011: 78.7%
LLC
Context-SVM
GHM
Sub-category
Deep feature
HCP
25%
More Extra Data?More Extra Data?
Better Solution for Small/Occluded Objects?Better Solution for Small/Occluded Objects?
Shuicheng YAN