pascal voc classification: local features vs. deep...

Shuicheng YAN, NUS

PASCAL VOC Classification: Local Features vs. Deep Features

PASCAL VOC

PASCAL VOC Visual object classes challenges Be held yearly 2007 – 2012 Tens of teams from universities and industries participated including INRIA,

Berkeley, Oxford, NEC, etc. Become “the dataset” for visual object recognition research

Main tasks: object classification, detection and segmentation Other tasks: person layout, action recognition, etc.

Data: 20 object classes, ~23,000 images with fine labeling

Visual ObjectRecognition

ObjectSegmentation

ObjectClassification

Person, Horse,Barrier, Table, etc

ObjectDetection

Why valuable? Multi-label, Real Scenarios!Why valuable? Multi-label, Real Scenarios!

PASCAL VOC: 2010-2014

NUS-(PSL) team results 2014, Classification MAP to 0.91

2012, 2011, 2010, Winner of object classification task. (cls)

2012, Winner of object segmentation task. (seg)

2010, Honorable mention of object detection task. (det)

NUS-(PSL) architecture A joint learning of cls-det-seg.

Cls: GlobalInformation

Det: LocalInformation

Seg: Fine-detailed

Information

VisualObject

Recognition

PASCAL VOC: 2010-2014

2010: 73.8%

2012: 82.2%

2014: 83.2%

2014: 91.4%

2011: 78.7%

LLC

Context-SVM

GHM

Sub-category

Deep feature

HCP

25%

2013: 79.0%

Deep feature

I. Spring of Local Features: 2010-2012

Pipeline

1. Jian Dong, Qiang Chen, Jiashi Feng, Wei Xia, Zhongyang Huang, Shuicheng YAN, Subcategory-aware Object Classi-fication. In CVPR'13.2. Qiang Chen, Zheng Song, Yang Hua, Zhongyang Huang, Shuicheng Yan. Hierarchical Matching with Side Informationfor Image Classification. In CVPR’12.3. Zheng Song*, Qiang Chen*, Zhongyang Huang, Yang Hua, and Shuicheng Yan. Contextualizing Object Detection andClassification. In CVPR'11.

FeatureRepresentation

FeatureRepresentation

ModelLearningModel

Learning

Low LevelFeatures

FeatureEncoding

FeaturePooling

ClassifierLearning

ContextModeling

GHM[2]:Generalized HierarchicalMatching (GHM) forobject central problems.Object central pooling.

Subcategory mining[1]:Automatically mining thevisual subcategoriesbased on ambiguitymodeling.

Contextualization[3]:Mutual Contextualizationfor object classification anddetection tasks. Greatperformance improvement.

Visual Features

Feature PoolingSPM

Feature PoolingSPM

Local FeatureExtraction


Feature CodingFeature Coding

Classification

SVMSVM

RegressionRegression

Post Processing

KernelRegression

KernelRegression

ConfidenceRefinement

with Exclusiveprior


with Exclusiveprior

Detection ResultsDetection Results

Max poolingMax pooling

Kernel

NonlinearKernel

NonlinearKernel

Linear KernelLinear Kernel

Chair

Framework – NUS-PSL 2010

Feature PoolingSPM

Feature PoolingSPM

Feature PoolingSPM, GHM

Feature PoolingSPM, GHM

Feature CodingFeature CodingFeature CodingFK

Feature CodingFK

Chair

Detection ResultsDetection ResultsSubcategory

Detection ResultsSubcategory

Detection Results

Max poolingMax pooling

Framework – NUS-PSL 2012

Visual Features

Classification

SVMSVM

RegressionRegression

Post Processing

KernelRegression

KernelRegression


with Exclusiveprior


with Exclusiveprior



Kernel

NonlinearKernel

NonlinearKernel

Linear KernelLinear Kernel

Nonlinear +Linear KernelNonlinear +Linear Kernel

SubcategoryMining

SubcategoryMining

Flipping

Flipping

Flipping

II GeneralizedHierarchical

Matching

III SubcategoryMining

I ContextualizedObject

Classification andDetection

Outline for VOC: 2010-2012

Context model: Contextualized Object Classification andDetection

Feature pooling: Generalized Hierarchical Matching/Pooling

Subcategory learning: Sub-Category Aware Detection &Classification

Contextualized Object Classification and Detection

Occurrence Probability

Det: Localpatches with

matched localshape/texture

Whether CanExchange

Information?

Det

Cls

Cls: Globalprobabilities tocontain objects

Observations

Object classification and detection are mutuallycomplemental to each other. Each subject task servesas context task for the other.

Context is not robust for the subject task, so use onlywhen necessary

Scene/Global level information isnot stable for object detection.

person

False alarm of object detectionharms object classification.

• Adaptive contextualization

• Configurable model complexity: low rank constraint

• dim n x m R x (n + m)

• Easy to be solved and kernelized, if is fixed.

Contextualized SVM - Formulation

Adaptive embeddingof context features

Original classificationhyperplane

Context model(dim m)

Selection toambiguous samples(dim n)

Sample specificclassification

n: feature dimm: context dim

Contextualized SVM - Formulation

Ambiguity modeling: Define the ambiguity degree of sample as the hinge loss of the subject task,

Learn the Ambiguity-guided Mixture Model (AMM) through EM to maximize thefollowing objective,

Multi-mode ambiguity term is defined as the posterior of each mixture r,

Iterative Co-training of Detection and Classification

Learn to Detect

Learn to Classify

Context frominitial

Classification

Context frominitial

Classification

Detection Feature

Context frominitial

Detection

Context frominitial

Detection

Classification Feature

ContextSVM

Context from1st

Classification

Context from1st

Classification

DetectionFeature

DetectionFeature

Contextfrom 1st

Detection

Contextfrom 1st

Detection

ClassificationFeature

ClassificationFeature

InitialModel

Classification PipelineDetection Pipeline

…ContextSVM

a) initial model b) 1st iteration of ContextSVM c) 2nd iteration of ContextSVM

Results

Iterative contextualization:Mean AP values of 20 classes on VOC 2010 train/val

Results

Comparison with state-of-the-arts on VOC 2010

Exemplar results

Representative examples of the baseline (without contextualization) andContext-SVM for classification task.

Generalized Hierarchical Matching/Pooling

Traditional Pooling: SPM = approximate geometric constraint Not optimal for object recognition due to misalignment

(a) Images (b) SPM partitions (c) Object Confidence Map partition

Hierarchical Pooling for Image Classification

Design a general form of hierarchical matching withside information.

Represent image with hierarchical structure

Hierarchical Matching Kernel

Image Similarity Kernel is defined as the weightedsum over each cluster kernel.

General form of SPM, PMK, etc… Flexible to integrate other side information.

Generalized Hierarchical Matching/Pooling

Encoded local featurevs. side information

(b) Hierarchically cluster by side information.Level 1 (top),2 (mid),3 (bottom)

(a) Side informationand Image

(c) Hierarchical structurerepresentation

(d) Matching/poolingwithin each cluster

Utilize side information to hierarchically pool local features

Side information design

FusingFusing

Images

ObjectConfidenceMaps

sub-windowsub-windowSliding

windowSliding

window

Score voteback to image


ShapeModel

AppearanceModel

Process Score voteback to image


Side Information - Detection Confidence Map

Results

VOC

Sub-Category Mining

Sofa

Chair

Diningtable

Ambiguity Guided Subcategory Mining

Subcategory-aware Object Classification

Fusion ModelFusion Model

Subcategory Model2

Subcategory Model2

Subcategory ModelN

Subcategory ModelN

Subcategory Model1

Subcategory Model1

Calculate the sample intra-class similarity

Calculate the sample inter-class ambiguity

Detect dense subgraphs by graph shift algorithm [1]

Subgraphs to subcategories.

Sub-Category Mining

Subcategory Mining based on both Similarity & Ambiguity

[1] Hairong Liu, Shuicheng Yan. Robust Graph Mode Seeking by Graph Shift. ICML 2010

Chair

SofaAmbiguous Categories

Similarity

Similarity

InstanceAffinity Graph

Graph Shift

DetectedSubgraphs

Visualization

CorrespondingSubcategories

Ambiguity

Ambiguity

Sub-Category Aware Detection & Classification

Subcategory Model 1

Detection Model

Classification Model

FusionModelFusionModel

GHM PoolingLocal FeatureExtraction and

Coding

SubcategoryDetectionResult N

SubcategoryDetectionResult NSubcategory Model NSubcategory Model N

Categorylevel

Result

Categorylevel

Result

ImageRepresentation Subcategory

ClassificationResult N

SubcategoryClassification

Result N

SubcategoryDetectionResult 1

SubcategoryDetectionResult 1


Result 1


Result 1Sliding/SelectiveWindow Search

FeatureExtraction

Testing Image

Sub-Category Mining Result

OutliersSubcategories

Bus

Chair

Summary of VOC results

2010 2011 2012Our Best Other's Best Our Best Other's Best Our Best Other's Best

aeroplane 93 93.3 95.5 94.5 97.3 92bicycle 79 77 81.1 82.6 84.2 74.2

bird 71.6 69.9 79.4 79.4 80.8 73boat 77.8 77.2 82.5 80.7 85.3 77.5

bottle 54.3 53.7 58.2 57.8 60.8 54.3bus 85.2 85.9 87.7 87.8 89.9 85.2car 78.6 80.4 84.1 85.5 86.8 81.9cat 78.8 79.4 83.1 83.9 89.3 76.4

chair 64.5 62.9 68.5 66.6 75.4 65.2cow 64 66.2 74.7 74.2 77.8 63.2

diningtable 62.9 61.1 68.5 69.4 75.1 68.5dog 69.6 71.1 76.4 75.2 83 68.9

horse 82 76.7 83.3 83 87.5 78.2motorbike 84.4 81.7 87.5 88.1 90.1 81

person 91.6 90.2 92.8 93.5 95 91.6pottedplant 48.6 53.3 56.5 58.7 57.8 55.9

sheep 65.4 66.3 77.7 75.5 79.2 69.4sofa 59.6 58 67 66.3 73.4 65.4train 89.4 87.5 91.2 90 94.5 86.7

tvmonitor 77.2 76.2 77.5 77.2 80.7 77.4MAP 73.8 78.7 82.2

II. Spring of Deep Feature: 2013-2014

CNN: Single-label Image Classification

Definition Assign one and only one label from a pre-defined set to an image

Explicit assumption: object is roughly aligned

Alex Net [1] made a great breakthrough in single-label classification inILSVRC2012 (with 10% gain over the previous methods)

[1] A. Krizhevsky, I. Sutskever, G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012.

CNN: Multi-label Image Classification

Definition Assign multiple labels from a pre-defined set to an image

Challenges Foreground objects are not roughly aligned

Interactions between different objects, e.g. partial visibility and occlusion

A large number of training images are required The label space is expanded from n to 2^n

Single-label images Multi-label images

vs.

Directly CNN training is unreasonable and unreliable!

Hypotheses-CNN-Pooling(HCP)

Our framework

c

96

256384 384 256

4096 4096

5527 13 13 13

5

5

3

33

333

MaxPooling

MaxPooling

MaxPooling

Shared convolutional neural network

11

…………………

dog，person，sheep

MaxPooling

…

Scores for individualhypothesis

Hypotheses assumption:single-labeled

Characteristics of Our Framework

No ground-truth bounding box information is required for training on themulti-label image dataset

The proposed HCP infrastructure is robust to the noisy and/or redundanthypotheses

No explicit hypothesis label is required for training

The shared CNN can be well pre-trained with a large-scale single-labelimage dataset

The HCP outputs are naturally multi-label prediction results

Training of HCP

Hypotheses extraction

Initialization of HCP Pre-training on a large-scale single-label image set, e.g. ImageNet

Image-fine-tuning on a multi-label image set

Hypotheses-fine-tuning

Hypotheses Extraction

Criteria: High object detection recall rate

Small number of hypotheses

High computational efficiency

Solution: BING [2]+ Boxes clustering

[2] M.-M. Cheng, J. Warrell, W.-Y. Lin, and P.H.S.Torr. BING: Binarized normed gradients for objectness estimation at 300fps. CVPR 2014.

Hypotheses Extraction

Initialization of HCP

Single-label Images(e.g. ImageNet)

…

Pre-training

Step1

Multi-label Images(e.g. Pascal VOC)

…

Image-fine-tuning

Step2

Parameters transferring

Hypotheses-fine-tuning

A subset from detection dataset of ILSVRC 2013 is used for BING training

Experimental Results


Performance on PASCAL VOC 2007

NewNew


Performance on PASCAL VOC 2012

New-1New-1

New-2New-2


Complementary Analysis: Hand-crafted features vs. Deep features

One test sample from VOC2007 500 hypotheses for each image, 1~1.5s


car horse

person

personcar

horseperson

…

…

Generate hypotheses

Feed into the shared CNN

Cross-hypothesis max-pooling

New Result: “Network in Network” (NIN)

NIN: CNN with non-linear filters, yet without final fully-connected NN layer

Intuitively less overfitting globally, and more discriminative locally(not finally used in our submission due to the surgery of our main team member, but very effective)

With less parameter #[4] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, Yoshua Bengio: Maxout

Networks. ICML (3) 2013: 1319-1327

[4]

CNNNIN

Better Local Abstraction

Local patch is projected to its feature vector.Using a small network.

Motivation: Better Local Abstraction!

Cascaded Cross Channel Parametric Pooling (CCCP)

Lin, Min, Qiang Chen, and Shuicheng Yan. "Network In Network." ICLR-2014.

CCCP ≈ Cascaded 1x1 Convolution in Implementation

Global Average Pooling

Save tons of parameters

CNN NIN

Confidence map of each category

To avoid hyper-parameter tuning,we put cccp layer directly on convolutionlayers of ZFNet. (Network in ZFNet)

layer details

Conv1 Stride = 2, kernel = 7x7,channel_out = 96





Fc1 Output = 4096

Fc2 Output = 4096

Fc3 Output = 1000

layer details


Cccp1 Output = 96


Cccp2 Output = 256


Cccp3 Output = 256


Cccp4 Output = 512

Cccp5 Output = 384


Cccp6 Output = 256

Fc1 Output = 4096

Fc2 Output = 4096

Fc3 Output = 1000

(10.91%) With 256xN training and 3 view test

Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks."Computer Vision–ECCV 2014. Springer International Publishing, 2014. 818-833.

NIN in ILSVR2014

…………………

Shared NIN

dog，person，sheep

c

MaxPooling

…

Scores for individualhypothesis

NIN in HCP

Compared with State-of-the-arts on VOC 2012

Category NUS-PSL[1] PRE-1000C[2] PRE-1512[2] Chatfield et al.[3] HCP-NIN HCP-NIN+NUS-PSLplane 97.3 93.5 94.6 96.8 98.4 99.5

bicycle 84.2 78.4 82.9 82.5 89.5 93.7bird 80.8 87.7 88.2 91.5 96.2 96.8boat 85.3 80.9 84.1 88.1 91.7 94.0

bottle 60.8 57.3 60.3 62.1 72.5 77.7bus 89.9 85.0 89.0 88.3 91.1 95.3car 86.8 81.6 84.4 81.9 87.2 92.4cat 89.3 89.4 90.7 94.8 97.1 98.2

chair 75.4 66.9 72.1 70.3 73.0 86.1cow 77.8 73.8 86.8 80.2 89.5 91.3table 75.1 62.0 69.0 76.2 75.1 83.5dog 83.0 89.5 92.1 92.9 96.3 97.3

horse 87.5 83.2 93.4 90.3 93.0 96.8motor 90.1 87.6 88.6 89.3 90.5 96.3person 95.0 95.8 96.1 95.2 94.8 95.8plant 57.8 61.4 64.3 57.4 66.5 72.2sheep 79.2 79.0 86.6 83.6 90.3 91.5sofa 73.4 54.3 62.3 66.4 65.8 81.1train 94.5 88.0 91.1 93.5 95.6 97.6

tv 80.7 78.3 79.8 81.9 82.0 90.0MAP 82.2 78.7 82.8 83.2 86.8 91.4

[1] S. Yan, J. Dong, Q. Chen, Z. Song, Y. Pan, W. Xia, H. Zhongyang, Y. Hua, and S. Shen. Generalized hierarchical matching for subcategory awareobject classification. In Visual Recognition Challange workshop, ECCV, 2012.

[2] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks. CVPR, 2014.

[3] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman. Return of the Devil in the Details: Delving Deep into Convolutional Nets , BMVC, 2014

From 81.7% | < 90.3%From 81.7% | < 90.3%

Demo

Online Demo

Highest and Lowest Score Five Images for Each Class

Aeroplane

Bicycle

Bird

Boat

Bottle


Bus

Car

Cat

Chair

Cow

Dining table

Dog

Horse

Motorbike

Person


Pottedplant

Sheep

Sofa

Train

TV monitor


What’s next?

Better Deep Features?Better Deep Features?

Better Local Features?Better Local Features?

2009: 66.5%

2010: 73.8%

2012: 82.2%

2014: 83.2%

2014: 91.4%

2011: 78.7%

LLC

Context-SVM

GHM

Sub-category

Deep feature

HCP

25%

More Extra Data?More Extra Data?

Better Solution for Small/Occluded Objects?Better Solution for Small/Occluded Objects?

Shuicheng YAN

[email protected]

pascal voc classification: local features vs. deep...

Documents