learning convolutional feature hierarchies for visual …koray/files/poster.pdfkoray kavukcuoglu1,...

1
--- 0 20 40 60 80 10 0 10 1 10 2 10 3 10 4 deg # of cross corr > deg Patch Based Training Convolutional Training 10 ï2 10 ï1 10 0 10 1 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 false positives per image miss rate Shapeletïorig (90.5%) PoseInvSvm (68.6%) VJïOpenCv (53.0%) PoseInv (51.4%) Shapelet (50.4%) VJ (47.5%) FtrMine (34.0%) Pls (23.4%) HOG (23.1%) HikSvm (21.9%) LatSvmïV1 (17.5%) MultiFtr (15.6%) R+R+ (14.8%) U+U+ (11.5%) MultiFtr+CSS (10.9%) LatSvmïV2 (9.3%) FPDW (9.3%) ChnFtrs (8.7%) 14.8% 11.5% 10 ï2 10 ï1 10 0 10 1 0.05 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 false positives per image miss rate U+U+ïbt0 (23.6%) U+U+ïbt1 (16.5%) U+U+ïbt2 (13.8%) U+U+ïbt6 (12.4%) U+U+ïbt3 (11.9%) U+U+ïbt5 (11.7%) U+U+ïbt4 (11.5%) Learning Convolutional Feature Hierarchies for Visual Recognition Koray Kavukcuoglu 1 , Pierre Sermanet 1 , Y-Lan Boureau 1,2 , Karol Gregor 1 , Michael Mathieu 1 , Yann LeCun 1 1 Courant Institute of Mathematical Sciences - NYU, 2 INRIA - Willow Project Team Overview Object Recognition on Caltech 101 Pedestrian Detection on INRIA One of the most widely used pedestrian detection benchmark dataset Detect and match bounding boxes with 50% overlap 4 bootstrapping passes Predictor Dictionary Predictor Dictionary 1 st Layer 2 nd Layer Sparse coding is a popular method for learning features in an unsupervised manner [Olshausenʼ97, Ranzatoʼ07, Kavukcuogluʼ08,ʼ09]. Sparse coding is often trained on isolated image patches, which produces Gabor filters at various orientations and positions to cover the patch. This produces highly-redundant representations. We use convolutional sparse coding, which is trained on large image regions, produces more diverse filters, and less redundant representations [Zeilerʼ10, Leeʼ10]. We use a feed-forward encoder to produce fast approximations of the sparse code [Ranzatoʼ07, Kavukcuogluʼ08,ʼ09, Jarrettʼ09]. The method is used to pre-train the filters of convolutional networks that are subsequently fine-tuned with supervised back-prop [Hintonʼ06, Ranzatoʼ07, Bengioʼ07, Kavukcuogluʼ08,ʼ09, Jarrettʼ09]. competitive accuracies are achieved on object recognition and detection tasks. Model Sparse Modeling (patch-based) Input Dictionary Representation Sparse Modeling (convolutional) Input Representation x R s×s D R K ×s×s z R K 1 2 x k D k z k 2 2 + λ k,i,j |z kij | 1 2 x Dz 2 2 + λ i |z i | x R w ×h z R K ×(w s+1)×(hs+1) Each dictionary item k is a convolutional kernel connected to a feature map k Convolutional Training yields a more diverse set of filters Convolutional Predictive Sparse Decomposition Cumulative histogram of angle between every dictionary item pair Dictionary learned with convolutional training produces less redundant filters Minimum angle between any two convolutional dictionary item is 40 acos(abs(max(D i D T j ))) Reduce cost of inference by training a feed- forward predictor function 1 2 x k D k z k 2 2 + λ|z | 1 + β k ||z k f (W k x)|| 2 2 Even a simple tanh non-linearity produces good accuracy for recognition Second order derivative information is important (Levenberg-Marquardt) Better sparse predictions can be obtained by using a shrinking non-linearity ˜ z = g k × tanh(W k x) sh β k ,b k (s)= sign(s) × 1/β k log(exp(β k × b k ) + exp(β k × |s|) 1) b k ï3 ï2 ï1 0 1 2 3 ï3 ï2 ï1 0 1 2 3 s sh `,b `=10,b=1 `=3,b=1 `=1,b=1 `=10,b=2 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Iteration Loss g k × tanh (x W k ) sh `,b (x W k ) sh β k ,b k (s)= sign(s) × 1/β k log(exp(β k × b k ) + exp(β k × |s|) 1) b k Filter Bank Non- Linearity Pooling Unsupervised Pre-Training x z 1 Filter Bank Non- Linearity Pooling Unsupervised Pre-Training z 2 Supervised Refinement 40 45 50 55 60 65 70 66.3 65.3 57.6 57.1 65.5 63.7 54.2 52.2 Patch Based Training Convolutional Training Unsupervised Unsupervised + Supervised Unsupervised Unsupervised + Supervised 1 Stage 1 Stage 2 Stages 2 Stages Build 2 stage model using predictor function followed by absolute value rectification and local contrast normalization at each stage Predictor function is initialized using Convolutional Predictive Sparse Decomposition (ConvPSD) Complete system is fine-tuned together with a linear logistic regression classifer Training Efficient Predictors Unsupervised initialization using ConvPSD yields better accuracy than patch-based PSD Unsupervised Training Supervised Fine-Tuning

Upload: nguyendang

Post on 01-Apr-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Learning Convolutional Feature Hierarchies for Visual …koray/files/poster.pdfKoray Kavukcuoglu1, Pierre Sermanet1, Y-Lan Boureau1,2, Karol Gregor1, Michael Mathieu1, Yann LeCun1

---

0 20 40 60 80100

101

102

103

104

deg

# of

cro

ss c

orr >

deg

Patch Based TrainingConvolutional Training

10 2 10 1 100 101

0.05

0.1

0.2

0.3

0.40.50.60.70.80.9

1

false positives per image

miss

rate

Shapelet orig (90.5%)PoseInvSvm (68.6%)VJ OpenCv (53.0%)PoseInv (51.4%)Shapelet (50.4%)VJ (47.5%)FtrMine (34.0%)Pls (23.4%)HOG (23.1%)HikSvm (21.9%)LatSvm V1 (17.5%)MultiFtr (15.6%)R+R+ (14.8%)U+U+ (11.5%)MultiFtr+CSS (10.9%)LatSvm V2 (9.3%)FPDW (9.3%)ChnFtrs (8.7%)

10 2 10 1 100 101

0.05

0.1

0.2

0.3

0.40.50.60.70.80.9

1

false positives per image

miss

rate

Shapelet orig (90.5%)PoseInvSvm (68.6%)VJ OpenCv (53.0%)PoseInv (51.4%)Shapelet (50.4%)VJ (47.5%)FtrMine (34.0%)Pls (23.4%)HOG (23.1%)HikSvm (21.9%)LatSvm V1 (17.5%)MultiFtr (15.6%)R+R+ (14.8%)U+U+ (11.5%)MultiFtr+CSS (10.9%)LatSvm V2 (9.3%)FPDW (9.3%)ChnFtrs (8.7%)

14.8%

11.5%

10 2 10 1 100 101

0.05

0.1

0.2

0.3

0.40.50.60.70.80.9

1

false positives per image

mis

s ra

te

U+U+ bt0 (23.6%)U+U+ bt1 (16.5%)U+U+ bt2 (13.8%)U+U+ bt6 (12.4%)U+U+ bt3 (11.9%)U+U+ bt5 (11.7%)U+U+ bt4 (11.5%)

Learning Convolutional Feature Hierarchies for Visual RecognitionKoray Kavukcuoglu1, Pierre Sermanet1, Y-Lan Boureau1,2, Karol Gregor1, Michael Mathieu1, Yann LeCun1

1Courant Institute of Mathematical Sciences - NYU, 2INRIA - Willow Project Team

Overview

Object Recognition on Caltech 101

Pedestrian Detection on INRIA• One of the most widely used pedestrian

detection benchmark dataset• Detect and match bounding boxes with 50%

overlap

• 4 bootstrapping passes

Predictor Dictionary

Predictor

Dictionary

1st Layer 2nd Layer

• Sparse coding is a popular method for learning features in an unsupervised manner [Olshausenʼ97, Ranzatoʼ07, Kavukcuogluʼ08,ʼ09].

• Sparse coding is often trained on isolated image patches, which produces Gabor filters at various orientations and positions to cover the patch.

• This produces highly-redundant representations.• We use convolutional sparse coding, which is trained on large

image regions, produces more diverse filters, and less redundant representations [Zeilerʼ10, Leeʼ10].

• We use a feed-forward encoder to produce fast approximations of the sparse code [Ranzatoʼ07, Kavukcuogluʼ08,ʼ09, Jarrettʼ09].

• The method is used to pre-train the filters of convolutional networks that are subsequently fine-tuned with supervised back-prop [Hintonʼ06, Ranzatoʼ07, Bengioʼ07, Kavukcuogluʼ08,ʼ09, Jarrettʼ09].

• competitive accuracies are achieved on object recognition and detection tasks.

Model• Sparse Modeling (patch-based)

Input Dictionary Representation

• Sparse Modeling (convolutional)

Input Representation

x ∈ Rs×s D ∈ RK×s×s z ∈ RK

1

2�x−

k

Dk ∗ zk�22 + λ�

k,i,j

|zkij |

1

2�x−Dz�22 + λ

i

|zi|

x ∈ Rw×h z ∈ RK×(w−s+1)×(h−s+1)

• Each dictionary item k is a convolutional kernel connected to a feature map k

• Convolutional Training yields a more diverse set of filters

Convolutional Predictive Sparse Decomposition

• Cumulative histogram of angle between every dictionary item pair

• Dictionary learned with convolutional training produces less redundant filters

• Minimum angle between any two convolutional dictionary item is 40

acos(abs(max(Di ∗DTj )))

• Reduce cost of inference by training a feed-forward predictor function

1

2�x−

k

Dk ∗ zk�22 + λ|z|1 + β�

k

||zk − f(W k ∗ x)||22

• Even a simple tanh non-linearity produces good accuracy for recognition

• Second order derivative information is important (Levenberg-Marquardt)

• Better sparse predictions can be obtained by using a shrinking non-linearity

z̃ = gk × tanh(W k ∗ x)

shβk,bk(s) = sign(s)× 1/βk log(exp(βk × bk) + exp(βk × |s|)− 1)− bk

3 2 1 0 1 2 33

2

1

0

1

2

3

s

sh,b

=10,b=1=3,b=1=1,b=1=10,b=2

0 2000 4000 6000 8000 10000 12000 14000 16000 18000Iteration

Loss

gk × tanh (x Wk)sh ,b(x Wk)

shβk,bk(s) = sign(s)× 1/βk log(exp(βk × bk) + exp(βk × |s|)− 1)− bk

Filter BankNon-

LinearityPooling

Unsupervised Pre-Training

x z1

Filter BankNon-

LinearityPooling

Unsupervised Pre-Training

z2

Supervised Refinement

40455055606570

66.365.3

57.657.1

65.563.7

54.252.2

Patch Based Training Convolutional Training

Unsupervised Unsupervised+

Supervised

Unsupervised Unsupervised+

Supervised

1 Stage 1 Stage 2 Stages 2 Stages

• Build 2 stage model using predictor function followed by absolute value rectification and local contrast normalization at each stage

• Predictor function is initialized using Convolutional Predictive Sparse Decomposition (ConvPSD)

• Complete system is fine-tuned together with a linear logistic regression classifer

Training Efficient Predictors

Unsupervised initialization using ConvPSD yields better accuracy than patch-based PSD

Unsupervised Training

Supervised Fine-Tuning