what is the best multi-stage architecture for object recognition? ruiwen wu [1] jarrett, kevin, et...

What is the Best Multi-Stage Architecture for Object Recognition?

Ruiwen Wu

[1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object recognition?." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.(Cited by 396 till 2014.11.12)

Usual architecture of the neural networks

Each part of the neural networks

Unsupervised learning conception

Experiment

Contribution of this paper

2010 2011 2012 2013 20140

10000

20000

30000

40000

50000

60000

70000

80000

90000

2010; Series1; 64000

2011; Series1; 750002012; Series1; 78000

2013; Series1; 80000 2014; Series1; 78900

Papers about Neural Networks

2010 2011 2012 2013 20140

100

200

300

400

500

600

2010; Papers about Unsu-pervised Pre-Training; 110





Papers about Unsupervised Pre-Training

2009 2010 2011 2012 2013 20140

20

40

60

80

100

120

140

2009; citations every year; 4

2010; citations every year; 32

2011; citations every year; 622012; citations every year; 64

2013; citations every year; 1152014; citations every year; 113

Citations every year

Deep learning methods aims at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features[2]

Neural Networks with many hidden layers

Graphical Models with many levels of hidden layers

Other methods

Deep Learning Methods

[2]Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. Why Does Unsupervised Pre-training Help Deep Discriminant Learning?.

Usual architecture of neural networks

Non-linear Operation: Quantization, Winner-take-all, Sparsification, Normalization, S-function

Pooling Operation: Max, average, histogramming operator

Classifier: Neural Networks(NN), k-Nearest Neighbor(KNN), Support Vector Machine(SVM), Logistic Regression(LR)

This paper addresses three questions:

How does the non-linearities that follow the filter banks influence the recognition accuracy?

Does learning the filter banks in an unsupervised or supervised manner improve the performance over random filters or hardwired filters?

Is there any advantage to using an architecture with two stages of feature extraction, rather than one?

Questions to address

To address these three questions, they experimented with various combinations of architectures:

One stage or two stages of feature extraction

Different types of non-linearities

Different types of filters

Different filter learning methods(random, unsupervised and supervised)

Test Dataset: Caltech-101 dataset; NORB object dataset; MNIST dataset

Experiments Architecture

Filter Bank Layers(FCSG)

Local Contrast Normalization Layer(N)

Pooling and Subsampling Layer(PA or PM)

Model Architecture

The module computes:

Filter Bank Layer(FCSG)

tanh( * )i i ij ii

y g k x * is the convolution operater, tanh is hyperbolic tangent non-linearity, g is a trainable scalar coefficient.

Output size: assume each map is n1 x n2, each kernel is l1 x l2, then the output y is (n1-l1+1) x (n2-l2+1)

The kernel here could be either supervised trained or unsupervised pre-trained


, ,ijk ijk pq i j p k qipqv x w x

/ max(c, )ijk ijk jky v

1 22, ,jk pq i j p k qipq

w v

C is the mean( )jk

I am not quiet understand this part

Wpq is Gaussian weighting window


The result of this module:

20 40 60 80 100120

50

100

15020 40 60 80 100120140

20

40

60

80

100

120

140

It seems like this module is doing edge extraction

Pooling and Subsampling Layer(PA or PM)

For each of the small area:

, ,ijk pq i j p k qpqy w x

Where is a uniform weighting window or max weighting window

Each output feature map is then subsampled spatially by a factory S horizontally and vertically

pqw

Combine Modules

There could be three types of architectures of this network:

FCSG ---- PA

FCSG ---- N ---- PA

FCSG ---- PM

Training Protocol

Random Features and Supervised Classifier – R and RR

Unsupervised Features, Supervised Classifier - U and UU

Random Features, Global Supervised Refinement - R+ and R+R+

Unsupervised Feature, Global Supervised Refinement U+ and U+U+

Unsupervised Training of Filter Banks

For a given input X, a matrix W whose columns are the dictionary elements, feature vector Z is obtained by ∗minimizing the following energy function

2

2 1

*

(X,Z,W)

argmin (X,Z,W)

OF

OFZ

E X WZ Z

Z E

where λ is a sparsity hyper-parameter.

For any input X, one needs to run a rather expensive optimization algorithm to find Z , To alleviate the problem, ∗the PSD method is imported.

Predictive Sparse Decomposition(PSD)[3]

[3] Kavukcuoglu, Koray, Marc'Aurelio Ranzato, and Yann LeCun. "Fast inference in sparse coding algorithms with applications to object recognition." arXiv preprint arXiv:1010.3467 (2010).(cited by 94)

2 2

2 1 2

*

(X,Z,W,K) (X,K)

argmin (X,Z,W,K)

K {G,S,D}

C(X,K) C(X,G,S,D) Gtanh(SX D)

PSD

PSDZ

E X WZ Z Z C

Z E

where S R∈ m×n is a filter matrix, D R∈ m is a vector of biases

Result

Why does Unsupervised Pre-training Help Deep Discriminant Learning?[2]

Reference of the graph

[2]Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. Why Does Unsupervised Pre-training Help Deep Discriminant Learning?.[3] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554.[4] Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 114–128.[5] Weston, J., Ratle, F., & Collobert, R. (2008). Deep learning via semi-supervised embedding. Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08) (pp. 1168–1175). New York, NY, USA: ACM.[6]LeCun, Yann, et al. "Gradient-based learning applied to document recognition."Proceedings of the IEEE 86.11 (1998): 2278-2324.[7] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554.

non-convex function

In deep learning, the objective function is usually a highly non-convex function of the parameters, so there must be many local minima in the model parameter space

Supervised Learning use a fix point or a random point as the initialization. So in some or most situations, it converges at a local minima

Local Minima

Random Initialization

Unsupervised Pre-training

Reason

There are a few reasonable hypotheses why pre-training might work.

One possibility that unsupervised pre-training acts as a kind of regularizer, putting the parameter values in the appropriate range for discriminant training

Another possibility, is that pre-training initializes the model to a point in parameter space that somehow renders the optimization process more effective, in the sense of achieving a lower minimum of the empirical cost function.

Conclusion

Future work should clarify this hypothesis.

Understanding and improving deep architectures remains a challenge.

This work helps with such understanding via extensive simulations and puts forward and confirms a hypothesis explaining the mechanisms behind the effect of unsupervised pre-training for the final discriminant learning task.

Reference[1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object recognition?." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, 2009.(Cited by 396 till 2014.11.12)[2]Erhan, D., Bengio, Y., Courville, A., Manzagol, P. A., Vincent, P., & Bengio, S. Why Does Unsupervised Pre-training Help Deep Discriminant Learning?.[3] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554.[4] Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 114–128.[5] Weston, J., Ratle, F., & Collobert, R. (2008). Deep learning via semi-supervised embedding. Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08) (pp. 1168–1175). New York, NY, USA: ACM.[6]LeCun, Yann, et al. "Gradient-based learning applied to document recognition."Proceedings of the IEEE 86.11 (1998): 2278-2324.[7] Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation,18, 1527–1554.

Thank You!

what is the best multi-stage architecture for object recognition? ruiwen wu [1] jarrett, kevin, et...

Documents

pm slide

paper slide

pm model architecture

edge extraction slide

logistic regressionlr

methods deep learning

learning feature hierarchies

filter banks