![Page 1: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/1.jpg)
1/26
Deformable Part Models are ConvolutionalNeural Networks
Ross Girshick, Forrest Iandola, Trevor Darrell, Jitendra Malik
Presentor: YANG Wei
January 25, 2016
![Page 2: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/2.jpg)
2/26
Outline
1 Introduction
2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM
3 Implementation details
4 Experiments
![Page 3: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/3.jpg)
3/26
Outline
1 Introduction
2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM
3 Implementation details
4 Experiments
![Page 4: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/4.jpg)
4/26
Deformable Part Models vs. Convolutional NeuralNetworks
Deformable part models
Convolutional neural networks
![Page 5: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/5.jpg)
4/26
Deformable Part Models vs. Convolutional NeuralNetworks
Deformable part models
Convolutional neural networks
![Page 6: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/6.jpg)
5/26
Are DPMs and CNNs actually distinct?
DPMs: graphical modelsCNNs: “black-box” non-linear classifiers
This paper shows that any DPM can be formulated as anequivalent CNN, i.e., deformable part models are convolutionalneural networks.
![Page 7: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/7.jpg)
6/26
Outline
1 Introduction
2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM
3 Implementation details
4 Experiments
![Page 8: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/8.jpg)
7/26
DeepPyramid DPMs
Schematic model overview: “front-end CNN” + DPM-CNN
input: image pyramidoutput: object detection scores
![Page 9: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/9.jpg)
8/26
Feature pyramid front-end CNN
front-end CNN: AlexNet (conv1-conv5).
A CNN that maps an image pyramid to a feature pyramidAlexNetsingle-scale architecture
![Page 10: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/10.jpg)
9/26
Constructing an equivalent CNN from a DPM
A single-component DPM.
mixture of componentscomponent = root filter + part filter
![Page 11: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/11.jpg)
10/26
Inference with DPMs
The matching process at one scale.
![Page 12: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/12.jpg)
11/26
Architecture of DPM-CNN
The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:
1 input: conv5 feature pyramid of front-end CNN
2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)
with the transformed part feature maps5 The resulting P+1 channel feature map is convolved with
an object geometry filter, which produces the output DPMscore map for the input pyramid level
![Page 13: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/13.jpg)
11/26
Architecture of DPM-CNN
The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:
1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters
3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)
with the transformed part feature maps5 The resulting P+1 channel feature map is convolved with
an object geometry filter, which produces the output DPMscore map for the input pyramid level
![Page 14: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/14.jpg)
11/26
Architecture of DPM-CNN
The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:
1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer
4 root feature map are stacked (channel-wise concatenated)with the transformed part feature maps
5 The resulting P+1 channel feature map is convolved withan object geometry filter, which produces the output DPMscore map for the input pyramid level
![Page 15: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/15.jpg)
11/26
Architecture of DPM-CNN
The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:
1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)
with the transformed part feature maps
5 The resulting P+1 channel feature map is convolved withan object geometry filter, which produces the output DPMscore map for the input pyramid level
![Page 16: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/16.jpg)
11/26
Architecture of DPM-CNN
The unrolled detection algorithm of DPM generates a specificnetwork with fixed length:
1 input: conv5 feature pyramid of front-end CNN2 generate P+1 feature maps: 1 root filter and P part filters3 P part feature maps are fed into distance transform layer4 root feature map are stacked (channel-wise concatenated)
with the transformed part feature maps5 The resulting P+1 channel feature map is convolved with
an object geometry filter, which produces the output DPMscore map for the input pyramid level
![Page 17: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/17.jpg)
12/26
Architecture of DPM-CNN
CNN equivalent to a single-component DPM.
![Page 18: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/18.jpg)
13/26
Traditional distance transform
Traditional distance transforms are defined for sets of points ona grid [FH05].
G : gridd(p−q): measure ofdistance between pointsp,q ∈ GB⊆ G
Then the distance transform ofB on G
DB(p) = minq∈B
d(p−q)Distance transform (Euclidean distance)
![Page 19: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/19.jpg)
14/26
Traditional distance transform
DT can be also formulated as
DB(p) = minq∈G
(d(p−q)+1B(q))
where
1B(q) =
{0, if q ∈ B,∞, if q /∈ B.
(1)
![Page 20: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/20.jpg)
15/26
Generalized distance transform
A generalization of distance transforms can be obtained byreplacing the indicator function with some arbitrary functionover the grid G
D f ′(p) = minq∈G
(d(p−q)+ f ′(q))
We can also define the generalized DT as maximization byletting f (q) =− f ′(q)
D f (p) = maxq∈G
( f (q)−d(p−q))
![Page 21: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/21.jpg)
16/26
Distance transform in DPM
In DPM, after computing filter responses we transform theresponses of the part filters to allow spatial uncertainty,
Di(x,y) = maxdx,dy
(Ri(x+dx,y+dy)−wi ·φd(dx,dy))
whereφd(dx,dy) = [dx,dy,dx2, ,dy2]
The value Di(x,y) is the maximum contribution of the partto the score of a root location that places the anchor of thispart at position (x,y).
By letting p = (x,y), p−q = (dx,dy) andd(p−q) = w ·φ(p−q), we can see that it is exactly in theform of distance transform.
![Page 22: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/22.jpg)
16/26
Distance transform in DPM
In DPM, after computing filter responses we transform theresponses of the part filters to allow spatial uncertainty,
Di(x,y) = maxdx,dy
(Ri(x+dx,y+dy)−wi ·φd(dx,dy))
whereφd(dx,dy) = [dx,dy,dx2, ,dy2]
The value Di(x,y) is the maximum contribution of the partto the score of a root location that places the anchor of thispart at position (x,y).By letting p = (x,y), p−q = (dx,dy) andd(p−q) = w ·φ(p−q), we can see that it is exactly in theform of distance transform.
![Page 23: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/23.jpg)
17/26
Max pooling as distance transform
Consider max pooling on f : G 7→ R on a regular grid G .Let a window half-length as k, then max pooling can be definedas
M f (p) = max∆p∈{−k,··· ,k}
f (p+∆p)
Max pooling can be expressed equivalently as distancetransform:
M f (p) = maxq∈G
( f (q)−dmax(p−q))
where
dmax(p−q) =
{0, if (p−q) ∈ {−k, · · · ,k},∞, otherwise .
(2)
![Page 24: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/24.jpg)
18/26
Generalize max pooling to distance transform pooling
We can generalize max pooling to distance transform pooling:unlike max pooling, the distance transform of f at p istaken over the entire domain Grather than specifying a fixed pooling window a priori, theshape of the pooling region can be learned from the data.
The released code does not include the DT pooling layer.Please refer to [OW13] for more details.
![Page 25: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/25.jpg)
18/26
Generalize max pooling to distance transform pooling
We can generalize max pooling to distance transform pooling:unlike max pooling, the distance transform of f at p istaken over the entire domain Grather than specifying a fixed pooling window a priori, theshape of the pooling region can be learned from the data.
The released code does not include the DT pooling layer.Please refer to [OW13] for more details.
![Page 26: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/26.jpg)
19/26
Object geometry filters
The root convolution map and the DT pooled part convolution maps are stacked into asingle feature map with P+1 channels and then convolved with a sparse objectgeometry filter.
![Page 27: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/27.jpg)
20/26
Combining mixture components with maxout
CNN equivalent to a multi-component DPM. A multi-component DPM-CNN iscomposed of one DPM-CNN per component and a maxout [GWFM+13] layer thattakes a max over component DPM-CNN outputs at each location.
![Page 28: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/28.jpg)
21/26
Outline
1 Introduction
2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM
3 Implementation details
4 Experiments
![Page 29: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/29.jpg)
22/26
Feature pyramid front-end CNN
Implementation detailspretrain on ILSVRC 2012 classification using Caffeuse conv5 as output layer“same” convolution
zero-pad each conv/pooling layer’s input with xk/2y zeroson all sides (top, bottom, left and right)(x,y) in conv5 feature map has a receptive field centered onpixel (16x,16y) in the input imageconv5 feature maps: stride: 16; receptive field: 163×163
![Page 30: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/30.jpg)
23/26
Outline
1 Introduction
2 DeepPyramid DPMsFeature pyramid front-end CNNConstructing an equivalent CNN from a DPM
3 Implementation details
4 Experiments
![Page 31: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/31.jpg)
24/26
Experiments
Detection average precision (%) on VOC 2007 test. Column C shows the number ofcomponents and column P shows the number of parts per component.
![Page 32: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/32.jpg)
25/26
Experiments
HOG versus conv5 feature pyramids. In contrast to HOG features, conv5 features aremore part-like and scale selective. Each conv5 pyramid shows 1 of 256 featurechannels. The top two rows show a HOG feature pyramid and the face channel of aconv5 pyramid on the same input image.
![Page 33: Deformable Part Models are Convolutional Neural Networks](https://reader034.vdocuments.net/reader034/viewer/2022051404/5884f2621a28abf76f8b62a3/html5/thumbnails/33.jpg)
26/26
References
Pedro F Felzenszwalb and Daniel P Huttenlocher, Pictorial structures for objectrecognition, International Journal of Computer Vision 61 (2005), no. 1, 55–79.
Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and YoshuaBengio, Maxout networks, arXiv preprint arXiv:1302.4389 (2013).
Wanli Ouyang and Xiaogang Wang, Joint deep learning for pedestrian detection,ICCV, IEEE, 2013, pp. 2056–2063.