panda: pose aligned networks for deep attribute modeling ning zhang1;2, manohar paluri1,...

PANDA: Pose Aligned Networks for Deep Attribute ModelingNing Zhang1;2, Manohar Paluri1, Marc’Aurelio Ranzato1, Trevor Darrell2, Lubomir Bourdev11: Facebook AI Research , 2: EECS, UC Berkeley{fnzhang, trevor}@eecs.berkeley.edu {mano, ranzato, lubomir}@fb.com

Outline

• Introduction• Related work• Pose Aligned Networks for Deep Attribute modeling • Datasets• Results• Conclusion

Introduction

• Recognizing human attributes, such as gender, age, hair style, and

clothing style, has many applications.

• The signal associated with some attributes is subtle and the image is

dominated by the effects of pose and viewpoint.

• Deep learning methods, and in particular convolutional nets

[20], have achieved very good performance on several tasks.

• Moreover, Donahue et al. [8] show that features extracted from

the deep convolutional network trained on large datasets are

generic and can help in other visual recognition problems.

• [20] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. E. Howard,W. Hubbard, and L. D. Jackel. Backpropagation applied to hand-written zip code recognition. In Neural Computation,1989. 1, 2

• [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep ConvolutionalActivation Feature for Generic Visual Recognition. In arXiv:1310.1531, 2013. 1, 4, 5

Introduction

• We conjecture that available training data, even ImageNet-scale, is

presently insufficient for learning pose normalization in a CNN.

• Part-based methods have gained significant recent attention

as a method to deal with pose variation and are the state-of- the-art

method for attribute prediction today.

Introduction

Introduction

• Our method can use other parts and we show the performance

using DPM [12] as well. We demonstrate the effectiveness of

PANDA on attribute classification problems and present state-of-

the-art experimental results on three datasets

• [12] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010. 1, 2

Introduction

Related work• 2.1. Attribute classification

Attributes are used as an intermediate representation for

knowledge transfer in [17, 10] for object recognition tasks.

• There is also some related work in automatic attribute discovery:

Berg et al. [1] proposed automatic attribute vocabularies discovery

by mining unlabeled text and image data sampled from the web.

• [10] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing Objects by their Attributes. In CVPR, 2009. 2

• [17] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to Detect Unseen Object Classes by Between-Class Attribute

Transfer. In CVPR, 2009. 2

• [1] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute discovery and characterization from noisy web data. In ECCV,

2010. 2

Related work

• In [16], facial attributes such as gender, mouth shape, facial

expression, are learned for face verification and image

search tasks.

• A very closely related work on attribute prediction is Bourdev et

al. [4], which is a three-layer feed forward classification system.

• [16] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. Attribute and simile classifiers for face verification. In

ICCV, 2009. 2, 6, 7

• [4] L. Bourdev, S. Maji, and J. Malik. Describing people: A poselet-based approach to attribute classification. In ICCV,

2011. 1, 2, 4, 5, 6

Related work

• 2.2. Deep learning

• The most popular deep learning method for vision, namely the

convolutional neural network (CNN)

• Although very successful when provided very large labeled

datasets, convolutional nets usually generalize poorly on

smaller datasets

Pose Aligned Networks for Deep Attribute modeling

• We explore part-based models, specifically poselets, and deep

learning.

• Our goal is to use poselets for part localization and

incorporate these normalized parts into deep convolutional

nets in order to extract pose normalized representations.

Pose Aligned Networks for Deep Attribute modeling • Towards this goal, we leverage both the power of

convolutional nets for learning discriminative features from

data and the ability of poselets to simplify the learning task by

decomposing the objects into their canonical poses.

• Specifically, we start from poselet patches, resize them to

64x64 pixels (Figure 3), randomly jitter each patch and flip it

horizontally with probability 0.5 to improve generalization, and

train a CNN for each poselet.

Pose Aligned Networks for Deep Attribute modeling

Pose Aligned Networks for Deep Attribute modeling • The whole network is trained jointly by standard back propagation of

the error [24] and stochastic gradient descent[2] using as a loss

function the sum of the log-losses of each attribute for each training

sample.

• The details of the layers are given in Figure 2 and further implementation details can be found in [15].

• [24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. In Nature,

1986.3

• [2] L. Bottou. Stochastic Gradient Descent Tricks. In G. Montavon,G. Orr, and K.-R. M¨uller, editors, Neural Networks:

Tricks of the Trade, volume 7700 of Lecture Notes in Computer Science, pages 421–436. Springer Berlin

Heidelberg,2012. 3• [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNetClassification with Deep Convolutional Neural Networks. InNIPS, 2012. 1, 3, 4

Pose Aligned Networks for Deep Attribute modeling • The overall convolutional net architecture is shown in Figure 2.

Pose Aligned Networks for Deep Attribute modeling • Based on our experiments, we find a more complex net is

needed for the whole-person region than for the part regions.

We extract deep convolutional features from the model

trained on Imagenet [15] using the open source package

provided by [8] as our deep representation of the full image

patch.

• [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural

Networks. In NIPS, 2012. 1, 3, 4

• [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep Convolutional

Activation Feature for Generic Visual Recognition. In arXiv:1310.1531, 2013. 1, 4, 5

Datasets

• 4.1. The Berkeley Human

Attributes Dataset We tested our method on the Berkeley

Human Attributes Dataset [4]. This dataset consists of 4013

training, and 4022 test images collected from PASCAL and H3D

datasets.

• [4] L. Bourdev, S. Maji, and J. Malik. Describing people: A poselet-based approach to attribute classification. In

ICCV, 2011. 1, 2, 4, 5, 6

Datasets

• 4.2. Attributes 25K Dataset

Unfortunately the training portion of the Berkeley dataset is not

large enough for training our deep-net models (they severely

overfit when trained just on these images).

We collected an additional dataset from Facebook of 24963

people split into 8737 training, 8737 validation and 7489 test

examples.

Results

[4] L. Bourdev, S. Maji, and J. Malik. Describing people: A poselet-based approach to attribute classification. In ICCV,2011. 1, 2, 4, 5, 6 [27] N. Zhang, R. Farrell, F. Iandola, and T. Darrell. Deformable Part Descriptors for Fine-grained Recognition and Attribute Prediction. In ICCV, 2013. 1, 4, 5

Results

Conclusion

• We presented a method for attribute classification of people

that improves performance compared with previously published

methods.

• Our feature representation is generic and we achieve state-of-

the-art results on the Berkeley Attributes of People dataset

and on LFW even if we train our CNNs on a different dataset.

panda: pose aligned networks for deep attribute modeling ning zhang1;2, manohar paluri1,...

Documents

introduction slide

attribute prediction

automatic attribute

class attribute transfer

facial attributes

deep learning methods

object recognition tasks

deep convolutional network