unconstrained face recognition: deep learning...

Unconstrained Face Recognition: Deep Learning Approaches

Chun-Ting Huang

2016/7/22USC Multimedia Communication Lab 2


http://www.nytimes.com/2015/08/13/us/facial-recognition-software-moves-from-overseas-wars-to-local-police.html?_r=0

Why Face?

▪ Facial features scored highest compatibility in a Machine Readable Travel Documents (MRTD) system


Hietmeyer, R.: Biometric identification promises fast and secure processing of airline passengers. ICAO J. 55(9), 10–11 (2000)

Outline

▪ Introduction

▪ Unconstrained face dataset

▪ Unconstrained face recognition with deep learning

▪ Papers from industry

▪ Papers from academia

▪ Discussion and conclusion


Introduction

Categorization

▪ A face recognition system operates in two modes

▪ Face verification (authentication)

▪ Face identification (recognition)

▪ Face verification

▪ One-to-one match

▪ Between query face image against an enrollment face image

▪ Face identification

▪ One-to-many match

▪ Between query face against multiple faces in the enrollment database


Face Recognition Processing Flow


Jain, Anil K., and Stan Z. Li. Handbook of face recognition. Vol. 1. New York: springer, 2011

Face Subspace


Jain, Anil K., and Stan Z. Li. Handbook of face recognition. Vol. 1. New York: springer, 2011

Frontal Face Recognition


Conventional Approaches


▪ Template matching

▪ PCA: M. Turk, A. Pentland, Eigenfaces for Recognition, Journal of Cognitive Neurosicence, Vol. 3, No. 1, Win. 1991

▪ LDA: Kamran Etemad and Rama Chellappa, ” Discriminant analysis for recognition of human face images”, JOSA A, 1997

▪ HOG: Dalal, Navneet, and Bill Triggs. "Histograms of oriented gradients for human detection." Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1. IEEE, 2005

▪ LBP: Ahonen, Timo and Hadid, Abdenour and Pietikainen, Matti, “Face description with local binary patterns: Application to face recognition”, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2006

Frontal is NOT Enough


Facial Landmark Localization

▪ Model based approach

▪ ASM: T.F. Cootes and C.J. Taylor and D.H. Cooper and J. Graham (1995). "Active shape models - their training and application". Computer Vision and Image Understanding (61): 38–59

▪ AAM: T.F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. ECCV, 2:484–498, 1998

▪ Regression based approach

▪ Cascade pose regression: P. Doll’ar, P. Welinder, and P. Perona. “Cascaded pose regression”. In CVPR. IEEE, 2010

▪ Explicit shape regression: X. Cao, Y.Wei, F.Wen, and J. Sun. “Face alignment by explicit shape regression”. In CVPR. IEEE, 2012


Explicit Shape Regression


t = 0 t = 1 t = 2 … t = 10

𝐼: image

initialized

from

face

detector

affine

transformtransform

back

…

Unconstrained Face Dataset

Labeled Faces in the Wild

▪ Contains 13233 images

▪ Consists of 5749 people

▪ 1680 people with two or more images

▪ Proposed in ICCV 2007

▪ Photos are collected through internet

▪ Also provide aligned faces with three types of alignment methods

USC Multimedia Communication Lab 17

Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained

Environments. University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.

2016/7/22

LFW: Performance (Image-Restricted)


LFW: Performance (Image-Unrestricted)


Youtube Face Database

▪ Lior Wolf, Tal Hassner and Itay Maoz, Face Recognition in Unconstrained Videos with Matched Background Similarity. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011

▪ 3425 videos of 1595 people


YTF: Performance (Image-Restricted)


YTF: Performance (Image-Restricted)

▪ EER - the error rate at the ROC operating point where the false positive and false negative rates are equal


Parkhi, Omkar M., Andrea Vedaldi, and Andrew Zisserman. "Deep face recognition." Proceedings of the British Machine Vision 1.3 (2015): 6.

IARPA Janus benchmark A

▪ Klare et al. Pushing the Frontiers of Unconstrained Face Detection and Recognition: IARPA Janus Benchmark A, CVPR, June 2015

▪ All labeled with manual bounding box annotation with fiducial landmarks

▪ Amazon Mechanical Turk (AMT)

▪ LFW are not fully constrained:

▪ Commodity face detector was used to detect all faces

▪ Restricted to pose variation, occlusions, and illuminations conditions

▪ Three landmarks: two eyes, and base of nose

▪ Geographic distribution

7/22/2016USC Multimedia Communications Lab 24

IJB-A Labeled Information

▪ 10-fold gallery / probe image set

▪ 17,000 images for training (333 subjects)

▪ Gallery set: 3000 images (167 subjects)

▪ Probe set: 13,700 images (include non-gallery subjects)

▪ X Y coordinates of eyes and nose base

▪ Face yaw angle (if applicable)

▪ Observation labeling: FOREHEAD_VISIBLE, EYES_VISIBLE, NOSE_MOUTH_VISIBLE, INDOOR, GENDER, SKIN_TONE (6 levels), AGE (5 levels), FACIAL_HAIR


Pose Variant


IJB-A Released Benchmark (1/29/2016)


Unconstrained Face RecognitionWith Deep Learning

Facebook: DeepFace

▪ DeepFace: Closing the Gap to Human-Level Performance in Face Verification

▪ Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, Lior Wolf; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1701-1708

▪ Claimed contributions

▪ Facial alignment with 3D modeling

▪ Advance LFW benchmark performance

▪ Reaching near human-performance

▪ Advance YTF benchmark performance

USC Multimedia Communication Lab 33Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, Lior Wolf; The IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), 2014, pp. 1701-1708 2016/7/22

3D Facial Alignment

▪ Detected face provided with 6 initial fiducial points

▪ 2D-aligned crop

▪ 67 fiducial points from Delaunay triangulation

▪ 3D shape transform

▪ Triangle visibility w.r.t. to the fitted 3D-2D camera

▪ Affine warping

▪ Final frontalized crop


DeepFace Architecture


DeepFace: Performance

▪ Results on Labeled Face in the Wild (LFW) and YouTube Faces (YTF) databases

USC Multimedia Communication Lab 36Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, Lior Wolf; The IEEE Conference on Computer Vision and Pattern Recognition

(CVPR), 2014, pp. 1701-1708 2016/7/22

DeepID

▪ Sun, Yi, Xiaogang Wang, and Xiaoou Tang. "Deep learning face representation from predicting 10,000 classes." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014


60 patches

DeepID


DeepID Performance (1)


160-dimensional feature

DeepID Performance (2)


o: outside dataset

u: unrestricted protocol

r: restricted protocol

Google: FaceNet

▪ Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face recognition and clustering." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015


FaceNet

▪ Objective - learning a Euclidean embedding per image with DNN

▪ Map the face images to a compact Euclidean space

▪ Distance in space = Face Similarity

▪ Approach – DNN with triplet loss


Triplet Loss

▪ Embedding: 𝑓(𝑥) ∈ ℝ𝑑

▪ Input image as 𝑥𝑖𝑎 (anchor), 𝑥𝑖

𝑝(positive), and 𝑥𝑖

𝑛 (negative)

▪ 𝛼 is a margin between positive and negative pairs

▪ Corresponding loss function


Triplet Selection

▪ To achieve fast convergence for previous loss function

▪ Select 𝑥𝑖𝑝

for (hard positive)

▪ Select 𝑥𝑖𝑛 for (hard negative)

▪ Sampled the training set with

▪ 40 faces per identity in each mini-batch as positive examplars

▪ Randomly sampled negative faces are added

▪ To avoid converging to bad local minima

▪ (semi-hard)


Deep Convolutional Networks

▪ CNN is trained using Stochastic Gradient Descent (SGD) with standard backpropagation

▪ Two types of architectures

▪ Zeiler&Fergus architecture

▪ GoogLeNet style Inception model

▪ Trained on a CPU cluster for 1000 to 2000 hours

▪ 100M-200M training face thumbnails consisting 8M identities

▪ Input sizes range from 96x96 to 224x224 pixels


Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." Computer vision–ECCV 2014. Springer International Publishing, 2014.

818-833.

Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

Network Details


Performance

▪ Validation rate VAL (true accepts / same identity pairs) on 1M hold-out test set

▪ Output dimension (embedding dimension)’s VAL


Sensitivity to Image Quality


Deep Face Recognition

▪ Parkhi, Omkar M., Andrea Vedaldi, and Andrew Zisserman. "Deep face recognition." Proceedings of the British Machine Vision 1.3 (2015): 6.

▪ Achieved similar performance on LFW and YTF dataset

▪ With less training images and identities

▪ 2.6M images collected from Google images and Bing with keyword “actor”

▪ Same triplet loss strategy with FaceNet


Fine-tuned with VGG Model

▪ The “Very Deep” Architecture

▪ Different from previous architectures proposed

▪ Network Details:

▪ 3 x 3 Convolution Kernels (Very small)

▪ Conv. Stride 1 px.

▪ Relu non-linearity

▪ No local contrast normalisation

▪ 3 Fully connected layers


image

Conv-64

maxpool

fc-4096

fc-4096

Softmax

Conv-64

Conv-128

maxpool

Conv-128

Conv-256

maxpool

Conv-256

Conv-512

maxpool

Conv-512

Conv-512

Conv-512

maxpool

Conv-512

Conv-512

fc-2622

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint

arXiv:1409.1556 (2014).

Training• MatConvNet Tootlbox

• Nvidia CuDNN bindings

• Multi GPU Training (approx 3.5x speedup)

• Nvidia Titan Black

• 7 days of training

• Stochastic Gradient Descent with back prop.

• Accumulator Descent for large batch sizes

• Batch Size: 256

• Incremental FC layer training

• 2622 way multi class criterion (soft max)


image

Conv-64

maxpool

fc-4096

fc-4096

Softmax

Conv-64

Conv-128

maxpool

Conv-128

Conv-256

maxpool

Conv-256

Conv-512

maxpool

Conv-512

Conv-512

Conv-512

maxpool

Conv-512

Conv-512

fc-2622

Vedaldi, Andrea, and Karel Lenc. "MatConvNet: Convolutional neural networks for matlab."Proceedings of the 23rd Annual ACM Conference

on Multimedia Conference. ACM, 2015.

Performance on LFW


No. Method # Training

Images

# Networks Accuracy

1 Fisher Vector Faces - - 93.10

2 DeepFace 4 M 3 97.35

3 DeepFace Fusion 500 M 5 98.37

4 DeepID-2,3 Full 200 99.47

5 FaceNet 200 M 1 98.87

6 FaceNet+

Alignment

200 M 1 99.63

7 VGG Face 2.6 M 1 98.95

Performance on YTF


No. Method # Training

Images

# Networks 100%-EER Accuracy

1 Video Fisher Vector

Faces

- - 87.7 93.10

2 DeepFace 4 M 1 91.4 91.4

4 DeepID-2,2+,3 200 - 93.2

5 FaceNet +

Alignment

200 M 1 - 95.1

7 VGG Face 2.6 M 1 97.4 97.3

Lightened CNN

▪ Wu, Xiang, Ran He, and Zhenan Sun. "A Lightened CNN for Deep Face Representation." arXiv preprint arXiv:1511.02683 (2015).

▪ Obtained competitive performance with previous models

▪ Composed by two networks

▪ New activation function: Max-Feature-Map (MFM) to replace ReLU


Max-Feature-Map


Performance

▪ On LFW:

▪ On YTF:


Deep Learning Applications Other than Recognition

Incorrect Alignment


Liu, Ziwei, et al. "Deep learning face attributes in the wild." Proceedings of the IEEE International Conference on Computer Vision. 2015.

Deep Learning Face Attributes


Details of the Networks

▪ Applied AlexNet directly for LNet

▪ Pre-trained with ImageNet 1000 object categories

▪ Fine-tuning LNet using attribute tags


Face Localization Performance (LNet)


Face localization performance (LNet)


Face Attributes Visualization


Attribute Accuracy


Discussion and Conclusion

LFW Survey

▪ Labeled Faces in the Wild: A Survey: Erik Learned-Miller, Gary Huang, AruniRoyChowdhury, Haoxiang Li, Gang Hua

▪ The future of face recognition

▪ Verification versus identification

▪ Not uncommon that two random individuals have large differences in appearance

▪ The more people in a gallery, the greater the chance that two individuals have similar appearance

▪ New face dataset

▪ IJB-A

▪ CASIA

▪ FaceScrub

▪ MegaFace


Discussion

▪ Unconstrained face recognition is a competitive field

▪ Target dataset: IJB-A

▪ Testing different approaches (with source code / trained models)

▪ Working on checking the effectiveness of lightened CNN

▪ Facial attributes may serve as auxiliary purpose


Large-scale CelebFaces Attributes (CelebA) Dataset

▪ S. Yang, P. Luo, C. C. Loy, and X. Tang, "From Facial Parts Responses to Face Detection: A Deep Learning Approach", in IEEE International Conference on Computer Vision (ICCV), 2015

▪ 10,177 number of identities

▪ 202,599 number of face images

▪ 5 landmark locations, 40 binary attributes annotations per image

▪ Available for download

▪ 1.34 GB for 202,599 align&cropped face images

▪ Similarity transformation according to two eye locations and are resized to 218*178

▪ 9.8 GB for 202,599 original web face images


Large-scale CelebFaces Attributes (CelebA) Dataset


Deep Face Dreams


Representative ImageNeuron Inversion

Mahendran, Aravindh, and Andrea Vedaldi. "Understanding deep image representations by inverting them." Computer Vision and Pattern Recognition (CVPR), 2015

Deep Face Dreams


Representative Image Neuron InversionMahendran, Aravindh, and Andrea Vedaldi. "Understanding deep image representations by inverting them." Computer Vision and Pattern Recognition (CVPR), 2015

Deep Face Dreams


Representative Image Neuron Inversion


Questions?