lecture 10 ming yang - face recognition systems
TRANSCRIPT
• Security and access control– ATM, buildings, airports/border control, smartphones
• Law enforcement and criminal justice system– Mug-shot systems, post-event analysis, forensics, missing children
• General identity verification (smart cards)– Driver license, passports, voter registration, welfare fraud
• Advanced video surveillance (CCTV control)
• Entertainment and multimedia HCI– Smart TV, context-aware system, VIP customers
– Video indexing and celebrity search
– Refining search engine searches
– Photo tagging in social media
• Face verification (1:1)
• Face identification (1:N)
• Face search/retrieval
• Still image
• Video sequence
• 3D/depth sensing or infrared imagery
• Facial key-point / landmark detection– Face editing, automatic face beautification on selfie photos
• Face pose estimation– Amazon Fire Phone, rendering 3D scenes w.r.t. viewer’s head pose
• Face or eye or gaze tracking– Fatigue detection for drivers, HCI, sport video analysis
• Facial attribute recognition– Gender/age/ethnicity estimation, business intelligence
• Face expression recognition– Sentiment / emotion analysis, perception of advertising
• Face liveness detection
• Face hallucination or super-resolution
• Face synthesis (3D morphable models)
• Faces belong to a single category
– Subtle intrapersonal and interpersonal variations
• Intrinsic factors– Aging, facial expressions, hair styles, accessories, makeup, etc.
• Extrinsic factors
– illumination, pose, resolution, scaling, noise, occlusion, etc.
Boston Marathon bombing suspects in surveillance footage, 4/15/2013
property constrained unconstrained
resolution about 2000x2000 50x50
viewpoint fully frontal rotated, loose
illumination controlled arbitrary
occlusion disallowed allowed
• W.W. Bledsoe, “Man-machine facial recognition”,
Tech. Report, PRI 22, Panaromic Res. Inc., 1966
• T. Kanade, “Picture processing system by computer
complex and recognition of human faces”, Doctoral
Dissertation, Kyoto University, 1973
• Almost all successful algorithms in pattern recognition
/ machine learning / computer vision have been
applied to FaceRec !
• Feature-based structural matching approaches– Kanade, 1973: 16 geometrical facial parameters
– Brunelli and Poggio, 1993: 35 geometric features
– Wiskott et al., 1997, Elastic bunch graph matching (EBGM)
– …
– Chen et al., Blessing of dimensionality, High-dim LBP, CVPR 2013
• Appearance-based holistic approaches– Turk and Pentland, 1991, Eigenfaces (PCA)
– Belhumeur et al., 1997, Fisherfaces (LDA)
– He and Yan et al., 2003, Laplacianfaces (LPP)
– Wright, et al., 2009, Sparse representation (SRC)
– …
– Taigman and Yang, et al., DeepFace, CVPR 2014
• Hybrid approaches
Face: key-points
Face recognition by elastic bunch graph matching, Wiskott, et al., TPAMI 1997
Face recognition: features versus templates, Brunelli and Poggio, TPAMI 1993
Feature based face recognition using mixture-distance, Cox, et al., CVPR 1996
35 geometric features
Face: A graph of key-points with a bunch of Gabor features
• LBP, 59 uniform LBP for (8,1) circular neighborhood
Face recognition with local binary patterns, Ahonen, et al., ECCV 2004
• Face: spatially enhanced histogram of LBP descriptors
Face: a 2D array of intensities projections on Eigenfaces
Face recognition using Eigenfaces, Turk and Pentland, CVPR 1991.
Eigenfaces vs. Fisherfaces: recognition using class specific linear projection,
Belhumeur, Hespanha, Kriegman,TPAMI 1997.
Face: projections on Fisherfaces (LDA on scatter matrices)
“the variations between the images of the same face due to lighting are almost always
larger than image variations due to a change in face identity”
• Isometric Feature Mapping (ISOMAP)
• Locally Linear Embedding (LLE)
• Locality Preserving Projection (LPP)
• Other metric learning methods …
Face recognition using Laplacianfaces, He, Yan, Hu, Niyogi, Zhang, TPAMI 2005.
Face: projections on Laplacianfaces, a linear embedding
preserving local manifold structures of an adjacency graph.
Laplacian L=D-S of the
nearest neighbor graph
Low-dimensional embedding
Robust face recognition via sparse representation, Wright, et al., TPAMI 2009.
Face: a 2D array of intensities a sparse representation
in a sufficiently large feature space
Eigenfaces, Laplacianfaces,
downsampled, random projections
• Integral images
• Fast Haar features
• Cascaded boosting classifier
• Bootstrapping hard negatives
• Sliding window search on image pyramid
• Non-maximum suppression
Rapid object detection using a boosted cascade of simple features, Viola&Jones, CVPR01
• In-plane/out-of-plane rotations (roll, yaw, pitch)
High-performance rotation invariant multiview face detection, Huang, et al., PAMI 2007
Non-face
• Width-first-search tree: multi-class vector boosting
• Pixel features in a granular space
High-performance rotation invariant multiview face detection, Huang, et al., PAMI 2007
• Weak classifier: stump function piece-wise function
• 30K frontal, 25K half profile, 20K full profile faces
• Define and label facial landmarks
• Build landmark detectors – templates, SVM/regression
• Constrain shape variations– Point distribution models for all landmarks
• EM-like algorithm– M-step: Find the optimized location for each landmark individually
– E-step: Smooth the shape by using point distribution model
Active shape models: their training and application, Cootes, CVIU 1995
Active appearance models, Cootes, et al., ECCV 1998.
• Efficient joint optimization
• Efficient landmark detector / regresssor
• New modeling of local appearances and global shape
Face alignment at 3000 FPS via regressing local binary features, Ren, Cao, Wei,
Sun, et al., CVPR 2014
Face alignment through subspace constrained mean-shifts, Saragih, et al. ICCV 2009
Face alignment by explicit shape regression, Cao, et al., CVPR 2012
Face detection, pose estimation, and landmark localization in the wild, Zhu and
Ramanan, CVPR 2012
Deep convolutional network cascade for facial point detection, Sun, et al., CVPR 2013
Detecting and aligning faces by image retrieval, Shen, et al., CVPR 2013
Head pose estimation in computer vision: A survey, Murphy-Chutorian and Trivedi,
TPAMI 2009
• CMU Multi-Pie– 750,000+ faces of 337 individuals
– 9 view points, 19 lighting conditions
– Facial expressions in 4 sessions
• Extended Yale B dataset– 21,800+ faces of 38 individuals
– 9 poses, 64 lighting conditions
• CAS-PEAL face dataset– 30,900 faces of 1040 individuals
– Pose, lighting, accessories, etc.
• MORPH dataset– 55,000 faces of 13,000 individuals
– Mug-shot images, biased distribution
• FERET (Face Recognition Technology), 1993-1997– 316 individuals (1993) 14,051 images of 1199 individuals (2000).
• FRVT (Face Vendor Recognition Test) 2000, 2002, 2006
• FRVT 2010 (still image track of MBE 2010)
• FRVT 2013 (done in 5/2014)– 1.6 million faces
– frontal faces with ambient lighting
Organized by National Institute of Standards and Technology (NIST), USA
J. Phillips, FRVT 2010 Report by NIST
FRR = 0.3% at
FAR = 0.1%
Error rate drops
by 3 orders of
magnitude in 20
years!
Test 1:
9,240 true-mates
vs. 10K imposters
Test 2:
12X3000X2=72K
genuine scores
vs. 18M imposter
scores
• A large gallery: 1.6 million faces
• Probe set: 171K mug-shots and 10.7K webcam images
• Evaluation metric: rank-1 miss rate
Vendor Mug-shot Webcam
NEC 4.1% 11.3%
Morpho 9.1% 36.4%
Toshiba 10.7% 23.7%
Cognitec 13.6% 57.6%
3M 17.2% 36.4%
Neurotech 20.5% 66.9% FRVT 2013 Report from NIST
• A large gallery: 1.6 million faces
• Probe set: 40K mated and 40K non-mated searches
• Evaluation metric: – Rank-1 miss rate
– FNIR at FPIR = 0.002
– FNIR (false negative identification rate)
– FPIR (false positive identification rate)
Vendor FRVT 2013 FRVT 2010 FNIR (FPIR = 0. 2%)
NEC 6.4% 8.9% 10.8%
Morpho 12.1% 13.5% 19.4%
Cognitec 17.0% 18.7% 34.2%
Neurotech 23.1% 25.8% 68.4%
FRVT 2013 Report from NIST
• Gallery images: 1 million mug-shot + 6 web images
• Probe images: 5 faces
• Rank ranking – w/o or with demographic filtering
A case study of automated face recognition: the Boston Marathon bombing suspects, J.
C. Klontz and A.K. Jain, IEEE Computer, 2013
• What is the state-of-the-art TPR (true positive rate) at
FAR (false alarm rate) 0.1% for constrained face
verification in FRVT 2010?– ( a ) 99.7%
– ( b ) 97.5%
– ( c ) 95.9%
• What is the state-of-the-art rank-1 accuracy on probe-
gallery search among 1.6 million faces for constrained
face identification in FRVT 2013?– ( a ) 99.7%
– ( b ) 97.5%
– ( c ) 95.9%
property constrained unconstrained
resolution about 2000x2000 50x50
viewpoint fully frontal rotated, loose
illumination controlled arbitrary
occlusion disallowed allowed
• Data collection– 13,233 web photos of 5,749 celebrities
– 6,000 face pairs in 10 splits
• Metric: mean recognition accuracy over 10 folds– Restricted protocol: only same/not-same labels
– Unrestricted protocol: face identities, additional training pairs
– Unsupervised setting: no training whatsoever on LFW images
Labeled faces in the wild: A database for studying face recognition in
unconstrained environments, Huang, Jain, Learned-Miller, ECCVW, 2008
• User study on Mechanical Turk – 10 different workers per face pair
– Average human performance
– Original images, tight crops, inverse crops
Attribute and simile classifiers for face verification, Kumar, et al., ICCV 2009
99.20%
97.53%
94.27%
60.02%
73.93%78.47%
85.54%88.00%
92.58%95.17% 96.33% 97.53%
37.08%
19.24%
37.09%
20.52%
48.06%52.32%
49.15%
Accuracy / year
Reduction of error wrt human / year
• Accurate (27) dense facial landmarks
• Concatenate multi-scale descriptors– ~100K-dim LBP, SIFT, Garbor, etc.
• Transfer learning: Joint Bayesian
• WDRef dataset– 99,773 images of 2,995 individuals
– 95.17% => 96.33% on LFW (unrestricted protocol)
Face alignment by explicit shape regression, Cao, et al., CVPR 2012
Bayesian face revisited: A joint formulation, Chen, et al., ECCV 2012
Blessing of dimensionality: High-dimensional feature and its efficient compression for
face verification, Chen, et al., CVPR 2013
A practical transfer learning algorithm for face verification, Cao, et al., ICCV 2013
Likelihood ratio test:
EM update of the between/within class covariance
• 12X5 Siamese ConvNets X8 + RBM classification
Hybrid deep learning for computing face similarities, Sun, Wang, Tang, ICCV 2013.
12 face regions
8 pairs of inputs
88.00%
92.58%
95.17%
96.33%
97.35% 97.53%
98.4%
2010 2011 2012 2013 DeepFace Human New*
Reduction of error wrt human / year
Accuracy / year
20.52%
48.06%
52.32%49.15%
85.00%
Face recognition in unconstrained videos with matched background similarity, Wolf,
Hassner, Maoz, ICCV 2011
• Data collection– 3,425 Youtube videos 1,595 celebrities (a subset of LFW subjects)
– 5,000 video pairs in 10 splits
– Detected and roughly aligned face frames available.
• Metric: mean recognition accuracy over 10 folds– Restricted protocol: only same/not-same labels
– Unrestricted protocol: face identities, additional training pairs
87.993.7 94.3 97.35
91.3
No Alignment 3D Pertrubation 2D Alignment 3D Alignment 3D Alignment +LBP
(LFW Acc. %)
97
96.0796.72
95.53
97.17
95.87
4096 4096bits
1024 1024bits
256 256 bits
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5 0.6 7 0.8 0.9 1
8.74 10.915.1
20.7
100% of the data 50% of the data 20% of the data 10% of the data
DB Size / DNN Test Error (%)
8.7411.2 12.6 13.5
C1+M2+C3+L4+L5+L6+F7 -C3 -L4 -L5 -C3 -L4 -L5
• Naïve binarization
97 96.72 96.78 97.1796.42 96.1
94.5
92.75
89.8
96.0795.53 95.5 95.87
93.38
91.45
87.15
85
87
89
91
93
95
97
dim=4096 dim=1024 dim=512 dim=256 dim=128 dim=64 dim=32 dim=16 dim=8
Verification accuracy (%) on LFW (restricted protocol)
float binary
• Coupling 3D alignment with large-capacity
locally-connected networks
• At the brink of human-level performance for
face verification
Baseline Rank-1 rate (%)
Rank-1 rate (%) @
1% False alarm rate Verification (%)
[1] 25 56.7 NA
[2] 44.5 64.9 97.35
[3] 61.9 82.5 98.4
• What is the state-of-the-art level of rank-1 accuracy
searching 3K faces against 4K gallery faces on the
unconstrained LFW dataset?– ( a ) ~80%
– ( b ) ~60%
– ( c ) ~40%