face recognition - pdfs. · pdf file face recognition abstract face recognition involves...

Click here to load reader

Post on 26-Jun-2020




1 download

Embed Size (px)


  • Face Recognition Abstract

    Face recognition involves identifying or verifying a person from a digital image or video frame and is still one of the most challenging tasks in computer vision today. The conventional face recognition pipeline consists of face

    , , and . This page further explains three exemplary state-of-detection face alignment feature extraction, classification the-art architectures: DeepID3 , FaceNet , and Sparse ConvNet .(6) (9) (11)

    1 Introduction 2 Overview 3 Notable networks

    3.1 DeepID3 3.2 FaceNet 3.3 Sparse ConvNet

    4 Literature 5 Weblinks


    The task of face recognition involves identifying or verifying a person from a digital image or video frame. Computer applications capable of performing this task, known as facial recognition systems, have been around for decades. The general idea of face recognition is identifying facial features by extracting and then compare facial landmarks to other images by matching those features.

    However, face recognition is still one of the most relevant and challenging research areas in computer vision and pattern recognition due to variations in facial expressions, poses, and illumination. (1)


    The conventional face recognition pipeline consists of four stages: , (or face detection face alignment preprocessing), (or face feature extraction representation) and , as illustrated in classification figure 1.

    A milestone in the face detection areas was the contribution by Viola & Jones in 2001, which (2)

    provided an object detection framework that was operating in real-time and was suited for human faces. The remaining multi-view face detection problem was first tackled by Farfade, Saberian, & Li in 2015 (3)

    by using deep instead of cascade-based CNNs approach as Viola & Jones. Current state-of-the-art approaches use region-based to enable a faster CNNs and more reliable detection. (4)

    To simplify the extraction part, a proper alignment is crucial. If facial points can be identified correctly, features can be matched in a region around them. Recently, -based architectures showed success in CNN this area. (5)

    The feature extraction part is often considered the most challenging and important of all, since any matching algorithm is limited by the quality of the underlying features.

    Figure 1: Stages of the face recognition pipeline. So urce: own illustration from and Link Link

    https://wiki.tum.de/display/lfdv/Facial+Landmark+Detection https://wiki.tum.de/display/lfdv/Convolutional+Neural+Networks https://wiki.tum.de/display/lfdv/Convolutional+Neural+Networks https://wiki.tum.de/display/lfdv/Convolutional+Neural+Networks https://1.bp.blogspot.com/-WEmxpqLpN50/Vy7j-80kMII/AAAAAAAABck/8zqeUCPve_Iglqg1od1-AHUXbAPyxxVfgCLcB/s1600/%25E6%2593%25B7%25E5%258F%25962.PNG http://electronicimaging.spiedigitallibrary.org/data/journals/electim/935241/jei_25_6_063002_f001.png

  • Notable networks

    There is a verity of successful architectures. This section focuses on three different models and explains their idiosyncrasies. Evaluations for face recognition approaches are almost always performed on the Labeled Face in

    data set, with as the most common metric. In the verification task, the Wild (LFW) (12) face verification accuracy given a pair of face images, the goal is to determine whether they are coming from a single subject or not.


    DeepID3 is the third generation of the DeepID architecture, which was one of the first publications to propose learning discriminative deep face representations (DFR) through large-scale face identity classification. The second generation proposed DFR by joint face identification-verification, which finally brought the networks up to human performance.

    In this third approach (shown in ), Sun et. al figure 2 (6)

    were trying to use insights of the most successful architectures from the ImageNet challenge in 2014: the inception layers of GoogLeNet and stacked (7) convolutions of VGG . They also included joint (8)

    identification-verification supervisory signals to multiple layers, to further reduce the intra-personal variance of the representation. The publication shows that very deep neural networks achieve state-of-the-art performance on face recognition tasks and slightly outperform their shallow counterparts. By exposing the architectures to large-scale training data, another increase in effectiveness is expected.

    Figure 2: Layers of DeepID3 network. Source: (6)


    The FaceNet publications by Google researchers introduced (9)

    a novelty to the field by directly learning a mapping from face images to a compact Euclidean space. The distances between representation vectors are a direct measure of their similarity with 0.0 corresponding to two equal pictures and 4.0 marking the opposite site of the spectrum. The representation is also able to significantly reduce the image complexity to only 128-bytes per face. This generalized embedding significantly differs from other approaches, which are trained over a set of known faces and then generalized via an intermediate bottleneck layer. Figur

    shows the exemplary scores of e 3 pairs of test images.

    Figure 4. This netModel structure. work consists of a batch input layer

    and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed

    by the triplet loss during training. S ource: (9)

  • The architecture is a combination of the multiple interleaved layers of convolutions of Zeiler & Fergus (10)

    and the inception model of GoogLeNet . These models are (7)

    interwoven to a deep architecture, which is symbolized as a black box in f The most important part igure 4. of the approach lies in the end-to- end learning of the whole system. As a loss function, the Triplet Loss was used, which is explained and shown in figure 5.

    During the time of the publication, FaceNet set a new record accuracy on the LFW dataset with (12)

    99.63%. The drawback of this model is the demand for a large training data set (200 million training samples in this case).

    Figure 3: Illumination and pose Pose and illumination invariance.

    have been a long standing problem in face recognition. This figure shows the output distances of

    FaceNet between pairs of illumination combinations. A

    distance of 0.0 means the faces are identical, 4.0 corresponds to

    the opposite spectrum, two different identities. You can see

    that a threshold of 1.1 would classify every pair correctly. Source

    : (9)

    Figure 5: The Triplet Loss minimizes the distance between an

    and a both of anchor positive, which have the same identity, and maximizes the distance between the and a of a anchor negative different identity. Source: (9)

    Sparse ConvNet

    In this recent publication, Sun et al. . tried to further improve their achievements of DeepID3 . by taking a (11) (6)

    trained, dense , sparsify the connections, and train it even further to improve performance. This architecture CNN increases the baseline performance of the DeepID3 from 98.95% to 99.30%, which implies an error rate reduction of 33%. It is important to note that even if it did not achieve a better performance than FaceNet , it only required (9)

    300,000 training samples and can thereby be considered more efficient.


    1) Kasar, M. M., Bhattacharyya, D., & Kim, T. H. (2016). . Face Recognition Using Neural Network: A Review Intern ational Journal of Security and Its Applications, 10(3), 81-100.

    2) Viola, P., & Jones, M. (2001). . In Rapid object detection using a boosted cascade of simple features Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on (Vol. 1, pp. I-I). IEEE.

    3) Farfade, S. S., Saberian, M. J., & Li, L. J. (2015, June). Multi-view Face Detection Using Deep Convolutional . In Neural Networks Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (pp. 643-

    650). ACM.

    4) Jiang, H., & Learned-Miller, E. (2016). . Face detection with the faster R-CNN arXiv preprint.

    5) Sun, Y., Wang, X., & Tang, X. (2013). . In Deep convolutional network cascade for facial point detection Proceedi ngs of the IEEE conference on computer vision and pattern recognition (pp. 3476-3483).

    Sun, Y., Liang, D., Wang, X., & Tang, X. (2015). . 6) DeepID3: Face recognition with very deep neural networks. ar Xiv preprint.

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. 7) (2015). . In Going deeper with convolutions Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).

    https://wiki.tum.de/display/lfdv/Convolutional+Neural+Networks http://www.sersc.org/journals/IJSIA/vol10_no3_2016/8.pdf http://www.merl.com/publications/docs/TR2004-043.pdf https://arxiv.org/abs/1502.02766v3 https://arxiv.org/abs/1502.02766v3 https://arxiv.org/abs/1606.03473 http://www.cv-foundation.org/openaccess/content_cvpr_2013/papers/Sun_Deep_Convolutional_Network_2013_CVPR_paper.pdf https://arxiv.org/abs/1502.00873 https://arxiv.org/abs/1409.4842

  • Simonyan, K., & Zisserman, A. (2014). . 8) Very deep convolutional networks for large-scale image recognition arXiv preprint.

    9) Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recogniti

View more