low-quality video face recognition with deep …...low-quality video face recognition with deep...

7
Low-Quality Video Face Recognition with Deep Networks and Polygonal Chain Distance Christian Herrmann *† , Dieter Willersinn , J¨ urgen Beyerer †* * Vision and Fusion Lab, Karlsruhe Institute of Technology KIT, Karlsruhe, Germany Fraunhofer IOSB, Karlsruhe, Germany {christian.herrmann|dieter.willersinn|juergen.beyerer}@iosb.fraunhofer.de Abstract—Face recognition under surveillance circumstances still poses a significant problem due to low data quality. Nevertheless, automatic analysis is highly desired for criminal investigations due to the growing amount of security cameras worldwide. We suggest a face recognition system addressing the typical issues such as motion blur, noise or compression arti- facts to improve low-quality recognition rates. A low-resolution adapted residual neural net serves as face image descriptor. It is trained by quality adjusted public training data generated by data augmentation strategies such as motion blurring or adding compression artifacts. To further reduce noise effects, a noise resistant manifold-based face track descriptor using a polygonal chain is proposed. This leads to a performance improvement on in-the-wild surveillance data compared to conventional local feature approaches or the state-of-the-art high-resolution VGG- Face network. I. I NTRODUCTION The increasing availability of security cameras raises the demand for analysis of the vast amounts of video footage, specifically, automatic analysis because manual inspection is unfeasible. While older security cameras lack in resolution and faces are often unrecognizable, with newer camera generations the faces become clearer and distinguishable. However, the data quality is usually still far from professional footage such as TV or press photographs, where automatic face recognition achieved impressive results recently, surpassing even human performance in certain setups [1, 2]. Addressing the low- quality surveillance domain is still a significant challenge for automatic face recognition approaches, caused by sev- eral reasons which are misalignment, noise affection, lack of effective features and dimensional mismatch between probe and gallery according to [3]. Recently, effective alignment methods for low-quality faces were proposed [4] which we found to be sufficiently accurate. Consequently, in this paper, we suggest an effective Convolutional Neural Network (CNN)- based feature which proves to be more efficient compared to previous solutions and address noise affection by data augmentation and a noise resistant track descriptor to utilize the temporal information. Data augmentation is necessary because large face datasets which are suitable for training a CNN are no surveillance datasets and consequently involve a domain gap. We suggest according augmentation strategies such as adding motion blur or compression artifacts to close this gap. 0.93 0.99 0.77 0.88 1.25 1.79 1.78 1.89 1.47 1.38 Fig. 1: Qualitative results of the proposed method on low- quality data. Line thickness and numbers denote face similarity (inverse of descriptor distance) and line color same (blue) and different (orange) identity. In detail, the contributions of this paper are threefold: First, the adaptation of the residual net architecture [5] to the low-quality face recognition domain by adjusting layer configuration and setup. Second, a manifold based and noise resistant strategy using a polygonal chain to aggregate facial information across multiple frames of a face track. Third, a systematic analysis of the target domain image quality effects and their reproduction as data augmentation for high-quality training data from a different domain. II. RELATED WORK Addressing low-quality video face recognition involves sev- eral specific problems. Face recognition with CNNs. Currently, CNNs serve only for high-resolution face recognition where they significantly improved the performance compared to previously known approaches, even surpassing human capabilities in certain setups [1, 2, 6]. Because these networks are mainly based on solutions for the ImageNet challenge [7], they adopt the high resolutions of 224×224 pixels and above. Low-resolution networks tend to loose performance as shown by [1], which makes it necessary to address this issue. Low-quality face recognition. One part of the approaches addressing this task tries to mitigate the low data quality by preprocessing steps. This includes super-resolution meth- ods where low-quality images are upsampled to apply a conventional high-resolution face-recognition strategy [8, 9] which proved to be a solid strategy for comparing low-quality

Upload: others

Post on 12-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Low-Quality Video Face Recognition with Deep …...Low-Quality Video Face Recognition with Deep Networks and Polygonal Chain Distance Christian Herrmann y, Dieter Willersinn , J¨urgen

Low-Quality Video Face Recognition with DeepNetworks and Polygonal Chain Distance

Christian Herrmann∗†, Dieter Willersinn†, Jurgen Beyerer†∗∗Vision and Fusion Lab, Karlsruhe Institute of Technology KIT, Karlsruhe, Germany

†Fraunhofer IOSB, Karlsruhe, Germany{christian.herrmann|dieter.willersinn|juergen.beyerer}@iosb.fraunhofer.de

Abstract—Face recognition under surveillance circumstancesstill poses a significant problem due to low data quality.Nevertheless, automatic analysis is highly desired for criminalinvestigations due to the growing amount of security camerasworldwide. We suggest a face recognition system addressing thetypical issues such as motion blur, noise or compression arti-facts to improve low-quality recognition rates. A low-resolutionadapted residual neural net serves as face image descriptor. Itis trained by quality adjusted public training data generated bydata augmentation strategies such as motion blurring or addingcompression artifacts. To further reduce noise effects, a noiseresistant manifold-based face track descriptor using a polygonalchain is proposed. This leads to a performance improvementon in-the-wild surveillance data compared to conventional localfeature approaches or the state-of-the-art high-resolution VGG-Face network.

I. INTRODUCTION

The increasing availability of security cameras raises thedemand for analysis of the vast amounts of video footage,specifically, automatic analysis because manual inspection isunfeasible. While older security cameras lack in resolution andfaces are often unrecognizable, with newer camera generationsthe faces become clearer and distinguishable. However, thedata quality is usually still far from professional footage suchas TV or press photographs, where automatic face recognitionachieved impressive results recently, surpassing even humanperformance in certain setups [1, 2]. Addressing the low-quality surveillance domain is still a significant challengefor automatic face recognition approaches, caused by sev-eral reasons which are misalignment, noise affection, lack ofeffective features and dimensional mismatch between probeand gallery according to [3]. Recently, effective alignmentmethods for low-quality faces were proposed [4] which wefound to be sufficiently accurate. Consequently, in this paper,we suggest an effective Convolutional Neural Network (CNN)-based feature which proves to be more efficient comparedto previous solutions and address noise affection by dataaugmentation and a noise resistant track descriptor to utilizethe temporal information. Data augmentation is necessarybecause large face datasets which are suitable for training aCNN are no surveillance datasets and consequently involvea domain gap. We suggest according augmentation strategiessuch as adding motion blur or compression artifacts to closethis gap.

Found a solution Config 21 55 12 86 11 6 16 16

Similarities: 1.2429 1.7796 1.4668 0.8777 0.7699 0.9891 0.9272 1.7920 1.8887 1.3830

0.93 0.99 0.77 0.88

1.25

1.79

1.78

1.89

1.47

1.38

Fig. 1: Qualitative results of the proposed method on low-quality data. Line thickness and numbers denote face similarity(inverse of descriptor distance) and line color same (blue) anddifferent (orange) identity.

In detail, the contributions of this paper are threefold:First, the adaptation of the residual net architecture [5] tothe low-quality face recognition domain by adjusting layerconfiguration and setup. Second, a manifold based and noiseresistant strategy using a polygonal chain to aggregate facialinformation across multiple frames of a face track. Third, asystematic analysis of the target domain image quality effectsand their reproduction as data augmentation for high-qualitytraining data from a different domain.

II. RELATED WORK

Addressing low-quality video face recognition involves sev-eral specific problems.Face recognition with CNNs. Currently, CNNs serve onlyfor high-resolution face recognition where they significantlyimproved the performance compared to previously knownapproaches, even surpassing human capabilities in certainsetups [1, 2, 6]. Because these networks are mainly basedon solutions for the ImageNet challenge [7], they adopt thehigh resolutions of 224×224 pixels and above. Low-resolutionnetworks tend to loose performance as shown by [1], whichmakes it necessary to address this issue.Low-quality face recognition. One part of the approachesaddressing this task tries to mitigate the low data qualityby preprocessing steps. This includes super-resolution meth-ods where low-quality images are upsampled to apply aconventional high-resolution face-recognition strategy [8, 9]which proved to be a solid strategy for comparing low-quality

Page 2: Low-Quality Video Face Recognition with Deep …...Low-Quality Video Face Recognition with Deep Networks and Polygonal Chain Distance Christian Herrmann y, Dieter Willersinn , J¨urgen

to high-quality gallery faces. For low-quality to low-qualitymatching feature adaptations [10], blur resistant features [11]or best-shot selection are known options. For video to videomatching, preprocessing is usually too computationally expen-sive which is why we follow a different strategy and create alow-quality tolerant face descriptor.Face track descriptors. Given a sequence of face image de-scriptors from one face track, there exists a wide variety ofmethods to aggregate these descriptors into a single track de-scriptor ranging from primitive best-shot selection to detailedmanifold modeling. Usually, three categories are distinguished:set-based, space-based and manifold-based. Set-based methodsinclude best-shot, random [2] or specific [12] selection ofimage descriptors, or even including all [13] image descriptorsin the track descriptor. Set comparison can then be performede.g. by minimum or Hausdorff distance [13]. Space-basedmodeling of sequences tries to fit a linear space model suchas an affine subspace or the convex hull [14, 15]. Comparisoncan, e.g. , be performed by the principle angle between thesubspaces [15]. In previous work, manifold-based methodsoperate directly on raw pixel values because it is known thatthe manifold assumption holds in this case [16]. It allows manypossibilities to model the face sequence ranging from linearapproximation by multiple PCA-planes [17] over simply ap-plying locally linear embedding (LLE) [18] or combining LLEand k-means [16] to local probabilistic models [19]. Later onin section IV, we will motivate that the manifold assumptionstill holds in the case of CNN face descriptors under certainconditions and propose an appropriate model for this case.Especially, we take comparison time into consideration wheresome manifold methods lack in efficiency.Recently successful alternative approaches (fitting in noneof the categories) based on cumulative descriptors [20] areimpractical in our case because they require local imagefeatures instead of the holistic ones produced by CNNs.Dataset augmentation. To increase the generalization abili-ties of machine learning systems, especially across domains,data augmentation can be applied to the training data toincrease the variety of data the system can learn. In the areaof face recognition, this is no systematically studied topicyet. Parkhi et al. apply different crops and flipping to traintheir face network [6], but without any evaluation of therespective contribution. Their strategy is probably motivatedby the common training strategies for the ImageNet challengewhich apply cropping, flipping and color shift to improve theresults [21, 22]. For low-quality data from surveillance, resultsfor the person re-identification scenario indicate that differentcropping, flipping and rotation are helpful, while color changesor affine transformations tend to decrease the results [23]. Wewill complement these augmentation suggestions with specificlow-quality related effects such as motion blur or compressionartifacts in section V. This way, the low-quality domain isdirectly addressed and the domain gap between training andtest data is minimized.

TABLE I: Structure of the proposed low-quality residual net-work (LqNet). After each conv layer, batch normalization [24]is performed.

type size stride,pad

data size outfor 32×32input image

conv 3×3, 64 1, 1 32×32×64max pool 3×3 2, 0 16×16×64

relu 16×16×64

res block (2×)1×1, 64 2/1, 0

8×8×2563×3, 64 1, 11×1, 256 1, 0

res block (2×)1×1, 128 1, 0

8×8×5123×3, 128 1, 11×1, 512 1, 0

res block (2×)1×1, 256 2/1, 0

4×4×10243×3, 256 1, 11×1, 1024 1, 0

avg pool 3×3 1, 0 2×2×1024fc 2048

relu 2048fc 128

III. FACE IMAGE DESCRIPTION

To describe a given face track T = [u1 , ... , un], eachframe ui in the sequence is described by a CNN descriptor:ui 7→ zi. Afterwards, the track descriptor presented in thenext section compresses the information from several frames.A modified residual net architecture [5] adapted to the low-resolution verification task provides the face image descriptors.The low-quality residual network (LqNet) is trained fromscratch because we found that fine-tuning pre-trained netsis unfeasible due to the large resolution difference betweenwidely available high-resolution nets and the target scenario.We stack 3 double “bottleneck” building blocks as shownby table I. In addition, we found it beneficial to insert afully connected layer between the last residual block and theoutput layer. Batch normalization [24] is applied after eachconvolutional layer. This results in a 20-layer architecture.By evaluation of further architectures with varying numberof layers (15 to 50) this architecture was found to be the bestsolution.

For training the LqNet, a Siamese network structure [25]with max-margin loss l, instead of the contrastive loss, isapplied. The loss function l is

l =∑i,j

max(0, 1− yij ·

(b− d2(zi, zj)

)), (1)

similar to [26], where zi and zj denote the face descriptors,yij = {−1, 1} the indicator variable, b the decision boundaryand d2 the squared euclidean distance. In this way, theproposed network learns to map face images to discriminative128-dimensional descriptors which can efficiently be com-pared by euclidean distance.

Page 3: Low-Quality Video Face Recognition with Deep …...Low-Quality Video Face Recognition with Deep Networks and Polygonal Chain Distance Christian Herrmann y, Dieter Willersinn , J¨urgen

Fig. 2: Projection of face descriptors from a sample face sequence in 2D via PCA. Illustrated at different time steps with 10,25, 40, 55, 70 and all (79) frames. Newly added descriptors are indicated in yellow-like colors and the face corresponding tothe most recently added descriptor is indicated in the respective lower right corner. It can be observed that both principal axesof the descriptor space mainly correspond to two visible effects in this case: head rotation (vertical axis) and blur (horizontalaxis).

1

𝑥1

𝑥2

ℳ𝐶 𝜶

(a)

1

𝑥1

𝑥2

(b)

1

𝑥1

𝑥2

(c)

Fig. 3: Visualization of the proposed manifold modeling. Theface descriptors of one sequence (blue dots) are assumed to lieon a manifoldMC (a). This manifold is first approximated bylocal means (b) and then linearly reconstructed as polygonalchain by the line segments between the means (c).

IV. TRACK REPRESENTATION

Using the net from the previous section yields one descriptorper image: ui 7→ zi. To process face tracks as delivered by aface tracker, one requires a track descriptor D which models allimage descriptors from one track: [z1 , ... , zn] 7→ D. Due toimage noise potentially influencing single image descriptors,a robust track descriptor is required. When analyzing facedescriptor sequences as in figure 2, it appears obvious that thissequence can be modeled by a manifold. Assuming a manifoldis justified, because

• face images u reside on a manifold M when vectorized(refers to raw pixel values) [16],

• linear transformations usually preserve manifolds andthe convolutional neural net C mainly consists of linearoperations, thus leading to a transformed manifold MC ,and

• non-linearities are assumed to be noise r: D = C(u) =MC(α) + r where α denotes the manifold coordinatevector.

Starting from the manifold concept (figure 3a), a noise re-sistant representation is required. We propose k local means{x1 , ... , xk} to address this (figure 3b). In contradiction tomore common exemplar based representations such as [12] orapplying the Ramer-Douglas-Peucker algorithm to select goodexemplars, the proposed strategy includes an averaging effectwhich reduces the given noise.

Fig. 4: Example shots from our recorded surveillance data.

We then model the manifold as the polygonal chain givenby connecting the local means Sp = [x1 , ... , xk] (figure 3c).Let S = [x1,x2]∪ ...∪ [xk−1,xk] denote the point set of themanifold model and further si the line segment between thepoints xi and xi+1 of the polygonal chain. Then the polygonalchain distance (PCD) d(S1, S2) between two polygonal chainsS1 and S2 is the minimum euclidean distance between any twopoints t1, t2 on the chains which is equivalent to the minimumdistance between any two line segments:

d(S1, S2) = mint1∈S1,t2∈S2

d(t1, t2

)= min

i∈{1 ,... , k1−1},j∈{1 ,... , k2−1}

d(s1i , s2j ). (2)

Segment to segment distance d(s1i , s2j ) in Rn is non-trivial and

basically three cases can be distinguished. For details how todetermine when exactly to apply which case refer e.g. to [27].

1) Closest points of according lines lie inside the segments.In this case line to line distance applies:

d(s1i , s2j ) =

∥∥aij −(aTijg

1i

)· g1i −

(aTijg

2j

)· g2j∥∥2

(3)

with

aij = x1i − x2

j (4)

and

gki =xki+1 − xk

i∥∥xki+1 − xk

i

∥∥2

(5)

denoting the normalized directional vector of the respec-tive line.

2) Point to segment distance applies if the endpoint of onesegment is part of the minimum distance. For endpointx1i w.l.o.g. the distance is

d(s1i , s2j ) =

∥∥aij −(aTijg

2j

)· g2j∥∥2. (6)

Page 4: Low-Quality Video Face Recognition with Deep …...Low-Quality Video Face Recognition with Deep Networks and Polygonal Chain Distance Christian Herrmann y, Dieter Willersinn , J¨urgen

Fig. 5: Qualitative comparison of face images from FaceScrub(top), YTF (mid) and SURV (bottom). It can be seen thatvideo data (YTF and SURV) has less quality than single image,mugshot-like data (FaceScrub). However, professional videorecordings (YTF) show still better quality than surveillancedata (SURV).

3) Point to point distance is the right choice if the endpointsof both segments are part of the minimum distance.

d(s1i , s2j ) = min

u∈{i,i+1},v∈{j,j+1}

∥∥x1u − x2

v

∥∥ . (7)

V. DATA AND AUGMENTATION

To represent the target scenario, an in-the-wild surveillancedataset (figure 4) was recorded on different days and nights atdifferent locations, including indoor and outdoor, with severalcameras per location. Faces are tracked by a Viola-Jones basedface tracker and a subset of all detected face tracks was labeledresulting in a dataset size of 869 face tracks of 25 people. Facesizes are mostly in the range of 20 to 40 pixels and the tracklength varies from 14 to about 1,200 frames with an averagelength of 59 frames. For further references let’s call this datasetSURV.

Due to the limited size of the collected data, it is unfeasibleto train a neural network with a subset of this data. It is alsoeconomically unreasonable to try creating a sufficiently largesurveillance training dataset because of the required manuallabeling. Thus, training with large public face datasets isrequired. The problem consists in the domain gap betweenthese datasets and the surveillance domain, because they aremainly automatically collected high-quality celebrity face im-ages from the web (e.g. Celebrity-1000 [28], FaceScrub [29],MS-Celeb-1M (MS1M) [30], MSRA-CFW (MSRA) [31],YTF [32]).

Two strategies lead to target domain adaptation. First, inaddition to public datasets, we add a TV Collection (TVC) facevideo dataset (15.4K tracks, 604 persons). Although collectedfrom professional TV footage, this video data is closer tothe target domain than public single image datasets. It hassimilar image quality as the YTF or Celebrity-1000 dataset,but is significantly larger than the first and has less labelerrors than the second. Second, we are looking for imagetransformations that adjust the public high-quality datasetsto be similar to the low-quality target domain with respect

TABLE II: Training data augmentation strategies with theirapplication probabilities and intensity. Unless normal distri-bution N is denoted, intensity is uniformly distributed overthe range. Given parameters assume [0, 1] pixel value range.m denotes the number of active domain augmentations whencombining several augmentations strategies.

augmentation probability intensity

crop 0.8 up to 2 pixelsflip 0.5

rotation 0.5 2·N (0, 1) degreesmotion blur 0.5

mup to 5 pixels

noise 0.5m

up to 0.1·N (0, 1)

compression 0.5m

jpeg quality down to 6rescale 0.5

mup to factor 1.4

to image properties. Obviously, the first stage is required tobe an adjustment to the chosen target resolution of 32×32pixels face size. The next domain difference is blur. ComparingSURV and downsampled versions of FaceScrub and YTF withregard to blur by the maximum response of a Laplacian filterindicates the respective sharpness level. High responses ofthe Laplacian filter denote sharp edges which are typical forunblurred images. Thus, filter responses can be understood assharpness level. FaceScrub images show an average sharpnesslevel of 0.534 and YTF images 0.352, compared to 0.300 forSURV face images. This correlates well with the subjectiveimage quality as shown by figure 5. According to furthertests, the difference corresponds on average to blurring theFaceScrub images with a Gaussian kernel with σ = 0.6 or amotion kernel of 5 pixels length (σ = 0.4 and 1.5 pixels forYTF). Note that blur in surveillance data is usually motionblur caused by object movement and integration time of thecamera. The image formation process involves some furthereffects besides motion blur and scale effects including noisyimages caused by sensor quantum noise as well as artifactscaused by compression requirements to transmit the data.

All in all, this leads to seven different augmentation strate-gies, the first geometric three inspired by literature [6, 23] andthe last four by the domain requirements: flipping, cropping,rotation, motion blur, noise, compression and rescaling. Fora training sample each augmentation is applied at runtimewith a certain predefined probability and a bounded randomeffectiveness presented in detail by table II which reflect thefrequency and intensity in the target domain.

VI. EXPERIMENTS

First, we shed some light on the process to train the pro-posed LqNet face image descriptor. This includes the choice ofappropriate training data and augmentation strategies. Second,some insight for the PCD track descriptor is given. Finally, acomparison to state-of-the-art face recognition approaches onthe target domain is performed. All results in this section arebased on a 10-fold cross validation verification setup. Valuesare given as accuracy with its standard deviation (std) across

Page 5: Low-Quality Video Face Recognition with Deep …...Low-Quality Video Face Recognition with Deep Networks and Polygonal Chain Distance Christian Herrmann y, Dieter Willersinn , J¨urgen

TABLE III: Comparison of several datasets and their suitabil-ity to train the proposed LqNet face descriptor.

train dataset public augmentationno yes

accuracy±std accuracy±std

Celebrity-1000 [28] yes 0.642±0.005 0.651±0.003FaceScrub [29] yes 0.621±0.005 0.641±0.002

MS1M [30] yes 0.600±0.004 0.610±0.006MSRA [31] yes 0.538±0.006 0.603±0.004

TVC no 0.668±0.007 0.686±0.004

TABLE IV: Impact of the geometric (*) and quality augmen-tation strategies on validation results when training only withthe FaceScrub dataset.

augmentation none + geometric +accuracy±std accuracy±std

none 0.621±0.005 0.609±0.005* crop 0.638±0.003 -* flip 0.628±0.003 -

* rotation 0.617±0.003 -motion blur 0.641±0.002 0.602±0.004

noise 0.633±0.004 0.639±0.005compression 0.628±0.004 0.626±0.005

rescale 0.621±0.006 0.630±0.005

the folds and complemented by area under curve (AUC) andequal error rate (EER) in the final experiments. AUC andEER denote different characteristics of the ROC-curve: AUCdenotes the area under the ROC curve, while the EER denotesthe point where both errors namely the false positive rate andthe false negative rate are equal.

A. LqNet face descriptor

Optimization of the LqNet is performed with a validationdataset consisting of single face images from the target do-main. The nets are trained for a fixed number of iterationsfor fair comparison using the Caffe framework [33] on aGeForce Titan X. First, possible training datasets are checkedfor their suitability with respect to the target domain usingno augmentation. As table III indicates, the best results areachieved with the video datasets Celebrity-1000 [28] and ourown TVC caused by the smaller domain gap to the targetdomain. Both Microsoft datasets yield lower results than therest, probably caused by rather inaccurate automatic labeling.For each dataset, an analysis regarding useful augmentationstrategies to close the respective gap to the target domain isperformed. Table IV shows the augmentation results for theFaceScrub dataset. In this case, motion blur, noise and differentcrops improve performance the most. The best results whenusing augmentation strategies are listed in table III for eachdataset. Supported by the results in tables III and IV, the finalconfiguration is the training on the datasets Celebrity-1000,FaceScrub and TVC with crop, flip, motion blur and noiseaugmentations. Consequently, the training set contains about3.3M face images from 1,930 persons. Using this setup, anaccuracy of 0.691 ± 0.006 is achieved on the validation set

0,65

0,70

0,75

0,80

2 4 6 8 10 12 14 16

acc

ura

cy

k

LqNet

denseSIFT

Fig. 6: Influence of local patch number k for the proposedPCD comparison method. Evaluated on SURV dataset.

after the same number of iterations as before and 0.756±0.004after convergence.

B. PCD track descriptor

The proposed track descriptor requires the choice of thesegment number k. Figure 6 indicates that performance ischanging only slowly for changes in k, making the choicerobust. The trend shows a decrease in accuracy for manylocal means k when using the LqNet descriptor which canbe expected because of the decreasing averaging effect. Thismatches the desire to keep k small because track comparisoncomplexity grows with O(k2). For further experiments we usek = 2 or k = 5.

C. Comparison to other approaches

Regarding experiments on the target domain, the proposedLqNet face image descriptor is compared to raw pixel val-ues, local features (LBP and dense SIFT) and the state-of-the-art VGG-Face network [6]. As vector distance for theimage descriptors, the best one out of euclidean, cosine andHellinger is chosen for each face image descriptor. Two set-based methods, namely best-shot and minset, the space-basedmethod MSM [15] and the manifold-based LLE [18] serve ascomparative face track descriptors. The results for all imageand track descriptor combinations are listed in table V andindicate mainly three aspects. First, for CNN-based imagedescriptors, the proposed PCD track descriptor yields superiorresults compared to the other options. Second, while the VGG-Face descriptor unexpectedly yields the worst results of allimage descriptors, the proposed LqNet descriptor outmatchesits rivals due to addressing the domain gap between trainingand target data. Third, the advantage of CNN-based descriptorscompared to local features (LBP and dense SIFT) is farlower for the low-quality domain than for the high-qualitydomain. Comparing the approaches on the high-quality officialYTF evaluation setup with 32× 32 pixels downscaled faces(table VI) shows a significantly larger gap in this case thanfor the low-quality case in table V. CNN-based approachesshow a significant loss in performance for low-quality data,especially the VGG-Face network suffers heavily under thelow-quality conditions. The proposed LqNet still looses someof its advance compared to local features on low-quality data,nevertheless its performance remains better than these baselineapproaches.

Page 6: Low-Quality Video Face Recognition with Deep …...Low-Quality Video Face Recognition with Deep Networks and Polygonal Chain Distance Christian Herrmann y, Dieter Willersinn , J¨urgen

TABLE V: Comparison of the proposed LqNet face descriptor and the proposed PCD track descriptor with further descriptorson the SURV dataset at 32× 32 pixels face size. For each face descriptor, the applied vector distance is denoted. For detailsrefer to text.

track distance value LqNet(euclidean)

VGG-Net [6](cosine)

dense SIFT [26](Hellinger)

LBP [34](Hellinger)

raw pixel(cosine)

best shot accuracy±std 0.640±0.084 0.571±0.079 0.572±0.065 0.551±0.063 0.549±0.046AUC | EER 0.697 | 0.410 0.585 | 0.438 0.596 | 0.437 0.600 | 0.443 0.562 | 0.454

minset accuracy±std 0.709±0.102 0.609±0.093 0.700±0.035 0.698±0.072 0.639±0.079AUC | EER 0.799 | 0.336 0.653 | 0.395 0.750 | 0.309 0.734 | 0.328 0.699 | 0.337

MSM accuracy±std 0.705±0.126 0.617±0.097 0.644±0.043 0.651±0.053 0.633±0.086AUC | EER 0.826 | 0.315 0.679 | 0.369 0.701 | 0.351 0.698 | 0.358 0.673 | 0.367

LLE accuracy±std 0.615±0.071 0.557±0.028 0.525±0.039 0.538±0.043 0.639±0.101AUC | EER 0.656 | 0.458 0.562 | 0.458 0.539 | 0.468 0.542 | 0.470 0.686 | 0.357

PCD (k = 2) accuracy±std 0.726±0.102 0.628±0.087 0.650±0.034 0.646±0.055 0.626±0.070AUC | EER 0.809 | 0.288 0.674 | 0.370 0.713 | 0.347 0.695 | 0.349 0.697 | 0.351

PCD (k = 5) accuracy±std 0.728±0.091 0.625±0.093 0.677±0.048 0.650±0.052 0.567±0.044AUC | EER 0.802 | 0.303 0.668 | 0.378 0.733 | 0.325 0.700 | 0.353 0.564 | 0.449

TABLE VI: Results for high-quality public YTF dataset at32× 32 pixels face size. For PCD k = 2.

method accuracy±std

raw pixel (cosine) + minset 0.604±0.030LBP (Hellinger) + minset 0.649±0.015

dense SIFT (Hellinger) + minset 0.664±0.013VGG-Face (cosine) + PCD 0.854±0.012LqNet (euclidean) + PCD 0.806±0.017

VII. CONCLUSION

As the last experiments showed, low-quality face recog-nition is an even harder task for current face recognitionapproaches than low-resolution alone already is. To improvelow-quality performance, the proposed face image descriptorhas proved to be a valid solution when combined with motionblur and noise data augmentation strategies for training data.In addition, the presented manifold-based face track descriptorbased on a polygonal chain improved performance for CNN-based descriptors with its integrated averaging character. Allin all, low-quality face recognition appears to benefit from thecurrent progress in the high-quality domain if accompanied bythe according transfer strategies. Nevertheless, we think thatthis topic remains a challenging field for research.

REFERENCES

[1] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: AUnified Embedding for Face Recognition and Clustering,” inComputer Vision and Pattern Recognition, 2015, pp. 815–823.

[2] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface:Closing the gap to human-level performance in face verifica-tion,” in Computer Vision and Pattern Recognition, 2014, pp.1701–1708.

[3] Z. Wang, Z. Miao, Q. M. J. Wu, Y. Wan, and Z. Tang, “Low-resolution face recognition: a review,” The Visual Computer,vol. 30, no. 4, pp. 359–386, 2014.

[4] C. Qu, E. Monari, and T. Schuchert, “Resolution-aware Con-strained Local Model with mixture of local experts,” in Ad-vanced Video and Signal Based Surveillance Workshops, 2013,pp. 454–459.

[5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learn-ing for Image Recognition,” arXiv preprint arXiv:1512.03385,2015.

[6] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep facerecognition,” British Machine Vision Conference, vol. 1, no. 3,p. 6, 2015.

[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inComputer Vision and Pattern Recognition. IEEE, 2009, pp.248–255.

[8] X. Wang and X. Tang, “Hallucinating face by eigentransforma-tion,” Systems, Man, and Cybernetics, Part C: Applications andReviews, vol. 35, no. 3, pp. 425–434, Aug 2005.

[9] P. H. Hennings-Yeomans, S. Baker, and B. V. Kumar, “Simulta-neous super-resolution and feature extraction for recognition oflow-resolution faces,” in Computer Vision and Pattern Recog-nition, 2008, pp. 1–8.

[10] C. Herrmann, “Extending a local matching face recognitionapproach to low-resolution video,” in Advanced Video andSignal Based Surveillance, 2013.

[11] V. Ojansivu and J. Heikkila, “Blur Insensitive Texture Classifi-cation Using Local Phase Quantization,” in Image and SignalProcessing. Springer, 2008, pp. 236–243.

[12] M. Zhao, J. Yagnik, H. Adam, and D. Bau, “Large Scale Learn-ing and Recognition Of Faces in Web Videos,” in AutomaticFace and Gesture Recognition, 2008, pp. 1–7.

[13] S. Chen, S. Mau, M. T. Harandi, C. Sanderson, A. Bigdeli, andB. C. Lovell, “Face Recognition from Still Images to Video Se-quences: A Local-feature-based Framework,” EURASIP Journalon Image and Video Processing, 2011.

[14] H. Cevikalp and B. Triggs, “Face recognition based on imagesets,” in Computer Vision and Pattern Recognition, 2010.

[15] K. Fukui and O. Yamaguchi, “Face Recognition Using Multi-viewpoint Patterns for Robot Vision,” Robotics Research, pp.192–201, 2005.

[16] A. Hadid and M. Pietikainen, “From still image to video-basedface recognition: an experimental analysis,” in Automatic Faceand Gesture Recognition. IEEE, 2004, pp. 813–818.

[17] K. Lee, J. Ho, M. Yang, and D. Kriegman, “Video-BasedFace Recognition Using Probabilistic Appearance Manifolds,”Computer Vision and Pattern Recognition, vol. 1, pp. 313–320,2003.

[18] A. Hadid and M. Pietikainen, “Manifold learning for video-to-video face recognition,” Biometric ID Management and Multi-modal Communication, pp. 9–16, 2009.

[19] M. E. Wibowo, D. Tjondronegoro, L. Zhang, and I. Himawan,“Heteroscedastic probabilistic linear discriminant analysis for

Page 7: Low-Quality Video Face Recognition with Deep …...Low-Quality Video Face Recognition with Deep Networks and Polygonal Chain Distance Christian Herrmann y, Dieter Willersinn , J¨urgen

manifold learning in video-based face recognition,” in Workshopon Applications of Computer Vision, 2013, pp. 46–52.

[20] O. M. Parkhi, K. Simonyan, A. Vedaldi, and A. Zisserman,“A Compact and Discriminative Face Track Descriptor,” inComputer Vision and Pattern Recognition, 2014.

[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi-fication with deep convolutional neural networks,” in Advancesin neural information processing systems, 2012, pp. 1097–1105.

[22] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” in InternationalConference on Learning Representations, 2015.

[23] N. McLaughlin, J. M. Del Rincon, and P. Miller, “Data-augmentation for reducing dataset bias in person re-identification,” in Advanced Video and Signal Based Surveil-lance. IEEE, 2015, pp. 1–6.

[24] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” inInternational Conference on Machine Learning, 2015.

[25] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reductionby learning an invariant mapping,” in Computer Vision andPattern Recognition, vol. 2. IEEE, 2006, pp. 1735–1742.

[26] K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman,“Fisher vector faces in the wild,” in British Machine VisionConference, vol. 1, no. 2, 2013, p. 7.

[27] D. Eberly, 3D Game Engine Design: A Practical Approach toReal-Time Computer Graphics. CRC Press, 2006.

[28] L. Liu, L. Zhang, H. Liu, and S. Yan, “Toward large-populationface identification in unconstrained videos,” IEEE Transactionson Circuits and Systems for Video Technology, vol. 24, no. 11,pp. 1874–1884, 2014.

[29] H.-W. Ng and S. Winkler, “A data-driven approach to cleaninglarge face datasets,” in International Conference on ImageProcessing. IEEE, 2014, pp. 343–347.

[30] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “MS-Celeb-1M:Challenge of Recognizing One Million Celebrities in the RealWorld,” in Imaging and Multimedia Analytics in a Web andMobile World, 2016.

[31] X. Zhang, L. Zhang, X.-J. Wang, and H.-Y. Shum, “Findingcelebrities in billions of web images,” IEEE Transactions onMultimedia, vol. 14, no. 4, pp. 995–1007, 2012.

[32] L. Wolf, T. Hassner, and I. Maoz, “Face recognition in un-constrained videos with matched background similarity,” inComputer Vision and Pattern Recognition, 2011.

[33] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolu-tional Architecture for Fast Feature Embedding,” arXiv preprintarXiv:1408.5093, 2014.

[34] T. Ahonen, A. Hadid, and M. Pietikainen, “Face Descriptionwith Local Binary Patterns: Application to Face Recognition,”Pattern Analysis and Machine Intelligence, vol. 28, no. 12, pp.2037–2041, 2006.