bilinear discriminative dictionary learning for face recognition

11
Bilinear discriminative dictionary learning for face recognition Hui-Dong Liu a , Ming Yang a,n , Yang Gao b , Yilong Yin c , Liang Chen a a School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, PR China b State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, PR China c School of Computer Science and Technology, Shandong University, Jinan 250101, PR China article info Article history: Received 30 March 2013 Accepted 3 November 2013 Available online 15 November 2013 Keywords: Bilinear Dictionary learning Face recognition l 2 -Norm regularization abstract This work presents a novel dictionary learning method based on the l 2 -norm regularization to learn a dictionary more suitable for face recognition. By optimizing the reconstruction error for each class using the dictionary atoms associated with that class, we learn a structured dictionary which is able to make the reconstruction error for each class more discriminative for classication. Moreover, to make the coding coefcients of samples coded over the learned dictionary discriminative, a discriminative term bilinear to the training samples and the coding coefcients is incorporated in our dictionary learning model. The bilinear discriminative term essentially resolves a linear regression problem for patterns concatenated by the training samples and the coding coefcients in the Reproducing Kernel Hilbert Space (RKHS). Consequently, a novel classier based on the bilinear discriminative model is also proposed. Experimental results on the AR, CMU PIE, CAS-PEAL-R1, and the Shefeld (previously UMIST) face databases show that the proposed method is effective to expression, lighting, and pose variations in face recognition as well as gender classication, compared with the recently proposed face recognition methods and dictionary learning methods. & 2013 Elsevier Ltd. All rights reserved. 1. Introduction Face recognition is an interesting and popular issue in the area of pattern recognition, computer vision, etc. In recent years, owing to the rise of Sparse Representation (Coding) [1,2] and the progress of optimizing the l 1 -norm based minimization problems [38], there has been an abundant literature on sparse representa- tion based image classication [114], and encouraging results have been achieved by sparse representation based classication methods. Sparse representation assumes that a test sample could be the linear combination of a few atoms in a dictionary. Such a dictionary could be composed of either the original training samples or some redundant bases learned from the training samples. However, the original training samples could contain outliers or noises, and may only reveal the weak discriminative structure among classes. Therefore, dictionary learning techniques are proposed to learn a set of dictionary atoms from the given samples, so that the learned atoms could better reconstruct the test samples. Some dictionary learning methods have been proposed to fulll this task. In Ref. [15], Aharon et al. proposed the KSVD algorithm to learn an overcomplete dictionary from the training samples by imposing the l 0 -norm constraint on the coding coefcients with- out any supervised information. Therefore, the dictionary learned by KSVD can not be used for classication. Based on Ref. [15], Aharon et al. [16] and Mairal et al. [17] added different supervised information to make the learned dictionary discriminative for different classes. Specically, in Ref. [16], a linear regression term for the coding coefcients of the labeled data was incorporated in the dictionary learning model. Different from Ref. [16], Mairal et al. [17] added a term which aimed to make the reconstruction errors for different classes discriminative. Later, Mairal et al. [18] pro- posed the supervised dictionary learning (SDL), where the coef- cients of the samples coded over the learned dictionary are discriminative for classication tasks. Zhang et al. [19] proposed the Discriminative KSVD (DKSVD) for dictionary learning. DKSVD was able to simultaneously learn the dictionary, coding coef- cients and a linear classier for the coding coefcients in one model applying the KSVD algorithm. In Ref. [20], Yang et al. proposed a method to learn a dictionary for each separate class. Ramirez et al. [21] imposed the incoherence on the dictionaries associated with different classes to make them as orthogonal as possible. Based on KSVD, Jiang et al. [22] proposed the LC-KSVD1 by adding a label consistent term to KSVD, to associate the label information with the dictionary items and subsequently the Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.11.007 n Correspondence to: No. 1, Wenyuan Road, Qixia District, Nanjing, Jiangsu 210023, China. Tel.: þ86 25 85891987; fax: þ86 25 85891685. E-mail address: [email protected] (M. Yang). Pattern Recognition 47 (2014) 18351845

Upload: liang

Post on 30-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bilinear discriminative dictionary learning for face recognition

Bilinear discriminative dictionary learning for face recognition

Hui-Dong Liu a, Ming Yang a,n, Yang Gao b, Yilong Yin c, Liang Chen a

a School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023, PR Chinab State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, PR Chinac School of Computer Science and Technology, Shandong University, Jinan 250101, PR China

a r t i c l e i n f o

Article history:Received 30 March 2013Accepted 3 November 2013Available online 15 November 2013

Keywords:BilinearDictionary learningFace recognitionl2-Norm regularization

a b s t r a c t

This work presents a novel dictionary learning method based on the l2-norm regularization to learn adictionary more suitable for face recognition. By optimizing the reconstruction error for each class usingthe dictionary atoms associated with that class, we learn a structured dictionary which is able to makethe reconstruction error for each class more discriminative for classification. Moreover, to make thecoding coefficients of samples coded over the learned dictionary discriminative, a discriminative termbilinear to the training samples and the coding coefficients is incorporated in our dictionary learningmodel. The bilinear discriminative term essentially resolves a linear regression problem for patternsconcatenated by the training samples and the coding coefficients in the Reproducing Kernel HilbertSpace (RKHS). Consequently, a novel classifier based on the bilinear discriminative model is alsoproposed. Experimental results on the AR, CMU PIE, CAS-PEAL-R1, and the Sheffield (previously UMIST)face databases show that the proposed method is effective to expression, lighting, and pose variations inface recognition as well as gender classification, compared with the recently proposed face recognitionmethods and dictionary learning methods.

& 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Face recognition is an interesting and popular issue in the areaof pattern recognition, computer vision, etc. In recent years, owingto the rise of Sparse Representation (Coding) [1,2] and theprogress of optimizing the l1-norm based minimization problems[3–8], there has been an abundant literature on sparse representa-tion based image classification [1–14], and encouraging resultshave been achieved by sparse representation based classificationmethods.

Sparse representation assumes that a test sample could be thelinear combination of a few atoms in a dictionary. Such adictionary could be composed of either the original trainingsamples or some redundant bases learned from the trainingsamples. However, the original training samples could containoutliers or noises, and may only reveal the weak discriminativestructure among classes. Therefore, dictionary learning techniquesare proposed to learn a set of dictionary atoms from the givensamples, so that the learned atoms could better reconstruct thetest samples.

Some dictionary learning methods have been proposed to fulfillthis task. In Ref. [15], Aharon et al. proposed the KSVD algorithm tolearn an overcomplete dictionary from the training samples byimposing the l0-norm constraint on the coding coefficients with-out any supervised information. Therefore, the dictionary learnedby KSVD can not be used for classification. Based on Ref. [15],Aharon et al. [16] and Mairal et al. [17] added different supervisedinformation to make the learned dictionary discriminative fordifferent classes. Specifically, in Ref. [16], a linear regression termfor the coding coefficients of the labeled data was incorporated inthe dictionary learning model. Different from Ref. [16], Mairal et al.[17] added a term which aimed to make the reconstruction errorsfor different classes discriminative. Later, Mairal et al. [18] pro-posed the supervised dictionary learning (SDL), where the coeffi-cients of the samples coded over the learned dictionary arediscriminative for classification tasks. Zhang et al. [19] proposedthe Discriminative KSVD (DKSVD) for dictionary learning. DKSVDwas able to simultaneously learn the dictionary, coding coeffi-cients and a linear classifier for the coding coefficients in onemodel applying the KSVD algorithm. In Ref. [20], Yang et al.proposed a method to learn a dictionary for each separate class.Ramirez et al. [21] imposed the incoherence on the dictionariesassociated with different classes to make them as orthogonal aspossible. Based on KSVD, Jiang et al. [22] proposed the LC-KSVD1by adding a label consistent term to KSVD, to associate the labelinformation with the dictionary items and subsequently the

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/pr

Pattern Recognition

0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.patcog.2013.11.007

n Correspondence to: No. 1, Wenyuan Road, Qixia District, Nanjing, Jiangsu210023, China. Tel.: þ86 25 85891987; fax: þ86 25 85891685.

E-mail address: [email protected] (M. Yang).

Pattern Recognition 47 (2014) 1835–1845

Page 2: Bilinear discriminative dictionary learning for face recognition

LC-KSVD2, which further added a classification error term toLC-KSVD1. Yang et al. [23] applied the Fisher discriminationcriterion to the dictionary atoms to learn a structured dictionary,and meanwhile, the Fisher discrimination criterion was alsoimposed to the coding coefficients to maximize the between-class scatter and minimize the within-class scatter. In Ref. [24], adictionary was learned so that coding coefficients of the trainingsamples could be fed to different tasks, such as semi-supervisedlearning, regression, binary classification, multiclass classifica-tion, etc. However, methods proposed in Ref. [16–19,22,24] onlytried in different ways to exploit the discriminability of thecoding coefficients to learn a dictionary, but neglected to makethe learned dictionary discriminative for each class. Conversely,methods in Ref. [17,20,21] only focused on minimizing thereconstruction error for each separate class to make the dic-tionary discriminative while they did not explore the discrimi-native information hidden in the coding coefficients. Althoughthe method proposed in Ref. [23] applied the Fisher discrimina-tion criterion to make both the dictionary and the codingcoefficients discriminative, the Fisher discrimination criterionitself could enforce strong sparsity on the coding coefficientseven if no sparse regularization term is imposed. Besides, all thedictionary learning methods mentioned above impose eitherthe l0-norm or l1-norm regularization or force strong sparsityon the coding coefficients to learn a dictionary.

Nevertheless, recent research suggests that face recognition isnot a sparse representation or compressive sensing problem [25],but a collaborative representation problem [26,27]. Because theface recognition problem is the small sample size (SSS) problem[28], given a high dimensional query face image y, typically, thesamples sharing the same label with y in the dictionary areinadequate to represent y, and samples from other classescould also have contributions to reconstruct y. Refs. [26,27] showthat by replacing the l1-norm regularization by the l2-normregularization, collaborative representation based classification(CRC) [26,27] achieves competitive results with sparse representationbased classification (SRC) [1] but needs much less computational cost.

In this paper, we propose a novel dictionary learning methodnamed bilinear discriminative dictionary learning (BDDL) forface recognition. In our dictionary learning model, we not onlymake the dictionary atoms associated with each class discrimi-native, but also exploit the discriminability hidden in the codingcoefficients. Since it has been suggested that the l2-normregularization is good for coding face images, in our dictionarylearning method, we impose the l2-norm regularization on thecoding coefficients. Next, we highlight some characteristics ofour approach below:

(1) The l2-norm regularization is adopted in our dictionary learn-ing method, which not only makes the learned dictionarymore suitable for face recognition, but also avoids the highcomputational cost compared to imposing l1-norm regulariza-tion on the coding coefficients.

(2) To make the dictionary discriminative for each class, weminimize the reconstruction error for each class using thedictionary atoms associated with that class. In this way, astructured dictionary is learned, and the learned dictionary isable to make the reconstruction error for each class morediscriminative for classification.

(3) To exploit the discriminative information hidden in thecoding coefficients, we make use of both the original trainingsamples and the coding coefficients in our discriminative term,because we hope that the discriminability of original trainingsamples and the discriminability of coding coefficients couldcomplement each other. Thus, we construct a new pattern byconcatenating a training sample and a coding coefficient, and

model a linear regression problem for the new patterns in theReproducing Kernel Hilbert Space (RKHS) in our dictionarylearning method. The final expression of this discriminativeterm is a bilinear regression term to the training samples andthe coding coefficients.

(4) From the mechanism of the proposed BDDL, we define thebilinear discriminative features (BDF) of a sample. The BDF of asample is obtained by the bilinear transformation of all thetraining samples and the coding coefficient of this sample.Along with the BDFs, we propose a novel classifier: the bilineardiscriminative classifier (BDC) to classify the samples using theBDFs, and the confidence of the proposed BDC is evaluatedbased on the Kullback–Leibler divergence (K–L divergence)[29].

The rest of this paper is organized as follows: Section 2introduces the proposed bilinear discriminative dictionary learn-ing (BDDL) method. In Section 3, we provide the optimizationstrategy for BDDL. Section 4 introduces some classificationapproaches suitable for the dictionary learned by BDDL. Experi-mental results are presented in Section 5. Finally we provide someconcluding remarks and future works in Section 6.

2. Bilinear discriminative dictionary learning (BDDL)

In this section, we propose a novel dictionary learning method:Bilinear Discriminative Dictionary Learning (BDDL). We first givesome notations below:

Given the training samples of the i-th class Xi ¼ ½x1;…; xni �ARm�ni , where xjARm, j¼ 1;…;ni. Suppose we have c class,and let X ¼ ½X1;…;Xc�ARm�n be the training data matrix, wheren¼ n1þ⋯þnc . Dictionary learning aims to learn a dictionary

D¼ ½D1;…;Dc�ARm�k, where DiARm�ki is the sub-dictionary asso-ciated with the i-th class Xi, and k¼ k1þ⋯þkc. Let Ai ¼½A1

i ;…;Aji;…;Ac

i �ARk�ni denote the coefficients of Xi coded over

the dictionary D, where AjiARkj�ni is the j-th sub-coefficients of Ai,

denoting the coding coefficients of Xi coded over the sub-dictionary Dj.

The framework of our BDDL model is expressed as follows:

minðD;A;WÞ

f ðX;D;AÞþαtðL;X;W ;AÞþβΩðAÞ ð1Þ

The model comprises three parts: the reconstruction error termf ðX;D;AÞ, the bilinear discriminative term tðL;X;W ;AÞ, and theregularization term ΩðAÞ. Next, we shall present the details of thethree terms.

2.1. The reconstruction error term f ðX;D;AÞ and the regularizationterm ΩðAÞ

For f ðX;D;AÞ, we adopt the fidelity term defined in Ref. [23].f ðX;D;AÞ in our BDDL is designed as follows:

f ðX;D;AÞ ¼ jjX�DAjj2F þ ∑c

i ¼ 1jjXi�DiA

iijj2F ð2Þ

where jjU jjF denotes the Frobenious norm of a matrix. In f ðX;D;AÞ,the first term jjX�DAjj2F signifies that the training samples X can bewell represented by the dictionary D with the coding coefficients A.

The second term ∑ci ¼ 1jjXi�DiA

iijj2F ensures that training samples Xi

are represented by the i-th sub-dictionary Di, which is composed ofthe dictionary atoms from class i. In f ðX;D;AÞ, we drop the last term

∑ci ¼ 1∑

cj ¼ 1;ja ijjDjA

jijj2F in the fidelity term used in Ref. [23], because

it has been suggested that face recognition is not a compressivesensing problem [25], but a collaborative representation problem

H.-D. Liu et al. / Pattern Recognition 47 (2014) 1835–18451836

Page 3: Bilinear discriminative dictionary learning for face recognition

[26,27], and the term ∑ci ¼ 1∑

cj ¼ 1;ja ijjDjA

jijj2F could enforce strong

sparsity. In order to demonstrate that ∑ci ¼ 1∑

cj ¼ 1;ja ijjDjA

jijj2F could

enforce strong sparsity, we conduct an experiment using

minðD;AÞ

jjX�DAjj2F þ∑ci ¼ 1 jjXi�DiA

iijj2F þ∑c

j ¼ 1;ja ijjDjAjijj2F

� �on the AR

database. Fig. 1(a) plots the coding coefficient of the 100th trainingsample which belongs to the 15th subject. From Fig. 1(a) we can seethat even though no sparse regularization term, such as ∑n

i ¼ 1jjaijj1,is imposed, the term ∑c

i ¼ 1∑cj ¼ 1;ja ijjDjA

jijj2F could also enforce

strong sparsity. Since we restrict our dictionary learning methodfor face recognition, we only adopt the first two terms in the fidelityterm proposed in Ref. [23].

Additionally, traditional dictionary learning methods [20,21,23,24]typically impose the l0-norm or l1-norm regularization on the codingcoefficients A. Motivated by Refs. [26,27], in this paper, we set ΩðAÞ tojjAjj2F ¼∑n

i ¼ 1jjaijj22 for face recognition based dictionary learning.Fig. 1(b) plots the coding coefficient of the identical sample asdisplayed in Fig. 1(a) computed by BDDL. The coefficient in Fig. 1(b) is not that sparse as the one shown in Fig. 1, (a) implying thatalthough the 100th sample is mainly constructed by the dictionaryatoms from the 15th class, atoms from other classes also have a littlecontribution to reconstruct this sample. This not only follows the claimthat face recognition is a collaborative representation problem [26,27],but, to some extent, avoids over-fitting as well.

2.2. The bilinear discriminative term tðL;X;W ;AÞ

The term tðL;X;W ;AÞ is a bilinear discriminative term definedas follows:

tðL;X;W ;AÞ ¼ jjL�XTWAjj2F ð3Þwhere LAf0;1gn�n is the discriminative matrix defined by

Lij ¼1; xi and xj belong to the same class0; otherwise

(ð4Þ

As we can see from the definition of this term, tðL;X;W ;AÞ isbilinear to X and A. Next we give the interpretation of this term.

Refs. [23,30,31] manifest that the coding coefficients also havediscriminative power. In Ref. [30], the low dimensional embeddingof the data is evaluated by preserving the inner products ofthe coding coefficients of the data coded over a dictionary in theoriginal space. In Ref. [31], the coding coefficients are used as thenew features of the original training samples for dimensionalityreduction (DR). In Ref. [23], the Fisher discrimination criterion isapplied on the coding coefficients to make the learned dictionarymore discriminative for the samples linear combined in thecoefficients.

Motivated by the above works, we try to learn a dictionary byexploiting the discriminative information hidden in the codingcoefficients of the training samples so that the coefficients of thetesting samples coded over the learned dictionary are discrimina-tive. Different from the previous works, however, we utilize boththe original training samples X and the coding coefficients A in ourdiscriminative term so that the discriminability of X and thediscriminability of A could complement each other.

Let θij ¼ ½xTi ; aTj �T ARmþk be a new pattern concatenated bysample xi and the coding coefficient aj of sample xj. We hope todesign a function gðθijÞ so that if xi and aj are associated with thesame class, i.e., xi and xj belong to the same class, then gðθijÞ shouldbe close to 1, otherwise it should be close to 0. Thus, the problemcould be formulated as the following regression problem:

ming

∑n

i ¼ 1∑n

i ¼ 1ðLij�gðθijÞÞ2 ð5Þ

Since face images in the original input space are generallynonlinear and form a manifold structure, it is not appropriate toconsider the problem in Eq. (5) as a linear regression problem inthe original space. Therefore, we apply the kernel trick to tacklethis problem. Denote by ϕ : θijAO-ϕðθijÞAℋ the implicit map-ping of the samples in the input space O� Rmþk to a ReproducingKernel Hilbert Space (RKHS)ℋ. ℋ could be of high dimensional orpossibly infinite dimensional. We assume that in a RKHS, datafϕðθijÞ; Lijg, i¼ 1;…;n, j¼ 1;…;n, could be fitted by linear regres-sion. Let gðθijÞ ¼wTϕðθijÞ. So, the Least Square Regression in theRKHS is expressed as

minw

∑n

i ¼ 1∑n

j ¼ 1ðLij�wTϕðθijÞÞ2 ð6Þ

where w is the weight vector of ϕðθijÞ. According to Refs. [32,33], wcould be expressed by the linear combination of the ϕðθijÞ. Thus,there exists coefficients αuv so that w¼∑n

u ¼ 1∑nv ¼ 1αuvϕðθuvÞ, let

α¼ ½α11;…; α1n;…; α21;…; α2n;…; αn1;…; αnn�T , hence, the problemin Eq. (6) is equivalent to

minα

∑n

i ¼ 1∑n

j ¼ 1ðLij� ∑

n

u ¼ 1∑n

v ¼ 1αuvϕðθuvÞTϕðθijÞÞ2 ð7Þ

Since the kernel is expressed as kðθij; θuvÞ ¼ ϕðθijÞTϕðθuvÞ, wefurther have

minα

∑n

i ¼ 1∑n

j ¼ 1ðLij� ∑

n

u ¼ 1∑n

v ¼ 1αuvkðθij; θuvÞÞ2 ð8Þ

Let KARn2�n2 denote the kernel matrix of the kernel kðU ; U Þ andlet l¼ ½L11;…; L1n;…; L21;…; L2n;…; Ln1;…; Lnn�T . Thus, Eq. (8) can

Fig. 1. (a) The coding coefficient of the 100th sample from the training set of the AR database computed by minðD;AÞ

jjX�DAjj2F þ∑ci ¼ 1 jjXi�DiA

iijj2F þ∑c

j ¼ 1;ja ijjDjAjijj2F

� �. (b) The

coding coefficient of the identical sample as displayed in (a) computed by BDDL. (see Section 5.1 for experimental settings on this database).

H.-D. Liu et al. / Pattern Recognition 47 (2014) 1835–1845 1837

Page 4: Bilinear discriminative dictionary learning for face recognition

be expressed as

minα

jjl�Kαjj22 ð9Þ

α can be computed by

α¼ ðKTKÞ�1KT l ð10ÞSince the kernel matrix K is a n2 �n2 matrix, the time complex-

ity for computing Eq. (10) is approximately Oðn4:75Þ [34], whichmainly involves computing one multiplication and one inverse ofthe n2 �n2 matrices.

In order to avoid high computational cost and simplify theproblem, we choose the kernel kðθij; θuvÞ ¼ xTi xua

Tvaj, which is the

tensor product of two linear kernels, one on the x0s and the otheron the a0s. So, Eq. (8) becomes

minα

∑n

i ¼ 1∑n

j ¼ 1ðLij� ∑

n

u ¼ 1∑n

v ¼ 1αuvxTi xua

TvajÞ2 ð11Þ

Replace ∑nu ¼ 1∑

nv ¼ 1αuvxua

Tv by W and call W the bilinear filter,

we propose the following problem to approximate the problem inEq. (11):

minW

∑n

i ¼ 1∑n

j ¼ 1ðLij�xTi WajÞ2 ð12Þ

We use gðθijÞ ¼ xTi Waj to approximate gðθijÞ. Finally, we get thefollowing problem:

minW

JðWÞ ¼ jjL�XTWAjj2F ð13Þ

By differentiating the above formula and setting the derivativeto 0, we have

∂JðWÞ∂W

¼ 2XLAT �2XXTWAAT ¼ 0 ð14Þ

Hence, W can be analytically obtained as W ¼ ðXXT Þ�1

XLAT ðAAT Þ�1. Let P ¼ ðXXT Þ�1XL, so P could be computed andstored in advance. The time complexity for this offline procedureis Oðm2:37þm2nþmn2ÞÞ, whereas for the online procedure,W ¼ PAT ðAAT Þ�1 needs to be computed, and its time complexityis Oðk2:37þmnkþ minðm;nÞk2Þ. As the number of atoms k and thenumber of training samples n are typically of the same order ofmagnitude, the computational cost of the problem in Eq. (13) ismuch smaller than the problem in Eq. (9).

In addition, the term tðL;X;W ;AÞ ¼ jjL�XTWAjj2F can be inter-preted in another way. Let U ¼WTX be the linear transformedfeatures of X. We hope that the inner products across the lineartransformed features U and the coding features A are preserved.More specifically, if ui and aj are associated with different classes,then they should be as orthogonal as possible. i.e., uT

i aj should benear 0. Otherwise, they should be as close as possible; in this case,we restrict the inner product uT

i aj to be near 1. Therefore, theabove analysis could be formulated as

minU

∑n

i ¼ 1∑n

j ¼ 1ðLij�uT

i ajÞ2 ð15Þ

Replace U by WTX, we obtain the bilinear term tðL;X;W ;AÞ.

2.3. The BDDL model

Our bilinear discriminative dictionary learning model isexpressed as follows:

minðD;A;WÞ

jjX�DAjj2F þ ∑c

i ¼ 1jjXi�DiA

iijj2F þαjjL�XTWAjj2F þβjjAjj2F ð16Þ

However, the model is not jointly convex w.r.t ðD;A;WÞ. In the nextsection, we shall provide the optimization strategy for the proposedBDDL model.

3. Optimization of BDDL

The objective function in our model is not jointly convexw.r.t ðD;A;WÞ. Nonetheless, it is convex to each of D, A, and Wwhen others are fixed. Alternating iterative methods are com-monly used in solving dictionary learning problems. Therefore, theproposed BDDL model could be optimized by alternatively opti-mizing D, A, and W .

Firstly, we fix D and W , and optimize A. In this case, theproblem in Eq. (16) reduces to

minA

jjX�DAjj2F þ ∑c

i ¼ 1jjXi�DiA

iijj2F þαjjL�XTWAjj2F þβjjAjj2F ð17Þ

The reduced problem is not only convex to A but also asmooth one. However, because of the term Σc

i ¼ 1jjXi�DiAiijj2F , the

closed-form solution to the problem could not be obtained.Therefore, we employ the FISTA [4] algorithm to solve the problemin Eq. (17).

Secondly, we fix A andW and optimize D. Thus, Eq. (16) reduces to

minD

jjX�DAjj2F þ ∑c

i ¼ 1jjXi�DiA

iijj2F ð18Þ

In fact, the problem in Eq. (18) is a sub-problem of the problemin Ref. [23]. Similar to Ref. [23], we update D class by class. Whenwe optimize Di, all Dj, ja i are fixed. Hence, Eq. (18) furtherreduces to

minDi

jjX�∑cj ¼ 1;ja iDjA

j�DiAijj2F þjjXi�DiA

iijj2F ð19Þ

Note that, in dictionary learning, each atom in D is restricted tohave unit norm. Thus, we adopt the method proposed in Refs.[20,23] to resolve the problem in Eq. (19).

Thirdly, we fix D and A, and optimize W . The objective functionreduces to

minW

JðWÞ ¼ jjL�XTWAjj2F ð20Þ

According to Eq. (14), W can be analytically obtained asW ¼ ðXXT Þ�1XLAT ðAAT Þ�1.

Finally, we summarize the algorithm of our BDDL model inAlgorithm 1.

Algorithm 1: Bilinear Discriminative Dictionary Learning(BDDL)

Input: a matrix of training samples X ¼ ½X1;…;Xc�ARm�n for cclasses, the number of dictionary atoms ½k1;…; kc�, α and β

Output: the learned dictionary D.1: Generate the discriminative matrix L according to Eq. (4)2: InitializationInitialize the atoms of D with random vectors having unitl2-norm.Initialize coding coefficients A by A¼ ðDTDÞ�1DTX.

Initialize bilinear filter W by W ¼ ðXXT Þ�1XLAT ðAAT Þ�1.3: Update the coding coefficients AFix D and W , update the coding coefficients A by solving

minAjjX�DAjj2F þ∑ci ¼ 1jjXi�DiA

iijj2F þαjjL�XTWAjj2F þβjjAjj2F

using the FISTA algorithm [4].4: Update the dictionary DFix A and W , update D class by class using the methodproposed in Refs. [20,23]

5: Update the bilinear filter W by

W ¼ ðXXT Þ�1XLAT ðAAT Þ�1

6: Go back to Step 3 until the convergence is met or maximumnumber of iterations is reached.

H.-D. Liu et al. / Pattern Recognition 47 (2014) 1835–18451838

Page 5: Bilinear discriminative dictionary learning for face recognition

4. Classification approaches

With the learned dictionary D, various classifiers could be usedfor classification. As we adopt the regularization term jjAjj2F in ourdictionary learning model, the Collaborative Representation basedClassification [26,27] methods are preferable for the dictionarylearned by BDDL. In addition, since our dictionary learning modelis a bilinear discriminative model, we also propose a bilineardiscriminative classifier (BDC). In Section 4.1, we give a briefreview of Collaborative Representation based Classification, andin Sections 4.2 and 4.3 we shall present the proposed bilineardiscriminative classifier (BDC) and the confidence for this classi-fier, respectively.

4.1. Collaborative representation based classification (CRC)

The CRC [26,27] is proposed for face recognition. Compared toSRC [1], the regularization term is jjajj22 rather than jjajj1. Given a testsample y, the coding coefficient in the CRC model is obtained bysolving

a¼ arg mina

jjy�Dajj22þλjjajj22 ð21Þ

where λ is the regularization parameter. The regularized reconstruc-tion error for each class is computed by

ri ¼ jjy�Diaijj2=jjaijj2 ð22Þwhere ai is the coding coefficient associated with class i. CRC withregularized least square (CRC_RLS) assigns the test sample y to theobject class based on minimal regularized reconstruction error:

IdentifyðyÞ ¼ arg miniðriÞ ð23ÞSimilarly, if we replace the regularized reconstruction error in

CRC_RLS with reconstruction error by

r′i ¼ jjy�Diaijj2; ð24Þthen we obtain the CRC_LS classifier:

IdentifyðyÞ ¼ arg miniðr′iÞ ð25ÞIn our experiments, we will employ CRC_RLS and CRC_LS as our

classification methods.

4.2. The bilinear discriminative classifier (BDC)

In this subsection, we propose a novel classifier named theBilinear Discriminative Classifier (BDC). BDC utilizes the informa-tion of both the original training samples and the correspondingcoding coefficients.

Once the dictionary D is learned by BDDL, we need to obtain thecoding coefficients A of the training samples X by solving

A¼ arg minA

jjX�DAjj2F þλ1jjAjj2F ð26Þ

where λ1 is the regularization parameter. By differentiating theabove formula and setting the derivative to zero, we have

A¼ ðDTDþλ1IÞ�1DTX ð27ÞNext, we need to compute the bilinear filter W by solving the

following problem:

W ¼ arg minW

jjL�XTWAjj2F ð28Þ

By differentiating the above formula and setting the derivativeto zero, we have

W ¼ ðXXT Þ�1XLAT ðAAT Þ�1 ð29Þ

For a test sample y, we solve

a¼ arg mina

jjy�Dajj22þλ2jjajj22 ð30Þ

to obtain the coding coefficient of y

a¼ ðDTDþλ2IÞ�1DTy ð31Þwhere λ2 is the regularization parameter. Let δi ¼ ½xTi ; a

T �T , andθij ¼ ½xTi ; a

Tj �T , where xi is the i-th training sample, and aj, the j-th

column of A, is the coding coefficient of xj. Similar to the derivationin Section 2, we have

gðδiÞ ¼ xTi Wa ð32ÞAccording to the mechanism of gðδiÞ, if xi and a are associated with

the same class, i.e., xi and y belong to the same class, gðδiÞ should benear 1, otherwise, it should be near 0. We give the definition of thebilinear discriminative features (BDF) of the test sample y by

φðX; aÞ ¼ ½gðδ1Þ; gðδ2Þ;…; gðδnÞ�T ¼ XTWa ð33ÞSimilarly, the BDFs of the training set X are given by

Ψ ¼ ½φðX; a1Þ;…;φðX; anÞ� ¼ XTWA ð34ÞFrom the definition of the BDF we find that the dimensionality

of a sample's BDF is equal to the number of training samples, andthe entries whose indices corresponding to the class the samplebelongs to tend to have intensive values, whereas other entrieshave relatively small values. Therefore, the BDFs of the sampleshave good discriminability. Fig. 2(a) illustrates a test image y onthe AR database, and Fig. 2(b) depicts the BDF of y belonging to thefirst class on the AR database.

Since we have the discriminative BDFs, we only need to use asimple classifier for classification. In our BDC, we use 1NNclassifier with cosine metric for classification. The algorithm forthe Bilinear Discriminative Classifier (BDC) is provided in Algo-rithm 2.

Algorithm 2: The Bilinear Discriminative Classifier (BDC)

Input: a matrix of training samples X ¼ ½X1;…;Xc�ARm�n for cclasses, a dictionary D, a test sample y, λ1 and λ2

Output: the identity of the test sample y.1: Normalize the columns of X to have unit l2-norm.2: Generate the discriminative matrix L according to Eq. (4)

3: Code X over D by A¼ ðDTDþλ1IÞ�1DTX

4: Compute the bilinear filter by W ¼ ðXXT Þ�1XLAT ðAAT Þ�1

5: Compute the BDFs of the training samples by Ψ ¼ XTWA

6: Code the test sample y over D by a¼ ðDTDþλ2IÞ�1DTy

7: Compute the BDF of the test sample y by φ¼ XTWa8: Classify φ on Ψ using the 1NN classifier with cosine metricand return the identity of the test sample y.

Fig. 2. (a) A test face image from the AR database. (b) The BDF of the image in (a).

H.-D. Liu et al. / Pattern Recognition 47 (2014) 1835–1845 1839

Page 6: Bilinear discriminative dictionary learning for face recognition

4.3. The confidence of BDC

In this section we provide an evaluation criterion to calculatethe confidence of our proposed BDC, i.e., to what extent we couldtrust this classifier.

In BDC, we use the 1NN classifier to classify the test samples onthe training set, and the BDFs are used as the features. Accordingto the definitions (Eqs. (33) and (34)) of BDFs of the test sample yand the training samples X, we discern that the performance ofBDC strongly depends on the relationship between a and A, thecoding coefficients of y and X, respectively. Therefore, we try tomeasure the confidence of BDC by computing the divergencebetween the distribution of coefficients of the training samplesand the distribution of the coefficient of a test sample. The largerthe divergence between the two distributions, the larger distancebetween coefficients of the training samples and the coefficient ofthe test sample, and, hence, the smaller confidence BDC has. Weemploy the Kullback–Leibler divergence (K–L divergence) [29] tomeasure the difference between two probability distributions.Denote by p and q the probability distributions of the coefficientsof the training samples X and the coefficient of a test sample y,respectively. The K–L divergence of q from p is defined as follows:

DKLðpjjqÞ ¼Z

pðaÞlog pðaÞqðaÞ

� �da ð35Þ

The divergence DKL ranges in ½0; þ1Þ. To restrict the confi-dence in (0, 1], we give the definition of the confidence of BDC forclassifying a test sample y on the training set X by

conf BDC ðX; yÞ ¼tanh 1

DKLðpjjqÞ

� �; DKLðpjjqÞ40

1; DKLðpjjqÞ ¼ 0

8<: ð36Þ

Thus, the confidence conf BDC ðX; yÞ is in (0, 1], and the largerconf BDC ðX; yÞ, the greater extent we could trust BDC for classifyingthe test sample y. In our experiments, we shall investigate theperformances of the BDC and its confidences on various databases,and provide an empirical confidence threshold for this classifier.

5. Experiments

We verify the performance of our method BDDL on fourdatabases: The AR database [35], the CAS-PEAL-R1 database [36],the CMU PIE database [37] and the Sheffield (previously UMIST)Face database [38]. We compare our results with 1-nearest-neighbor (1NN), the support vector machines (SVMs) [39], sparserepresentation-based classification (SRC) [1], the l2-based method(l2) [25], the collaborative representation based classification withregularized least square (CRC_RLS) [26,27], and some recentlyproposed dictionary learning methods: Discriminative KSVD(DKSVD) [19], the label consistent KSVD version 1(LC-KSVD1)and version 2 (LC-KSVD2) [22], and Fisher discrimination dictionarylearning (FDDL) [23]. We implement 1NN and SRC on our own. Forother methods in our experiments, we download the codes fromthe authors' homepages.

In all the experiments, we apply Principal Component Analysis(PCA) [40] to reduce the dimensionality of the samples to 300, andthe preprocessed samples are normalized to have unit l2-norm.The RBF kernel is chosen for SVM, and the kernel width ‘g’ is set asthe mean norm of the training data. For all the compared methodson all the databases, parameters are chosen by cross-validation.For dictionary learning methods, the number of atoms for eachclass is set to 250 in the experiment conducted on genderclassification on the AR database, and set to the number of trainingsamples of each class in other experiments.

5.1. The AR database

The AR database contains over 4000 images of 126 subjects.Like Refs. [1,23], we use the subset of this database. This subsetcontains 1400 images of 100 individuals (50 men and 50 women).For each individual, 14 images are taken in two sessions separatedby 2 weeks and each session contains 7 images with differentexpressions and illuminations. To test the effectiveness of differentmethods in expression and mild illumination variations, wechoose all the 700 images from session 1 for training and the restimages from session 2 for testing. The details of the 14 images ofthe first individual are shown in Fig. 3.

Fig. 4 plots the convergence of BDDL on AR database. We cansee from this figure that as the iteration number increases from1 to 20, the objective function value drops dramatically from over300 to approximate 170. From then on, it declines gradually by the100th iteration.

Table 1 reports the recognition rates of various methods. Fromthis table we find that the highest recognition rate is achieved by ourdictionary learning method BDDL with the CRC_RLS classifier: 95.3%.In comparisonwith CRC_RLS applied directly on the original trainingsamples giving the recognition rate: 93.8%, BDDL with CRC_RLSimproves its performance by 1.5%, which shows the effectiveness ofour dictionary learning method. For other dictionary learningmethods, FDDL gives the best recognition rate: 92% [23], which isstill 3.5% less than that of BDDL with CRC_RLS. The recognition ratesof dictionary learning methods: DKSVD, LC-DKSVD1, LC-DKSVD1, areno more than 90%, because these methods impose the l0-normconstraint for learning dictionaries, which enforce strong sparsity onthe coding coefficients. Hence they are not that proper for facerecognition. The proposed classifier BDC gives the recognition rate:93.6%, which is competitive with some recently proposed methodsfor face recognition: SRC (90.3%), l2 (93.6%), and CRC_RLS (93.8%).According to Table 2, the average confidence and STD of BDC on thisdatabase is 0.9470.10, ranking the second highest in this table.Therefore, we could trust this classifier in this experiment.

Fig. 3. All the 14 images of the first person on the subset of the AR database.Images in the first row are from session 1, in the second row are from session 2.

Fig. 4. The convergence of our BDDL on the AR database.

H.-D. Liu et al. / Pattern Recognition 47 (2014) 1835–18451840

Page 7: Bilinear discriminative dictionary learning for face recognition

5.2. The CMU PIE database

The CMU PIE database [36] consists of 41,368 images of 68 subjects.In our experiments, we use the dataset provided on Deng Cai'shomepage: http://www.cad.zju.edu.cn/home/dengcai/. This datasetcontains five near frontal poses (C05, C07, C09, C27, C29) of all theimages under different illuminations and expressions. On this subset,there are 170 images for each subject. Sample images of the firstperson on this dataset are shown in Fig. 5. We verify on this datasetthe effectiveness of different methods in illumination and moderatepose variations. In our experiments, 20 images for each individual arerandomly selected for training, and the rest images are for testing. Werepeat this process 20 times and report the average recognition ratesand standard deviations for different methods.

Table 3 lists the results of the compared methods. According tothis table, the best performance is achieved by BDDL with BDCwith the highest recognition rate: 94.4% and the minimal standarddeviation 0.3%. Moreover, compared to other dictionary learningmethods, the recognition rates/(standard deviation) given byclassifiers (CRC_RLS, CRC_LS) using the dictionary learned by ourBDDL surpass/(are less than) those of other dictionary learningmethods. For methods applied directly on the original trainingsamples, SRC gets the highest mean accuracy: 93.0%, which is still1.4% less than that of our BDDL with BDC.

According to Table 2, BDC has the highest average confidenceand the lowest STD on this database: 0.9970.01. So we could trustthis classifier to a great extent on this database. Fig. 6(a) comparesthe distributions of the coefficients of the training set and thetesting set. From this figure we can see that the coefficients of thetesting set have the similar distribution with that of the trainingset, which could be further confirmed by the coefficients matricesof the training set and testing set illustrated in Fig. 7(a) and (b),respectively. In addition, similar to the BDFs of the training setshown in Fig. 7(c), the BDFs of the testing set shown in Fig. 7(d) isalso a block diagonally dominant matrix, which implies that, forthe BDF of a sample, entries whose indices corresponding to theclass the sample belongs to have intensive values, while otherentries have relatively small values. Since the BDFs of the trainingsamples and the testing samples are discriminative, the BDCperforms well on this dataset.

5.3. The Sheffield (previously UMIST) face database

The Sheffield (previously UMIST) face database [38] consists of564 images of 20 subjects. Each individual is shown in a range ofposes from right profile to frontal views. To verify the perfor-mances of different methods in pose variation, for each subject, wechoose one image with right profile (see the first image in Fig. 8(a)) and 9 images with near frontal views (see 2nd–10th images inFig. 8(a)) for training, while the other images with significant posevariations for testing (see images in Fig. 8(b)).

The comparison of the recognition rates achieved by differentmethods is shown in Table 4. As we only use profile images and nearfrontal images for training, while images with significant posevariations for testing, there is a significant difference between thedistributions of the training set and the testing set. Thus, this is adifficult problem for most classification methods since no imagepreprocessing techniques are adopted here. As we can see from thistable, only BDDL, SVM, and FDDL give the recognition rates over 80%.BDDL with CRC_LS gets the best recognition rate: 89.3%, whichalmost halves the error rate achieved by the second best dictionarylearning method FDDL on this dataset. This implies that, althoughonly the profiles and images with moderate pose variations areprovided, our BDDL is able to learn dictionary atoms to give betterrepresentation of the samples in significant pose variation. Morespecifically, a test sample with significant pose variation could bereconstructed by the dictionary atoms from the class it truly belongsto with the minimal reconstruction error in comparison with thesample reconstructed by the dictionary atoms from other classes.The second highest recognition rate is achieved by CRC_RLS appliedon the dictionary learned by our BDDL: 85.1%, which also halves theerror rate given by CRC_RLS performed on original training samples.

Because there is a significant difference between the distribu-tions of the training set and the testing set, there is a hugedifference between the distributions of the training coefficientsand the testing coefficients shown in Fig. 6(b), which means thatthe coefficients of the testing set are not that sparse as those of the

Table 1The recognition rates (REC) on the AR database.

Methods REC (%) Parameters

1NN 71.4 NoSVM [39] 73.6 C¼1000, g¼1SRC [1] 90.3 λ¼ 0:001l2 [25] 93.6 NoCRC_RLS [26,27] 93.8 λ¼ 0:001DKSVD [19] 85.8 γ ¼ 25; β¼ 0; T ¼ 20LC-DKSVD1 [22] 88.6 α¼ 0:024; T ¼ 30LC-DKSVD2 [22] 89.4 α¼ 0:025; β¼ 0:025; T ¼ 30FDDL [23] 92.0 λ1 ¼ 0:005; λ2 ¼ 0:005; γ ¼ 0:001; w¼ 0:05BDDL α¼ 0:1; β¼ 0:5

CRC_RLS 95.3 λ¼ 0:01CRC_LS 93.3 λ¼ 0:01BDC 93.6 λ1 ¼ 0:01; λ2 ¼ 0:1

Table 2The average confidence conf BDC and its standard deviations (STD) of BDC on variousdatabases.

AR CMU PIE Sheffield CAS-PEAL-R1 AR gender

conf BDC&STD 0.9470.10 0.9970.01 0.5770.04 0.8270.08 0.6970.21

Fig. 5. Sample images of the first person from the CMU PIE database.

Table 3The average recognition rates (REC) and standard deviations (STD) of variousmethods on the CMU PIE database.

Methods REC&STD (%) Parameters

1NN 61.170.6 NoSVM [39] 89.170.7 C¼1000, g¼1SRC [1] 93.070.4 λ¼ 0:001l2 [25] 92.570.4 NoCRC_RLS [26,27] 92.570.4 λ¼ 0:001DKSVD [19] 90.870.7 γ ¼ 25; β¼ 0; T ¼ 12LC-DKSVD1 [22] 70.571.5 α¼ 0:05; T ¼ 30LC-DKSVD2 [22] 65.570.8 α¼ 0:001; β¼ 0:001; T ¼ 30FDDL [23] 92.370.5 λ1 ¼ 0:005; λ2 ¼ 0:005; γ ¼ 0:001; w¼ 0:1BDDL α¼ 0:001; β¼ 0:5

CRC_RLS 92.470.4 λ¼ 0:001CRC_LS 93.370.4 λ¼ 0:001BDC 94.470.3 λ1 ¼ 0:001; λ2 ¼ 0:1

H.-D. Liu et al. / Pattern Recognition 47 (2014) 1835–1845 1841

Page 8: Bilinear discriminative dictionary learning for face recognition

training set, implying that the coefficients of the testing set are notthat discriminative. Thus, BDDL with CRC_RLS does not perform aswell as BDDL with CRC_LS. Besides, because of such a hugedifference, the average confidence and STD of BDC is only0.5770.04 on this database and the recognition rate achieved byBDDL with BDC on this database is only 72.5%. Thus, we reject theresults provided by BDDL with BDC in this experiment.

5.4. The CAS-PEAL-R1 database

The CAS-PEAL-R1 database consists of 30,900 images of 1040subjects. The standard evaluation protocol in Ref. [37] comprisesthree parts: one gallery set, one training set, and several probesets. The gallery set contains one frontal image captured underthe standard lighting condition for each of the 1040 individuals.The probe sets contain six frontal probes: expression, lighting,accessory, background, distance, and time. The training setconsists 1200 frontal images of 300 subjects. Each subject has4 images randomly chosen from all the frontal probes. We test the

performances of different methods in illumination variations onthis database. In our experiments, from the gallery and training set,we choose the images of subjects that occur in both the training setand the lighting probe for training, and images of these subjects inthe lighting probe for testing. Thus, we have 5 images for each of 65individuals for training, and 229 images of them for testing. Sampleimages of this database are shown in Fig. 9.

In our experiments, since the training set is composed of theimages randomly chosen from all the six frontal probes, and the testimages have sharp shadows, similar to the experiment conductedon the Sheffield face database, there is a considerable differencebetween the distributions of the training samples and the testingsamples. The recognition rates of different methods are listed inTable 5. In this table, BDDL with CRC_RLS gives the best result 76.0%,which is 4.4% higher than CRC_RLS without dictionary learning, and3.5% higher than FDDL, the second best dictionary learning methodreported in this table. For SRC, l2, CRC_RLS, LC-DKSVD2, FDDL, andBDDL with CRC_LS, they provide close recognition rates: from 70%to around 72%. Other methods listed in this table give the

Fig. 6. The distributions of the coefficients of the training set and the testing set on (a) the CMU PIE database, and (b) the Sheffield face database.

Fig. 7. On the CMU PIE database, (a) the coding coefficients matrix of the training set, (b) the coding coefficients matrix of the testing set, (c) the BDFs of the training set, and(d) the BDFs of the testing set.

H.-D. Liu et al. / Pattern Recognition 47 (2014) 1835–18451842

Page 9: Bilinear discriminative dictionary learning for face recognition

recognition rates under 70%. On this database, the average con-fidence and STD of BDC is only 0.8270.08, implying that BDC doesnot perform well on this database. The reason is the same as theone analyzed on the Sheffield face database.

5.5. Gender classification on the AR database

In this subsection, we investigate the performance of BDDL ongender classification. We use the identical dataset as used inSection 5.1, i.e., 1400 images for 100 subjects (50 men and 50women). We randomly select images of 25 men and 25 women fortraining and the other 25 men and 25 women for testing. Werepeat this process 20 times and report the average results andstandard deviations for different methods.

From Table 6, it can be seen that BDDL with CRC_RLS andCRC_LS both give the best average classification rate: 93.2%, withstandard deviations 2.1% and 2.2%, respectively. SRC reports thesecond best result: 93.0%72.3%, followed by the result 92.9%71.8% given by the second best dictionary learning method FDDL.Among SVM, l2, and CRC_RLS, CRC_RLS gives the best result: 92.2%

72.2%, which is still 1.0% less than CRC_RLS applied on the BDDLlearned dictionary. The average confidence and STD of BDC givenin Table 2 is only 0.6970.21. Thus, we reject the results providedby BDC in this experiment.

5.6. The summary of the experiments

In the experiments, our dictionary learning method BDDL isable to learn a good dictionary to classify face images in expression(the AR database), lighting (the CMU PIE and CAS-PEAL-R1databases), and pose (the Sheffield face database) variations aswell as gender classification (the AR database). The proposedclassifier BDC performs well on the AR and the CMU PIE databases,because the samples from the training set and the testing set haveclose distributions. From the results provided by BDC and theconfidences in Table 2, we could empirically set the confidencethreshold to 0.9, i.e. if the confidence is less than 0.9, then theperformance of BDC could be trusted, otherwise, we reject theresults provided by BDC.

6. Conclusions and future work

We propose in this paper a novel dictionary learning methodbased on the l2-norm regularization, which is more suitable forface recognition. A structured dictionary is learned by minimizingthe reconstruction error for each class using the dictionary atomsassociated with that class. More importantly, to exploit thediscriminability of the coding coefficients, a discriminative termbilinear to the training samples and the coding coefficients isincorporated in our dictionary learning model. The bilinear dis-criminative term essentially resolves a linear regression problemfor the patterns concatenated by the training samples and thecoding coefficients in the Reproducing Kernel Hilbert Space(RKHS). Meanwhile, a new classifier BDC based on the bilineardiscriminative model is also proposed and the confidence of BDC isevaluated based on the K–L divergence. Experiments on the AR,

Fig. 8. (a) All the 10 images of the first person used for training on the Sheffield face database. (b) The rest 28 images of the first person used for testing on this database.

Table 4The recognition rates (REC) on the Sheffield database.

Methods REC (%) Parameters

1NN 66.9 NoSVM [39] 82.9 C¼100, g¼1SRC [1] 77.1 λ¼ 0:1l2 [25] 66.9 NoCRC_RLS [26,27] 70.4 λ¼ 0:01DKSVD [19] 70.4 γ ¼ 25; β¼ 0; T ¼ 12LC-DKSVD1 [22] 72.0 α¼ 0:03; T ¼ 30LC-DKSVD2 [22] 71.7 α¼ 6; β¼ 1; T ¼ 30FDDL [23] 80.3 λ1 ¼ 0:005; λ2 ¼ 0:05; γ ¼ 0:005; w¼ 0:5BDDL α¼ 0:001; β¼ 0:001

CRC_RLS 85.1 λ¼ 0:1CRC_LS 89.3 λ¼ 0:1BDC 72.5 λ1 ¼ 0:001; λ2 ¼ 0:1

H.-D. Liu et al. / Pattern Recognition 47 (2014) 1835–1845 1843

Page 10: Bilinear discriminative dictionary learning for face recognition

CMU PIE, CAS-PEAL-R1, and the Sheffield face databases demon-strate the effectiveness of our method.

In the future work, we will try to impose strong sparsity on thecoding coefficient to learn the dictionary for other pattern classi-fication tasks and further explore the effectiveness of the bilineardiscriminative classifier.

Conflict of interest statement

None declared.

Acknowledgments

This work is supported by National Natural Science Foundationof China under Grant nos. 61272222 and 61003116. Natural ScienceFoundation of Jiangsu Province of China under Grant no.BK2011782, and Key (Major) Program of Natural Science Founda-tion of Jiangsu Province of China under No. BK2011005.

References

[1] J. Wright, A. Yang, S. Sastry, Y. Ma, Robust face recognition via sparserepresentation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227.

[2] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T.S. Huang, S. Yan, Sparse representationfor computer vision and pattern recognition, Proc. IEEE 98 (6) (2010)1031–1044.

[3] S.J. Wright, R.D. Nowak, M.A.T. Figueiredo, Sparse reconstruction by separableapproximation, IEEE Trans. Signal Process. 57 (7) (2009) 2479–2493.

[4] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm forlinear inverse problems, SIAM J. Imaging Sci. 2 (1) (2009) 183–202.

[5] J. Yang, Y. Zhang, Alternating direction algorithms for l1 problems in com-pressive sensing, SIAM J. Sci. Comput. 33 (1) (2011) 250–278.

[6] D.M. Malioutov, M. Cetin, A.S. Willsky, Homotopy continuation for sparsesignal representation, in: Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP), 2005.

[7] S.J. Kim, K. Koh, M. Lustig, S. Boyd, D. Gorinevsky, A interior-point method forlarge-scale l1-regularized least squares, IEEE J. Sel. Top. Signal Process. 1 (4)(2007) 606–617.

[8] J. Liu, J. Ye, Efficient Euclidean projections in linear time, in: Proceedings of theInternational Conference on Machine Learning (ICML), 2009.

[9] J. Mairal, M. Elad, G. Sapiro, Sparse representation for color image restoration,IEEE Trans. Image Process. 17 (1) (2008) 53–69.

[10] A. Wagner, J. Wright, A. Ganesh, Z.H. Zhou, and Y. Ma, Towards a practical facerecognition system: robust registration and illumination by sparse represen-tation, in: Computer Vision and Pattern Recognition (CVPR), 2009.

[11] M. Yang, L. Zhang, Gabor feature based sparse representation for facerecognition with Gabor occlusion dictionary, in: Proceedings of the EuropeanConference on Computer Vision (ECCV), 2010.

[12] K. Huang, S. Aviyente, Sparse representation for signal classification, in:Advances in Neural Information Processing Systems (NIPS), 2006.

[13] J.C. Yang, K. Yu, T. Huang, Supervised translation-invariant sparse coding, in:Computer Vision and Pattern Recognition (CVPR), 2010.

[14] M. Yang, L. Zhang, J. Yang, D. Zhang, Robust sparse coding for face recognition,in: Computer Vision and Pattern Recognition (CVPR), 2011.

[15] M. Aharon, M. Elad, A.M. Bruckstein, K-SVD: an algorithm for designing ofovercomplete dictionaries for sparse representation, IEEE Trans. Signal Pro-cess. 54 (11) (2006) 4311–4322.

[16] D.S. Pham, S. Venkatesh, Joint learning and dictionary construction for patternrecognition, in: Computer Vision and Pattern Recognition (CVPR), 2008.

[17] J. Mairal, F. Bach, J. Ponce, G. Sapiro, A. Zissserman, Discriminative learneddictionaries for local image analysis, in: Computer Vision and PatternRecognition (CVPR), 2008.

[18] J. Mairal, F. Bach, J. Ponce, G. Sapiro, A. Zisserman, Supervised dictionarylearning, in: Advances in Neural Information Processing Systems (NIPS), 2009.

[19] Q. Zhang, B.X. Li, Discriminative K-SVD for dictionary learning in facerecognition, in: Computer Vision and Pattern Recognition (CVPR), 2010.

[20] M. Yang, L. Zhang, J. Yang, D. Zhang, Metaface learning for sparse representa-tion based face recognition, in: IEEE International Conference on ImageProcessing (ICIP), 2010.

[21] I. Ramirez, P. Sprechmann, G. Sapiro, Classification and clustering via dic-tionary learning with structured incoherence and shared features, in: Com-puter Vision and Pattern Recognition (CVPR), 2010.

[22] Z. Jiang, Z. Lin, L.S. Davis, Learning a discriminative dictionary for sparsecoding via label consistent K-SVD, in: Computer Vision and Pattern Recogni-tion (CVPR), 2011.

[23] M. Yang, L. Zhang, X. Feng, D. Zhang, Fisher discrimination dictionary learningfor sparse representation, in: Computer Vision and Pattern Recognition(CVPR), 2011.

[24] J. Mairal, F. Bach, J. Ponce, Task-driven dictionary learning, IEEE Trans. PatternAnal. Mach. Intell. 34 (4) (2012) 791–804.

[25] Q. Shi, A. Eriksson, A. Hengel, C. Shen, Is face recognition really a compressivesensing problem? in: Computer Vision and Pattern Recognition (CVPR), 2011.

[26] L. Zhang, M. Yang, X. Feng, Sparse representation or collaborative representa-tion: Which helps face recognition? In: IEEE International Conference onComputer Vision (ICCV), 2011.

[27] L. Zhang, M. Yang, X. Feng, Y. Ma, D. Zhang, Collaborative Representation basedClassification for Face Recognition, arXiv:1204.2358, 2012.

[28] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, John Wiley and Sons,New York, 2001.

[29] S. Kullback, R. Leibler, On information and sufficiency, Ann. Math. Stat. 22(1951) 79–86.

[30] I. Gkioulekas, T. Zickler, Dimensionality reduction using the sparse linearmodel, in: Advances in Neural Information Processing Systems (NIPS), 2011.

[31] J. Gao, Q. Shi, T.S. Caetano, Dimensionality reduction via compressive sensing,Pattern Recognit. Lett. 33 (9) (2012) 1163:1170.

Table 5The recognition rates (REC) on the CAS-PEAL-R1 database.

Methods REC (%) Parameters

1NN 20.5 NoSVM [39] 44.5 C¼1000, g¼1SRC [1] 71.2 λ¼ 0:001l2 [25] 70.3 NoCRC_RLS [26,27] 71.6 λ¼ 0:001DKSVD [19] 65.5 γ ¼ 25; β¼ 0; T ¼ 35LC-DKSVD1 [22] 68.6 α¼ 0:036; T ¼ 50LC-DKSVD2 [22] 71.2 α¼ 0:031; β¼ 0:015; T ¼ 50FDDL [23] 72.5 λ1 ¼ 0:05; λ2 ¼ 0:005; γ ¼ 0:005; w¼ 0:5BDDL α¼ 0:001; β¼ 0:05

CRC_RLS 76.0 λ¼ 0:01CRC_LS 70.7 λ¼ 0:01BDC 65.5 λ1 ¼ 0:001; λ2 ¼ 0:1

Table 6The average recognition rates (REC) and standard deviations (STD) of variousmethods for gender classification on the AR database.

Methods REC&STD(%) Parameters

1NN 89.172.8 NoSVM [39] 92.172.3 C¼1, g¼1SRC [1] 93.072.3 λ¼ 0:01l2 [25] 91.372.2 NoCRC_RLS [26,27] 92.272.2 λ¼ 0:01DKSVD [19] 91.971.7 γ ¼ 25; β¼ 0; T ¼ 12LC-DKSVD1 [22] 89.972.5 α¼ 1; T ¼ 30LC-DKSVD2 [22] 91.271.9 α¼ 1; β¼ 1; T ¼ 30FDDL [23] with GC 92.971.8 λ1 ¼ 0:1; λ2 ¼ 0:05; γ ¼ 0:005; w¼ 0:5BDDL α¼ 0:001; β¼ 0:5

CRC_RLS 93.272.1 λ¼ 0:5CRC_LS 93.272.2 λ¼ 0:5BDC 91.672.5 λ1 ¼ 0:001; λ2 ¼ 0:5

Fig. 9. Samples images of the first person on the CAS-PEAL-R1 database.

H.-D. Liu et al. / Pattern Recognition 47 (2014) 1835–18451844

Page 11: Bilinear discriminative dictionary learning for face recognition

[32] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernelapproach, Neural Comput. 12 (10) (2000) 2385–2404.

[33] A.J. Smola, Learning with Kernels (Ph.D. thesis), GMD, Birlinghoven, Germany,1998.

[34] V. Vassilevska Williams, Breaking the Coppersmith–Winograd barrier, Tech.Rep., 2011.

[35] A. Martinez, R. Benavente. The AR face database. CVC Tech. Report No. 24,1998.

[36] T. Sim, S. Baker, M. Bsat, The C.M.U. Pose, Illumination, and Expression (PIE)Database, in: IEEE International Conference on Automatic Face and GestureRecognition (FG), 2002.

[37] W. Gao, B. Cao, S. Shan, X. Chen, D. Zhou, X. Zhang, D. Zhao, The CAS-PEALlarge-scale Chinese face database and baseline evaluations, IEEE Trans. Syst.Man Cybern. (Part A) 38 (1) (2008) 149–161.

[38] D.B. Graham, N.M. Allinson, Characterizing virtual eigensignatures for generalpurpose face recognition, in: H. Wechsler, P.J. Phillips, V. Bruce, F. Fogelman-Soulie, T. Huang (Eds.), Face Recognition: From Theory to Applications, NATOASI Series F, Computer and Systems Sciences, 163, 1998, pp. 446–456.

[39] V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, NewYork, 1999.

[40] M.A. Turk, A.P. Pentland, Face recognition using eigenfaces, in: ComputerVision and Pattern Recognition (CVPR), 1991.

Hui-Dong Liu received his M.S. degree in the School of Computer Science and Technology from Nanjing Normal University in 2013, and his B.S. degree from the sameuniversity in 2010. His research interests include machine learning, pattern recognition and their applications to face recognition, image processing, etc.

Ming Yang received his Ph.D. degree in the department of computer science and engineering from Southeast University at Nanjing in 2004. He received his M.S. degree in thedepartment of mathematics from University of Science & Technology of China, and his B.S. degree in the department of mathematics from Anhui Normal University, in 1990and 1987, respectively. He is currently a Professor in the school of computer science and technology at Nanjing Normal University. His research interests include data miningand knowledge discovery, machine learning, pattern recognition and their applications.

Yang Gao received his Ph.D. degree in computer science from Nanjing University in 2000. He received his M.S. degree in computer aided design from Nanjing University ofScience and Technology in 1996, and his B.S. degree in Dalian University of Technology in 1993. His research interests include Reinforcement Learning, Intelligent System,Image Process and Video Surveillance.

Yilong Yin is the Director of MLA Group and a Professor of Shandong University. He received his Ph.D. degree in 2000 from Jilin University. From 2000 to 2002, he worked asa post-doctoral fellow in the Department of Electronic Science and Engineering, Nanjing University. His research interests include machine learning, data mining, andbiometrics.

Liang Chen is currently a M.S. candidate in the School of Computer Science and Technology, Nanjing Normal University. Her research interests include machine learning andpattern recognition.

H.-D. Liu et al. / Pattern Recognition 47 (2014) 1835–1845 1845