locally principal component learning for face representation and recognition
TRANSCRIPT
ARTICLE IN PRESS
0925-2312/$ - se
doi:10.1016/j.ne
�CorrespondE-mail addr
csdzhang@com
(J.-y. Yang).
Neurocomputing 69 (2006) 1697–1701
www.elsevier.com/locate/neucom
Letters
Locally principal component learning for facerepresentation and recognition
Jian Yanga,b,�, David Zhanga, Jing-yu Yangb
aDepartment of Computing, Biometric Research Centre, Hong Kong Polytechnic University, Kowloon, Hong KongbDepartment of Computer Science, Nanjing University of Science and Technology, Nanjing 210094, PR China
Received 6 October 2005; received in revised form 23 January 2006; accepted 24 January 2006
Communicated by R.W. Newcomb
Abstract
This paper develops a method called locally principal component analysis (LPCA) for data representation. LPCA is a linear and
unsupervised subspace-learning technique, which focuses on the data points within local neighborhoods and seeks to discover the local
structure of data. This local structure may contain useful information for discrimination. LPCA is tested and evaluated using the AT&T
face database. The experimental results show that LPCA is effective for dimension reduction and more powerful than PCA for face
recognition.
r 2006 Elsevier B.V. All rights reserved.
Keywords: Principal component analysis (PCA); Locality-based learning; Dimensionality reduction; Feature extraction; Face recognition
1. Introduction
Principal component analysis (PCA) is a classicaltechnique for linear dimension reduction. It has beensuccessfully applied to face recognition [5]. PCA produces acompact representation and preserves the global geometricstructure of data well in a low-dimensional space when thegiven data are linearly distributed. But, when the data aredistributed in a nonlinear way, PCA may fail to discoverthe intrinsic structure of data due to its intrinsic linearity.In this respect, Kernel PCA (KPCA) may perform wellsince it can make the data structure as linear as possible byvirtue of an implicit nonlinear mapping determined by akernel. KPCA turns out to be effective in a lot of real-world applications. But, its computational complexity isstill a problem, which restricts its applications to a certainextent.
As opposed to the globality-based learning techniqueslike PCA and KPCA, locality-based learning methods,
e front matter r 2006 Elsevier B.V. All rights reserved.
ucom.2006.01.009
ing author. Tel.: +852 2766 7280; fax: +852 2774 0842.
esses: [email protected] (J. Yang),
p.polyu.edu.hk (D. Zhang), [email protected]
represented by locally linear embedding (LLE) [4] andLaplacian eigenmap [1], appeared in the last few years.Both methods seek to discover the global structure of datavia locally linear fits, based on the fact that a globalnonlinear structure can be viewed locally linear. However,the mapping of LLE (or Laplacian eigenmap) is alwaysimplicit and close to the training data set so that it isdifficult to obtain the image of a data point from the testingset. This makes LLE and Laplacian eigenmap unsuitablefor some recognition tasks. Recently, a locality preservingprojection (LPP) technique [3] was proposed and applied toface recognition. LPP can preserve the intrinsic geometryof data and yield an explicit linear mapping suitable fortraining and testing samples.Motivated by the idea of LLP, we develop a locally
principal component analysis (LPCA) technique for facerepresentation. Differing from PCA, LPCA seeks todiscover the local structure of data (not the globalstructure). When the data are distributed in a nonlinearway, linear methods like PCA may fail to capture theglobal geometric structure of data, but it is still possible touse the similar linear method to recover the local structure.This local structure may contain useful information fordiscrimination.
ARTICLE IN PRESSJ. Yang et al. / Neurocomputing 69 (2006) 1697–17011698
2. Methods
2.1. PCA
Given a set of M training samples (pattern vectors)x1;x2; . . . ;xM in RN , PCA seeks to find a projection axis w,such that the mean square of the Euclidean distancebetween all pairs of the projected sample pointsy1; y2; . . . ; yM (yj ¼ wTxj , j ¼ 1; . . . ;M) is maximized, i.e.,
JðwÞ ¼1
2
1
MM
XMi¼1
XMj¼1
ðyi � yjÞ2. (1)
It follows that
JðwÞ ¼1
2
1
MM
XMi¼1
XMj¼1
ðwTxi � wTxjÞ2
¼ wT 1
2
1
MM
XMi¼1
XMj¼1
ðxi � xjÞðxi � xjÞT
" #w. ð2Þ
Let us denote
St ¼1
2
1
MM
XMi¼1
XMj¼1
ðxi � xjÞðxi � xjÞT, (3)
and the mean vector x ¼ ð1=MÞPM
j¼1xj. Then, it is easy toshow that
St ¼1
2
1
MM
XMi¼1
XMj¼1
ðxixTi � 2xix
Tj þ xjx
Tj Þ
¼1
2
1
MM2M
XMi¼1
xixTi � 2
XMi¼1
XMj¼1
xixTj
" #
¼1
MMMXMi¼1
xixTi �
XMi¼1
xi
! XMj¼1
xTj
!" #
¼1
M
XMi¼1
xixTi � xx
T
¼1
M
XMi¼1
ðxi � xÞðxi � xÞT. ð4Þ
Eq. (4) indicates that St is essentially the covariance matrixof data. So, the projection axis w that maximizes Eq. (1)can be selected as the eigenvector of St corresponding tothe largest eigenvalue. Similarly, we can obtain a set ofprojection axes of PCA by selecting the dPCA eigenvectorsof St corresponding to the dPCA largest eigenvalues.
2.2. Basic idea of LPCA
For each sample point xi, here we only consider itsneighboring point xj, for example, those within its locald-neighborhood (d40), i.e., xj 2 Ud
xi¼ fx jjx� xijj
2od�� g,
where jj � jj is the notation of the Euclidean norm. Let us
define Udi ¼ j xj 2 Ud
xi
��n o. Obviously, Ud
i is the set of the
indexes (subscripts) of the sample points that belong to thelocal d-neighborhood of xi. Based on this definition, themean square of the Euclidean distances between all pairs ofthe projected sample points within local neighborhoods isgiven by
JLðwÞ ¼1
2
1
ML
XMi¼1
Xj2Ud
i
ðyi � yjÞ2, (5)
where ML ¼PM
i¼1Mi, Mi is the number of elements in Udi .
It follows from Eq. (5) that
JLðwÞ ¼1
2
1
ML
XMi¼1
Xj2Ud
i
ðwTxi � wTxjÞ2
¼ wT 1
2
1
ML
XMi¼1
Xj2Ud
i
ðxi � xjÞðxi � xjÞT
24
35w. ð6Þ
Let us denote
SL ¼1
2
1
ML
XMi¼1
Xj2Ud
i
ðxi � xjÞðxi � xjÞT. (7)
SL is called the local covariance matrix. The eigenvectorsof SL corresponding to the d largest eigenvalues form acoordinate system for LPCA.
2.3. Implementation of LPCA
For the convenience of implementing LPCA, two issuesshould be concerned: first, it is hard to choose a properradius, d, of the local neighborhood in practice, although itis geometrically intuitive to employ the d-neighborhood tocharacterize the locality. Second, it is inefficient toconstruct SL and to calculate its eigenvectors using theformula in Eq. (7), particularly in the high-dimensional andsmall sample size cases. We will address these issues anddevelop a feasible algorithm for LPCA as follows.To avoid the difficulty of choosing the radius of local
neighborhood, the method of K-nearest neighbors isadopted to characterize the ‘‘locality’’ since it is easy tobe operated and implemented. Let us denoteUK
i ¼ fj xj
�� is among the K�nearest nieghbors of xig.Then, JLðwÞ can be reformulated by
JLðwÞ ¼1
2
1
MK
XMi¼1
Xj2UK
i
ðyi � yjÞ2. (8)
Actually, the K-nearest neighbor relationship between allpairs of training samples can be described by an adjacencymatrix H, whose element is given as:
Hij ¼1 if xj is among the K�nearest neighbors of xi;
0 otherwise:
�(9)
ARTICLE IN PRESSJ. Yang et al. / Neurocomputing 69 (2006) 1697–1701 1699
Thus, Eq. (8) can be rewritten by
JLðwÞ ¼1
2
1
MK
XMi¼1
XMj¼1
Hijðyi � yjÞ2. (10)
Note that the adjacency matrix H is not necessarily asymmetric matrix since the K-nearest neighbor relationshipof a pair of samples may be asymmetric. Specifically, itpossibly happens that xj is among the K-nearest neighborsof xi (Hij ¼ 1) but xi is not among the K-nearest neighborsof xj (Hij ¼ 0). For the convenience of derivation, we needto make H symmetric by the following means:
H 12ðHþHTÞ; i:e:; Hij
12ðHij þHjiÞ. (11)
If xj is among the K-nearest neighbors of xi while xi is notamong the K-nearest neighbors of xj, after symmetrization,we have Hij ¼ Hji ¼
12. For this case, we can view that a
symmetric semi-‘‘K-nearest neighbor relationship’’ existsbetween xi and xj. Finally, it should be stressed that thevalue of the criterion JLðwÞ in Eq. (10) keeps invariant afterthe symmetrization process via Eq. (11).
After the symmetrization of H, it follows from Eq. (10)that
JLðwÞ ¼1
2
1
MK
XMi¼1
XMj¼1
HijðwTxi � wTxjÞ
2
¼ wT 1
2
1
MK
XMi¼1
XMj¼1
Hijðxi � xjÞðxi � xjÞT
" #w ¼ wTSLw.
ð12Þ
Due to the symmetry of H, we have
SL ¼1
2
1
MK
XMi¼1
XMj¼1
HijxixTi þ
XMi¼1
XMj¼1
HijxjxTj
�2XMi¼1
XMj¼1
HijxixTj
!
¼1
MK
XMi¼1
DiixixTi �
XMi¼1
XMj¼1
HijxixTj
!
¼1
MKðXDXT � XHXTÞ
¼1
MKXLXT, ð13Þ
where X ¼ ðx1;x2; . . . ;xM Þ, and D is a diagonal matrixwhose elements on the diagonal are column (or row sinceHis symmetrized) sums of H, i.e., Dii ¼
PMj¼1Hij . L ¼ D�H
is called the Laplacian matrix in [1].It is obvious that L and SL are both real symmetric
matrices. From Eqs. (10) and (12), we know that wTSLwX0for any nonzero vector w. So, the local covariance matrixSL must be a non negative definite matrix, that is, itsnonzero eigenvalues are all positive.
The formulation of SL in Eq. (13) provides us a moreefficient way to construct the local covariance matrix andto calculate its eigenvectors in small sample size case. Since
L is a real symmetric matrix, its eigenvalues are all real. Wecalculate its all eigenvalues and the corresponding eigen-vectors. Suppose L is the diagonal matrix of eigenvalues ofL and P is the full matrix whose columns are thecorresponding eigenvectors. Then, L can be decomposedby
L ¼ PLPT ¼ PLPTL; where PL ¼ PL1=2. (14)
Based on the decomposition of L, we have SL ¼
ð1=MKÞ ðXPLÞðXPLÞT Let us define R ¼ ðXPLÞ
TðXPLÞ,
which is an M�M non negative definite matrix. Whenthe training sample size, M, is smaller than the dimensionof the input space, N, the size of R is much smaller thanthat of SL. Thus, it is computationally easier to obtain itseigenvectors. Let us work out R’s orthonormal eigenvec-tors v1; v2; . . . ; vd that correspond to the d largest eigen-vlaues l1Xl2X � � �Xld40. Then, based on the theorem ofsingular value decomposition (SVD) [5], the orthonormaleigenvectors w1;w2; . . . ;wd of SL corresponding to the dlargest eigenvlaues l1=ðMKÞ; l2=ðMKÞ; � � � ; ld=ðMKÞ are
wj ¼1ffiffiffiffilj
p XPLvj ; j ¼ 1; . . . ; d (15)
Let W ¼ ðw1;w2; . . . ;wdÞ. The LPCA transform of samplex is
y ¼WTx. (16)
In summary of the description above, the LPCA algorithmis given below:
Step 1: For the given training data set fxiji ¼ 1; � � � ;Mg,find the K-nearest neighbors of each data point andconstruct the adjacency matrix H ¼ ðHijÞM�M using Eq.(9). Symmetrize H using Eq. (11).
Step 2: Construct the M�M diagonal matrix D, whoseelements on the diagonal are given by Dii ¼
PMj¼1Hij.
Then, construct the Laplacian matrix L ¼ D�H.Step 3: Perform the eigenvalue decomposition of L by
L ¼ PLPT ¼ PLPTL, where PL ¼ PL1=2.
Step 4: Construct the matrix R ¼ ðXPLÞTðXPLÞ, where
X ¼ ðx1;x2; . . . ;xM Þ. Calculate the orthonormal eigenvec-tors v1; v2; . . . ; vd of R corresponding to the d largesteigenvlaues l1Xl2X � � �Xld40. The d projection axes ofLPCA are wj ¼ XPLvj
� ffiffiffiffilj
p, j ¼ 1; . . . ; d.
Step 5: Perform the linear transform of sample x usingEq. (16) and obtain the low-dimensional LPCA featurevector y. y is used to represent x for recognition purpose.
3. Experiments
The proposed method is tested using the standard AT&Tdatabase (http://www.uk.research.att.com/facedataba-se.html). This database contains images from 40 indivi-duals, each providing 10 different images. In the firstexperiment, we use the first five images of each class fortraining and the remaining five for testing. PCA and theproposed LPCA are, respectively, used for feature extrac-tion. In LPCA, the parameter K is chosen as K ¼ 6.
ARTICLE IN PRESS
Table 1
The maximal recongnition rates (%) of PCA ana LPCA and their corresponding dimensions when the firest five images of each class are used for training
Method Euclidean distance Mahalinobis cosine distance
PCA LPCA PCA LPCA
Recognition rate 93.5 95.5 92.0 96.5
Dimension 42 50 38 38
Table 2
The maximal average recongnition rates (%) of PCA and LPCA and their corresponding dimensions (shown in brackets) when the number of training
samples per class varies from 1 to 5 using 20-fold cross-validation tests
No. training samples/class 1 2 3 4 5
Euclidean distance PCA 67.7 [38] 80.7 [60] 87.1 [50] 91.1 [50] 94.0 [56]
LPCA 69.0 [30] 82.3 [42] 88.0 [50] 91.7 [46] 94.4 [46]
Mahalinobis cosine distance PCA 68.1 [32] 81.3 [40] 87.1 [34] 90.9 [46] 93.5 [44]
LPCA 68.6 [38] 83.4 [44] 89.3 [44] 92.7 [38] 95.1 [56]
J. Yang et al. / Neurocomputing 69 (2006) 1697–17011700
Finally, nearest-neighbour classifiers with Euclidean dis-tance and Mahalinobis cosine distance [2] are, respectively,employed for classification. The maximal recognition ratesof PCA and LPCA and their corresponding dimensions aregiven in Table 1. Table 1 shows LPCA outperforms PCAwith two different distance metrics.
To alleviate the effect on the recognition performancethat results from the choice of training set, in the secondexperiment, we perform a series of 20-fold cross-validationtests. In these tests, the number of training samples perclass, t, is allowed to vary from 1 to 5. In LPCA, theparameter K is chosen as K ¼ tþ 2. The maximal averagerecognition rates across 20 runs of PCA and LPCA undernearest-neighbour classifiers with two distance metrics andtheir corresponding dimensions are shown in Table 2.Table 2 indicates LPCA consistently outperforms PCAirrespective of the variation in training sample size.
4. Conclusions
A linear and unsupervised subspace-learning technique,locally principal component analysis (LPCA), is developedin this paper. Since the global nonlinear structure can beviewed locally linear, it is possible to use the lineartechnique to recover the local structure of data, althoughit is almost impossible to use the same technique to recoverthe global geometric structure. Our experimental resultsindicate that the local structure does contain effectiveinformation for discrimination.
Acknowledgments
This research was supported by the CERG fund fromthe HKSAR Government and the central fund from theHong Kong Polytechnic University, the National ScienceFoundation of China under Grant nos. 60503026 and
60332010, no. 60472060, and no. 60473039. Dr. Yang wassupported by China and the Hong Kong PolytechnicUniversity Postdoctoral Fellowships.
References
[1] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality
reduction and data representation, Neural comput. 15 (6) (2003)
1373–1396.
[2] R. Beveridge, D. Bolme, M. Teixeira, B. Draper, The CSU Face
Identification Evaluation System User’s Guide: Version 5.0, http://
www.cs.colostate.edu/evalfacerec/.
[3] X. He, S. Yan, Y. Hu, P. Niyogi, H.-J. Zhang, Face recognition using
laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005)
328–340.
[4] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally
linear embedding, Science 290 (5500) (2000) 2323–2326.
[5] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cognitive
Neurosci. 3 (1) (1991) 71–86.
Jian Yang was born in Jiangsu, China, June 1973.
He obtained his Bachelor of Science in Mathe-
matics at the Xuzhou Normal University in 1995.
He then continued to complete a Masters of
Science degree in Applied Mathematics at the
Changsha Railway University in 1998 and his
Ph.D. at the Nanjing University of Science and
Technology (NUST) in the Department of
Computer Science on the subject of Pattern
Recognition and Intelligence Systems in 2002.
From January to December in 2003, he was a postdoctoral researcher at
the University of Zaragoza and affiliated with the Division of
Bioengineering of the Aragon Institute of Engineering Research (I3A).
In the same year, he was awarded the RyC program Research Fellowship,
sponsored by the Spanish Ministry of Science and Technology. Now, he is
a professor at the Department of Computer Science of NUST and, at the
same time, a Postdoctoral Research Fellow at Biometrics Centre of Hong
Kong Polytechnic University. He is the author of more than 30 scientific
papers in pattern recognition and computer vision. His current research
interests include pattern recognition, computer vision and machine
learning.
ARTICLE IN PRESSJ. Yang et al. / Neurocomputing 69 (2006) 1697–1701 1701
David Zhang graduated in computer science from
the Peking University in 1974 and received his
M.Sc. and Ph.D. degrees in computer science and
engineering from the Harbin Institute of Tech-
nology (HIT) in 1983 and 1985, respectively. He
received his second Ph.D. in electrical and
computer engineering at the University of Water-
loo, Ontario, Canada, in 1994. After that, he was
an associate professor at the City University of
Hong Kong and a chair professor at the Hong
Kong Polytechnic University. Currently, he is a founder and director of
the Biometrics Technology Centre supported by the UGC of the
Government of the Hong Kong SAR. He is the Founder and Editor-in-
Chief of the International Journal of Image and Graphics, and an
Associate Editor in some international journals such as IEEE Transac-
tions on SMC-C, Pattern Recognition, and International Journal of
Pattern Recognition and Artificial Intelligence. His research interests
include automated biometrics-based identification, neural systems and
applications, and image processing and pattern recognition. So far, he has
published over 180 articles as well as 10 books, and won numerous prizes.
Jing-yu Yang received the B.S. Degree in
Computer Science from Nanjing University of
Science and Technology (NUST), Nanjing, Chi-
na. From 1982 to 1984 he was a visiting scientist
at the Coordinated Science Laboratory, Univer-
sity of Illinois at Urbana-Champaign. From 1993
to 1994 he was a visiting professor at the
Department of Computer Science, Missuria
University. And in 1998, he acted as a visiting
professor at Concordia University in Canada. He
is currently a professor and Chairman in the department of Computer
Science at NUST. He is the author of over 300 scientific papers in
computer vision, pattern recognition, and artificial intelligence. He has
won more than 20 provincial awards and national awards. His current
research interests are in the areas of pattern recognition, robot vision,
image processing, data fusion, and artificial intelligence.