lift a new framework of learning from testing data for face ...lift: a new framework of learning...

14
LIFT: A new framework of learning from testing data for face recognition Yuan Cao a , Haibo He b, , He (Helen) Huang b a Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA b Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881, USA article info Article history: Received 30 July 2009 Received in revised form 23 September 2010 Accepted 18 October 2010 Communicated by D. Xu Keywords: Face recognition Semi-supervised learning One-against-all Feature extraction Data quality abstract In this paper, a novel learning methodology for face recognition, LearnIng From Testing data (LIFT) framework, is proposed. Considering many face recognition problems featured by the inadequate training examples and availability of the vast testing examples, we aim to explore the useful information from the testing data to facilitate learning. The one-against-all technique is integrated into the learning system to recover the labels of the testing data, and then expand the training population by such recovered data. In this paper, neural networks and support vector machines are used as the base learning models. Furthermore, we integrate two other transductive methods, consistency method and LRGA method into the LIFT framework. Experimental results and various hypothesis testing over five popular face benchmarks illustrate the effectiveness of the proposed framework. & 2010 Elsevier B.V. All rights reserved. 1. Introduction Recently, many new theories and methodologies for face recognition have been developed in the community, and many new algorithms and practical tools have been designed and succe- ssfully applied to a wide range of applications, such as biometrics, surveillance, human–computer interface, information security, and others [12,59,49,18]. Generally, the face recognition problem aims to identify or verify one or more persons in still or video images of a scene based on a stored database. Three important subtasks are generally considered in the solution to a face recognition problem: face segmentation/detection, feature extraction, and face recognition/ identification. For instance, face segmentation/detection aims to detect and localize an unknown number (if any) of faces from a simple or complex background in a still image or video stream data [22]. In this paper, we focus on the latter two phases and address the face recognition problem as to predict the identity label of a given still face image by designing an effective learning method. Feature extraction is an important step for a successful face recognition approach. Generally speaking, feature extraction tech- niques for face recognition can be categorized into two types [8]: feature-based matching and template matching (also called holis- tic [15]) methods. In the feature-based matching approach, the most characteristic face components (eyes, nose, mouth, chin, etc.) and their features (colors, shapes, positions, etc.) are recogni- zed and extracted within a face image. In the template matching approach, important features are extracted from the images and are represented as a bidimensional matrix of intensity values. Refs. [8] and [15] argue that although feature-based approach has many advantages, such as robust performance against rotation/scale and illumination variations, fast computation, and efficient memory utilization, its performance heavily relies on the facial feature detection methods and the quality of individual facial features. On the other hand, since many feature extraction methods, such as the principal component analysis (PCA) [48], independent component analysis (ICA) [1], Fisher’s linear discriminant (FLD) or linear discriminant analysis (LDA) [2], and Gabor wavelets [27], have been applied to face recognition problems, the template matching approach has attracted significantly growing attention in the community. For instance, many variants of the aforementioned basic feature extraction methods have been extensively proposed in literature, including the fractional-step linear discriminant analysis (F-LDA) method [30], the direct linear discriminant analysis (D-LDA) method [57], the direct fractional-step linear discriminant analysis (DF-LDA) method [31], regularized discrimi- nant analysis (RDA) method [13], among others. In F-LDA [30], the concept of fractional dimensionality was introduced and integrated into an incremental dimensionality reduction procedure based on linear discriminant analysis. Due to computational constraints, the traditional LDA approach was performed in the low-dimensional PCA subspace, which may result in a loss of significant discrimi- natory information contained in the discarded null space. There- fore, D-LDA was proposed to process data directly in the original high-dimensional input space by modifying the simultaneous Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.10.015 Corresponding author. E-mail addresses: [email protected] (Y. Cao), [email protected] (H. He), [email protected] (H. (Helen) Huang) Please cite this article as: Y. Cao, et al., LIFT: A new framework of learning from testing data for face recognition, Neurocomputing (2011), doi:10.1016/j.neucom.2010.10.015 Neurocomputing ] (]]]]) ]]]]]]

Upload: others

Post on 05-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Neurocomputing ] (]]]]) ]]]–]]]

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

� Corr

E-m

huang@

Pleasdoi:1

journal homepage: www.elsevier.com/locate/neucom

LIFT: A new framework of learning from testing data for face recognition

Yuan Cao a, Haibo He b,�, He (Helen) Huang b

a Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USAb Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881, USA

a r t i c l e i n f o

Article history:

Received 30 July 2009

Received in revised form

23 September 2010

Accepted 18 October 2010

Communicated by D. Xurecover the labels of the testing data, and then expand the training population by such recovered data. In

Keywords:

Face recognition

Semi-supervised learning

One-against-all

Feature extraction

Data quality

12/$ - see front matter & 2010 Elsevier B.V. A

016/j.neucom.2010.10.015

esponding author.

ail addresses: [email protected] (Y. Cao), he@

ele.uri.edu (H. (Helen) Huang)

e cite this article as: Y. Cao, et al., LIFT0.1016/j.neucom.2010.10.015

a b s t r a c t

In this paper, a novel learning methodology for face recognition, LearnIng From Testing data (LIFT)

framework, is proposed. Considering many face recognition problems featured by the inadequate training

examples and availability of the vast testing examples, we aim to explore the useful information from the

testing data to facilitate learning. The one-against-all technique is integrated into the learning system to

this paper, neural networks and support vector machines are used as the base learning models.

Furthermore, we integrate two other transductive methods, consistency method and LRGA method into

the LIFT framework. Experimental results and various hypothesis testing over five popular face

benchmarks illustrate the effectiveness of the proposed framework.

& 2010 Elsevier B.V. All rights reserved.

1. Introduction

Recently, many new theories and methodologies for facerecognition have been developed in the community, and manynew algorithms and practical tools have been designed and succe-ssfully applied to a wide range of applications, such as biometrics,surveillance, human–computer interface, information security,and others [12,59,49,18].

Generally, the face recognition problem aims to identify orverify one or more persons in still or video images of a scene basedon a stored database. Three important subtasks are generallyconsidered in the solution to a face recognition problem: facesegmentation/detection, feature extraction, and face recognition/identification. For instance, face segmentation/detection aims todetect and localize an unknown number (if any) of faces from asimple or complex background in a still image or video stream data[22]. In this paper, we focus on the latter two phases and addressthe face recognition problem as to predict the identity label of agiven still face image by designing an effective learning method.

Feature extraction is an important step for a successful facerecognition approach. Generally speaking, feature extraction tech-niques for face recognition can be categorized into two types [8]:feature-based matching and template matching (also called holis-tic [15]) methods. In the feature-based matching approach, themost characteristic face components (eyes, nose, mouth, chin, etc.)

ll rights reserved.

ele.uri.edu (H. He),

: A new framework of learni

and their features (colors, shapes, positions, etc.) are recogni-zed and extracted within a face image. In the template matchingapproach, important features are extracted from the images and arerepresented as a bidimensional matrix of intensity values. Refs. [8]and [15] argue that although feature-based approach has manyadvantages, such as robust performance against rotation/scale andillumination variations, fast computation, and efficient memoryutilization, its performance heavily relies on the facial featuredetection methods and the quality of individual facial features. Onthe other hand, since many feature extraction methods, such as theprincipal component analysis (PCA) [48], independent componentanalysis (ICA) [1], Fisher’s linear discriminant (FLD) or lineardiscriminant analysis (LDA) [2], and Gabor wavelets [27], havebeen applied to face recognition problems, the template matchingapproach has attracted significantly growing attention in thecommunity. For instance, many variants of the aforementionedbasic feature extraction methods have been extensively proposedin literature, including the fractional-step linear discriminantanalysis (F-LDA) method [30], the direct linear discriminantanalysis (D-LDA) method [57], the direct fractional-step lineardiscriminant analysis (DF-LDA) method [31], regularized discrimi-nant analysis (RDA) method [13], among others. In F-LDA [30], theconcept of fractional dimensionality was introduced and integratedinto an incremental dimensionality reduction procedure based onlinear discriminant analysis. Due to computational constraints, thetraditional LDA approach was performed in the low-dimensionalPCA subspace, which may result in a loss of significant discrimi-natory information contained in the discarded null space. There-fore, D-LDA was proposed to process data directly in the originalhigh-dimensional input space by modifying the simultaneous

ng from testing data for face recognition, Neurocomputing (2011),

Page 2: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]]2

diagonalization procedure of the LDA method [57]. DF-LDA [31]combines the strengths of the D-LDA and the F-LDA methods byconducting D-LDA and F-LDA sequentially. Based on LDA, RDA wasdeveloped in [13] by using a regularized discriminate schemeinstead of optimizing the Fisher index. This scheme was usedto solve the small sample-size problem and evaluated on threepopular databases, ORL, Yale, and Feret database. However, thismethod suffers the drawback of high computational load. In [27],the Gabor wavelets were used for face recognition within adynamic link architecture (DLA) framework. Gabor jets were firstcalculated and then used for a flexible template comparisonbetween the resulting image decompositions. A Gabor–Fisherclassifier was proposed in [28], in which an augmented Gaborfeature vector was derived from the Gabor wavelet representationof the face images. This method is illustrated to be robust to thevariations in illumination and facial expression.

Recently, new approaches and methods have been presented inliterature. In [52], an iterative algorithm was proposed to rearrangeelements of the data matrix in order to maximize intramatrixcorrelation. This algorithm was extended to supervised learningproblems and simulation results demonstrated the effectiveness ofthe proposed algorithms. Another iterative method, adaptiveregularization-based semi-supervised discriminant analysis withtensor representation (ARSDA/T) and its vector-based variant(ARSDA/V) were presented in [51]. In these algorithms, graphLaplacian regularization was used based on the data representationin the low-dimensional feature space. In [54], two null space-basedschemes, NS2DLDA and NS2DMFA, extended from the traditional2-dimensional LDA and MFA algorithms were presented. Theproposed schemes could solve the convergence problem and theexperiment results over CMU PIE and FERET datasets showedsuperior performance of the proposed methods when compared tothe traditional approaches. Spatially constrained earth mover’sdistance (SEMD) was used in [53] in order to improve the robustness ofthe face recognition algorithms against image misalignments. Thedistances were treated as features in a Kernel Discriminant Analysisframework. Experiments over three benchmark face datasets illu-strated the effectiveness of the proposed distance measure approach.In [55], a concurrent subspaces analysis method for object reconstruc-tion and recognition was proposed. A high order tensor object wasencoded in multiple concurrent subspaces that were learned by aniterative procedure sequentially. Simulation results on four popularface datasets showed that CSA outperforms the traditional PCA.

Another key aspect for face recognition is the underline learningalgorithms. For instance, the learning methods such as neuralnetworks, support vector machines (SVMs), k-nearest neighbors(KNN), among others, have been studied extensively in the com-munity for face recognition. For instance, in [15], a high speed facerecognition strategy using radial basis function (RBF) neural net-works was proposed. This system involves two subtasks: featureextraction based on the discrete cosine transform (DCT) and Fisher’slinear discriminant (FLD) analysis, and face classification using RBFneural networks. Simulation results on the Colorado State University(CSU) Face Identification Evaluation System and the Yale Databaseillustrates that the proposed method can reduce the computationalcost and achieve competitive recognition accuracy compare to otherface recognition methodologies, such as pseudo-2D hidden Markovmodels, probabilistic decision-based neural network, among others. Anovel neural architecture, PyraNet, was proposed in [39] for visualpattern recognition. PyraNet consists of two layers: a pyramidal layerused for feature extraction and reduction, and a one-dimensional layerused for classification. Five training methods including gradientdescent, gradient descent with momentum, resilient backpropagation,Polak–Ribiere conjugate gradient, and Levenberg–Marquadrt areanalyzed in PyraNet for visual pattern recognition. In [20], binarySVMs are used to tackle the face recognition problem. One-against-one

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

technique and a bottom-up binary tree structure were employed tosolve a multiclass face recognition problem. Refs. [24] and [25] alsodiscussed SVMs in the context of face recognition. The authors arguedthat SVMs can capture the relevant discriminatory information fromthe training data and provide superior learning performance compareto other classification methods such as Euclidean distance and norma-lized correlation method. However, if the data have been preprocessedby some feature extraction techniques that can capture the discrimi-natory information, such as FLD, SVMs may not be able to outperformother classification methods. In [33], nearest neighbor classifier wasdeveloped for face recognition, in which a new linear feature extrac-tion technique was proposed by transferring the problem of findingthe optimal linear projection matrix in feature extraction to a classi-fication problem that is solved by AdaBoost algorithm and multitasklearning theory. In [26], a face recognition scheme was proposed bycombining wavelet decomposition, Fisherface method, and fuzzyintegral. The wavelet decomposition and Fisherface method are usedto extract important features from the image, and fuzzy integralmethod is used to combine multiple classifiers that are trained on diff-erent subspaces generated by the wavelet decomposition. The effec-tiveness of the proposed method is indicated by simulation results overthe Chungbuk National University face database and Yale database.

In this paper, we assume a learner is provided with limitedtraining data, whereas a large amount of testing data. Under thisscenario, we propose a new learning methodology, Learning FromTesting data (LIFT) framework, by exploring useful informationfrom the accessible testing data to facilitate the final decision-making processes. Meanwhile, since the proposed framework has amodularized structure, the method in each module of the frame-work can be replaced by other approaches. For instance, in thispaper, we substitute the one-against-all strategy with two othertransductive learning methods, consistency method [63,61] andLRGA method [56], in the data selection step of the proposedframework. Consistency method and LRGA method employ mani-fold learning and are developed based on the global patterns andlocal structures in the data.

The rest of the paper is organized as follows. Section 2 formu-lates the problem addressed in this paper. Section 3 presents thedetails of the LIFT approach for face recognition problems. Systemlevel framework and a learning algorithm are proposed in this section.Experimental results on five popular face databases and statisticalanalysis of these results are presented in Section 4 to show the effec-tiveness of this method. Meanwhile, the performance of the variantswith consistency method and LRGA method over the five face datasetsare presented. In Section 5, a brief analysis of the data quality is pro-vided. Finally, a conclusion and a brief discussion on future researchdirections are outlined in Section 6.

2. Problem formulation

In the traditional face recognition problems, we generally assumean adequate and representative training data is available to developthe decision boundaries for future prediction. However, in many real-world applications, collecting and acquisition of labeled face images isoften expensive and time consuming. Meanwhile, such image coll-ection process normally requires the efforts of experienced humanannotators, which is not suitable for the automated face recognitionsystems. This introduced the semi-supervised learning scenario[64,42]. Generally speaking, the key idea of semi-supervised learningis to exploit the unlabeled training examples together with the labe-led ones to modify and refine the hypothesis to improve learningaccuracy [64,37,10,60]. For instance, self-training methods [50] firstdevelop an initial classifier with labeled data examples alone. Thenthis classifier is used to recover the unlabeled data examples andappend them to the labeled data to retrain the classifier. This

ng from testing data for face recognition, Neurocomputing (2011),

Page 3: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Fig. 1. The proposed LIFT system diagram.

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]] 3

procedure is repeated, and each time the most confident unlabeledexamples will be labeled with the estimated labels. In this way, theclassifier uses its own knowledge to teach itself iteratively. Someother representative work includes co-training methods [36,58,6,44],semi-supervised support vector machines [11,3], graph-based meth-ods [62,5], EM with generative mixture models [38,35], among others.

In many applications, it is not uncommon that only scarce labeledtraining images are available originally, whereas large amount ofunlabeled images become available at the online testing stage. Forinstance, in [46], an active learning paradigm was presented, in whichthe testing phase was regarded as the beginning of a machine learningexperiment instead of the end in the traditional approaches. There-fore, the testing data were used iteratively to evaluate the learningprocess and, if necessary, construct the new training dataset in orderto cover the instance space more completely. This learning paradigmwas illustrated with a case study of the robotic soccer game. A semi-supervised PCA-based face recognition algorithm based on self-training was proposed in [41]. Unlabeled images are used to updatethe eigenspace and the templates in order to improve the perfor-mance of the face recognition systems. In our previous study, weproposed an iterative learning strategy of the incremental semi-supervised learning problem by adaptively recovering the labels forthe testing data that become available incrementally [9].

Motivated by these ideas, in this paper we consider the followingface recognition problem: given inadequate labeled training images,can one use the unlabeled testing images to improve the recognitionperformance? To address this problem, we propose a novel facerecognition framework by recovering the labels of the testing imagesand moving the most confidently recovered testing images into thetraining set to facilitate learning and recognition. To our best knowl-edge, this is the first study to regard the unknown testing images as anew source of information in the face recognition problems. One ofthe main contributions of this paper is that we provide a new directionof understanding the semi-supervised learning. Compared to ourprevious work in [9], in this article we develop a general-purposelearning framework by combining the feature extraction/reductionmethod, such as PCA, the one-against-all strategy, and the computa-tional intelligence methods, such as neural networks and supportvector machines as discussed in this paper. These techniques enablethe proposed framework to deal with the high-dimensional andmultiple-class databases effectively, which makes the proposedframework suitable for most face recognition problems. We inves-tigate the use of the proposed method in the context of the facerecognition problems, and test it on five popular face databases toillustrate its effectiveness. In this work, we also analyze severalaspects of the data quality of the recovered testing data, such as thesize, the accuracy rate, and the error type, in detail and show theimpact of these attributes of the recovered data on the performance ofthe final decision-making processes for face recognition.

Consider an original training dataset Dtr with ntr samples, whichcan be represented as fxq,yqg,q¼ 1, . . . ,ntr , where xq is a face imagesample and yqAY ¼ f1, . . . ,Cg is the subject identity label asso-ciated with xq. We assume that the testing dataset Dte with nte

images is available without the identity labels, i.e., Dte can berepresented as fxpg,p¼ 1, . . . ,nte. Moreover, we assume that nte ismuch greater than ntr. Due to the inadequate training dataset, thehypothesis built on Dtr cannot provide satisfactory predictionperformance. Therefore, the objective here is to design an effectivelearning framework to exploit the useful information from the testingsamples in order to improve the face recognition performance.

Before we proceed to the details of the proposed framework, wewould like to note the major differences between the problem weaim to tackle in this paper compared to the traditional semi-superviselearning problems. In the traditional semi-supervised learning sce-nario, one aims to exploit the unlabeled training examples to benefitthe learning process, which can be accomplished based on the labeled

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

training data information. In this paper, we are interested in findingpotential useful information from the testing data to improve thedecision-making process. In addition, we investigate the impact of thedata quality of the recovered testing data on the accuracy perfor-mance of the face recognition system. From our observation, threeattributes of the recovered testing dataset greatly affect the systemperformance, including the sample sizes, the error rates, and the errortypes. Various experiments on five frequently used face benchmarksare used to demonstrate the effectiveness of the proposed frameworkby using neural networks and SVMs with different kernel functions asthe base classifier.

3. LIFT: learning from testing data framework

We propose the LIFT learning framework as illustrated in Fig. 1.Briefly speaking, the LIFT framework consists of three phases:feature extraction, data selection, and final training. All the imagesfirst go through a preprocessing procedure for dimensionalityreduction to facilitate learning. Then the one-against-all techniquewith a base learning model is developed to estimate the identity

ng from testing data for face recognition, Neurocomputing (2011),

Page 4: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Fig. 2. An example of the one-against-all technique. (a) The decision boundary of

hypothesis h3 based on the training data; (b) the prediction of the testing data based

on the hypothesis h3.

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]]4

labels of the testing images. By adding the most confident testingimages with the estimated labels into the original training dataset,the final classification hypothesis will be built on the expandedtraining dataset. We will present this system in detail in thefollowing sections.

3.1. Feature extraction

Similar to many of the existing face classification research, weuse the PCA method for feature extraction to handle the high-dimensional facial database. In PCA, feature extraction is conductedon the original face images in order to find the subset of basisimages and in this new feature space, the original images can berepresented by the coordinates that are uncorrelated [48]. Accord-ing to the well-known eigenface methods proposed in [48], wegenerate the eigenfaces that correspond to the eigenvectorsassociated with the dominant eigenvalues of the facial imagecovariance matrix. These eigenfaces define the new feature spacethat greatly reduces the dimensionality of the original space, whichallows the efficient learning from the reduce feature space.

Specifically, we take the first N principal components andgenerate the subspaces with dimensionality of N for all face images.After this phase, each face image is expressed as a pair of fxi,yig,where xi is an image vector with size of N, and for the images in thetraining set, yiAY ¼ f1, . . . ,Cg is the class identity label associatedwith xi, while for those in the testing set, yi¼0 stands for anunknown label. In our current experiments, we set N to 100. Wewould like to note that other dimensional reduction methods canalso be integrated into the proposed LIFT framework. Interestedreaders can find more details on this issue in [1,2].

3.2. Data selection

Due to the inadequate training samples, the classifier obtainedbased on such limited training data may not provide accurate androbust classification performance. The key question is whether onecan take advantage of the testing data itself to benefit the learningprocess. To this end, we propose to use the one-against-all techniqueto estimate and recover the labels of the testing images to augmentthe training data size.

One-against-all method [4,29] is a standard technique to solvemulticlass classification problems by transforming a multiclassclassification problem to multiple binary classification problems.By focusing on one class each time, the one-against-all method canprovide well-suited classification capability. For class label i, wepartition the training dataset Dtr into two subsets: Di

tr that containsall the examples with label i and D

itr that contains all the examples

that do not belong to class i. All the examples in Ditr are labeled as

1 and all the examples in Ditr are labeled as 2. Then a hypothesis hi is

trained based on the newly labeled training data. Once the hypothesishi is developed, all the testing examples are applied to the hypothesisto predict if these examples belong to class i or not. If the recoveredlabel is 1, then we consider that this example may belong to class i andthis example is added to the recovered testing dataset Di

re, otherwise,this example is skipped for class i and may be evaluated for other classlabels. Note that any testing example that is predicted to two or moredifferent classes will be excluded from the recovered testing datasets.Finally, the recovered testing datasets for all labels are combined toform the recovered testing dataset Dre as Dre ¼D1

re [ D2re [ . . . [ DC

re.Fig. 2 illustrates an example of the one-against-all technique by

considering the class 3 data of the Yale face database used in this paper.We first divide all the training images into two groups, those thatbelong to class 3 and those that do not belong to class 3. The hypothesish3 is trained to predict the class 3 and non-class 3 examples. Then, thishypothesis is used to evaluate the testing dataset. Those that are

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

classified as class 3, i.e., the images left to the h3 decision boundary, areadded to the training dataset. One should note that due to the limitedlearning capability of the learning method used to generate hi, the falsepositive failure (FP) (i.e., predicting an example as class i when thecorrect label is not i) and false negative failure (FN) (i.e., predicting anexample not as class i when the correct label is i) may occur whenevaluating the testing data. For instance, the circled image left to thedecision boundary h3 is a false positive example. The correct label ofthis image is class 2 but it is recovered incorrectly as class 3. On theother hand, the circled image right to the h3 boundary is a falsenegative misclassification, i.e., the correct label of this image is class3 while it is misclassified not to class 3. Due to the existence of the falsepositive and false negative failures, the recovered data from the testingdataset may contain some misclassified examples. In this case, theinaccurate information learned from the testing data may underminethe final prediction performance. We will discuss the impact of thequality of the recovered data on the classification performance of thesystem in further detail in Section 5.

3.3. Final training

Based on the discussions in Sections 3.1 and 3.2, we have threedatasets, the training dataset Dtr , the testing dataset Dte, and therecovered dataset Dre. We add Dre into Dtr and form an augmentedtraining set Dtr : Dtr ¼Dtr [ Dre. Based on Dtr , we develop the finalhypothesis hf for the final face recognition.

The objective of the LIFT framework is to design an effectivelearning methodology to exploit the useful information from thetesting samples in order to improve the face recognition perfor-mance. The main procedure of the learning framework is summar-ized as follows.

Algorithm 1 (LIFT-Learning Algorithm).Input:

ng

Initial training dataset Dtr ¼ fxq,yqg,q¼ 1, . . . ,ntr , where xq is aface image sample and yqAY ¼ f1, . . . ,Cg is the subject identitylabel associated with xq;

� Available testing dataset Dte ¼ fxpg,p¼ 1, . . . ,nte; � Recovered testing dataset Dre that is empty initially Dre ¼F; � Learning algorithm Learn1 used in the data selection phase; � Learning algorithm Learn2 used to generate the final hypothesis;

Procedure:

Preprocessing (feature extraction)(1) PCA is performed to all images in Dtr and Dte, the first N

principal components are chosen to transform all imagesinto vectors with length N;

from

testing data for face recognition, Neurocomputing (2011),
Page 5: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]] 5

Pd

Learning from testing data (data selection)Do for each potential class label i,iAY ¼ f1, . . . ,Cg:(1) Partition Dtr into Di

tr and Ditr where

Ditr ¼ ffxk,ykg : fxk,ykgADtr ,yk ¼ ig

Ditr ¼ ffx‘ ,y‘g : fx‘ ,y‘gADtr ,y‘a ig

ð1Þ

(2) Label all examples in Ditr as class 1 and all examples in D

itr as

class 2 and form a binary classification training set,

Di

tr ¼ ffxk,1g : yk ¼ ig [ ffx‘ ,2g : y‘a ig.

(3) Train Learn1 on Di

tr and return a hypothesis hi.(4) Apply the testing dataset Dte to hi, and return the predicted

labels yq. Add all testing examples that are predicted to class1 (i.e., class i), Di

re ¼ ffxq,ig : xqADte,yq ¼ ig, into Dre:

Dre ¼Dre [ Dið2Þ

leasoi:1

re

Final training(1) Exclude any testing example that is recovered to two or

more different classes from Dre.(2) Combine Dtr and Dre to form the augmented training dataset

Dtr:

Dtr ¼Dtr [ Dre ð3Þ(3) Train Learn2 on Dtr and return the final hypothesis H.

Table 1The benchmark characteristics used in this paper.

# example # class # feature

YALE 165 15 1024

EYB 2414 38 1024

ORL 400 40 1024

PIE 11554 68 1024

JAFFE 213 10 1024

Output: the final hypothesis H.

Fig. 3 visualizes the main procedure of the proposed LIFT learningframework. In the traditional face recognition approaches, the train-ing images and the testing images are handled separately. Normally,the hypothesis is obtained only based on the training images andapplied to the testing images to predict the labels. Instead, the LIFTlearning framework proposed in this paper aims to take advantage ofthe vast amount of the unlabeled testing images and learn from thisunknown information. The recovered testing data can be integratedinto the training process and largely improve the accuracy androbustness of the final hypothesis. Since many face recognitionproblems are characterized by the inadequate labeled training imagesand the large amount of testing images, we believe that this idea mayprovide important new insights to the face recognition applications.

One should also note that LIFT is a general learning frameworkallowing a wide range of choices of the base classification models forLearn1 and Learn2. For instance, different kinds of the base learningalgorithms, such as neural networks, SVMs, decision tree, amongothers, can be integrated into this framework. Furthermore, the userscan also choose different learning schemes for Learn1 and Learn2 indifferent applications. For example, when only weak learners that canmerely do better than random guessing are available, then bootstrapaggregating (bagging) or boosting algorithm can be employed toconstruct a much stronger learner from these weak learners [16,17,7].This provides the flexibility of using this framework as a generallearning methodology in a wide range of real-world applications.

Fig. 3. Block diagram of the main procedure o

e cite this article as: Y. Cao, et al., LIFT: A new framework of learni0.1016/j.neucom.2010.10.015

4. Experimental results and analysis

Five popular face benchmarks, the Yale face (YALE) database [2],the extended Yale face (EYB) database B [19], Cambridge ORL face(ORL) database [43], the CMU PIE face (PIE) database [45] and theJapanese Female Facial Expression (JAFFE) database [32], are usedto investigate the performance of the proposed LIFT learning frame-work. We obtained these databases from [14] and [23]. The Yale facedatabase contains 165 grayscale face images for 15 persons with11 images for each person. The extended Yale face database B consistsof 2414 images of 38 subjects around 64 near frontal images underdifferent illuminations per individual. The ORL database of faces con-sists of 400 images, which are 10 different images for 40 distinct sub-jects. The CMU-PIE database contains 11 554 images of 68 subjectsaround 170 images for each subject. The JAFFE database contains213 images of seven facial expressions (angry, disappointed, fearful,happy, sad, surprised and neutral) posed by 10 Japanese femalemodels. Each image in the first four databases is originally cropped into32�32 pixels and expressed as a 1024-dimensional vector. For JAFFE,the images are originally 256�256 pixels. We first crop the imagesin JAFFE into 128�128 pixels, and then use a 4�4 average filterto reduce the JAFFE to 32�32 pixels that are represented as a1024-dimensional vector as the other four datasets. Table 1 sum-marizes the benchmark characteristics and Fig. 4 shows someexamples of the databases used in the experiments.

Because PCA is independent on the identity labels of the data, wecan first apply the PCA paradigm on all the images to extract theface features from each image and compress the image vectorssignificantly. In our current experiments, we choose the first 100principle components and transform each image to a 100-dimen-sional vector. Then, we randomly partition the whole image datasetinto a training dataset and a testing set. For example, in the YaleFace database, for each subject, we randomly select three images asthe training samples and use the remaining eight images as testingimages. Therefore, a total of 45 images are used for training and theother 120 images for testing. Table 2 shows the configurations ofthe training sets and the testing sets for the five image databases.Because some face datasets are not well-balanced, the numbers ofthe testing samples for each class in these datasets may not be the

f the proposed LIFT learning framework.

ng from testing data for face recognition, Neurocomputing (2011),

Page 6: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Fig. 4. Examples of the five face databases used in the experiments. (a) The Yale face database; (b) the extended Yale face database B; (c) the ORL database; (d) the CMU-PIE

database; (e) the JAFFE database.

Table 2The configuration of training set and testing set in each experiment.

The training set The testing set

Each class Total Each class Total

YALE 3 45 8 120

EYB 15 570 44–49 1844

ORL 3 120 7 280

PIE 40 2720 126–130 8834

JAFFE 4 40 16-19 173

Table 3Testing error performance comparison (in percentage).

NN SVM

(Linear kernel)

SVM

(Poly. kernel)

SVM

(RBF kernel)

Trad. LIFT Trad. LIFT Trad. LIFT Trad. LIFT

YALE 41.42 36.97 41.73 36.87 46.62 43.34 41.61 36.58

EYB 25.54 17.49 19.62 14.32 48.69 35.90 21.14 15.28

ORL 48.62 22.33 15.73 10.35 18.90 12.56 15.70 10.33

PIE 30.73 8.05 6.94 4.79 19.96 9.84 6.87 4.61

JAFFE 4.87 2.77 6.94 3.44 9.05 5.12 6.92 3.43

Table 4Testing error standard deviation (in percentage).

NN SVM

(Linear kernel)

SVM

(Poly. kernel)

SVM

(RBF kernel)

Trad. LIFT Trad. LIFT Trad. LIFT Trad. LIFT

YALE 8.50 5.10 4.08 3.90 4.12 4.53 4.00 4.01

EYB 5.86 1.99 1.35 1.14 2.22 2.12 1.29 1.20

ORL 13.60 6.39 2.87 2.84 3.17 2.95 2.89 2.86

PIE 2.44 2.66 0.44 0.29 0.86 0.52 2.19 0.29

JAFFE 2.98 2.32 2.65 2.49 3.23 2.65 2.65 2.46

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]]6

same. For example, for PIE benchmark, there is only 126 testingsamples for the class 38 subject, whereas 130 for other classes; forEYB benchmark, the numbers of the testing samples for each classrange from 44 to 49; for JAFFE benchmark, the numbers of thetesting samples for each class range from 16 to 19.

In our experiments, we verify our framework by using twodifferent sets of the base algorithms, the neural networks withmulti-layer perceptron (MLP) structure and SVM. Moreover, weadopt the same base algorithms for Learn1 and Learn2. In otherwords, in the first set of experiments, we use neural networks forLearn1 and Learn2, in which the number of hidden neuron is 20, andthe number of input neurons and output neurons are equal to thenumber of dimensions and classes for each dataset, respectively.Sigmoid function is used for the activation function and backpro-pagation is used to train the network. Parameter settings for theneural networks include a learning rate of 0.0001 and a trainingcycle of 5000. In the second set of experiments, we use SVM (linearkernel, polynomial kernel with degree of 3, and radial basis func-tion (RBF)) for both Learn1 and Learn2. We compare our LIFT learn-ing framework to the traditional learning scheme in which the finalhypothesis is generated only based on the training dataset with thesame base learning model.

Tables 3 and 4 show the averaged testing error performance anderror standard deviations of the results of the 100 random runs forthe LIFT algorithm as well as the traditional learning method for thefive benchmarks. From this table, one can see that the proposedLIFT framework can provide better classification accuracy perfor-mance over the traditional learning method.

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

In order to further investigate the performance improvement ofthe proposed framework over the tradition method, we comparethe statistical characteristics of the results of all the 100 runs fromboth methods by using two difference testing schemes, hypothesistesting of the average values and box plot.

In the hypothesis testing, we calculate the mean and thestandard deviation that are shown in Tables 3 and 4 using thefollowing equations [34,21]:

m¼ 1

n

Xn

i ¼ 1

erri ð4Þ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffinPn

i ¼ 1 err2i �ð

Pni ¼ 1 erriÞ

2

nðn�1Þ

sð5Þ

ng from testing data for face recognition, Neurocomputing (2011),

Page 7: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]] 7

We formulate the hypothesis as:Null hypothesis:

H0 : m1 ¼ m2 ð6Þ

Alternative hypothesis:

H1 : m1am2 ð7Þ

Table 5Hypothesis testing results of LIFT and traditional method.

NN SVM

(Linear kernel)

SVM

(Poly. kernel)

SVM

(RBF kernel)

YALE 4.48 8.63 5.35 8.88

EYB 13.01 30.03 41.71 33.28

ORL 17.49 13.32 14.65 13.21

PIE 62.74 40.74 101.25 6.68

JAFFE 5.55 9.62 9.40 9.65

0 20 40 60 80 10010

15

20

25

30

35

40

45

50

55

run

erro

r rat

e (in

per

cent

age)

TraditionalLIFT

Fig. 5. Error performance and Boxplot re

0 20 40 60 80 10010

12

14

16

18

20

22

24

run

erro

r rat

e (in

per

cent

age)

TraditionalLIFT

Fig. 6. Error performance and Boxplot results on

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

The test statistic is calculated as follows:

Z ¼m1�m2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffis2

1n1þ

s22

n2

q ð8Þ

For a two-tailed test, we will reject H0 if jZjo2:33. (2.33 is for atwo-tailed test where the results are significant at a level of 0.02).Table 5 shows the hypothesis testing results. From Table 5 we cansee all results are greater than 2.33 and most of them are evengreater than 10. Therefore, we accept the alternative hypothesis H1,which means there is statistically significant difference in theclassification performance of the traditional method and theproposed LIFT framework. In others words, LIFT can significantlyimprove the recognition performance of the tradition method.

The Boxplot method is a standard technique to depict groups ofnumerical data by presenting their 5-number summary including theminimum and maximum range values, the upper and lower quartiles,and the median [47,40]. We have investigated the boxplot results for allthe five face image sets for the tradition method and LIFT frameworkusing MLP and SVMs (linear kernel, polynomial kernel, and RBF kernel).Figs. 5–8 provide several snapshots of this analysis for the EYB databaseusing the four base learners. The left parts of these figures are the error

NN_Traditional NN_LIFT

15

20

25

30

35

40

45

50er

ror r

ate

(in p

erce

ntag

e)

sults on the EYB database using NN.

SVM_Traditional(linear kernel)

SVM_LIFT(linear kernel)

12

14

16

18

20

22

24

erro

r rat

e (in

per

cent

age)

the EYB database using SVM (linear kernel).

ng from testing data for face recognition, Neurocomputing (2011),

Page 8: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

0 20 40 60 80 10030

35

40

45

50

55

run

erro

r rat

e (in

per

cent

age)

TraditionalLIFT

SVM_Traditional(polynomial

kernel)

SVM_LIFT(polynomial

kernel)

30

35

40

45

50

55

erro

r rat

e (in

per

cent

age)

Fig. 7. Error performance and Boxplot results on the EYB database using SVM (polynomial kernel).

0 20 40 60 80 10012

14

16

18

20

22

24

26

run

erro

r rat

e (in

per

cent

age)

TraditionalLIFT

SVM_Traditional(RBF kernel)

SVM_LIFT(RBF kernel)

12

14

16

18

20

22

24

erro

r rat

e (in

per

cent

age)

Fig. 8. Error performance and Boxplot results on the EYB database using SVM (RBF kernel).

Table 6Numerical characteristics of the LIFT framework and the traditional strategy on the EYB database (In percentage).

NN SVM (Linear kernel) SVM (Poly. kernel) SVM (RBF kernel)

Trad. LIFT Trad. LIFT Trad. LIFT Trad. LIFT

Largest non-outlier 28.36 21.53 22.78 16.65 52.98 40.02 24.62 17.79

Upper quartile 25.52 18.36 20.63 15.27 49.97 37.36 22.15 16.08

Median 23.86 17.30 19.50 14.40 48.86 36.06 20.96 15.37

Lower quartile 22.40 16.08 18.66 13.42 47.32 34.46 20.20 14.37

Smallest non-outlier 18.66 14.37 15.84 11.66 43.49 30.42 17.95 12.04

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]]8

rates of the 100 runs and the right parts are the boxplot results of the100 runs. Table 6 summarizes the corresponding numerical results ofthe box plot methods on the EYB database. One can see each numericalresult of the LIFE framework is smaller than that of the traditionalmethod. In other words, these statistical analysis results indicate thatthe proposed LIFT framework can greatly improve the classificationperformance over the traditional learning method.

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

As we discussed in Section 1, the proposed framework has amodularized structure, which means that the methods used in themodules of the framework can be replaced seamlessly. Forinstance, in the data selection step, instead of using the one-against-all strategy, we can use other transductive learning algo-rithms to explore useful information from the testing dataset. Inthis work, we investigate the use of the consistency method [63,61]

ng from testing data for face recognition, Neurocomputing (2011),

Page 9: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]] 9

and LRGA method [56] in the data selection step. We use the JAFFEdatabase and SVM (linear kernel, polynomial kernel, and RBFkernel) as learners in the final training step to compare theperformance of the LIFT framework with consistency methodand LRGA method to the traditional approach in terms of the errorrates, standard deviation of the error rates, and hypothesis testingin Table 7. Cross-validation is used to find the parameters used inthese two methods. For consistency method, a is set to 0.25, whichis consistent with the discussion in [56], and s is set to 0.75. ForLRGA, k is set to 2, and l is set to 10. In Fig. 9, we investigate the errorperformance of LIFT framework with LRGA method using differentk values. The simulation results show that in this particularscenario, better performance can be obtained with a smaller k

value. This is probably due to the small size of the datasets and large

Fig. 9. Error performance of the LIFT framework for LRGA metho

Table 7Simulation results of the LIFT framework with consistency method and LRGA

method compared to the traditional strategy on the JAFFE database and SVM

learners (in percentage).

Learner Linear

kernel

Poly.

kernel

RBF

kernel

Trad. Error rate 6.94 9.05 6.92

Std. 2.65 3.23 2.65

Consistency Error rate 4.76 5.60 4.73

Std. 2.66 2.62 2.68

Hypothesis testing vs.

Trad.

5.80 8.29 5.79

LRGA Error rate 5.99 7.64 6.02

Std. 2.83 3.23 2.81

Hypothesis testing vs.

Trad.

2.44 3.09 2.34

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

number of class categories. From Table 7 one can see that there aresignificant differences between the two variants of the LIFT frame-work with the consistency method, LRGA method, and the tradi-tional approach, respectively, with confidence level 0.02.

Furthermore, we compare the proposed framework with acommonly used semi-supervised learning scheme, self-trainingmethod [50], on the EYB dataset with neural network as the baselearner. The experiment is designed in the following way. The EYBdataset is divided to three datasets: the labeled training dataset with300 images, the unlabeled training dataset with 1057 images, and thetesting dataset with the rest 1057 images. In the self-training scheme,we use the labeled training dataset to recover the labels of theunlabeled training dataset, and then with the labeled data and therecovered unlabeled data, a final classifier is trained and applied to thetesting dataset. In our learning scheme, only the labeled training dataand the testing data are used. Specifically, the labeled training data areused to recover the labels of the testing data, and the informationexplored from the testing dataset are integrated to the final learningprocedure. Table 8 illustrates the simulation results for both methods.We also present the simulation results for the tradition approach, inwhich only the labeled training data are used to develop the finalclassifier. Here, 12.72 and 13.39 are the hypothesis testing results

d using different k on the JAFFE database and SVM learners.

Table 8Simulation results of the LIFT framework and self-training scheme compared to the

traditional strategy on the EYB database and NN learners (in percentage).

Scheme Error rate Standard deviation Hypothesis testing

Trad. 38.36 5.47 – –

Self-training 30.81 2.31 12.72 –

LIFT 29.80 3.31 13.39 2.50

ng from testing data for face recognition, Neurocomputing (2011),

Page 10: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]]10

among the traditional approach, self-training, and LIFT, respectively,and 2.50 is the hypothesis testing result between self-training andLIFT. These results show that both the self-training scheme and LIFToutperform the traditional approach. LIFT can achieve better perfor-mance compared to the self-training scheme.

5. Data quality for LIFT framework

An important question still remains for the proposed framework,i.e., to what extend or under what assumption that the proposedmethod can benefit the final decision-making processes? In thissection, we provide some discussions of the impact of the data qualitylearned from the testing set on the performance of the face recognitionsystem. Generally speaking, high quality of the recovered data canbenefit learning, whereas the recovered testing data with low qualitymay indeed degrade the performance of the proposed system since itmay simply introduce more noise into the learning system. Here weuse the YALE benchmark and SVM with a linear kernel as an example toexplore how the quality of the recovered data impacts the performanceof the LIFT framework. To do this, we explicitly add the recoveredtesting data into the training set and then train a final hypothesis basedon the expanded training set. We adjust three attributes of the addedtesting dataset, i.e., the sample size, the accuracy rates, and the errortype, to investigate their influences to the LIFT framework.

5.1. Size and accuracy rate

Fig. 10 shows the adjustments of the sample size and accuracyrates to test their influences to the LIFT framework. The x-axisstands for the size of the recovered testing dataset to be added andthe y-axis stands for the accuracy of such recovered testing data.We partition all the testing data (120 samples in this case) into 20chunks, each chunk with 6 samples. The first k chunks of the testingdata are combined to generate the kth block. For example, block1 contains six samples from chunk 1; block 15 contains 90 samplesfrom chunk 1 to chunk 15. Along the y-axis, we explicitly changethe accuracy rates of the recovered class labels from 0% to 100%with step size of 5%. For instance, in block 15 that contains 90samples, when the accuracy rate is 40%, then 36 samples will belabeled with correct class labels and the rest 54 samples will belabeled with incorrect class labels.

5.2. Error type

When we generate incorrect class labels for the recove-red testing data to adjust the accuracy rate, we design two

Fig. 10. Examples of the adjustments in the size and the accuracy rate.

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

approaches to generate the incorrect labels: random error andbiased error.

In the random error approach, we randomly pick a class labelother than the actual class label for a recovered testing sample. Forexample, for the recovered testing data xt with an actual class labelyt, we randomly generate a label yt , yt Afyi : yiAY ,yiaytg and usethis incorrect label as the recovered label for xt . On the other hand,for the biased error method, we directly use the misclassified labelfrom LIFT framework as the recovered label for xt . Fig. 11 providesan example of how to generate incorrect labels in these twoapproaches. In this example, we assume that the size of the blockis 12 and the accuracy rate is 75%, therefore, the recovered testingset contains nine samples with correct class labels and 3 withincorrect class labels. In the random error approach, all incorrectlabels are generated by randomly picking a class label other thanthe correct label, whereas in the biased error approach, since the lastthree samples are incorrectly estimated by Learn1 in the LIFTframework in our experiments, then we directly use the incorrectlabels estimated by Learn1 as the recovered labels for the last threesamples.

According to these discussions, Fig. 12 illustrates the results ofthe two sets of the experiments. In each set of experiment, weevaluate the recognition performance with respect to the changingsample size and accuracy rate of the added testing dataset whendifferent error type methods are adopted. Specifically, in Fig. 12(a),biased error method is used, whereas in Fig. 12(b), random error

method is used. In each figure, the x-axis and y-axis are the size andthe accuracy of the added testing dataset, respectively, and the z-label is the final classification error rate of the final hypothesis.Since the result of the traditional approach etraditional is a constantvalue that is independent of the added testing data, therefore, wecan draw etraditional as a plane parallel to xy-plane that is illustratedin Fig. 12(a) and (b). From Fig. 12, we can draw the contours ofthe differences of the performance between the approach based onthe augmented training set and the traditional approach, i.e., De¼ eLIFT�etraditional where eLIFT are the error rates of the system basedon the augmented training set (the LIFT framework) and etraditional

are the error rates of the traditional approach based on the originaltraining dataset. We illustrate the contours of De when biased error

and random error method are used in Fig. 13(a) and (b), respec-tively. From Fig. 13, we can see that if the accuracy rate is high, withthe increasing size of the recovered testing dataset, the perfor-mance will push to the upper-right corner with larger negativevalue of De. This means that the proposed LIFT framework canachieve better performance of recognition compare to the tradi-tional approach. However, if the accuracy rate of the added testing

Fig. 11. Examples of the adjustments in the error type.

ng from testing data for face recognition, Neurocomputing (2011),

Page 11: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Fig. 12. The error rate results of the two sets of the experiments: (a) biased error; (b) random error.

Fig. 13. Contour of De: (a) biased error; (b) random error.

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]] 11

data is too low, larger sizes of the recovered testing dataset result ineven worse performance, i.e., the upper-left corner. Meanwhile,contour curve with value of 0 can be considered as the performance

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

boundary, i.e., any point left to the boundary stands for that theadded testing data degrade the original recognition system, whilefor any point right to the boundary, the recovered testing data canbenefits the final recognition performance. Furthermore, thisboundary provides us a criterion to decide under what data qualitythe recovered testing data should be added to the training set. Infact, we can use cross-validation to obtain this contour to provide acriterion for this purpose.

Figs. 14 and 15 show several snapshots of the error performancewith the fixed accuracy rate and size of the added dataset,respectively. In each subfigure of Fig. 14, the accuracy rate of theadded dataset is fixed. The x-axis represents the size of the addedtesting data increasing from 6 to 120, and the y-axis represents theerror rate performance of the final system. In each subfigure ofFig. 15, the size of the added dataset is fixed. The x-axis representsthe accuracy rate of the added testing data improving from 0% to100%, and the y-axis represents the error rate performance of thefinal system. The results from both error type methods discussed inSection 5.2 (random error and biased error) are showed in eachfigure, as well as those from the traditional approach based only onthe training set. From Fig. 14 one can see, if the accuracy rate of theadded testing data is below 30%, the final recognition performancewill indeed decrease with the increase of the number of augmenteddata. On the other hand, if the recovered accuracy is above 80%,then increasing the size of augmented data will benefit the finalrecognition process. If the accuracy rate of the added testing data isbetween 30% and 80%, increasing the size brings worse perfor-mance initially then further increasing the size benefits theperformance. When the size of the added testing data is fixed asshown in Fig. 15, then increasing in the accuracy rate of the addedtesting data always benefits the final recognition. Also, anotherinteresting phenomenon observed is that the results from randomerrors are always better than those from biased errors, except whenthe number of the added data is very large (i.e., the last subfigure inFig. 15), both random error and biased error methods give the samelevel of performance (the two lines are overlapped).

ng from testing data for face recognition, Neurocomputing (2011),

Page 12: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

20 40 60 80 100 12040

60

80

1000%

20 40 60 80 100 12040

60

80

10015%

20 40 60 80 100 12040

50

60

70

8030%

20 40 60 80 100 12040

50

60

7040%

20 40 60 80 100 12040

45

50

55

6050%

20 40 60 80 100 12040

45

50

55

6055%

20 40 60 80 100 12035

40

45

50

5560%

20 40 60 80 100 12030

35

40

45

5065%

20 40 60 80 100 12030

35

40

45

5070%

20 40 60 80 100 12010

20

30

40

5080%

20 40 60 80 100 1200

20

40

6090%

20 40 60 80 100 1200

20

40

60100%

randomerror

biased error

traditional method

Fig. 14. Some snapshots of the error performance with respect to the size when the accuracy rate is fixed in each subfigure.

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]]12

6. Conclusion

We propose the LIFT framework for face recognition problems inthis paper. The key idea of this approach is to reinforce the finallearning system based on the extra information learned from thetesting data distribution. In order to effectively explore such usefulinformation from the testing data, we use one-against-all techni-que to recover the labels of the testing examples. By adding the re-covered testing examples into the training set, a more reliable androbust hypothesis can be developed based on the expanded train-ing set. Neural network with multi-layer perceptron and supportvector machines with three different kernels are integrated into theproposed learning framework. Furthermore, we investigate two vari-ants of the proposed algorithm by integrating two other transductivemethods, consistency method and LRGA method into the LIFT frame-work. Simulation results on five face benchmarks, including the Yaledatabase, the extended Yale face database B, Cambridge ORL facedatabase, the CMU PIE face database and the Japanese Female FacialExpression database, are used to demonstrate the effectiveness androbustness of the proposed learning methodology.

There are several interesting directions that can be furtherstudied. For instance, different feature extraction methods, such as

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

ICA, FLD, and others, can be integrated into the LIFT framework. Theinfluence of different feature extraction methods on classificationaccuracy and robustness of LIFT framework is an interesting futuredirection. Second, our framework requires good data quality of therecovered testing data. Therefore, the method used to recover thelabels of the testing data is critical for this method. In our currentstudy, we adopt the one-against-all technique in our experiments.It would be interesting to study other mechanisms for the labelrecovery process for the proposed LIFT framework. For instance,some of the existing semi-supervised learning methods, such as co-training, self-training, among others, may be integrated into thisframework to facilitate the learning process. Furthermore, largescale empirical study of the proposed method across different typesof benchmarks will be necessary to fully justify the effectiveness ofthis framework across different application domains. Currently, weare investigating all these aspects and new results will be reportedin future research publications. Motivated by our results in thispaper, we believe the essential idea of LIFT, that is to say, the usageof testing data to reinforce the final decision-making process, mayprovide the community a new angle to address this issue, and canpotentially be a powerful method for a wide range of real-worldapplications.

ng from testing data for face recognition, Neurocomputing (2011),

Page 13: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

0 50 10035

40

45

50

556

0 50 10020

30

40

50

60

70

8024

0 50 10020

30

40

50

60

70

80

9042

0 50 1000

20

40

60

80

10060

0 50 1000

20

40

60

80

10078

0 50 1000

20

40

60

80

10090

0 50 1000

20

40

60

80

100108

0 50 1000

20

40

60

80

100120

biased error

traditionalmethodrandom error

Fig. 15. Some snapshots of the error performance with respect to the accuracy rates when the size is fixed in each subfigure.

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]] 13

References

[1] M.S. Bartlett, J.R. Movellan, T.J. Sejnowski, Face recognition by independentcomponent analysis, IEEE Transaction on Neural Networks 13 (6) (2002)1450–1464.

[2] P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. fisherfaces: recognitionusing class specific linear projection, IEEE Transactions on Pattern Analysis andMachine Intelligence 19 (7) (1997) 711–720.

[3] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometricframework for learning from labeled and unlabeled examples, Journal ofMachine Learning Research 7 (2006) 2399–2434.

[4] A. Beygelzimer, J. Langford, B. Zadrozn, Weighted one against all, in: Proceed-ings of the 20th National Conference on Artificial Intelligence (AAAI), 2005,pp. 720–725.

[5] A. Blum, S. Chawla, Learning from labeled and unlabeled data using graphmincuts, in: Proceedings of the International Conference on Machine Learning(ICML’01), 2001, pp. 19–26.

[6] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training,in: Proceedings of the Workshop on Computational Learning Theory (COLT’98),1998, pp. 92–100.

[7] L. Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123–140.[8] R. Brunelli, T. Poggio, Face recognition: features versus templates, IEEE

Transactions on Pattern Analysis and Machine Intelligence 15 (10) (1993)1042–1052.

[9] Y. Cao, H. He, Learning from testing data: a new view of incremental semi-supervised learning, in: Proceedings of the International Joint Conference onNeural Networks (IJCNN’08), 2008, pp. 2873–2879.

[10] O. Chapelle, B. Scholkopf, A. Zien, Semi-Supervised Learning, MIT press, 2006.[11] O. Chapelle, V. Sindhwani, S.S. Keerthi, Branch and bound for semi-supervised

support vector machines, in: Proceedings of Neural Information ProcessingSystems (NIPS’06), 2006, pp. 1–8.

[12] R. Chellappa, C.L. Wilson, S. Sirohey, Human and machine recognition of faces:a survey, Proceedings of the IEEE 83 (5) (1995) 705–741.

[13] D.-Q. Dai, P.C. Yuen, Face recognition by regularized discriminant analysis, IEEETransactions on System, Man, and Cybernetics, Part B: Cybernetics 37 (4)(2007) 1080–1085.

[14] Four face databases in matlab format, [Online], available: /http://www.cs.uiuc.edu/homes/dengcai2/Data/FaceData.htmlS.

[15] M.J. Er, W. Chen, S. Wu, High-speed face recognition based on discrete cosinetransform and RBF neural networks, IEEE Transaction on Neural Networks 16(3) (2005) 679–691.

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

[16] Y. Freund, R.E. Schapire, Experiments with a new boosting algorithm, in: Proceed-ings of the International Conference on Machine Learning, 1996, pp. 148–156.

[17] Y. Freund, R.E. Schapire, Decision-theoretic generalization of on-line learningand application to boosting, Journal of Computer and System Sciences 55 (1)(1997) 119–139.

[18] Y. Fu, Z. Li, J. Yuan, Y. Wu, T.S. Huang, Locality versus globality: query-drivenlocalized linear models for facial image computing, IEEE Transactions onCircuits and Systems for Video Technology (T-CSVT) 18 (12) (2008) 1741–1752.

[19] A.S. Georghiades, P.N. Belhumeur, D.J. Kriegman, From few to many: illumina-tion cone models for face recognition under variable lighting and pose,IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (6)(2001) 643–660.

[20] G. Guo, S.Z. Li, K. Chan, Face recognition by support vector machines, in:Proceedings of Fourth IEEE International Conference on Automatic Face andGesture Recognition, 2000.

[21] H. He, J.A. Starzyk, A self organizing learning array system for power qualityclassification based on wavelet transform, IEEE Transactions on PowerDelivery 21 (2006) 286–295.

[22] E. Hjelmas, B.K. Low, Face detection: a survey, Computer Vision and ImageUnderstanding 83 (3) (2001) 236–274.

[23] JAFFE Download, [Online], available: /http://www.kasrl.org/jaffe_down-load.htmlS.

[24] K. Jonsson, J. Kittler, Y.P. Li, J. Matas, Support vector machines for faceauthentication, in: T. Pridmore, D. Elliman (Eds.), BMVC’99, 1999, pp. 543–553.

[25] K. Jonsson, J. Matas, J. Kittler, Y. P. Li, Learning support vectors for faceverification and recognition, in: Proceedings of Fourth IEEE InternationalConference on Automatic Face and Gesture Recognition, 2000.

[26] K.-C. Kwak, W. Pedrycz, Face recognition using fuzzy integral and waveletdecomposition method, IEEE Transactions on System, Man, and Cybernetics,Part B: Cybernetics 34 (4) (2004) 1666–1675.

[27] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R.P. Wurtz,W. Konen, Distortion invariant object recognition in the dynamic linkarchitecture, IEEE Transactions on Computers 42 (1993) 300–311.

[28] C. Liu, H. Wechsler, Gabor feature based classification using the enhancedfisher linear discrimination model for face recognition, IEEE Transactions onImage Processing 11 (4) (2002) 467–476.

[29] Y. Liu, Y.F. Zheng, One-against-all multi-class SVM classification usingreliability measures, in: Proceedings of 2005 IEEE International Joint Con-ference on Neural Networks, IJCNN ’05, 2005, pp. 849–854.

[30] R. Lotlikar, R. Kothari, Fractional-step dimensionality reduction, IEEE Transac-tions on Pattern Analysis and Machine Intelligence 22 (6) (2000) 623–627.

ng from testing data for face recognition, Neurocomputing (2011),

Page 14: LIFT A new framework of learning from testing data for face ...LIFT: A new framework of learning from testing data for face recognition Yuan Caoa, Haibo Heb,, He (Helen) Huangb a Department

Y. Cao et al. / Neurocomputing ] (]]]]) ]]]–]]]14

[31] J. Lu, K.N. Plataniotis, A.N. Venetsanopoulos, Face recognition using LDA-basedalgorithms, IEEE Transactions on Neural Networks 14 (1) (2003) 195–200.

[32] M.J. Lyons, J. Budynek, S. Akamatsu, Automatic classification of single facialimages, IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (12)(1999) 1357–1362.

[33] D. Masip, J. Vitria, Shared feature extraction for nearest neighbor facerecognition, IEEE Transactions on Neural Networks 1 (4) (2008) 586–595.

[34] I. Miller, J.E. Fruend, Probability and Statistics for Engineers, Prentice-Hall,Englewood Cliffs, NJ, 1965.

[35] D.J. Miller, H.S. Uyar, A mixture of experts classifier with learning based on bothlabelled and unlabelled data, in: Proceedings of Neural Information ProcessingSystems (NIPS’97), 1997, pp. 571–577.

[36] T. Mitchell, The role of unlabeled data in supervised learning, in: Proceedings ofthe International Colloquium on Cognitive Science, 1999.

[37] T. Mitchell, The discipline of machine learning, Technical Report, CMU-ML-06-108, Carnegie Mellon University, 2006.

[38] K. Nigam, A.K. McCallum, S. Thrun, T. Mitchell, Text classification from labeled andunlabeled documents using EM, Machine Learning 3 (2–3) (2000) 103–134.

[39] S.L. Phung, A. Bouzerdoum, A pyramidal neural network for visual patternrecognition, IEEE Transactions on Neural Networks 18 (2) (2007) 329–343.

[40] K. Potter, Methods for presenting statistical information: the box plot, in:H. Hagen, A. Kerren, P. Dannenmann (Eds.), Visualization of Large andUnstructured Data Sets, (LNI), Vol. S-4, 2006, pp. 97–106.

[41] F. Roli, G.L. Marcialis, Semi-supervised PCA-based face recognition using self-training, in: D.-Y. Yeung, J.T. Kwok, A.L.N. Fred, F. Roli, D. Ridder (Eds.),Structural, Syntactic, and Statistical Pattern Recognition, (SSPR/SPR), Springer,2006, pp. 560–568.

[42] C. Rosenberg, M. Hebert, H. Schneiderman, Semi-supervised self-training ofobject detection models, in: Proceedings of the Seventh IEEE Workshops onApplication of Computer Vision (WACV/MOTION’05), 1(2005) 29–36.

[43] F. Samaria, A. Harter, Parameterisation of a stochastic model for humanfaceidentification, in: Proceedings of the 2nd IEEE Workshop on Applications ofComputer Vision, Sarasota, 1994, pp. 138–142.

[44] A. Sarkar, Applying co-training methods to statistical parsing, in: Proceedingsof the North American Chapter of the Association for Computational Linguisticson Language Technologies (NAACL’01), 2001, pp. 95–102.

[45] T. Sim, S. Baker, M. Bsat, The CMU pose, illumination, and expression (PIE)database, in: Proceedings of the IEEE International Conference on AutomaticFace and Gesture Recognition, 2002, pp. 46–51.

[46] P. Stone, M. Veloso, Using testing to iteratively improve training, in: WorkingNotes of the AAAI 1995 Fall Symposium on Active Learning, 1995, pp. 110–111.

[47] J.W. Tukey, Exploratory Data Analysis, Addison-Wesley, Reading, MA, 1977.[48] M. Turk, A. Pentland, Eigenfaces for recognition, Journal of Cognitive Neu-

roscience 13 (1) (1991) 71–86.[49] H. Wang, S. Yan, T. Huang, J. Liu, X. Tang, Misalignment-robust face recognition,

ACM Computing Surveys (CSUR) 35 (4) (2003) 399–458.[50] M. Wang, X.S. Hua, L.R. Dai, Y. Song, Enhanced semi-supervised learning for

automatic video annotation, in: Proceedings of the IEEE International Con-ference on Multimedia and Expo, 2006.

[51] D. Xu, S. Yan, Semi-supervised bilinear subspace learning, IEEE Transactions onImage Processing 18 (7) (2009) 1529–1541.

[52] D. Xu, S. Yan, S. Lin, T.S. Huang, S.-F. Chang, Enhancing bilinear subspacelearning by element rearrangement, IEEE Transactions on Pattern Analysis andMachine Intelligence 31 (10) (2009) 1913–1920.

[53] D. Xu, S. Yan, J. Luo, Face recognition using spatially constrained earth mover’sdistance, IEEE Transactions on Image Processing 17 (11) (2008) 2256–2260.

[54] D. Xu, S. Yan, L. Zhang, S. Lin, T.S. Huang, Convergent 2D subspace learning withnull space analysis, IEEE Transactions on Circuits Systems for Video Technol-ogy 18 (12) (2008) 1753–1759.

[55] D. Xu, S. Yan, L. Zhang, S. Lin, H.-J. Zhang, T.S. Huang, Reconstruction andrecognition of tensor-based objects with concurrent subspaces analysis, IEEETransactions on Circuits Systems for Video Technology 18 (1) (2008) 36–47.

[56] Y. Yang, D. Xu, F. Nie, J. Luo, Y. Zhuang, Ranking with local regression and globalalignment for cross media retrieval, in: Proceedings of the Seventeen ACMInternational Conference Multimedia, 2009, pp. 175–184.

[57] H. Yu, J. Yang, A direct LDA algorithm for high-dimensional data—withapplication to face recognition, Pattern Recognition 34 (10) (2001) 2067–2070.

[58] D. Zhang, W.S. Lee, Validating co-training models for web image classification,in: Proceedings of SMA Annual Symposium, NUS, 2005.

[59] W. Zhao, R. Chellappa, P.J. Phillips, A. Rosenfeld, Face recognition: a literaturesurvey, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR’08, 2008.

[60] Z.H. Zhou, M. Li, Tri-training: exploiting unlabeled data using three classi-fiers, IEEE Transactions on Knowledge and Data Engineering 17 (11) (2005)1529–1541.

Please cite this article as: Y. Cao, et al., LIFT: A new framework of learnidoi:10.1016/j.neucom.2010.10.015

[61] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Scholkopf, Learning with local andglobal consistency, in: S. Thrun, L. Saul (Eds.), Advances in Neural Informa-tion Processing Systems, vol. 16, MIT Press, Cambridge, MA, USA, 2004, pp.321–328.

[62] D. Zhou, B. Scholkopf, T. Hofmann, Semi-supervised learning on directedgraphs, in: Proceedings of Neural Information Processing Systems (NIPS’05),2005, pp. 1633–1640.

[63] D. Zhou, J. Weston, A. Gretton, O. Bousquet, B. Scholkopf, Ranking on datamanifolds, MPI Technical Report (113), Max Planck Institute for BiologicalCybernetics, Tubingen, Germany, 2003.

[64] X. Zhu, Semi-supervised learning literature survey, Technical Report:TR-1530, Department of Computer Sciences, University of Wisconsin atMadison, 2007.

Yuan Cao received the B.E. and M.S. degrees fromZhejiang University, China, in 2001 and 2004, respec-tively, and the M.S. degree from Oklahoma State Uni-versity, Stillwater, in 2007, all in electrical engineering.He is currently a Ph.D candidate in computer engineer-ing at Stevens Institute of Technology, Hoboken.

His current research interests include pattern recog-nition, machine learning, and data mining.

Haibo He received the B.S. and M.S. degrees in electricalengineering from Huazhong University of Science andTechnology (HUST), Wuhan, China, in 1999 and 2002,respectively, and the Ph.D. degree in electrical engineer-ing from Ohio University, Athens, in 2006. From 2006 to2009, he was an assistant professor in the Departmentof Electrical and Computer Engineering, Stevens Insti-tute of Technology, Hoboken, New Jersey. He is cur-rently an assistant professor in the Department ofElectrical, Computer, and Biomedical Engineering atthe University of Rhode Island, Kingston, Rhode Island.

His research interests include self-adaptive intelli-

gent systems, machine learning and data mining, com-

putational intelligence, VLSI and FPGA design, and smart grid. He has servedregularly on the organization committees and the program committees of manyinternational conferences and has also been a reviewer for the leading academicjournals in his fields. He has also served as a guest editor for several internationaljournals. Currently, he is an Associate Editor of the IEEE Transactions on NeuralNetwork, Editor of the IEEE Transactions on Smart Grid, and the Editor of the IEEEComputational Intelligence Society (CIS) Electronic Letter (E-letter).

(He) Helen Huang received a B.S. from the School ofElectronic and Information Engineering at Xi’an Jiao-Tong University, China in 2000 and a M.S. and Ph.D.degree from the Harrington Department of Bioengineer-ing, Arizona State University in 2002 and 2006, respec-tively. She worked as a post-doc research associate inthe Neural Engineering Center for Artificial Limbs at theRehabilitation Institute of Chicago from 2006 to 2008.She is currently an assistant professor of the Depart-ment of Electrical, Computer, and Biomedical Engineer-ing at the University of Rhode Island.

Dr. Huang’s primary research interests include

neural-machine interface, modeling and analysis of

neuromuscular control of movement in normal and neurologically disorderedhumans, virtual reality in neuromotor rehabilitation, and design and control oftherapeutic robots, orthoses, and prostheses. Her specialties lie in machine learning,adaptive control, biomechanical modeling, signal and image processing, and motionanalysis. She is a member of the IEEE Medicine and Biology Society and the Societyfor Neuroscience.

ng from testing data for face recognition, Neurocomputing (2011),