1 an optimization criterion for generalized discriminant...

1

An optimization criterion for generalized

discriminant analysis on undersampled problems

Jieping Ye, Ravi Janardan, Cheong Hee Park, and Haesun Park

Abstract

An optimization criterion is presented for discriminant analysis. The criterion extends the optimization criteria

of the classical Linear Discriminant Analysis (LDA) through the use of the pseudo-inverse when the scatter

matrices are singular. It is applicable regardless of the relative sizes of the data dimension and sample size,

overcoming a limitation of classical LDA. The optimization problem can be solved analytically by applying the

Generalized Singular Value Decomposition (GSVD) technique. The pseudo-inverse has been suggested and used

for undersampled problems in the past, where the data dimension exceeds the number of data points. The criterion

proposed in this paper provides a theoretical justification for this procedure.

An approximation algorithm for the GSVD-based approach is also presented. It reduces the computational

complexity by finding sub-clusters of each cluster, and uses their centroids to capture the structure of each cluster.

This reduced problem yields much smaller matrices to which the GSVD can be applied efficiently. Experiments

on text data, with up to 7000 dimensions, show that the approximation algorithm produces results that are close

to those produced by the exact algorithm.

Index Terms

Classification, clustering, dimension reduction, generalized singular value decomposition, linear discriminant

analysis, text mining.

J. Ye, R. Janardan, C. Park, and H. Park are with the Department of Computer Science and Engineering, University of Minnesota–Twin

Cities, Minneapolis, MN 55455, U.S.A. Email:�jieping,janardan,chpark,hpark � @cs.umn.edu.

Corresponding author: Jieping Ye; Tel.: 612-626-7504; Fax: 612-625-0572.

2

I. INTRODUCTION

Many interesting data mining problems involve data sets represented in very high-dimensional spaces.

We consider dimension reduction for high-dimensional, undersampled data, where the dimension of the

data points is higher than the number of data points. Such high-dimensional, undersampled problems

occur frequently in many applications, including information retrieval [25], facial recognition [3], [28]

and microarray analysis [1].

The application area of interest in this paper is vector space-based information retrieval. The dimension

of the document vectors is typically very high, due to the large number of terms that appear in the collection

of the documents. In the vector space-based model, documents are represented as column vectors in a

term-document matrix. For an �� term-document matrix��

, its��

-th term,��

, represents the

weighted frequency of term�

in document�. Several weighting schemes have been developed for encoding

the document collection in a term-document matrix [17]. An advantage of the vector space based-method

is that once the collection of documents is represented as columns of the term-document matrix in a

high-dimensional space, the algebraic structure of the related vector space can then be exploited.

When the documents are already clustered, we would like to find a dimension-reducing transformation

that preserves the cluster structure of the original full space even after the dimension reduction. Throughout

the paper, the input documents are assumed to have been already clustered before the dimension reduction

step. When the documents are not clustered, then an efficient clustering algorithm such as K-Means [6],

[16] can be applied before the dimension reduction step. We seek a reduced representation of the document

vectors, which best preserves the structure of the original document vectors.

A. Prior related work

Latent Semantic Indexing (LSI) has been widely used for dimension reduction of text data [2], [5]. It is

based on lower rank approximation of the term-document matrix from the Singular Value Decomposition

(SVD) [11]. Although the SVD provides the optimal reduced rank approximation of the matrix when the

3

difference is measured in the �� or Frobenius norm, it has limitations in that it does not consider cluster

structure in the data and is expensive to compute. Moreover, the choice of the optimal reduced dimension

is difficult to determine theoretically.

The Linear Discriminant Analysis (LDA) method has been applied for decades for dimension reduction

(feature extraction) of clustered data in pattern recognition [10]. It is classically formulated as an opti-

mization problem on scatter matrices. A serious disadvantage of the LDA is that its objective function

requires that the total scatter matrix be nonsingular. In many modern data mining problems such as

information retrieval, facial recognition, and microarray data analysis, the total scatter matrix in question

can be singular since the data items are from a very high-dimensional space, and in general, the dimension

exceeds the number of data points. This is known as the undersampled or singularity problem [10], [18].

In recent years, many approaches have been brought to bear on such high-dimensional, undersampled

problems. These methods can be roughly grouped into three categories. The first approach applies an

intermediate dimension reduction method, such as LSI, Principal Component Analysis (PCA), and Partial

Least Squares (PLS), to extract important components for LDA [3], [14], [18], [28]. The algorithm is

named two-stage LDA. This approach is straightforward but the intermediate dimension reduction stage

may remove some useful information. Some relevant references on this approach can be found in [15],

[18]. The second approach solves the singularity problem by adding a perturbation to the scatter matrix

[4], [9], [18], [33]. The algorithm is known as the regularized LDA, or RLDA for short. A limitation of

RLDA is that the amount of perturbation to be used is difficult to determine [4], [18].

The final approach applies the pseudo-inverse to avoid the singularity problem, which is equivalent to

approximating the solution using a least-squares solution method. The work in [12] showed that its use

led to appropriate calculation of Mahalanobis distance in singular cases. The use of classical LDA with

the pseudo-inverse has been proposed in the past. The Pseudo Fisher Linear Discriminant (PFLDA) in

[8], [10], [23], [26], [27] is based on the pseudo-inverse of the scatter matrices. The generalization error of

PFLDA was studied in [8], [26], when the size and dimension of the training data vary. As observed in [8],

4

[26], with an increase in the size of the training data, the generalization error of PFLDA first decreases,

reaching a minimum, then increases, reaching a maximum at the point when the size of the training data is

comparable with the data dimension, and afterwards begins to decrease. This peaking behavior of PFLDA

can be resolved by either reducing the size of the training data [8], [26] or by increasing the dimension

through addition of redundant features [27]. Pseudo-inverses of the scatter matrices were also studied in

[18], [20], [29]. Experiments in [18] showed that the pseudo-inverse based method achieved comparable

performance with RLDA and two-stage LDA. It is worthwhile to note that both RLDA and two-stage

LDA involve the estimation of some parameters. The authors in [18] applied cross-validation to estimate

the parameters, which can be very expensive. On the other hand, the pseudo-inverse based LDA does not

involve any parameter and is straightforward. However, its theoretical justification has not been studied

well in the literature.

Recently, a generalization of LDA based on the Generalized Singular Value Decomposition (GSVD)

has been developed [13], [14], which is applicable regardless of the data dimension, and, therefore, can be

used for undersampled problems. The classical LDA solution becomes a special case of this LDA/GSVD

method. In [13], [14], the solution from LDA/GSVD is justified to preserve the cluster structure in the

original full space after dimension reduction. However, no explicit global objective function has been

presented.

B. Contributions of this paper

In this paper, we present a generalized optimization criterion, based on the pseudo-inverse, for discrim-

inant analysis. Our cluster-preserving projections are tailored for extracting the cluster structure of high-

dimensional data, and are closely related to classical linear discriminant analysis. The main advantage of

the proposed criterion is that it is applicable to undersampled problems. A detailed mathematical derivation

of the proposed optimization problem is presented. The GSVD technique is the key component for the

derivation. The solution from the LDA/GSVD algorithm is a special case of the solution for the proposed

criterion. We call it the exact algorithm, to distinguish it from an efficient approximation algorithm that we

5

also present. Interestingly, we show the equivalence between the exact algorithm and the pseudo-inverse-

based method, PFLDA. Hence, the criterion proposed in this paper may provide a theoretical justification

for the pseudo-inverse based LDA.

One limitation of the exact algorithm is the high computational complexity of GSVD in handling

large matrices. Typically, large document datasets, such as the ones in [32], contain several thousands of

documents with dimension in the tens of thousands. We propose an approximation algorithm based on

sub-clustering of clusters to reduce the cost of computing the SVD involved in the computation of the

GSVD. Each cluster is further sub-clustered so that the overall structure of each cluster can be represented

by the set of centroids of sub-clusters. As a result, only a few vectors are needed to define the scatter

matrices, thus reducing the computational complexity quite significantly. Experimental results show that

the approximation algorithm produces results close to those produced by the exact one.

We compare our proposed algorithms with other dimension reduction algorithms in terms of classifi-

cation accuracy, where the K-Nearest neighbors (K-NN) method [7] based on the Euclidean distance is

used for classification. We use 10-fold cross-validation for estimating the misclassification rate. In 10-fold

cross-validation, we divide the data into ten subsets of (approximately) equal size. Then we do the training

and testing ten times, each time leaving out one of the subsets from training, and using only the omitted

subset for testing. The misclassification rate is the average from the ten runs. The misclassification rates

on different data sets for all the dimension reduction algorithms discussed in this paper are reported for

comparison. The results in Section V show that our exact and approximation algorithms are comparable

with the other dimension reduction algorithms discussed in this paper. More interestingly, the results

produced by the approximation algorithm are close to those produced by the exact algorithm, while the

approximation algorithm deals with matrices of much smaller sizes than those in the exact algorithm,

hence has lower computational complexity.

The main contributions of this paper include: (1) A generalization of the classical discriminant analysis

to small sample size data using an optimization criterion, where the non-singularity of the scatter matrices

6

is not required; (2) Mathematical derivation of the solution for the proposed optimization criterion; (3)

Theoretical justification of previously proposed pseudo-inverse based LDA, by showing the equivalence

between the exact algorithm and the Pseudo Fisher Linear Discriminant Analysis (PFLDA); (4) An efficient

approximation algorithm for the proposed optimization criterion. (5) Detailed experimental results for our

exact and approximation algorithms and comparisons with competing algorithms.

The rest of the paper is organized as follows: Classical discriminant analysis is reviewed in Section II. A

generalization of the classical discriminant analysis using the proposed criterion is presented in Section III.

An efficient approximation algorithm is presented in Section IV. Experimental results are presented in

Section V and concluding discussions are presented in Section VI.

For convenience, we present in Table I the important notations used in the rest of the paper.

TABLE I

IMPORTANT NOTATIONS USED IN THE PAPER

Notation Description Notation Description

� number of the training data points � dimension of the training data

�number of clusters � data matrix

�� data matrix of the � -th cluster � � size of the � -th cluster

�� centroid of the � -th cluster � global centroid of the training set

�� between-cluster scatter matrix

��within-cluster scatter matrix

��total scatter matrix �� dimension reduction transformation matrix

�� number of sub-clusters in K-Means � � total number of sub-clusters

� perturbation in regularized LDA � number of nearest neighbors in K-NN

� reduced dimension by LSI �� reduced dimension by the exact algorithm

II. CLASSICAL DISCRIMINANT ANALYSIS

Given a data matrix�!

IR " #%$ , we consider finding a linear transformation &(' IR ) # " that maps

each column �

, for *,+ � + � , of�

in the � -dimensional space to a vector - � in the . -dimensional space:

& '0/ �1 IR "32 - �1 IR ) . Assume the original data is already clustered. The goal here is to find the

7

transformation & ' such that the cluster structure of the original high-dimensional space is preserved in the

reduced-dimensional space. Let the document matrix�

be partitioned into�

clusters as��

,

where� �

IR " #%$ � , and �� . Let � � be the set of column indices that belong to the

�-th cluster,

i.e.,

, for� � � , belongs to the

�-th cluster.

In general, if each cluster is tightly grouped, but well separated from the other clusters, the quality of the

cluster is considered to be high. In discriminant analysis [10], three scatter matrices, called within-cluster,

between-cluster, and total scatter matrices, are defined to quantify the quality of the cluster, as follows:

��

� ��

� �� ' �

�! ��

� ��

�"�� "�� ' ��#�� "��

�� "�� ' ��$ � �! !%&��

(1)

where the centroid� � ��

of the�-th cluster is defined as

� � �#� � �$ �� (' � ��

, where' � �� * � * �� * � ' IR

$ � ,

and the global centroid�

is defined as� � �

$ �' � where' �� * � * �)�)�� * � ' IR

$.

Define the matrices

* � � � �+�,�� -� �.'/� �-� � ' �)�)�� 01�� 2� �"'/� 2� � ' � �* � �43 � � �.�5�

�"� �� 3 � ��"��2� �� 6�-7

(2)

Then the scatter matrices��

and�!

can be expressed as�� * � * '� , and

�! � * * ' . The traces of

the two scatter matrices can be computed as follows,

trace� ��

��

� ��)� �

� ��5� �� ' � ��

� ��)� �

898 �� 898 �

trace� �! � �

�� "��

�� ' �.�5��

�� 898 �5�

�� 898 � 7

Hence, trace� ��

measures the closeness of the vectors within the clusters, while trace� �: �

measures

the separation between clusters.

8

In the lower-dimensional space resulting from the linear transformation & ' , the within-cluster and

between-cluster matrices become

�� & ' * � � � & ' * � � ' � & ' �� & �

�� & ' * �� & ' * � ' � & ' �! & 7 (3)

An optimal transformation &,' would maximize trace� � � �

and minimize trace� � ��

. Common opti-

mizations in classical discriminant analysis include

�� trace

� � �� and ��

�trace

� � �� 07(4)

The optimization problem in (4) is equivalent to finding all the eigenvectors that satisfy��

,

for��

. The solution can be obtained by solving an eigenvalue problem on the matrix� �� !

, if��

is nonsingular, or on� � ��

, if�!

is nonsingular. Equivalently, we can compute the optimal solution by

solving an eigenvalue problem on the matrix�� $ �!

[10]. There are at most� � * eigenvectors corresponding

to nonzero eigenvalues, since the rank of the matrix�

is bounded above by� � * . Therefore, the reduced

dimension by classical LDA is at most� � * . A stable way to solve this eigenvalue problem is to apply

SVD on the scatter matrices. Details can be found in [28].

Note that a limitation of classical discriminant analysis in many applications involving small sample

data, including text processing in information retrieval, is that the matrix� $

must be nonsingular. Several

methods, including two-stage LDA, regularized LDA and pseudo-inverse based LDA, were proposed in

the past to deal with the singularity problem as follows.

A common way to deal with the singularity problem is to apply an intermediate dimension reduction

stage to reduce the dimension of the original data before classical LDA is applied. Several choices of the

intermediate dimension reduction algorithms include LSI [14], PCA [3], [18], [28], Partial Least Squares

(PLS) [18], etc. The application area of this paper is text mining, hence LSI is a natural choice for

the intermediate dimension reduction. In this two-stage LSI+LDA algorithm, the discriminant stage is

preceded by a dimension reduction stage using LSI. A limitation of this approach is that the optimal

9

value of the reduced dimension for LSI is difficult to determine [14].

A simple way to deal with the singularity of� $

is to add some constant values to the diagonal elements

of��$

, as��$ %��

" , for some��

, where�" is an identity matrix [9]. It is easy to check that

�!$ %��"

is positive definite, hence nonsingular. This approach is called regularized Linear Discriminant Analysis

(RLDA). A limitation of RLDA is that the optimal value of the parameter�

is difficult to determine.

It is evident that when� 2 � , we lose the information on

� $, while very small values of

�may not

be sufficiently effective [26]. Cross-validation is commonly applied for estimating the optimal�

. More

recent studies on RLDA can be found in [4], [18], [26], [33].

Another common way to deal with the singularity problem is to apply pseudo-inverse on the scatter

matrices instead of inverse. The pseudo-inverse of a matrix is defined as follows.

Definition 2.1: The pseudo-inverse of a matrix�

, denoted as��

, refers to the unique matrix satisfying

the following four conditions:

� * � � � � � � � � � � � � � � � � � � �

�� ' � � � � � �� ' �� 7The pseudo-inverse

� �is commonly computed by the SVD [11] as follows. Let

� ��

� �

�� '

be the SVD of�

, where�

and � are orthogonal and � is diagonal with positive diagonal entries. Then

the pseudo-inverse of�

, denoted as��

, can be computed as��

��

� �

� �

� �� ' .

The Pseudo Fisher Linear Discriminant Analysis (PFLDA) [8], [10], [23], [26], [27] is equivalent to

finding all the eigenvectors satisfying� �$ �! ��

, for� ��

. An advantage of PFLDA over the other

two methods is that there is no parameter involved in PFLDA.

III. GENERALIZATION OF DISCRIMINANT ANALYSIS

Classical discriminant analysis expresses its solution by solving an eigenvalue problem when� $

is

nonsingular. However, for a general document matrix�

, the number of documents ( � ) may be smaller

10

than its dimension ( � ), then the total scatter matrix� $

is singular. This implies that both�

and��

are

singular. In this paper, we propose criterion � � below, where the non-singularity of� $

is not required.

The criterion aims to minimize the within-cluster distance, trace� � ��

, and maximize the between-cluster

distance, trace� � � �

.

The criterion � � is a natural extension of the classical one in Eq. (4), in that the inverse of a matrix is

replaced by the pseudo-inverse [11]. While the inverse of a matrix may not exist, the pseudo-inverse of

any matrix is well-defined. Moreover, when the matrix is invertible, its pseudo-inverse is the same as its

inverse.

The criterion � � that we propose is:

� �� & � � trace � � �� 7 (5)

Hence the trace optimization is to find an optimal transformation matrix & ' such that � �� & � is minimum

under a certain constraint defined in more detail in Section III-C.

We show how to solve the above minimization problem in Section III-C. The main technique applied

there is the GSVD, briefly introduced in Section III-A below. The constraints on the optimization problem

are based on the observations in Section III-B.

A. Generalized singular value decomposition

The Generalized Singular Value Decomposition (GSVD) was introduced in [21], [31]. A simple algo-

rithm to compute GSVD can be found in [13], where the algorithm is based on [21].

The GSVD on the matrix pair� * ' � * '� � will give orthogonal matrices

� IR #

, � IR$ #%$

, and a

nonsingular matrix � IR " # " , such that

��

� ��' �

� ��

� � �

��

for

��

��* ' * '�

��

(6)

11

where

� � ��

� � �

� � �

��

��

� � � �

� � � ��7

Here��

IR � # � and� �

IR� $ � �� # � $ � �� are identity matrices,

� IR� � �� # � $ � �� and

� �

IR� $ $ � � � # � are zero matrices with � � rank

��

rank� * � �

, and � � rank� * � %

rank� * � � �

rank��, and

� �diag

�� diag

� � � � ��)� �� are diagonal matrices satisfying

* � � � � � � �)�)��

�� + �)�)�� + � � � � * �

and� �� % �� * for

� � � % * �� % � .From (6), we have

* ' � � �� * '� � � � � � � ��

hence

� ' * * ' � �� ' � � � �

� �

��

� ' * � * '� � �� ' � � � �

� �

� ��

�7

(7)

Therefore

� � � & ' � � � � ' � � ' * * ' � � � � & � �& � ��& ' �� & ' � � � � ' � � ' * � * '� � � � � & � �& � � �& ' �

(8)

12

where

�& �� & � ' 7 (9)

We use the above representations for� �

and� ��

in Section III-C for the minimization of � � .It follows from Eq. (7) that

� ' ��$ � � � � % �� $ �

� �

� ��

(10)

which is used in the following Lemma 3.5.

B. Linear subspace spanned by the centroids

Let

� �span � ��

�-� �� 2��

be a subspace in IR " spanned by the�

centroids of the data vectors. In the lower dimensional space

obtained by & ' , the linear subspace spanned by the�

centroids is

� � �span � � �

�"�� )� �2� � 2��

� �

where� � �� & ' � �

��, for

� � * ��)� � � .

The following lemma studies the relationship between the dimension of the subspace�

and the rank

of the matrix*

, as well as their relationship in the reduced space.

Lemma 3.1: Let�

and� � be defined as above and let

* be defined as in Eq. (2). Let dim

� � �and

dim� � � � denote the dimensions of the subspaces

�and

� � respectively. Then

1. dim� � � �

rank� * � % * , or rank

� * �, and dim

� � � � � rank� & ' * � % * , or rank

� & ' * � ;2. If dim

� � � �rank

� * �, then dim

� � � � � rank� & ' * � . If dim

� � � � � rank� & ' * � % * , then dim

� � � �

rank� * � % * .

Proof: Consider the matrix

� ��#�5� �"� �� 2� �� 2��

13

where the global centroid�

can be rewritten as� �

�#�� $ �$ � ��

.

It is easy to check that rank� � � �

dim� � �

. On the other hand, the rank of the matrix�

equals rank� * �

,

if�

lies in the space spanned by � � ��-� � � �)�)�� 2� � 2� � � �

, and equals rank� * � % * , otherwise. Hence

dim� � � �

rank� * � % * or rank

� * �. Similarly, we can prove dim

� � � � � rank� & ' * � % * or rank

� & ' * � .This completes the proof for part 1.

If dim� � � �

rank� * �

, from the above argument,� �

�� "� � �� for some coefficients

� �. It

follows that� � � & ' � �

�#�� "� � �� , where� � is the centroid in the lower dimensional space.

Again by the above argument, dim� � � � � rank

� & ' * � .Similarly, if dim

� � � � � rank� & ' * � % * , we can show that dim

� � � �rank

� * � % * .

C. Generalized discriminant analysis using the � � measure

We start with a more general optimization problem:

�� & � subject to rank� & ' * � ��

(11)

for some� � �

. The optimization in Eq. (11) depends on the value of�. The optimal transformation

& ' we are looking for is a special case of the above formulation, where we set� �

rank� * �

. This

choice of�

guarantees that the dimension of the linear space spanned by the centroids in the original

high-dimensional space and the corresponding one in the transformed lower-dimensional space are kept

close to each other, as shown in the following Proposition 3.1.

To solve the minimization problem in Eq. (11), we need the following three lemmas, where the proofs

of the first two are straightforward from the definition of the pseudo-inverse [11].

Lemma 3.2: For any matrix�

IR " #%$ , we have trace��

rank��

.

Lemma 3.3: For any matrix�

IR " #%$ , we have�� ' � � �� ' � � .

The following lemma is critical for our main result:

Lemma 3.4: Let � �diag

� � �� be any diagonal matrix with

� � � �� . Then for any

14

matrix � IR � # � with rank

� � � ��, we have

trace� � � � � ' � � �� '��

�� *� � 7Furthermore, the equality holds if and only if � ��

� � �

� �

�� , for some orthogonal matrix

� IR � # �

and matrix� � �

�� IR � # � , where

� IR � # � is orthogonal and �

� � � � IR � # � are diagonal matrices

with positive diagonal entries.

Proof: It is easy to check that

� � � � ' � � �� ' � �� ' � � � � '

� � � � ' � � � ' � (12)

where � � �� and the last equality follows from Lemma 3.3.

It is clear that rank� � � rank

� � � � �, since � is non-singular. Let � � ��

� ��

� �

� �� ' be the

SVD of . Then by the definition of pseudo-inverse, � � ��

� �

� �

�� ' . Using the fact that

trace� ��

trace��

, for any�

and�

, together with Eq. (12), we have

trace� �� ' � � �� '��

trace � � � � ' � � � ' �

�trace � � �

� � � � ' ��

trace � � '� � � � � ��

�#�� *� � �where � � contains the first

�columns of the orthogonal matrix � , and the last inequality follows,

since the diagonal elements of the diagonal matrix � �

are nondecreasing. Moreover, the minimum of

trace � � '� � � � � � is obtained if and only if

� � ��

��

� ��

15

for some orthogonal matrix�� IR � # � . Hence

� � � � ��

� � �

� �

��

where� � ��

��

� �� , and � � is the

�-th principal sub-matrix of � .

We are now ready to present our main result for this section. To solve the minimization problem in

Eq. (11), we first give a lower bound on the objective function � � in Theorem 3.1. A sufficient condition

on & ' , under which the lower bound computed in Theorem 3.1 is obtained, is presented in Corollary 3.1.

A simple solution is then presented in Corollary 3.2.

The optimal solution for our generalized discriminant analysis presented in this paper is a special case

of the solution in Corollary 3.2, where we set� �

rank� * �

.

Theorem 3.1: Assume the transformation matrix & ' IR ) # " satisfies rank

� & ' * � � �, for some

integer� � �

. Then the following inequality holds:

� � �trace � � � � � � � � ��

� �� if

� � � % *�

if� � � % * �

(13)

where� � � & ' �! & and

� �� & ' �� & are defined in Eq. (3), and the��

, �

are specified by the GSVD

of� * ' � * '� � in Eq. (6).

Proof: First consider the easy case when� � � % * . Since both

� � � � �and

� ��are positive semi-

definite, trace � � � � � � � � �� . Next consider the case when

� � � % * .

Recall from Eq. (8) that by the GSVD,

�� & � ��& ' � �� & � � �& ' �where

�& is defined in Eq. (9). Partition�& into

�& ��& � � & � � & � &��

such that & � IR ) # � , & � IR ) # � , &� IR ) # �$ � �� , and &� IR ) # � "

$ �.

16

Since� �� % �� * , for

� � * � ��)�)�� , we have

� ' � � � % � ' � � � � � $ �here

� $ IR$ # $

is an identity matrix. Define & � � � �& � � & � . We have

� �� %&� � � & � � & ' � � % & & ' 7By the property of the pseudo-inverse in Lemma 3.2,

trace� � ��

rank� ��

rank� & ' �! & �

�rank

� & ' * � � � 7(14)

It follows that

trace � � � � � � � � �� %&� � � � �trace

� � & � � � & ' � � �� & � � & ' � � % &� & ' � ��

trace� � & � � � & ' � � �

�& � � & ' � � �

� �� * % ��

�� *� �� (15)

where � ��

� � � �� . The first inequality follows, since & & ' is positive semi-definite, and the

equality holds if &� � �. The second inequality follows from Eq. (14) and Lemma 3.4, and the equality

holds if and only if

& � � ��

� �

� ��

IR ) # � � � � (16)

for some orthogonal matrix�

IR ) # ) and some matrix� � �

� �� IR � # � , where �

� � � � IR � # � are

diagonal matrices with positive diagonal entries and�

IR � # � is orthogonal. In particular, it holds if

both�

and�

are identity matrices.

17

By Eq. (14) and Eq. (15), we have

� � �trace

� � � � � � � � �� %&� � � �,�trace

� � � � � � � � ��

�� * % ��

*� ��

� ��

� �� 7

Theorem 3.1 gives a lower bound on � � , when�

is fixed. From the arguments above, all the inequalities

become equalities if & � � has the form in Eq. (16), and & ��. A simple choice is to set

�and

�in

Eq. (16) to be identity matrices, which is used in our implementation later. The result is summarized in

the following corollary.

Corollary 3.1: Let & ' ,�& be defined as in Theorem 3.1 and

� � � % * , then the equality in Eq. 13

holds, if the partition of�& � �

& �� & � � & � &�� satisfies & � � ��

� �

�� and &� � �

, where

& � � �� & � � & � .

Theorem 3.1 does not mention the connection between . , the row dimension of the transformation

matrix & ' , and�. We can choose the transformation matrix & ' that has large row dimension . and still

satisfies the condition stated in Corollary 3.1. However we are more interested in a lower-dimensional

representation of the original data, while keeping the same information from the original data.

The following corollary says that the smallest possible value for . is�, and, more importantly, we can

find a transformation & ' , which has row dimension�, and satisfies the condition stated in Corollary 3.1.

Corollary 3.2: For every� + � % � , there exists a transformation & '

IR � # " , such that the equality

in Eq. (13) holds, i.e. the minimum value for � � is obtained. Furthermore for any transformation� &�� '

IR ) # " , such that the assumption in Theorem 3.1 holds, we have . � �.

Proof: Construct & ' IR � # " , such that

�& � � � � & � ' � �& � � � &� � &��

18

satisfies & � � � � � � � , & � �, and &� � �

, where� � IR � # � is an identity matrix. Hence &,' � � '� ,

where � � IR � # " contains the first�

columns of the matrix � . By Corollary 3.1, the equality in Eq. (13)

holds under the above transformation & ' . This completes the proof for the first part.

For any transformation� & � � '

IR ) # " , such that rank� � & � � ' * � � �

, it is clear� + rank

� � & � � ' � + . .

Hence . � �.

Theorem 3.1 shows that the minimum value of the objective function, � � , is dependent on the rank

of the matrix & ' * . As shown in Lemma 3.1, the rank of the matrix*

, which is � % � by GSVD,

denotes the degree of linear independence of the�

centroids in the original data, while the rank,�, of

the matrix & ' * implies the degree of linear independence of the�

centroids in the reduced space by

Lemma 3.1. By Corollary 3.2, if we fix the degree of linear independence of the centroids in the reduced

space, i.e. if rank� & ' * � ��

is fixed, then we can always find a transformation matrix & ' IR � # " with

row dimension�

such that the minimum value for � � is obtained.

Next we consider the case when�

varies. From the result in Theorem 3.1, the minimum value of the

objective function � � is smaller for smaller�. However, as the value of

�decreases (note

�is always no

larger than � % � ), the degree of linear independence of the�

centroids in the reduced space is a lot less

than the one in the original high-dimensional space, which may lead to information loss of the original

data. Hence, we choose� �

rank� * �

, its maximum possible value, in our implementation. One nice

property of this choice is stated in the following proposition:

Proposition 3.1: If� � rank

� & ' * � equals rank� * �

, then8 � 8 � * + 8 � � 8 + 8 � 8

.

Proof: The proof follows directly from Lemma 3.1.

Proposition 3.1 implies that choosing� �

rank� * �

keeps the degree of linear independence of the

centroids in the reduced space the same as or one less than the one in the original space. With the

above choice, the reduced dimension under the transformation & ' is rank� * �

. We use��

rank� * �

to denote the optimal reduced dimension for our generalized discriminant analysis (also called exact

algorithm) throughout the rest of the paper. In this case, we can choose the transformation matrix & ' as

19

Algorithm 1: Exact algorithm

1. Form the matrices��

and��

as in Eq. (2).

2. Compute SVD on ��

�� , as ��

��

�� , where � and � are orthogonal and � is diagonal.

3. �� rank �� , �! "� rank � �#� � .4. Compute SVD on �#�%$'&)(+*,$'&-�.� , the sub-matrix consisting of the first ( rows and the first � columns of matrix �

from step 2, as �#�%$/&0(+*,$'&-�.�1��2436517 � , where 2 and 7 are orthogonal and 385 is diagonal.

5. 9:� ��<;>= 7 �� ?

�� .

6. @ � �A9 �BDC , where 9 B C contains the first �E columns of the matrix 9 .

in Corollary 3.2 as follows: & ' � � '� � , where � � � contains the first� �

columns of the matrix � . The

pseudo-code for our main algorithm is shown in Algorithm 1, where Lines 2–5 compute the GSVD on

the matrix pair� * ' � * '� � to obtain the matrix � as in Eq. (6).

In principle, it is not necessary to choose the reduced dimension to be exactly� �

. Experiments show

good performance for values of reduced dimension close to� �

. However the results also show that the

performance can be very poor for small values of reduced dimension, probably due to information loss

during dimension reduction.

D. Relation between the exact algorithm and the pseudo-inverse based LDA

As discussed in the Introduction, the Pseudo Fisher Linear Discriminant (PFLDA) method [8], [10],

[23], [26], [27] applies the pseudo-inverse on the scatter matrices. The solution can be obtained by solving

the eigenvalue problem on matrix� �$ �!

. The exact algorithm proposed in this paper is based on the

criterion � � , as defined in Eq. (5), which is an extension of the criterion in classical LDA by replacing the

inverse with the pseudo-inverse. Both PFLDA and the exact algorithm apply pseudo-inverse to deal with

the singularity problem on undersampled problems. A natural question is what is the relation between

the two methods? We show in this section that the solution from the exact algorithm also solves the

20

eigenvalue problem on matrix� �$ �!

, that is, the two methods are essentially equivalent. The significance

of this equivalence result is that the criterion proposed in this paper may provide theoretical justification

of the pseudo-inverse based LDA, which was used in an ad-hoc manner in the past.

The following lemma is essential to show the relation between the exact algorithm and the PFLDA.

Lemma 3.5: Let � be the matrix obtained by computing GSVD on the matrix pair� * ' � * '� � (see

Line 5 of the Algorithm 1). Denote� �

�� $ �

� �

� �� . Then

� �$ � � � � ' , where��$

is the total scatter

matrix.

Proof: By Eq. (10), we have � ' ��$ � � �, i.e.

��$ � � ' � � � . From Line 5 of the Algorithm 1,

� has the form of � � � ��

� �� , where

�and

�are orthogonal and

�is diagonal. Then

��$ � � ' � � ��

� � � �

� ��

��

�' � �

� ��

'

� � ��

� �

��

' 7

It follows from the definition of pseudo-inverse that� �$ � � ��

� � � �

� �

��

' . It is easy to check that

� � � ' � � ��

� �

��

' 7

Hence� �$ � � � � ' .

Theorem 3.2: Let � be the matrix as in Lemma 3.5. Then the first� �

columns of � solve the following

eigenvalue problem:� �$ �! � � � �

, for� ��

.

Proof: From Eq. (7), we have � ' �! � � � �, where

� �is diagonal as defined in Eq. (7). It follows

from Lemma 3.5 that

� �$ �! � � � � � ' �� ' � � � � � � � � � � � � �

21

where� � �

is diagonal. This implies that the columns of � are the eigenvectors of the matrix� �$ �!

.

The result of the theorem follows, since only the first� �

diagonal entries of the diagonal matrix� � �

are

nonzero.

IV. APPROXIMATION ALGORITHM

One of the limitations of the above exact algorithm is the expensive computation of the GSVD of the

matrix

�

IR� $ � 2� # " . In this section, an efficient approximation algorithm is presented to overcome this

limitation.

The K-Means algorithm [6], [16] is widely used to capture the clustered structure of the data, by

decomposing the whole data set as a disjoint union of small sets, called clusters. The K-Means algorithm

aims to minimize the distance within each cluster, hence the data points in each cluster are close to the

cluster centroid and the centroids themselves are a good approximation of the original data set.

In Section II, we use the matrix* �

IR " #%$ to capture the closeness of the vectors within each cluster.

Although the number of data points � is relatively smaller than the dimension � , the absolute value of �

can still be large. To simplify the model, we attempt to use the centroids only to approximate the structure

of each cluster, by applying the K-Means algorithm to each cluster.

More specifically, if��

��)� � �0 �

are the�

clusters in the data set, with the size of each cluster

8 � � 8 � � � , and �� , then the K-Means algorithm is applied to each cluster

� �to produce � �

sub-clusters � � � �� , with

� � �� and the size of each sub-cluster8 � � �� 8 � �

�. Let

� � ��be the

centroid for each sub-cluster� � ��

. The within-cluster distance in the�-th cluster ��

�� 898 � � � �� 898 � can

be approximated as � ��

��

898 �� 898 ��

�� 898 � � � �� 5� �� 898 � �

by approximating every point

in the sub-cluster� � � ��

by its centroid� � � ��

.

22

Algorithm 2: Approximation algorithm

1. Form the matrix��

as defined in Eq. (2).

2. Run K-Means algorithm on each cluster �� with �� .3. Form

��as defined in Eq. (17) to approximate

�#�.

4. Compute GSVD on the matrix pair � � �� * �� to obtain the matrix 9 as in Eq. (6).

3. �! '� rank � �#� � .4. @ � � 9 �BDC , where 9 B C contains the first �E columns of the matrix 9 .

Hence the matrix* �

can be approximated as

� �� "� � �-�� "� �� )�� "� � �� -� ��)�)��

�� .� � �"� �� )�� "� � ��

IR " #

$ � �(17)

where��

�� is the total number of sub-clusters, which is typically much smaller than � , the

total number of data points, thus reducing the complexity of the GSVD computation dramatically. The

main steps for our approximation algorithm are summarized in Algorithm 2. For simplicity, in our

implementation we chose all the � � ’s to have the same value � � , where � � stands for the numbers of

sub-clusters. We discuss below the choice for � � .A. The choice of the number of sub-clusters ( � � )

The value of � � , which is the number of sub-clusters within each cluster� �

, will determine the

complexity of the approximation algorithm. If � � is too large, the approximation algorithm will produce

result close to the one using all the points, while the computation of the GSVD will still be expensive.

If � � is small, then the sub-clusters will not approximate well the original data set. For our problem, we

apply the K-Means algorithm to the data points belonging to the same cluster, which are already close

to each other. Indeed, in our experiments, we found that small values of � � worked well; in particular,

choosing � � around 6 to 10 gave good results.

23

TABLE II

COMPARISON OF DECOMPOSITION AND QUERY TIME: � IS THE SIZE OF THE TRAINING SET, � IS THE DIMENSION OF THE TRAINING

DATA,�

IS THE NUMBER OF CLUSTERS IN THE TRAINING SET, � IS THE NUMBER OF NEAREST NEIGHBORS USED IN K-NN AND � IS

THE REDUCED DIMENSION BY LSI.

Method decomposition time query time

Full 0 �� LSI �� RLDA �� LSI+LDA �� Exact �� Approximation ��

B. The time complexity

Under the approximation algorithm, the�-th cluster is partitioned into � � sub-clusters. Hence the number

of rows in the approximate* �

matrix is� % � � � �� . The complexity of the GSVD computation

using the approximate* �

matrix is reduced to � � � % � � � � � � � ��

. Note that the K-Means

algorithm for the clustering is very fast [16]. Within every iteration, the computation complexity is linear

in the number of vectors and the number of clusters, while the algorithm converges within a few iterations.

Hence the complexity of the K-Means step for cluster�

is � � � � � � �

, where � � is the size of the�-th cluster.

Therefore, the total time for all the�

clusterings is � �#��

.

Hence the total complexity for our approximation algorithm is � � � � �� % � � � � �

. Since � � is usually

chosen to be a small number and�

is much smaller than � , the complexity is simplified to � � � �

.

Table II lists the complexity of different methods discussed above. The K-NN algorithm is used for

querying the data.

24

TABLE III

SUMMARY OF DATASETS USED FOR EVALUATION

Data Set 1 2 3

Source TREC Reuters-21578

number of documents � 210 320 490

number of terms � 7454 2887 3759

number of clusters�

7 4 5

V. EXPERIMENTAL RESULTS

A. Datasets

In the following experiments, we use three different datasets1, summarized in Table III. For all datasets,

we use a stop-list to remove common words, and the words are stemmed using Porter’s suffix-stripping

algorithm [22]. Moreover, any term that occurs in fewer than two documents was eliminated as in [32].

Dataset 1 is derived from the TREC-5, TREC-6, and TREC-7 collections [30]. It consist of 210 documents

in a space of dimension of 7454, with 7 clusters. Each cluster has 30 documents. Datasets 2 and 3 are

from Reuters-21578 text categorization test collection Distribution 1.0 [19]. Dataset 2 contains 4 clusters,

with each containing 80 elements; the dimension of the document set is 2887. Dataset 3 has 5 clusters,

each with 98 elements; the dimension of the document set is 3759.

For all the examples, we use the tf-idf weighting scheme [24], [32] for encoding the document collection

with a term-document matrix.

B. Experimental methodology

To evaluate the proposed methods in this paper, we compared them with the other three dimension

reduction methods: LSI, RLDA, and two stage LSI+LDA on the three datasets in Table III. The K-Nearest

Neighbor algorithm [7] (for �� * �� *�� ) was applied to evaluate the quality of different dimension

reduction algorithms as in [13]. For each method, we applied 10-fold cross-validation to compute the

1These datasets are available at www.cs.umn.edu/ � jieping/Research.html .

25

misclassification rate and the standard deviation. For LSI and two-stage LSI+LDA, the results depend on

the choice of reduced dimension � by LSI. Our experiments showed that �� * � � for LSI and �

� � �for LSI+LDA produced good overall results. Similarly, RLDA depends on the choice of the parameter�

. We ran RLDA with different choices of�

and found that� �� 7

� produced good overall results on

the three datasets. We also applied K-NN on the original high dimensional space without the dimension

reduction step, which we named “FULL” in the following experiments.

The clustering using the K-Means in the approximation algorithm is sensitive to the choices of the

initial centroids. To mitigate this, we ran the algorithm ten times, and the initial centroids for each run

were generated randomly. The final result is the average over the ten different runs.

C. Results

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Mis

clas

sific

atio

n ra

te

Dataset 1Dataset 2Dataset 3

Fig. 1. Effect of the value of � � on the approximation algorithm using Datasets 1–3. The � -axis denotes the value of � � .

1) Effect of the value of � � on the approximation algorithm: In Section IV, we use � � to denote the

number of sub-clusters for the approximation algorithm. As mentioned in Section IV, our approximation

algorithm worked well for small values of � � . We tested our approximation algorithm on Datasets 1–3

for different values of � � , ranging from 2 to 30, and computed the misclassification rates. ��

nearest

neighbors have been used for classification. As seen from Figure 1, the misclassification rates did not

fluctuate very much within the range. For efficiency, we would like � � to be small. Therefore, in our

experiments, we chose � � �� , as the approximation algorithm performed very well on all three datasets

26

in the vicinity of this value of � � . However, as shown in Figure 1, the performance of the approximation

algorithm is generally insensitive to the choice of � � .2) Comparison of misclassification rates: In the following experiment, we evaluated our exact and

approximation algorithms and compared them with competing algorithms (LSI, RLDA, LSI+LDA) based

on the misclassification rates, using the three datasets in Table III.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 7 15

Mis

clas

sific

atio

n ra

te

Value of K in K-Nearest Neighbors

FULLLSI

RLDALSI+LDA

EXACTAPPR

Fig. 2. Performance of different dimension reduction methods on Dataset 1

0

0.05

0.1

0.15

0.2

0.25

1 7 15

Mis

clas

sific

atio

n ra

te


FULLLSI

RLDALSI+LDA

EXACTAPPR


The results for Datasets 1–3 are summarized in Figures 2–4, respectively, where the�

-axis shows

the three different choices of K ( �� * � � � * � ) used in K-NN for classification, and the - -axis shows the

misclassification rate. ”EXACT” and “APPR” refer to the exact and approximation algorithms respectively.

We also report the results on the original space without dimension reduction, denoted as ”FULL” in

Figures 2–4. The corresponding standard deviations of different methods for Datasets 1–3 are summarized

in Tables IV. For each dataset, the results on different numbers, �� * �� *�� of neighbors used in K-NN

27

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 7 15M

iscl

assi

ficat

ion

rate


FULLLSI

RLDALSI+LDA

EXACTAPPR


are reported.

TABLE IV

STANDARD DEVIATIONS OF DIFFERENT METHODS ON DATASETS 1–3, EXPRESSED AS A PERCENTAGE

Dataset Dataset 1 Dataset 2 Dataset 3

K K K

Method 1 7 15 1 7 15 1 7 15

FULL 5.41 5.61 5.96 7.99 7.86 6.04 5.63 5.71 5.40

LSI 5.41 6.27 5.70 7.97 6.43 7.10 3.78 5.20 5.13

RLDA 4.69 2.30 2.30 6.50 6.20 5.40 3.29 4.18 4.35

LSI+LDA 5.52 3.33 2.30 6.56 6.60 5.55 4.09 4.28 4.43

EXACT 2.01 2.01 2.01 7.18 5.80 6.07 3.57 3.57 3.57

APPR 3.21 3.67 2.67 6.40 6.80 5.60 4.59 4.12 3.24

For all three datasets, the number of terms in the term-document matrix ( � ) is larger than the number

of documents ( � ). It follows that all scatter matrices are singular and classical discriminant analysis breaks

down. However our proposed generalized discriminant analysis circumvents this problem.

The main observations from these experiments are: (1) Dimension reduction algorithms, like RLDA,

LSI+LDA, and our exact and approximation algorithms do improve the performance for classification,

even for very high dimensional data sets, like those derived from text documents; (2) Algorithms using the

label information like the proposed exact and approximation algorithms, and RLDA and LSI+LDA, have

28

better performance than the one that does not use the label information, i.e., LSI; (3) The approximation

algorithm deals with a problem of much smaller size than the one dealt with by the exact algorithm.

However, the results from all the experiments show that they have similar misclassification rates; (4) The

performance of the exact algorithm is close to RLDA and LSI+LDA in all experiments. However, both

RLDA and LSI+LDA suffer from the difficulty of choosing certain parameters, while the exact algorithm

can be applied directly without setting any parameters.

3) Effect of the reduced dimension on the exact and approximation algorithms: In Section III, we

showed our exact algorithm using generalized discriminant analysis has optimal reduced dimension� �

,

which equals the rank of the matrix*

and is typically very small. We also mentioned in Section III, that

our exact algorithm may not work very well if the reduced dimension is chosen to be much smaller than

the optimal one. In this experiment, we study the effect of the reduced dimension on the performance of

the exact and approximation algorithms.

1 3 6 11 16 21 26 31 36 41 460

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Reduced Dimension

Misc

lassif

icatio

n Ra

te

EXACTAPPR

Fig. 5. Effect of the reduced dimension using Dataset 1 (optimal reduced dimension � �� )

Figures 5–7 illustrate the effect of different choices for the reduced dimension on our exact and

approximation algorithms. ��

nearest neighbors have been used for classification. The�

-axis is the

value for the reduced dimension, and the - -axis is the misclassification rate. For each reduced dimension,

the results for our exact and approximation algorithms are ordered from left to right. Our exact and

approximation algorithms achieved the best performance in all cases, when the reduced dimension is

around� �

. Note, the optimal reduced dimension� �

for different datasets may be different. These further

29

confirm our theoretical results on the optimal choice of the reduced dimension for our exact algorithm as

discussed in Section III.

1 3 6 11 16 21 26 31 36 41 460

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Reduced Dimension

Misc

lassif

icatio

n Ra

te

EXACTAPPR


1 4 6 11 16 21 26 31 36 41 460

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Reduced Dimension

Misc

lassif

icatio

n Ra

te

EXACTAPPR


VI. CONCLUSIONS

An optimization criterion for generalized linear discriminant analysis is presented. The criterion is

applicable to undersampled problems, thus overcoming the limitation of classical linear discriminant

analysis. A new formulation for the proposed generalized linear discriminant analysis based on the trace

optimization is discussed. Generalized singular value decomposition is applied to solve the optimization

problem. The solution from the LDA/GSVD algorithm is a special case of the solution for this new

optimization problem. The equivalence between the exact algorithm and the Pseudo Fisher LDA (PFLDA)

is shown theoretically, thus providing a justification for the pseudo-inverse based LDA.

30

The exact algorithm involves the matrix* �

IR " #%$ with high column dimension � . To reduce the

decomposition time for the exact algorithm, an approximation algorithm is presented, which applies K-

Means algorithm to each cluster and replaces the cluster by the centroids of the resulting sub-clusters.

The column dimension of the matrix* �

is reduced dramatically, therefore reducing the complexity for

the computation of GSVD. Experimental results on various real data set show that the approximation

algorithm produces results close to those produced by the exact algorithm.

Acknowledgement

We thank the Associate Editor and the four reviewers for helpful comments that greatly improved the

paper.

Research of J. Ye and R. Janardan is sponsored, in part, by the Army High Performance Computing

Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative

agreement number DAAD19-01-2-0014, the content of which does not necessarily reflect the position or

the policy of the government, and no official endorsement should be inferred. Research of C. Park and

H. Park has been supported in part by the National Science Foundation Grants CCR-0204109 and ACI-

0305543. Any opinions, findings and conclusions or recommendations expressed in this material are those

of the authors and do not necessarily reflect the views of the National Science Foundation.

31

REFERENCES

[1] P. Baldi and G.W. Hatfield. DNA Microarrays and Gene Expression: From experiments to data analysis and modeling, Cambridge,

2002.

[2] M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573-595,

1995.

[3] P.N. Belhumeour, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE

Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.

[4] D.Q. Dai and P.C. Yuen. Regularized discriminant analysis and its application to face recognition. Pattern Recognition, vol. 36, pp.

845–847, 2003.

[5] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J. Society for

Information Science, 41, pp. 391–407, 1990.

[6] I.S. Dhillon and D.S. Modha. Concept Decompositions for Large Sparse Text Data using Clustering. Machine Learning. 42, pp.

143–175, 2001.

[7] R.O. Duda and P.E. Hart, and D. Stork. Pattern Classification. Wiley, 2000.

[8] R.P.W. Duin. Small sample size generalization. Proc. 9th Scandinavian Conf. on Image Analysis, Vol. 2, 957–964, 1995.

[9] J. H. Friedman. Regularized discriminant analysis. J. American Statistical Association, vol. 84, no. 405, pp. 165–175, 1989.

[10] K. Fukunaga. Introduction to Statistical Pattern Recognition, second edition. Academic Press, Inc., 1990.

[11] G.H. Golub, and C.F. Van Loan. Matrix Computations, John Hopkins University Press, 3rd edition, 1996.

[12] J.C. Gower. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, pp. 325–338.

[13] P. Howland, M. Jeon, and H. Park. Cluster structure preserving dimension reduction based on the generalized singular value

decomposition. SIAM J. Matrix Analysis and Applications, 25(1), pp. 165–179, 2003.

[14] P. Howland and H. Park. Generalizing discriminant analysis using the generalized singular value decomposition, IEEE TPAMI, 2003,

to appear.

[15] P. Howland and H. Park. Equivalence of Several Two-stage Methods for Linear Discriminant Analysis, Proc. Fourth SIAM International

Conference on Data Mining, 2004, to appear.

[16] A.K. Jain, and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.

[17] G. Kowalski. Information Retrieval System: Theory and Implementation, Kluwer Academic Publishers, 1997.

[18] W.J. Krzanowski, P. Jonathan, W.V McCarthy, and M.R. Thomas. Discriminant analysis with singular covariance matrices: methods

and applications to spectroscopic data. Applied Statistics, 44, pp. 101–115, 1995.

[19] D.D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http: �� www.research.att.com � � lewis, 1999.

[20] A. Mehay, C. Cai, and P.B. Harrington. Regularized Linear Discriminant Analysis of Wavelet Compressed Ion Mobility Spectra. Applied

Spectroscopy, Vol. 15, No. 2, pp. 219-227, 2002.

32

[21] C.C. Paige, and M.A.Saunders. Towards a generalized singular value decomposition, SIAM J. Numerical Analysis. 18, pp. 398–405,

1981.

[22] M.F Porter. An algorithm for suffix stripping Program. Program, 14(3), pp. 130–137, 1980.

[23] S. Raudys and R.P.W. Duin. On expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix,

Pattern Recognition Letters, vol. 19, no. 5-6, pp. 385-392, 1998.

[24] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.

[25] G. Salton, and M.J. McGill. Introduction to Modern Information Retrieval, McGraw-Hill, 1983.

[26] M. Skurichina and R.P.W. Duin. Stabilizing classifiers for very small sample size. Proc. International Conference on Pattern Recognition,

Vienna, Austria, pp. 891–896, 1996.

[27] M. Skurichina and R.P.W. Duin, Regularization of linear classifiers by adding redundant features, Pattern Analysis and Applications,

vol. 2, no. 1, pp. 44-52, 1999.

[28] D. L. Swets, and J. Weng. Using Discriminant Eigenfeatures for Image Retrieval, IEEE Trans. Pattern Analysis and Machine Intelligence,

vol. 18, no. 8, pp. 831–836, 1996.

[29] Q. Tian, M. Barbero, Z.H. Gu, and S.H. Lee, Image classification by the Foley-Sammon transform. Optical Engineering, Vol. 25, No.

7, pp. 834-840, 1986.

[30] TREC. Text Retrieval Conference. http: � � trec.nist.gov, 1999.

[31] C. F. Van Loan. Generalizing the singular value decomposition. SIAM J. Numerical Analysis, 13(1), pp. 76–83, 1976.

[32] Y. Zhao, and G. Karypis. Criterion Functions for Document: Experiments and Analysis. Technical Report TR01-040. Department of

Computer Science and Engineering, University of Minnesota, Twin Cities, U.S.A., 2001.

[33] X. S. Zhou, and T.S. Huang. Small Sample Learning during Multimedia Retrieval using BiasMap. Proc. IEEE Computer Society

Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, pp. 11–17, 2001.

1 an optimization criterion for generalized discriminant...

Documents