1 an optimization criterion for generalized discriminant...
TRANSCRIPT
1
An optimization criterion for generalized
discriminant analysis on undersampled problems
Jieping Ye, Ravi Janardan, Cheong Hee Park, and Haesun Park
Abstract
An optimization criterion is presented for discriminant analysis. The criterion extends the optimization criteria
of the classical Linear Discriminant Analysis (LDA) through the use of the pseudo-inverse when the scatter
matrices are singular. It is applicable regardless of the relative sizes of the data dimension and sample size,
overcoming a limitation of classical LDA. The optimization problem can be solved analytically by applying the
Generalized Singular Value Decomposition (GSVD) technique. The pseudo-inverse has been suggested and used
for undersampled problems in the past, where the data dimension exceeds the number of data points. The criterion
proposed in this paper provides a theoretical justification for this procedure.
An approximation algorithm for the GSVD-based approach is also presented. It reduces the computational
complexity by finding sub-clusters of each cluster, and uses their centroids to capture the structure of each cluster.
This reduced problem yields much smaller matrices to which the GSVD can be applied efficiently. Experiments
on text data, with up to 7000 dimensions, show that the approximation algorithm produces results that are close
to those produced by the exact algorithm.
Index Terms
Classification, clustering, dimension reduction, generalized singular value decomposition, linear discriminant
analysis, text mining.
J. Ye, R. Janardan, C. Park, and H. Park are with the Department of Computer Science and Engineering, University of Minnesota–Twin
Cities, Minneapolis, MN 55455, U.S.A. Email:�jieping,janardan,chpark,hpark � @cs.umn.edu.
Corresponding author: Jieping Ye; Tel.: 612-626-7504; Fax: 612-625-0572.
2
I. INTRODUCTION
Many interesting data mining problems involve data sets represented in very high-dimensional spaces.
We consider dimension reduction for high-dimensional, undersampled data, where the dimension of the
data points is higher than the number of data points. Such high-dimensional, undersampled problems
occur frequently in many applications, including information retrieval [25], facial recognition [3], [28]
and microarray analysis [1].
The application area of interest in this paper is vector space-based information retrieval. The dimension
of the document vectors is typically very high, due to the large number of terms that appear in the collection
of the documents. In the vector space-based model, documents are represented as column vectors in a
term-document matrix. For an ����� term-document matrix�������� ��
, its���������
-th term,���
, represents the
weighted frequency of term�
in document�. Several weighting schemes have been developed for encoding
the document collection in a term-document matrix [17]. An advantage of the vector space based-method
is that once the collection of documents is represented as columns of the term-document matrix in a
high-dimensional space, the algebraic structure of the related vector space can then be exploited.
When the documents are already clustered, we would like to find a dimension-reducing transformation
that preserves the cluster structure of the original full space even after the dimension reduction. Throughout
the paper, the input documents are assumed to have been already clustered before the dimension reduction
step. When the documents are not clustered, then an efficient clustering algorithm such as K-Means [6],
[16] can be applied before the dimension reduction step. We seek a reduced representation of the document
vectors, which best preserves the structure of the original document vectors.
A. Prior related work
Latent Semantic Indexing (LSI) has been widely used for dimension reduction of text data [2], [5]. It is
based on lower rank approximation of the term-document matrix from the Singular Value Decomposition
(SVD) [11]. Although the SVD provides the optimal reduced rank approximation of the matrix when the
3
difference is measured in the ��� or Frobenius norm, it has limitations in that it does not consider cluster
structure in the data and is expensive to compute. Moreover, the choice of the optimal reduced dimension
is difficult to determine theoretically.
The Linear Discriminant Analysis (LDA) method has been applied for decades for dimension reduction
(feature extraction) of clustered data in pattern recognition [10]. It is classically formulated as an opti-
mization problem on scatter matrices. A serious disadvantage of the LDA is that its objective function
requires that the total scatter matrix be nonsingular. In many modern data mining problems such as
information retrieval, facial recognition, and microarray data analysis, the total scatter matrix in question
can be singular since the data items are from a very high-dimensional space, and in general, the dimension
exceeds the number of data points. This is known as the undersampled or singularity problem [10], [18].
In recent years, many approaches have been brought to bear on such high-dimensional, undersampled
problems. These methods can be roughly grouped into three categories. The first approach applies an
intermediate dimension reduction method, such as LSI, Principal Component Analysis (PCA), and Partial
Least Squares (PLS), to extract important components for LDA [3], [14], [18], [28]. The algorithm is
named two-stage LDA. This approach is straightforward but the intermediate dimension reduction stage
may remove some useful information. Some relevant references on this approach can be found in [15],
[18]. The second approach solves the singularity problem by adding a perturbation to the scatter matrix
[4], [9], [18], [33]. The algorithm is known as the regularized LDA, or RLDA for short. A limitation of
RLDA is that the amount of perturbation to be used is difficult to determine [4], [18].
The final approach applies the pseudo-inverse to avoid the singularity problem, which is equivalent to
approximating the solution using a least-squares solution method. The work in [12] showed that its use
led to appropriate calculation of Mahalanobis distance in singular cases. The use of classical LDA with
the pseudo-inverse has been proposed in the past. The Pseudo Fisher Linear Discriminant (PFLDA) in
[8], [10], [23], [26], [27] is based on the pseudo-inverse of the scatter matrices. The generalization error of
PFLDA was studied in [8], [26], when the size and dimension of the training data vary. As observed in [8],
4
[26], with an increase in the size of the training data, the generalization error of PFLDA first decreases,
reaching a minimum, then increases, reaching a maximum at the point when the size of the training data is
comparable with the data dimension, and afterwards begins to decrease. This peaking behavior of PFLDA
can be resolved by either reducing the size of the training data [8], [26] or by increasing the dimension
through addition of redundant features [27]. Pseudo-inverses of the scatter matrices were also studied in
[18], [20], [29]. Experiments in [18] showed that the pseudo-inverse based method achieved comparable
performance with RLDA and two-stage LDA. It is worthwhile to note that both RLDA and two-stage
LDA involve the estimation of some parameters. The authors in [18] applied cross-validation to estimate
the parameters, which can be very expensive. On the other hand, the pseudo-inverse based LDA does not
involve any parameter and is straightforward. However, its theoretical justification has not been studied
well in the literature.
Recently, a generalization of LDA based on the Generalized Singular Value Decomposition (GSVD)
has been developed [13], [14], which is applicable regardless of the data dimension, and, therefore, can be
used for undersampled problems. The classical LDA solution becomes a special case of this LDA/GSVD
method. In [13], [14], the solution from LDA/GSVD is justified to preserve the cluster structure in the
original full space after dimension reduction. However, no explicit global objective function has been
presented.
B. Contributions of this paper
In this paper, we present a generalized optimization criterion, based on the pseudo-inverse, for discrim-
inant analysis. Our cluster-preserving projections are tailored for extracting the cluster structure of high-
dimensional data, and are closely related to classical linear discriminant analysis. The main advantage of
the proposed criterion is that it is applicable to undersampled problems. A detailed mathematical derivation
of the proposed optimization problem is presented. The GSVD technique is the key component for the
derivation. The solution from the LDA/GSVD algorithm is a special case of the solution for the proposed
criterion. We call it the exact algorithm, to distinguish it from an efficient approximation algorithm that we
5
also present. Interestingly, we show the equivalence between the exact algorithm and the pseudo-inverse-
based method, PFLDA. Hence, the criterion proposed in this paper may provide a theoretical justification
for the pseudo-inverse based LDA.
One limitation of the exact algorithm is the high computational complexity of GSVD in handling
large matrices. Typically, large document datasets, such as the ones in [32], contain several thousands of
documents with dimension in the tens of thousands. We propose an approximation algorithm based on
sub-clustering of clusters to reduce the cost of computing the SVD involved in the computation of the
GSVD. Each cluster is further sub-clustered so that the overall structure of each cluster can be represented
by the set of centroids of sub-clusters. As a result, only a few vectors are needed to define the scatter
matrices, thus reducing the computational complexity quite significantly. Experimental results show that
the approximation algorithm produces results close to those produced by the exact one.
We compare our proposed algorithms with other dimension reduction algorithms in terms of classifi-
cation accuracy, where the K-Nearest neighbors (K-NN) method [7] based on the Euclidean distance is
used for classification. We use 10-fold cross-validation for estimating the misclassification rate. In 10-fold
cross-validation, we divide the data into ten subsets of (approximately) equal size. Then we do the training
and testing ten times, each time leaving out one of the subsets from training, and using only the omitted
subset for testing. The misclassification rate is the average from the ten runs. The misclassification rates
on different data sets for all the dimension reduction algorithms discussed in this paper are reported for
comparison. The results in Section V show that our exact and approximation algorithms are comparable
with the other dimension reduction algorithms discussed in this paper. More interestingly, the results
produced by the approximation algorithm are close to those produced by the exact algorithm, while the
approximation algorithm deals with matrices of much smaller sizes than those in the exact algorithm,
hence has lower computational complexity.
The main contributions of this paper include: (1) A generalization of the classical discriminant analysis
to small sample size data using an optimization criterion, where the non-singularity of the scatter matrices
6
is not required; (2) Mathematical derivation of the solution for the proposed optimization criterion; (3)
Theoretical justification of previously proposed pseudo-inverse based LDA, by showing the equivalence
between the exact algorithm and the Pseudo Fisher Linear Discriminant Analysis (PFLDA); (4) An efficient
approximation algorithm for the proposed optimization criterion. (5) Detailed experimental results for our
exact and approximation algorithms and comparisons with competing algorithms.
The rest of the paper is organized as follows: Classical discriminant analysis is reviewed in Section II. A
generalization of the classical discriminant analysis using the proposed criterion is presented in Section III.
An efficient approximation algorithm is presented in Section IV. Experimental results are presented in
Section V and concluding discussions are presented in Section VI.
For convenience, we present in Table I the important notations used in the rest of the paper.
TABLE I
IMPORTANT NOTATIONS USED IN THE PAPER
Notation Description Notation Description
� number of the training data points � dimension of the training data
�number of clusters � data matrix
��� data matrix of the � -th cluster � � size of the � -th cluster
�� �� centroid of the � -th cluster � global centroid of the training set
�� between-cluster scatter matrix
���within-cluster scatter matrix
���total scatter matrix ��� dimension reduction transformation matrix
��� number of sub-clusters in K-Means � � total number of sub-clusters
� perturbation in regularized LDA � number of nearest neighbors in K-NN
� reduced dimension by LSI ��� reduced dimension by the exact algorithm
II. CLASSICAL DISCRIMINANT ANALYSIS
Given a data matrix�!
IR " #%$ , we consider finding a linear transformation &(' IR ) # " that maps
each column �
, for *,+ � + � , of�
in the � -dimensional space to a vector - � in the . -dimensional space:
& '0/ �1 IR "32 - �1 IR ) . Assume the original data is already clustered. The goal here is to find the
7
transformation & ' such that the cluster structure of the original high-dimensional space is preserved in the
reduced-dimensional space. Let the document matrix�
be partitioned into�
clusters as����� ����������� � ���
,
where� �
IR " #%$ � , and ����� � � � � . Let � � be the set of column indices that belong to the
�-th cluster,
i.e.,
, for� � � , belongs to the
�-th cluster.
In general, if each cluster is tightly grouped, but well separated from the other clusters, the quality of the
cluster is considered to be high. In discriminant analysis [10], three scatter matrices, called within-cluster,
between-cluster, and total scatter matrices, are defined to quantify the quality of the cluster, as follows:
��� �������
� ���� �
� ������ ��� ��� ������ ��� � ' �
�! �������
� ���� �
�"��� ��� ��� ���"��� ��� ��� � ' ���#��� � � �"���
��� ��� ���"��� ��� ��� � ' ���$ � �! !%&��� �
(1)
where the centroid� � ���
of the�-th cluster is defined as
� � �#� � �$ �� �(' � ���
, where' � ��� ��� * � * ������� � * � ' IR
$ � ,
and the global centroid�
is defined as� � �
$ �' � where' ��� * � * �)�)��� � * � ' IR
$.
Define the matrices
* � � � �+�,����� �-� �.'/� �-� � ' �)�)��� � �01����� 2� �"'/� 2� � ' � �* � �43 � � �.�5�
�"� ��� � ������� ��3 � ��"���2� ��� �6�-7
(2)
Then the scatter matrices���
and�!
can be expressed as��� � * � * '� , and
�! � * * ' . The traces of
the two scatter matrices can be computed as follows,
trace� ��� � �
������
� ��)� �
� ����5� ��� � ' � ������ ��� � �������
� ��)� �
898 ������ ��� 898 �
trace� �! � �
������ � � �"���
��� ��� � ' �.�5���� ��� � �
������ � � 898 �5�
��� ��� 898 � 7
Hence, trace� ��� �
measures the closeness of the vectors within the clusters, while trace� �: �
measures
the separation between clusters.
8
In the lower-dimensional space resulting from the linear transformation & ' , the within-cluster and
between-cluster matrices become
���� � � & ' * � � � & ' * � � ' � & ' ��� & �
��� � � & ' * ��� & ' * � ' � & ' �! & 7 (3)
An optimal transformation &,' would maximize trace� � � �
and minimize trace� � �� �
. Common opti-
mizations in classical discriminant analysis include
������ ��trace
� � ��� �� � �� � � �and ������ �
�trace
� � ��� �� � ��� ���07(4)
The optimization problem in (4) is equivalent to finding all the eigenvectors that satisfy�� �� ��� �����
,
for������
. The solution can be obtained by solving an eigenvalue problem on the matrix� �� �!
, if���
is nonsingular, or on� � ���
, if�!
is nonsingular. Equivalently, we can compute the optimal solution by
solving an eigenvalue problem on the matrix�� �$ �!
[10]. There are at most� � * eigenvectors corresponding
to nonzero eigenvalues, since the rank of the matrix�
is bounded above by� � * . Therefore, the reduced
dimension by classical LDA is at most� � * . A stable way to solve this eigenvalue problem is to apply
SVD on the scatter matrices. Details can be found in [28].
Note that a limitation of classical discriminant analysis in many applications involving small sample
data, including text processing in information retrieval, is that the matrix� $
must be nonsingular. Several
methods, including two-stage LDA, regularized LDA and pseudo-inverse based LDA, were proposed in
the past to deal with the singularity problem as follows.
A common way to deal with the singularity problem is to apply an intermediate dimension reduction
stage to reduce the dimension of the original data before classical LDA is applied. Several choices of the
intermediate dimension reduction algorithms include LSI [14], PCA [3], [18], [28], Partial Least Squares
(PLS) [18], etc. The application area of this paper is text mining, hence LSI is a natural choice for
the intermediate dimension reduction. In this two-stage LSI+LDA algorithm, the discriminant stage is
preceded by a dimension reduction stage using LSI. A limitation of this approach is that the optimal
9
value of the reduced dimension for LSI is difficult to determine [14].
A simple way to deal with the singularity of� $
is to add some constant values to the diagonal elements
of��$
, as��$ %����
" , for some��� �
, where�" is an identity matrix [9]. It is easy to check that
�!$ %����"
is positive definite, hence nonsingular. This approach is called regularized Linear Discriminant Analysis
(RLDA). A limitation of RLDA is that the optimal value of the parameter�
is difficult to determine.
It is evident that when� 2 � , we lose the information on
� $, while very small values of
�may not
be sufficiently effective [26]. Cross-validation is commonly applied for estimating the optimal�
. More
recent studies on RLDA can be found in [4], [18], [26], [33].
Another common way to deal with the singularity problem is to apply pseudo-inverse on the scatter
matrices instead of inverse. The pseudo-inverse of a matrix is defined as follows.
Definition 2.1: The pseudo-inverse of a matrix�
, denoted as���
, refers to the unique matrix satisfying
the following four conditions:
� * � � � � � � � � � � � � � � � � � � �
��� � �� � � � ' � � � � � �� �� � � � ��� ' ��� � � 7The pseudo-inverse
� �is commonly computed by the SVD [11] as follows. Let
� ��� ������ �
� �
������� '
be the SVD of�
, where�
and � are orthogonal and � is diagonal with positive diagonal entries. Then
the pseudo-inverse of�
, denoted as���
, can be computed as��� � �
���� �
� �
� �
� ��� � ' .
The Pseudo Fisher Linear Discriminant Analysis (PFLDA) [8], [10], [23], [26], [27] is equivalent to
finding all the eigenvectors satisfying� �$ �! �� � � �
, for� �� �
. An advantage of PFLDA over the other
two methods is that there is no parameter involved in PFLDA.
III. GENERALIZATION OF DISCRIMINANT ANALYSIS
Classical discriminant analysis expresses its solution by solving an eigenvalue problem when� $
is
nonsingular. However, for a general document matrix�
, the number of documents ( � ) may be smaller
10
than its dimension ( � ), then the total scatter matrix� $
is singular. This implies that both�
and���
are
singular. In this paper, we propose criterion � � below, where the non-singularity of� $
is not required.
The criterion aims to minimize the within-cluster distance, trace� � �� �
, and maximize the between-cluster
distance, trace� � � �
.
The criterion � � is a natural extension of the classical one in Eq. (4), in that the inverse of a matrix is
replaced by the pseudo-inverse [11]. While the inverse of a matrix may not exist, the pseudo-inverse of
any matrix is well-defined. Moreover, when the matrix is invertible, its pseudo-inverse is the same as its
inverse.
The criterion � � that we propose is:
� ��� & � � trace � � �� � � ����� 7 (5)
Hence the trace optimization is to find an optimal transformation matrix & ' such that � ��� & � is minimum
under a certain constraint defined in more detail in Section III-C.
We show how to solve the above minimization problem in Section III-C. The main technique applied
there is the GSVD, briefly introduced in Section III-A below. The constraints on the optimization problem
are based on the observations in Section III-B.
A. Generalized singular value decomposition
The Generalized Singular Value Decomposition (GSVD) was introduced in [21], [31]. A simple algo-
rithm to compute GSVD can be found in [13], where the algorithm is based on [21].
The GSVD on the matrix pair� * ' � * '� � will give orthogonal matrices
� IR #
, � IR$ #%$
, and a
nonsingular matrix � IR " # " , such that
����� �
� ���' �
� ����� � � �
� � �
�� �
for
��
����* ' * '�
�� �
(6)
11
where
� � ���������� � �
� � �
� � �
��� � � �
�������� � � �
� � � �
� � � ���7
Here��
IR � # � and� �
IR� $ � �� � # � $ � �� � are identity matrices,
� IR� � �� � # � $ � �� � and
� �
IR� $ $ � � � # � are zero matrices with � � rank
��� �
rank� * � �
, and � � rank� * � %
rank� * � � �
rank���, and
� �diag
��� � � � ������� ��� � � � � �� � �diag
� � � � �����)� �� � � � �are diagonal matrices satisfying
* � � � � � � �)�)��� ����� � � � � � �
���� � � � + �)�)��� � + � � � � * �
and� �� % �� � * for
� � � % * ������� � � % � .From (6), we have
* ' � � ��� � � ��� �* '� � � � � � � ��� �
hence
� ' * * ' � ����� � ' � � � �
� �
���� � ���
� ' * � * '� � ����� � ' � � � �
� �
� ��� �
�7
(7)
Therefore
� � � & ' � � � � ' � � ' * * ' � � � � & � �& � ���& ' �� �� � & ' � � � � ' � � ' * � * '� � � � � & � �& � � �& ' �
(8)
12
where
�& ��� � � & � ' 7 (9)
We use the above representations for� �
and� ��
in Section III-C for the minimization of � � .It follows from Eq. (7) that
� ' ��$ � � � � % ������� ��$ �
� �
� ��� �
(10)
which is used in the following Lemma 3.5.
B. Linear subspace spanned by the centroids
Let
� �span � ���
�-� ������� �2��� ���
be a subspace in IR " spanned by the�
centroids of the data vectors. In the lower dimensional space
obtained by & ' , the linear subspace spanned by the�
centroids is
� � �span � � �
�"�� �����)� �2� � 2��
� �
where� � ���� � & ' � �
���, for
� � * �����)� � � .
The following lemma studies the relationship between the dimension of the subspace�
and the rank
of the matrix*
, as well as their relationship in the reduced space.
Lemma 3.1: Let�
and� � be defined as above and let
* be defined as in Eq. (2). Let dim
� � �and
dim� � � � denote the dimensions of the subspaces
�and
� � respectively. Then
1. dim� � � �
rank� * � % * , or rank
� * �, and dim
� � � � � rank� & ' * � % * , or rank
� & ' * � ;2. If dim
� � � �rank
� * �, then dim
� � � � � rank� & ' * � . If dim
� � � � � rank� & ' * � % * , then dim
� � � �
rank� * � % * .
Proof: Consider the matrix
� ���#�5� �"� ��� ������� � ��� 2� ��� �2��� �
13
where the global centroid�
can be rewritten as� �
�#��� $ �$ � ����
.
It is easy to check that rank� � � �
dim� � �
. On the other hand, the rank of the matrix�
equals rank� * �
,
if�
lies in the space spanned by � � ��-� � � �)�)��� �2� � 2� � � �
, and equals rank� * � % * , otherwise. Hence
dim� � � �
rank� * � % * or rank
� * �. Similarly, we can prove dim
� � � � � rank� & ' * � % * or rank
� & ' * � .This completes the proof for part 1.
If dim� � � �
rank� * �
, from the above argument,� �
������� � �"� � ��� � � �for some coefficients
� �. It
follows that� � � & ' � �
�#����� � �"� � ���� � � � � , where� � is the centroid in the lower dimensional space.
Again by the above argument, dim� � � � � rank
� & ' * � .Similarly, if dim
� � � � � rank� & ' * � % * , we can show that dim
� � � �rank
� * � % * .
C. Generalized discriminant analysis using the � � measure
We start with a more general optimization problem:
������ � � ��� & � subject to rank� & ' * � ��� �
(11)
for some� � �
. The optimization in Eq. (11) depends on the value of�. The optimal transformation
& ' we are looking for is a special case of the above formulation, where we set� �
rank� * �
. This
choice of�
guarantees that the dimension of the linear space spanned by the centroids in the original
high-dimensional space and the corresponding one in the transformed lower-dimensional space are kept
close to each other, as shown in the following Proposition 3.1.
To solve the minimization problem in Eq. (11), we need the following three lemmas, where the proofs
of the first two are straightforward from the definition of the pseudo-inverse [11].
Lemma 3.2: For any matrix�
IR " #%$ , we have trace�� ��� � �
rank����
.
Lemma 3.3: For any matrix�
IR " #%$ , we have�� � ' � � ��� � � � ' � � .
The following lemma is critical for our main result:
Lemma 3.4: Let � �diag
� � ��������� � ��� �be any diagonal matrix with
� � � ����� � ��� � �. Then for any
14
matrix � IR � # � with rank
� � � ���, we have
trace� � � � � ' � � ��� '�� � ��
����� *� � 7Furthermore, the equality holds if and only if � ��� ���
� � �
� �
����� , for some orthogonal matrix
� IR � # �
and matrix� � �
���� � IR � # � , where
� IR � # � is orthogonal and �
� � � � IR � # � are diagonal matrices
with positive diagonal entries.
Proof: It is easy to check that
� � � � ' � � ��� ' � �� ' � � � � '
� � � � ' � � � ' � (12)
where � � �� �� and the last equality follows from Lemma 3.3.
It is clear that rank� � � rank
� � � � �, since � is non-singular. Let � � ���
� �� �
� �
� ��� � ' be the
SVD of . Then by the definition of pseudo-inverse, � � ����� ��
� �
� �
����� � ' . Using the fact that
trace� ��� � �
trace��� � �
, for any�
and�
, together with Eq. (12), we have
trace� ��� � � ' � � ��� '�� �
trace � � � � ' � � � ' �
�trace � � �
� � � � ' ��
trace � � '� � � � � �� ��
�#��� *� � �where � � contains the first
�columns of the orthogonal matrix � , and the last inequality follows,
since the diagonal elements of the diagonal matrix � �
are nondecreasing. Moreover, the minimum of
trace � � '� � � � � � is obtained if and only if
� � �����
���
� ��� �
15
for some orthogonal matrix�� IR � # � . Hence
� � � � �� � � ���
� � �
� �
����� �
where� � ��
�� �
� ��� , and � � is the
�-th principal sub-matrix of � .
We are now ready to present our main result for this section. To solve the minimization problem in
Eq. (11), we first give a lower bound on the objective function � � in Theorem 3.1. A sufficient condition
on & ' , under which the lower bound computed in Theorem 3.1 is obtained, is presented in Corollary 3.1.
A simple solution is then presented in Corollary 3.2.
The optimal solution for our generalized discriminant analysis presented in this paper is a special case
of the solution in Corollary 3.2, where we set� �
rank� * �
.
Theorem 3.1: Assume the transformation matrix & ' IR ) # " satisfies rank
� & ' * � � �, for some
integer� � �
. Then the following inequality holds:
� � �trace � � � � � � � � �� � �
� ���� ��� ���� � � � ��� �� � � � if
� � � % *�
if� � � % * �
(13)
where� � � & ' �! & and
� �� � & ' ��� & are defined in Eq. (3), and the���
, �
are specified by the GSVD
of� * ' � * '� � in Eq. (6).
Proof: First consider the easy case when� � � % * . Since both
� � � � �and
� ��are positive semi-
definite, trace � � � � � � � � �� � � ���. Next consider the case when
� � � % * .
Recall from Eq. (8) that by the GSVD,
�� � �& � ���& ' � ��� � �& � � �& ' �where
�& is defined in Eq. (9). Partition�& into
�& ���& � � & � � & � &��� �
such that & � IR ) # � , & � IR ) # � , &� IR ) # �$ � �� � , and &� IR ) # � "
$ �.
16
Since� �� % �� � * , for
� � * � ��)�)��� ��� , we have
� ' � � � % � ' � � � � � $ �here
� $ IR$ # $
is an identity matrix. Define & � � � �& � � & � . We have
� �� %&� � � & � � & ' � � % & & ' 7By the property of the pseudo-inverse in Lemma 3.2,
trace� � ��� � � �� � �
rank� ��� � �
rank� & ' �! & �
�rank
� & ' * � � � 7(14)
It follows that
trace � � � � � � � � �� %&� � � � �trace
� � & � � � & ' � � �� � & � � & ' � � % &� & ' � ��
trace� � & � � � & ' � � �
�& � � & ' � � �
� ������� * % ��
��� � � �*� �� � (15)
where � ����� �� �
� � � ����� . The first inequality follows, since & & ' is positive semi-definite, and the
equality holds if &� � �. The second inequality follows from Eq. (14) and Lemma 3.4, and the equality
holds if and only if
& � � ��� ���� � �
� �
� ���
IR ) # � � � � (16)
for some orthogonal matrix�
IR ) # ) and some matrix� � �
� �� � IR � # � , where �
� � � � IR � # � are
diagonal matrices with positive diagonal entries and�
IR � # � is orthogonal. In particular, it holds if
both�
and�
are identity matrices.
17
By Eq. (14) and Eq. (15), we have
� � �trace
� � � � � � � � �� %&� � � �,�trace
� � � � � � � � �� ��
����� * % ����� � � �
*� �� � �
� ����� � � �
� �� � � 7
Theorem 3.1 gives a lower bound on � � , when�
is fixed. From the arguments above, all the inequalities
become equalities if & � � has the form in Eq. (16), and & ���. A simple choice is to set
�and
�in
Eq. (16) to be identity matrices, which is used in our implementation later. The result is summarized in
the following corollary.
Corollary 3.1: Let & ' ,�& be defined as in Theorem 3.1 and
� � � % * , then the equality in Eq. 13
holds, if the partition of�& � �
& ��� & � � & � &�� satisfies & � � ����� � � �
� �
����� and &� � �
, where
& � � ��� & � � & � .
Theorem 3.1 does not mention the connection between . , the row dimension of the transformation
matrix & ' , and�. We can choose the transformation matrix & ' that has large row dimension . and still
satisfies the condition stated in Corollary 3.1. However we are more interested in a lower-dimensional
representation of the original data, while keeping the same information from the original data.
The following corollary says that the smallest possible value for . is�, and, more importantly, we can
find a transformation & ' , which has row dimension�, and satisfies the condition stated in Corollary 3.1.
Corollary 3.2: For every� + � % � , there exists a transformation & '
IR � # " , such that the equality
in Eq. (13) holds, i.e. the minimum value for � � is obtained. Furthermore for any transformation� &�� � '
IR ) # " , such that the assumption in Theorem 3.1 holds, we have . � �.
Proof: Construct & ' IR � # " , such that
�& � � � � & � ' � �& � � � &� � &��
18
satisfies & � � � � � � � , & � �, and &� � �
, where� � IR � # � is an identity matrix. Hence &,' � � '� ,
where � � IR � # " contains the first�
columns of the matrix � . By Corollary 3.1, the equality in Eq. (13)
holds under the above transformation & ' . This completes the proof for the first part.
For any transformation� & � � '
IR ) # " , such that rank� � & � � ' * � � �
, it is clear� + rank
� � & � � ' � + . .
Hence . � �.
Theorem 3.1 shows that the minimum value of the objective function, � � , is dependent on the rank
of the matrix & ' * . As shown in Lemma 3.1, the rank of the matrix*
, which is � % � by GSVD,
denotes the degree of linear independence of the�
centroids in the original data, while the rank,�, of
the matrix & ' * implies the degree of linear independence of the�
centroids in the reduced space by
Lemma 3.1. By Corollary 3.2, if we fix the degree of linear independence of the centroids in the reduced
space, i.e. if rank� & ' * � ���
is fixed, then we can always find a transformation matrix & ' IR � # " with
row dimension�
such that the minimum value for � � is obtained.
Next we consider the case when�
varies. From the result in Theorem 3.1, the minimum value of the
objective function � � is smaller for smaller�. However, as the value of
�decreases (note
�is always no
larger than � % � ), the degree of linear independence of the�
centroids in the reduced space is a lot less
than the one in the original high-dimensional space, which may lead to information loss of the original
data. Hence, we choose� �
rank� * �
, its maximum possible value, in our implementation. One nice
property of this choice is stated in the following proposition:
Proposition 3.1: If� � rank
� & ' * � equals rank� * �
, then8 � 8 � * + 8 � � 8 + 8 � 8
.
Proof: The proof follows directly from Lemma 3.1.
Proposition 3.1 implies that choosing� �
rank� * �
keeps the degree of linear independence of the
centroids in the reduced space the same as or one less than the one in the original space. With the
above choice, the reduced dimension under the transformation & ' is rank� * �
. We use��� �
rank� * �
to denote the optimal reduced dimension for our generalized discriminant analysis (also called exact
algorithm) throughout the rest of the paper. In this case, we can choose the transformation matrix & ' as
19
Algorithm 1: Exact algorithm
1. Form the matrices���
and���
as in Eq. (2).
2. Compute SVD on ���� ��������
���� , as �����
��� �� �
������ � , where � and � are orthogonal and � is diagonal.
3. ��� rank ����� , �! "� rank � �#� � .4. Compute SVD on �#�%$'&)(+*,$'&-�.� , the sub-matrix consisting of the first ( rows and the first � columns of matrix �
from step 2, as �#�%$/&0(+*,$'&-�.�1��2436517 � , where 2 and 7 are orthogonal and 385 is diagonal.
5. 9:� ����<;>= 7 �� ?
���� .
6. @ � �A9 �BDC , where 9 B C contains the first �E columns of the matrix 9 .
in Corollary 3.2 as follows: & ' � � '� � , where � � � contains the first� �
columns of the matrix � . The
pseudo-code for our main algorithm is shown in Algorithm 1, where Lines 2–5 compute the GSVD on
the matrix pair� * ' � * '� � to obtain the matrix � as in Eq. (6).
In principle, it is not necessary to choose the reduced dimension to be exactly� �
. Experiments show
good performance for values of reduced dimension close to� �
. However the results also show that the
performance can be very poor for small values of reduced dimension, probably due to information loss
during dimension reduction.
D. Relation between the exact algorithm and the pseudo-inverse based LDA
As discussed in the Introduction, the Pseudo Fisher Linear Discriminant (PFLDA) method [8], [10],
[23], [26], [27] applies the pseudo-inverse on the scatter matrices. The solution can be obtained by solving
the eigenvalue problem on matrix� �$ �!
. The exact algorithm proposed in this paper is based on the
criterion � � , as defined in Eq. (5), which is an extension of the criterion in classical LDA by replacing the
inverse with the pseudo-inverse. Both PFLDA and the exact algorithm apply pseudo-inverse to deal with
the singularity problem on undersampled problems. A natural question is what is the relation between
the two methods? We show in this section that the solution from the exact algorithm also solves the
20
eigenvalue problem on matrix� �$ �!
, that is, the two methods are essentially equivalent. The significance
of this equivalence result is that the criterion proposed in this paper may provide theoretical justification
of the pseudo-inverse based LDA, which was used in an ad-hoc manner in the past.
The following lemma is essential to show the relation between the exact algorithm and the PFLDA.
Lemma 3.5: Let � be the matrix obtained by computing GSVD on the matrix pair� * ' � * '� � (see
Line 5 of the Algorithm 1). Denote� �
���� � $ �
� �
� ��� . Then
� �$ � � � � ' , where��$
is the total scatter
matrix.
Proof: By Eq. (10), we have � ' ��$ � � �, i.e.
��$ � � ' � � � . From Line 5 of the Algorithm 1,
� has the form of � � � ���� � ��� �
� ������ , where
�and
�are orthogonal and
�is diagonal. Then
��$ � � ' � � �� � ���
� � � �
� �� ��� �
����
�' � �
� �� ��� �
'
� � ���� � � �
� �
����� �
' 7
It follows from the definition of pseudo-inverse that� �$ � � ���
� � � �
� �
����� �
' . It is easy to check that
� � � ' � � ���� � � �
� �
����� �
' 7
Hence� �$ � � � � ' .
Theorem 3.2: Let � be the matrix as in Lemma 3.5. Then the first� �
columns of � solve the following
eigenvalue problem:� �$ �! � � � �
, for� �� �
.
Proof: From Eq. (7), we have � ' �! � � � �, where
� �is diagonal as defined in Eq. (7). It follows
from Lemma 3.5 that
� �$ �! � � � � � ' ��� � ' � � � � � � � � � � � � �
21
where� � �
is diagonal. This implies that the columns of � are the eigenvectors of the matrix� �$ �!
.
The result of the theorem follows, since only the first� �
diagonal entries of the diagonal matrix� � �
are
nonzero.
IV. APPROXIMATION ALGORITHM
One of the limitations of the above exact algorithm is the expensive computation of the GSVD of the
matrix
�
IR� $ � 2� # " . In this section, an efficient approximation algorithm is presented to overcome this
limitation.
The K-Means algorithm [6], [16] is widely used to capture the clustered structure of the data, by
decomposing the whole data set as a disjoint union of small sets, called clusters. The K-Means algorithm
aims to minimize the distance within each cluster, hence the data points in each cluster are close to the
cluster centroid and the centroids themselves are a good approximation of the original data set.
In Section II, we use the matrix* �
IR " #%$ to capture the closeness of the vectors within each cluster.
Although the number of data points � is relatively smaller than the dimension � , the absolute value of �
can still be large. To simplify the model, we attempt to use the centroids only to approximate the structure
of each cluster, by applying the K-Means algorithm to each cluster.
More specifically, if�� � � �
������)� � �0 �
are the�
clusters in the data set, with the size of each cluster
8 � � 8 � � � , and ����� � ��� � , then the K-Means algorithm is applied to each cluster
� �to produce � �
sub-clusters � � � �� � � � ��� , with
� � ��� � � ��� � � �� and the size of each sub-cluster8 � � �� 8 � �
�. Let
� � ��be the
centroid for each sub-cluster� � ��
. The within-cluster distance in the�-th cluster ����
��� � 898 � � � ��� 898 � can
be approximated as � ��� ���
������� �� �
898 ����� ��� 898 �� � ��� ��� �
�� 898 � � � �� ���5� ��� 898 � �
by approximating every point
in the sub-cluster� � � ��
by its centroid� � � ��
.
22
Algorithm 2: Approximation algorithm
1. Form the matrix���
as defined in Eq. (2).
2. Run K-Means algorithm on each cluster ��� with ��� ����� .3. Form
����as defined in Eq. (17) to approximate
�#�.
4. Compute GSVD on the matrix pair � � �� * �� �� � to obtain the matrix 9 as in Eq. (6).
3. �! '� rank � �#� � .4. @ � � 9 �BDC , where 9 B C contains the first �E columns of the matrix 9 .
Hence the matrix* �
can be approximated as
� ��� �"� � �-�� ��� � �"� ��� �)��� ��� � �� � �"� � �� �� ��� � �-� ���)�)��� �
�� �.� � �"� ��� � � � � �)��� � � ��� �"� � ��� � ��� � � � �
IR " #
$ � �(17)
where��� �
����� � � is the total number of sub-clusters, which is typically much smaller than � , the
total number of data points, thus reducing the complexity of the GSVD computation dramatically. The
main steps for our approximation algorithm are summarized in Algorithm 2. For simplicity, in our
implementation we chose all the � � ’s to have the same value � � , where � � stands for the numbers of
sub-clusters. We discuss below the choice for � � .A. The choice of the number of sub-clusters ( � � )
The value of � � , which is the number of sub-clusters within each cluster� �
, will determine the
complexity of the approximation algorithm. If � � is too large, the approximation algorithm will produce
result close to the one using all the points, while the computation of the GSVD will still be expensive.
If � � is small, then the sub-clusters will not approximate well the original data set. For our problem, we
apply the K-Means algorithm to the data points belonging to the same cluster, which are already close
to each other. Indeed, in our experiments, we found that small values of � � worked well; in particular,
choosing � � around 6 to 10 gave good results.
23
TABLE II
COMPARISON OF DECOMPOSITION AND QUERY TIME: � IS THE SIZE OF THE TRAINING SET, � IS THE DIMENSION OF THE TRAINING
DATA,�
IS THE NUMBER OF CLUSTERS IN THE TRAINING SET, � IS THE NUMBER OF NEAREST NEIGHBORS USED IN K-NN AND � IS
THE REDUCED DIMENSION BY LSI.
Method decomposition time query time
Full 0 ��� � � ���LSI ��� ��� ��� ��� � � ��RLDA ��� � � ��� ��� � � ���LSI+LDA ��� � � ��� ��� � � ���Exact ��� � � ��� ��� � � ���Approximation ��� � ��� ��� � � ��
B. The time complexity
Under the approximation algorithm, the�-th cluster is partitioned into � � sub-clusters. Hence the number
of rows in the approximate* �
matrix is� % � � � �� � � � � � . The complexity of the GSVD computation
using the approximate* �
matrix is reduced to � � � % � � � � � � � �� � � � � �� � �
. Note that the K-Means
algorithm for the clustering is very fast [16]. Within every iteration, the computation complexity is linear
in the number of vectors and the number of clusters, while the algorithm converges within a few iterations.
Hence the complexity of the K-Means step for cluster�
is � � � � � � �
, where � � is the size of the�-th cluster.
Therefore, the total time for all the�
clusterings is � �#��� � � � � � � � � � � � ����� � � � � � � � � � � � �
.
Hence the total complexity for our approximation algorithm is � � � � �� � % � � � � �
. Since � � is usually
chosen to be a small number and�
is much smaller than � , the complexity is simplified to � � � �
.
Table II lists the complexity of different methods discussed above. The K-NN algorithm is used for
querying the data.
24
TABLE III
SUMMARY OF DATASETS USED FOR EVALUATION
Data Set 1 2 3
Source TREC Reuters-21578
number of documents � 210 320 490
number of terms � 7454 2887 3759
number of clusters�
7 4 5
V. EXPERIMENTAL RESULTS
A. Datasets
In the following experiments, we use three different datasets1, summarized in Table III. For all datasets,
we use a stop-list to remove common words, and the words are stemmed using Porter’s suffix-stripping
algorithm [22]. Moreover, any term that occurs in fewer than two documents was eliminated as in [32].
Dataset 1 is derived from the TREC-5, TREC-6, and TREC-7 collections [30]. It consist of 210 documents
in a space of dimension of 7454, with 7 clusters. Each cluster has 30 documents. Datasets 2 and 3 are
from Reuters-21578 text categorization test collection Distribution 1.0 [19]. Dataset 2 contains 4 clusters,
with each containing 80 elements; the dimension of the document set is 2887. Dataset 3 has 5 clusters,
each with 98 elements; the dimension of the document set is 3759.
For all the examples, we use the tf-idf weighting scheme [24], [32] for encoding the document collection
with a term-document matrix.
B. Experimental methodology
To evaluate the proposed methods in this paper, we compared them with the other three dimension
reduction methods: LSI, RLDA, and two stage LSI+LDA on the three datasets in Table III. The K-Nearest
Neighbor algorithm [7] (for �� * ����� *�� ) was applied to evaluate the quality of different dimension
reduction algorithms as in [13]. For each method, we applied 10-fold cross-validation to compute the
1These datasets are available at www.cs.umn.edu/ � jieping/Research.html .
25
misclassification rate and the standard deviation. For LSI and two-stage LSI+LDA, the results depend on
the choice of reduced dimension � by LSI. Our experiments showed that �� * � � for LSI and �
� � �for LSI+LDA produced good overall results. Similarly, RLDA depends on the choice of the parameter�
. We ran RLDA with different choices of�
and found that� ��� 7
� produced good overall results on
the three datasets. We also applied K-NN on the original high dimensional space without the dimension
reduction step, which we named “FULL” in the following experiments.
The clustering using the K-Means in the approximation algorithm is sensitive to the choices of the
initial centroids. To mitigate this, we ran the algorithm ten times, and the initial centroids for each run
were generated randomly. The final result is the average over the ten different runs.
C. Results
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Mis
clas
sific
atio
n ra
te
Dataset 1Dataset 2Dataset 3
Fig. 1. Effect of the value of � � on the approximation algorithm using Datasets 1–3. The � -axis denotes the value of � � .
1) Effect of the value of � � on the approximation algorithm: In Section IV, we use � � to denote the
number of sub-clusters for the approximation algorithm. As mentioned in Section IV, our approximation
algorithm worked well for small values of � � . We tested our approximation algorithm on Datasets 1–3
for different values of � � , ranging from 2 to 30, and computed the misclassification rates. �� �
nearest
neighbors have been used for classification. As seen from Figure 1, the misclassification rates did not
fluctuate very much within the range. For efficiency, we would like � � to be small. Therefore, in our
experiments, we chose � � ��� , as the approximation algorithm performed very well on all three datasets
26
in the vicinity of this value of � � . However, as shown in Figure 1, the performance of the approximation
algorithm is generally insensitive to the choice of � � .2) Comparison of misclassification rates: In the following experiment, we evaluated our exact and
approximation algorithms and compared them with competing algorithms (LSI, RLDA, LSI+LDA) based
on the misclassification rates, using the three datasets in Table III.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 7 15
Mis
clas
sific
atio
n ra
te
Value of K in K-Nearest Neighbors
FULLLSI
RLDALSI+LDA
EXACTAPPR
Fig. 2. Performance of different dimension reduction methods on Dataset 1
0
0.05
0.1
0.15
0.2
0.25
1 7 15
Mis
clas
sific
atio
n ra
te
Value of K in K-Nearest Neighbors
FULLLSI
RLDALSI+LDA
EXACTAPPR
Fig. 3. Performance of different dimension reduction methods on Dataset 2
The results for Datasets 1–3 are summarized in Figures 2–4, respectively, where the�
-axis shows
the three different choices of K ( �� * � � � * � ) used in K-NN for classification, and the - -axis shows the
misclassification rate. ”EXACT” and “APPR” refer to the exact and approximation algorithms respectively.
We also report the results on the original space without dimension reduction, denoted as ”FULL” in
Figures 2–4. The corresponding standard deviations of different methods for Datasets 1–3 are summarized
in Tables IV. For each dataset, the results on different numbers, �� * ����� *�� of neighbors used in K-NN
27
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 7 15M
iscl
assi
ficat
ion
rate
Value of K in K-Nearest Neighbors
FULLLSI
RLDALSI+LDA
EXACTAPPR
Fig. 4. Performance of different dimension reduction methods on Dataset 3
are reported.
TABLE IV
STANDARD DEVIATIONS OF DIFFERENT METHODS ON DATASETS 1–3, EXPRESSED AS A PERCENTAGE
Dataset Dataset 1 Dataset 2 Dataset 3
K K K
Method 1 7 15 1 7 15 1 7 15
FULL 5.41 5.61 5.96 7.99 7.86 6.04 5.63 5.71 5.40
LSI 5.41 6.27 5.70 7.97 6.43 7.10 3.78 5.20 5.13
RLDA 4.69 2.30 2.30 6.50 6.20 5.40 3.29 4.18 4.35
LSI+LDA 5.52 3.33 2.30 6.56 6.60 5.55 4.09 4.28 4.43
EXACT 2.01 2.01 2.01 7.18 5.80 6.07 3.57 3.57 3.57
APPR 3.21 3.67 2.67 6.40 6.80 5.60 4.59 4.12 3.24
For all three datasets, the number of terms in the term-document matrix ( � ) is larger than the number
of documents ( � ). It follows that all scatter matrices are singular and classical discriminant analysis breaks
down. However our proposed generalized discriminant analysis circumvents this problem.
The main observations from these experiments are: (1) Dimension reduction algorithms, like RLDA,
LSI+LDA, and our exact and approximation algorithms do improve the performance for classification,
even for very high dimensional data sets, like those derived from text documents; (2) Algorithms using the
label information like the proposed exact and approximation algorithms, and RLDA and LSI+LDA, have
28
better performance than the one that does not use the label information, i.e., LSI; (3) The approximation
algorithm deals with a problem of much smaller size than the one dealt with by the exact algorithm.
However, the results from all the experiments show that they have similar misclassification rates; (4) The
performance of the exact algorithm is close to RLDA and LSI+LDA in all experiments. However, both
RLDA and LSI+LDA suffer from the difficulty of choosing certain parameters, while the exact algorithm
can be applied directly without setting any parameters.
3) Effect of the reduced dimension on the exact and approximation algorithms: In Section III, we
showed our exact algorithm using generalized discriminant analysis has optimal reduced dimension� �
,
which equals the rank of the matrix*
and is typically very small. We also mentioned in Section III, that
our exact algorithm may not work very well if the reduced dimension is chosen to be much smaller than
the optimal one. In this experiment, we study the effect of the reduced dimension on the performance of
the exact and approximation algorithms.
1 3 6 11 16 21 26 31 36 41 460
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Reduced Dimension
Misc
lassif
icatio
n Ra
te
EXACTAPPR
Fig. 5. Effect of the reduced dimension using Dataset 1 (optimal reduced dimension � ��� � )
Figures 5–7 illustrate the effect of different choices for the reduced dimension on our exact and
approximation algorithms. �� �
nearest neighbors have been used for classification. The�
-axis is the
value for the reduced dimension, and the - -axis is the misclassification rate. For each reduced dimension,
the results for our exact and approximation algorithms are ordered from left to right. Our exact and
approximation algorithms achieved the best performance in all cases, when the reduced dimension is
around� �
. Note, the optimal reduced dimension� �
for different datasets may be different. These further
29
confirm our theoretical results on the optimal choice of the reduced dimension for our exact algorithm as
discussed in Section III.
1 3 6 11 16 21 26 31 36 41 460
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Reduced Dimension
Misc
lassif
icatio
n Ra
te
EXACTAPPR
Fig. 6. Effect of the reduced dimension using Dataset 2 (optimal reduced dimension � ����� )
1 4 6 11 16 21 26 31 36 41 460
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Reduced Dimension
Misc
lassif
icatio
n Ra
te
EXACTAPPR
Fig. 7. Effect of the reduced dimension using Dataset 3 (optimal reduced dimension � ����� )
VI. CONCLUSIONS
An optimization criterion for generalized linear discriminant analysis is presented. The criterion is
applicable to undersampled problems, thus overcoming the limitation of classical linear discriminant
analysis. A new formulation for the proposed generalized linear discriminant analysis based on the trace
optimization is discussed. Generalized singular value decomposition is applied to solve the optimization
problem. The solution from the LDA/GSVD algorithm is a special case of the solution for this new
optimization problem. The equivalence between the exact algorithm and the Pseudo Fisher LDA (PFLDA)
is shown theoretically, thus providing a justification for the pseudo-inverse based LDA.
30
The exact algorithm involves the matrix* �
IR " #%$ with high column dimension � . To reduce the
decomposition time for the exact algorithm, an approximation algorithm is presented, which applies K-
Means algorithm to each cluster and replaces the cluster by the centroids of the resulting sub-clusters.
The column dimension of the matrix* �
is reduced dramatically, therefore reducing the complexity for
the computation of GSVD. Experimental results on various real data set show that the approximation
algorithm produces results close to those produced by the exact algorithm.
Acknowledgement
We thank the Associate Editor and the four reviewers for helpful comments that greatly improved the
paper.
Research of J. Ye and R. Janardan is sponsored, in part, by the Army High Performance Computing
Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative
agreement number DAAD19-01-2-0014, the content of which does not necessarily reflect the position or
the policy of the government, and no official endorsement should be inferred. Research of C. Park and
H. Park has been supported in part by the National Science Foundation Grants CCR-0204109 and ACI-
0305543. Any opinions, findings and conclusions or recommendations expressed in this material are those
of the authors and do not necessarily reflect the views of the National Science Foundation.
31
REFERENCES
[1] P. Baldi and G.W. Hatfield. DNA Microarrays and Gene Expression: From experiments to data analysis and modeling, Cambridge,
2002.
[2] M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573-595,
1995.
[3] P.N. Belhumeour, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE
Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
[4] D.Q. Dai and P.C. Yuen. Regularized discriminant analysis and its application to face recognition. Pattern Recognition, vol. 36, pp.
845–847, 2003.
[5] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic analysis. J. Society for
Information Science, 41, pp. 391–407, 1990.
[6] I.S. Dhillon and D.S. Modha. Concept Decompositions for Large Sparse Text Data using Clustering. Machine Learning. 42, pp.
143–175, 2001.
[7] R.O. Duda and P.E. Hart, and D. Stork. Pattern Classification. Wiley, 2000.
[8] R.P.W. Duin. Small sample size generalization. Proc. 9th Scandinavian Conf. on Image Analysis, Vol. 2, 957–964, 1995.
[9] J. H. Friedman. Regularized discriminant analysis. J. American Statistical Association, vol. 84, no. 405, pp. 165–175, 1989.
[10] K. Fukunaga. Introduction to Statistical Pattern Recognition, second edition. Academic Press, Inc., 1990.
[11] G.H. Golub, and C.F. Van Loan. Matrix Computations, John Hopkins University Press, 3rd edition, 1996.
[12] J.C. Gower. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, pp. 325–338.
[13] P. Howland, M. Jeon, and H. Park. Cluster structure preserving dimension reduction based on the generalized singular value
decomposition. SIAM J. Matrix Analysis and Applications, 25(1), pp. 165–179, 2003.
[14] P. Howland and H. Park. Generalizing discriminant analysis using the generalized singular value decomposition, IEEE TPAMI, 2003,
to appear.
[15] P. Howland and H. Park. Equivalence of Several Two-stage Methods for Linear Discriminant Analysis, Proc. Fourth SIAM International
Conference on Data Mining, 2004, to appear.
[16] A.K. Jain, and R.C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[17] G. Kowalski. Information Retrieval System: Theory and Implementation, Kluwer Academic Publishers, 1997.
[18] W.J. Krzanowski, P. Jonathan, W.V McCarthy, and M.R. Thomas. Discriminant analysis with singular covariance matrices: methods
and applications to spectroscopic data. Applied Statistics, 44, pp. 101–115, 1995.
[19] D.D. Lewis. Reuters-21578 text categorization test collection distribution 1.0. http: ��� www.research.att.com � � lewis, 1999.
[20] A. Mehay, C. Cai, and P.B. Harrington. Regularized Linear Discriminant Analysis of Wavelet Compressed Ion Mobility Spectra. Applied
Spectroscopy, Vol. 15, No. 2, pp. 219-227, 2002.
32
[21] C.C. Paige, and M.A.Saunders. Towards a generalized singular value decomposition, SIAM J. Numerical Analysis. 18, pp. 398–405,
1981.
[22] M.F Porter. An algorithm for suffix stripping Program. Program, 14(3), pp. 130–137, 1980.
[23] S. Raudys and R.P.W. Duin. On expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix,
Pattern Recognition Letters, vol. 19, no. 5-6, pp. 385-392, 1998.
[24] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.
[25] G. Salton, and M.J. McGill. Introduction to Modern Information Retrieval, McGraw-Hill, 1983.
[26] M. Skurichina and R.P.W. Duin. Stabilizing classifiers for very small sample size. Proc. International Conference on Pattern Recognition,
Vienna, Austria, pp. 891–896, 1996.
[27] M. Skurichina and R.P.W. Duin, Regularization of linear classifiers by adding redundant features, Pattern Analysis and Applications,
vol. 2, no. 1, pp. 44-52, 1999.
[28] D. L. Swets, and J. Weng. Using Discriminant Eigenfeatures for Image Retrieval, IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 18, no. 8, pp. 831–836, 1996.
[29] Q. Tian, M. Barbero, Z.H. Gu, and S.H. Lee, Image classification by the Foley-Sammon transform. Optical Engineering, Vol. 25, No.
7, pp. 834-840, 1986.
[30] TREC. Text Retrieval Conference. http: � � trec.nist.gov, 1999.
[31] C. F. Van Loan. Generalizing the singular value decomposition. SIAM J. Numerical Analysis, 13(1), pp. 76–83, 1976.
[32] Y. Zhao, and G. Karypis. Criterion Functions for Document: Experiments and Analysis. Technical Report TR01-040. Department of
Computer Science and Engineering, University of Minnesota, Twin Cities, U.S.A., 2001.
[33] X. S. Zhou, and T.S. Huang. Small Sample Learning during Multimedia Retrieval using BiasMap. Proc. IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, Kauai, Hawaii, pp. 11–17, 2001.