chapter 2 dimensionality reduction. linear methods

Chapter 2Dimensionality Reduction. Linear Methods

2.1 Introduction

• Dimensionality reduction– the process of finding a suitable lower

dimensional space in which to represent the original data

• Goal:– Explore high-dimensional data– Visualize the data using 2-D or 3-D.– Analyze the data using statistical methods, such as

clustering, smoothing

possible methods

• just select subsets of the variables for processing

• An alternative would be to create new variables that are functions

• The methods we describe in this book are of the second type

Example 2.1

• A projection will be in the form of a matrix that takes the data from the original space to a lower-dimensional one.

• To project onto a line that is θ radians from the horizontal or x axis

• projection matrix P

Example 2.1

2.2 Principal Component Analysis . PCA

• Aim:– (PCA) is to reduce the dimensionality from p to d,

where d < p, while at the same time accounting for as much of the variation in the original data set as possible

• new set of coordinates or variables that are a linear combination of the original variables

• the observations in the new principal component space are uncorrelated.

2.2.1 PCA Using the Sample Covariance Matrix

• centered data matrix Xc that has dimension• Variable definition:


• The next step is to calculate the eigenvectors and eigenvalues of the matrix S

orthonormal

subject to the condition that the set of eigenvectors is orthonormal.


• A major result in matrix algebra shows that any square, symmetric, nonsingular matrix can be transformed to a diagonal matrix using

• the columns of A contain the eigenvectors of S, and L is a diagonal matrix with the eigenvalues along the diagonal.

• By convention, the eigenvalues are ordered in descending order


• use the eigenvectors of S to obtain new variables called principal components (PCs)

• Equation 2.2 shows that the PCs are linear combinations of the original variables.

• Scaling the eigenvectors• Using wj in the transformation yields PCs that

are uncorrelated with unit variance.


• transform the observations to the PC coordinate system via the following equation

• The matrix Z contains the principal component scores

• To summarize: the transformed variables are the PCs and the individual transformed data values are the PC scores.


• linear algebra theorem: the sum of the variances of the original variables is equal to the sum of the eigenvalues

• The idea of dimensionality reduction with PCA is that one could include in the analysis only those PCs that have the highest eigenvalues

• Reduce the dimensionality to d with the following

Ad contains the first d eigenvectors or columns of A

2.2.2 PCA Using the Sample Correlation Matrix

• We can scale the data first to have standard units

• The standardized data x* are then treated as observations in the PCA process.

• sample correlation matrix R• The correlation matrix should be used for PCA

when the variances along the original dimensions are very different

2.2.2 PCA Using the Sample Correlation Matrix

• Something should be noted:– Methods for statistical inference based on the

sample PCs from covariance matrices are easier and are available in the literature.

– the PCs obtained from the correlation and covariance matrices do not provide equivalent information.

2.2.3 How Many Dimensions Should We Keep?

• following possible ways to address this question– Cumulative Percentage of Variance Explained– Scree Plot– The Broken Stick– Size of Variance

• Example 2.2– We show how to perform PCA using the yeast cell

cycle data set.


• Cumulative Percentage of Variance Explained• The idea is to select those d PCs that

contribute a specified cumulative percentage of total variation in the data


• Scree Plot• graphical way to decide the number of PCs• The original idea: a plot of lk (the eigenvalue) versus k

(the index of the eigenvalue)– In some cases, we might plot the log of the eigenvalues

when the first eigenvalues are very large• looks for the elbow in the curve or the place where

the curve levels off and becomes almost flat.• The value of k at this elbow is an estimate for how

many PCs to retain


• The Broken Stick• choose the number of PCs based on the size of

the eigenvalue or the proportion of the variance explained by the individual PC.

• take a line segment and randomly divide it into p segments, the expected length of the k-th longest segmentthe proportion of the variance explained by the k-th PC is greater than gk, that PC is kept.


• Size of Variance• we would keep PCs if

where


• Example 2.2• Yeast that these contain 384 gene corresponding to five phases, measured

at 17 time points.

2.3 Singular Value Decomposition . SVD

• provides a way to find the PCs without explicitly calculating the covariance matrix

• the technique is valid for an arbitrary matrix– We use the noncentered form in the explanation

that follows• The SVD of X is given by

– where U is an matrix D is a diagonal matrix with n rows and p columns, and V has dimensions


• the first r columns of V form an orthonormal basis for the column space of X

• the first r columns of U form a basis for the row space of X

• As with PCA, we order the singular values decreasing ordered and impose the same order on the columns of U and V

• approximation to the original matrix X is obtained


• Example 2.3• applied to information retrieval (IR)• start with a data matrix, where each row

corresponds to a term, and each column corresponds to a document in the corpus

• this query is given by a column vector

Example 2.3

Example 2.3

• Method to find the most relevant documents– cosine of the angle between the query vectors and

the columns– use a cutoff value of 0.5– the second query matches with the first book, but

misses the fourth one

Example 2.3

• The idea is that some of the dimensions represented by the full term-document matrix are noise and that documents will have closer semantic structure after dimensionality reduction using SVD

• find the representation of the query vector in the reduced space given by the first k columns of U

Example 2.3

• Why?• Consider • Note that The columns of U and V are

orthonormal• Equation 2.6 left-multiply

1 1( , , , ) ( , , , )T T Tk n k nU X X D V V

Example 2.3

Using a cutoff value of 0.5,we now correctly have documents 1 and 4 as beingrelevant to our queries on baking bread and baking.

Thanks

chapter 2 dimensionality reduction. linear methods

Documents