principal components analysis ( pca)
DESCRIPTION
Principal Components Analysis ( PCA). An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D Can be used to: Reduce number of dimensions in data Find patterns in high-dimensional data Visualize data of high dimensionality Example applications: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/1.jpg)
1
Principal Components Analysis ( PCA)
• An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D
• Can be used to:– Reduce number of dimensions in data– Find patterns in high-dimensional data– Visualize data of high dimensionality
• Example applications:– Face recognition– Image compression– Gene expression analysis
![Page 2: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/2.jpg)
2
Principal Components Analysis Ideas ( PCA)
• Does the data set ‘span’ the whole of d dimensional space?
• For a matrix of m samples x n genes, create a new covariance matrix of size n x n.
• Transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs).
• developed to capture as much of the variation in data as possible
![Page 3: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/3.jpg)
3
X1
X2
Principal Component AnalysisSee online tutorials such as http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Note: Y1 is the first eigen vector, Y2 is the second. Y2 ignorable.
Y1Y2
xx
x xx
x
xx
xx
xx
x xx
x
xx
x xx
x
x
x
x
Key observation:variance = largest!
![Page 4: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/4.jpg)
4
Principal Component Analysis: one attribute first
• Question: how much spread is in the data along the axis? (distance to the mean)
• Variance=Standard deviation^2
Temperature
4240243015181530153035304030
)1(
)(1
2
2
n
XXs
n
ii
![Page 5: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/5.jpg)
5
Now consider two dimensions
X=Temperature Y=Humidity
40 9040 9040 9030 9015 7015 7015 7030 9015 7030 7030 7030 9040 7030 90
)1(
))((),cov( 1
n
YYXXYX
n
iii
Covariance: measures thecorrelation between X and Y• cov(X,Y)=0: independent•Cov(X,Y)>0: move same dir•Cov(X,Y)<0: move oppo dir
![Page 6: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/6.jpg)
6
More than two attributes: covariance matrix
• Contains covariance values between all possible dimensions (=attributes):
• Example for three attributes (x,y,z):
)),cov(|( jiijijnxn DimDimccC
),cov(),cov(),cov(),cov(),cov(),cov(),cov(),cov(),cov(
zzyzxzzyyyxyzxyxxx
C
![Page 7: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/7.jpg)
7
Eigenvalues & eigenvectors
• Vectors x having same direction as Ax are called eigenvectors of A (A is an n by n matrix).
• In the equation Ax=x, is called an eigenvalue of A.
23
48
1223
1232
xx
![Page 8: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/8.jpg)
8
Eigenvalues & eigenvectors
• Ax=x (A-I)x=0• How to calculate x and :
– Calculate det(A-I), yields a polynomial (degree n)
– Determine roots to det(A-I)=0, roots are eigenvalues
– Solve (A- I) x=0 for each to obtain eigenvectors x
![Page 9: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/9.jpg)
9
Principal components• 1. principal component (PC1)
– The eigenvalue with the largest absolute value will indicate that the data have the largest variance along its eigenvector, the direction along which there is greatest variation
• 2. principal component (PC2)– the direction with maximum variation left in data,
orthogonal to the 1. PC • In general, only few directions manage to
capture most of the variability in the data.
![Page 10: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/10.jpg)
10
Steps of PCA
• Let be the mean vector (taking the mean of all rows)
• Adjust the original data by the meanX’ = X –
• Compute the covariance matrix C of adjusted X
• Find the eigenvectors and eigenvalues of C.
X
• For matrix C, vectors e (=column vector) having same direction as Ce :– eigenvectors of C is e such
that Ce=e, is called an eigenvalue of
C.• Ce=e (C-I)e=0
– Most data mining packages do this for you.
X
![Page 11: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/11.jpg)
11
Eigenvalues• Calculate eigenvalues and eigenvectors x for
covariance matrix:– Eigenvalues j are used for calculation of [% of total
variance] (Vj) for each component j:
n
xxn
xx
jj nV
1
1
100
![Page 12: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/12.jpg)
12
Principal components - Variance
0
5
10
15
20
25
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Varia
nce
(%)
![Page 13: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/13.jpg)
13
Transformed Data• Eigenvalues j corresponds to variance on each
component j• Thus, sort by j
• Take the first p eigenvectors ei; where p is the number of top eigenvalues
• These are the directions with the largest variances
nin
i
i
pip
i
i
xx
xxxx
e
ee
y
yy
.........22
11
2
1
2
1
![Page 14: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/14.jpg)
14
An Example
X1 X2 X1' X2'
19 63 -5.1 9.25
39 74 14.9 20.25
30 87 5.9 33.25
30 23 5.9 -30.75
15 35 -9.1 -18.75
15 43 -9.1 -10.75
15 32 -9.1 -21.75
30 73 5.9 19.25
0102030405060708090100
0 10 20 30 40 50
Series1
Mean1=24.1Mean2=53.8
-40
-30
-20
-10
0
10
20
30
40
-15 -10 -5 0 5 10 15 20Series1
![Page 15: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/15.jpg)
15
Covariance Matrix
• C=
• Using MATLAB, we find out:– Eigenvectors: – e1=(-0.98,-0.21), 1=51.8
– e2=(0.21,-0.98), 2=560.2– Thus the second eigenvector is more important!
75 106
106 482
![Page 16: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/16.jpg)
16
If we only keep one dimension: e2
• We keep the dimension of e2=(0.21,-0.98)
• We can obtain the final data as
212
1 *98.0*21.098.021.0 iii
ii xx
xx
y
-0.5-0.4-0.3-0.2-0.10
0.10.20.30.40.5
-40 -20 0 20 40
yi-10.14
-16.72
-31.35
31.374
16.464
8.624
19.404
-17.63
![Page 17: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/17.jpg)
17
![Page 18: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/18.jpg)
18
![Page 19: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/19.jpg)
19
![Page 20: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/20.jpg)
20
PCA –> Original Data• Retrieving old data (e.g. in data compression)
– RetrievedRowData=(RowFeatureVectorT x FinalData)+OriginalMean
– Yields original data using the chosen components
![Page 21: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/21.jpg)
21
Principal components
• General about principal components– summary variables– linear combinations of the original variables– uncorrelated with each other– capture as much of the original variance as
possible
![Page 22: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/22.jpg)
22
Applications – Gene expression analysis• Reference: Raychaudhuri et al. (2000)• Purpose: Determine core set of conditions for useful gene comparison• Dimensions: conditions, observations: genes• Yeast sporulation dataset (7 conditions, 6118 genes)• Result: Two components capture most of variability (90%)• Issues: uneven data intervals, data dependencies• PCA is common prior to clustering• Crisp clustering questioned : genes may correlate with multiple
clusters• Alternative: determination of gene’s closest neighbours
![Page 23: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/23.jpg)
23
Two Way (Angle) Data AnalysisGenes 103–104
Sam
ples
101
-102
Gene expression
matrix
Sample space analysis
Gene space analysis
Conditions 101–102
Gene
s 103
-104
Gene expression
matrix
![Page 24: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/24.jpg)
24
PCA - example
![Page 25: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/25.jpg)
25
PCA on all GenesLeukemia data, precursor B and T
Plot of 34 patients, dimension of 8973 genes reduced to 2
![Page 26: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/26.jpg)
26
PCA on 100 top significant genes Leukemia data, precursor B and T
Plot of 34 patients, dimension of 100 genes reduced to 2
![Page 27: Principal Components Analysis ( PCA)](https://reader035.vdocuments.net/reader035/viewer/2022062814/56816784550346895ddc9915/html5/thumbnails/27.jpg)
27
PCA of genes (Leukemia data)Plot of 8973 genes, dimension of 34 patients reduced to 2