intermediate r - principal component analysis
TRANSCRIPT
Presentation Title Goes Here…presentation subtitle.
Principal Component Analysis
Violeta I. Bartolome
Senior Associate Scientist
PBGB-CRIL
Principal Component Analysis
• PCA is a statistical method capable
of reducing the dimensionality of
multivariate data.
• Multivariate data are:�Obtained by taking more than one response
variable from each experimental unit.
�There response variables are correlated.
Example
• An agronomist might be interested
in:�Seedling height (X1)
�Leaf length (X2)
�Leaf width (X3)
�Ligule length (X4)
• Data were collected from 5 plants
Data from such studies are usually
stored in ‘data matrix’, D
Seed Ht Leaf Ln Leaf Wd Lig Ln
Plant 1 x11 x12 x13 x14
Plant 2 : : : :
Plant 3 : : : :
Plant 4 : : : :
Plant 5 x51 x52 x53 x54
D =
�Usually the number of rows
(experimental units/object) or
the number of columns
(response variables/attributes)
is large.
�Data reduction on either row
or column of D.
�Perform data reduction
techniques -- PCA is one!
Scenario
Goal
Solution
The Problem of Data Reduction
• Case 1: No variability in X1
16
18
20
14 15 16 17 18 19
X2
X1
The amount of information from Xi is shown by its variability.
X1 gives very small information so it can be ignored to reduce
data to X2 alone with no loss of information.
Case 2: Small variability in X1
16
18
20
14 16 18 20
X2
X1
Ignoring X1 gives little loss of information.
Case 3: Significant variation in both
variables
16
18
20
14 15 16 17 18 19
X2
X1
We can not simply discard any one variable.
What PCA does
• Find a smaller set of new variables to represent the data with as little loss of information as possible�The new set of variables are called principal components and they are uncorrelated
Y1 = A11X1 + A21X2 + … + Ap1Xp
.
.
.
Yp = A1pX1 + A2pX2 + … + AppXp
The Yi’s are linear combinations of the original variables.
16
18
20
14 15 16 17 18 19
X2
X1
Y2
Y1
�
pull
To reduce dimensionality of data under the last
case, we translate the (X1, X2) axes into (Y1, Y2)
axes and rotate the new axes.
Y2
Y1
Y1 = A11X1 + A21X2
Y2 = A12X1 + A22X2
We can now ignore Y1. Hence, PCA is used to construct
Y1 and Y2 which are called “principal components” of (X1, X2).
Example
Let X be defined as:
X1 Plant height (cm)
X2 Leaf area (cm2)
X3 Days to maturity
X4 Dry matter weight
X5 Panicle length
X6 Grain dry weight (g/hill)
X7 % filled grains
• Then new variables may be formed by some
‘linear combinations’ of X variables:
Big contributors to Y1
Y1=A11X1 + A21X2 + A31X3 +A41X4 + A51X5 + A61X6 + A71X7
Y2=A12X1 + A22X2 + A32X3 +A42X4 + A52X5 + A62X6 + A72X7
Big contributors to Y2
• We may call Y1 as “growth component” and
Y2 as “yield component”.
• PCA is used to find the values of Aij’s
(eigenvectors) so that the information from X
will be regained by Y1 and Y2.
TOTAL SYSTEM VARIABILTITY
reproduced by p principal components
among p variables
principal components
k p - k
There is almost as much information in the k components as there is in the original p variables.
• The variance of each principal component is
its respective eigenvalue (characteristic root)
Var ( Yi ) = λλλλiCov ( Yi, Yk ) = 0
where λλλλ1 ≥≥≥≥ λλλλ2 ≥≥≥≥ … ≥≥≥≥ λλλλp ≥≥≥≥ 0
Properties of Principal ComponentTotal variance of the X’s = Total of the eigenvalues:
22
22
2
11
1
)( pp
p
i
iXVar σσσσ++++++++σσσσ++++σσσσ====∑∑∑∑====
⋯
Proportion of total variancedue to the kth PC
p
k
... λλλλ++++++++λλλλ++++λλλλλλλλ
====21
If most (≥ 80%) of the total variance can be attributed to the first one, two, or three components, then these components can “replace” the original p variables w/o much loss of information.
Properties of Principal Components
p
p
i
p
i
iiYVar λλλλ++++++++λλλλ++++λλλλ====λλλλ========∑∑∑∑ ∑∑∑∑==== ====
⋯21
1 1
)(
Eigenvector
• The set of corresponding coefficients of
the X’s in the kth PC [ A1k A2k … Aik … Apk ]
is the eigenvector of the kth PC.
• The magnitude of the Aij measures the
importance of Xi to Yj.
• The sign of Aij denotes the direction of the
contribution of Xi to Yj.
Principal components may be obtained using eitherthe variance-covariance matrix ΣΣΣΣ or the correlation matrix ρρρρ
• PCA using ΣΣΣΣ considers relative weights of the variables (larger variance ⇒ larger weight)
• PCA using ρρρρ considers all variables with equal weights
Correlation matrix is recommended when:
• attributes have large variances• scales have highly varied ranges• different units of measurements
ExampleExample:: Random variables X1, X2, and X3 with the
covariance matrix
Eigenvalue-eigenvector pairs:
λλλλ1 = 5.83 e1’ = [0.383 -0.924 0]
λλλλ2 = 2.00 e2’ = [0 0 1]
λλλλ3 = 0.17 e3’ = [0.924 0.383 0]
Principal Components:
Y1 = 0.383 X1 - 0.924 X2
Y2 = X3
Y3 = 0.924 X1 + 0.383 X2
−−−−
−−−−
====∑∑∑∑
200
052
021
Proportion of total variance accounted for by the PC’s:
Components Y1 and Y2 explain 98% of the total variation. Hence they could replace the 3 original variables with little loss of information.
73.08
83.5
321
1 ========λλλλ++++λλλλ++++λλλλ
λλλλ
25.08
00.2
321
2 ========λλλλ++++λλλλ++++λλλλ
λλλλ
PCA in R
Sample Data
Read data to R prcomp()
Scale=TRUE causes data to be
standardized to have unit variance
before the analysis.
80% is explained by 3
principal components
EigenvectorScores
PCA PlotAnother Example
biplot()
Thank you!