intermediate r - principal component analysis

8
Presentation Title Goes Here presentation subtitle. Principal Component Analysis Violeta I. Bartolome Senior Associate Scientist PBGB-CRIL [email protected] Principal Component Analysis PCA is a statistical method capable of reducing the dimensionality of multivariate data. Multivariate data are: Obtained by taking more than one response variable from each experimental unit. There response variables are correlated. Example An agronomist might be interested in: Seedling height (X 1 ) Leaf length (X 2 ) Leaf width (X 3 ) Ligule length (X 4 ) Data were collected from 5 plants Data from such studies are usually stored in ‘data matrix’, D Seed Ht Leaf Ln Leaf Wd Lig Ln Plant 1 x 11 x 12 x 13 x 14 Plant 2 : : : : Plant 3 : : : : Plant 4 : : : : Plant 5 x 51 x 52 x 53 x 54 D =

Upload: vivay-salazar

Post on 22-Nov-2014

781 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Intermediate R - Principal Component Analysis

Presentation Title Goes Here…presentation subtitle.

Principal Component Analysis

Violeta I. Bartolome

Senior Associate Scientist

PBGB-CRIL

[email protected]

Principal Component Analysis

• PCA is a statistical method capable

of reducing the dimensionality of

multivariate data.

• Multivariate data are:�Obtained by taking more than one response

variable from each experimental unit.

�There response variables are correlated.

Example

• An agronomist might be interested

in:�Seedling height (X1)

�Leaf length (X2)

�Leaf width (X3)

�Ligule length (X4)

• Data were collected from 5 plants

Data from such studies are usually

stored in ‘data matrix’, D

Seed Ht Leaf Ln Leaf Wd Lig Ln

Plant 1 x11 x12 x13 x14

Plant 2 : : : :

Plant 3 : : : :

Plant 4 : : : :

Plant 5 x51 x52 x53 x54

D =

Page 2: Intermediate R - Principal Component Analysis

�Usually the number of rows

(experimental units/object) or

the number of columns

(response variables/attributes)

is large.

�Data reduction on either row

or column of D.

�Perform data reduction

techniques -- PCA is one!

Scenario

Goal

Solution

The Problem of Data Reduction

• Case 1: No variability in X1

16

18

20

14 15 16 17 18 19

X2

X1

The amount of information from Xi is shown by its variability.

X1 gives very small information so it can be ignored to reduce

data to X2 alone with no loss of information.

Case 2: Small variability in X1

16

18

20

14 16 18 20

X2

X1

Ignoring X1 gives little loss of information.

Case 3: Significant variation in both

variables

16

18

20

14 15 16 17 18 19

X2

X1

We can not simply discard any one variable.

Page 3: Intermediate R - Principal Component Analysis

What PCA does

• Find a smaller set of new variables to represent the data with as little loss of information as possible�The new set of variables are called principal components and they are uncorrelated

Y1 = A11X1 + A21X2 + … + Ap1Xp

.

.

.

Yp = A1pX1 + A2pX2 + … + AppXp

The Yi’s are linear combinations of the original variables.

16

18

20

14 15 16 17 18 19

X2

X1

Y2

Y1

pull

To reduce dimensionality of data under the last

case, we translate the (X1, X2) axes into (Y1, Y2)

axes and rotate the new axes.

Y2

Y1

Y1 = A11X1 + A21X2

Y2 = A12X1 + A22X2

We can now ignore Y1. Hence, PCA is used to construct

Y1 and Y2 which are called “principal components” of (X1, X2).

Example

Let X be defined as:

X1 Plant height (cm)

X2 Leaf area (cm2)

X3 Days to maturity

X4 Dry matter weight

X5 Panicle length

X6 Grain dry weight (g/hill)

X7 % filled grains

Page 4: Intermediate R - Principal Component Analysis

• Then new variables may be formed by some

‘linear combinations’ of X variables:

Big contributors to Y1

Y1=A11X1 + A21X2 + A31X3 +A41X4 + A51X5 + A61X6 + A71X7

Y2=A12X1 + A22X2 + A32X3 +A42X4 + A52X5 + A62X6 + A72X7

Big contributors to Y2

• We may call Y1 as “growth component” and

Y2 as “yield component”.

• PCA is used to find the values of Aij’s

(eigenvectors) so that the information from X

will be regained by Y1 and Y2.

TOTAL SYSTEM VARIABILTITY

reproduced by p principal components

among p variables

principal components

k p - k

There is almost as much information in the k components as there is in the original p variables.

• The variance of each principal component is

its respective eigenvalue (characteristic root)

Var ( Yi ) = λλλλiCov ( Yi, Yk ) = 0

where λλλλ1 ≥≥≥≥ λλλλ2 ≥≥≥≥ … ≥≥≥≥ λλλλp ≥≥≥≥ 0

Properties of Principal ComponentTotal variance of the X’s = Total of the eigenvalues:

22

22

2

11

1

)( pp

p

i

iXVar σσσσ++++++++σσσσ++++σσσσ====∑∑∑∑====

Proportion of total variancedue to the kth PC

p

k

... λλλλ++++++++λλλλ++++λλλλλλλλ

====21

If most (≥ 80%) of the total variance can be attributed to the first one, two, or three components, then these components can “replace” the original p variables w/o much loss of information.

Properties of Principal Components

p

p

i

p

i

iiYVar λλλλ++++++++λλλλ++++λλλλ====λλλλ========∑∑∑∑ ∑∑∑∑==== ====

⋯21

1 1

)(

Page 5: Intermediate R - Principal Component Analysis

Eigenvector

• The set of corresponding coefficients of

the X’s in the kth PC [ A1k A2k … Aik … Apk ]

is the eigenvector of the kth PC.

• The magnitude of the Aij measures the

importance of Xi to Yj.

• The sign of Aij denotes the direction of the

contribution of Xi to Yj.

Principal components may be obtained using eitherthe variance-covariance matrix ΣΣΣΣ or the correlation matrix ρρρρ

• PCA using ΣΣΣΣ considers relative weights of the variables (larger variance ⇒ larger weight)

• PCA using ρρρρ considers all variables with equal weights

Correlation matrix is recommended when:

• attributes have large variances• scales have highly varied ranges• different units of measurements

ExampleExample:: Random variables X1, X2, and X3 with the

covariance matrix

Eigenvalue-eigenvector pairs:

λλλλ1 = 5.83 e1’ = [0.383 -0.924 0]

λλλλ2 = 2.00 e2’ = [0 0 1]

λλλλ3 = 0.17 e3’ = [0.924 0.383 0]

Principal Components:

Y1 = 0.383 X1 - 0.924 X2

Y2 = X3

Y3 = 0.924 X1 + 0.383 X2

−−−−

−−−−

====∑∑∑∑

200

052

021

Proportion of total variance accounted for by the PC’s:

Components Y1 and Y2 explain 98% of the total variation. Hence they could replace the 3 original variables with little loss of information.

73.08

83.5

321

1 ========λλλλ++++λλλλ++++λλλλ

λλλλ

25.08

00.2

321

2 ========λλλλ++++λλλλ++++λλλλ

λλλλ

Page 6: Intermediate R - Principal Component Analysis

PCA in R

Sample Data

Read data to R prcomp()

Scale=TRUE causes data to be

standardized to have unit variance

before the analysis.

80% is explained by 3

principal components

Page 7: Intermediate R - Principal Component Analysis

EigenvectorScores

PCA PlotAnother Example

Page 8: Intermediate R - Principal Component Analysis

biplot()

Thank you!