principal component analysis - technical university of · pdf fileprincipal component analysis...

41
1 Principal component analysis Course 27411 Biological dataanalysis and chemometrics Jens C. Frisvad Department of Systems Biology Building 221 Technical University of Denmark 2800 Kgs. Lyngby Denmark E-mail: [email protected] Thanks to Lars Nørgaard, CAMO, Michael Edberg and Harald Martens for some of the figures Principal component analysis OUTLINE Some matrix algebra What is PCA? Goals of PCA Projection Q and R analysis Matrix notation Double example (continous variables and a series of single variables) Connection to PCR and PLSR Validation (short) How to do it!

Upload: vocong

Post on 10-Mar-2018

232 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

1

Principal component analysis

Course 27411

Biological dataanalysis and chemometrics

Jens C. Frisvad Department of Systems Biology

Building 221

Technical University of Denmark

2800 Kgs. Lyngby – Denmark

E-mail: [email protected]

Thanks to Lars Nørgaard,

CAMO, Michael Edberg

and Harald Martens for

some of the figures

Principal component analysis OUTLINE

• Some matrix algebra

• What is PCA?

• Goals of PCA

• Projection

• Q and R analysis

• Matrix notation

• Double example (continous variables and a series of single variables)

• Connection to PCR and PLSR

• Validation (short)

• How to do it!

Page 2: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

2

Scalars, Vectors and Matrices

• Scalar: variable described by a single number (magnitude) – Temperature = 20 °C – Density = 1 g.cm-3

– Image intensity (pixel value) = 2546 a. u.

• Vector: variable described by magnitude and direction

• Matrix: rectangular array of scalars

Column vector

Row vector

Square (3 x 3) Rectangular (3 x 2) d i j : ith row, jth column

3

2

e

n

v

vv

476

145

321

A

83

72

41

C

333231

232221

131211

ddd

ddd

ddd

D

2

1

1

b

943d

Vector Operations

• Transpose operator

• Outer product = matrix

column → row row → column

2

1

1

b 211Tb 943d

9

4

3Td

476

145

321

A

413

742

651TA

332313

322212

312111

321

3

2

1

yxyxyx

yxyxyx

yxyxyx

yyy

x

x

xT

xy

Page 3: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

3

Vector Operations

• Inner product = scalar

• Length of a vector

i

i

i

T yxyxyxyx

y

y

y

xxx

3

1

332211

3

2

1

321yx

i

n

i

i

n

n

T yx

y

y

y

xxx

1

2

1

21 ...

yx

|| x || = (x12+ x2

2 )1/2

|| x || = (x12+ x2

2 + x32 )1/2

Inner product of a vector with itself =

(vector length)2

xT x =x12+ x2

2 +x32 = (|| x ||)2

x2

x1

||x||

Matrix Operations

• Addition (matrix of same size)

– Commutative: A+B=B+A

– Associative: (A+B)+C=A+(B+C)

33

33

11

11

22

22BA

Page 4: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

4

Matrix Operations

• Multiplication – number of columns in first matrix = number of rows in second

– Associative: (A B) C = A (B C)

– Distributive: A (B+C) = A B + A C

– Not commutative: AB BA !!!

– (A B)T = BT AT

C = A B

(m x p) = (m x n) (n x p)

cij = inner product between ith

row in A and jth column in B

2 x 3 3 x 2 2 x 2

32122

1450

362514968574

332211938271

39

28

17

654

321BA

Some Definitions … • Identity Matrix

• Diagonal Matrix

• Symmetric Matrix

I A = A I = A

B = BT bij = bji

100

010

001

I

700

050

003

D

720

251

013

B

Page 5: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

5

Matrix Inverse

• A-1 A = A A-1 = I

Properties

A-1 only exists if A is square (n x n)

If A-1 exists then A is non-singular (invertible)

(A B) -1 = B-1 A-1; B-1 A-1 A B = B-1 B = I

(AT) -1 = (A-1)T; (A-1)T AT = (A A-1)T = I

IDD

100

010

001

7

100

05

10

003

1

700

050

0031

• Samples as observations in a Multi/hyper-dimensional Space:

– Objects are characterized by a profile of features.

– Features are dimensions.

– Objects are points in a multidimensional space.

• Mathematical notation

– N is the number of observations

– M is the number of variables/features

1 11 12 1

2 21 22 2

1 2

M

M

N N N NM

x x x

x x x

x x x

x

xX

x

Data as observations

Page 6: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

6

PCA

• Developed by Karl Pearson in 1901 • Pearson, K. (1901) On lines and planes of closest fit to systems of points in space.

Philosophical Magazine (6) 2: 559-572.

• Also called: – Singular value decomposition

– Karhunen-Loéve expansion

– Eigenvector analysis

– Latent vector analysis

– Characteristic vector analysis

– Hotelling transformation

Principal Component Analysis (PCA)

• Projection method

• Exploratory data analysis

• Extract information and remove noise

• Reduce dimensionality / Compression

• Classification and clustering

X = Model Noise +

Datatable (X) • Instrument measurements

• Quality parameters

• Process settings

Model • Structured variation

• Information

Noise • Unstructured variation

• Measurement error

• Contains no information

Page 7: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

7

Goals of PCA

• Simplification

• Data reduction

• Modeling

• Outlier detection

• Variable selection

• Classification (unsupervised)

• Unmixing (curve resolution)

PCA, which distance measure is used?

• PCA uses Euclidean distance!

Page 8: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

8

PCA in a nutshell

• First... ... The illustrational/intuitive approach:

Projection of data

• Linear transformation

By proper rotation

Feature observations in 3D space Feature observations in reduced 2D space

Page 9: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

9

Projection of data

• By proper linear transformation

• The PCA approach:

– Rotation according to maximum variance in data.

By proper rotation

Feature observations in 3D space Feature observations in reduced 2D space

Projection of data

• By proper linear transformation

• The PCA approach:

– Rotation according to maximum variance in data.

• Fisher approach:

– Rotation according to maximum discrimination between groups.

By proper rotation

Feature observations in 3D space Feature observations in reduced 2D space

Page 10: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

10

Principal Component Analysis

• For a given dataset:

x2

x1

Principal Component Analysis

• Calculate the centroid (=”mean in all directions”):

x2

x1

Page 11: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

11

Principal Component Analysis

• Subtract the mean:

x2

x1

x2

x1

Principal Component Analysis

• Take this as our new coordinate system:

x2

x1

Page 12: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

12

Principal Component Analysis

• Calculate the direction in which the variance is maximal:

x2

x1

p1

Principal Component Analysis

• And repeat this for the next orthonormal axis (direction with second most variance):

x2

x1

p1

p2

Page 13: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

13

Principal Component Analysis

• Leaving us with a rotated grid:

p1

p2

Principal Component Analysis

• Which we can rotate to a “normal” position:

p1

p2

Page 14: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

14

Principal Component Analysis

• Showing us maximal variance:

p1

p2

Principal Component Analysis

• We can also use this to reduce the complexity of the data set:

p1

p2

Page 15: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

15

Principal Component Analysis

• By eliminating a number of axis by projection of the points:

p1

p2

Principal Component Analysis

• In this example moving from two…:

p1

p2

Page 16: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

16

Principal Component Analysis

• … to one dimensional data points:

In 3D

1st PC

2nd PC

Projections

x1

x2

x3

Var(PC1)

Var(PC2)

Page 17: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

17

Principal Component Analysis

• Assumuming variation equals species diversity...:

x2

x1

Principal Component Analysis

• … the first PCA depicts this information…:

p1

p2

Page 18: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

18

Scores and Loadings

Scores • Map of samples

• Displays distribution of samples in the new space defined by the PC’s

Loadings • Map of variables

• Shows how the original variables are related to the PC’s

Data as observations There are often two matrices to compare

NMNNN

M

M

M

M

xxxx

xxxx

xxxx

xxxx

xxxx

321

4434241

3333231

2232221

1131211

• For each sample

n = 1

n = 2

n = 3

n = 4

n = N

y1 = (y11,...,y1P)

NPN

P

P

P

P

yy

yy

yy

yy

yy

1

441

331

221

111 y3 = (y31,...,y3P)

y4 = (y41,...,y4P)

yN = (yN1,...,yNP)

y2 = (y21,...,y2P)

Covariates A priori

Page 19: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

19

Data as observations

• For each sample

n = 1

n = 2

n = 3

n = 4

n = N

Covariates A priori

YX

y1 = (y11,...,y1P)

y3 = (y31,...,y3P)

y4 = (y41,...,y4P)

yN = (yN1,...,yNP)

y2 = (y21,...,y2P)

Data as observations Extensions of PCA: PCR, PLS etc.

• For each sample

n = 1

n = 2

n = 3

n = 4

n = N

Covariates A priori

YX Can we model Y

based on X?

OBJECTIVE:

y1 = (y11,...,y1P)

y3 = (y31,...,y3P)

y4 = (y41,...,y4P)

yN = (yN1,...,yNP)

y2 = (y21,...,y2P)

Page 20: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

20

PCA

• PCA is used for analysis of one datamatrix, in a hope to distinguish latent signals from noise in data

X

Nr. samples (n)

Nr. variables (p)

1

1

(i)

(k)

Principal component analysis

E'PTxXX

Scores

Loadings Errors Av. of all

X Meas. data

Page 21: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

21

PCA is depending on feature size and scale

• Variables = features = characteristics = attributes = characters

• When in anyone variable the largest value is 8 times greater than the lowest value, use log transformation xlog1 = log (x+1)

• Often the data are centered: xik (new) = xik- xi

n

x

x

n

i

i 1

Mean (avg)

1

)(

)( 1

2

n

xx

xVar

n

i

i

Variance

1

)(1

2

n

xx

S

n

i

i

x

Standard

deviation

(std)

Mean, variance and standard deviation

Page 22: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

22

Autoscaling = standardization

• To avoid the problem of different scales the data are often centered and all values are divided by the standard deviation for each variable

X – Xavg

Xaut = Xstd

Q versus R analysis

• Sometimes the object could as well be the variables, in that case the matrix X can be transposed (to X')

• Objects are rarely transformed regarding the variables

• Normalization is rarely done (all variables have sum 1 or 100%): one risks closure of the data – Fx used in GC: % area (but do not use this as raw

data, rather standardize your method)

Page 23: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

23

Objects and variables

• In contingency tables, objects and variables may have equal status. Often correspondence analysis is used on such data

• In some cases only the objects or only the variables are interesting – Example: Flourescence or other spectrophotometric data

are occasionally used just for characterization of the objects. Those soft curves are often corrected f.ex. by Multiple Scatter Correction before PCA, but they are not always used in the final evaluation of the data

Spectral measurements on sugar samples from three different

factories (B, E, F): 34 samples and 1023 variables (Lars Nørgaard example)

Lars Nørgaard, Foss

Page 24: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

24

Raw data

Centering

Lars Nørgaard, Foss

Raw spectrum Mean spectrum 1. loading 2. loading Residual

Lars Nørgaard, Foss

Page 25: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

25

Two dimensional score plot, PC1 vs. PC2

E

B

F

Lars Nørgaard, Foss

Two-dimensional score plot, PC1 vs. PC3

E

F

B

Lars Nørgaard, Foss

Page 26: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

26

Chemical/physical measurements on sugar samples from 3

different factories (B, E, F); 34 samples, 10 variables

Lars Nørgaard, Foss

Raw data

Centering

Scaling

Lars Nørgaard, Foss

Page 27: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

27

Score plot

E

F

B

Lars Nørgaard, Foss

Principal component analysis

Raw data (scaled) Mean spectrum 1. loading 2. loading Residual

Lars Nørgaard, Foss

Page 28: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

28

Loading plot

Lars Nørgaard, Foss

Biplot

Lars Nørgaard, Foss

Page 29: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

29

NIPALS algorithm Non-linear Iterative PArtial Least Squares

1. Center or autoscale X

2. Choose a random column in X as a start guess on t

3. Solve X = tp' + E in regard to p; "p = X/t" (normalize p to length one: p = p/|p|)

4. Solve X = tp' + E in regard to t: "t = X/p''

5. Repeat step 3 & 4 until convergence

6. Orthogonalize X: Xnew = X - tp'

7. Start from 2. again with Xnew to extract another factor (scores and loadings)

Principal Component Analysis

• The i’th principal component of X is the projection

• The vector Y is called the vector of principal components.

XPYT

kY

Y

1

XpYT

ii

k

TT

0

0

covcov

1

PΣPXPY

Page 30: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

30

Principal Component Analysis

• New variables Yi that are linear combination of the original variables (xi): Yi= ai1x1 + ai2x2+ … + aipxp ; i = 1..p

• The new variables Yi are derived in decreasing order of importance.

• They are called “principal components”.

Principal Component Analysis

• The direction of p1 is given by the eigenvector λ1 corresponding to the largest eigenvalue of matrix C

• The second vector that is orthogonal (uncorrelated) to the first is the one that has the second highest variance which comes to be the eignevector corresponding to the second eigenvalue

• And so on …

Page 31: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

31

Principal Component Analysis

• Let C = cov(X)

• The eigenvalues λi of C are found by solving the equation det(C- λI) = 0

• Eigenvectors are columns (rows) of the matrix P

Example

• Let us take two variables with covariance c>0

• Solving the eigenvalue problem

• Solving this we find:

1

1cov

c

cXC

0-1

01

1detdet

22

c

c

IC

12

1

1

1

c

c

Page 32: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

32

Principal Component Analysis

99.096.0

96.087.1

0.0,0.0

Σ

μ

2.50,0.39, 21

0.85-0.53-

0.530.85-

, 21 ppP

PCA

39.000.0

00.050.2

0.0,0.0

Σ

μ

Principal Component Analysis

99.096.0

96.087.1

0.0,0.0

Σ

μ

2.50,0.39, 21

0.85-0.53-

0.530.85-

, 21 ppP

PCA

39.000.0

00.050.2

0.0,0.0

Σ

μ

1p

2p

Page 33: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

33

Principal Component Analysis

99.096.0

96.087.1

0.0,0.0

Σ

μ

2.50,0.39, 21

0.85-0.53-

0.530.85-

, 21 ppP

PCA

39.000.0

00.050.2

0.0,0.0

Σ

μ

Principal Component Analysis

• From the PCA we may extract the set of linear combinations that explains the most variation

• And hereby condense and reduce the dimensionality of the feature space.

• From the example before we see, that the linear combinations explain of the total variance.

qkm

m

1

1

1 2, 86.3%,13.7%p p P

Page 34: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

34

Output

• Loadings

– The weights

• Scores

• Plots

– Loadings plot

– Scores plot

– Biplot

Overview of PCA

• Select objects (Sneath rule of thumb: use 24 objects per supposed class) • Select variables (in general use many) • Examine the raw data (plot and calculate averages and standard deviations) • Consider transformations of raw data • Perform the PCA • Use cross-validation to estimate the number of components A • Other tests for the number of components can also be used (broken stick etc.) • In general it is good to have more than 75 % of the variance in the data explained • Plot scores • Plot loadings (maybe also biplots) • Look for outliers (representatives probably of a new class (or more often an error)

(leverage or high residual outliers) • Be conservative: Do not take too many (object) outliers out before a new PCA

Page 35: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

35

Validation

• Cross-validation (will be explained in detail in lecture on PCR)

• Jack-knifing

• Bootstrapping

• PTP analysis & Monte Carlo simulations

Crossvalidation (number of PC’s ?)

Under fit Over fit

Number of PC’s

Pre

dic

ted

erro

r on

Y

Optimum

Page 36: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

36

Crossvalidation (methods)

• “Leave one out” or “full validation” leaves one sample out

each time

•Segmented (category) validation

•On replicates (on independent replicates), leaves one

replicate group (one segment) out each time

•Mixtures, leaves one mixture group (one segment) out

each time

•Remember that repeated experiments are dangerous to use in

cross-validation (i.e. use real biological/chemical replicates!!)

3 & 8 out

Page 37: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

37

3 & 7 out, 8 & 4 double)

Page 38: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

38

Can we always use PCA? (test not used in UNSCRAMBLER, but in NTSYS)

• Bartlett‘s (1950) sphericity test (an approximate chi-squared test)

Rp

npp

chi ln6

)52()1(

2

)( 22

•ln R is the natural log of the determinant of the correlation matrix

•The p formula is the number of degrees of freedom associated with the chi-

square test statistci

•p = the number of variables

•n = the number of observations

Bartlett‘s (1951a,b) test whether all of the roots except the first are equal

(in NTSYS under Eigenvectors)

)ln)1(ln)(3

2

6

52(

2

2

pp

nchip

i

i

p

i

ip 2

11

p

pppp )1)(2)(1(

2

)1)(2( 2

where

d.f.:

Page 39: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

39

UNSCRAMBLER EXAMPLE

• Datafile: Fischerout, but add a first variable telling the grouping (1, 2, 3)

• Task: PCA, select objects (all), select variables (2-5: Iris petal & sepal width and length)

• Use autoscaling & cross-validation

Page 40: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

40

Colouring the groups

• You may colour each class (if you know them!)

– Fx use PCA on each of the classes, and create a group for each class, when you have these classes, you use the for sample grouping (right click on score plot) (you can use these classes for SIMCA later!)

• An alternative: use K-means clustering or fuzzy clustering to get the variable showing the groups (classes)

Page 41: Principal component analysis - Technical University of · PDF filePrincipal component analysis ... (1901) On lines and planes of closest fit to systems of points in space. ... Projection

41

Pros and Cons of PCA

• Positives

– Can deal with large datasets (both in objects and variables)

– There are no special assumptions on the data and PCA can be applied on all data-sets

• Negatives

– Non-linear structure is hard to model with PCA

– The meaning of the original variables may be difficult to assess directly on latent variables (but use the loading plot) or Varimax, factor analysis etc.

• Still:

– Non-linear PCA methods exist (kernel methods)

– Sparseness of supervised projections can be introduced to emphasize important features

Exercises in PCA

• The classical Iris data set (150 Iris plants characterized by 4 variables): Fischerout (Excell file)

• The wine data set (178 wines characterized by 13 variables): VIN2 (Excell file): for examn!