principal component analysis

43
UNIVERSITY OF AMSTERDAM 1 Principal Component Analysis Biosystems Data Analysis

Upload: saki

Post on 16-Mar-2016

41 views

Category:

Documents


6 download

DESCRIPTION

Principal Component Analysis. Biosystems Data Analysis. From molecule to networks. Protein network of SRD5A2. Yeast metabolic network of Glycolysis. Disease gene network. Biological data. Genes or proteins or metabolites. DATA. Samples. 5.31. 5.31. Repeatability (herhaalbaarheid). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

1

Principal Component Analysis

Biosystems Data Analysis

Page 2: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

2

From molecule to networksFrom molecule to networks

Protein network of SRD5A2

Yeast metabolic network of Glycolysis

Page 3: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

3

Disease gene networkDisease gene network

Page 4: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

4

Biological dataBiological data

DATA

Sam

ples

Genes or proteins or metabolites

5.31

5.31

Repeatability (herhaalbaarheid)Reproducibility (reproduceerbaarheid)

Biological variability

Page 5: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

5

How to explore such networksHow to explore such networks

DATA

Sam

ples

Genes or proteins or metabolites

Results are specific for the

selected samples/situation

Correlation matrix

Genes or proteins or metabolites

Gen

es o

r pro

tein

s or

met

abol

ites

Page 6: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

6

GoalsGoals• If you measure multiple variables on an object it can

be important to analyze the measurements simultaneously.

• Understand the most important tool in multivariate data analysis Principal Component Analysis.

Page 7: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

7

Multiple measurementsMultiple measurements• If there is a mutual relationship between two or more

measurements they are correlated.

• There are strong correlations and weak correlations

Mass of an object and

the weight of that object on the earth surface

Capabilities in sports and

month of birth

Page 8: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

8

CorrelationCorrelation• Correlation occurs everywhere!

18 20 22 24 26 28 3075

76

77

78

79

80

81

82

83

84

Age (months)

Hei

ght (

cm)

• Example: mean height vs. age of a group of young children

• A strong linear relationship between height and age is seen.

• For young children, height and age are correlated.

Moore, D.S. and McCabe G.P., Introduction to the Practice of Statistics (1989).

Page 9: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

9

Correlation in spectroscopyCorrelation in spectroscopy

200 210 220 230 240 250 260 270 280 290 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Wavelength (nm)

Abs

orba

nce

(uni

ts)

0.332 0.181

0.498 0.27015

0.664 0.36220

0.831 0.45325

0.166 0.0905

Intensity at 230nm

Intensity at 265nm

Conc. (MMol)

230 265

• Example: a pure compound is measured at two wavelengths over a range of concentrations

10

Page 10: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Absorbance at 230nm (units)

Abs

orba

nce

at 2

65nm

(uni

ts)

Correlation in spectroscopyCorrelation in spectroscopy

• The intensities at 230 and 265 are highly correlated.

increasing concentration

• There is only one factor underlying the data: concentration.

• The data is not two-dimensional, but one-dimensional.

Page 11: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

11

The data matrixThe data matrix

• For example,– Spectroscopy: sample wavelength– Proteomics: patient protein

65.078.022.015.0

33.085.024.013.081.093.034.014.029.065.045.012.0

variables

objects

• Information often comes in the form of a matrix:

Page 12: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

12

Large amounts of dataLarge amounts of data• In (bio)chemical analysis, the measured data

matrices can be very large.– An infrared spectrum measured for 50 samples gives a data

matrix of size 50 800 = 40,000 numbers!– The matabolome of a 100 patient yield a data matrix of size

100 1000 = 100,000 numbers.

• We need a way of extracting the important information from large data matrices.

Page 13: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

13

PPrincipal rincipal CComponent omponent AAnalysisnalysis• Data reduction

– PCA reduces large data matrices into two smaller matrices which can be more easily examined, plotted and interpreted.

• Data exploration– PCA extracts the most important factors (principal

components) from the data. These factors describe multivariate interactions between the measured variables.

• Data understanding– Principal components can be used to classify samples,

identify compound spectra, determine biomarker etc.

Page 14: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

14

Different views of PCADifferent views of PCA• Statistically, PCA is a multivariate analysis technique

closely related to– eigenvector analysis– singular value decomposition (SVD)

• In matrix terms, PCA is a decomposition of X into two smaller matrices plus a set of residuals: X = TPT + E

• Geometrically, PCA is a projection technique in which X is projected onto a subspace of reduced dimensions.

Page 15: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

15

PCA: mathematicsPCA: mathematics• The basic equation for PCA is written as

where

X (I J) is a data matrix,

T (I R) are the scores,

P (J R) are the loadings and

E (I J) are the residuals.

R is the number of principal components used to describe X.

ETPEptptptX

T

TT22

T11 ... RR

Page 16: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

16

Principal componentsPrincipal components

rr pt ,

18.1 87.63

1.3 88.94

23.9 69.52

45.6 45.61

% X explained

Total % X explained

Principal comp.

• A principal component is defined by one pair of loadings and scores, , sometimes also known as a latent variable.

• Principal components describe maximum variance and are calculated in order of importance, e.g.

and so on... up to 100%

Page 17: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

17

PCA: matricesPCA: matrices

= + ... +X

scores

loadings

principal component

+ E=

T

PT

Page 18: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

18

Scores and loadingsScores and loadings• Scores

– relationships between objects– orthogonal, TTT = diagonal matrix

• Loadings– relationships between variables– orthonormal, PTP = identity matrix, I

• Similarities and differences between objects (or variables) can be seen by plotting scores (or loadings) against each other.

Page 19: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

19

Numbers exampleNumbers example

2/12/102/12/1

12108642

660665505544044330332202211011

Page 20: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

20

PCA: simple projectionPCA: simple projection• Simplest case: two correlated variables

18 20 22 24 26 28 3075

76

77

78

79

80

81

82

83

84

Age (months)

Hei

ght (

cm)

PC1

PC2

• PC1 describes 99.77% of the total variation in X.

-8 -6 -4 -2 0 2 4 6 8-8

-6

-4

-2

0

2

4

6

8

Scores PC 1 (99.77%)S

core

s P

C 2

(0.2

3%)

scores plot

PCA

• PC2 describes residual variation (0.23%).

Page 21: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

21

PCA: projectionsPCA: projections• PCA is a projection technique.

– Now we will project some J-dimensional data onto a two-dimensional space, i.e. onto a plane.

– In the previous example, we projected the two-dimensional data onto a one-dimensional space, i.e. onto a line.

• Each row of the data matrix X (I J) can be considered as a point in J-dimensional space. This data is projected orthogonally onto a subspace of lower dimensionality.

Page 22: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

22

= +•••••••••••••••

EPTX T

•••••••••••••••• •

••

• •

••••••••••••••••••••••••••••••

Page 23: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

23

Example :Example :Protein dataProtein data

• Protein consumption across Europe was studied.• 9 variables describe different sources of protein.• 25 objects are the different countries.• Data matrix has dimensions 25 9.

Weber, A., Agrarpolitik im Spannungsfeld der internationalen Ernaehrungspolitik, Institut fuer Agrarpolitik und marktlehre, Kiel (1973) .

• Which countries are similar?• Which foods are related to red meat consumption?

Page 24: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

24

Page 25: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

25

PCA on the protein dataPCA on the protein data• The data is mean-centred and each variable is scaled

to unit variance. Then a PCA is performed.

Percent Variance Captured by PCA Model Principal Eigenvalue % Variance %VarianceComponent of Captured Captured Number Cov(X) This PC Total--------- ---------- ---------- ---------- 1 4.01e+000 44.52 44.52 2 1.63e+000 18.17 62.68 3 1.13e+000 12.53 75.22 4 9.55e-001 10.61 85.82 5 4.64e-001 5.15 90.98 6 3.25e-001 3.61 94.59 7 2.72e-001 3.02 97.61 8 1.16e-001 1.29 98.90 9 9.91e-002 1.10 100.00How many principal components do you want to keep?

4 1 2 3 4 5 6 7 8 9

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5Eigenvalue vs. PC Number

PC Number

Eig

enva

lue

Page 26: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

26

-3 -2 -1 0 1 2 3 4-5

-4

-3

-2

-1

0

1

2

Scores PC 1 (44.52%)

Sco

res

PC

2 (1

8.17

%)

Albania

Austria

Belgium

Bulgaria

Czechoslovakia

Denmark East Germany

Finland

France Greece

Hungary Ireland

Italy

Netherlands

Norway

Poland

Portugal

Romania

Spain

Sweden

Switzerland

UK USSR West Germany

Yugoslavia

Scores: PC1 vs PC2Scores: PC1 vs PC2

PC 2

Page 27: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

27

LoadingsLoadings

Red meat White meat Eggs Milk Fish Cereals Starch Beans/nuts/oilFruit & veg -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6P

C lo

adin

gs

PC1PC2

Page 28: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

28

-5 -4 -3 -2 -1 0 1 2 3 4 5-5

-4

-3

-2

-1

0

1

2

PC 1

PC

2

Albania

Austria

Belgium

Bulgaria

Czechoslovakia

Denmark East Germany

Finland

France Greece

Hungary Ireland

Italy

Netherlands

Norway

Poland

Portugal

Romania

Spain

Sweden

Switzerland

UK USSR West Germany

Yugoslavia

Red meat

White meat

Eggs

Milk

Fish

Cereals

Starch

Beans/nuts/oil

Fruit & veg

Biplot: PC1 vs PC2Biplot: PC1 vs PC2

PC2 primarily says that the Spanish and Portuguese especially like fruit, vegetables, fish, oils.

SE Europeans eat cereal crops

Page 29: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

29

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

3

4

PC 1

PC

3

Albania

Austria

Belgium Bulgaria

Czechoslovakia

Denmark

East Germany

Finland

France

Greece

Hungary

Ireland Italy

Netherlands

Norway

Poland

Portugal Romania

Spain

Sweden

Switzerland

UK

USSR

West Germany

Yugoslavia

Red meat

White meat

Eggs

Milk

Fish

Cereals

Starch

Beans/nuts/oil

Fruit & veg

Biplot: PC1 vs PC3Biplot: PC1 vs PC3

Scandinavians eat fish!

Red meat and milk are correlated

The Dutch like ‘patat’...

...with mayonnaise!?

Page 30: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

30

ResidualsResiduals• It is also important to look at the model residuals, E.

1 2 3 4 5 6 7 8 9-1

-0.5

0

0.5

1

1.5

Variable number

Res

idua

l var

iatio

n

• Ideally, the residuals will not contain any structure - just unsystematic variation (noise).

Page 31: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

31

ResidualsResiduals• The (squared) model residuals can be summed along

the object or variable direction:

0 5 10 15 20 250

0.5

1

1.5

2

2.5

3

3.5

Object number

Q (s

um o

f squ

ared

resi

dual

s)

J

jiji eQ

1

2

Country 23 (USSR) fits the model least well

Page 32: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

32

Centering and scalingCentering and scaling• We are often interested in the differences between

objects, not in their absolute values.– protein data: differences between countries

• If different variables are measured in different units, some scaling is needed to give each variable an equal chance of contributing to the model.

Page 33: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

33

Mean-centeringMean-centering• Subtract the mean from each column of X:

107111.387.6105482.363.6118575.355.6102452.376.6

3.129350.1175.03.292550.0225.0

1016250.1025.02.595450.0075.0

Mean-centering

x6.525 1084036.75

x0.0 0.00.0

Page 34: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

34

ScalingScaling• Divide each column of X by its standard deviation:

0.171 704.81.139 1.01.0

3.129350.1175.03.292550.0225.0

1016250.1025.02.595450.0075.0

Scaling

183.0186.1025.1415.0483.0318.1443.1098.1146.0845.0395.0439.0

1.0

Page 35: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

35

How many PC’s to use?How many PC’s to use?

• Too few PC’s:– some systematic variation is not described.– model does not fully summarise the data.

X = TPT + Esystematic variation noise

• Too many PC’s:– latter PC’s describe noise.– model is not robust when applied to new data.

• How to select the correct number of PC’s?

Page 36: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

36

How many PC’s to use?How many PC’s to use?• Eigenvalue plots

• Select components where explained % variance > noise level

• Look at PC scores and loadings - do they make sense?! Do residuals have structure?

• Cross-validation

1 2 3 4 5 6 7 8 90

0.5

1

1.5

2

2.5

3

3.5

4

4.5Eigenvalue vs. PC Number

PC Number

Eig

enva

lue ‘Knee’ here -

select 4 PC’s

Page 37: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

37

Cross-validationCross-validation• Remove subset of the data -

test set.

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• Build model on remaining data - training set.

• Repeat for next test set.

• Project test set onto model - calculate residuals.

• Repeat for R = 1,2,3...

• Calculate PRESS:

I

i

J

jijr ePRESS

1 1

2

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

Page 38: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

38

1 2 3 4 5 6 7 80

5

Eig

enva

lue

of C

ov(x

) b)

1 2 3 4 5 6 7 80

50

Latent Variable

PR

ES

S (r

)

PRESSPRESS plot plot

First minimum at 2 PC’s

Overall minimum at 4 PC’s

8 PC’s gives very high CV error

Page 39: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

39

OutliersOutliers• Outliers are objects which are very different from the

rest of the data. These can have a large effect on the principal component model and should be removed.

1 1.5 2 2.5 3 3.5 4 4.54

6

8

10

12

14

16

18

pH

T (o C

)

1 1.5 2 2.5 3 3.5 4 4.54

6

8

10

12

14

16

18

pH

T (o C

)

Remove outlier

bad experiment

Page 40: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

40

OutliersOutliers• Outliers can also be found in the model space or in

the residuals.

-8 -6 -4 -2 0 2 4 6 8-8

-6

-4

-2

0

2

4

6

Scores PC 1

Sco

res

PC

2

22 24 26 28 30 32 34 36 38 40 420

2

4

6

8

10

12

14

Time (min)

Sum

-of-s

quar

ed re

sidu

als

Page 41: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

41

0 5 10 15 20 25 300

50

100

150

200

250

300

Age (years)

Hei

ght (

cm)

Model extrapolation can be Model extrapolation can be dangerous!dangerous!

Linear model was valid for this age range...

...but is not valid for 30 year olds!

Page 42: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

42

ConclusionsConclusions• Principal component analysis (PCA) reduces large,

collinear matrices into two smaller matrices - scores and loadings:

• Principal components– describe the important variation in the data.– are calculated in order of importance.– are orthogonal.

ETPEptptptX

T

TT22

T11 ... RR

Page 43: Principal Component Analysis

UNIVERSITY OF AMSTERDAM

43

ConclusionsConclusions• Scores plots and biplots can be useful for exploring

and understanding the data.

• It is often correct to mean-center and scale the variables prior to analysis.

• It is important to include the correct number of PC’s in the PCA model. One method for determining this is called cross-validation.