principal component analysis

UNIVERSITY OF AMSTERDAM

1

Principal Component Analysis

Biosystems Data Analysis


2

From molecule to networksFrom molecule to networks

Protein network of SRD5A2

Yeast metabolic network of Glycolysis


3

Disease gene networkDisease gene network


4

Biological dataBiological data

DATA

Sam

ples

Genes or proteins or metabolites

5.31

5.31

Repeatability (herhaalbaarheid)Reproducibility (reproduceerbaarheid)

Biological variability


5

How to explore such networksHow to explore such networks

DATA

Sam

ples


Results are specific for the

selected samples/situation

Correlation matrix


Gen

es o

r pro

tein

s or

met

abol

ites


6

GoalsGoals• If you measure multiple variables on an object it can

be important to analyze the measurements simultaneously.

• Understand the most important tool in multivariate data analysis Principal Component Analysis.


7

Multiple measurementsMultiple measurements• If there is a mutual relationship between two or more

measurements they are correlated.

• There are strong correlations and weak correlations

Mass of an object and

the weight of that object on the earth surface

Capabilities in sports and

month of birth


8

CorrelationCorrelation• Correlation occurs everywhere!

18 20 22 24 26 28 3075

76

77

78

79

80

81

82

83

84

Age (months)

Hei

ght (

cm)

• Example: mean height vs. age of a group of young children

• A strong linear relationship between height and age is seen.

• For young children, height and age are correlated.

Moore, D.S. and McCabe G.P., Introduction to the Practice of Statistics (1989).


9

Correlation in spectroscopyCorrelation in spectroscopy

200 210 220 230 240 250 260 270 280 290 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Wavelength (nm)

Abs

orba

nce

(uni

ts)

0.332 0.181

0.498 0.27015

0.664 0.36220

0.831 0.45325

0.166 0.0905

Intensity at 230nm

Intensity at 265nm

Conc. (MMol)

230 265

• Example: a pure compound is measured at two wavelengths over a range of concentrations

10


10

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Absorbance at 230nm (units)

Abs

orba

nce

at 2

65nm

(uni

ts)

Correlation in spectroscopyCorrelation in spectroscopy

• The intensities at 230 and 265 are highly correlated.

increasing concentration

• There is only one factor underlying the data: concentration.

• The data is not two-dimensional, but one-dimensional.


11

The data matrixThe data matrix

• For example,– Spectroscopy: sample wavelength– Proteomics: patient protein

65.078.022.015.0

33.085.024.013.081.093.034.014.029.065.045.012.0

variables

objects

• Information often comes in the form of a matrix:


12

Large amounts of dataLarge amounts of data• In (bio)chemical analysis, the measured data

matrices can be very large.– An infrared spectrum measured for 50 samples gives a data

matrix of size 50 800 = 40,000 numbers!– The matabolome of a 100 patient yield a data matrix of size

100 1000 = 100,000 numbers.

• We need a way of extracting the important information from large data matrices.


13

PPrincipal rincipal CComponent omponent AAnalysisnalysis• Data reduction

– PCA reduces large data matrices into two smaller matrices which can be more easily examined, plotted and interpreted.

• Data exploration– PCA extracts the most important factors (principal

components) from the data. These factors describe multivariate interactions between the measured variables.

• Data understanding– Principal components can be used to classify samples,

identify compound spectra, determine biomarker etc.


14

Different views of PCADifferent views of PCA• Statistically, PCA is a multivariate analysis technique

closely related to– eigenvector analysis– singular value decomposition (SVD)

• In matrix terms, PCA is a decomposition of X into two smaller matrices plus a set of residuals: X = TPT + E

• Geometrically, PCA is a projection technique in which X is projected onto a subspace of reduced dimensions.


15

PCA: mathematicsPCA: mathematics• The basic equation for PCA is written as

where

X (I J) is a data matrix,

T (I R) are the scores,

P (J R) are the loadings and

E (I J) are the residuals.

R is the number of principal components used to describe X.

ETPEptptptX

T

TT22

T11 ... RR


16

Principal componentsPrincipal components

rr pt ,

18.1 87.63

1.3 88.94

23.9 69.52

45.6 45.61

% X explained

Total % X explained

Principal comp.

• A principal component is defined by one pair of loadings and scores, , sometimes also known as a latent variable.

• Principal components describe maximum variance and are calculated in order of importance, e.g.

and so on... up to 100%


17

PCA: matricesPCA: matrices

= + ... +X

scores

loadings

principal component

+ E=

T

PT


18

Scores and loadingsScores and loadings• Scores

– relationships between objects– orthogonal, TTT = diagonal matrix

• Loadings– relationships between variables– orthonormal, PTP = identity matrix, I

• Similarities and differences between objects (or variables) can be seen by plotting scores (or loadings) against each other.


19

Numbers exampleNumbers example

2/12/102/12/1

12108642

660665505544044330332202211011


20

PCA: simple projectionPCA: simple projection• Simplest case: two correlated variables

18 20 22 24 26 28 3075

76

77

78

79

80

81

82

83

84

Age (months)

Hei

ght (

cm)

PC1

PC2

• PC1 describes 99.77% of the total variation in X.

-8 -6 -4 -2 0 2 4 6 8-8

-6

-4

-2

0

2

4

6

8

Scores PC 1 (99.77%)S

core

s P

C 2

(0.2

3%)

scores plot

PCA

• PC2 describes residual variation (0.23%).


21

PCA: projectionsPCA: projections• PCA is a projection technique.

– Now we will project some J-dimensional data onto a two-dimensional space, i.e. onto a plane.

– In the previous example, we projected the two-dimensional data onto a one-dimensional space, i.e. onto a line.

• Each row of the data matrix X (I J) can be considered as a point in J-dimensional space. This data is projected orthogonally onto a subspace of lower dimensionality.


22

= +•••••••••••••••

EPTX T

•

•

•••••••••••••••• •

••

• •

••••••••••••••••••••••••••••••


23

Example :Example :Protein dataProtein data

• Protein consumption across Europe was studied.• 9 variables describe different sources of protein.• 25 objects are the different countries.• Data matrix has dimensions 25 9.

Weber, A., Agrarpolitik im Spannungsfeld der internationalen Ernaehrungspolitik, Institut fuer Agrarpolitik und marktlehre, Kiel (1973) .

• Which countries are similar?• Which foods are related to red meat consumption?


24


25

PCA on the protein dataPCA on the protein data• The data is mean-centred and each variable is scaled

to unit variance. Then a PCA is performed.

Percent Variance Captured by PCA Model Principal Eigenvalue % Variance %VarianceComponent of Captured Captured Number Cov(X) This PC Total--------- ---------- ---------- ---------- 1 4.01e+000 44.52 44.52 2 1.63e+000 18.17 62.68 3 1.13e+000 12.53 75.22 4 9.55e-001 10.61 85.82 5 4.64e-001 5.15 90.98 6 3.25e-001 3.61 94.59 7 2.72e-001 3.02 97.61 8 1.16e-001 1.29 98.90 9 9.91e-002 1.10 100.00How many principal components do you want to keep?

4 1 2 3 4 5 6 7 8 9

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5Eigenvalue vs. PC Number

PC Number

Eig

enva

lue


26

-3 -2 -1 0 1 2 3 4-5

-4

-3

-2

-1

0

1

2

Scores PC 1 (44.52%)

Sco

res

PC

2 (1

8.17

%)

Albania

Austria

Belgium

Bulgaria

Czechoslovakia

Denmark East Germany

Finland

France Greece

Hungary Ireland

Italy

Netherlands

Norway

Poland

Portugal

Romania

Spain

Sweden

Switzerland

UK USSR West Germany

Yugoslavia

Scores: PC1 vs PC2Scores: PC1 vs PC2

PC 2


27

LoadingsLoadings

Red meat White meat Eggs Milk Fish Cereals Starch Beans/nuts/oilFruit & veg -0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6P

C lo

adin

gs

PC1PC2


28

-5 -4 -3 -2 -1 0 1 2 3 4 5-5

-4

-3

-2

-1

0

1

2

PC 1

PC

2

Albania

Austria

Belgium

Bulgaria

Czechoslovakia

Denmark East Germany

Finland

France Greece

Hungary Ireland

Italy

Netherlands

Norway

Poland

Portugal

Romania

Spain

Sweden

Switzerland

UK USSR West Germany

Yugoslavia

Red meat

White meat

Eggs

Milk

Fish

Cereals

Starch

Beans/nuts/oil

Fruit & veg

Biplot: PC1 vs PC2Biplot: PC1 vs PC2

PC2 primarily says that the Spanish and Portuguese especially like fruit, vegetables, fish, oils.

SE Europeans eat cereal crops


29

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

-2

-1

0

1

2

3

4

PC 1

PC

3

Albania

Austria

Belgium Bulgaria

Czechoslovakia

Denmark

East Germany

Finland

France

Greece

Hungary

Ireland Italy

Netherlands

Norway

Poland

Portugal Romania

Spain

Sweden

Switzerland

UK

USSR

West Germany

Yugoslavia

Red meat

White meat

Eggs

Milk

Fish

Cereals

Starch

Beans/nuts/oil

Fruit & veg

Biplot: PC1 vs PC3Biplot: PC1 vs PC3

Scandinavians eat fish!

Red meat and milk are correlated

The Dutch like ‘patat’...

...with mayonnaise!?


30

ResidualsResiduals• It is also important to look at the model residuals, E.

1 2 3 4 5 6 7 8 9-1

-0.5

0

0.5

1

1.5

Variable number

Res

idua

l var

iatio

n

• Ideally, the residuals will not contain any structure - just unsystematic variation (noise).


31

ResidualsResiduals• The (squared) model residuals can be summed along

the object or variable direction:

0 5 10 15 20 250

0.5

1

1.5

2

2.5

3

3.5

Object number

Q (s

um o

f squ

ared

resi

dual

s)

J

jiji eQ

1

2

Country 23 (USSR) fits the model least well


32

Centering and scalingCentering and scaling• We are often interested in the differences between

objects, not in their absolute values.– protein data: differences between countries

• If different variables are measured in different units, some scaling is needed to give each variable an equal chance of contributing to the model.


33

Mean-centeringMean-centering• Subtract the mean from each column of X:

107111.387.6105482.363.6118575.355.6102452.376.6

3.129350.1175.03.292550.0225.0

1016250.1025.02.595450.0075.0

Mean-centering

x6.525 1084036.75

x0.0 0.00.0


34

ScalingScaling• Divide each column of X by its standard deviation:

0.171 704.81.139 1.01.0

3.129350.1175.03.292550.0225.0

1016250.1025.02.595450.0075.0

Scaling

183.0186.1025.1415.0483.0318.1443.1098.1146.0845.0395.0439.0

1.0


35

How many PC’s to use?How many PC’s to use?

• Too few PC’s:– some systematic variation is not described.– model does not fully summarise the data.

X = TPT + Esystematic variation noise

• Too many PC’s:– latter PC’s describe noise.– model is not robust when applied to new data.

• How to select the correct number of PC’s?


36

How many PC’s to use?How many PC’s to use?• Eigenvalue plots

• Select components where explained % variance > noise level

• Look at PC scores and loadings - do they make sense?! Do residuals have structure?

• Cross-validation

1 2 3 4 5 6 7 8 90

0.5

1

1.5

2

2.5

3

3.5

4

4.5Eigenvalue vs. PC Number

PC Number

Eig

enva

lue ‘Knee’ here -

select 4 PC’s


37

Cross-validationCross-validation• Remove subset of the data -

test set.

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• Build model on remaining data - training set.

• Repeat for next test set.

• Project test set onto model - calculate residuals.

• Repeat for R = 1,2,3...

• Calculate PRESS:

I

i

J

jijr ePRESS

1 1

2

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••


38

1 2 3 4 5 6 7 80

5

Eig

enva

lue

of C

ov(x

) b)

1 2 3 4 5 6 7 80

50

Latent Variable

PR

ES

S (r

)

PRESSPRESS plot plot

First minimum at 2 PC’s

Overall minimum at 4 PC’s

8 PC’s gives very high CV error


39

OutliersOutliers• Outliers are objects which are very different from the

rest of the data. These can have a large effect on the principal component model and should be removed.

1 1.5 2 2.5 3 3.5 4 4.54

6

8

10

12

14

16

18

pH

T (o C

)

1 1.5 2 2.5 3 3.5 4 4.54

6

8

10

12

14

16

18

pH

T (o C

)

Remove outlier

bad experiment


40

OutliersOutliers• Outliers can also be found in the model space or in

the residuals.

-8 -6 -4 -2 0 2 4 6 8-8

-6

-4

-2

0

2

4

6

Scores PC 1

Sco

res

PC

2

22 24 26 28 30 32 34 36 38 40 420

2

4

6

8

10

12

14

Time (min)

Sum

-of-s

quar

ed re

sidu

als


41

0 5 10 15 20 25 300

50

100

150

200

250

300

Age (years)

Hei

ght (

cm)

Model extrapolation can be Model extrapolation can be dangerous!dangerous!

Linear model was valid for this age range...

...but is not valid for 30 year olds!


42

ConclusionsConclusions• Principal component analysis (PCA) reduces large,

collinear matrices into two smaller matrices - scores and loadings:

• Principal components– describe the important variation in the data.– are calculated in order of importance.– are orthogonal.

ETPEptptptX

T

TT22

T11 ... RR


43

ConclusionsConclusions• Scores plots and biplots can be useful for exploring

and understanding the data.

• It is often correct to mean-center and scale the variables prior to analysis.

• It is important to include the correct number of PC’s in the PCA model. One method for determining this is called cross-validation.

principal component analysis

Documents

large data matrices

measured data matrices

data explorationpca

data matrix of size

data matrixfor example

smaller matrices

measured variables

decomposition of x