principal component analysis

UNIVERSITY OF AMSTERDAM

Principal Component Analysis

Biosystems Data Analysis

From molecule to networksFrom molecule to networks

Protein network of SRD5A2

Yeast metabolic network of Glycolysis

Disease gene networkDisease gene network

Biological dataBiological data

Genes or proteins or metabolites

Repeatability (herhaalbaarheid)Reproducibility (reproduceerbaarheid)

Biological variability

How to explore such networksHow to explore such networks

Results are specific for the

selected samples/situation

Correlation matrix

GoalsGoals• If you measure multiple variables on an object it can

be important to analyze the measurements simultaneously.

• Understand the most important tool in multivariate data analysis Principal Component Analysis.

Multiple measurementsMultiple measurements• If there is a mutual relationship between two or more

measurements they are correlated.

• There are strong correlations and weak correlations

Mass of an object and

the weight of that object on the earth surface

Capabilities in sports and

month of birth

CorrelationCorrelation• Correlation occurs everywhere!

18 20 22 24 26 28 3075

Age (months)

• Example: mean height vs. age of a group of young children

• A strong linear relationship between height and age is seen.

• For young children, height and age are correlated.

Moore, D.S. and McCabe G.P., Introduction to the Practice of Statistics (1989).

Correlation in spectroscopyCorrelation in spectroscopy

200 210 220 230 240 250 260 270 280 290 3000

Wavelength (nm)

0.332 0.181

0.498 0.27015

0.664 0.36220

0.831 0.45325

0.166 0.0905

Intensity at 230nm

Intensity at 265nm

Conc. (MMol)

230 265

• Example: a pure compound is measured at two wavelengths over a range of concentrations

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Absorbance at 230nm (units)

Correlation in spectroscopyCorrelation in spectroscopy

• The intensities at 230 and 265 are highly correlated.

increasing concentration

• There is only one factor underlying the data: concentration.

• The data is not two-dimensional, but one-dimensional.

The data matrixThe data matrix

• For example,– Spectroscopy: sample wavelength– Proteomics: patient protein

65.078.022.015.0

33.085.024.013.081.093.034.014.029.065.045.012.0

variables

objects

• Information often comes in the form of a matrix:

Large amounts of dataLarge amounts of data• In (bio)chemical analysis, the measured data

matrices can be very large.– An infrared spectrum measured for 50 samples gives a data

matrix of size 50 800 = 40,000 numbers!– The matabolome of a 100 patient yield a data matrix of size

100 1000 = 100,000 numbers.

• We need a way of extracting the important information from large data matrices.

PPrincipal rincipal CComponent omponent AAnalysisnalysis• Data reduction

– PCA reduces large data matrices into two smaller matrices which can be more easily examined, plotted and interpreted.

• Data exploration– PCA extracts the most important factors (principal

components) from the data. These factors describe multivariate interactions between the measured variables.

• Data understanding– Principal components can be used to classify samples,

identify compound spectra, determine biomarker etc.

Different views of PCADifferent views of PCA• Statistically, PCA is a multivariate analysis technique

closely related to– eigenvector analysis– singular value decomposition (SVD)

• In matrix terms, PCA is a decomposition of X into two smaller matrices plus a set of residuals: X = TPT + E

• Geometrically, PCA is a projection technique in which X is projected onto a subspace of reduced dimensions.

PCA: mathematicsPCA: mathematics• The basic equation for PCA is written as

X (I J) is a data matrix,

T (I R) are the scores,

P (J R) are the loadings and

E (I J) are the residuals.

R is the number of principal components used to describe X.

ETPEptptptX

T11 ... RR

Principal componentsPrincipal components

rr pt ,

18.1 87.63

1.3 88.94

23.9 69.52

45.6 45.61

% X explained

Total % X explained

Principal comp.

• A principal component is defined by one pair of loadings and scores, , sometimes also known as a latent variable.

• Principal components describe maximum variance and are calculated in order of importance, e.g.

and so on... up to 100%

PCA: matricesPCA: matrices

= + ... +X

scores

loadings

principal component

Scores and loadingsScores and loadings• Scores

– relationships between objects– orthogonal, TTT = diagonal matrix

• Loadings– relationships between variables– orthonormal, PTP = identity matrix, I

• Similarities and differences between objects (or variables) can be seen by plotting scores (or loadings) against each other.

Numbers exampleNumbers example

2/12/102/12/1

12108642

660665505544044330332202211011

PCA: simple projectionPCA: simple projection• Simplest case: two correlated variables

18 20 22 24 26 28 3075

Age (months)

• PC1 describes 99.77% of the total variation in X.

-8 -6 -4 -2 0 2 4 6 8-8

Scores PC 1 (99.77%)S

scores plot

• PC2 describes residual variation (0.23%).

PCA: projectionsPCA: projections• PCA is a projection technique.

– Now we will project some J-dimensional data onto a two-dimensional space, i.e. onto a plane.

– In the previous example, we projected the two-dimensional data onto a one-dimensional space, i.e. onto a line.

• Each row of the data matrix X (I J) can be considered as a point in J-dimensional space. This data is projected orthogonally onto a subspace of lower dimensionality.

= +•••••••••••••••

EPTX T

•••••••••••••••• •

••

• •

••••••••••••••••••••••••••••••

Example :Example :Protein dataProtein data

• Protein consumption across Europe was studied.• 9 variables describe different sources of protein.• 25 objects are the different countries.• Data matrix has dimensions 25 9.

Weber, A., Agrarpolitik im Spannungsfeld der internationalen Ernaehrungspolitik, Institut fuer Agrarpolitik und marktlehre, Kiel (1973) .

• Which countries are similar?• Which foods are related to red meat consumption?

PCA on the protein dataPCA on the protein data• The data is mean-centred and each variable is scaled

to unit variance. Then a PCA is performed.

Percent Variance Captured by PCA Model Principal Eigenvalue % Variance %VarianceComponent of Captured Captured Number Cov(X) This PC Total--------- ---------- ---------- ---------- 1 4.01e+000 44.52 44.52 2 1.63e+000 18.17 62.68 3 1.13e+000 12.53 75.22 4 9.55e-001 10.61 85.82 5 4.64e-001 5.15 90.98 6 3.25e-001 3.61 94.59 7 2.72e-001 3.02 97.61 8 1.16e-001 1.29 98.90 9 9.91e-002 1.10 100.00How many principal components do you want to keep?

4 1 2 3 4 5 6 7 8 9

4.5Eigenvalue vs. PC Number

PC Number

-3 -2 -1 0 1 2 3 4-5

Scores PC 1 (44.52%)

Albania

Austria

Belgium

Bulgaria

Czechoslovakia

Denmark East Germany

Finland

France Greece

Hungary Ireland

Netherlands

Norway

Poland

Portugal

Romania

Sweden

Switzerland

UK USSR West Germany

Yugoslavia

Scores: PC1 vs PC2Scores: PC1 vs PC2

LoadingsLoadings

Red meat White meat Eggs Milk Fish Cereals Starch Beans/nuts/oilFruit & veg -0.8

PC1PC2

-5 -4 -3 -2 -1 0 1 2 3 4 5-5

Albania

Austria

Belgium

Bulgaria

Czechoslovakia

Denmark East Germany

Finland

France Greece

Hungary Ireland

Netherlands

Norway

Poland

Portugal

Romania

Sweden

Switzerland

UK USSR West Germany

Yugoslavia

Red meat

White meat

Cereals

Starch

Beans/nuts/oil

Fruit & veg

Biplot: PC1 vs PC2Biplot: PC1 vs PC2

PC2 primarily says that the Spanish and Portuguese especially like fruit, vegetables, fish, oils.

SE Europeans eat cereal crops

-5 -4 -3 -2 -1 0 1 2 3 4 5-3

Albania

Austria

Belgium Bulgaria

Czechoslovakia

Denmark

East Germany

Finland

France

Greece

Hungary

Ireland Italy

Netherlands

Norway

Poland

Portugal Romania

Sweden

Switzerland

West Germany

Yugoslavia

Red meat

White meat

Cereals

Starch

Beans/nuts/oil

Fruit & veg

Biplot: PC1 vs PC3Biplot: PC1 vs PC3

Scandinavians eat fish!

Red meat and milk are correlated

The Dutch like ‘patat’...

...with mayonnaise!?

ResidualsResiduals• It is also important to look at the model residuals, E.

1 2 3 4 5 6 7 8 9-1

Variable number

• Ideally, the residuals will not contain any structure - just unsystematic variation (noise).

ResidualsResiduals• The (squared) model residuals can be summed along

the object or variable direction:

0 5 10 15 20 250

Object number

jiji eQ

Country 23 (USSR) fits the model least well

Centering and scalingCentering and scaling• We are often interested in the differences between

objects, not in their absolute values.– protein data: differences between countries

• If different variables are measured in different units, some scaling is needed to give each variable an equal chance of contributing to the model.

Mean-centeringMean-centering• Subtract the mean from each column of X:

107111.387.6105482.363.6118575.355.6102452.376.6

3.129350.1175.03.292550.0225.0

1016250.1025.02.595450.0075.0

Mean-centering

x6.525 1084036.75

x0.0 0.00.0

ScalingScaling• Divide each column of X by its standard deviation:

0.171 704.81.139 1.01.0

3.129350.1175.03.292550.0225.0

1016250.1025.02.595450.0075.0

Scaling

183.0186.1025.1415.0483.0318.1443.1098.1146.0845.0395.0439.0

How many PC’s to use?How many PC’s to use?

• Too few PC’s:– some systematic variation is not described.– model does not fully summarise the data.

X = TPT + Esystematic variation noise

• Too many PC’s:– latter PC’s describe noise.– model is not robust when applied to new data.

• How to select the correct number of PC’s?

How many PC’s to use?How many PC’s to use?• Eigenvalue plots

• Select components where explained % variance > noise level

• Look at PC scores and loadings - do they make sense?! Do residuals have structure?

• Cross-validation

1 2 3 4 5 6 7 8 90

4.5Eigenvalue vs. PC Number

PC Number

lue ‘Knee’ here -

select 4 PC’s

Cross-validationCross-validation• Remove subset of the data -

test set.

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

• Build model on remaining data - training set.

• Repeat for next test set.

• Project test set onto model - calculate residuals.

• Repeat for R = 1,2,3...

• Calculate PRESS:

jijr ePRESS

••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

1 2 3 4 5 6 7 80

Latent Variable

PRESSPRESS plot plot

First minimum at 2 PC’s

Overall minimum at 4 PC’s

8 PC’s gives very high CV error

OutliersOutliers• Outliers are objects which are very different from the

rest of the data. These can have a large effect on the principal component model and should be removed.

1 1.5 2 2.5 3 3.5 4 4.54

T (o C

1 1.5 2 2.5 3 3.5 4 4.54

T (o C

Remove outlier

bad experiment

OutliersOutliers• Outliers can also be found in the model space or in

the residuals.

-8 -6 -4 -2 0 2 4 6 8-8

Scores PC 1

22 24 26 28 30 32 34 36 38 40 420

Time (min)

0 5 10 15 20 25 300

Age (years)

Model extrapolation can be Model extrapolation can be dangerous!dangerous!

Linear model was valid for this age range...

...but is not valid for 30 year olds!

ConclusionsConclusions• Principal component analysis (PCA) reduces large,

collinear matrices into two smaller matrices - scores and loadings:

• Principal components– describe the important variation in the data.– are calculated in order of importance.– are orthogonal.

ETPEptptptX

T11 ... RR

ConclusionsConclusions• Scores plots and biplots can be useful for exploring

and understanding the data.

• It is often correct to mean-center and scale the variables prior to analysis.

• It is important to include the correct number of PC’s in the PCA model. One method for determining this is called cross-validation.

principal component analysis

large data matrices

measured data matrices

data explorationpca

data matrix of size

data matrixfor example

smaller matrices

measured variables

decomposition of x

Documents

principal component analysis (pca)€¦ · principal...

binary principal component analysis

principal component and factor analysis 8 · 2017. 11....

principal component analysis-presentation.ppt

principal component analysis

principal component regression when simple linear model...

principal component analysis (pca)

correlations, principal component analysis

topics part i •principal component analysis •independent...

constrained principal component analysis (cpca) and...

introduction to principal component analysis (pca) ·...

principal component analysis (pca) of proteins … ·...

pca (principal component analysis)

classical and robust symbolic principal component analysis...

principal component analysis -...

principal component analysis - astrostatistics.psu.edu ·...

an application of principal component analysis to ... ·...

the principal component analysis

factor and component analysis esp. principal component...

6.principal component analysis