principal component analysis
Post on 16-Mar-2016
41 Views
Preview:
DESCRIPTION
TRANSCRIPT
UNIVERSITY OF AMSTERDAM
1
Principal Component Analysis
Biosystems Data Analysis
UNIVERSITY OF AMSTERDAM
2
From molecule to networksFrom molecule to networks
Protein network of SRD5A2
Yeast metabolic network of Glycolysis
UNIVERSITY OF AMSTERDAM
3
Disease gene networkDisease gene network
UNIVERSITY OF AMSTERDAM
4
Biological dataBiological data
DATA
Sam
ples
Genes or proteins or metabolites
5.31
5.31
Repeatability (herhaalbaarheid)Reproducibility (reproduceerbaarheid)
Biological variability
UNIVERSITY OF AMSTERDAM
5
How to explore such networksHow to explore such networks
DATA
Sam
ples
Genes or proteins or metabolites
Results are specific for the
selected samples/situation
Correlation matrix
Genes or proteins or metabolites
Gen
es o
r pro
tein
s or
met
abol
ites
UNIVERSITY OF AMSTERDAM
6
GoalsGoals• If you measure multiple variables on an object it can
be important to analyze the measurements simultaneously.
• Understand the most important tool in multivariate data analysis Principal Component Analysis.
UNIVERSITY OF AMSTERDAM
7
Multiple measurementsMultiple measurements• If there is a mutual relationship between two or more
measurements they are correlated.
• There are strong correlations and weak correlations
Mass of an object and
the weight of that object on the earth surface
Capabilities in sports and
month of birth
UNIVERSITY OF AMSTERDAM
8
CorrelationCorrelation• Correlation occurs everywhere!
18 20 22 24 26 28 3075
76
77
78
79
80
81
82
83
84
Age (months)
Hei
ght (
cm)
• Example: mean height vs. age of a group of young children
• A strong linear relationship between height and age is seen.
• For young children, height and age are correlated.
Moore, D.S. and McCabe G.P., Introduction to the Practice of Statistics (1989).
UNIVERSITY OF AMSTERDAM
9
Correlation in spectroscopyCorrelation in spectroscopy
200 210 220 230 240 250 260 270 280 290 3000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Wavelength (nm)
Abs
orba
nce
(uni
ts)
0.332 0.181
0.498 0.27015
0.664 0.36220
0.831 0.45325
0.166 0.0905
Intensity at 230nm
Intensity at 265nm
Conc. (MMol)
230 265
• Example: a pure compound is measured at two wavelengths over a range of concentrations
10
UNIVERSITY OF AMSTERDAM
10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Absorbance at 230nm (units)
Abs
orba
nce
at 2
65nm
(uni
ts)
Correlation in spectroscopyCorrelation in spectroscopy
• The intensities at 230 and 265 are highly correlated.
increasing concentration
• There is only one factor underlying the data: concentration.
• The data is not two-dimensional, but one-dimensional.
UNIVERSITY OF AMSTERDAM
11
The data matrixThe data matrix
• For example,– Spectroscopy: sample wavelength– Proteomics: patient protein
65.078.022.015.0
33.085.024.013.081.093.034.014.029.065.045.012.0
variables
objects
• Information often comes in the form of a matrix:
UNIVERSITY OF AMSTERDAM
12
Large amounts of dataLarge amounts of data• In (bio)chemical analysis, the measured data
matrices can be very large.– An infrared spectrum measured for 50 samples gives a data
matrix of size 50 800 = 40,000 numbers!– The matabolome of a 100 patient yield a data matrix of size
100 1000 = 100,000 numbers.
• We need a way of extracting the important information from large data matrices.
UNIVERSITY OF AMSTERDAM
13
PPrincipal rincipal CComponent omponent AAnalysisnalysis• Data reduction
– PCA reduces large data matrices into two smaller matrices which can be more easily examined, plotted and interpreted.
• Data exploration– PCA extracts the most important factors (principal
components) from the data. These factors describe multivariate interactions between the measured variables.
• Data understanding– Principal components can be used to classify samples,
identify compound spectra, determine biomarker etc.
UNIVERSITY OF AMSTERDAM
14
Different views of PCADifferent views of PCA• Statistically, PCA is a multivariate analysis technique
closely related to– eigenvector analysis– singular value decomposition (SVD)
• In matrix terms, PCA is a decomposition of X into two smaller matrices plus a set of residuals: X = TPT + E
• Geometrically, PCA is a projection technique in which X is projected onto a subspace of reduced dimensions.
UNIVERSITY OF AMSTERDAM
15
PCA: mathematicsPCA: mathematics• The basic equation for PCA is written as
where
X (I J) is a data matrix,
T (I R) are the scores,
P (J R) are the loadings and
E (I J) are the residuals.
R is the number of principal components used to describe X.
ETPEptptptX
T
TT22
T11 ... RR
UNIVERSITY OF AMSTERDAM
16
Principal componentsPrincipal components
rr pt ,
18.1 87.63
1.3 88.94
23.9 69.52
45.6 45.61
% X explained
Total % X explained
Principal comp.
• A principal component is defined by one pair of loadings and scores, , sometimes also known as a latent variable.
• Principal components describe maximum variance and are calculated in order of importance, e.g.
and so on... up to 100%
UNIVERSITY OF AMSTERDAM
17
PCA: matricesPCA: matrices
= + ... +X
scores
loadings
principal component
+ E=
T
PT
UNIVERSITY OF AMSTERDAM
18
Scores and loadingsScores and loadings• Scores
– relationships between objects– orthogonal, TTT = diagonal matrix
• Loadings– relationships between variables– orthonormal, PTP = identity matrix, I
• Similarities and differences between objects (or variables) can be seen by plotting scores (or loadings) against each other.
UNIVERSITY OF AMSTERDAM
19
Numbers exampleNumbers example
2/12/102/12/1
12108642
660665505544044330332202211011
UNIVERSITY OF AMSTERDAM
20
PCA: simple projectionPCA: simple projection• Simplest case: two correlated variables
18 20 22 24 26 28 3075
76
77
78
79
80
81
82
83
84
Age (months)
Hei
ght (
cm)
PC1
PC2
• PC1 describes 99.77% of the total variation in X.
-8 -6 -4 -2 0 2 4 6 8-8
-6
-4
-2
0
2
4
6
8
Scores PC 1 (99.77%)S
core
s P
C 2
(0.2
3%)
scores plot
PCA
• PC2 describes residual variation (0.23%).
UNIVERSITY OF AMSTERDAM
21
PCA: projectionsPCA: projections• PCA is a projection technique.
– Now we will project some J-dimensional data onto a two-dimensional space, i.e. onto a plane.
– In the previous example, we projected the two-dimensional data onto a one-dimensional space, i.e. onto a line.
• Each row of the data matrix X (I J) can be considered as a point in J-dimensional space. This data is projected orthogonally onto a subspace of lower dimensionality.
UNIVERSITY OF AMSTERDAM
22
= +•••••••••••••••
EPTX T
•
•
•••••••••••••••• •
••
• •
••••••••••••••••••••••••••••••
UNIVERSITY OF AMSTERDAM
23
Example :Example :Protein dataProtein data
• Protein consumption across Europe was studied.• 9 variables describe different sources of protein.• 25 objects are the different countries.• Data matrix has dimensions 25 9.
Weber, A., Agrarpolitik im Spannungsfeld der internationalen Ernaehrungspolitik, Institut fuer Agrarpolitik und marktlehre, Kiel (1973) .
• Which countries are similar?• Which foods are related to red meat consumption?
UNIVERSITY OF AMSTERDAM
24
UNIVERSITY OF AMSTERDAM
25
PCA on the protein dataPCA on the protein data• The data is mean-centred and each variable is scaled
to unit variance. Then a PCA is performed.
Percent Variance Captured by PCA Model Principal Eigenvalue % Variance %VarianceComponent of Captured Captured Number Cov(X) This PC Total--------- ---------- ---------- ---------- 1 4.01e+000 44.52 44.52 2 1.63e+000 18.17 62.68 3 1.13e+000 12.53 75.22 4 9.55e-001 10.61 85.82 5 4.64e-001 5.15 90.98 6 3.25e-001 3.61 94.59 7 2.72e-001 3.02 97.61 8 1.16e-001 1.29 98.90 9 9.91e-002 1.10 100.00How many principal components do you want to keep?
4 1 2 3 4 5 6 7 8 9
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5Eigenvalue vs. PC Number
PC Number
Eig
enva
lue
UNIVERSITY OF AMSTERDAM
26
-3 -2 -1 0 1 2 3 4-5
-4
-3
-2
-1
0
1
2
Scores PC 1 (44.52%)
Sco
res
PC
2 (1
8.17
%)
Albania
Austria
Belgium
Bulgaria
Czechoslovakia
Denmark East Germany
Finland
France Greece
Hungary Ireland
Italy
Netherlands
Norway
Poland
Portugal
Romania
Spain
Sweden
Switzerland
UK USSR West Germany
Yugoslavia
Scores: PC1 vs PC2Scores: PC1 vs PC2
PC 2
UNIVERSITY OF AMSTERDAM
27
LoadingsLoadings
Red meat White meat Eggs Milk Fish Cereals Starch Beans/nuts/oilFruit & veg -0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6P
C lo
adin
gs
PC1PC2
UNIVERSITY OF AMSTERDAM
28
-5 -4 -3 -2 -1 0 1 2 3 4 5-5
-4
-3
-2
-1
0
1
2
PC 1
PC
2
Albania
Austria
Belgium
Bulgaria
Czechoslovakia
Denmark East Germany
Finland
France Greece
Hungary Ireland
Italy
Netherlands
Norway
Poland
Portugal
Romania
Spain
Sweden
Switzerland
UK USSR West Germany
Yugoslavia
Red meat
White meat
Eggs
Milk
Fish
Cereals
Starch
Beans/nuts/oil
Fruit & veg
Biplot: PC1 vs PC2Biplot: PC1 vs PC2
PC2 primarily says that the Spanish and Portuguese especially like fruit, vegetables, fish, oils.
SE Europeans eat cereal crops
UNIVERSITY OF AMSTERDAM
29
-5 -4 -3 -2 -1 0 1 2 3 4 5-3
-2
-1
0
1
2
3
4
PC 1
PC
3
Albania
Austria
Belgium Bulgaria
Czechoslovakia
Denmark
East Germany
Finland
France
Greece
Hungary
Ireland Italy
Netherlands
Norway
Poland
Portugal Romania
Spain
Sweden
Switzerland
UK
USSR
West Germany
Yugoslavia
Red meat
White meat
Eggs
Milk
Fish
Cereals
Starch
Beans/nuts/oil
Fruit & veg
Biplot: PC1 vs PC3Biplot: PC1 vs PC3
Scandinavians eat fish!
Red meat and milk are correlated
The Dutch like ‘patat’...
...with mayonnaise!?
UNIVERSITY OF AMSTERDAM
30
ResidualsResiduals• It is also important to look at the model residuals, E.
1 2 3 4 5 6 7 8 9-1
-0.5
0
0.5
1
1.5
Variable number
Res
idua
l var
iatio
n
• Ideally, the residuals will not contain any structure - just unsystematic variation (noise).
UNIVERSITY OF AMSTERDAM
31
ResidualsResiduals• The (squared) model residuals can be summed along
the object or variable direction:
0 5 10 15 20 250
0.5
1
1.5
2
2.5
3
3.5
Object number
Q (s
um o
f squ
ared
resi
dual
s)
J
jiji eQ
1
2
Country 23 (USSR) fits the model least well
UNIVERSITY OF AMSTERDAM
32
Centering and scalingCentering and scaling• We are often interested in the differences between
objects, not in their absolute values.– protein data: differences between countries
• If different variables are measured in different units, some scaling is needed to give each variable an equal chance of contributing to the model.
UNIVERSITY OF AMSTERDAM
33
Mean-centeringMean-centering• Subtract the mean from each column of X:
107111.387.6105482.363.6118575.355.6102452.376.6
3.129350.1175.03.292550.0225.0
1016250.1025.02.595450.0075.0
Mean-centering
x6.525 1084036.75
x0.0 0.00.0
UNIVERSITY OF AMSTERDAM
34
ScalingScaling• Divide each column of X by its standard deviation:
0.171 704.81.139 1.01.0
3.129350.1175.03.292550.0225.0
1016250.1025.02.595450.0075.0
Scaling
183.0186.1025.1415.0483.0318.1443.1098.1146.0845.0395.0439.0
1.0
UNIVERSITY OF AMSTERDAM
35
How many PC’s to use?How many PC’s to use?
• Too few PC’s:– some systematic variation is not described.– model does not fully summarise the data.
X = TPT + Esystematic variation noise
• Too many PC’s:– latter PC’s describe noise.– model is not robust when applied to new data.
• How to select the correct number of PC’s?
UNIVERSITY OF AMSTERDAM
36
How many PC’s to use?How many PC’s to use?• Eigenvalue plots
• Select components where explained % variance > noise level
• Look at PC scores and loadings - do they make sense?! Do residuals have structure?
• Cross-validation
1 2 3 4 5 6 7 8 90
0.5
1
1.5
2
2.5
3
3.5
4
4.5Eigenvalue vs. PC Number
PC Number
Eig
enva
lue ‘Knee’ here -
select 4 PC’s
UNIVERSITY OF AMSTERDAM
37
Cross-validationCross-validation• Remove subset of the data -
test set.
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
• Build model on remaining data - training set.
• Repeat for next test set.
• Project test set onto model - calculate residuals.
• Repeat for R = 1,2,3...
• Calculate PRESS:
I
i
J
jijr ePRESS
1 1
2
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
UNIVERSITY OF AMSTERDAM
38
1 2 3 4 5 6 7 80
5
Eig
enva
lue
of C
ov(x
) b)
1 2 3 4 5 6 7 80
50
Latent Variable
PR
ES
S (r
)
PRESSPRESS plot plot
First minimum at 2 PC’s
Overall minimum at 4 PC’s
8 PC’s gives very high CV error
UNIVERSITY OF AMSTERDAM
39
OutliersOutliers• Outliers are objects which are very different from the
rest of the data. These can have a large effect on the principal component model and should be removed.
1 1.5 2 2.5 3 3.5 4 4.54
6
8
10
12
14
16
18
pH
T (o C
)
1 1.5 2 2.5 3 3.5 4 4.54
6
8
10
12
14
16
18
pH
T (o C
)
Remove outlier
bad experiment
UNIVERSITY OF AMSTERDAM
40
OutliersOutliers• Outliers can also be found in the model space or in
the residuals.
-8 -6 -4 -2 0 2 4 6 8-8
-6
-4
-2
0
2
4
6
Scores PC 1
Sco
res
PC
2
22 24 26 28 30 32 34 36 38 40 420
2
4
6
8
10
12
14
Time (min)
Sum
-of-s
quar
ed re
sidu
als
UNIVERSITY OF AMSTERDAM
41
0 5 10 15 20 25 300
50
100
150
200
250
300
Age (years)
Hei
ght (
cm)
Model extrapolation can be Model extrapolation can be dangerous!dangerous!
Linear model was valid for this age range...
...but is not valid for 30 year olds!
UNIVERSITY OF AMSTERDAM
42
ConclusionsConclusions• Principal component analysis (PCA) reduces large,
collinear matrices into two smaller matrices - scores and loadings:
• Principal components– describe the important variation in the data.– are calculated in order of importance.– are orthogonal.
ETPEptptptX
T
TT22
T11 ... RR
UNIVERSITY OF AMSTERDAM
43
ConclusionsConclusions• Scores plots and biplots can be useful for exploring
and understanding the data.
• It is often correct to mean-center and scale the variables prior to analysis.
• It is important to include the correct number of PC’s in the PCA model. One method for determining this is called cross-validation.
top related