principal component analysis (pca) - politechnika gdańska › chem › ceeam › dokumenty ›...
TRANSCRIPT
Principal Component Analysis (PCA)
n
X =
f
S
nn
L + Ef
mm m
X = S*L + E, S=XL’m – number of objectsn – number of variablesf – number of the new latent factors (PCs)S – scores matrixL – loadings matrixE – error matrix containing unexplained part of X
Reference:B.M.G. Vandeginste, D.L. Massart, L.M.C. Buydens, S. de Jong, P.J. Lewi, J.Smeyers-Verbeke, 1998. Handbook of Chemometrics and Qualimetrics: Part B. Elsevier Science, Amsterdam.
Principal Component Analysis (PCA)
• the new variables (PCs) are linear combination of the original ones.• PCs are uncorrelated.• the first PC captures as much as possible of the variability in all the original ones having been constrained to have maximum variance amongst all such linear combinations.• each successive PC accounts for as much of the remaining variability as possible.
x1
x2PC1
PC2
the mean of the data set
Reference:T. Naes, T. Isaksson. T. Fearn, T. Davies, 2002. A user-friendly guide to Multivariate Calibration and Classification. NIR Publications, UK
Principal Component Analysis (PCA)
x1
x2 PC1
outlier
PC1
PC2
PC1
PC2
removing
of outliers
PCA of autoscaled Austrian data•score plots
Reference:I. Stanimirova, M. Daszykowski, D. L. Massart, F. Questier, V. Simeonov, H. Puxbaum, Chemometrical exploration of the wet precipitation chemistry from the Austrian Monitoring Network (1988-1999), accepted in JEM
PCA of autoscaled Austrian data•loading bar plots
H+
NH4+
Na+
K+
Ca2+
Mg2+
Cl-
NO3-
SO42-
H+
NH4+
Na+ K+Ca2+ Mg2+
Cl-
NO3-
SO42-
H+
NH4+
Na+
K+ Ca2+
Mg2+Cl-
NO3-SO4
2-
PC 1 is a global size factor reflecting the total mineral content
PC 2 describes the acidity of the samples
PC 3 is “mixed salt” factor, contrasting K+ and Mg2+
on one hand and Cl- on the other
PCA of autoscaled Austrian data
•score plots (after removing of the extreme objects)
PCA of autoscaled Austrian data
• Innervillgraten, Sonnblick, Reutte – low mineral content.• Kufstein, Werfenweng, Nassfeld, Lunz, Nasswald – intermediate mineral content• Haunsberg, Litschau and Lobau - large mineral content.• Lobau is much more influenced by anthropogenic factors than Nasswald.• Litschau and Hausberg samples can be recognized from those ofWerfenweng by their higher proportion of H+, NO3
-, SO42-
ReutteKufstein
Innervillgraten Sonnblick
Nassfeld
Haunsberg
Werfenweng
LunzNasswald
Lobau
Litschau
Tucker3
GE
k
F
D
E
j
C
B
Fkk
j j
ii = +X E
D
A
i
X = Σ Σ Σd=1 e=1 f=1
D E Fa idbjeckf gdefijk
Reference:R. Henrion, N-way principal component analysis. Theory, algorithms and applications, Chemom. Intell. Lab. Syst. 25 (1994) 1-23.
Tucker3 of standardized Austrian data
time parameters
perfect data structure - the time dimension was set to be 8 years for each sampling site
Xsites
-1.81 3.10 0.17-2.44 -5.85 -3.570.17 -0.33 2.87
A
B
C
48.24 -0.13 -0.170.26 -7.75 -4.290.03 -6.06 8.89
elements: (1 1 1); (3 3 1); (2 2 1); (3 2 1)
core data array
[3 3 2] - 89.7%
Interpretation of Tucker3 - core element (1 1 1)
H+
NH4+
NO3-
SO42-
Na+K+
Ca2+Cl-
Mg2+
Samples from all sampling sites are ranked according to their increasing content of NH4
+, NO3- and SO4
2- during the whole sampling period, and particularly during 1993, 1995, 1995.
A1 B1 C1 G(1 1 1) product+ + + + +
Interpretation of Tucker3 – core element (3 2 1)
H+
NH4+
NO3-
SO42-
Na+K+
Ca2+Cl-
Mg2+
A3 B2 C1 G(3 2 1) product+ + + - -+ - + - +
Samples from all sites, exceptWerfenweng and Lobau have decrease (-) of Na+, K+, Ca2+, Mg2+, Cl- during the whole sampling period and particularly 1993, 1995, 1996 and increase (+) of acidity.
Interpretation of Tucker3 – core element (2 2 1)
H+
NH4+
NO3-
SO42-
Na+K+
Ca2+Cl-
Mg2+
Samples from Haunsberg can be recognized from the samples of the other sites by high Ca2+ and Cl- ion concentrations during 1995, 1996.
A2 B2 C1 G(2 2 1) product- + + - +
Interpretation of Tucker3 – core element (3 3 1)
H+
NH4+
NO3-
SO42-
Na+
K+
Ca2+
Cl-
Mg2+
Samples from Lobau can be recognized from the samples of the other sites by high K+ and Mg2+ ion concentrations during 1995, 1996.
A3 B3 C1 G(3 3 1) product- - + + +
Least Squares Regression (LSR)Y x Y
Fitted valuesˆ ˆˆi 0 1 iy =b +b x
ˆ ˆi i if = y - yResidual
Observation, yi1
ˆ∑N
2i
i=1RSS = fN
xfxY ++= 10 bbThe sum over all N observations of the squared residuals (RSS) is an overall measure of the goodness of fit of the line to the data. The principal of least squares (LS) chooses the values of b0 and b1 that give the smallest RSS.
b0 – interceptb1 - slopef – random error term
EXAMPLE *Forbes’s data: He measured boiling point of water, and atmospheric pressure at various locations in the Alps and Scotland.Aim: Forbes’s aim was to be able to estimate pressure (Y) (and hence altitude) by boiling water (x).
outlier
b0=-81.00; b1=0.52
RSS=0.79; R2=0.994
The outlier is a little leverage. It has boiling point close to the mean. It fits the linear model.
* The example is presented in T. Naes, T. Isaksson. T. Fearn, T. Davies, 2002. A user-friendly guide to Multivariate Calibration and Classification. NIR Publications, UK
Multiple Linear Regression (MLR)Why are multivariate methods needed?
In many cases several predictor variables used in combination can give dramatically better results than any of the individual predictors alone.
Y observed vs. Y predicted using one variable.R2 =0.500
Y observed vs. Y predicted using several predictor variables.R2 =0.996
x1Y x2 xKx31
N
fY ++= ∑=
K
1kkk0 xbb
Disadvantage of MLRWhen the number of available samples is smaller than the number of variables, this leads to exact relationship, so-called exact multicollinearityamong variables and LS solution becomes non-unique. MLR will give a bad predictor.
Example: “WHEAT” data set contains 100 NIR spectra of wheat samples (X) and their moisture content values (Y). Samples were measured in diffuse reflectance mode as log(1/R) in the range 1100-2500 nm in 2 nm intervals using Bran+Luebbe instrument.
Y observed vs. Y predicted using all of the predictor variables.
How to solve the multicollinearity problem?•A careful variable selection deletes some of the variables in the model.
x1
x2 MLRYx3
x4
•All x-variables are transformed into linear combinations t1 and t2, which are related to Y in a regression equation.
x1t1 x2 PCR
PLSRY
x3t2x4
Reference:H. Martens, T. Naes, 1989. Multivariate Calibration,John Wiley & Sons.
General model structure for PCR and PLSR
x
x2 Y
t
x1 t
X=TPt + E The information in X is compressed to a few components t.
Y=Tq + f These components,t, are used as independent variables in a regression equation with Y as dependent variable.
P, q –loadings describing how variables in T relate to X and YE, f –residuals representing the noise in X and Y
Principal Component Regression (PCR)
The score matrix, T, is computed using the criteria of maximizing the variance in X.
Example: WHEAT data set: calibration set: 51 samplestest set: 49 samples
RMSEP =0.244RMSECV =0.219
∑=
−=N
1i
2iiCV, /N)yy(RMSECV /N)yy(RMSEP 2
ii
N
1i−= ∑
=
Partial Least Squares Regression (PLSR)
The score matrix, T, is computed using the criteria of maximizing the covariance between Y and all possible linear functions in X.
RMSEP =0.221RMSECV =0.228RMSECV =0.223
Comparison of PCR and PLS regression
PLS regression can give good prediction results with fewer components than PCR. A consequence of this is that the number of PCs needed for interpreting the information in X, which is related to Y is smaller for PLS than for PCR.
From computational point of view, PLS is faster than PCR.
From theoretical point of view, PCR is better understood than PLS.