principal component analysis (pca) - politechnika gdańska › chem › ceeam › dokumenty ›...

Principal Component Analysis (PCA)

n

X =

f

S

nn

L + Ef

mm m

X = S*L + E, S=XL’m – number of objectsn – number of variablesf – number of the new latent factors (PCs)S – scores matrixL – loadings matrixE – error matrix containing unexplained part of X

Reference:B.M.G. Vandeginste, D.L. Massart, L.M.C. Buydens, S. de Jong, P.J. Lewi, J.Smeyers-Verbeke, 1998. Handbook of Chemometrics and Qualimetrics: Part B. Elsevier Science, Amsterdam.


• the new variables (PCs) are linear combination of the original ones.• PCs are uncorrelated.• the first PC captures as much as possible of the variability in all the original ones having been constrained to have maximum variance amongst all such linear combinations.• each successive PC accounts for as much of the remaining variability as possible.

x1

x2PC1

PC2

the mean of the data set

Reference:T. Naes, T. Isaksson. T. Fearn, T. Davies, 2002. A user-friendly guide to Multivariate Calibration and Classification. NIR Publications, UK


x1

x2 PC1

outlier

PC1

PC2

PC1

PC2

removing

of outliers

PCA of autoscaled Austrian data•score plots

Reference:I. Stanimirova, M. Daszykowski, D. L. Massart, F. Questier, V. Simeonov, H. Puxbaum, Chemometrical exploration of the wet precipitation chemistry from the Austrian Monitoring Network (1988-1999), accepted in JEM

PCA of autoscaled Austrian data•loading bar plots

H+

NH4+

Na+

K+

Ca2+

Mg2+

Cl-

NO3-

SO42-

H+

NH4+

Na+ K+Ca2+ Mg2+

Cl-

NO3-

SO42-

H+

NH4+

Na+

K+ Ca2+

Mg2+Cl-

NO3-SO4

2-

PC 1 is a global size factor reflecting the total mineral content

PC 2 describes the acidity of the samples

PC 3 is “mixed salt” factor, contrasting K+ and Mg2+

on one hand and Cl- on the other

PCA of autoscaled Austrian data

•score plots (after removing of the extreme objects)

PCA of autoscaled Austrian data

• Innervillgraten, Sonnblick, Reutte – low mineral content.• Kufstein, Werfenweng, Nassfeld, Lunz, Nasswald – intermediate mineral content• Haunsberg, Litschau and Lobau - large mineral content.• Lobau is much more influenced by anthropogenic factors than Nasswald.• Litschau and Hausberg samples can be recognized from those ofWerfenweng by their higher proportion of H+, NO3

-, SO42-

ReutteKufstein

Innervillgraten Sonnblick

Nassfeld

Haunsberg

Werfenweng

LunzNasswald

Lobau

Litschau

Tucker3

GE

k

F

D

E

j

C

B

Fkk

j j

ii = +X E

D

A

i

X = Σ Σ Σd=1 e=1 f=1

D E Fa idbjeckf gdefijk

Reference:R. Henrion, N-way principal component analysis. Theory, algorithms and applications, Chemom. Intell. Lab. Syst. 25 (1994) 1-23.

Tucker3 of standardized Austrian data

time parameters

perfect data structure - the time dimension was set to be 8 years for each sampling site

Xsites

-1.81 3.10 0.17-2.44 -5.85 -3.570.17 -0.33 2.87

A

B

C

48.24 -0.13 -0.170.26 -7.75 -4.290.03 -6.06 8.89

elements: (1 1 1); (3 3 1); (2 2 1); (3 2 1)

core data array

[3 3 2] - 89.7%

Interpretation of Tucker3 - core element (1 1 1)

H+

NH4+

NO3-

SO42-

Na+K+

Ca2+Cl-

Mg2+

Samples from all sampling sites are ranked according to their increasing content of NH4

+, NO3- and SO4

2- during the whole sampling period, and particularly during 1993, 1995, 1995.

A1 B1 C1 G(1 1 1) product+ + + + +

Interpretation of Tucker3 – core element (3 2 1)

H+

NH4+

NO3-

SO42-

Na+K+

Ca2+Cl-

Mg2+

A3 B2 C1 G(3 2 1) product+ + + - -+ - + - +

Samples from all sites, exceptWerfenweng and Lobau have decrease (-) of Na+, K+, Ca2+, Mg2+, Cl- during the whole sampling period and particularly 1993, 1995, 1996 and increase (+) of acidity.


H+

NH4+

NO3-

SO42-

Na+K+

Ca2+Cl-

Mg2+

Samples from Haunsberg can be recognized from the samples of the other sites by high Ca2+ and Cl- ion concentrations during 1995, 1996.

A2 B2 C1 G(2 2 1) product- + + - +


H+

NH4+

NO3-

SO42-

Na+

K+

Ca2+

Cl-

Mg2+

Samples from Lobau can be recognized from the samples of the other sites by high K+ and Mg2+ ion concentrations during 1995, 1996.

A3 B3 C1 G(3 3 1) product- - + + +

Least Squares Regression (LSR)Y x Y

Fitted valuesˆ ˆˆi 0 1 iy =b +b x

ˆ ˆi i if = y - yResidual

Observation, yi1

ˆ∑N

2i

i=1RSS = fN

xfxY ++= 10 bbThe sum over all N observations of the squared residuals (RSS) is an overall measure of the goodness of fit of the line to the data. The principal of least squares (LS) chooses the values of b0 and b1 that give the smallest RSS.

b0 – interceptb1 - slopef – random error term

EXAMPLE *Forbes’s data: He measured boiling point of water, and atmospheric pressure at various locations in the Alps and Scotland.Aim: Forbes’s aim was to be able to estimate pressure (Y) (and hence altitude) by boiling water (x).

outlier

b0=-81.00; b1=0.52

RSS=0.79; R2=0.994

The outlier is a little leverage. It has boiling point close to the mean. It fits the linear model.

* The example is presented in T. Naes, T. Isaksson. T. Fearn, T. Davies, 2002. A user-friendly guide to Multivariate Calibration and Classification. NIR Publications, UK

Multiple Linear Regression (MLR)Why are multivariate methods needed?

In many cases several predictor variables used in combination can give dramatically better results than any of the individual predictors alone.

Y observed vs. Y predicted using one variable.R2 =0.500

Y observed vs. Y predicted using several predictor variables.R2 =0.996

x1Y x2 xKx31

N

fY ++= ∑=

K

1kkk0 xbb

Disadvantage of MLRWhen the number of available samples is smaller than the number of variables, this leads to exact relationship, so-called exact multicollinearityamong variables and LS solution becomes non-unique. MLR will give a bad predictor.

Example: “WHEAT” data set contains 100 NIR spectra of wheat samples (X) and their moisture content values (Y). Samples were measured in diffuse reflectance mode as log(1/R) in the range 1100-2500 nm in 2 nm intervals using Bran+Luebbe instrument.

Y observed vs. Y predicted using all of the predictor variables.

How to solve the multicollinearity problem?•A careful variable selection deletes some of the variables in the model.

x1

x2 MLRYx3

x4

•All x-variables are transformed into linear combinations t1 and t2, which are related to Y in a regression equation.

x1t1 x2 PCR

PLSRY

x3t2x4

Reference:H. Martens, T. Naes, 1989. Multivariate Calibration,John Wiley & Sons.

General model structure for PCR and PLSR

x

x2 Y

t

x1 t

X=TPt + E The information in X is compressed to a few components t.

Y=Tq + f These components,t, are used as independent variables in a regression equation with Y as dependent variable.

P, q –loadings describing how variables in T relate to X and YE, f –residuals representing the noise in X and Y

Principal Component Regression (PCR)

The score matrix, T, is computed using the criteria of maximizing the variance in X.

Example: WHEAT data set: calibration set: 51 samplestest set: 49 samples

RMSEP =0.244RMSECV =0.219

∑=

−=N

1i

2iiCV, /N)yy(RMSECV /N)yy(RMSEP 2

ii

N

1i−= ∑

=

Partial Least Squares Regression (PLSR)

The score matrix, T, is computed using the criteria of maximizing the covariance between Y and all possible linear functions in X.

RMSEP =0.221RMSECV =0.228RMSECV =0.223

Comparison of PCR and PLS regression

PLS regression can give good prediction results with fewer components than PCR. A consequence of this is that the number of PCs needed for interpreting the information in X, which is related to Y is smaller for PLS than for PCR.

From computational point of view, PLS is faster than PCR.

From theoretical point of view, PCR is better understood than PLS.

principal component analysis (pca) - politechnika gdańska › chem › ceeam › dokumenty ›...

Documents