basics of multivariate modelling and data analysis

KEH Basics of Multivariate Modelling and Data Analysis 1

Basics of Multivariate Modelling and Data AnalysisKurt-Erik Häggblom

9. Linear regression with latent variables9.1 Principal component regression (PCR)9.2 Partial least-squares regression (PLS)[ mostly from Varmuza and Filzmoser (2009) and PLS-toolbox manual by Wise et al. (2006) ]


9. Linear regression with latent variables

9.1 Principal component regression (PCR)9.1.1 OverviewThe case with many correlated regressor variables (“independent” variables that are collinear) is notoriously difficult in classical multiple linear regression(MLR) such as ordinary least-squares (OLS) regression. Usually it is necessary to select a subset of variables to reduce the number of regressors and

collinearityIn general, this is a very difficult task. However, principal component regression (PCR) is a way of avoiding/simplifying the

variable selection taskPCR is a combination of principal component analysis (PCA) multiple linear regression (MLR), usually OLS

where the PC scores are used as regressor variables. Since they are orthogonal, the multicollinearity problem is avoided often only a few ones, the number or regressor variables is small


9. Linear regression with latent variables 9.1 PCR

9.1.2 Calculating regression coefficientsMultiple linear regression (MLR)If data is mean-centred, the regression model in MLR is

Using OLS, the regression coefficients are determined by minimizing . This gives a solution that can be expressed as

The problem here is the inverse for collinear data, it is very sensitive to small errors in , which means

that is also very sensitive to small errors the number of observations must be larger than the number of variables

y Xb eb Te e

T TOLS

1( ) b X X X y bT 1( )X X

X XOLSb


9.1 PCR 9.1.2 Calculating regression coefficients

Principal component regressionIn PCR, a principal component analysis (PCA) is first done:

The latter expression is inserted into the linear regression model:

where and . Minimization of by OLS gives

Here the inverse is well-conditioned, and it always exists, because the columns of are orthogonal the number of principal components is never larger than the number of

observations

In terms of , the solution for the linear regression model can be expressed as

T X TP ET XP

TPCR y Xb e TP b Eb e Tg e

Tg P b PCR e Eb e TPCR PCRe e

T T1( )g T T T y

T

bT T

PCR1( ) b Pg P T T T y b



9.1.3 Selecting principal componentsThe problemOverfitting of a regression model is strongly related with collinearity problem. For PCR this means that PCR is less susceptible to overfitting than MLR because it directly

addresses the collinearity problem a PCR model could become overfitted through the retention of too many

principal components

Therefore, an important part of PCR is the determination of the optimal number of PCs to retain in the model.

Another problem is that PCA determines and ranks PCs to explain as much as possible of the

variance of the regressor variables in PCR, we want PCs that give the best possible prediction of the

dependent variable ; such PCs have a high correlation with

Therefore, some variable selection technique has to be applied also in PCR — it is not necessarily best to choose the highest ranked PCs according to the PCA.

y

jx

y


9.1 PCR 9.1.3 Selecting principal components

Cross-validation

As for PCA and MLR, cross-validation is an important tool for variable selection.

This means that the data have to be split observation-wise into a– modelling (or training) data set– a test data set

The prediction residual error on the test set observations is then determined as a function of the (number of ) PCs retained in the PCR model.

This procedure is usually repeated several times using different selections of observation subsets for training and test sets, such that each sample in the original data set is part of a test set at least once.

– A good rule of thumb is to use the square root of the number of observationsfor the number of repetitions of the cross- validation procedure, up to a maximum of ten repetitions.

A plot of the total “composite” prediction error over all test sets as a function of the (number of ) PCs retained in the model is used to determine the optimal (number of ) PCs.



9.1.4 PCR application using the PLS-toolboxWe shall apply PCR to a Slurry-Fed Ceramic Melter (SFCM) system (Wise et al., 1991), where nuclear waste from fuel reprocessing is combined with glass-forming materials. Data from the process, consisting of temperatures in 20 locations within the melter and the molten glass level, are shown in the figure.

It is apparent that there isa great deal of correlationin the data. Many of thevariables appear to followa sawtooth pattern.

We shall develop a PCRmodel that will enableestimation of the levelof molten glass usingtemperaturemeasurements.


9.1 PCR 9.1.4 PCR application using the PLS-toolbox

Starting and loading data

The SFCM temperature and molten glass level data are stored in the file plsdata.mat. The file contains 300 calibration or “training” samples (xblock1 and yblock1) and 200 test samples (xblock2 and yblock2). We will load the data into MATLAB and delete a few samples that are known to be outliers.


9.1.4 PCR application using the PLS-toolbox Starting and loading data



Preprocessing of data

Now that the data are loaded, we need to decide how to preprocess the data for modelling.

Because the response (temperature) variables with the greatest variance in this data set also appear to be correlated to the molten glass level, we choose to mean-centre (rather than autoscale) the data.

The scaling for Y is irrelevant if there is only one Y variable, as in this case.


9.1.4 PCR application using the PLS-toolbox Preprocessing of data



A preliminary model


9.1.4 PCR application using the PLS-toolbox A preliminary model



Cross-validation

We must now decide how to cross-validate the model.

We will choose to split the data into ten contiguous block-wise subsets, and to calculate all twenty PCs.


9.1.4 PCR application using the PLS-toolbox Cross-validation



A new model


9.1.4 PCR application using the PLS-toolbox A new model



Choice of principal components

Now that the PCR model and the cross-validation results have been computed, one can view the cross-validation results in various ways. A common plot that is used to analyse cross-validation results is a

RMSECV-plot (root mean squared error of cross-validation)


9.1.4 PCR application using the PLS-toolbox Choice of principal components

Note how the RMSECV has several local minima and a global minimum at eleven PCs. Two rules of thumb: do not include a PC unless it improves the RMSECV by at least 2% use a model with the lowest possible complexity among close alternatives.

Here the rules suggest that a model with six PCs would be the best choice.



Model suggested by CV

Choose the desired # of PCs by clicking on the corresponding line.


9.1.4 PCR application using the PLS-toolbox Model suggested by CV


9.1.4 PCR application using the PLS-toolbox Model suggested by CV

Returning to the RMSECV plot for the final PCR model, we note that some PCs in the final model (specifically, PCs 2, 4 and 5) result in an

increase in the model’s estimated prediction error; this suggests that these specific PCs, although they help explain variation

in the X variables (temperatures), are not useful for prediction of the molten glass level.



Saving the model



To Matlab workspace– name can be changed– will be lost after the session

unless saved by the Matlabsave command

To disk

Exported


9. Linear regression with latent variables

9.2 Partial least-squares regression (PLS)9.2.1 OverviewPLS stands for Projection to Latent Structures by means of Partial Least Squares

and is a method to relate a matrix to a vector or to a matrix .

Essentially, the model structures of PLS and PCR are the same: The x-data are first transformed into a set of (a few) latent variables

(components). The latent variables are used for regression (by OLS) with one or several

dependent variables. The regression criterion (most often applied) is maximum covariance

between the scores of dependent variables and regressor variables (i.e. latent variables)

Maximum covariance combines– high variance of scores with– high correlation with scores

X y Y

XY


9.2 Partial least-squares regression (PLS) 9.2.1 Overview

Relationship with MLR and PCRPLS is related to both MLR (OLS) and PCR/PCA.

To see this, consider the linear regression modelThe best prediction/estimation of using is

MLR by OLS maximizes the correlation between and as seen from

PCR also maximizes the correlation between and , but with the constraint , , where is the PCA loading matrix that maximizes the variance of the columns in . This is seen from

y Xb e

T TT T

PCRT T T

1ˆargmax arg max ( )r yy

g g

g T y T T T y gg T Tg y y

y

y

T

T T

T TT T

OLST T T

ˆ

1

ˆargmax arg max

ˆ ˆ

arg max ( )

r

yyb b

b

y yy y y y

b X y X X X y bb X Xb y y

ˆ y Xby Xy

yb Pg

T XPPdim( ) dim( )k p g b


9.2.1 Overview Relationship with MLR and PCR

As a summary of this, we can say that MLR (OLS) gives the prediction , i.e. the identity

matrix can be considered as a loading matrix

PCR/PCA gives the prediction , where is the loading matrix that maximizes the variance of the columns in

The idea in PLS is to determine a loading matrix for the prediction, where both and are determined in such a way that

they help maximize the correlation between and .

This solution is different from MLR (OLS) because of the constraint ,

. different from PCR/PCA because the loading matrix is determined to

maximize the correlation between and .

We have also said that PLS maximizes the covariance between the dependent variable(s) and the scores. When the scores are constrained to have a given constant variance (e.g. =1), this is equivalent to maximizing the correlation.

OLS OLSˆ y Xb XIbI

PCRˆ y XPg PT XP

PLSˆ y XWhW

W PLShy y

b Whdim( ) dim( )k p h b

Wy y


9. Linear regression with latent variables 9.2 PLS regression

9.2.2 One dependent variableThe main purpose of PLS is to determine a linear model for prediction (estimation) of one or several dependent variables from a set of predictor (independent) variables.

The modelling procedure is here outlined for one dependent variable. The first PLS-component is calculated as the latent variable which has the

maximum covariance between the scores and the modelled property .

Next, the information (variance) of this component is removed from thedata. This process is called peeling or deflation. It is a projection of thespace on to a (hyper-)plane that is orthogonal to the direction of the

found component. The resulting matrix after deflation is

which as required has the property .

1t yT

1,|| || 1

arg max

w Xw

w y Xw 1 1t Xw

XX

XT T

1 1 1 1 1( ) X X t t X I t t X

1 1 X w 0


9.2 PLS regression 9.2.2 One dependent variable

The next PLS component is derived from the residual matrix — again with maximum covariance between the scores and .

This procedure is continued to produce “sufficiently many” PLS-components.

The final choice of the number of components to retain in the model can be made as for PCR (mainly using cross-validation).

Some comments In the standard versions of PLS, the scores of the PLS-components are

uncorrelated; the loading vectors, are in general not orthogonal.

Because PLS components are developed as latent variables possessing a high correlation with , the optimum number of PLS-components isusually smaller than the optimum number of PCA-components in PCR.

However, PLS models may be less “stable” than PCR models because less variance is contained.

The more components are used, the more similar PCR and PLS models become.

1Xy

y

X


9.2.2 One dependent variable Some comments

A complicating aspect of most PLS algorithms is the stepwise calculation of components.

– After a component is computed, the residual matrices for (and ) are determined.

– The next PLS-component is calculated from the residual matrices and therefore their parameters (scores, loadings, weights) do not relate to the original matrices.

– However, equations exist that relate the PLS parameters to the originaldata and that also provide the regression coefficients of the final model for the original data.

X Y

XPLSb

X

k

pp p

PLSb

PLSb

PLSb



9.2.3 Many dependent variablesIf there is more than one dependent variable, the dependent data is stored in a matrix . The basic regression model is then

where is a matrix of regression coefficients and is a matrix of residuals. Here columns and in and correspond to column in .

If the dependent variables are considered to be mutually independent (uncorrelated) , PLS (or any other regression method) for one dependent variable can be applied to each variable , one at a time.

– MLR using OLS then gives the solution .

If the dependent variables are correlated, it is best to deal with them jointly. PLS, which is called PLS2 when there is more than one dependent variable, is then very suitable.

Y Y XB E

B Ejb je jyB E Y

jyT T1( )B X X X Y


9.2 PLS regression 9.2.3 Many dependent variables

Main ideaIn PLS2, both and are decomposed into scores and loadings and , , are loadings and scores of latent variables for and , , are loadings and scores of latent variables for

in such a way thatthe covariancebetween andscores is maximized.

The regressionis performedbetweenand scores.

The regressioncoefficients canbe transformedto allow directestimation of

from .

X Ytp 1, ,k X

q u 1, ,k Y

X Y

XY

Y X


9.2 PLS regression 9.2.3 Many dependent variables

Mathematical developmentThere are many variants of PLS algorithms. It is e.g. possible to specify that loading vectors are orthogonal (the “Eigenvector algorithm”) scores are uncorrelated (i.e. orthogonal), loadings are non-orthogonal

(algorithms such as Kernel, NIPALS, SIMPLS, O-PLS)

Here a PLS2 method producing uncorrelated scores is outlined. and are modelled by linear latent variables as

and

Instead of the loadings and , new loading vectors and that satisfy

and ,

are introduced. Here and , , is an updated version of .

For the first PLS-component, the vectors and are determined by solving

, ,

X YT XX TP E T YY UQ E

P Q

1t X w u Yc

1w 1c

T T T

T T1 1[ ]

[ ] arg max ( )w c

w c Xw Yc || || 1Xw || || 1Yc

w c1, ,k

0 X X 1X 1 X


9.2.3 Many dependent variables Mathematical development

The vectors and can be found by solving eigenvalue problems– is the eigenvector to the largest eigenvalue of– is the eigenvector to the largest eigenvalue of

The first component scores are given byand

The first component loadings for are given by

The first component loadings for are not needed for the regression.

The deflated matrix is given by

The deflated matrix could be calculated similarly, but it is not needed.

The second PLS-components are calculated similarly from and .

The procedure is repeated to obtain “sufficiently many” components.

When all components are known, the regression coefficients are given by

, ,

1w 1c1w

1c

T TX YY XT TY XX Y

1 1t Xw 1 1u Yc

T T1 1 1 p X t X Xw

X

Y

T T1 1 1 1 1( ) X X t t X I t t X

X

Y

1X Y

T T1( )B W P W C 1[ ]kW w w 1[ ]kC c c



9.2.4 Geometric illustration [ based on UMETRICS material]

Data consist of observations and a set of independent variables

(inputs) in a matrix set of dependent variables

(outputs) in a matrixEach variable has a coordinate axis coordinates for data coordinates for data illustration for

Each observation is represented by one point in the space one point in the space

The mean value of each variablein both data sets is here denotedby a red dot in the two coordinatesystems (it is not an observation).Data are here mean-centred.

n p X

n q Y

p Xq Y 3p q

np

q

p q

n n

XY


9.2 PLS regression 9.2.4 Geometric illustration

The first PLS-component is a line in the space line in the space

calculated to approximate the points

well in and yield a good correlation

between the projectionsand (the scores)

The vector directions are and coordinates are and

The plot shows how well thefirst PLS-components models the data points on the line are modelled exactly points not on the line may be modelled

by other PLS-components

1 1-t u

XY

YX

1t 1u

1w 1c1t 1u



The second PLS-component is alsorepresented by lines in the and space

calculated to approximate the points well provide a good correlation

in such a way that -lines are orthogonal -lines may be orthogonal

These lines with directions and coordinates and

improve the approximation andcorrelation as much as possible.

The second projection coordinates usually correlate less well than

the first projection coordinates may correlate better than the first projection coordinates if there is a

strong structure in that is not related to (or present in) .

X Y

XY

2w 2c2t 2u

X Y



The first two PLS-components form planes in the and spaces

The variability around the -planecan be used to calculate a tolerance interval within which

new observations will (should)be located

Observations outside this interval implies that

the model may not be valid forthis data

Plotting of successive pairs of latentvariables against each other will give a good picture of the correlation

structureThe plot in the SE corner indicates thatthere is almost no information left inthe :th pair of latent variables

uk

tk

X YX

k



9.2.5 Evaluation and diagnosticsThe PLS result and the data can be analysed and evaluated in many ways.

Detection of outliers The techniques for outlier detection used in PCA, can also be used in PLS. Since PLS uses a block in addition to the block, one can also look for

outliers in the prediction of . The figure illustrates a way

of plotting the prediction error against a score relatedparameter (“leverage”)

– the y-axis prediction erroris autoscaled error unitis standard deviations

– the x-axis “Leverage” definesthe influence of a givenobservation on the model; itis proportional to Hotelling’s T2

Four of the marked observations are very clear outliers. These outliers should be removed from the model-building data.

XYY


9.2 PLS regression 9.2.5 Evaluation and diagnostics

Cross-validation The standard techniques for selecting the number of latent variables

based on cross-validation can be used. In addition, one can consider how much of the variation each LV

describes expressed e.g. as .

Relationships between observationsRelationships between observations can be studied by various score plots, e.g. vs. , vs.

– a linear relationship with high correlation is desired

vs. , vs. – no correlation is desired

T T/ u u u u

Y

1u 1t 2u 2t

2t 1t 2u 1u


9.2 PLS regression 9.2.5 Evaluation and diagnostics

Variable interpretationsThere are many ways to analyse the contribution and importance of variables in the PLS model, e.g. loadings on LVs ( :s) Q-residuals Hotelling’s T2 statistic regression coefficients VIP scores (“Variable Importance in Projection”)

In the PLS-toolbox, these plots are obtained via “Loadings plots”.

p



9.2.6 PLS application using the PLS-toolboxWe shall apply PLS to the same Slurry-Fed Ceramic Melter (SFCM) system, which was used in the PCR application (section 9.1.4).

Except for the analysis startup for PLS the model building steps for PLS are exactly the same as for PCR, including

cross-validation for choice of the number of PLS-components


9.2 PLS regression 9.2.6 PLS application using the PLS-toolbox

Cross-validation resultsModel-building combined with cross-validation produces the result shown. The variance captured by the model for each # of latent variables (LVs)

is shown for the– block (% )– block (% )

The suggested model is(apparently) based on

block variance Note the small block

variance for LV4 LV4does not contain infoabout .

This suggests that 3 LVswould be sufficient forpredictive purposes.

XY

Tt t Tu u

XY

Y


9.2.6 PLS application using the PLS-toolbox Cross-validation results

Another (better) way to select the number of LVs is to consider the prediction error based on cross-validation, which can be quantified by a RMSECV-plot

– fig suggest 3 or 4– based on rule of

thumb (at least2% improvement), 3 LVs is sufficient

basics of multivariate modelling and data analysis

Documents