basics of multivariate modelling and data analysis
TRANSCRIPT
KEH Basics of Multivariate Modelling and Data Analysis 1
Basics of Multivariate Modelling and Data AnalysisKurt-Erik Häggblom
9. Linear regression with latent variables9.1 Principal component regression (PCR)9.2 Partial least-squares regression (PLS)[ mostly from Varmuza and Filzmoser (2009) and PLS-toolbox manual by Wise et al. (2006) ]
KEH Basics of Multivariate Modelling and Data Analysis 2
9. Linear regression with latent variables
9.1 Principal component regression (PCR)9.1.1 OverviewThe case with many correlated regressor variables (“independent” variables that are collinear) is notoriously difficult in classical multiple linear regression(MLR) such as ordinary least-squares (OLS) regression. Usually it is necessary to select a subset of variables to reduce the number of regressors and
collinearityIn general, this is a very difficult task. However, principal component regression (PCR) is a way of avoiding/simplifying the
variable selection taskPCR is a combination of principal component analysis (PCA) multiple linear regression (MLR), usually OLS
where the PC scores are used as regressor variables. Since they are orthogonal, the multicollinearity problem is avoided often only a few ones, the number or regressor variables is small
KEH Basics of Multivariate Modelling and Data Analysis 3
9. Linear regression with latent variables 9.1 PCR
9.1.2 Calculating regression coefficientsMultiple linear regression (MLR)If data is mean-centred, the regression model in MLR is
Using OLS, the regression coefficients are determined by minimizing . This gives a solution that can be expressed as
The problem here is the inverse for collinear data, it is very sensitive to small errors in , which means
that is also very sensitive to small errors the number of observations must be larger than the number of variables
y Xb eb Te e
T TOLS
1( ) b X X X y bT 1( )X X
X XOLSb
KEH Basics of Multivariate Modelling and Data Analysis 4
9.1 PCR 9.1.2 Calculating regression coefficients
Principal component regressionIn PCR, a principal component analysis (PCA) is first done:
The latter expression is inserted into the linear regression model:
where and . Minimization of by OLS gives
Here the inverse is well-conditioned, and it always exists, because the columns of are orthogonal the number of principal components is never larger than the number of
observations
In terms of , the solution for the linear regression model can be expressed as
T X TP ET XP
TPCR y Xb e TP b Eb e Tg e
Tg P b PCR e Eb e TPCR PCRe e
T T1( )g T T T y
T
bT T
PCR1( ) b Pg P T T T y b
KEH Basics of Multivariate Modelling and Data Analysis 5
9. Linear regression with latent variables 9.1 PCR
9.1.3 Selecting principal componentsThe problemOverfitting of a regression model is strongly related with collinearity problem. For PCR this means that PCR is less susceptible to overfitting than MLR because it directly
addresses the collinearity problem a PCR model could become overfitted through the retention of too many
principal components
Therefore, an important part of PCR is the determination of the optimal number of PCs to retain in the model.
Another problem is that PCA determines and ranks PCs to explain as much as possible of the
variance of the regressor variables in PCR, we want PCs that give the best possible prediction of the
dependent variable ; such PCs have a high correlation with
Therefore, some variable selection technique has to be applied also in PCR — it is not necessarily best to choose the highest ranked PCs according to the PCA.
y
jx
y
KEH Basics of Multivariate Modelling and Data Analysis 6
9.1 PCR 9.1.3 Selecting principal components
Cross-validation
As for PCA and MLR, cross-validation is an important tool for variable selection.
This means that the data have to be split observation-wise into a– modelling (or training) data set– a test data set
The prediction residual error on the test set observations is then determined as a function of the (number of ) PCs retained in the PCR model.
This procedure is usually repeated several times using different selections of observation subsets for training and test sets, such that each sample in the original data set is part of a test set at least once.
– A good rule of thumb is to use the square root of the number of observationsfor the number of repetitions of the cross- validation procedure, up to a maximum of ten repetitions.
A plot of the total “composite” prediction error over all test sets as a function of the (number of ) PCs retained in the model is used to determine the optimal (number of ) PCs.
KEH Basics of Multivariate Modelling and Data Analysis 7
9. Linear regression with latent variables 9.1 PCR
9.1.4 PCR application using the PLS-toolboxWe shall apply PCR to a Slurry-Fed Ceramic Melter (SFCM) system (Wise et al., 1991), where nuclear waste from fuel reprocessing is combined with glass-forming materials. Data from the process, consisting of temperatures in 20 locations within the melter and the molten glass level, are shown in the figure.
It is apparent that there isa great deal of correlationin the data. Many of thevariables appear to followa sawtooth pattern.
We shall develop a PCRmodel that will enableestimation of the levelof molten glass usingtemperaturemeasurements.
KEH Basics of Multivariate Modelling and Data Analysis 8
9.1 PCR 9.1.4 PCR application using the PLS-toolbox
Starting and loading data
The SFCM temperature and molten glass level data are stored in the file plsdata.mat. The file contains 300 calibration or “training” samples (xblock1 and yblock1) and 200 test samples (xblock2 and yblock2). We will load the data into MATLAB and delete a few samples that are known to be outliers.
KEH Basics of Multivariate Modelling and Data Analysis 9
9.1.4 PCR application using the PLS-toolbox Starting and loading data
KEH Basics of Multivariate Modelling and Data Analysis 10
9.1.4 PCR application using the PLS-toolbox Starting and loading data
KEH Basics of Multivariate Modelling and Data Analysis 11
9.1 PCR 9.1.4 PCR application using the PLS-toolbox
Preprocessing of data
Now that the data are loaded, we need to decide how to preprocess the data for modelling.
Because the response (temperature) variables with the greatest variance in this data set also appear to be correlated to the molten glass level, we choose to mean-centre (rather than autoscale) the data.
The scaling for Y is irrelevant if there is only one Y variable, as in this case.
KEH Basics of Multivariate Modelling and Data Analysis 12
9.1.4 PCR application using the PLS-toolbox Preprocessing of data
KEH Basics of Multivariate Modelling and Data Analysis 13
9.1 PCR 9.1.4 PCR application using the PLS-toolbox
A preliminary model
KEH Basics of Multivariate Modelling and Data Analysis 14
9.1.4 PCR application using the PLS-toolbox A preliminary model
KEH Basics of Multivariate Modelling and Data Analysis 15
9.1 PCR 9.1.4 PCR application using the PLS-toolbox
Cross-validation
We must now decide how to cross-validate the model.
We will choose to split the data into ten contiguous block-wise subsets, and to calculate all twenty PCs.
KEH Basics of Multivariate Modelling and Data Analysis 16
9.1.4 PCR application using the PLS-toolbox Cross-validation
KEH Basics of Multivariate Modelling and Data Analysis 17
9.1.4 PCR application using the PLS-toolbox Cross-validation
KEH Basics of Multivariate Modelling and Data Analysis 18
9.1 PCR 9.1.4 PCR application using the PLS-toolbox
A new model
KEH Basics of Multivariate Modelling and Data Analysis 19
9.1.4 PCR application using the PLS-toolbox A new model
KEH Basics of Multivariate Modelling and Data Analysis 20
9.1 PCR 9.1.4 PCR application using the PLS-toolbox
Choice of principal components
Now that the PCR model and the cross-validation results have been computed, one can view the cross-validation results in various ways. A common plot that is used to analyse cross-validation results is a
RMSECV-plot (root mean squared error of cross-validation)
KEH Basics of Multivariate Modelling and Data Analysis 21
9.1.4 PCR application using the PLS-toolbox Choice of principal components
Note how the RMSECV has several local minima and a global minimum at eleven PCs. Two rules of thumb: do not include a PC unless it improves the RMSECV by at least 2% use a model with the lowest possible complexity among close alternatives.
Here the rules suggest that a model with six PCs would be the best choice.
KEH Basics of Multivariate Modelling and Data Analysis 22
9.1 PCR 9.1.4 PCR application using the PLS-toolbox
Model suggested by CV
Choose the desired # of PCs by clicking on the corresponding line.
KEH Basics of Multivariate Modelling and Data Analysis 23
9.1.4 PCR application using the PLS-toolbox Model suggested by CV
KEH Basics of Multivariate Modelling and Data Analysis 24
9.1.4 PCR application using the PLS-toolbox Model suggested by CV
Returning to the RMSECV plot for the final PCR model, we note that some PCs in the final model (specifically, PCs 2, 4 and 5) result in an
increase in the model’s estimated prediction error; this suggests that these specific PCs, although they help explain variation
in the X variables (temperatures), are not useful for prediction of the molten glass level.
KEH Basics of Multivariate Modelling and Data Analysis 25
9.1 PCR 9.1.4 PCR application using the PLS-toolbox
Saving the model
KEH Basics of Multivariate Modelling and Data Analysis 26
9.1 PCR 9.1.4 PCR application using the PLS-toolbox
To Matlab workspace– name can be changed– will be lost after the session
unless saved by the Matlabsave command
To disk
Exported
KEH Basics of Multivariate Modelling and Data Analysis 27
9. Linear regression with latent variables
9.2 Partial least-squares regression (PLS)9.2.1 OverviewPLS stands for Projection to Latent Structures by means of Partial Least Squares
and is a method to relate a matrix to a vector or to a matrix .
Essentially, the model structures of PLS and PCR are the same: The x-data are first transformed into a set of (a few) latent variables
(components). The latent variables are used for regression (by OLS) with one or several
dependent variables. The regression criterion (most often applied) is maximum covariance
between the scores of dependent variables and regressor variables (i.e. latent variables)
Maximum covariance combines– high variance of scores with– high correlation with scores
X y Y
XY
KEH Basics of Multivariate Modelling and Data Analysis 28
9.2 Partial least-squares regression (PLS) 9.2.1 Overview
Relationship with MLR and PCRPLS is related to both MLR (OLS) and PCR/PCA.
To see this, consider the linear regression modelThe best prediction/estimation of using is
MLR by OLS maximizes the correlation between and as seen from
PCR also maximizes the correlation between and , but with the constraint , , where is the PCA loading matrix that maximizes the variance of the columns in . This is seen from
y Xb e
T TT T
PCRT T T
1ˆargmax arg max ( )r yy
g g
g T y T T T y gg T Tg y y
y
y
T
T T
T TT T
OLST T T
ˆ
1
ˆargmax arg max
ˆ ˆ
arg max ( )
r
yyb b
b
y yy y y y
b X y X X X y bb X Xb y y
ˆ y Xby Xy
yb Pg
T XPPdim( ) dim( )k p g b
KEH Basics of Multivariate Modelling and Data Analysis 29
9.2.1 Overview Relationship with MLR and PCR
As a summary of this, we can say that MLR (OLS) gives the prediction , i.e. the identity
matrix can be considered as a loading matrix
PCR/PCA gives the prediction , where is the loading matrix that maximizes the variance of the columns in
The idea in PLS is to determine a loading matrix for the prediction, where both and are determined in such a way that
they help maximize the correlation between and .
This solution is different from MLR (OLS) because of the constraint ,
. different from PCR/PCA because the loading matrix is determined to
maximize the correlation between and .
We have also said that PLS maximizes the covariance between the dependent variable(s) and the scores. When the scores are constrained to have a given constant variance (e.g. =1), this is equivalent to maximizing the correlation.
OLS OLSˆ y Xb XIbI
PCRˆ y XPg PT XP
PLSˆ y XWhW
W PLShy y
b Whdim( ) dim( )k p h b
Wy y
KEH Basics of Multivariate Modelling and Data Analysis 30
9. Linear regression with latent variables 9.2 PLS regression
9.2.2 One dependent variableThe main purpose of PLS is to determine a linear model for prediction (estimation) of one or several dependent variables from a set of predictor (independent) variables.
The modelling procedure is here outlined for one dependent variable. The first PLS-component is calculated as the latent variable which has the
maximum covariance between the scores and the modelled property .
Next, the information (variance) of this component is removed from thedata. This process is called peeling or deflation. It is a projection of thespace on to a (hyper-)plane that is orthogonal to the direction of the
found component. The resulting matrix after deflation is
which as required has the property .
1t yT
1,|| || 1
arg max
w Xw
w y Xw 1 1t Xw
XX
XT T
1 1 1 1 1( ) X X t t X I t t X
1 1 X w 0
KEH Basics of Multivariate Modelling and Data Analysis 31
9.2 PLS regression 9.2.2 One dependent variable
The next PLS component is derived from the residual matrix — again with maximum covariance between the scores and .
This procedure is continued to produce “sufficiently many” PLS-components.
The final choice of the number of components to retain in the model can be made as for PCR (mainly using cross-validation).
Some comments In the standard versions of PLS, the scores of the PLS-components are
uncorrelated; the loading vectors, are in general not orthogonal.
Because PLS components are developed as latent variables possessing a high correlation with , the optimum number of PLS-components isusually smaller than the optimum number of PCA-components in PCR.
However, PLS models may be less “stable” than PCR models because less variance is contained.
The more components are used, the more similar PCR and PLS models become.
1Xy
y
X
KEH Basics of Multivariate Modelling and Data Analysis 32
9.2.2 One dependent variable Some comments
A complicating aspect of most PLS algorithms is the stepwise calculation of components.
– After a component is computed, the residual matrices for (and ) are determined.
– The next PLS-component is calculated from the residual matrices and therefore their parameters (scores, loadings, weights) do not relate to the original matrices.
– However, equations exist that relate the PLS parameters to the originaldata and that also provide the regression coefficients of the final model for the original data.
X Y
XPLSb
X
k
pp p
PLSb
PLSb
PLSb
KEH Basics of Multivariate Modelling and Data Analysis 33
9. Linear regression with latent variables 9.2 PLS regression
9.2.3 Many dependent variablesIf there is more than one dependent variable, the dependent data is stored in a matrix . The basic regression model is then
where is a matrix of regression coefficients and is a matrix of residuals. Here columns and in and correspond to column in .
If the dependent variables are considered to be mutually independent (uncorrelated) , PLS (or any other regression method) for one dependent variable can be applied to each variable , one at a time.
– MLR using OLS then gives the solution .
If the dependent variables are correlated, it is best to deal with them jointly. PLS, which is called PLS2 when there is more than one dependent variable, is then very suitable.
Y Y XB E
B Ejb je jyB E Y
jyT T1( )B X X X Y
KEH Basics of Multivariate Modelling and Data Analysis 34
9.2 PLS regression 9.2.3 Many dependent variables
Main ideaIn PLS2, both and are decomposed into scores and loadings and , , are loadings and scores of latent variables for and , , are loadings and scores of latent variables for
in such a way thatthe covariancebetween andscores is maximized.
The regressionis performedbetweenand scores.
The regressioncoefficients canbe transformedto allow directestimation of
from .
X Ytp 1, ,k X
q u 1, ,k Y
X Y
XY
Y X
KEH Basics of Multivariate Modelling and Data Analysis 35
9.2 PLS regression 9.2.3 Many dependent variables
Mathematical developmentThere are many variants of PLS algorithms. It is e.g. possible to specify that loading vectors are orthogonal (the “Eigenvector algorithm”) scores are uncorrelated (i.e. orthogonal), loadings are non-orthogonal
(algorithms such as Kernel, NIPALS, SIMPLS, O-PLS)
Here a PLS2 method producing uncorrelated scores is outlined. and are modelled by linear latent variables as
and
Instead of the loadings and , new loading vectors and that satisfy
and ,
are introduced. Here and , , is an updated version of .
For the first PLS-component, the vectors and are determined by solving
, ,
X YT XX TP E T YY UQ E
P Q
1t X w u Yc
1w 1c
T T T
T T1 1[ ]
[ ] arg max ( )w c
w c Xw Yc || || 1Xw || || 1Yc
w c1, ,k
0 X X 1X 1 X
KEH Basics of Multivariate Modelling and Data Analysis 36
9.2.3 Many dependent variables Mathematical development
The vectors and can be found by solving eigenvalue problems– is the eigenvector to the largest eigenvalue of– is the eigenvector to the largest eigenvalue of
The first component scores are given byand
The first component loadings for are given by
The first component loadings for are not needed for the regression.
The deflated matrix is given by
The deflated matrix could be calculated similarly, but it is not needed.
The second PLS-components are calculated similarly from and .
The procedure is repeated to obtain “sufficiently many” components.
When all components are known, the regression coefficients are given by
, ,
1w 1c1w
1c
T TX YY XT TY XX Y
1 1t Xw 1 1u Yc
T T1 1 1 p X t X Xw
X
Y
T T1 1 1 1 1( ) X X t t X I t t X
X
Y
1X Y
T T1( )B W P W C 1[ ]kW w w 1[ ]kC c c
KEH Basics of Multivariate Modelling and Data Analysis 37
9. Linear regression with latent variables 9.2 PLS regression
9.2.4 Geometric illustration [ based on UMETRICS material]
Data consist of observations and a set of independent variables
(inputs) in a matrix set of dependent variables
(outputs) in a matrixEach variable has a coordinate axis coordinates for data coordinates for data illustration for
Each observation is represented by one point in the space one point in the space
The mean value of each variablein both data sets is here denotedby a red dot in the two coordinatesystems (it is not an observation).Data are here mean-centred.
n p X
n q Y
p Xq Y 3p q
np
q
p q
n n
XY
KEH Basics of Multivariate Modelling and Data Analysis 38
9.2 PLS regression 9.2.4 Geometric illustration
The first PLS-component is a line in the space line in the space
calculated to approximate the points
well in and yield a good correlation
between the projectionsand (the scores)
The vector directions are and coordinates are and
The plot shows how well thefirst PLS-components models the data points on the line are modelled exactly points not on the line may be modelled
by other PLS-components
1 1-t u
XY
YX
1t 1u
1w 1c1t 1u
KEH Basics of Multivariate Modelling and Data Analysis 39
9.2 PLS regression 9.2.4 Geometric illustration
The second PLS-component is alsorepresented by lines in the and space
calculated to approximate the points well provide a good correlation
in such a way that -lines are orthogonal -lines may be orthogonal
These lines with directions and coordinates and
improve the approximation andcorrelation as much as possible.
The second projection coordinates usually correlate less well than
the first projection coordinates may correlate better than the first projection coordinates if there is a
strong structure in that is not related to (or present in) .
X Y
XY
2w 2c2t 2u
X Y
KEH Basics of Multivariate Modelling and Data Analysis 40
9.2 PLS regression 9.2.4 Geometric illustration
The first two PLS-components form planes in the and spaces
The variability around the -planecan be used to calculate a tolerance interval within which
new observations will (should)be located
Observations outside this interval implies that
the model may not be valid forthis data
Plotting of successive pairs of latentvariables against each other will give a good picture of the correlation
structureThe plot in the SE corner indicates thatthere is almost no information left inthe :th pair of latent variables
uk
tk
X YX
k
KEH Basics of Multivariate Modelling and Data Analysis 41
9. Linear regression with latent variables 9.2 PLS regression
9.2.5 Evaluation and diagnosticsThe PLS result and the data can be analysed and evaluated in many ways.
Detection of outliers The techniques for outlier detection used in PCA, can also be used in PLS. Since PLS uses a block in addition to the block, one can also look for
outliers in the prediction of . The figure illustrates a way
of plotting the prediction error against a score relatedparameter (“leverage”)
– the y-axis prediction erroris autoscaled error unitis standard deviations
– the x-axis “Leverage” definesthe influence of a givenobservation on the model; itis proportional to Hotelling’s T2
Four of the marked observations are very clear outliers. These outliers should be removed from the model-building data.
XYY
KEH Basics of Multivariate Modelling and Data Analysis 42
9.2 PLS regression 9.2.5 Evaluation and diagnostics
Cross-validation The standard techniques for selecting the number of latent variables
based on cross-validation can be used. In addition, one can consider how much of the variation each LV
describes expressed e.g. as .
Relationships between observationsRelationships between observations can be studied by various score plots, e.g. vs. , vs.
– a linear relationship with high correlation is desired
vs. , vs. – no correlation is desired
T T/ u u u u
Y
1u 1t 2u 2t
2t 1t 2u 1u
KEH Basics of Multivariate Modelling and Data Analysis 43
9.2 PLS regression 9.2.5 Evaluation and diagnostics
Variable interpretationsThere are many ways to analyse the contribution and importance of variables in the PLS model, e.g. loadings on LVs ( :s) Q-residuals Hotelling’s T2 statistic regression coefficients VIP scores (“Variable Importance in Projection”)
In the PLS-toolbox, these plots are obtained via “Loadings plots”.
p
KEH Basics of Multivariate Modelling and Data Analysis 44
9. Linear regression with latent variables 9.2 PLS regression
9.2.6 PLS application using the PLS-toolboxWe shall apply PLS to the same Slurry-Fed Ceramic Melter (SFCM) system, which was used in the PCR application (section 9.1.4).
Except for the analysis startup for PLS the model building steps for PLS are exactly the same as for PCR, including
cross-validation for choice of the number of PLS-components
KEH Basics of Multivariate Modelling and Data Analysis 45
9.2 PLS regression 9.2.6 PLS application using the PLS-toolbox
Cross-validation resultsModel-building combined with cross-validation produces the result shown. The variance captured by the model for each # of latent variables (LVs)
is shown for the– block (% )– block (% )
The suggested model is(apparently) based on
block variance Note the small block
variance for LV4 LV4does not contain infoabout .
This suggests that 3 LVswould be sufficient forpredictive purposes.
XY
Tt t Tu u
XY
Y
KEH Basics of Multivariate Modelling and Data Analysis 46
9.2.6 PLS application using the PLS-toolbox Cross-validation results
Another (better) way to select the number of LVs is to consider the prediction error based on cross-validation, which can be quantified by a RMSECV-plot
– fig suggest 3 or 4– based on rule of
thumb (at least2% improvement), 3 LVs is sufficient