concentration residual augmented classical least squares (cracls): a multivariate calibration method...

Volume 56, Number 5, 2002 APPLIED SPECTROSCOPY 6150003-7028 / 02 / 5605-0615$2.00 / 0q 2002 Society for Applied Spectroscopy

Concentration Residual Augmented Classical Least Squares(CRACLS): A Multivariate Calibration Method withAdvantages over Partial Least Squares

DAVID K. MELGAARD,* DAVID M. HAALAND, and CHRISTINE M. WEHLBURGSandia National Laboratories,† Albuquerque, New Mexico 87185-0889

A signi� cant extension to the classical least-squares (CLS) algo-rithm called concentration residual augmented CLS (CRACLS) hasbeen developed. Previously, unmodeled sources of spectral variationhave rendered CLS models ineffective for most types of problems,but with the new CRACLS algorithm, CLS-type models can be ap-plied to a signi� cantly wider range of applications. This new quan-titative multivariate spectral analysis algorithm iteratively aug-ments the calibration matrix of reference concentrations with con-centration residuals estimated during CLS prediction. Becausethese residuals represent linear combinations of the unmodeledspectrally active component concentrations, the effects of these com-ponents are removed from the calibration of the analytes of interest.This iterative process allows the development of a CLS-type cali-bration model comparable in prediction ability to implicit multi-variate calibration methods such as partial least squares (PLS) evenwhen unmodeled spectrally active components are present in thecalibration sample spectra. In addition, CRACLS retains the im-proved qualitative spectral information of the CLS algorithm rel-ative to PLS. More importantly, CRACLS provides a model com-patible with the recently presented prediction-augmented CLS(PACLS) method. The CRACLS/PACLS combination generates anadaptable model that can achieve excellent prediction ability forsamples of unknown composition that contain unmodeled sourcesof spectral variation. The CRACLS algorithm is demonstrated withboth simulated and real data derived from a system of dilute aque-ous solutions containing glucose, ethanol, and urea. The simulateddata demonstrate the effectiveness of the new algorithm and helpelucidate the principles behind the method. Using experimentaldata, we compare the prediction abilities of CRACLS and PLS dur-ing cross-validated calibration. In combination with PACLS, theCRACLS predictions are comparable to PLS for the prediction ofthe glucose, ethanol, and urea components for validation samplescollected when signi� cant instrument drift was present. However,the PLS predictions required recalibration using nonstandardcross-validated rotations while CRACLS/PACLS was rapidly up-dated during prediction without the need for time-consuming cross-validated recalibration. The CRACLS/PACLS algorithm providesa more general approach to removing the detrimental effects ofunmodeled components.

Index Headings: Near-infrared spectra; Multivariate calibration;Classical least squares; CLS; Augmented classical least squares;ACLS; Concentration residual augmented classical least squares;Partial least squares; PLS; Prediction-augmented classical leastsquares.

INTRODUCTION

Over the past 20 years, quantitative multivariate spec-tral analysis has primarily shifted from the explicit clas-

Received 22 August 2001; accepted 31 December 2001.* Author to whom correspondence should be sent.† Sandia is a multi-program laboratory operated by Sandia Corporation,

a Lockheed Martin Company, for the United States Department ofEnergy under Contract DE-ACO4-94AL85000.

sical least-squares (CLS)1,2 algorithm to the implicit prin-cipal component regression (PCR)3 and partial least-squares (PLS)4–6 methods. The principle motivation forthis shift is that CLS is based on an explicit linear ad-ditive model, e.g., the Beer–Lambert law. As such, CLShas the signi� cant limitation that it requires the concen-trations of all spectrally active constituents to be knownand included in the calibration before an adequate pre-diction model can be developed. On the other hand, thePCR and PLS algorithms can achieve excellent predic-tions for data sets where all of the constituents have notbeen determined. Consequently, CLS has been relegatedto solving a small set of well-de� ned linear problemswith known constituents that adhere to the Beer–Lambertlaw, e.g., infrared spectra of gas-phase samples.7 How-ever, PCR or PLS do not have the qualitative capabilitiesof CLS because they do not generate explicit estimatedpure-component spectra using all the available concen-tration information. Also, they are not well suited to theadvantages of the newly developed prediction-augmentedCLS (PACLS) technique.8 PACLS adds spectral varia-tions (i.e., spectral intensity information) to the CLS cal-ibration model during prediction to account for spectrallyactive components or other spectral effects present in theprediction samples that were not modeled during calibra-tion. PACLS has proven to be effective at updating theprediction model by removing the deleterious effects ofunmodeled constituents, accounting for spectrometerdrift, and allowing the transfer of calibrations from onespectrometer to another.8 However, due to the inverse na-ture of PCR and PLS, the PACLS method cannot be di-rectly applied to either, so its bene� ts are not readilyavailable to them.

To overcome these de� ciencies, we had previously de-veloped a new hybrid multivariate calibration algorithmthat combined the best features of both PACLS and PLS.9

This PACLS/PLS hybrid algorithm was shown to provideprediction ability better than either algorithm separatelyapplied to the quantitative analysis of a multicomponentsystem of dilute aqueous solutions. However, to achievethe best predictions when the prediction data containednew sources of spectral variation, recalibration of the hy-brid method was often required.9 In order to use thePACLS algorithm to update the model during predictionwithout recalibration, we have developed a new aug-mented CLS method that we have labeled concentrationresidual augmented classical least-squares (CRACLS).The CRACLS algorithm is based on CLS so it retainsthe qualitative bene� ts of CLS, yet it has the � exibilityof PLS and the hybrid algorithm in that it can de� ne a

616 Volume 56, Number 5, 2002

comparable model even when spectrally active compo-nents are not explicitly included in the calibration. Thecomponents may be unknown, components with un-known concentrations, or other sources of spectral vari-ation that are present in the calibration spectra. Also,since CRACLS is based on CLS, it can incorporate thePACLS feature of updating the prediction model for newsources of spectral variation without recalibrating. In thispaper, we discuss the CRACLS algorithm and provideexamples of its application to both simulated and realspectral data sets. We will demonstrate that the combi-nation of CRACLS with PACLS provides prediction abil-ity comparable to that of PLS, but in a faster and simplermanner. Our examples in this paper involve the use ofcontinuous spectra; however, CRACLS may also use anyset of discontinuous spectral intensities that are selectedin the calibration for the least-squares analysis.

THEORY

A number of forms of the CLS calibration and predic-tion algorithms have been published.2,10–15 The notationin the equations below uses upper-case bold letters formatrices, lower-case bold letters for vectors, and itali-cized letters for scalars. We use the ^ to indicate esti-mated values, T to denote a transposed matrix, 21 for ma-trix inversion, and 1 for the pseudoinverse of a matrix.The use of the pseudoinverse can result in improved nu-merical precision from a variety of methods, e.g., singularvalue decomposition.16 The basic CLS model is

A 5 CK 1 EA (1)

where A is the n 3 p matrix of the absorbance spectrafrom the n samples at the p frequencies, C is the n 3 mreference concentration matrix containing m components,K is the m 3 p matrix of in situ pure-component spectrascaled to unit concentration and pathlength, and EA is then 3 p matrix of spectral noise (and model errors if themodel is not linear or if K does not include all the pure-component spectra or other sources of spectral variation).

The pure-component spectra represent the best overallestimate of the spectral contributions of each chemicalconstituent or physical change in the sample spectra giv-en the range of variation of all the components in thecalibration sample set. Examples of sources of spectralvariation that must be modeled include the effect of theindividual chemical species, chemical interactions, or anychanges caused by a wide variety of physical parametersthat can induce spectral variations, such as temperaturevariations, spectrometer drift, humidity changes, andsample insertions. Note that the physical property of tem-perature can have a considerable quanti� able effect onthe sample spectra, as can be seen from samples contain-ing an aqueous solvent.9 In this paper, the term compo-nent applies to any source of spectral variation, be itchemical, physical, or otherwise.

Many preprocessing procedures such as centering,scaling, path-length correction, smoothing, derivatives,multiplicative signal correction, and covariance � lteringhave been developed that can be performed on the spectrawhen using this method.17,18 However, we will excludefurther discussion of these preprocessing methods fromthis section, as they do not in� uence the theoretical dis-

cussion. In the CLS analyses of the data presented in thispaper, we have not generally mean-centered the data inorder to directly generate pure-component spectra thatrepresent the linear least-squares estimates of the pure-component spectra of calibration mixtures.

The calibration step requires determining the least-squares solution for K, i.e., K . This solution is given by:

K 5 (CTC )21CTA 5 C1A (2)

K will provide accurate estimates of the pure-componentspectra only if the system is linear and C contains all thechemical and non-chemical components, which contrib-ute to spectral variation in the sample spectra. The ac-curacy of the pure-component spectral estimates im-proves with each additional known component concen-tration vector added to the CLS model. In the more gen-eral case, each estimated pure-component spectrum willconsist of the pure-component spectrum at unit concen-tration for the corresponding known component in C pluslinear combinations of the unmodeled pure-componentspectra, i.e.,

umuk 5 k 1 b k (3)Oj j i i

i51

where k j represents the estimated pure-component spec-trum for the jth component and b i are the linear scalefactors for the mu unmodeled pure-component spectrak . These contaminated estimated pure-component spec-u

i

tra are then used during CLS prediction to approximatethe original reference concentrations as follows:

C 5 AKT (KKT )21 5 A(KT )1 (4)

The n 3 m concentration residuals matrix, EC, then isgiven by:

EC 5 C 2 C (5)

To account for baseline variations, K can be augment-ed with explicit baseline functions, i.e., vectors repre-senting the potential variations in baselines in the spec-tra.2,8,19 These baseline functions may include an offset,general polynomials, orthogonal Legendre polynomials,or any expected functional form of the baselines. A rowis added to K for each additional order in the baselinefunction added during CLS prediction. Using the PACLSalgorithm, spectral information for other components canbe added to K as well. A corresponding column must beadded to C for each row added to K before solving Eq.4. If rows have been added to K , then EC is computedusing only the estimated concentrations corresponding tothe original reference values in C .

Now if we consider the case where there are unmo-deled sources of spectral interference, Eq. 1 can be re-written:

A 5 CK 1 CuKu 1 E (6)

where Cu is an n 3 mu matrix that represents the un-known concentrations for each of the mu unmodeled pure-component spectra in Ku, and E represents the error re-maining after removing mu unmodeled linear spectral ef-fects. Note that even if EA in Eq. 1 is the result of non-linear factors, the nonlinearities can be estimated by alinear approximation in CuK u, so even in those cases, Ecan be considerably smaller then EA. Following the dis-

APPLIED SPECTROSCOPY 617

FIG. 1. CRACLS calibration diagram. FIG. 2. CRACLS prediction diagram.

cussion by Martens and Naes18 on the extended mixturemodel, the matrix, Ku, can be decomposed into the sumof two terms. One part of Ku can be written as P(K )K u

5 DK , where P(K ) means the projection onto the spacespanned by K , and D is dimensioned mu 3 m, where mu

is the number of rows in Ku and m is the number of rowsin K . Another part of the decomposition of Ku is orthog-onal to K , denoted by G . Now we can write:

Ku 5 DK 1 G (7)

Substituting Eq. 7 into Eq. 6 and gathering terms yields:

A 5 (C 1 CuD )K 1 CuG 1 E (8)

The predicted concentration values in Eq. 4 are then ap-proximately equal to C 1 CuD . Consequently, from Eq.5, we have

EC ù CuD (9)

Although D is not known, Eq. 9 still shows that eachcolumn, ec, of EC approximates a linear combination ofthe unknown concentrations unless Ku is orthogonal toK, i.e., D 5 0, where 0 is the null matrix. However, ifD 5 0, the unknowns will not contaminate the estimatedknown concentrations, so they can be ignored withoutaffecting the predicted concentrations. In practice, D 5 0will generally occur only if the spectral components donot overlap in the spectral region being analyzed. Forcalibration of the known components, the magnitudes forunit concentration of the pure-component spectra for theunmodeled components are not required, since only therelative concentration values are needed to generate thecorrect net-analyte signals (NAS)20 for each of the knowncomponents. By including linear combinations of the un-modeled component concentrations as additional columnsin C and solving Eq. 2, the resulting additional pure-component spectra in K will be linear combinations ofthe spectra of the unmodeled pure components, Ku. Whenenough linear combinations are added to cover all sourcesof spectral variation beyond the noise, the NAS for eachof the known components in K will be correct (see Ref.8 for a general proof of this statement for the case whenerror is absent).

We take one vector of the concentration residuals, ec,and augment the original C matrix with the ec vector asa new column to create the augmented n 3 (m 1 1)matrix, C . If there is more than one component, only oneof the concentration residual vectors is used in the aug-mentation, since the concentration errors for the other com-

ponents contain redundant information. In the Results andDiscussion section, criteria for selecting which concen-tration residual to use will be discussed in more detail.Using C, the augmented, (m 1 1) 3 p, pure-componentmatrix is computed by the equation:ˆK

5 (CTC )21CTA 5 C1AˆK (10)

This step is illustrated in Fig. 1 where the concentrationresiduals used to augment the concentration matrix aregiven as the e i’s, and the additional estimated pure-com-ponent spectrum in Eq. 10 is represented by the dottedline.

The new is then used in the prediction step:ˆK

5 A T( T )21 5 A( T )1ˆ ˆ ˆ ˆ ˆ˜ ˜ ˜ ˜ ˜C K KK K (11)

shown in Fig. 2 where the c ei’s represent the estimatedconcentrations for the augmented pure-component spec-trum (i.e., the dotted line). Since the additional spectrumis a linear combination of unspeci� ed magnitude of theunmodeled pure-component spectra, the cei’s in usuallyˆCwill not provide any useful quantitative information. Thenew n 3 m concentration residuals matrix, EC 9, then isgiven by:

EC 9 5 C 2 C (12)

where C consists only of the estimated concentrations incorresponding to the known concentrations in C .ˆCThe steps delineated by Eqs. 5, 10, and 11 remove one

linear combination of the unmodeled spectral variationsfrom the calibrations. These steps can be repeated againusing one row from the new concentration residuals ma-trix, EC 9 to augment the concentration matrix. To mitigateall the additional sources of spectral variation (i.e., reducethe error from EA in Eq. 1 to E in Eq. 8), these stepsmust be repeated for each of the independent sources ofspectral variation present in the calibration data. In otherwords, the augmented concentration residual vectorsmust span the space de� ned by the unmodeled spectrallyactive components. The iterative process of generatingresidual concentration vectors, augmenting the C matrix,and estimating a new set of augmented pure-componentspectra constitutes the CRACLS algorithm.

Martens and Naes18,21 suggested another approach forremoving the contamination of unmodeled components.Their approach augmented K with the eigenvectors de-rived from the CLS calibration spectral residuals in thesame way that columns are added to account for baselinevariation. The difference is that their approach augments


the pure-component spectral matrix during the predictionstep while CRACLS augments the concentration matrixduring the calibration step. In a future paper, we willaddress the implications of these differences.

Because the resulting CRACLS model is a CLS-typemodel, it is well suited to take advantage of the newlydeveloped PACLS algorithm. The PACLS algorithm aug-ments the estimated pure-component spectra with spectralinformation representing unmodeled sources of spectralvariation present in the unknown sample spectra to bepredicted. Basically, in Eq. 11 can be further aug-ˆKmented with any spectral intensities that encompass spec-trally active components in the prediction data set thatwere not present in the calibration spectra. The advantageof PACLS is that the model can be updated quickly dur-ing prediction to account for new sources of spectral var-iation. Recomputing the calibration model using the orig-inal and the additional spectra would take considerablymore time. For a more detailed description of PACLS,we refer the reader to Ref. 8.

The spectra added during prediction need to containall new sources of spectral variation introduced duringthe collection of the prediction data that were not includ-ed in the calibration model. If these sources of spectralvariation can be attributed to a known component, a sam-ple can be doped with the identi� ed component to captureits impact on the spectra. However, often the new sourcesof spectral variation, such as those caused by instrumentdrift or sample insertion effects, cannot be easily identi-� ed. For those sources of variation, a single stable samplemight be used to provide the required spectral informa-tion to augment the model during prediction. By repeat-edly measuring a stable sample during calibration andprediction spectral measurements, spectral variations dueto artifacts not related to concentration, such as instru-ment drift and sample insertion, can be captured. Thedifferences between repeated spectral measurements ob-tained during both calibration and prediction or an eigen-vector analysis of those differences can provide the spec-tral information required for the prediction augmentation.Note that, unlike PLS, the spectral variation can be in-corporated into the PACLS model without knowing theconcentrations of the sample. When using the repeat sam-ple spectra, the assumption is that the spectra can be usedto provide a reasonable estimate of the error covariancematrix, i.e., the unmodeled sources of spectral variationaffect the repeat sample spectra and the prediction samplespectra in the same way. This assumption is valid if thespectra are similar across all samples, which is the casefor the dilute aqueous solutions considered in this paper.However, if the range of concentrations causes large var-iations in the sample spectra, it may be necessary to usea subset of repeat samples to adequately capture the im-pact of the unknown sources of variation.

EXPERIMENTAL

Experimental Data. The calibration and validationsamples and data used in this study have been presentedpreviously.22 The samples consisted of a series of diluteaqueous solutions of glucose, ethanol, and urea each in-dependently varied over the concentration range of 0–500 mg/dL. The samples were prepared by weight and

volume in a pseudo D-optimal design23 that allowed eachof the three analytes to be varied separately at 9 levelsover the concentration range. The aqueous solvent wasobtained from a single source of buffered saline solution.A detailed error analysis indicated that the samples weremade to an accuracy of better than 1 mg/dL. The cali-bration data set consisted of 27 samples plus a repeatsample taken from the center of the design, so the repeatsample contained approximately 250 mg/dL each of glu-cose, ethanol, and urea. The validation samples, the setof samples used for prediction, included the same repeatsample and 27 new samples using the same design as thecalibration set. The validation samples spanned the same0 to 500 mg/dL concentration range for each of the an-alytes as in the calibration set, but no sample from theprediction set had the same composition as any samplein the calibration set. Five of the prediction samples wereremoved from consideration when they were determinedto be outlier samples contaminated by the epoxy used toseal the lids on the cuvettes.

The samples were sealed with a magnetic stirring barin 10-mm pathlength cuvettes. An HP89090A Peltiertemperature controller propelled the magnetic stirrerwhile holding the samples at a temperature constant to60.05 8C (1 s). The samples were placed in the temper-ature controller and held in the beam of the spectrometerfor 4 min with stirring to allow the sample temperatureto equilibrate. The near-infrared spectra of the sampleswere obtained on a Nicolet Model 750 Fourier transforminfrared (FT-IR) spectrometer. The spectrometer em-ployed a 75 W tungsten-halogen lamp, quartz beam split-ter, and liquid nitrogen cooled InSb detector. Spectra at16 cm21 resolution were obtained after averaging the in-terferograms over a 2-min period.

The run order of the calibration and validation samplesets was randomized. The spectrum of the repeat sampleheld at 32 8C was obtained after each group of two cal-ibration or prediction samples. The spectra for the pre-diction samples were obtained approximately one monthafter the calibration spectra were obtained. Purge varia-tion was introduced during the prediction data set to in-duce additional short-term system drift and to accentuatethe dif� culty in maintaining the calibration. Sample tem-perature was varied in random order in 2 8C steps overthe range of 30 to 34 8C for both the calibration andprediction samples. A background of the empty sampleholder was obtained after each sample. Transmittancespectra of the calibration and validation sample sets wereobtained by ratioing each single-beam sample spectrumto either its corresponding background or to the averageof the background for the day. The spectra were thenconverted to absorbance. Since the best calibration andprediction results were obtained with the spectra obtainedusing the averaged daily background, only the results ob-tained from these spectra are reported.

The data were analyzed in the spectral region from7500 to 11 000 cm21 using the CLS, PLS, and CRACLS/PACLS (i.e., the combination of CRACLS during cali-bration and PACLS during prediction) algorithms incor-porated into software developed at Sandia National Lab-oratories. The software was programmed using the ArrayBasic programming language from the GRAMS 32 soft-ware obtained from Thermo Galactic. The calibration


FIG. 3. Four loading vectors (solid lines) estimated from simulatedspectra using the CRACLS calibration that includes urea and waterconcentrations and two concentration residual augmentations. The pure-component spectra (dotted lines) for water, glucose, urea, and ethanolare placed close to the corresponding LV for shape comparisons. Theloading vectors have been shifted up or down slightly to display theshape comparisons more clearly.

FIG. 4. Mean-centered simulated spectra showing high-frequencynoise.

models for each algorithm were used for prediction ofthe validation sample spectra.

Simulated Data. To demonstrate the effectiveness ofthe CRACLS algorithm, we also constructed a simulateddata set using the pure-component spectra derived fromexperimental dilute aqueous data using standard CLSmethods. While these spectra do not precisely match thetrue pure-component spectra primarily because of signif-icant baseline variations resulting from spectrometer driftduring the collection of the calibration spectra, they canbe used to demonstrate the capabilities of CRACLS. Thesimulated data were constructed from these pure-com-ponent spectra and a series of concentrations to generatetwo different sets of 25 samples, one for calibration andthe other for validation, containing 0–500 mg/dL each ofglucose, ethanol, and urea and 98 000 to 100 000 mg/dLof water. The sum of the concentrations of all the com-ponents for each sample was constrained to 100 000. Inaddition, normally distributed random spectral noise wasadded to the absorbance spectra at a level of 0.3% of themaximum spectral absorbance intensity. The level ofnoise amounted to approximately 5% of the absorbancecaused by the minor components, which is slightly morenoise than we observed in the measured spectral data.

The pure-component spectra of the analytes and thewater solvent used to generate the simulated data areshown as the dotted lines in Fig. 3. The spectra werederived from real data containing signi� cant spectrometerdrift resulting in some computed negative absorbancevalues. Note that the glucose pure-component spectrumis similar to that of a modi� ed water spectrum since glu-cose has no noticeable vibrational features of its own inthis region of the spectrum. Rather the CLS-estimatedglucose spectrum is dominated by the interaction of glu-cose with the water solvent. Consequently, building amultivariate spectral model to estimate glucose concen-tration in the samples is more dif� cult than for the otheranalytes. On the other hand, both ethanol and urea havedistinctive spectral features (e.g., at 8500 and 9900 cm21,respectively), so they are more easily quanti� ed.

Because water is the dominant component, the spectraof the simulated samples closely resemble that of water.However, if we remove the average spectrum, the spectralvariation due to the analytes is apparent as shown in Fig.4. Most of the spectral variation shown is due to differ-ences in the concentrations between samples, which isaccentuated by the baseline variations present in thesepure-component spectra. The high-frequency componentof the spectra is the added noise.

RESULTS AND DISCUSSION

Augmentation Prediction Ability for the SimulatedData. Using only two of the four components, glucoseand ethanol, to build a CLS model based on the simulateddata, the cross-validated standard errors of prediction(CVSEPs) for glucose and ethanol are 234 and 93 mg,respectively. There is almost no prediction ability for glu-cose. However, if we build a CRACLS model, augment-ing the model with the concentration residuals twice, weachieve CVSEPs of 17 mg for glucose and 3 mg forethanol, demonstrating that the CRACLS model has pre-diction ability for both components. If we include onlyglucose or ethanol concentrations in the CRACLS cali-bration model, and augment three times, we get essen-tially the same CVSEPs as above. Adding the concentra-tion residuals to the concentration matrix successfully re-moved contamination of the unmodeled componentsfrom the calibration model for the included components.

Qualitative Information in the CRACLS LoadingVectors. The estimated pure-component spectra or load-ing vectors (LV) that resulted by including water and ureaconcentrations in the calibration and augmenting withconcentration residuals twice are shown as the solid linesin Fig. 3. The vectors have been shifted slightly to showthe comparisons more clearly with the pure-componentspectra (the dotted lines). Also, the loading vector forethanol was multiplied by 21 to match the orientation ofthe pure-component spectrum. The qualitative informa-tion of the loading vectors is apparent when you comparethem with the actual pure-component shapes shown usingthe dotted lines. LV1 and LV2 closely resemble the pure-component spectra of the given components, urea andwater, respectively. Also, even though ethanol was notincluded in the calibration, LV3 resembles its pure-com-ponent spectrum. Because the glucose has weak vibra-tional bands in this region, its spectral information is notapparent in any of the loading vectors, even in LV4.


FIG. 5. Shape comparison of the � rst weight loading vector of a PLSglucose calibration and the � rst loading vector of a CLS calibrationincluding glucose and water using the simulated data with the pure-component spectrum for glucose.

TABLE I. Quantity of pure-component spectra in CRACLS load-ing vectors using glucose and ethanol concentrations and two aug-mentations.

Glucosemg/dL

Ethanolmg/dL

Ureamg/dL

Watermg/dL

LV 1LV 2LV 3LV 4

1.00.00.00.0

0.01.00.00.0

0.40.5

20.121.0

196134

2150248

If the major component is left out of the CLS calibra-tion, the qualitative information will not be as apparent.The loading vectors will generally match the shape of themajor component due to its overwhelming in� uence. Theextent of this in� uence on the simulated data is shownby the decomposition of the vectors in the next section.With mean centering, the loading vectors obtained froma CLS calibration without the solvent concentrations willreveal more of the distinctive spectral features of the non-solvent components. However, for a mixture systemwhere the sum of the concentrations is constrained to aconstant, the CLS estimated pure-component analytespectrum represents the net change in the sample spec-trum due to a unit change in the analyte concentration.Consequently, the estimated pure-component spectrumwill include negative spectral changes due to the dis-placement of the solvent that will contaminate the mean-centered loading vectors.

In every case, the qualitative information in the load-ing vectors for CLS will be equal to or better than PLS,since the � rst PLS weight-loading vector that retains thebest qualitative information is generated from only onecomponent.5,26 For CLS, and therefore for CRACLS, thequalitative information is improved as additional analyteconcentrations are included in the model. To illustrate thisstatement, a PLS calibration on glucose and a CLS cal-ibration using only glucose and water were computed us-ing the simulated data. The weight loading vector (WLV)from the PLS calibration and the loading vector (LV)from the CLS calibration for glucose are given in Fig. 5as the dashed and dotted lines, respectively. The PLSweight loading vector is contaminated by the displace-ment of the water by the analyte, so the spectral shapedoes not match the pure-component shape (solid line)very well. The loading vector from a CLS calibrationusing only glucose and mean-centering has the sameshape. However, by including the water concentration, itscontamination is removed, and the CLS � rst loading vec-tor as shown in the � gure provides an improved matchto the pure-component spectrum. The more informationincluded in the CLS calibration, the better the � t. If thespectrometer drift has a linear component, adding time ofspectrum collection to the CLS concentration matrix will

yield better pure-component spectra, as demonstrated inRef. 24. Also CLS will generate better pure-componentspectra than PLS when the known component concentra-tions are correlated.

Decomposition of the CRACLS Loading Vectors.CRACLS is effective even if only the minor componentsare included in the calibration. The decomposition of the� rst four CRACLS loading vectors provides insight intothis effectiveness when only the minor components ofglucose and ethanol are included in the calibration with-out mean centering. By de� ning a CLS prediction modelusing all four true pure-component spectra with unit con-centration, shown in Fig. 3, that were used to create thesimulated data, the contribution of each of the true spec-tra to the loading vectors can be quanti� ed as shown inTable I. The spectral residuals, after removing the givenamount for each of the pure-component spectra, representonly random noise, indicating that all the spectral infor-mation contained in these loading vectors was removed.As shown in Table I, the � rst and second loading vectorsare the only ones that contain the glucose and ethanolpure-component spectra, respectively, corresponding tothe order in which their concentrations were placed in theconcentration matrix. Note also that the pure-componentspectra are present at unit concentration so they can pro-vide quantitative information during prediction. In addi-tion, the urea and water spectra contaminate these � rsttwo loading vectors, as expected from Eq. 3. In Table I,the amount of water far exceeds the other components,so all the loading vector shapes are very similar to thepure water spectral shape. The last two estimated pure-component spectra are linear combinations of only ureaand water, the components left out of the concentrationmatrix. Consequently, during prediction using these fourloading vectors, the resulting NAS for glucose and eth-anol, (i.e., the part of each pure-component spectrum or-thogonal to all other vectors), is the same as obtainedfrom the fully speci� ed CLS model. Basically, the con-tamination of both water and urea is removed from theprediction of the glucose and ethanol by the third andfourth loading vectors. Therefore, the prediction modelcorrectly determines the glucose and ethanol concentra-tions even though the concentrations for the other twocomponents were not explicitly included in the calibra-tion.

Results from Experimental Data. The average cali-bration spectrum and the mean-centered calibration sam-ple spectra from the experimental data are shown in Fig.6. As with the simulated data, the water spectral featuresdominate the spectra before mean-centering. However,most of the spectral variations in the mean-centered spec-tra shown in the Fig. 7B are not due to concentration


FIG. 6. The calibration experimental spectral data set. (A) Mean spec-trum, and (B) mean-centered calibration spectra.

FIG. 7. The spectra added to PACLS. (A) Difference between averagerepeat spectra for the calibration and prediction data sets. (B) Mean-centered average repeat spectra from the prediction data set.

TABLE II. Minor component CVSEPs for dilute aqueous solutions using CLS, PLS, and CRACLS calibration models.

CLS PLSa

CRACLS(10)b

CRACLS(15)b

CRACLS(20)b

Glucose (mg/dL)Ethanol (mg/dL)Urea (mg/dL)

712318

15 (11)6 (11)6 (9)

1866

1655

1555

a Value in parentheses is the optimal number of PLS factors used in the model.b Values in parentheses represent the number of augmentations used in the CRACLS calibration.

differences but rather are the result of spectrometer driftand temperature variations. Due to these additional sourc-es of variation, even a fully speci� ed CLS model is notable to adequately model the data. This fact is demon-strated in the CVSEPs from the full CLS model of thereal calibration data shown in the � rst column of TableII. All the chemical components and temperature wereincluded in the model, and yet CLS still was unable togenerate a precise calibration model. These results indi-cate why CLS is seldom used for prediction in multivar-iate calibrations. Also provided in Table II are theCVSEPs for PLS, with the optimal number of factorsshown in parentheses, and for CRACLS, using 10, 15,and 20 augmentations. Although there was a gradual de-crease in the CVSEPs with more augmentations, the op-timal number of augmentations was not apparent. Laterin this section, we discuss the issue of choosing the op-timal number of augmentations in CRACLS. However,differences between the CRACLS calibration results andthe PLS results for these data are not statistically signif-icant.

To predict the validation data, information from therepeat sample was included in both the PLS and theCRACLS/PACLS models. For PLS, all the spectra fromthe repeat sample taken with the prediction data set wereadded to the original calibration data set along with itscomponent concentrations, and the PLS model was re-calibrated. The cross-validated calibrations were per-formed initially by using the standard approach of re-moving all the data for each sample during the rotations(sample-out rotation), so that all the spectra for the repeatsample were removed at once.25 The number of loadingvectors or factors for the PLS model was determined us-ing an F-test on the cross-validated calibrations, as rec-ommended by Haaland and Thomas.5

For CRACLS, no information from the repeat samplewas given to the original CRACLS model to compute

new pure-component spectra. Instead, the sources ofspectral variation from the repeat sample shown in Fig.7 were used (without concentration information) to aug-ment the pure-component matrix during prediction. Theadvantage of this augmented method is that the modelcan be rapidly updated to account for new sources ofspectral variation. Recomputing the calibration model us-ing the original and the additional spectra would takeconsiderably more computation time. The spectral differ-ence between the average of the repeat spectra from thecalibration and prediction data sets, shown in Fig. 7A,was added to capture the long-term spectrometer drift be-tween the calibration and prediction days. Without theaddition of this mean-difference spectrum, the predictedvalues show a de� nite bias. In addition, all the mean-centered repeat spectra for the prediction repeat sample(shown in the Fig. 7B) were added to eliminate the det-rimental effects of any short-term system drift during thecollection of the prediction sample spectra. By mean cen-tering, the spectral contribution of the repeat sample’sspeci� c concentration is removed, leaving only the vari-ation from non-chemical sources. Note the sharp featuresin the mean-centered spectra at the wave numbers around8800 and 10 600 cm21. These features are the result ofwater vapor variations caused by the changes in the qual-ity of the purge. Other less obvious spectrometer driftfeatures contribute to the spectral variation as well. Thebackground spectra are inadequate to correct for allsources of spectrometer drift. In fact, prediction resultswere best when an average background, rather than in-dividual backgrounds, was used to obtain the absorbancespectra.

Table III shows the standard error of predictions(SEPs) for all the minor constituents in the predictiondata set from various CLS, PLS, and CRACLS methodsusing calibration models based upon the calibration data


TABLE III. CVSEPs for PLS and SEPs for CLS, PLS, and CRACLS on the prediction data set using models containing the repeat samplespectral information.

ComponentCVSEPs

PLS CLS PLS-Aa PLS-Ba

CRACLS/PACLS

(10)b

CRACLS/PACLS

(20)b

Glucose (mg/dL)Ethanol (mg/dL)Urea (mg/dL)

1855

2872742

55 (6)9 (8)4 (11)

21 (12)4 (13)4 (12)

1954

1944

a Value in parentheses is the optimal number of PLS factors used in the model.b Values in parentheses represent the number of augmentations used in the CRACLS calibration.

FIG. 8. (A) Ethanol, (B) urea, and (C ) water concentration residualsvs. reference glucose concentration from the CLS model applied to thesimulated data when leaving out glucose concentrations.

set and the repeat sample information. As a comparisonstandard, the � rst column of Table III provides theCVSEPs obtained from a separate PLS calibration ob-tained from the prediction data. As expected, CLS with-out the information about the interferents had poor pre-diction ability. PLS-A in Table III represents PLS recal-ibration with the optimal number of PLS factors selectedduring the sample-out cross-validated rotation. The re-sults show that the model was inadequate even thoughthis method of rotating spectra out during cross-valida-tion has generally been recommended in order to mini-mize the possibility of over� tting the data.25 Therefore,we modi� ed the cross-validation rotation during the re-calibration so that each of the repeat spectra from theprediction set was rotated out one at a time (spectrum-out rotation). Using this rotation, a more correct numberof factors was selected, and the PLS-B model shown inTable III predicted well for these data. Cross-validatedrotation removing individual spectra was preferred in thiscase over rotation removing all repeat spectra for a givensample since the repeat spectra were added to capture thesystem drift, not the component concentrations. By ro-tating all repeat spectra out at once, the F-test on thecross-validated calibrations did not adequately gauge theimpact of the drift on the residuals for factor selection.While the spectra-out rotation worked better for this dataset, we do not recommend this rotation as a general rule.Adding several spectra for the same sample can over-emphasize that sample in the model, which could lead toover� tting. Consequently, a careful analysis on a case-by-case basis is required to determine the appropriatetype of spectral rotation during cross-validation for PLS.The proper choice of rotation for PLS will depend on therelative effects of the concentrations and the spectrometer

drift or other sources of spectral variations that need tobe modeled.

The results for CRACLS models using 10 and 20 aug-mentations are also presented in Table III. Again we seethat CRACLS compares favorably with PLS. Because theCRACLS model was developed using all constituents andthe PLS model uses only one component at a time, itcould be argued that the comparison is not valid. There-fore, we recalibrated using CRACLS with 20 augmen-tations and used the concentrations from only one com-ponent at a time to generate three separate models. Ap-plying these models to the prediction data set, the SEPsfor glucose, ethanol, and urea were 20, 4, and 4 mg/dL,respectively. Essentially, those results matched the valuesderived when all the constituents were included in theCRACLS calibration model. Using one or more of thecomponent concentrations, the CRACLS approach effec-tively removed the impact of all sources of spectral var-iation not represented by the given/known concentrations.Notice, however, that with CRACLS, there is no problemwith selecting a rotation method since the mean-centeredrepeat spectra are simply augmented to the estimatedpure-component spectra during prediction.

Concentration Residual Choice. When there is morethan one known component, we need to choose whichconcentration residual vector to use for the augmentation.To make the choice, consider the impact that the unmo-deled components will have on each of the concentrationresiduals. From Eq. 9, we know that the concentrationresidual vector is related to the projection of the unmo-deled pure-component spectra, K u onto the space spannedby the modeled pure-component spectra, K , which is theportion of the unmodeled pure spectra that overlaps withthe modeled pure spectra. If one of the unmodeled andone of the modeled pure-component spectra happen to beorthogonal, the concentration residual from that modeledcomponent will not provide any information about theconcentrations of the unmodeled component. However,in practice, orthogonal pure-component spectra rarely oc-cur since orthogonality implies that there is no spectraloverlap. Even using simulated data with pure-componentspectra con� gured to be nearly orthogonal, augmentationwith the concentration residual vector was still suf� cientto produce a good predictive model.

To demonstrate the applicability of the various com-ponent residuals, a calibration model was developed us-ing the simulated data and leaving out only the glucoseconcentrations during calibration. The concentration re-siduals vs. the reference concentration for the three com-ponents included in the calibration are shown in Fig. 8.The absolute value of the concentration residuals in Fig.


8 varies from 20 to 100 mg/dL depending on the analyte.However, the pattern of the sample residuals for each ofthe components is basically the same, indicating that thecontamination from glucose contributed approximatelythe same relative error to each of the samples for eachof the components. Consequently, any of these compo-nent residuals will give statistically equivalent CRACLSmodels since the magnitude of the differences will onlychange the scaling factor of the estimated glucose pure-component spectrum corresponding to the augmentedconcentration residual vector, Cu, in Eq. 9. The resultingNAS of the known components will be essentially iden-tical regardless of which concentration residual vector isselected. Considering this discussion, at this time we � ndno theoretical or empirical basis for choosing which re-sidual to use for the augmentation.

Since the concentration residuals are used as the esti-mated concentrations for the unknown components, thereis also the question about the impact of reference errorson the prediction ability of CRACLS. The implicationsof the reference errors are currently under investigationand will be addressed in a future paper. However, for thedata sets we have used, reference errors have not degrad-ed the prediction ability of the CRACLS models anymore than they have the PLS methods.

Num ber of Augm entations. Another choice forCRACLS is selection of the number of augmentations touse in the model. As discussed by Haaland and Thomas,5

when using PLS, it is important to avoid using too manyfactors because over� tting the concentration data will de-grade the prediction model. The excessive PLS factorshave concentration residuals associated with them thatcan degrade the concentration predictions if they are in-cluded in the model. Finding the correct number of aug-mentations in CRACLS may be less critical than choos-ing the optimal number of factors for PLS. Each addi-tional loading vector generated by the concentration re-sidual augmentation in CRACLS results in a reduction ofthe NASs for the analytes of interest. After the majorunmodeled sources of spectral variation have been re-moved, the addition of more concentration residual vec-tors will generate estimated pure spectra that representprimarily random noise. Since random spectral noise isnearly orthogonal to the pure-component spectra of theknown analytes, the impact on the NAS will be minimal.The insensitivity of CRACLS to the number of augmen-tations was demonstrated in the example above using realdata. As shown in Tables II and III, the models generatedusing 10 or 20 augmentations gave equivalent resultswith no evidence of over� tting. We needed to add a suf-� cient number of residual vectors to capture all the un-modeled sources of spectral variation, but adding moreresidual vectors did not degrade the results. For thesedata, the CVSEPs for CRACLS dramatically decreasedwith increasing augmentations as real sources of spectralvariation were modeled, and then the CVSEPs just grad-ually continued to decrease. If true, the insensitivity ofCRACLS to the number of augmentations would be anadvantage over PLS since at times it is dif� cult to deter-mine the optimal number of factors for PLS. Currentlywe use the F-test on the cross-validated calibrations, sug-gested by Haaland and Thomas5 for PLS, to select thenumber of augmentations in CRACLS. This F-test often

results in selecting the maximum or near the maximumnumber of augmentations computed. Because over� ttingmay occur with other data, the choice of the number ofaugmentations during CRACLS calibration certainly war-rants further study and will be discussed in a future paper.

Augmenting CRACLS or PACLS. CRACLS in-volves the use of the known concentration and augmentedspectral information. An important observation is that thePACLS prediction models are mathematically identicalwhether the pure-component spectra are added to K ofEq. 4 during the cross-validation in the calibration phaseor to K before performing true prediction. In addition, aswe demonstrated earlier with our data, equivalent predic-tion ability was achieved with CRACLS by using all theconcentrations simultaneously or by using the concentra-tions one component at a time. However, to enhance theoutlier detection sensitivity and to achieve better quali-tative information in the estimated pure-component spec-tra, it is advisable to include all known reference con-centrations in the cross-validated calibration.

CONCLUSION

CRACLS, a new method for developing a CLS-typemodel, has been presented. The method is able to gen-erate accurate and precise models using data with un-modeled spectral interferents present. Our new methodprovides the improved qualitative information of CLSmodels with the prediction ability of the implicit multi-variate calibration methods by augmenting the calibrationconcentration matrix with column vectors that accountfor unmodeled constituents. These column vectors consistof selected concentration residuals from previous calibra-tion iterations. The concentration residuals are approxi-mate linear combinations of the concentrations of the un-modeled sources of spectral variation. By augmenting theconcentration matrix with these residual vectors, estimat-ed pure-component spectra, which are linear combina-tions of the actual pure-component spectra of the un-modeled sources of spectral variation, are generated.Augmenting the prediction with these additional spectralestimates then generates regression vectors that are ap-propriate for the analytes of interest in estimating con-centrations uncontaminated by the unmodeled spectral in-terferences. Also, CRACLS might not experience thesame over� tting problem that can af� ict PLS. More im-portantly, combining the CRACLS model with thePACLS technique results in models for our data that arecomparable in prediction ability to the standard PLSmodels. Moreover, unlike PLS, the CRACLS/PACLScombination can be used to update models without theneed for time-consuming recalibration. To update theCRACLS/PACLS models only requires spectral infor-mation, while PLS requires spectral and concentration in-formation during recalibration. With more recent analy-sis, we have found that an augmented CLS method al-gorithm actually outperforms the prediction ability ofPLS.22,26 Finally, the CRACLS algorithm provides betterqualitative information about the analytes by generatingbetter estimates of their pure-component spectra.

ACKNOWLEDGMENTSThe authors would like to acknowledge Laura Martin for making the

samples and for collecting the spectral data and Edward Thomas for


providing the experimental design and for helpful discussions. HowlandJones and Jean Martin of Rio Grande Medical Technologies, Inc., aidedin the making of the calibration and validation samples and providedthe methods and equipment to achieve the high accuracy for the ref-erence concentrations.

1. M. K. Antoon, J. H. Koenig, and J. L. Koenig, Appl. Spectrosc.31, 518 (1977).

2. D. M. Haaland, R. G. Easterling, and D. A. Vopicka, Appl. Spec-trosc. 39, 73 (1985).

3. P. M. Fredericks, J. B. Lee, P. R. Osborn, and D. A. Swinkels, Appl.Spectrosc. 39, 303 (1985).

4. W. Linberg, J. A. Persson, and S. Wold, Anal. Chem. 55, 643(1983).

5. D. M. Haaland and E. V. Thomas, Anal. Chem. 60, 1193 (1988).6. D. M. Haaland and E. V. Thomas, Anal. Chem. 60, 1202 (1988).7. D. W. T. Grif� th, Appl. Spectrosc. 50, 59 (1996).8. D. M. Haaland and D. K. Melgaard, Appl. Spectrosc. 54, 1303

(2000).9. D. M. Haaland and D. K. Melgaard, Appl. Spectrosc. 55, 1 (2001).

10. D. M. Haaland, ‘‘Multivariate Calibration Methods Applied toQuantitative FT-IR Analyses’’, in Practical Fourier Transform In-frared Spectroscopy, J. R. Ferraro and K. Krishnan, Eds., (Academ-ic Press, New York, 1989), Chap. 8, pp. 396–468.

11. D. M. Haaland, ‘‘Methods to Include Beer’s Law Nonlinearities inQuantitative Spectral Analysis’’, in Computerized Quantitative In-frared Analysis, ASTM Special Technical Publication, G. L. Mc-

Clure, Ed., STP Vol. 934 (American Society of Testing Materials,Philadelphia, Pennsylvania, 1987), p. 78.

12. P. Saarinen and J. Kauppinen, Appl. Spectrosc. 45, 953 (1991).13. P. Jaakkola, J. D. Tate, M. Paakkunainen, J. Kauppinen, and P.

Saarinen, Appl. Spectrosc. 51, 1159 (1997).14. D. M. Haaland and R. G. Easterling, Appl. Spectrosc. 34, 539

(1980).15. D. M. Haaland and R. G. Easterling, Appl. Spectrosc. 36, 665

(1982).16. C. L. Lawson and R. J. Hanson, Solving Least Squares Problems

(Prentice-Hall, Englewood Cliffs, NJ, 1974).17. C. D. Brown, L. VegaMontoto, and P. D. Wentzell, Appl. Spectrosc.

54, 1055 (2000).18. H. Martens and T. Naes, Multivariate Calibration (John Wiley and

Sons, Chichester, 1989).19. I. K. Salomaa and J. K. Kauppinen, Appl. Spectrosc. 52, 579

(1998).20. A. Lorber, Anal. Chem. 58, 1167 (1986).21. H. Martens and T. Naes, ‘‘Multivariate Calibration by Data Com-

pression’’, in Near-Infrared Technology in Agricultural and FoodIndustries, P. C. Williams and K. Norris, Eds. (American Associa-tion of Cereal Chemists, St. Paul, Minnesota, 1987), pp. 57–87.

22. C. M. Wehlburg, D. M. Haaland, D. K. Melgaard, and L. E. Martin,Appl. Spectrosc. 56, 605 (2002).

23. E. V. Thomas and N. Ge, Technometrics 42, 168 (2000).24. D. M. Haaland, L. Han, and T. M. Niemczyk, Appl Spectrosc. 53,

390 (1999).25. D. M. Haaland, Anal. Chem. 60, 1208 (1988).26. C. M. Wehlburg, D. M. Haaland, and D. K. Melgaard, Appl. Spec-

trosc., paper accepted for publication.

concentration residual augmented classical least squares (cracls): a multivariate calibration method...

Documents