data analysis in two-dimensional chromatography

Data analysis in two-dimensional chromatography

Data analysis in two-dimensional chromatography

Van ‘t Hoff Institute for Molecular SciencesVan ‘t Hoff Institute for Molecular Sciences University of AmsterdamUniversity of Amsterdam

Gabriel Vivó TruyolsAnalytical-chemistry group

Van ‘t Hoff Institute for Molecular SciencesUniversity of Amsterdam

[email protected]


IntroductionFrom 1D chromatography to 2D chromatography: what does change?

2 4 6 8 10 12 14 16 18Time, min

0x100

4x106

8x106

1x107

Inte

nsity


The complexity of the data changesInstruments can be classified according to the order of the tensor of data used to represent a single experiment:

Zero-order instruments Produce Zero-order tensor

(e.g. a number) Example pH - meter

First-order instruments Produce First-order tensor

(e.g. a vector) Example UV-VIS spectrometer

Second-order instruments Produce Second-order tensor

(e.g. a matrix) Example LC-MS, GCxGC

Third-order instrument Produce Third-order tensor

(e.g. a “cube” of data) Example GCxGC-MS

nth-order instrument exist, but they are rare

The changes from 1D to 2D


The concept of “peak vicinity” changes

Two-dimensional

1tR2 t R

One-dimensional

tR

Peak of interest

Peak of interest

Neighbors Neighbors

S. Peters, G. Vivó-Truyols, P. Marriott, P.J. Schoenmakers, J. Chromatogr. A. 1146 (2007), 232-241.



The concept of “peak resolution” is different

A

B

One-dimensional

tR

Two-dimensional

1tR2 t R

f

g

Valley-to-peak ratio(between A and B) g

fP gfP

AB?

Valley-to-peak ratio(between A and B)?

S. Peters, G. Vivó-Truyols, P. Marriott, P.J. Schoenmakers, J. Chromatogr. A. 1146 (2007), 232-241.



The concept of “peak” changes

One-dimensional

tR

Two-dimensional



The concept of “peak” changes

One-dimensional

tR

Two-dimensional


GCxGC, Riva - 2014


Do the main steps in data processing change?

Ste

p 2

Pre-process

Ste

p 3

MeasureS

tep

1View

• Base-line correction

• Noise filtering• Spike filtering• Alignment• … etc.

• Peak detection / integration

• Calibration• Deconvolution• Pattern

recognition



Do the main steps in data processing change?

Ste

p 2

Pre-process

Ste

p 3

MeasureS

tep

1View





recognition• Class separation


Basically, the steps are the same, but the algorithms used for each step may be different.

Univariate

Multivariate

• Folding• Phasing

First step: visualizationFirst step: visualization


Ste

p 2

Pre-process

Ste

p 3

Measure

Ste

p 1

View








VisualisationRaw data in 2D chromatography

0 10 20 30Time, min

-101234

Abso

rban

ce, A

U

Chromatogram of Glycine preparation (254nm)(data courtesy of University of Valencia)

Two-dimensional chromatographyGCxGC chromatogram of diesel (FID detector)First dimension: non-polar; Second dimension: polar


“Folding” the chromatogram Visualisation

Mod time (8 seconds)

27002710

27202730

2740

02

46

80

2000

4000

6000

8000

10000

1tR, s2tR, s

FID

sig

nal


“Folding” the chromatogram Visualisation

27002720

2740

02

46

80

5000

10000

1tR, s2tR, s

FID

sig

nal

2700 2710 2720 2730 27400

1

2

3

4

5

6

7

8

1tR, s

2 t R, s

0

2000

4000

6000

8000

10000



“Folding” the chromatogram: phasing Visualisation


27002720

2740

46

8100

5000

10000

1tR, s2tR, s

FID

sig

nal

2700 2710 2720 2730 2740

3

4

5

6

7

8

9

10

1tR, s

2 t R, s

0

2000

4000

6000

8000

10000

Phase = 0.3

2700 2710 2720 2730 27404

5

6

7

8

9

10

11

12

1tR, s

2 t R, s

0

2000

4000

6000

8000

10000

27002720

2740

46

810

120

5000

10000

1tR, s2tR, s

FID

sig

nal




Phase = 0.5

2700 2710 2720 2730 27406

7

8

9

10

11

12

13

14

1tR, s

2 t R, s

0

2000

4000

6000

8000

10000

27002720

2740

68

1012

140

5000

10000

1tR, s2tR, s

FID

sig

nal




Phase = 0.75


Cylindrical coordinates. An alternative way to represent the data.

Visualisation

J.J.A.M. Weusten, E.P.P.A. Derks , J.H.M. Mommers, S. van der Wal, Anal. Chim. Acta 726 (2012), 9

1tR

=2tR


Interpolation Visualisation

27002720

2740

46

8100

5000

10000

1tR, s2tR, s

FID

sig

nal


+ =

Welcome to the magic world of chemometrics!

Interpolation Visualisation


“Folding” the chromatogram: final result Visualisation


Conclusions Visualisation

• Visualising is simple, and gives a lot of information.

• Folding (one-dimensional) data into (2D) image introduces discontinuities in the edges. Other visualization methods (cylindrical coordinates) possible.

• Phasing can be of great help.

• Careful with “cosmetic” effects!

Second step: Pre-processingSecond step: Pre-processing


Ste

p 2

Pre-process

Ste

p 3

Measure

Ste

p 1

View








Pre-processingTypical problems: base-line drifts and noise

0 10 20 30Time, min

-101234

Abso

rban

ce, A

U

0 10 20 30Time, min

-0.15-0.1

-0.050

0.050.1

Abso

rban

ce, A

U

Base-line drifts Noise

20 20.2 20.4 20.6 20.8 21Time, min

-0.04-0.02

00.020.040.06

Abso

rban

ce, A

U

0 10 20 30Time, min

-0.2

-0.1

0

0.1

0.2

Abso

rban

ce, A

U


Pre-processingBase-line drifts.

0 10 20 30Time, min

-0.15-0.1

-0.050

0.050.1

Abso

rban

ce, A

U

Original chromatogram

Corrected chromatogr.

Fitted base-line correction

Weighted least squares fitting


Pre-processing

Originalchromatogram

Base-line drifts.



Correctedchromatogram


Pre-processing




Other (more sophisticated) options:

- Use splines- Base-line correction coupled to peak detection- Fourier-transform based approaches- Wavelet-based approaches

Base-line drifts.



Pre-processing

S.E. Reichenbach, M. Ni, D. Zhang, E.B. Ledford Jr., J. Chromatogr. A, 985 (2003) 47 - 56

Base-line is reached at the (half) upper part

Base-line is reached at the (half) bottom part

Base-line drifts.


Pre-processing

S.E. Reichenbach, M. Ni, D. Zhang, E.B. Ledford Jr., J. Chromatogr. A, 985 (2003) 47 - 56

Consider the positions with the smallest values in each half

Estimate local background parameters using neighboring

values

Interpolate the main background trend and

subtract it

Base-line drifts.


Pre-processingNoise removal. Smoothing and derivatives.

Savitzky-Golay filter is the most common method

Two parameters should be optmisized

• Window size• Polynomial degree

These parameters govern how much correlated noise is

removed

• Large window sizes and low polynomial degree

Too much noise is removed (chromatograms appear deformed)

• Small window sizes and large polynomial degrees Too much noise remains


Pre-processingNoise removal. Smoothing and derivatives.

Window size = 11Polynomial = 2





G. Vivó-Truyols, P.J. Schoenmakers, "Automatic selection of optimal Savitzky-Golay smoothing”, Anal. Chem. 78 (2006) 4598-4608.

FID

sig

nal


Pre-processingNoise removal. Spikes.

A good way of removing spikes consists of passing a median filter(before the Savitsky-Golay filter)

Savitzky-Golay filter

Median filterBase-line correction

Original data

Parameter to tune: window

size

Parameters to tune: window size and

polynomial degreeOptimizing three parameters


Pre-processingAlignment.

Two types of alignment

Between-chromatogram alignment

Between-modulation alignment

Alignment is not always necessary, depending on the final objective of the analysis

Rarely doneUsing 2D

techniques(folded data)

Using 1D techniques

(unfolded data)


15 different (but related) chromatograms

13.5 13.6 13.7 13.8Time, min

Alignment

Pre-processingAlignment. COW (1D)


Pre-processingAlignment. Using (truly) 2D algorithms

S. Castillo, I. Mattila, J. Miettinen, M. Orešič, T.Hyötyläinen, Anal. Chem. 83 ( 2011) 3058–3067

Score alignment in GCxGC-MS

D. Zhang, X. Huang, F.E. Regnier,M. Zhang, Anal. Chem., 80 (2008) 2664–2671

COW-adapted GCxGC-MS (using

single channel)


Conclusions Pre-processing

• Pre-processing methods are almost the same: one-dimensional = two-dimensional. Normally done in the (pre-folded) raw data.

• Every case needs a particular solution (it always exists, but some care should be taken!)

Third step: measureThird step: measure


Ste

p 2

Pre-process

Ste

p 3

Measure

Ste

p 1

View








Peak detection in one step: the watershed algorithm Peak detection

Most common software programs use the watershed algorithm to detect peaks in 2D chromatography:

J. De Bock et al., doi 10.1007/11558484

1 2 3 4


??? ?Single catchmentbasin?

G. Vivó-Truyols, H.G. Janssen, J. Chromatogr. A, doi:10.1016/j.chroma.2009.12.063



The first problem using the watershed algorithm.



The watershed algorithm

Does a two-dimensional chromatographic peak form a single basin?

50100

150

0

5

100

500

1000

1500

y-3

y-2

y-1

y0

y3

y2y1

70 90 110 1304.5

5

5.5

1tR, AU

2 t R, A

U

4 4.5 5 5.5 60

500

1000

1500

y-2

y-1

y-3

y2

y1

y0

2tR, AU

Watershed algorithm works


Automated peak detection in 2D chromatography


The watershed algorithm

Does a two-dimensional chromatographic peak form a single basin?

70 90 110 1304.5

5

5.5

1tR, AU

2 t R, A

U

50100

150

0

5

100

500

1000

1500

y-3

y-2

y-1

y0

y3

y2y1

4 4.5 5 5.5 60

500

1000

1500

2tR, AU

y-2

y-1

y3

y2

y1

y0

Saddle point

Watershed algorithm fails!

y-1

y0




Critical variability in second-dimension retention time:

2

2 1

pq

tP critfail

22 1

pq

tP critfail

mp

2

1

m

p

2

1

1mq 1mq

q=0.5 (eight cuts per peak)

q=1 (four cuts per peak)

q=2 (two cuts per peak)

q=4 (1 cut per peak)0 0.04 0.08 0.12 0.16 0.2

tR,crit, min

0

0.2

0.4

0.6

0.8

1

Pfa

il

p=5p=20

p=2.5

p=10

Current instruments

Current chromatographic practice


Peak detectionin one step: the watershed algorithm Peak detection


S. Peters, G. Vivó-Truyols, P.J. Marriott and P.J. Schoenmakers, J. Chromatogr. A 1156 (2007) 14.E.J.C. van der Klift, G. Vivó-Truyols, F.W. Claassen, F.L. van Holthoon, T.A. van Beek, J. Chromatogr. A, 1178 (2008) 43.

Time, arbitrary units (AU)

Ste

p

1 Detect peaks as in one-dimensional chromatography

Use information from derivatives (pre-processing

step)

Peak detectioni in two steps. Peak detection


Ste

p

2 Merge peaks that belong to the same compound according to 2nd-dimension retention time differences

50100

150

0

5

100

500

1000

1500

4 4.5 5 5.5 60

500

1000

1500

2tR, AU

T: Tolerance criterion

70 90 110 1304.5

5

5.5

1tR, AU

2 t R, A

U



Ste

p

2 Merge peaks that belong to the same compound according to 2nd-dimension retention time differences

50100

150

0

5

100

500

1000

1500

70 90 110 1304.5

5

5.5

1tR, AU

2 t R, A

U

4 4.5 5 5.5 60

500

1000

1500

2tR, AU

D

D>T ?



Ste

p

3 Check unimodality

50100

150

0

5

100

500

1000

1500

4 4.5 5 5.5 60

500

1000

1500

2tR, AU

70 90 110 1304.5

5

5.5

1tR, AU

2 t R, A

U



Alternative algorithms?

0 5000 10000 15000

50

60

70

80

90

100

1st dimension ret. time, s

2nd

dim

ensi

on re

t. tim

e, s

0

1

2

3

4

5

6

7

8

9

10x 107


1st dimension, Ag-Column

2nd

dim

ensi

on, R

P

Data courtesy of Teris van Beek, University of Wageningen (NL)



Each of these dots corresponds to a detected peak



7400 7500 7600 7700 7800 7900 8000 8100 820048

50

52

54-0.5

0

0.5

1

1.5

2

2.5

x 108

Sig

nali

nten

sity

, AU

Peak A

Peak B

7400 7500 7600 7700 7800 7900 8000 8100 820048

50

52

54-0.5

0

0.5

1

1.5

2

2.5

x 108

Sig

nali

nten

sity

, AU

Peak A

Peak B

Possibility 1 Possibility 2

Automated peak detection in 2D chromatographyAlternative algorithms?



0 5000 10000 15000

50

60

70

80

90

100

1st dimension ret. time, s

2nd

dim

ensi

on re

t. tim

e, s

0

1

2

3

4

5

6

7

8

9

10x 107

In general, any group of 1D peaks may exhibit x possibilities of arrangement in 2D peaks that do not violate the rules of unimodality and 2tR < T (tolerance criterion)!!!



The Bayesian approach

1tR

2 t R

Detected 1D peaksPeak A

Peak B

Peak A

Peak B

Peak A

Peak B

Peak A

Peak B

Peak C

Sol 1 Sol 2 Sol 3

Sol 4

Peak A

Peak B

Sol n

…

Ste

p

1 Let’s consider all possible solutions of peak arrangement


1tR

2 t R

Detected 1D peaksPeak A

Peak B

Peak A

Peak B

Peak A

Peak B

Peak A

Peak B

Peak C

Sol 1 Sol 2 Sol 3

Sol 4

Peak A

Peak B

Sol n

…


The Bayesian approachS

tep

2 Discard those solutions that violate the unimodality criterion. Discard also those solutions that imply a too fragmented chromatographic peak.


Peak A

Peak B

Peak A

Peak B

Sol 1

Sol 2

Peak A

Peak B

Sol n

…


The Bayesian approachS

tep

3 Apply the Bayes theorem to calculate the probability of each solution

H1

Hypothesis

H2

…Hn

|D D| |D D|


1tR

2 t R


The Bayesian approach

|D ∝ D||D ∝ D|

I’m interested only in a relative value of

p(Hn|D)

|D ∝ D||D ∝ D|

All the priors have the same probability

|D D| |D D|



The Bayesian approach Automated peak detection in 2D chromatography

HPLC-2011

| … , 1 , , 1 1 | … , 1 , , 1 1

|

…12

12

12 1 1

212

2 21 1

|

…12

12

12 1 1

212

2 21 1

Peak phase

1st

dimension peak width

Total peak area

Prior probabilities

How much does your 1D peak profile look like a

peak?

Are the 2nd dimension retention times too far

away?The computer willdo the work for you!


Automated peak detection in 2D chromatographyThe Bayesian approach

7400 7500 7600 7700 7800 7900 8000 8100 820048

50

52

54-0.5

0

0.5

1

1.5

2

2.5

x 108

Sig

nali

nten

sity

, AU

Peak A

Peak B

7400 7500 7600 7700 7800 7900 8000 8100 820048

50

52

54-0.5

0

0.5

1

1.5

2

2.5

x 108

Sig

nali

nten

sity

, AU

Peak A

Peak B

Possibility 1 Possibility 2

Probability = 51% Probability = 49%


2 t R

1tR

> 5000 peaks(or peak clusters)

Analyse each spot using

deconvolution

Deconvolution methods. Deconvolution

2D-FID chromatogram

Deconvolution

εHεyxA εHεyxA Bilinear model (for a single compound)

Cross productof two vectors

Modelled variance

xy

=

Unmodelled variance

Matrix

+

Total response

Matrix

A

Bilinear model


εHεyxA

n

iii

1εHεyxA

n

iii

1

Total response

Matrix

A

Unmodelled variance

Matrix

+

Modelled variance

=

+

+

+

...

x1y1

x2y2

xnyn


Deconvolution

Bilinear model (for a n compounds)

Bilinear model

-50

5

10

15

20

25

4.04.2

4.44.6

225250

275300

325350

375

Derivatisationagent

Amino-acid

Deconvolution


Solving the bilinear model. OPA-ALS

εYXεHεyxA

n

iii

1εYXεHεyxA

n

iii

1

y1

y2

Peak profiles inthe first order ofmeasurement(spectra)

x1 x2

Peak profiles inthe second orderof measurement(chromatograms)

AXY AXY

YAX YAX

Apply constraints

on Y

Apply constraints

on X

Initial estimates of X (or Y)

Final X and Y

DeconvolutionOPA-ALS


DDTiid det DDTiid det ixRD ixRD

1

x1 x2 …References(R) xi

spectrum at the ith retention time (xi)

The user selects the retention time of

maximum dissimilarity

Consider the spectrum at the selected time as the new R matrix

3.8 4 4.2 4.4 4.6time, min

d i (m

AU

)

Dissimilarity Reference(s) (R)

Mean spectrum

200 240 280 320 360 400lambda, nm

0

2

4

6

Abs

orba

nce,

mAU

DissimilarityDeconvolutionOPA-ALS


Dissimilarity Reference(s) (R)

3.8 4 4.2 4.4 4.6time, min

d i (m

AU

)

200 240 280 320 360 400lambda, nm

0

4

8

12

16

Abs

orba

nce,

mAU2

3.8 4 4.2 4.4 4.6time, min

d i (m

AU

)

200 240 280 320 360 400lambda, nm

05

10152025

Abs

orba

nce,

mAU

3 Only noise

Initial spectra (X)

DeconvolutionOPA-ALS


Deconvolution

Initial X

200 240 280 320 360 400lambda, nm

05

10152025

Abs

orba

nce,

mAU

AXY AXY

YAX YAX

Apply constraints

on Y

Apply constraints

on X

200 240 280 320 360 400lambda, nm

0

0.1

0.2

0.3

0.4

Abso

rban

ce, m

AU(n

orm

alis

ed)

3.8 4 4.2 4.4 4.6time, min

0

20

40

60

80

Abs

orba

nce,

mAU

Derivatisationagent

Amino-acid

Derivatisationagent

Amino-acid

Final X

Final Y


50100

150

0

5

100

500

1000

1500

y-3

y-2

y-1

y0

y3

y2y1


Let’s reflect… Deconvolution

… peak profiles in the second dimension are not exactly the same (e.g. due to retention time misalignments…)

What happens with the model if…

4 4.5 5 5.5 60

500

1000

1500

2tR, AU

y-2

y-1

y3

y2

y1

y0

A Cross productof two vectors

xy

MatrixMatrix= +


What to do then in practice? Deconvolution

If you have multichannel detection: matrix unfolding

If you don’t have multichannel detection… you

could align… but forget about it!

Misalignments in the 2nd-dimension


Matrix unfolding Deconvolution

1tR (region)

2 t R(r

egio

n) tR

m/z

m/z=50m/z=51

…m/z=750

m/z=50 m/z=51 … m/z=750


x

y

MatrixMatrix= +

MatrixA


Matrix unfolding. Example. Deconvolution

PLO and POL

1st dimension, Ag-Column

2nd

dim

ensi

on, R

P

Injection of corn oil in

LCxLC-MS(zoom in)

Data courtesy of Teris van Beek, University of Wageningen (NL). M. Navarro et al., presented at HPLC-Geneva.


Matrix unfolding. Example. Deconvolution

20 22 24 26 28 30 326065707580859095

100105

570575580585590595600605610615

0

0.2

0.4

0.6

0.8

1 575

577

601

570575580585590595600605610615

0

0.2

0.4

0.6

0.8

1

575

577

601

m/z

2 t R

1tR

2 t R

1tR

Abu

ndan

ce

m/z

Abu

ndan

ce

m/z

PLO

POL

Reference library

Data courtesy of Teris van Beek, University of Wageningen (NL). M. Navarro et al., presented at HPLC-Geneva.


Matrix unfolding & matrix augmentation Deconvolution

H. Parastar et. al, Anal. Chem. 83 (2011) 9289

εYXεHεyxA

n

iii

1εYXεHεyxA

n

iii

1

y1

y2

Peak profiles inthe first order ofmeasurement(spectra)

x1 x2

Peak profiles inthe second orderof measurement(chromatograms)

AXY AXY

YAX YAX

Apply constraints

on Y

Apply constraints

on X

Initial estimates of X (or Y)

Final X and Y

DeconvolutionFinding some unique mases. AMDIS.


If I can find a unique mass for each compound…

AXY AXY

DeconvolutionFinding some unique mases. AMDIS.


If I can find a unique mass for each compound…

… I could find the peak profiles (X) for each compound…


Let’s reflect… Deconvolution

Two peaks are completely resolved in one dimension but retention times are the same in the other dimension…

What happens with the model if…

εHεyxA

n

iii

1εHεyxA

n

iii

1For n compounds

“Mathematical compound” is not the same as

“chemical compound”!!


Matrix unfolding Deconvolution

1tR (region)

2 t R(r

egio

n) tR

m/z

m/z=50m/z=51

…m/z=750

m/z=50 m/z=51 … m/z=750


x

y

MatrixMatrix= +

MatrixA

Could we avoid this unfolding and use true

third-order data?


Using third-order data Deconvolution

Trilinear model (for n compounds)

??? ?What are the implications of this?

Profiles are unique in the three order of measurements

Solving the tri-linear model

Via PARAFAC, the three profiles are unique

Via PARAFAC2, one of the conditions of uniqueness can be relaxed (normally the

alignment between 1st and 2nd dimensions)

Other algorithms possible (e.g., TLD), but rarely used…


Deconvolution methods. Deconvolution

A.E. Sinha, J.L. Hope, B.J. Prazen, C.G. Fraga, E.J. Nilsson, R.E. Synovec, J. Chromatogr. A, 1056 (2004) 145 - 154

An example of PARAFAC


Deconvolution methods. Summary. Deconvolution

Deconvolutionmethods

Using 1D (unfolded) data

Use truly (folded) 2D data

• ALS or rank annihilation methods• Does not need between-

modulation alignment• Similar to AMDIS

• PARAFAC (needs between-modulation alignment)

• PARAFAC2 (more robust against between-modulation alignment)

Main problem: determine the number of components behind the peak cluster


Deconvolution methods. Discussion. Deconvolution

Trilinear model (for n compounds)

??? ?What are the

implications of having high-resolution MS

instead of nominal mass?

Bilinear model (for n compounds)

Third step: measureThird step: measure


Ste

p 2

Pre-process

Ste

p 3

Measure

Ste

p 1

View








Pattern recognition as a variable reductionS

tep

1 Obtain chromatogram(s)

Ste

p

2 Peak detection

Ste

p

3 Pattern recognition

Ste

p

4 Unknown sample characterized

HPLC-MS/MS GCxGC-MS

GC-MS

Base-line correction Alignment

Peak detection

Digital variables

Chemicalvariables

Process/biologicalvariables

From digital variables (chromatogram) to process variables

Raw data

Features

Healthy/sick

Pattern recognition

Information


Is peak detection necessary?

Chromatographic data

Using raw data directly

Ana

lysi

s of

flam

e ac

cele

rant

s us

ing

GC

xGC

Source ASource B

… for example:

Pattern recognition


Pattern recognition as a variable reductionS

tep

1 Obtain chromatogram(s)

Ste

p

2 Peak detection

Ste

p

3 Pattern recognition

Ste

p

4 Unknown sample characterized

HPLC-MS/MS GCxGC-MS

GC-MS

Base-line correction Alignment

Peak detection

Digital variables

Chemicalvariables

Process/biologicalvariables

From digital variables (chromatogram) to process variables

Raw data

Features

Healthy/sick

Pattern recognition


Pattern recognition in GCxGC. The options Pattern recognition

Pattern recognition in GCxGC

Using raw data Using peak tables

• Alignment is critical• Less chance to miss important

compounds• Normally done with the unfolded

(raw) data, but not always (e.g. N-PLS)

• Alignment not important, but peak tracking is essential (normally MS should be present)

• Chance to miss important compounds (close to the S/N)

• (Truly) 1D method


Pattern recognition in GCxGC. Supervised methods. Pattern recognition

Any method will be prone to overfitting

In supervised pattern recognition of GCxGC, a tremendous reduction of variables is performed (form millions to a few tens/hundreds)

Any variable pre-reduction (e.g. using Fisher ratios) should be done within a cross-validation loop

Otherwise the results will be optimistic (a method that seems to work, when in fact it only works for that data)


Pattern recognition in GCxGC. Example of a wrong strategy Pattern recognition

Ste

p 2Variable

selection: Fisher ratio on the raw

data Ste

p 3Supervised

pattern recognition: PLS-DA to

separate sick from healthyS

tep

1Obtain GCxGCchromatograms for sick (50) and

healthy (50) Ste

p 4Consider the

coefficients from PLS-DA as indicators of

potential metabolites

Objective: discovering metabolites responsible for cancer tumor

??? ?Aren’t you Overfitting? No, I’ve been cross-

validating the PLS-DA

… but the variable pre-selection has been done with the full data set!!


Pattern recognition in GCxGC. Example of a wrong strategy Pattern recognition

Ste

p 2Variable

selection: Fisher ratio on the raw

data Ste

p 3Supervised

pattern recognition: PLS-DA to

separate sick from healthyS

tep

1Obtain GCxGCchromatograms for sick (50) and

healthy (50) Ste

p 4Consider the

coefficients from PLS-DA as indicators of

potential metabolites

Objective: discovering metabolites responsible for cancer tumor

??? ?Aren’t you Overfitting? No, I’ve been cross-

validating the variable selection and the PLS-DA

correct


Orig

inal

dat

a se

t

Ste

p 2Withdraw

subset 1, fit the model with the rest, and test

the model with subset 1 S

tep

3

Do the same for the

other k sectionsS

tep

1Divide the original data

set in k subsets

(randomly) Ste

p 4

Sum up the error of the model in all

validation sets

1

2

…

k

Cal

ibra

tion

Val

Cal

ibra

tion

Val

Cal

Test the model here

Test the model here

Cal

Val

Cal

Pattern recognition in GCxGC. Example of a correct strategy Pattern recognition

“model” = “variable pre-selection (Fisher ratio) + PLS-DA”


Conclusions Multivariate methods

• Deconvolution: normally done with the unfolded data (less problems with between-modulation alignment)

• Deconvolution: problem to establish the number of compounds (normally done in a manual way)

• Two ways for pattern recognition: with raw data (normally preferred) or with peak table.

• Careful with validation of supervised pattern recognition. Variable pre-selection should be included in the validation loop.


Further debate… Multivariate methods

This presentation has been uploaded in my blog: www.tecnometrix.com

Feel free to download and generate debate if you wish!

data analysis in two-dimensional chromatography

Documents