usingmulti-block analysis to select informative...

Douglas N. Rutledge

Laboratoire de Chimie Analytique, AgroParisTech

16, rue Claude Bernard, 75005 Paris, France

[email protected]

Douglas N. Rutledge

Laboratoire de Chimie Analytique, AgroParisTech

16, rue Claude Bernard, 75005 Paris, France

[email protected]

Using multi-block analysis

to select informative variables

Using multi-block analysis

to select informative variables

Starting point

•Huge data sets

•Variable selection

•Multiple data sets

Variable Selection

[1] V. Centner, D. L. Massart, O. E. deNoord, S. deJong, B. M. Vandeginste, C. Sterna, Elimination of uninformative variables for multivariate calibration. Analytical Chemistry 1996, 68, 3851-3858.[2] A.S. Bangalore, R.E. Shaffer, G.W. Small, M.A. Arnold, Genetic algorithm-based method for selecting wavelengths and model size for use with partial least-squares regression: Application to near-infrared spectroscopy. Analytical Chemistry 1996, 68, 4200-4212.[3] L. Norgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, S.B. Engelsen, Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy. Applied Spectroscopy 2000, 54, 413-419.

The quality of multivariate predictive models is increased by eliminatinguninformative variables.

For discriminant models, pp-ANOVA is often used :- test each variable separately- varies more between groups than within groups ?

For regression analysis, many methods :- Uninformative Variable Elimination-PLS [1]- Genetic Algorithm-PLS [2]- iPLS [3] ...

pp-ANOVA and iPLS

The most commonly used methods :

• pp-ANOVA is intrinsically UNIVARIATE

• iPLS applies PLS regression to ISOLATED BLOCKS

• And then there is the multiple-testing problem !

SO - better to use an intrinsically MULTIVARIATE, MULTIBLOCK method

Multi-block analysis

[4] E. Qannari, I. Wakeling, P. Courcoux, H.J.H MacFie,

Defining the underlying sensory dimensions. Food Quality and Preference 2000, 11, 151-154.

"Common Components and Specific Weights Analysis" - CCSWA [4]

Simultaneously study several matrices- with different variables describing the same samples

Describe m data tables observed for the same n samples :- a set of m data matrices (X) each with n rows,- but not necessarily the same number columns

Determine a common space for all m data table,- each matrix has a specific contribution ("salience")to the definition of each dimension of this common space


Start with pmatrices Xi of size n × ki (i = 1 to p)

Each Xi column-centered and scaled by dividing by matrix norm :Xsi

For each Xsi, an n × n scalar product matrix Wi can be computed as :

Wi = Xsi • Xsi T

Wi reflect the dispersion of the samples in the space of that table

The common dimensions of all the tables are computed iterativelyAt each iteration, a weighted sum of the pWimatrices is computed, resulting in a global WGmatrix

For each successive Common Dimension, calculate a scores vector q(coordinates of the n samples along the common dimension)

is the specific weight ("salience") associated withthe ith table for the jth Common Dimension generated by qj

Differences in the values of the specific weights for a dimension :- information present in some tables but not others

Subsequent components calculated after deflating the data tables


)(i

jλ

( )

1

j ni T

i j j j

j

W q q=

=

= λ∑

Classical ComDim

"ComDim" the implementation of CCSWA used here is part of the SAISIR toolbox SAISIR (2008): Statistics Applied to the Interpretation of Spectra in the InfraRed

Dominique Bertrand ([email protected])

PLS-ComDim

1) NIR on apples

Samples

• 2 Varieties :

• Cox, Jonagold

• 2 Faces :

• Red, Green

• 3 Maturity levels :

• fresh, ripe, over-ripe

• 8 different apples

Spectra

• 94 x 200 points

Tables

• 50 blocks of 4 variables

• 6 Common Dimensions

NIR Spectra

Spectra

ComDimSaliences

Correlation between ComDim Scores

and "Face"

Correlation between PLS-ComDim Scores

and "Face"

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

RM

SE

CV

Dotte d line is RMSECV (4 LV 's) fo r g loba l m ode l / Ita lic numbe rs a re optim a l LVs in inte rva l m ode l

Interval num ber

3 2 3 4 2 2 3 3 4 4 4 1 4 2 1 1 2 1 2 2 2 4 2 4 2 3 2 3 2 1 3 1 1 2 3 1 2 2 3 2 2 1 1 2 1 1 1 1 1 1

i-PLS between NIR and "Face"

10 2 0 30 4 0 50 60 7 0 80 9 0

-1. 5

-1

-0. 5

0

0. 5

1

1. 5

2

- Blocks = 50 - Mean centred- Max LVs = 4 - CV = Full

- Scores on LV1for Block 8


and "Maturity"

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

RM

SE

CV


Interval num ber

1 1 3 2 3 2 2 3 3 3 3 3 3 2 3 1 1 3 2 2 3 4 1 1 2 2 3 1 1 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1

i-PLS between NIR and "Maturity"

10 2 0 30 4 0 50 60 7 0 80 9 0

-1. 5

-1

-0. 5

0

0. 5

1

1. 5




and "Variety"

1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849500

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

RM

SE

CV


Interval num ber

3 2 2 2 3 3 4 3 1 3 2 2 4 4 4 3 3 4 4 2 4 1 4 3 3 3 3 1 2 3 2 2 3 3 3 3 2 3 2 2 2 4 3 3 1 2 2 4 2 2

i-PLS between NIR and "Variety"

10 2 0 30 4 0 50 60 7 0 80 9 0-0. 4

-0. 3

-0. 2

-0. 1

0

0. 1

0. 2

0. 3



Genomics data

0 . 5 1 1 . 5 2 2 . 5 3 3 . 5 4 4 . 5

x 1 04

Chromosome 1

1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0

Chromosome 22

………

Genomics data

Samples

• 940 healthy individuals :

• 7 regions :

• 53 ethnic groups :

Variables

• 644,138 Single Nucleotide Polymorphisms (SNPs)

/ 22 Chromosomes

Pretreatment

• PCT on each chromosome = 22x(940*940)� 22x(940*100)

Analysis

• ComDim on the set of 22 PCT blocks

[5] H. M. Cann, C. de Toma, L. Cazes, M-F. Legrand, V. Morel, L. Piouffre, J. Bodmer, et al. (2002),A human genome diversity cell line panel. Science 296, 261–262

[6] E. Génin - Inserm UMR 946 - Univ Paris Diderot

X

PCA

X1

T1

P1

X2

T2

P2

Xq

Tq

Pq

T1 T2 ... T1

PCA/PLS...

TPCTPT

PCT

PX1 PX2 PXq...

Segmented PCT on Obese data sets

TX TPCT=

(PTX1=inv(TPCT

T*TPCT)*TPCTT*X1)

Combined PCs

Component t1=Xu1

Space X3

Space X1

SpaceX 2

SegPCT Component tPCT

Component t2=X2u2

Component t3=X3u3

PCA and SegPCT-PCA (SW=2048, 10 PCT-PCs) memory allocation profiles

SegPCT-PCA21 Mbytes1207 s (~20 min.)

PCA

230 Mbytes14819 s (~250 min.)

With PCT : faster and requires less memory

SegPCT-PCA versus PCA

-0.002

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

500106016202180274033003860442049805540

PCA & SegPCT-PCA Loadings

With PCT : identical results

Multivariate Analysis of all SNPs

[7] J. Li et al.Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation,Science, 319(5866), 1100 - 1104

PCT-ComDim

Saliences of chromosome blocks

PCT-ComDim

CC1 Scores

Africa

Mid-EastEuropeCS_Asia

AmericaE_Asia

Oceania

PCT-ComDim

CC2 Scores

Africa

Mid-East

Europe

CS_Asia

AmericaE_Asia

Oceania

PCT-ComDim

CC3 Scores

America

E_Asia Oceania

PCT-ComDim

CC4 Scores

Oceania

PCT-ComDim

CC5 Scores

Bantu_S

BiakaPygmies

Mandenka

MbutiPygmies San

Yoruba

Bantu_N


and European populations

PCT-ComDim

Saliences of chromosome blocks

PCT-ComDim

French

Tuscan &Italian

Sardinian

Orcadian

Adygei

FrenchBasque

Russian

East

West

South

North

PCT-ComDim PCA

French

Tuscan & Italian

Sardinian

Orcadian

Adygei

FrenchBasque

Russian

[7] J. Li et al.Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation,Science, 319(5866), 1100 - 1104

Good & Bad news

With segmented PCT :

• analyse data sets of any width

• quickly

• using less memory

With multi-block analysis :

• detect groups of interesting variables ("salience")

• visualise relations among samples ("scores")

39

First African-European Conference on

Chemometrics

Mining School of Rabat

Morocco, 20th to 24th of September 2010

www.afrodata.org

usingmulti-block analysis to select informative...

Documents