usingmulti-block analysis to select informative...
TRANSCRIPT
Douglas N. Rutledge
Laboratoire de Chimie Analytique, AgroParisTech
16, rue Claude Bernard, 75005 Paris, France
Douglas N. Rutledge
Laboratoire de Chimie Analytique, AgroParisTech
16, rue Claude Bernard, 75005 Paris, France
Using multi-block analysis
to select informative variables
Using multi-block analysis
to select informative variables
Starting point
•Huge data sets
•Variable selection
•Multiple data sets
Variable Selection
[1] V. Centner, D. L. Massart, O. E. deNoord, S. deJong, B. M. Vandeginste, C. Sterna, Elimination of uninformative variables for multivariate calibration. Analytical Chemistry 1996, 68, 3851-3858.[2] A.S. Bangalore, R.E. Shaffer, G.W. Small, M.A. Arnold, Genetic algorithm-based method for selecting wavelengths and model size for use with partial least-squares regression: Application to near-infrared spectroscopy. Analytical Chemistry 1996, 68, 4200-4212.[3] L. Norgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, S.B. Engelsen, Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy. Applied Spectroscopy 2000, 54, 413-419.
The quality of multivariate predictive models is increased by eliminatinguninformative variables.
For discriminant models, pp-ANOVA is often used :- test each variable separately- varies more between groups than within groups ?
For regression analysis, many methods :- Uninformative Variable Elimination-PLS [1]- Genetic Algorithm-PLS [2]- iPLS [3] ...
pp-ANOVA and iPLS
The most commonly used methods :
• pp-ANOVA is intrinsically UNIVARIATE
• iPLS applies PLS regression to ISOLATED BLOCKS
• And then there is the multiple-testing problem !
SO - better to use an intrinsically MULTIVARIATE, MULTIBLOCK method
Multi-block analysis
[4] E. Qannari, I. Wakeling, P. Courcoux, H.J.H MacFie,
Defining the underlying sensory dimensions. Food Quality and Preference 2000, 11, 151-154.
"Common Components and Specific Weights Analysis" - CCSWA [4]
Simultaneously study several matrices- with different variables describing the same samples
Describe m data tables observed for the same n samples :- a set of m data matrices (X) each with n rows,- but not necessarily the same number columns
Determine a common space for all m data table,- each matrix has a specific contribution ("salience")to the definition of each dimension of this common space
Multi-block analysis
Start with pmatrices Xi of size n × ki (i = 1 to p)
Each Xi column-centered and scaled by dividing by matrix norm :Xsi
For each Xsi, an n × n scalar product matrix Wi can be computed as :
Wi = Xsi • Xsi T
Wi reflect the dispersion of the samples in the space of that table
The common dimensions of all the tables are computed iterativelyAt each iteration, a weighted sum of the pWimatrices is computed, resulting in a global WGmatrix
For each successive Common Dimension, calculate a scores vector q(coordinates of the n samples along the common dimension)
is the specific weight ("salience") associated withthe ith table for the jth Common Dimension generated by qj
Differences in the values of the specific weights for a dimension :- information present in some tables but not others
Subsequent components calculated after deflating the data tables
Multi-block analysis
)(i
jλ
( )
1
j ni T
i j j j
j
W q q=
=
= λ∑
Classical ComDim
"ComDim" the implementation of CCSWA used here is part of the SAISIR toolbox SAISIR (2008): Statistics Applied to the Interpretation of Spectra in the InfraRed
Dominique Bertrand ([email protected])
PLS-ComDim
1) NIR on apples
Samples
• 2 Varieties :
• Cox, Jonagold
• 2 Faces :
• Red, Green
• 3 Maturity levels :
• fresh, ripe, over-ripe
• 8 different apples
Spectra
• 94 x 200 points
Tables
• 50 blocks of 4 variables
• 6 Common Dimensions
NIR Spectra
Spectra
ComDimSaliences
Correlation between ComDim Scores
and "Face"
Correlation between PLS-ComDim Scores
and "Face"
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
RM
SE
CV
Dotte d line is RMSECV (4 LV 's) fo r g loba l m ode l / Ita lic numbe rs a re optim a l LVs in inte rva l m ode l
Interval num ber
3 2 3 4 2 2 3 3 4 4 4 1 4 2 1 1 2 1 2 2 2 4 2 4 2 3 2 3 2 1 3 1 1 2 3 1 2 2 3 2 2 1 1 2 1 1 1 1 1 1
i-PLS between NIR and "Face"
10 2 0 30 4 0 50 60 7 0 80 9 0
-1. 5
-1
-0. 5
0
0. 5
1
1. 5
2
- Blocks = 50 - Mean centred- Max LVs = 4 - CV = Full
- Scores on LV1for Block 8
Correlation between ComDim Scores
and "Maturity"
Correlation between PLS-ComDim Scores
and "Maturity"
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
RM
SE
CV
Dotte d line is RMSECV (4 LV 's) fo r g loba l m ode l / Ita lic numbe rs a re optim a l LVs in inte rva l m ode l
Interval num ber
1 1 3 2 3 2 2 3 3 3 3 3 3 2 3 1 1 3 2 2 3 4 1 1 2 2 3 1 1 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1
i-PLS between NIR and "Maturity"
10 2 0 30 4 0 50 60 7 0 80 9 0
-1. 5
-1
-0. 5
0
0. 5
1
1. 5
- Blocks = 50 - Mean centred- Max LVs = 4 - CV = Full
- Scores on LV1for Block 10
Correlation between ComDim Scores
and "Variety"
Correlation between PLS-ComDim Scores
and "Variety"
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031323334353637383940414243444546474849500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
RM
SE
CV
Dotte d line is RMSECV (4 LV 's) fo r g loba l m ode l / Ita lic numbe rs a re optim a l LVs in inte rva l m ode l
Interval num ber
3 2 2 2 3 3 4 3 1 3 2 2 4 4 4 3 3 4 4 2 4 1 4 3 3 3 3 1 2 3 2 2 3 3 3 3 2 3 2 2 2 4 3 3 1 2 2 4 2 2
i-PLS between NIR and "Variety"
10 2 0 30 4 0 50 60 7 0 80 9 0-0. 4
-0. 3
-0. 2
-0. 1
0
0. 1
0. 2
0. 3
- Blocks = 50 - Mean centred- Max LVs = 4 - CV = Full
- Scores on LV1for Block 33
Genomics data
0 . 5 1 1 . 5 2 2 . 5 3 3 . 5 4 4 . 5
x 1 04
Chromosome 1
1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0
Chromosome 22
………
Genomics data
Samples
• 940 healthy individuals :
• 7 regions :
• 53 ethnic groups :
Variables
• 644,138 Single Nucleotide Polymorphisms (SNPs)
/ 22 Chromosomes
Pretreatment
• PCT on each chromosome = 22x(940*940)� 22x(940*100)
Analysis
• ComDim on the set of 22 PCT blocks
[5] H. M. Cann, C. de Toma, L. Cazes, M-F. Legrand, V. Morel, L. Piouffre, J. Bodmer, et al. (2002),A human genome diversity cell line panel. Science 296, 261–262
[6] E. Génin - Inserm UMR 946 - Univ Paris Diderot
X
PCA
X1
T1
P1
X2
T2
P2
Xq
Tq
Pq
T1 T2 ... T1
PCA/PLS...
TPCTPT
PCT
PX1 PX2 PXq...
Segmented PCT on Obese data sets
TX TPCT=
(PTX1=inv(TPCT
T*TPCT)*TPCTT*X1)
Combined PCs
Component t1=Xu1
Space X3
Space X1
SpaceX 2
SegPCT Component tPCT
Component t2=X2u2
Component t3=X3u3
PCA and SegPCT-PCA (SW=2048, 10 PCT-PCs) memory allocation profiles
SegPCT-PCA21 Mbytes1207 s (~20 min.)
PCA
230 Mbytes14819 s (~250 min.)
With PCT : faster and requires less memory
SegPCT-PCA versus PCA
-0.002
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
500106016202180274033003860442049805540
PCA & SegPCT-PCA Loadings
With PCT : identical results
Multivariate Analysis of all SNPs
[7] J. Li et al.Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation,Science, 319(5866), 1100 - 1104
PCT-ComDim
Saliences of chromosome blocks
PCT-ComDim
CC1 Scores
Africa
Mid-EastEuropeCS_Asia
AmericaE_Asia
Oceania
PCT-ComDim
CC2 Scores
Africa
Mid-East
Europe
CS_Asia
AmericaE_Asia
Oceania
PCT-ComDim
CC3 Scores
America
E_Asia Oceania
PCT-ComDim
CC4 Scores
Oceania
PCT-ComDim
CC5 Scores
Bantu_S
BiakaPygmies
Mandenka
MbutiPygmies San
Yoruba
Bantu_N
Correlation between ComDim Scores
and European populations
PCT-ComDim
Saliences of chromosome blocks
PCT-ComDim
French
Tuscan &Italian
Sardinian
Orcadian
Adygei
FrenchBasque
Russian
East
West
South
North
PCT-ComDim PCA
French
Tuscan & Italian
Sardinian
Orcadian
Adygei
FrenchBasque
Russian
[7] J. Li et al.Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation,Science, 319(5866), 1100 - 1104
Good & Bad news
With segmented PCT :
• analyse data sets of any width
• quickly
• using less memory
With multi-block analysis :
• detect groups of interesting variables ("salience")
• visualise relations among samples ("scores")
39
First African-European Conference on
Chemometrics
Mining School of Rabat
Morocco, 20th to 24th of September 2010
www.afrodata.org