uni multi intermediate v1 - universidade de coimbra · e “cracking” catalítico de uma...
TRANSCRIPT
GEPSI – PSE Group Encontros com a Ciência
Chemical Process Engineering and Forest Products
Research Center – CIEPQPF
Department of Chemical Engineering
University of Coimbra
CIEPQPF/DEQ-FCTUC
Semi-empirical model building: should we go univariate, multivariate or ... intermediate?
Marco S. Reis
04 June, 2014
UC 2014
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
Outline
1. Motivation
2. Methods: The Network Induced Supervised Learning framework (NI-SL)
i. Network Induced Classification (NIC)
ii. Network Induced Regression (NIR)
3. Results
i. Classification
ii. Regression
4. Conclusions
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
1. Motivation
• Current regression and classification approaches are strongly
focused on optimizing prediction ability
Prediction
Interpretation
PCA - total variation
OLS - quality of fitness (TSS, R2)
PLS - prediction ability (RMSEP)
LDA , LQA- classification rate
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
1. Motivation
• Interpretation is the main target in several problems
– Process improvement activities
– New complex reaction mechanisms
• Primarily interested in the way and how variables/reactants interact, in
order to design a better system or to conduct the next round of
experimental identification trials.
– Analysis of natural systems (metabolic, gene regulation networks)
• Extracting the connectivity and causal structure of the system, rather than
predicting accurately the amounts of metabolites/proteins produced.
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
1. Motivation
• On the other hand…
– Systems are not ensembles of isolated and quantities…
– … nor are they the result of the cooperative action of all its
constituting elements.
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
1. Motivation
• But classification/ regression methods are either …
– One variable-at-a-time …
• Variable selection (FA, BR, FS, GA, …)
– … multivariate methods.
• PLS, PCR, CCR, FDA, RR …
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
7
• Flowsheet de uma unidade de destilação.
Representação em rede para o sistema de destilação numa refinaria
de petróleo.
1. Motivation
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
8
Representação em rede para um flowsheet de uma unidade destilação
e “cracking” catalítico de uma refinaria de petróleo.
1. Motivation
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
9
• Mais exemplos:
Redes metabólicas simplificadas de dois microrganismos, relativas a 89 metabolitos (a-E.
coli e b-Buchnere aphidicola).
1. Motivation
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
1. Motivation
10
O “proteoma” da S. cerevisiae (mapa de interacções entre proteinas)
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
1. Motivation
• The nature of systems is:
– Modular, hierarchical, specialized …
• Methods should adapts its model structures to this reality
(and not the contrary … “squeezing” reality into our models)
Source: BMC Systems Biology 2011, 5(Suppl 3):S7
doi:10.1186/1752-0509-5-S3-S7
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
2. Methods
• We propose a platform for incorporating the modular
structure of variables in regression and classification problems
• Both problems share the same interpretation-oriented
backbone: NETWORK INDUCED CLUSTERING
• The clusters CONSTRAIN the final structure of the models
They are composed by VARIATES from the RELEVANT
modules
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
2. Methods
NI-ClusteringAlgorithm
Training set
Groups of variables(direct interactions)
Select groups for classification and construct a classifier
I. (Network Induced) Feature extraction
NI-C
The Integrated Network Induced Supervised Learning Framework
Select groups for regression and estimate a model
NI-RII. Modeling
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
2. Methods
• STAGE I: NI-Clustering
1. Compute partial-correlation coefficients (1st, 2nd and full order)
instead of marginal correlations
2. Threshold partial-correlation coefficients using a criterion for
statistical significance
3. Obtain the resulting adjacency matrix (Adj)
Z
A B
0.8AZr =
0.6BZr =
0.48ABr =
0.7295AZ Br ⋅ =
0.4104BZ Ar ⋅ =
0AB Zr ⋅ =
Corr(A,B) ↑↑
but
R(A,B|Z) is ↓
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
2. Methods
• STAGE I: NI-Clustering (cont.)
4. Compute the Generalized Topologycal Overlap Measure (GTOM)
between variables
- Introduce robustness in the formation of modules: for each pair of
variables under analysis, GTOM also takes into account the number of
common variables shared by their neighbourhs.
( ) ( ) ( ) ( )
, ,,
min , 1 ,i j
l i j Adj i jTOM i j
k k Adj i j
+=
+ −
( ) ( ) ( ),
, , ,u i j
l i j Adj i u Adj u j≠
= ×∑
( ),i u ik Adj i u
≠=∑
Generalization of:
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
2. Methods
• STAGE I: NI-Clustering (cont.)
5. Using the GTOM similarity, compute the respective distance matrix
6. Hierarchical clustering algorithm, with linkage criteria set to the
unweighted average distance
( ) ( ), 1 ,l ld i j GTOM i j= −
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
• Notes
– Analysing dendrogram, TOM plot, silhouette values, S(i), a number of
clusters or natural variable groups, reflecting the variables direct
associations and therefore their potentially similar functional role, can
be proposed (NCLUST).
( )( )( ) ( )( )
( ) ( )( )( )min _ , _
max _ ,min _ ,
AVER BETWEEN i k AVER WITHIN iS i
AVER WITHIN i AVER BETWEEN i k
−=
2. Methods
6 7 1 3 2 4 10 8 5 11 90
0.2
0.4
0.6
0.8
1
2 4 6 8 10
2
4
6
8
10
1 2 3 4 5 6 7 8 9 10 11-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
Cluster
Silh
ouet
te v
alue
s
Average Silhouette = 0.26977
1
23
4
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
• STAGE II: Modeling
NI-ClusteringAlgorithm
Training set
Groups of variables(direct interactions)
Compute discriminantdirections for each group of variables
(clusters)
Putative discriminants for classification
Select discriminantsfor classification
Final set of discriminants
Estimate a classifier usingselected discriminantsand class information
Classification of samples
NIC - Network Induced Classification
Compute predictivedirections for each group of variables
(clusters)
Putative scores for prediction
Select scoresfor predictive model
Final set of scores
Estimate model usingselected scores
and response data
Predict samples
NIR - Network Induced Regression
Stage I
Stage II
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
• STAGE II: Modeling
– Selection of variates (linear discriminants, in the case of NIC)
1 1.5 2 2.5 3 3.5 40.02
0.025
0.03
0.035
0.04
0.045
0.05
0.055
Number of scores in best combination
MC
-CV
Mea
n G
loba
l Err
or R
ate
Best Combinations of Scores:Combination 1: CL2/DL1Combination 2: CL2/DL1, CL2/DL2Combination 3: CL1/DL1, CL2/DL1, CL3/DL1Combination 4: CL1/DL1, CL1/DL2, CL2/DL1, CL2/DL2
Kmax=3NVMod=3
2. Methods
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
• Main adjustable parameters (NIC)– Number of clusters to retain (NCLUST);
– Maximum number of variates per cluster to use before selection (NVC);
– Maximum number of variates to use for classification (NVMod);
• Secondary adjustable parameters (usually kept constant)– Order of Partial Correlations (1,2 or m-2 = full order);
– Order of topologycal overlap (GTOM parameter; 0, 1, 2);
– Significance level in thresholding operations (ALPHA);
– Percentage of samples to set aside in internal MC-CV analysis (PER);
– Number of MC-CV trials (ITERMC);
2. Methods
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
3. Results – Classification: NIC
• Datasets
– WINE. X(178×13) dataset (13 descriptors and 178 samples). Variables consist of analytically measured wine constituents, and the class labels regard three different cultivators from the same region of Italy (available at http://archive.ics.uci.edu/ml/datasets/Wine).
– ROUGHNESS. X(36×11) dataset. Variables regard geometrical-oriented features that summarize different aspects of an accurate profile taken from a sheet of paper, at the roughness scale (which is a fine scale), using high-resolution mechanical stylus profilometry. Classes regard the evaluation made by a panel of experts about the quality of paper sheets in what concerns to their sensorial perception of surface roughness. Furthermore, another quantitative variable was collected for each sample, regarding measurements obtained with the Bendtsen tester.
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
3. Results – Classification: NIC
• Performance metrics
– Re-substitution accuracy. Classification accuracy is computed with the same dataset used to “train” the methodology.
– Monte-Carlo cross-validation accuracy. Random train/test data split are performed a number of times, say, k=1,…,N_CV_TRIALS, where the training sets are used to estimate the model parameters and the test sets to evaluate the models performances (accuracy for trial k).
– Leave-one-out cross-validation accuracy. Similar to MC-CV, but now the splitting is deterministic. In each trial, exactly one sample is left out for testing the method, while the remaining ones are used to estimate it, and process is repeated for all samples.
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
3. Results
• Benchmark method
– The same methods as in NI-SL but with total freedom to use all the
variables available.
– NI-SL constraints the predictive space by imposing only some linear
combinations of it, with functional meaning.
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
Case
study
Partial
correlation
order
Method’s adjustable
parameters
Global accuracy measures
(%)1st line: Proposed method (NI-C)
2nd line: Benchmark
NCLUST NVC NVMod Re-subs. MC-CV LOO-CV
Wine
foPC 2 2 398.3
100
98.2 (1.9)
98.3 (1.5)
97.2
98.9
2oPC 2 2 399.4
100
97.1 (2.3)
98.3 (1.5)
97.8
98.9
Roughness
foPC 3 2 3 97.2
100
89.3 (13.8)
82.9 (11.0)
94.4
88.9
2oPC 4 2 397.2
100
85.0 (10.8)
82.0 (11.0)
77.8
88.9
• Results NIC
3. Results – Classification: NIC
Legend: foPC – full-order partial correlation; 2oPC – second-order partial correlation. Other
parameters are defined in the text.
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
3. Results – Regression: NIR
• Performance metrics
– Re-substitution accuracy.
– Monte-Carlo cross-validation accuracy.
– Leave-one-out cross-validation accuracy.
( )2
1
ˆn
i ii
C
y yRMSE
n=
−=∑
( )
( )
( )
( )
2 2
2 1 1
2 2
1 1
ˆ ˆ1
n n
i i i ii i
n n
i ii i
y y y yR
y y y y
= =
= =
− −= − =
− −
∑ ∑
∑ ∑
( ) ( )( )2
1
ˆ( )
n
i ii
LOO CV
y yRMSE k
n=
−
−=∑ ( )
( ) ( )( )( )( )
2
2 1
2
1
ˆ1
n
i ii
LOO CV n
ii
y yR k
y y
=−
=
−= −
−
∑
∑
( )2
1
ˆ( )
outn
i ii
CVout
y yRMSE k
n=
−=∑ ( )
( )
( )
2
2 1
2
1
ˆ1
out
out
n
i ii
CV n
i outi
y yR k
y y
=
=
−= −
−
∑
∑
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
• Results NIR
3. Results – Regression: NIR
Case study
Partial
correlation
order*
Method’s adjustable
parameters
RMSE / R2
1st line: Proposed method (NI-R)
2nd line: Benchmark (OLS)
3rd line: Benchmark (PLS)
NCLUST NVC NVMod Re-subs. MC-CV LOO-CV
Roughness 2oPC 4 2 4
NI-R
31.32/0.95
OLS
28.45/0.96
PLS
34.25/0.94
NI-R
60.81 (37.51) / 0.6240 (0.6158)
OLS
46.56 (20.44) / 0.7775 (0.1814)
PLS
48.22 (23.74) / 0.7569 (0.2515)
NI-R
49.74 / 0.8739
OLS
52.36 / 0.8602
PLS
55.83 / 0.8412
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
1 2 3 4 5 6 7 8 9 10 110
0.5
1
1.5
2
2.5
Variable index
VIP
NIR
Ra
Rz
Rq
Rp
Rt
RS
m
RS
RS
k
RK
u
Rv
Rdq RMS slope of profileRdq
Maximum profile valley depthRv
Kurtosis of profileR Ku
Skewness of profileR Sk
Mean distance between local peaksR S
Mean width of profile elementsR Sm
Total height of profileRt
Maximum profile peak heightRp
RMS deviation of profileRq
Maximum height of profileRz
Arithmetical mean deviation of profileRa
DescriptionRoughness parameters
RMS slope of profileRdq
Maximum profile valley depthRv
Kurtosis of profileR Ku
Skewness of profileR Sk
Mean distance between local peaksR S
Mean width of profile elementsR Sm
Total height of profileRt
Maximum profile peak heightRp
RMS deviation of profileRq
Maximum height of profileRz
Arithmetical mean deviation of profileRa
DescriptionRoughness parameters
( ) 2 2,
k
NI R j k jj
VIP k wβ−∈Ω
= ×∑
kΩ ≡ set of variates containing variable k
,k jw ≡ kth entry of the PLS weighting vector used to compute the jth variate
3. Results – Regression: NIR
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
4. Conclusions
• NI-SL was developed to help elucidating which are the natural clusters of variables present, and that are associated with some sort of systems function or operation.
• NI-C and NI-R enable the interception of the two sources of knowledge arising from data: variables structure and observations structure.
• The results presented illustrate that such knowledge can indeed be extracted, without compromising the classification and prediction accuracies.
• The proposed methodology finds interesting applications in the analysis of datasets from a variety of sources, such as those arising from biosystems, where one is concerned in explaining both the variable structure and natural groups of specimens, and industrial systems, where data is increasingly complex, and models that are both parsimonious and informative are required to support their analysis.
• Future work will address applications in other scenarios, as well as try to better clarify the role of the partial correlation order in the scope of the proposed methodology.
CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência
Thank you!
Acknowledgments:The author acknowledges financial support through project
PTDC/EQU-ESI/108597/2008
co-financed by the Portuguese FCT and European Union’s FEDER through “Eixo I do Programa OperacionalFactores de Competitividade (POFC)” of QREN (with ref. FCOMP-01-0124-FEDER-010398).