uni multi intermediate v1 - universidade de coimbra · e “cracking” catalítico de uma...

GEPSI – PSE Group Encontros com a Ciência

Chemical Process Engineering and Forest Products

Research Center – CIEPQPF

Department of Chemical Engineering

University of Coimbra

CIEPQPF/DEQ-FCTUC

Semi-empirical model building: should we go univariate, multivariate or ... intermediate?

Marco S. Reis

04 June, 2014

UC 2014

CIEPQPF/DEQ-FCTUC GEPSI – PSE Group Encontros com a Ciência

Outline

1. Motivation

2. Methods: The Network Induced Supervised Learning framework (NI-SL)

i. Network Induced Classification (NIC)

ii. Network Induced Regression (NIR)

3. Results

i. Classification

ii. Regression

4. Conclusions


1. Motivation

• Current regression and classification approaches are strongly

focused on optimizing prediction ability

Prediction

Interpretation

PCA - total variation

OLS - quality of fitness (TSS, R2)

PLS - prediction ability (RMSEP)

LDA , LQA- classification rate


1. Motivation

• Interpretation is the main target in several problems

– Process improvement activities

– New complex reaction mechanisms

• Primarily interested in the way and how variables/reactants interact, in

order to design a better system or to conduct the next round of

experimental identification trials.

– Analysis of natural systems (metabolic, gene regulation networks)

• Extracting the connectivity and causal structure of the system, rather than

predicting accurately the amounts of metabolites/proteins produced.


1. Motivation

• On the other hand…

– Systems are not ensembles of isolated and quantities…

– … nor are they the result of the cooperative action of all its

constituting elements.


1. Motivation

• But classification/ regression methods are either …

– One variable-at-a-time …

• Variable selection (FA, BR, FS, GA, …)

– … multivariate methods.

• PLS, PCR, CCR, FDA, RR …


7

• Flowsheet de uma unidade de destilação.

Representação em rede para o sistema de destilação numa refinaria

de petróleo.

1. Motivation


8

Representação em rede para um flowsheet de uma unidade destilação

e “cracking” catalítico de uma refinaria de petróleo.

1. Motivation


9

• Mais exemplos:

Redes metabólicas simplificadas de dois microrganismos, relativas a 89 metabolitos (a-E.

coli e b-Buchnere aphidicola).

1. Motivation


1. Motivation

10

O “proteoma” da S. cerevisiae (mapa de interacções entre proteinas)


1. Motivation

• The nature of systems is:

– Modular, hierarchical, specialized …

• Methods should adapts its model structures to this reality

(and not the contrary … “squeezing” reality into our models)

Source: BMC Systems Biology 2011, 5(Suppl 3):S7

doi:10.1186/1752-0509-5-S3-S7


2. Methods

• We propose a platform for incorporating the modular

structure of variables in regression and classification problems

• Both problems share the same interpretation-oriented

backbone: NETWORK INDUCED CLUSTERING

• The clusters CONSTRAIN the final structure of the models

They are composed by VARIATES from the RELEVANT

modules


2. Methods

NI-ClusteringAlgorithm

Training set

Groups of variables(direct interactions)

Select groups for classification and construct a classifier

I. (Network Induced) Feature extraction

NI-C

The Integrated Network Induced Supervised Learning Framework

Select groups for regression and estimate a model

NI-RII. Modeling


2. Methods

• STAGE I: NI-Clustering

1. Compute partial-correlation coefficients (1st, 2nd and full order)

instead of marginal correlations

2. Threshold partial-correlation coefficients using a criterion for

statistical significance

3. Obtain the resulting adjacency matrix (Adj)

Z

A B

0.8AZr =

0.6BZr =

0.48ABr =

0.7295AZ Br ⋅ =

0.4104BZ Ar ⋅ =

0AB Zr ⋅ =

Corr(A,B) ↑↑

but

R(A,B|Z) is ↓


2. Methods

• STAGE I: NI-Clustering (cont.)

4. Compute the Generalized Topologycal Overlap Measure (GTOM)

between variables

- Introduce robustness in the formation of modules: for each pair of

variables under analysis, GTOM also takes into account the number of

common variables shared by their neighbourhs.

( ) ( ) ( ) ( )

, ,,

min , 1 ,i j

l i j Adj i jTOM i j

k k Adj i j

+=

+ −

( ) ( ) ( ),

, , ,u i j

l i j Adj i u Adj u j≠

= ×∑

( ),i u ik Adj i u

≠=∑

Generalization of:


2. Methods

• STAGE I: NI-Clustering (cont.)

5. Using the GTOM similarity, compute the respective distance matrix

6. Hierarchical clustering algorithm, with linkage criteria set to the

unweighted average distance

( ) ( ), 1 ,l ld i j GTOM i j= −


• Notes

– Analysing dendrogram, TOM plot, silhouette values, S(i), a number of

clusters or natural variable groups, reflecting the variables direct

associations and therefore their potentially similar functional role, can

be proposed (NCLUST).

( )( )( ) ( )( )

( ) ( )( )( )min _ , _

max _ ,min _ ,

AVER BETWEEN i k AVER WITHIN iS i

AVER WITHIN i AVER BETWEEN i k

−=

2. Methods

6 7 1 3 2 4 10 8 5 11 90

0.2

0.4

0.6

0.8

1

2 4 6 8 10

2

4

6

8

10

1 2 3 4 5 6 7 8 9 10 11-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

Cluster

Silh

ouet

te v

alue

s

Average Silhouette = 0.26977

1

23

4


• STAGE II: Modeling

NI-ClusteringAlgorithm

Training set

Groups of variables(direct interactions)

Compute discriminantdirections for each group of variables

(clusters)

Putative discriminants for classification

Select discriminantsfor classification

Final set of discriminants

Estimate a classifier usingselected discriminantsand class information

Classification of samples

NIC - Network Induced Classification

Compute predictivedirections for each group of variables

(clusters)

Putative scores for prediction

Select scoresfor predictive model

Final set of scores

Estimate model usingselected scores

and response data

Predict samples

NIR - Network Induced Regression

Stage I

Stage II


• STAGE II: Modeling

– Selection of variates (linear discriminants, in the case of NIC)

1 1.5 2 2.5 3 3.5 40.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

Number of scores in best combination

MC

-CV

Mea

n G

loba

l Err

or R

ate

Best Combinations of Scores:Combination 1: CL2/DL1Combination 2: CL2/DL1, CL2/DL2Combination 3: CL1/DL1, CL2/DL1, CL3/DL1Combination 4: CL1/DL1, CL1/DL2, CL2/DL1, CL2/DL2

Kmax=3NVMod=3

2. Methods


• Main adjustable parameters (NIC)– Number of clusters to retain (NCLUST);

– Maximum number of variates per cluster to use before selection (NVC);

– Maximum number of variates to use for classification (NVMod);

• Secondary adjustable parameters (usually kept constant)– Order of Partial Correlations (1,2 or m-2 = full order);

– Order of topologycal overlap (GTOM parameter; 0, 1, 2);

– Significance level in thresholding operations (ALPHA);

– Percentage of samples to set aside in internal MC-CV analysis (PER);

– Number of MC-CV trials (ITERMC);

2. Methods


3. Results – Classification: NIC

• Datasets

– WINE. X(178×13) dataset (13 descriptors and 178 samples). Variables consist of analytically measured wine constituents, and the class labels regard three different cultivators from the same region of Italy (available at http://archive.ics.uci.edu/ml/datasets/Wine).

– ROUGHNESS. X(36×11) dataset. Variables regard geometrical-oriented features that summarize different aspects of an accurate profile taken from a sheet of paper, at the roughness scale (which is a fine scale), using high-resolution mechanical stylus profilometry. Classes regard the evaluation made by a panel of experts about the quality of paper sheets in what concerns to their sensorial perception of surface roughness. Furthermore, another quantitative variable was collected for each sample, regarding measurements obtained with the Bendtsen tester.



• Performance metrics

– Re-substitution accuracy. Classification accuracy is computed with the same dataset used to “train” the methodology.

– Monte-Carlo cross-validation accuracy. Random train/test data split are performed a number of times, say, k=1,…,N_CV_TRIALS, where the training sets are used to estimate the model parameters and the test sets to evaluate the models performances (accuracy for trial k).

– Leave-one-out cross-validation accuracy. Similar to MC-CV, but now the splitting is deterministic. In each trial, exactly one sample is left out for testing the method, while the remaining ones are used to estimate it, and process is repeated for all samples.


3. Results

• Benchmark method

– The same methods as in NI-SL but with total freedom to use all the

variables available.

– NI-SL constraints the predictive space by imposing only some linear

combinations of it, with functional meaning.


Case

study

Partial

correlation

order

Method’s adjustable

parameters

Global accuracy measures

(%)1st line: Proposed method (NI-C)

2nd line: Benchmark

NCLUST NVC NVMod Re-subs. MC-CV LOO-CV

Wine

foPC 2 2 398.3

100

98.2 (1.9)

98.3 (1.5)

97.2

98.9

2oPC 2 2 399.4

100

97.1 (2.3)

98.3 (1.5)

97.8

98.9

Roughness

foPC 3 2 3 97.2

100

89.3 (13.8)

82.9 (11.0)

94.4

88.9

2oPC 4 2 397.2

100

85.0 (10.8)

82.0 (11.0)

77.8

88.9

• Results NIC


Legend: foPC – full-order partial correlation; 2oPC – second-order partial correlation. Other

parameters are defined in the text.


3. Results – Regression: NIR

• Performance metrics

– Re-substitution accuracy.

– Monte-Carlo cross-validation accuracy.

– Leave-one-out cross-validation accuracy.

( )2

1

ˆn

i ii

C

y yRMSE

n=

−=∑

( )

( )

( )

( )

2 2

2 1 1

2 2

1 1

ˆ ˆ1

n n

i i i ii i

n n

i ii i

y y y yR

y y y y

= =

= =

− −= − =

− −

∑ ∑

∑ ∑

( ) ( )( )2

1

ˆ( )

n

i ii

LOO CV

y yRMSE k

n=

−

−=∑ ( )

( ) ( )( )( )( )

2

2 1

2

1

ˆ1

n

i ii

LOO CV n

ii

y yR k

y y

=−

=

−= −

−

∑

∑

( )2

1

ˆ( )

outn

i ii

CVout

y yRMSE k

n=

−=∑ ( )

( )

( )

2

2 1

2

1

ˆ1

out

out

n

i ii

CV n

i outi

y yR k

y y

=

=

−= −

−

∑

∑


• Results NIR


Case study

Partial

correlation

order*

Method’s adjustable

parameters

RMSE / R2

1st line: Proposed method (NI-R)

2nd line: Benchmark (OLS)

3rd line: Benchmark (PLS)

NCLUST NVC NVMod Re-subs. MC-CV LOO-CV

Roughness 2oPC 4 2 4

NI-R

31.32/0.95

OLS

28.45/0.96

PLS

34.25/0.94

NI-R

60.81 (37.51) / 0.6240 (0.6158)

OLS

46.56 (20.44) / 0.7775 (0.1814)

PLS

48.22 (23.74) / 0.7569 (0.2515)

NI-R

49.74 / 0.8739

OLS

52.36 / 0.8602

PLS

55.83 / 0.8412


1 2 3 4 5 6 7 8 9 10 110

0.5

1

1.5

2

2.5

Variable index

VIP

NIR

Ra

Rz

Rq

Rp

Rt

RS

m

RS

RS

k

RK

u

Rv

Rdq RMS slope of profileRdq

Maximum profile valley depthRv

Kurtosis of profileR Ku

Skewness of profileR Sk

Mean distance between local peaksR S

Mean width of profile elementsR Sm

Total height of profileRt

Maximum profile peak heightRp

RMS deviation of profileRq

Maximum height of profileRz

Arithmetical mean deviation of profileRa

DescriptionRoughness parameters

RMS slope of profileRdq

Maximum profile valley depthRv

Kurtosis of profileR Ku

Skewness of profileR Sk

Mean distance between local peaksR S

Mean width of profile elementsR Sm

Total height of profileRt

Maximum profile peak heightRp

RMS deviation of profileRq

Maximum height of profileRz

Arithmetical mean deviation of profileRa

DescriptionRoughness parameters

( ) 2 2,

k

NI R j k jj

VIP k wβ−∈Ω

= ×∑

kΩ ≡ set of variates containing variable k

,k jw ≡ kth entry of the PLS weighting vector used to compute the jth variate



4. Conclusions

• NI-SL was developed to help elucidating which are the natural clusters of variables present, and that are associated with some sort of systems function or operation.

• NI-C and NI-R enable the interception of the two sources of knowledge arising from data: variables structure and observations structure.

• The results presented illustrate that such knowledge can indeed be extracted, without compromising the classification and prediction accuracies.

• The proposed methodology finds interesting applications in the analysis of datasets from a variety of sources, such as those arising from biosystems, where one is concerned in explaining both the variable structure and natural groups of specimens, and industrial systems, where data is increasingly complex, and models that are both parsimonious and informative are required to support their analysis.

• Future work will address applications in other scenarios, as well as try to better clarify the role of the partial correlation order in the scope of the proposed methodology.


Thank you!

Acknowledgments:The author acknowledges financial support through project

PTDC/EQU-ESI/108597/2008

co-financed by the Portuguese FCT and European Union’s FEDER through “Eixo I do Programa OperacionalFactores de Competitividade (POFC)” of QREN (with ref. FCOMP-01-0124-FEDER-010398).

uni multi intermediate v1 - universidade de coimbra · e “cracking” catalítico de uma...

Documents