20.12.05/12:00 agora, aud 2 public examination of phd thesis: “feature extraction for supervised...

20.12.05/12:00Agora, Aud 2

Public examination of PhD thesis: “Feature Extraction for Supervised Learning in Knowledge Discovery Systems”

1

Prof. Seppo Puuronen, JYU

Dr. Alexey Tsymbal, TCD

Prof. Tommi Kärkkäinen, JYU

Prof. Ryszard Michalski, GMU

Prof. Peter Kokol, UM

Dr. Kari Torkkola, MotorolaLabs

Supervisors:

Reviewers:

Opponent:JYU, Agora Building, Auditorium 2

December 20, 2005 12:00

Mykola Pechenizkiy

Feature Extraction for Supervised Learning in Knowledge Discovery

Systems

Public examination of dissertation

20.12.05/12:00Agora, Aud 2


2

Outline DM and KDD background

– KDD as a process– DM strategy

Classification– Curse of dimensionality and indirectly relevant

features– Feature extraction (FE) as dimensionality reduction

Feature Extraction for Classification– Conventional Principal Component Analysis – Class-conditional FE: parametric and non-parametric

Research Questions Research Methods Contributions

20.12.05/12:00Agora, Aud 2


3

Knowledge discovery as a process

Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1997.

20.12.05/12:00Agora, Aud 2


4

CLASSIFICATIONCLASSIFICATION

New instance to be classified

Class Membership ofthe new instance

J classes, n training observations, p features

Given n training instances

(xi, yi) where xi are values of

attributes and y is class

Goal: given new x0,

predict class y0

Training Set

The task of classification

Examples:

- diagnosis of thyroid diseases;

- heart attack prediction, etc.

20.12.05/12:00Agora, Aud 2


5

Improvement of Representation Space

Curse of dimensionality drastic increase in computational complexity and

classification error with data having a large number of dimensions

Indirectly relevant features

20.12.05/12:00Agora, Aud 2


6

FE example “Heart Disease”

0.1·Age-0.6·Sex-0.73·RestBP-0.33·MaxHeartRate

-0.01·Age+0.78·Sex-0.42·RestBP-0.47·MaxHeartRate

-0.7·Age+0.1·Sex-0.43·RestBP+0.57·MaxHeartRate

100% Variance covered 87%

60% <= 3NN Accuracy => 67%

20.12.05/12:00Agora, Aud 2


7

representation of instances of class y1

representation of instances of class yk

Selecting most relevant features

Selecting most

representative instances

Extracted featuresOriginal features

How to construct good RS for SL?

RQ4: Which features – original, extracted or both – are useful for SL?RQ1 – How important is to use class information in the FE process?RQ2 – Is FE data oriented or SL oriented or both?

RQ5 – How many extracted features are useful for SL?

RQ6 – How to cope with the presence of contextual features in data, and data heterogeneity?

RQ7 – What is the effect of sample reduction on the performance of FE for SL?

RQ3 – Is FE for dynamic integration of base-level classifiers useful in a similar way as for a single base-level classifier?

20.12.05/12:00Agora, Aud 2


8

Research Problem

Studying both theoretical background and practical aspects of FE for SL in KDSs

Main Contribution

Many-sided analysis of the research problem Ensemble of relatively small contributions

Research Method A multimethodological approach to the

construction of an artefact for DM (following Nunamaker et al., 1990-91)

DM ArtifactDevelopment

Experimentation

Theory Building

Observation

20.12.05/12:00Agora, Aud 2


9

Further Research

Meta-Model, ES, KB

Feature Manipu-lators

ML algorithms/ Classifiers

Post-processors/visualisers

Meta-Data

Meta-learning

Data set

KDD-Manager Data Pre-

processors

Instances Manipu-lators

GUI

Data generator

Evaluators

How to help in decision making on the selection of the appropriate DM strategy for a problem at consideration?

When FE is useful for SL?

What is the effect of FE on interpret-ability of results and transparency of SL?

20.12.05/12:00Agora, Aud 2


10

Additional Slides …

Further Slides for Step-by-Step Analysis of Research Questions and Corresponding Contributions

20.12.05/12:00Agora, Aud 2


11

Research Questions:

RQ1 – How important is to use class information in the FE process?

RQ2 – Is FE a data- or hypothesis-driven constructive induction?

RQ3 – Is FE for dynamic integration of base-level classifiers useful in a similar way as for a single base-level classifier?

RQ4 – Which features – original, extracted or both – are useful for SL?

RQ5 – How many extracted features are useful for SL?

20.12.05/12:00Agora, Aud 2


12

Research Questions (cont.):

RQ6 – How to cope with the presence of contextual features in data, and data heterogeneity?

RQ7 – What is the effect of sample reduction on the performance of FE for SL?

RQ8 – When FE is useful for SL?

RQ9 – What is the effect of FE on interpretability of results and transparency of SL?

RQ10 – How to make a decision about the selection of the appropriate DM strategy for a problem at consideration?

20.12.05/12:00Agora, Aud 2


13

RQ1: Use of class information in FE

Tsymbal A., Puuronen S., Pechenizkiy M., Baumgarten M., Patterson D. 2002. Eigenvector-based Feature Extraction for Classification (Article I, FLAIRS’02)

Use of class information in FE process is crucial for many datasets:

Class-conditional FE can result in better classification accuracy while solely variance-based FE has no effect on or deteriorates the accuracy.

x2 PC(1) PC(2)

a) x1

x2 PC(1) PC(2)

b) x1

No superior technique, but nonparametric approaches are more stables to various dataset characteristics

20.12.05/12:00Agora, Aud 2


14

RQ2: Is FE a data- or hypothesis-driven CI?

Pechenizkiy M. 2005. Impact of the Feature Extraction on the Performance of a Classifier: kNN, Naïve Bayes and C4.5 (Article III, AI’05)

Search for the most appropriate FE technique

FE process Trans-

formed train set

Train set

Search for the most appropriate SL technique

FE model

SL process

SL model

Test set

Prediction

Search for the most appropriate FE technique

FE process

Trans-formed

Train set

Train set

Search for the most appropriate SL technique

FE model

SL process

SL model

Test set

PredictionRanking of different FE techniques according to the corresponding accuracy results of a SL technique can vary a lot for different datasets.

Different FE techniques behave also in a different way when integrated with different SL techniques.

Selection of FE method is not independent from the selection of classifier

20.12.05/12:00Agora, Aud 2


15

RQ3: FE for Dynamic Integration of Classifiers

Dynamic Selection

Dynamic Voting

Dynamic Voting with Selection

Dynamic Integration

Divide instances

Data set

Training set Validation set

Test set

S - size of the emsemble N - number of features TS - training subset BC - base classifier NN - nearest neighborhood

Search for NN

Feature Extraction :

RSM(S,N)

TS1

Training BC1

accuracy estimation

TSS

Training BCS

accuracy estimation

TSi

Training BCi

accuracy estimation

Local accuracy estimates

Trained Base

classifiers

Meta-Data

WNN: for each nn predict local errors

of every BC

Transformed training set

Transforma-tion models

Feature subsets refinement

PCA Par Non-Par

trai

ning

phas

e ap

plica

tion

phas

e

...

...

...

...

...

...

Meta-Learning

(Article VIII, Pechenizkiy et al., 2005)

20.12.05/12:00Agora, Aud 2


16

RQ4: How to construct good RS for SL?

Pechenizkiy M., Tsymbal A., Puuronen S. 2004. PCA-based feature transformation for classification: issues in medical diagnostics, (Article II, CBMS’2004)

representation of instances of class y1

representation of instances of class yk

Selecting most relevant features

Selecting most

representative instances

Combination of original features with extracted features can be beneficial for SL with many datasets, especially when tree-based inducers like C4.5 are used for classification.

Which features – original, extracted or both – are useful for SL?

20.12.05/12:00Agora, Aud 2


17

RQ4: How to construct good RS for SL? (cont.)

Pechenizkiy M., Tsymbal A., Puuronen S. 2005. On Combining Principal Components with Parametric LDA-based Feature Extraction for Supervised Learning. (Article III, FCDS)

PCA

Training Data

PCs

PAR LDs

Train PCs + LDs SL

Accuracy

Test PCs+LDs

Transform Test Data

Classifier

0.70

0.72

0.74

0.76

0.78

0.80

3NN NB C4.5

LDA PCA LDA+ PCA

0

2

4

6

8

10

12

3NN NB C4.5

- - - = + ++

20.12.05/12:00Agora, Aud 2


18

RQ5: How many extracted features are useful?

Criteria for selecting the most useful transformed features are often based on variance accounted by the features to be selected

all the components, the corresponding eigenvalues of which are significantly greater than one

a ranking procedure: select principal components that have the highest correlations with the class attribute

1#

1#21#

instaces

featuresseigenvalue


20.12.05/12:00Agora, Aud 2


19

RQ6: How to cope with data heterogeneity?

Pechenizkiy M., Tsymbal A., Puuronen S. 2005. Supervised Learning and Local Dimensionality Reduction within Natural Clusters: Biomedical Data Analysis, (T-ITB, "Mining Biomedical Data“)

TrainingData

TestData

SL SL

Classifier Classifier C1

SL

DR

Natural Clustering

Accuracy

Cluster1 Cluster2 Clustern

SL SL

C2 Cn C1

SL SL SL

C2 Cn

DR DR DR

Accuracy AccuracyAccuracy

20.12.05/12:00Agora, Aud 2


20

RQ7: What is the effect of sample reduction?

Pechenizkiy M., Puuronen S., Tsymbal A. 2005. The Impact of Sample Reduction on PCA-based Feature Extraction for Naïve Bayes Classification. (Article V, ACM SAC’06: DM Track)

11

%100S

pN

kd-tree

building

Root

kd-tree

11N 1

nN

11

1 NNn

ii

cc S

pN

%100

kd-tree building

Root

kd-tree

cN1c

nN

c

n

i

ci NN

1

FE + SLo o o o o oo o o

k

k

k

clas

s 1

class c

Sample

SSc

ii

1

k

Data

N

k

1N

k

cN

Random Sampling

Random Sampling

20.12.05/12:00Agora, Aud 2


21

RQ8: When FE is useful for SL?

Kaiser-Meyer-Olkin (KMO) criterion: accounts total and partial correlation

,22

2

i jij

i jij

i jij

ar

r

KMO

jjii

ij

Xij RR

Ra ji

),(.

IF KMO > 0.5

THEN Apply PCA

General General recommendation:recommendation:

Rarely works in the context of SL

20.12.05/12:00Agora, Aud 2


22

RQ9: What is the effect of FE on interpretability?


Interpretability refers to whether a classifier is easy to understand. – rule-based classifiers like a decision tree and association rules are very easy to interpret, – neural networks and other connectionist and “black-box” classifiers have low interpretability.

FE enables: • New concepts – new understanding• Information summary from a large number of features into a

limited number of components • The transformation formulae provide information about the

importance of the original features • Better RS – better neighbourhood – better interpretability by

analogy with similar medical cases• Visual analysis projecting data onto 2D or 3D plots.

20.12.05/12:00Agora, Aud 2


23

RQ9: Feature Extraction & Interpretability (cont.)

The assessment of interpretability relies on the user’s perception of the classifier

The assessment of an algorithm’s practicality depends much on a user’s background, preferences and priorities.

Most of the characteristics related to practicality can be described only by reporting users’ subjective evaluations.

Thus, – the interpretability issues are disputable and difficult to

evaluate, – many conclusions on interpretability are relative and

subjective. Collaboration between DM researchers and domain experts

is needed for further analysis of interpretability issues

Objectivity of interpretabilityObjectivity of interpretability


20.12.05/12:00Agora, Aud 2


24

RQ10: Framework for DM Strategy Selection

Pechenizkiy M. 2005. DM strategy selection via empirical and constructive induction. (Article IX, DBA’05)

Meta-Model, ES, KB

Feature Manipu-lators

ML algorithms/ Classifiers

Post-processors/visualisers

Meta-Data

Meta-learning

Data set

KDD-Manager Data Pre-

processors

Instances Manipu-lators

GUI

Data generator

Evaluators

20.12.05/12:00Agora, Aud 2


25

Additional Slides …

20.12.05/12:00Agora, Aud 2


26

Meta-Learning

Suggested technique

A new data set Meta-model

Collection of data sets

Collection of techniques

Meta-learning space

Performance criteria

Knowledge repository

Evaluation

20.12.05/12:00Agora, Aud 2


27

New Research Framework for DM Research

Ap

plic

able

K

no

wle

dg

e

(Un-)Successful Applications in the appropriate environment

People Organizations Technology

Environment Knowledge Base

Foundations Design knowledge

Develop/Build

Justify/Evaluate

Assess Refine

Contribution to Knowledge Base

DM Research

Bu

siness

Need

s

Relevance Rigor

20.12.05/12:00Agora, Aud 2


28

People Roles Capabilities Characteristics Organizations Strategy Structure&Culture Processes Technology Infrastructure Applications Communications Architecture Development Capabilities

Environment Knowledge Base

Foundations Base-level theories Frameworks Models Instantiation Validation Criteria Design knowledge Methodologies Validation Criteria (not instantiations of models but KDD processes, services, systems)

Develop/Build Theories Artifacts

Justify/ Evaluate Analytical Case Study Experimental Field Study Simulation

Assess Refine

(Un-)Successful Applications in the appropriate environment

Contribution to Knowledge Base

DM Research

Ap

plic

able

Kn

ow

led

ge

Bu

sines

s Ne

eds

Relevance Rigor

New Research Framework for DM Research

… following Hevner et al. framework

20.12.05/12:00Agora, Aud 2


29

Some Multidisciplinary Research Pechenizkiy M., Puuronen S., Tsymbal A. 2005. Why Data Mining Does

Not Contribute to Business? In: C.Soares et al. (Eds.), Proc. of Data Mining for Business Workshop, DMBiz (ECML/PKDD’05), Porto, Portugal, pp. 67-71.

Pechenizkiy M., Puuronen S., Tsymbal A. 2005. Competitive advantage from Data Mining: Lessons learnt in the Information Systems field. In: IEEE Workshop Proc. of DEXA’05, 1st Int. Workshop on Philosophies and Methodologies for Knowledge Discovery PMKD’05, IEEE CS Press, pp. 733-737 (Invited paper).

Pechenizkiy M., Puuronen S., Tsymbal A. 2005. Does the relevance of data mining research matter? (resubmitted as a book chapter to) Foundations of Data Mining, Springer.

Pechenizkiy M., Tsymbal A., Puuronen S. 2005. Knowledge Management Challenges in Knowledge Discovery Systems. In: IEEE Workshop Proc. of DEXA’05, 6th Int. Workshop on Theory and Applications of KM, TAKMA’05, IEEE CS Press, pp. 433-437.

20.12.05/12:00Agora, Aud 2


30

Some Applications: Pechenizkiy M., Tsymbal A., Puuronen S., Shifrin M., Alexandrova I.

2005. Knowledge Discovery from Microbiology Data: Many-sided

Analysis of Antibiotic Resistance in Nosocomial Infections. In: K.D.

Althoff et al. (Eds) Post-Conference Proc. of 3rd Conf. on Professional

Knowledge Management: Experiences and Visions, LNAI 3782,

Springer Verlag, pp. 360-372.

Pechenizkiy M., Tsymbal A., Puuronen S. 2005. Supervised Learning

and Local Dimensionality Reduction within Natural Clusters:

Biomedical Data Analysis, (T-ITB, "Mining Biomedical Data“)

Tsymbal A., Pechenizkiy M., Cunningham P., Puuronen S. 2005.

Dynamic Integration of Classifiers for Handling Concept Drift.

(submitted to Special Issue on Application of Ensembles, Information

Fusion, Elsevier)

20.12.05/12:00Agora, Aud 2


31

Contact Info

Mykola Pechenizkiy

Department of Computer Science and Information Systems,

University of Jyväskylä, FINLANDE-mail: [email protected]

Tel. +358 14 2602472Mobile: +358 44 3851845

Fax: +358 14 2603011www.cs.jyu.fi/~mpechen

THANK YOU!

MS Power Point slides of recent talks and full texts of selected publications

are available online at: www.cs.jyu.fi/~mpechen

mailto:[email protected]

http://www.cs.jyu.fi/~mpechen

http://www.cs.jyu.fi/~mpechen

20.12.05/12:00 agora, aud 2 public examination of phd thesis: “feature extraction for supervised...

Documents

effect of fe

fe process

performance of fe

transparency of sl

practical aspects of

data mining

data heterogeneity

level classifiers useful