split criterions for variable selection using decision trees

Split Criterions for Variable Selection using Decision Trees

J. Abellán, A. R. Masegosa

Department of Computer Science and A.I. University of Granada

Spain

Outline 1. Introduction

2. Previous knowledge

3. Experimentation

4. Conclusions & future work

Introduction Information from a data base

Attribute variables Class variable

Data Base

Calcium Tumor Coma Migraine Cancer

normal a1 absent absent absent

high a1 present absent present



high ao present present absent

...... ...... ...... ...... ......

Introduction Classificacion tree (decision tree)

Tumor

Classification: absent Calcium

Classification: absent

Classification: present

Attribute variableNode

Case of the class variableLeaf

SPLIT CRITERION STOP CRITERION 1 LEAF = 1 RULE

Introduction Classification tree. New observation Observation: ( high, a1, absent, present) Variables: [Calcium, Tumor, Coma, Migraine] Classification: Cancer present

normal high

a0 a1 Classification :

absent

Classification: absent

Classification: present

Calcium

Tumor

Introduction Principal problems for the clasifiers

Redundant attribute variables Irrelevant attribute variables Excessive number of variables

Variable Selection Methods Filter methods Wrapper methods (classifier dependency)

Mark A. Hall y G. Holmes, Benchmarking Attribute Selection Techniques for Discrete Class Data Mining, IEEE TKDE (2003)

Introduction Variable selection with classification trees

Xa

Xd Xc Xb

Xe Xf Xg Xh Xi Xk Xj

……………………………………………………..

{Xa, Xb, Xc, Xd}

{Xa, Xb,..., Xk}

FRIST LEVELS MORE SIGNIFICATIVE VARIABLES


……………….......... …………

DB

DB

DB

Training set

SET1

SET2

SETm

SET1 U...U SETm

…………

Training set

Training set


……………….......... …………

DB

DB

DB

SET1

SET2

SETm

…………

INFORMATIVE ORDER FOR THE ROOT NODE (Abellán &

Masegosa, 2007) Training set

Training set

Training set

SET1 U...U SETm

Introduction Approach of the work presented Stablish the most suitable split criterion for building

decision trees to use it as base for those compose methods for VARIABLE SELECTION.

The variables of the first levels of one decision tree are extracted.

The performance of this variables is evaluated with a Naive Bayes classifier.

We carry out EXPERIMENTS on a large set of data bases using well-known split criterions (InfoGain, IGRatio and GiniIndex) and another one based on imprecise probabilities (Abellán & Moral, 2003), Imprecise InfoGain.



3. Experimentation

4. Conclusions & future works

Previos knowledges Naive Bayes (Duda & Hart, 1973)

Attribute variables {Xi | i=1,..,r} Class variable C with states in

{c1,..,ck} Select state of C: arg maxci

(P(ci|X)). Supposition of independecy

known the class variable: arg maxci

(P(ci) ∏rj=1P(zj|ci))

…

C

X1 X2 Xr

Graphical Structure

Previos knowledges Split Criterions for decision trees: Info-Gain (Quinlan, 1986)

Selects the attribute variable with higher positive value of IG(Xi,C) = H(C)-H(C|Xi)

H(C) = -∑j P(cj) log P(cj)SHANNON ENTROPY H(C|Xi) = -∑t ∑jP(cj|xi

t) log P(cj|xit)

ID3 Work only with discrete data bases Have a tendence to select variables with great

number of cases

Previos knowledges Split Criterions for decision trees: Info-Gain Ratio (Quinlan, 1993)

Selects the attribute variable with higher positive value of IGR(Xi,C) = IG(Xi,C)/ H(Xi)

C4.5

Work with continuous data bases Have a posterior prune process Penalize the use of variables with higher number of

cases

Previos knowledges Split Criterions for decision trees: Gini index (Breiman et al., 1984)

Selects the attribute variable with higher positive

value of GIx(Xi,C) = gini(C|Xi)-gini(C) gini(C) = ∑j (1-P(cj))2

gini(C|Xi) = ∑t P(xit) gini(C|x

it)

GINI INDEX

Quantify the impurity degree of a partition (a “pure partition” has only values in one case of C)

Previos knowledges Split Criterions for decision trees: Imprecise Info-Gain (Abellán & Moral, 2003)

Representing the information from a data base Imprecise Dirichlet Model (IDM) Probability estimation

j

jj

ccc

j IsNsn

sNn

cP ≡

+

+

+∈ ,)(

})(|{)(jcj IcqqCK ∈= })(|{)|( },{ ij xcji IcqqxXCK ∈==

Credal Sets

Previos knowledges Split Criterions for decision trees: Imprecise Info-Gain (Abellán & Moral, 2003)

Select the attribute variable with higher positive value of:

IGI(Xi,C) = S(K(C)) - ∑t P(xit) S(K(C| Xi=x

it))

with S as Maximum entropy function of a credal set.

Global uncertainty measure ⊃ conflict & no-specificity

Conflict is on the side of ramification. No-specificity tries to reduce the ramification.



3. Experimentation


Experimentation Data Bases

Preprocess:

- Filling of missing data (average & mode)

- Discretization of continuous values

Aplication of selection methods

Aplication of NB on original BDs with the set of selected variables

•Percentage of correct classification of NB before and after the selection process •Number of variable selected

Experimentation Results with 3 levels. Correct classifications

NB comparison:

Accumulated Comparison:

10 fold-cross x 10 times. Corrected paired t-test with 5% of significance level

Experimentation Results with 3 levels. Number of variables

Accumulated Comparison:

Experimentation Results with 4 levels Comparison over right classifications:

Comparison over number of variables:

Experimentation Results Analysis 1. Only using one tree, all the procedures obtain

good results using a few number of variables. 2. The improvement from 3 to 4 levels is not very

significative, except for IGR. 3. IGR penalizes excesivelly variables with high

number of cases (Audiology, Optdigits,..). 4. Using 3 levels, IIG has better results than the

other criterions. This outperforming is higher with 4 levels.


2. Previous knowledges

3. Experimentation


Conclusions & future works Experiments over 27 DBs present to IGI as a

outperforming split-criterion considering the trade off of accuracy and nº of variables.

Apply IGI criterion and others ones based on bayesian scores on the compose methods explained in the introduction.

Study the use of combined criterions, i.e. to use of one or other criterion with dependency of the characteristics of the BD (size, number of variables, number of cases, etc…) and level of the tree we stay.