split criterions for variable selection using decision trees
TRANSCRIPT
Split Criterions for Variable Selection using Decision Trees
J. Abellán, A. R. Masegosa
Department of Computer Science and A.I. University of Granada
Spain
Outline 1. Introduction
2. Previous knowledge
3. Experimentation
4. Conclusions & future work
Introduction Information from a data base
Attribute variables Class variable
Data Base
Calcium Tumor Coma Migraine Cancer
normal a1 absent absent absent
high a1 present absent present
normal a1 absent absent absent
normal a1 absent absent absent
high ao present present absent
...... ...... ...... ...... ......
Introduction Classificacion tree (decision tree)
Tumor
Classification: absent Calcium
Classification: absent
Classification: present
Attribute variableNode
Case of the class variableLeaf
SPLIT CRITERION STOP CRITERION 1 LEAF = 1 RULE
Introduction Classification tree. New observation Observation: ( high, a1, absent, present) Variables: [Calcium, Tumor, Coma, Migraine] Classification: Cancer present
normal high
a0 a1 Classification :
absent
Classification: absent
Classification: present
Calcium
Tumor
Introduction Principal problems for the clasifiers
Redundant attribute variables Irrelevant attribute variables Excessive number of variables
Variable Selection Methods Filter methods Wrapper methods (classifier dependency)
Mark A. Hall y G. Holmes, Benchmarking Attribute Selection Techniques for Discrete Class Data Mining, IEEE TKDE (2003)
Introduction Variable selection with classification trees
Xa
Xd Xc Xb
Xe Xf Xg Xh Xi Xk Xj
……………………………………………………..
{Xa, Xb, Xc, Xd}
{Xa, Xb,..., Xk}
FRIST LEVELS MORE SIGNIFICATIVE VARIABLES
Introduction Variable selection with classification trees
……………….......... …………
DB
DB
DB
Training set
SET1
SET2
SETm
SET1 U...U SETm
…………
Training set
Training set
Introduction Variable selection with classification trees
……………….......... …………
DB
DB
DB
SET1
SET2
SETm
…………
INFORMATIVE ORDER FOR THE ROOT NODE (Abellán &
Masegosa, 2007) Training set
Training set
Training set
SET1 U...U SETm
Introduction Approach of the work presented Stablish the most suitable split criterion for building
decision trees to use it as base for those compose methods for VARIABLE SELECTION.
The variables of the first levels of one decision tree are extracted.
The performance of this variables is evaluated with a Naive Bayes classifier.
We carry out EXPERIMENTS on a large set of data bases using well-known split criterions (InfoGain, IGRatio and GiniIndex) and another one based on imprecise probabilities (Abellán & Moral, 2003), Imprecise InfoGain.
Outline 1. Introduction
2. Previous knowledge
3. Experimentation
4. Conclusions & future works
Previos knowledges Naive Bayes (Duda & Hart, 1973)
Attribute variables {Xi | i=1,..,r} Class variable C with states in
{c1,..,ck} Select state of C: arg maxci
(P(ci|X)). Supposition of independecy
known the class variable: arg maxci
(P(ci) ∏rj=1P(zj|ci))
…
C
X1 X2 Xr
Graphical Structure
Previos knowledges Split Criterions for decision trees: Info-Gain (Quinlan, 1986)
Selects the attribute variable with higher positive value of IG(Xi,C) = H(C)-H(C|Xi)
H(C) = -∑j P(cj) log P(cj)SHANNON ENTROPY H(C|Xi) = -∑t ∑jP(cj|xi
t) log P(cj|xit)
ID3 Work only with discrete data bases Have a tendence to select variables with great
number of cases
Previos knowledges Split Criterions for decision trees: Info-Gain Ratio (Quinlan, 1993)
Selects the attribute variable with higher positive value of IGR(Xi,C) = IG(Xi,C)/ H(Xi)
C4.5
Work with continuous data bases Have a posterior prune process Penalize the use of variables with higher number of
cases
Previos knowledges Split Criterions for decision trees: Gini index (Breiman et al., 1984)
Selects the attribute variable with higher positive
value of GIx(Xi,C) = gini(C|Xi)-gini(C) gini(C) = ∑j (1-P(cj))2
gini(C|Xi) = ∑t P(xit) gini(C|x
it)
GINI INDEX
Quantify the impurity degree of a partition (a “pure partition” has only values in one case of C)
Previos knowledges Split Criterions for decision trees: Imprecise Info-Gain (Abellán & Moral, 2003)
Representing the information from a data base Imprecise Dirichlet Model (IDM) Probability estimation
j
jj
ccc
j IsNsn
sNn
cP ≡
+
+
+∈ ,)(
})(|{)(jcj IcqqCK ∈= })(|{)|( },{ ij xcji IcqqxXCK ∈==
Credal Sets
Previos knowledges Split Criterions for decision trees: Imprecise Info-Gain (Abellán & Moral, 2003)
Select the attribute variable with higher positive value of:
IGI(Xi,C) = S(K(C)) - ∑t P(xit) S(K(C| Xi=x
it))
with S as Maximum entropy function of a credal set.
Global uncertainty measure ⊃ conflict & no-specificity
Conflict is on the side of ramification. No-specificity tries to reduce the ramification.
Outline 1. Introduction
2. Previous knowledge
3. Experimentation
4. Conclusions & future works
Experimentation Data Bases
Preprocess:
- Filling of missing data (average & mode)
- Discretization of continuous values
Aplication of selection methods
Aplication of NB on original BDs with the set of selected variables
•Percentage of correct classification of NB before and after the selection process •Number of variable selected
Experimentation Results with 3 levels. Correct classifications
NB comparison:
Accumulated Comparison:
10 fold-cross x 10 times. Corrected paired t-test with 5% of significance level
Experimentation Results with 3 levels. Number of variables
Accumulated Comparison:
Experimentation Results with 4 levels Comparison over right classifications:
Comparison over number of variables:
Experimentation Results Analysis 1. Only using one tree, all the procedures obtain
good results using a few number of variables. 2. The improvement from 3 to 4 levels is not very
significative, except for IGR. 3. IGR penalizes excesivelly variables with high
number of cases (Audiology, Optdigits,..). 4. Using 3 levels, IIG has better results than the
other criterions. This outperforming is higher with 4 levels.
Outline 1. Introduction
2. Previous knowledges
3. Experimentation
4. Conclusions & future works
Conclusions & future works Experiments over 27 DBs present to IGI as a
outperforming split-criterion considering the trade off of accuracy and nº of variables.
Apply IGI criterion and others ones based on bayesian scores on the compose methods explained in the introduction.
Study the use of combined criterions, i.e. to use of one or other criterion with dependency of the characteristics of the BD (size, number of variables, number of cases, etc…) and level of the tree we stay.