a bayesian random split to build ensembles of classification trees

A Bayesian Random Split to BuildEnsembles of Classification Trees

Andrés Cano and Andrés R. Masegosa and Serafín Moral

Department of Computer Science and Artificial Intelligence

University of Granada

Verona, July 2009

10th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty

ECSQARU 2009 Verona (Italy) 1/22

Introduction

Part I

Introduction


Introduction

Introduction

Ensembles of Randomized Decision Trees I

Features

Randomized decision trees were first proposed by Minger [2].

Ensembles of decision trees exploited this idea: several DT were generatedwith random differences among them.

They take advantage of their inherent instability to ease the introduction ofrandomness.


Introduction

Introduction

Ensembles of Randomized Decision Trees II

Randomized Training Datasets

Bagging: Boostrapping Aggregating.

Each DT is built with different replicas of the training data set generated byrandom sampling with replacement.Randomness is introduced by the random variations of the bootstrappeddatasets.

Random Split Operators

Diettrich et al. [4] built ensembles where the split node was randomly selectedfrom among the K best splits attributes.

Breiman’s Random Forests [1] combined a Bagging scheme with random splitnode selection.

Performance of ensembles of DT

Notion of bias-variance error decomposition [7,8].

Ensembles of DT maintain bias while reduce variance component when thesize of the ensemble increases.


Introduction

Introduction

Motivation of this study

Description

Propose a new random split method inspired in a Bayesian approach forbuilding ensembles of trees.

It is similar to the random split operator of Random Forests.

It is less sensitive to the selection of a parameter and have a competitiveaverage error, bias and variance.

Outline

A Bayesian account to Classification Trees.

Experimental Results.

Conclusions and Future Works.


A Bayesian Account to Classification Trees

Part II




Basic Framework

Classification Trees

Example Description

Tree T and parameters ΦT

ΦT = {(0.9, 0.1), ..., (0.8, 0.2)}P(c2|(x2, y2),T ,ΦT ) = 0.95

Notation

Data set D = (~x ,~c), ~x is set of multino-mial attributes and ~c are the class values.

ΦT = {φk,l : k = 1, ...,K ; l ∈ leaves(T )}P(C = ck |x ,T ,ΦT ) = φk,l , if x falls inleaf l .

Bayesian Approach to Classification Trees

T and ΦT are considered random variables [9].

A prior distribution is defined.

Posterior distributions of T and ΦT given the data D are computed.



Basic Framework

Assumptions

Parameter Priors

An uniform prior distribution for T and a prior Dirichlet distribution for ΦT .

Parameters priors are independent in the different leaves of the tree.

P(ΦT |T ) =∏

l∈leaves(T )

1BK (α1,l ..., αK ,l )

K∏k=1

φαk,l−1k,l

Posterior Tree Probability

It is computed using the Bayes rule and integrating over the parameters. Thatis feasible because the Dirichlet and the multinomial are conjugate distributions.

P(T |x̄ , c̄) ∝ P(T )∏

l∈leaves(T )

BC(n1,l + α1,l , ..., nK ,l + αK ,l )

BC(α1,l , ..., αK ,l )

Posterior Class Probabilities Estimates

P(C = ck |x ,T , c̄, x̄) = EΦT |T ,c̄,x̄ (φk,l ) =nk,l + αj,l

nl + αl



Basic Framework

Ensembles of Classification Trees

Bayesian Solution: Model Averaging [9]

Consider all possible models with the corresponding posterior probability.

Final prediction is composed by a set of weighted predictions.

P(C = ck |x , c̄, x̄) =∑

T

∫ΦT

P(C = ck |x ,T ,ΦT )P(T ,ΦT |c̄, x̄)dΦT (1)

Implementation Issues

Equation (1) can be approximated by Monte-Carlo methodsProblem: If model space is huge, it would lead to a very computationallyexpensive approach.

Hybrid Solution: Bayesian Random Split

Tree structures are generated in an approximate proportion to their posteriorprobabilities (similar to Monte-Carlo methods).

We employ an efficient technique very similar to random forests.




Random Forests: Random Split Node Selection

Scheme Steps

1 K attributes (Xi1 , ...,XiK ) are randomlyselected from the set of available splitattributes.

2 Compute an split score for each of theselected nodes.

3 Select the attribute with the highest score.

If maximum score is negative, repeatthe procedure.

Comments

Different scores can be used: GIx, IG or IGR.

Trees are not prunned.

Suggested value: K = ln N.




A Bayesian Random Split Operator (BRS)

Scheme Steps

1 K attributes are randomly selected.2 Compute the Bayesian score of :

A tree, TXi , branched by each Xi .The tree, T̄ , without branching.

3 Normalize the previous scores.4 Define a multinomial probability distribution:

K cases for each split nodes.One case to stop the branching.

5 The split action is randomly selectedaccording to this multinomial distribution.

If the Stop branching is selected, repeatthe procedure.




A Bayesian Random Split Operator (BRS)

Comments

Splits scores are probabilities, so random split node selection can properlycarried out.

Informative scores does not allow this process.

Bayesian scores have an inherent penalty to complexity.

Stop branching is probabilistically managed.

Random Forests fully grows the trees.

A heuristic which tries to recover the classification trees in a approximateproportion to their posterior probability.

Random Forests justifies its performance with the notion of strength andcorrelation among trees.


Experimental Evaluation

Part III





Experimental Evaluation Set-up

Data Sets Benchmark

23 UCI datasets with very different features.

Missing values were replaced with mean and mode values for continuous anddiscrete attributes respectively.

Continuous attributes were discretized with Fayyad & Irani’s method [15].

Preprocessing was only carried out considering information for training data sets.

Bias-Variance error decomposition analysis [8,16]

Error can be seen as the sum of bias and variance components.

Bias measures the component of the error resulting from the incapacity of thepredictor to model the underlying distribution.

Variance represents the component of the error that stems from theparticularities of the training sample.

Evaluation

Friedman test and the Bonferroni post-hoc test [17].




The role of K in Randon Forests

The ranks assessed by Friedman Test are depicted at 0.05 level.As lower the rank, better classification error.Ranks in bold face indicates a statistically significant difference respect to BRSwith 10 Trees and K=1 with Bonferroni Test.

RF Trees1050

100200

?BRS (10 trees) RFK=1 K=1 K=3 K=5 K=Log N2.0 3.9 3.1 2.9 3.13.4 3.5 2.6 2.9 2.73.7 3.2 2.3 2.8 2.93.8 3.1 2.7 2.8 2.6

Analysis

RF 10 Trees is statistically worst for K=1 and K=3.

Only RF 100 Trees and K=3 is statistically better than BRS 10 Trees.

Random Forests seem very sensitive to the specific K value.




The role of K in BRS

The ranks assessed by Friedman Test are depicted.As lower the rank, better classification error.Ranks in bold face indicates a statistically significant difference respect to RFwith 10 Trees and K=Log N with Bonferroni Test.

BRS Trees1050100200

?RF (10 trees) BRSK=Log N K=1 K=3 K=5 K=Log N

4.3 3.2 2.1 2.6 2.84.9 2.7 2.3 2.5 2.65.0 2.3 2.5 2.7 2.65.0 2.4 2.4 2.7 2.5

Analysis

BRS 10 Trees is statistically better than RF 10 Trees with different K values.

Our approach with a higher number of trees is always better with any K value.

BRS seems to be much less sensible to the specific K value.



Bias-Variance error decomposition

Average Classification Error

The ranks assessed by Friedman Test are depicted.As lower the rank, better the performance.Ranks in bold face indicates a statistically significant difference respect to BRSK=1 with Bonferroni Test.

Trees1050

100200

?BRS RFK=1 K=1 K=3 K=5 K=Log N2.01 3.9 3.1 2.9 3.12.41 3.7 2.9 3.1 2.92.51 3.6 2.7 3.1 3.22.41 3.5 3.1 3.1 2.9

Conclusions

BRS clearly has a better ranking across the different sizes of ensembles and thedifferent K values employed by RF.

The K value with best ranking for RF changes with the size of the ensemble.

There is no statistical significant differences with non parametric tests.




Average Classification Bias


Trees1050

100200

?BRS RFK=1 K=1 K=3 K=5 K=Log N2.51 3.8 2.9 3.0 2.92.21 3.8 3.0 3.1 2.92.11 3.8 2.8 3.0 3.22.31 3.9 3.0 3.0 2.7

Conclusions

BRS clearly has a better ranking across the different sizes of ensembles and thedifferent K values employed by RF.

The K value with best ranking for RF changes with the size of the ensemble.

There is statistical significant differences for RF with K=1.




Average Classification Variance


Trees1050

100200

?BRS RFK=1 K=1 K=3 K=5 K=Log N2.31 3.5 3.2 3.0 3.02.81 2.9 3.0 3.2 3.02.9 3.0 2.9 3.3 2.81

2.8 2.41 3.0 3.5 3.2

Conclusions

BRS has a better ranking for ensembles with 10 and 50 trees.

For ensembles with 100 trees and 200 trees RF has better ranking but dependswith the K value.

There is no statistical significant differences with non parametric tests.


Conclusions and Future Works

Part IV






Conclusions

We have proposed a new random split operator inspired by Bayesian methods tobuild ensemble of classification trees.

This operator has some nice features such as:Allows a proper random selection of the split node.The stop branching criteria is probabilistically managed.

It seems to be less sensitive to the K value than Random Forests.

Although this method does not statistically outperform Random Forests, itconsistently has better average error, bias and variance for the differentevaluated number of trees.

Future Works

Apply Markov-Chain Monte Carlo methods to generate the trees.

Introduce some Bayesian method to assign a weight to each classification rule.


Thanks for your attention!!

Questions?