decision tree and random forestdecision tree and random forest dr. ammar mohammed associate...

Decision Tree And Random Forest

Dr. Ammar MohammedAssociate Professor of Computer Science

ISSR, Cairo UniversityPhD of CS ( Uni. Koblenz-Landau, Germany)

Spring 2019

Contact:mailto: [email protected] [email protected]

Decision Tree

A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification using a divide-and-conquer strategy

rootBranches corresponds to

possible answer

Leaf node indicates the class (category)

Classification is performed by routing from the root node until arriving at a leaf node.

Decision Tree: Example

decision tree for determiningwhether to play tennis.

overcast

high normal falsetrue

sunnyrain

No NoYes Yes

Yes

Outlook

HumidityWindy

(Outlook ==overcast) -> yes(Outlook==rain) and (not Windy==false) ->yes(Outlook==sunny) and (Humidity=normal) ->yes

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Training Example

Decision Tree: Training Dataset

Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

<=30>40

no yes yes

yes

31..40

no

fairexcellentyesno

Decision Tree: Training Dataset

Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in

advance) Examples are partitioned recursively based on best selected attributes Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting

is employed for classifying the leaf There are no samples left

Decision Tree: How to build the Tree

Example

bluebrown

no

longshort

yes

shortlong

Example: possible answer

football

footballfootball

NetballNetball

EyeColor

HairLength

Married

HairLength

femalemale

The Question isI.What is the best Tree ?II. Which Attribute to be selected for splitting branches ?

Example: Another answer

Netballfootball

Sex

Algorithm : BuildSubtree

How to Construct a Tree

Required: Node n, Data D1: (n→split,D

L,D

R)=FindBestSplit(D)

2: if StoppingCriteria(DL) then

3: n→left_Prediction=FindPrediction(DL)

4: else5: BuildSubtree (n→left,D

L)

6: if StoppingCriteria(DR) then

7: n→right_Prediction=FindPrediction(DR)8: else

9: BuildSubtree (n→right,DR)

Information, Entropy and Impurity

- Entropy is a measure of the disorder or unpredictability in a system. - Given a binary (two-class) classification, C, and a set of examples, S,the class distribution at any node can be written as (p

0 , p

1 ), where p

1 =1 - p

0 ,

and the entropy of S is the sum of the information:Entropy of : E(S) = - p

0 log

2 p

0 - p1 log

2 p

1

Attribute A

Attribute A

Attribute A

5 5 6 4 0 10

Entropy =1Highest impurityUseless classifier

Entropy =0.97High impurityPoor classifier

Entropy =0Impurity= 0Good classifier

10 10 10


In the general case, the target attribute can take on m different values multi-way split) and the entropy of T (one feature) relative to this m-wise classification is given by

E (T )=−∑i

m

pi log2 ( pi )

E (y) = - 1/2* log2 1/2 – 1/2 log

2 1/2= 1 bit

E (y |x=AI )= - 2/4* log2 2/4 – 2/4 log

2 2/4= 1 bit

P(x=AI)=0.5, P(X=Math)=0.25, P(X=CS)=0.25

E (y | x=Math )= - 2/2* log2 2/2 – 0/2 log

2 0/2= 0 bit

E (y | x=CS )= - 0/2* log2 0/2 – 2/2 log

2 2/2= 0 bit

E(Y , X) = p(X=AI) E(Y|X=AI) + p(X=Math) E(Y|X=Math) + p(X=CS) E(Y|X=CS)

E(Y|X) =1*0.5+0*.25+0*.25 =0.5

Entropy of two Features: For c in Feature XExample:

measure how much a given attribute X tells us about the Class Y

X Y

AI Yes

Math No

CS Yes

AI No

AI No

CS Yes

AI Yes

Math No

v P(x-v) E(Y|X=v)

AI 0.5 1

Math 0.25 0

CS 0.25 0


Information gain: measure how much a given attribute X tells us about the Class Y

Entropy (Y,X )=∑j

P ( X=v j )∗E (Y |X=v j )

Also, the term E(Y |X=vj) denotes Entropy of all

possible partitions of the attribute X

Gain(X)= Entropy(T)- Entropy(T,X)

The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

Select the attribute with the highest information gain

Let pi be the probability that an arbitrary tuple in D (a data set) belongs to class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D:

Information needed (after using attribute A to split D into v partitions) to classify D:

Information gained by branching on attribute A

Entropy (D)=−∑i=1

m

p i log2( pi )

Entropy (D,A )=∑j=1

v |D j|

|D|∗Entropy (D j )

Attribute Selection Measure: Information Gain

Gain(A)= Entropy(D)- Entropy(D,A)

Example:

Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, Gain(credit rating) = 0.048 bits

- Because age has the highest information gain among the attributes, it is selected as the splitting attribute.

Example: Answer

Entropy (D)

Entropy(D,Age)

Gain(age)= Entropy(D)- Entropy(D,age) = 0.940 -0.694 = 0.246 bits

The attribute age has the highest information gain and therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly.

Attribute Selection: Information Gain

Let attribute A be a continuous-valued attribute

Must determine the best split point for A

Sort the value A in increasing order

Typically, the midpoint between each pair of adjacent values is considered as a possible split point

(ai+ai+1)/2 is the midpoint between the values of ai and ai+1

The point with the minimum expected information requirement for A is selected as the split-point for A

Split: D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is

the set of tuples in D satisfying A > split-point

Information Gain for Continuous Value attributes

Information gain is biased towards attributes with a large number of values

C4.5 Decision Tree Algorithm uses gain ratio to overcome the problem (normalization to information gain). Gain ratio is the ratio of information gain to the intrinsic information (split info)

GainRatio(A) = Gain(A)/SplitInfo(A) Ex.

gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected as the

splitting attribute

SplitInfo A (D)=−∑j=1

v |D j|

|D|×log2(

|D j|

|D|)

SplitInfo A (D)=−4

14×log2 (

414

)−614

×log2(6

14)−

414

×log2 (4

14)=0 .926

Gain Ration for Attribute Section (C4.5)

If a data set D contains examples from n classes, gini index, gini(D) is defined as

where pj is the relative frequency of class j in D If a data set D is split on a feature A into two subsets D1 and D2, the gini index

gini(D) is defined as

Reduction in Impurity:

The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

gini(D )=1−∑j=1

n

p j2

giniA(D )=|D1|

|D|gini(D1 )+

|D2|

|D|gini (D2 )

Δginigini ( A )=gini (D)−giniA(D )

GINI Index: IBM IntellgentMiner

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2

but gini{medium,high} is 0.30 and thus the best since it is the lowest All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes

gini(D )=1−( 914 )

2

−( 514 )

2

=0 .459

giniincome∈ {low,medium }(D)=(1014 )Gini(D1 )+( 4

14 )Gini (D1 )

GINI Index

The three measures, in general, return good results but Information gain:

biased towards multivalued attributes Gain ratio:

tends to prefer unbalanced splits in which one partition is much smaller than the others

Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and

purity in both partitions

Comparing Attribute Selection Measures

Decision Tree Over-fitting

• You can perfectly fit to any training data

• Zero bias, high variance

• Too many branches, some may reflect anomalies due to noise or outliers

• Poor accuracy for unseen samples

Two approaches:1. Stop growing the tree when further splitting the data does not yield an

improvement

2. Grow a full tree, then prune the tree, by eliminating nodes.

Overfitting: An induced tree may overfit the training data

Suppose that the probability of five events are P(1)= 0.5, and P(2)= P(3)= P

(4)=P(5)=0.125. Calculate the entropy.

Three binary nodes, N 1 , N 2 , and N 3 , split examples into (0, 6), (1,5), and

(3,3),respectively. For each node, calculate its entropy, Gini impurity

Build a decision tree that computes the logical AND function

Exercise

One the side , a list of everything someone has done in the past 10 days.

Which feature do you use as the starting (root)Node? For this you need to compute the entropy, and then find out which featurehas the maximal information gain

Bagging

• Bagging or bootstrap aggregation a technique for reducing the variance of an estimated prediction function.

• For classification, a committee of trees each

cast a vote for the predicted class.

Bagging

Create bootstrap samplesfrom the training data

....

…

M features

BaggingN

exam

ple

s

....

…

....

…

Take the majority

vote

M features

Construct decision tree for each sample

Weakness of Bagging

- bagging gives distinct advance in machine learning when it is used

- Analyzing the details of many models , it was found that the trees in the bagger were too similar to each other.

- This asks for a way to make the trees dramatically more different.

- new idea was to introduce randomness not just into the training samples but also into the actual tree growing as well

Introducing Randomness

- Suppose that instead of always picking the best splitter we picked the splitter at random

- This would guarantee that different trees would be quite dissimilar to each other

Random Forests

N e

xam

ple

s

Training Data

M features

Random Forests are one of the mostpowerful, fully automated, machine learningtechniques.

Random Forest is a tool that leverages the power of many decision trees, judicious randomization, and ensemble learning to produce astonishingly accurate predictive models

Random ForestN

exam

ple

s

Create bootstrap samplesfrom the training data

....

…

M features

Random Forest

....

…M features

N e

xam

ple

s

At each node in choosing the split featurechoose only among m<M features

Random ForestN

exam

ple

s

....

…

....

…

M features

Take he majority

vote

Random Forest Algorithm

Given a training set S

For i = 1 to k do:

Build subset Si by sampling with replacement from S

Learn tree Ti from Si

At each node:

Choose best split from random subset of F features

Each tree grows to the largest extend, and no pruning

Make predictions according to majority vote of the set of k trees.

Random Forest vs Bagging

• It is more robust.

• It is faster to train (no reweighting, each split is on a small subset of data and feature).

Pros:

• The feature selection process is not explicit.

• Feature fusion is also less obvious.

• Has weaker performance on small size training data.

Cons:

decision tree and random forestdecision tree and random forest dr. ammar mohammed associate...

Documents