decision tree and random forestdecision tree and random forest dr. ammar mohammed associate...

34
Decision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019 Contact: mailto: [email protected] [email protected]

Upload: others

Post on 25-Feb-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Decision Tree And Random Forest

Dr. Ammar MohammedAssociate Professor of Computer Science

ISSR, Cairo UniversityPhD of CS ( Uni. Koblenz-Landau, Germany)

Spring 2019

Contact:mailto: [email protected] [email protected]

Page 2: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Decision Tree

A decision tree is a simple classifier in the form of a hierarchical tree structure, which performs supervised classification using a divide-and-conquer strategy

rootBranches corresponds to

possible answer

Leaf node indicates the class (category)

Classification is performed by routing from the root node until arriving at a leaf node.

Page 3: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Decision Tree: Example

decision tree for determiningwhether to play tennis.

overcast

high normal falsetrue

sunnyrain

No NoYes Yes

Yes

Outlook

HumidityWindy

(Outlook ==overcast) -> yes(Outlook==rain) and (not Windy==false) ->yes(Outlook==sunny) and (Humidity=normal) ->yes

Page 4: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Training Example

Decision Tree: Training Dataset

Page 5: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

<=30>40

no yes yes

yes

31..40

no

fairexcellentyesno

Decision Tree: Training Dataset

Page 6: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in

advance) Examples are partitioned recursively based on best selected attributes Test attributes are selected on the basis of a heuristic or statistical

measure (e.g., information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting

is employed for classifying the leaf There are no samples left

Decision Tree: How to build the Tree

Page 7: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Example

Page 8: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

bluebrown

no

longshort

yes

shortlong

Example: possible answer

football

footballfootball

NetballNetball

EyeColor

HairLength

Married

HairLength

Page 9: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

femalemale

The Question isI.What is the best Tree ?II. Which Attribute to be selected for splitting branches ?

Example: Another answer

Netballfootball

Sex

Page 10: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Algorithm : BuildSubtree

How to Construct a Tree

Required: Node n, Data D1: (n→split,D

L,D

R)=FindBestSplit(D)

2: if StoppingCriteria(DL) then

3: n→left_Prediction=FindPrediction(DL)

4: else5: BuildSubtree (n→left,D

L)

6: if StoppingCriteria(DR) then

7: n→right_Prediction=FindPrediction(DR)8: else

9: BuildSubtree (n→right,DR)

Page 11: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Information, Entropy and Impurity

- Entropy is a measure of the disorder or unpredictability in a system. - Given a binary (two-class) classification, C, and a set of examples, S,the class distribution at any node can be written as (p

0 , p

1 ), where p

1 =1 - p

0 ,

and the entropy of S is the sum of the information:Entropy of : E(S) = - p

0 log

2 p

0 - p1 log

2 p

1

Attribute A

Attribute A

Attribute A

5 5 6 4 0 10

Entropy =1Highest impurityUseless classifier

Entropy =0.97High impurityPoor classifier

Entropy =0Impurity= 0Good classifier

10 10 10

Page 12: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Information, Entropy and Impurity

In the general case, the target attribute can take on m different values multi-way split) and the entropy of T (one feature) relative to this m-wise classification is given by

E (T )=−∑i

m

pi log2 ( pi )

E (y) = - 1/2* log2 1/2 – 1/2 log

2 1/2= 1 bit

E (y |x=AI )= - 2/4* log2 2/4 – 2/4 log

2 2/4= 1 bit

P(x=AI)=0.5, P(X=Math)=0.25, P(X=CS)=0.25

E (y | x=Math )= - 2/2* log2 2/2 – 0/2 log

2 0/2= 0 bit

E (y | x=CS )= - 0/2* log2 0/2 – 2/2 log

2 2/2= 0 bit

E(Y , X) = p(X=AI) E(Y|X=AI) + p(X=Math) E(Y|X=Math) + p(X=CS) E(Y|X=CS)

E(Y|X) =1*0.5+0*.25+0*.25 =0.5

Entropy of two Features: For c in Feature XExample:

measure how much a given attribute X tells us about the Class Y

X Y

AI Yes

Math No

CS Yes

AI No

AI No

CS Yes

AI Yes

Math No

v P(x-v) E(Y|X=v)

AI 0.5 1

Math 0.25 0

CS 0.25 0

Page 13: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Information, Entropy and Impurity

Information gain: measure how much a given attribute X tells us about the Class Y

Entropy (Y,X )=∑j

P ( X=v j )∗E (Y |X=v j )

Also, the term E(Y |X=vj) denotes Entropy of all

possible partitions of the attribute X

Gain(X)= Entropy(T)- Entropy(T,X)

The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

Page 14: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Select the attribute with the highest information gain

Let pi be the probability that an arbitrary tuple in D (a data set) belongs to class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D:

Information needed (after using attribute A to split D into v partitions) to classify D:

Information gained by branching on attribute A

Entropy (D)=−∑i=1

m

p i log2( pi )

Entropy (D,A )=∑j=1

v |D j|

|D|∗Entropy (D j )

Attribute Selection Measure: Information Gain

Gain(A)= Entropy(D)- Entropy(D,A)

Page 15: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Example:

Page 16: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, Gain(credit rating) = 0.048 bits

- Because age has the highest information gain among the attributes, it is selected as the splitting attribute.

Example: Answer

Entropy (D)

Entropy(D,Age)

Gain(age)= Entropy(D)- Entropy(D,age) = 0.940 -0.694 = 0.246 bits

Page 17: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

The attribute age has the highest information gain and therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly.

Attribute Selection: Information Gain

Page 18: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Let attribute A be a continuous-valued attribute

Must determine the best split point for A

Sort the value A in increasing order

Typically, the midpoint between each pair of adjacent values is considered as a possible split point

(ai+ai+1)/2 is the midpoint between the values of ai and ai+1

The point with the minimum expected information requirement for A is selected as the split-point for A

Split: D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is

the set of tuples in D satisfying A > split-point

Information Gain for Continuous Value attributes

Page 19: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Information gain is biased towards attributes with a large number of values

C4.5 Decision Tree Algorithm uses gain ratio to overcome the problem (normalization to information gain). Gain ratio is the ratio of information gain to the intrinsic information (split info)

GainRatio(A) = Gain(A)/SplitInfo(A) Ex.

gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected as the

splitting attribute

SplitInfo A (D)=−∑j=1

v |D j|

|D|×log2(

|D j|

|D|)

SplitInfo A (D)=−4

14×log2 (

414

)−614

×log2(6

14)−

414

×log2 (4

14)=0 .926

Gain Ration for Attribute Section (C4.5)

Page 20: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

If a data set D contains examples from n classes, gini index, gini(D) is defined as

where pj is the relative frequency of class j in D If a data set D is split on a feature A into two subsets D1 and D2, the gini index

gini(D) is defined as

Reduction in Impurity:

The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

gini(D )=1−∑j=1

n

p j2

giniA(D )=|D1|

|D|gini(D1 )+

|D2|

|D|gini (D2 )

Δginigini ( A )=gini (D)−giniA(D )

GINI Index: IBM IntellgentMiner

Page 21: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”

Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2

but gini{medium,high} is 0.30 and thus the best since it is the lowest All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes

gini(D )=1−( 914 )

2

−( 514 )

2

=0 .459

giniincome∈ {low,medium }(D)=(1014 )Gini(D1 )+( 4

14 )Gini (D1 )

GINI Index

Page 22: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

The three measures, in general, return good results but Information gain:

biased towards multivalued attributes Gain ratio:

tends to prefer unbalanced splits in which one partition is much smaller than the others

Gini index: biased to multivalued attributes has difficulty when # of classes is large tends to favor tests that result in equal-sized partitions and

purity in both partitions

Comparing Attribute Selection Measures

Page 23: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Decision Tree Over-fitting

• You can perfectly fit to any training data

• Zero bias, high variance

• Too many branches, some may reflect anomalies due to noise or outliers

• Poor accuracy for unseen samples

Two approaches:1. Stop growing the tree when further splitting the data does not yield an

improvement

2. Grow a full tree, then prune the tree, by eliminating nodes.

Overfitting: An induced tree may overfit the training data

Page 24: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Suppose that the probability of five events are P(1)= 0.5, and P(2)= P(3)= P

(4)=P(5)=0.125. Calculate the entropy.

Three binary nodes, N 1 , N 2 , and N 3 , split examples into (0, 6), (1,5), and

(3,3),respectively. For each node, calculate its entropy, Gini impurity

Build a decision tree that computes the logical AND function

Exercise

One the side , a list of everything someone has done in the past 10 days.

Which feature do you use as the starting (root)Node? For this you need to compute the entropy, and then find out which featurehas the maximal information gain

Page 25: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Bagging

• Bagging or bootstrap aggregation a technique for reducing the variance of an estimated prediction function.

• For classification, a committee of trees each

cast a vote for the predicted class.

Page 26: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Bagging

Create bootstrap samplesfrom the training data

....

M features

Page 27: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

BaggingN

exam

ple

s

....

....

Take the majority

vote

M features

Construct decision tree for each sample

Page 28: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Weakness of Bagging

- bagging gives distinct advance in machine learning when it is used

- Analyzing the details of many models , it was found that the trees in the bagger were too similar to each other.

- This asks for a way to make the trees dramatically more different.

- new idea was to introduce randomness not just into the training samples but also into the actual tree growing as well

Introducing Randomness

- Suppose that instead of always picking the best splitter we picked the splitter at random

- This would guarantee that different trees would be quite dissimilar to each other

Page 29: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Random Forests

N e

xam

ple

s

Training Data

M features

Random Forests are one of the mostpowerful, fully automated, machine learningtechniques.

Random Forest is a tool that leverages the power of many decision trees, judicious randomization, and ensemble learning to produce astonishingly accurate predictive models

Page 30: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Random ForestN

exam

ple

s

Create bootstrap samplesfrom the training data

....

M features

Page 31: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Random Forest

....

…M features

N e

xam

ple

s

At each node in choosing the split featurechoose only among m<M features

Page 32: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Random ForestN

exam

ple

s

....

....

M features

Take he majority

vote

Page 33: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Random Forest Algorithm

Given a training set S

For i = 1 to k do:

Build subset Si by sampling with replacement from S

Learn tree Ti from Si

At each node:

Choose best split from random subset of F features

Each tree grows to the largest extend, and no pruning

Make predictions according to majority vote of the set of k trees.

Page 34: Decision Tree And Random ForestDecision Tree And Random Forest Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany)

Random Forest vs Bagging

• It is more robust.

• It is faster to train (no reweighting, each split is on a small subset of data and feature).

Pros:

• The feature selection process is not explicit.

• Feature fusion is also less obvious.

• Has weaker performance on small size training data.

Cons: