decision tree models in data mining matthew j. liberatore thomas coghlan

Decision Tree Models in Data Mining

Matthew J. Liberatore

Thomas Coghlan

Decision Trees in Data Mining

Decision Trees can be used to predict a categorical or a continuous target (called regression trees in the latter case)

Like logistic regression and neural networks decision trees can be applied for classification and prediction

Unlike these methods no equations are estimated A tree structure of rules over the input variables are

used to classify or predict the cases according to the target variable

The rules are of an IF-THEN form – for example: If Risk = Low, then predict on-time payment of a loan

Decision Tree Approach A decision tree represents a hierarchical segmentation of

the data The original segment is called the root node and is the

entire data set The root node is partitioned into two or more segments by

applying a series of simple rules over an input variables For example, risk = low, risk = not low Each rule assigns the observations to a segment based on its input

value Each resulting segment can be further partitioned into

sub-segments, and so on For example risk = low can be partitioned into income = low and

income = not low The segments are also called nodes, and the final

segments are called leaf nodes or leaves

Decision Tree Example – Loan Payment

Income

< $30k >= $30k

Age Credit Score

< 25 >=25 < 600 >= 600not on-time on-time not on-time on-time

Growing the Decision Tree

Growing the tree involves successively partitioning the data – recursively partitioning

If an input variable is binary, then the two categories can be used to split the data

If an input variable is interval, a splitting value is used to classify the data into two segments

For example, if household income is interval and there are 100 possible incomes in the data set, then there are 100 possible splitting values For example, income < $30k, and income >= $30k

Evaluating the partitions

When the target is categorical, for each partition of an input variable a chi-square statistic is computed

A contingency table is formed that maps responders and non-responders against the partitioned input variable

For example, the null hypothesis might be that there is no difference between people with income <$30k and those with income >=$30k in making an on-time loan payment The lower the significance or p-value, the more likely that

we reject this hypothesis, meaning that this income split is a discriminating factor

Contingency Table

$<30k $>=30k total

Payment on-time

Payment

not on-time

total

Chi-Square Statistic

The chi-square statistic computes a measure of how different the number of observations is in each of the four cells as compared to the expected number The p-value associated with the null hypothesis is

computed Enterprise Miner then computes the logworth

of the p-value, logworth = - log10(p-value) The split that generates the highest logworth

for a given input variable is selected

Growing the Tree In our loan payment example, we have three interval-

valued input variables: income, age, and credit score We compute the logworth of the best split for each of

these variables We then select the variable that has the highest logworth

and use its split – suppose it is income Under each of the two income nodes, we then find the

logworth of the best split of age and credit score and continue the process -- subject to meeting the threshold on the significance of the chi-

square value for splitting and other stopping criteria (described later)

Other Splitting Criteria for a Categorical Target The gini and entropy measures are based on how

heterogeneous the observations are at a given node relates to the mix of responders and non-responders at the node

Let p1 and p0 represent the proportion of responders and non-responders at a node, respectively

If two observations are chosen (with replacement) from a node, the probability that they are either both responders or both non-responders is (p1)2 + (p0)2

The gini index = 1 – [(p1)2 + (p0)2], the probability that both observations are different Best case is a gini index of 0 (all observations are the same) An index of ½ means both groups equally represented

Other Splitting Criteria for a Categorical Target

The rarity of an event is defined as: -log2(pi) Entropy sums up the rarity of response and

non-response over all observations Entropy ranges from the best case of 0 (all

responders or all non-responders) to 1 (equal mix of responders and non-responders)

Splitting Criteria for a Continuous (Interval) Target

An F-statistic is used to measure the degree of separation of a split for an interval target, such as revenue

Similar to the sum of squares discussion under multiple regression, the F-statistic is based on the ratio of the sum of squares between the groups and the sum of squares within groups, both adjusted for the number of degrees of freedom

The null hypothesis is that there is no difference in the target mean between the two groups

As before, the logworth of the p-value is computed

Some Adjustments

The more possible splits of an input variable, the less accurate the p-value (bigger chance of rejecting the null hypothesis) If there are m splits, the Bonferroni adjustment adjusts

the p-value of the best case by subtracting log10(m) from the logworth

If Time of Kass Adjustment is set to before then the p-values of the splits are compared with Bonferroni adjustment

Some Adjustments Setting Split Adjustment property to Yes means that the

significance of the p-value can be adjusted by the depth of the tree For example, at the fourth split, a calculate p-value of 0.04

becomes 0.04*24 = 0.64, making the split statistically insignificant

This leads to rejecting more splits, limiting the size of the tree Tree growth can also be controlled by setting:

Leaf Size property (minimum number of observations in a leaf)

Split Size property (minimum number of observations to allow a node to be split)

Maximum Depth property (maximum number of generation of nodes)

Some Results

The posterior probabilities are the proportions of responders and non-responders at each node A node is classified as a responder or non-

responder depending on which posterior probability is the largest

In selecting the best tree, one can use Misclassification, Lift, or Average Squared Error

Creating a Decision Tree Model in Enterprise Miner

Open the bankrupt project, and create a new diagram called Bankrupt_DecTree

Drag and drop the bankrupt data node and the Decision Tree node (from the model tab) onto the diagram Connect the nodes

Select ProbChisq for the Criterion under Splitting RuleChange Use Input Once to Yes (otherwise, the same variable can appear more than once in the tree)

Under Subtree select Misclassification for Assessment MeasureKeep defaults under P-Value Adjustment and Output VariablesUnder Score set Variable Selection to No (otherwise variables with importance values greater than 0.05 are set as rejected and not considered by the tree)

The Decision Tree has only one split on RE/TA. The misclassification rate is 0.15 (3/20), with 2 false negatives and 1 false positive. The cumulative lift is somewhat lower than the best cumulative lift, and starts out at 1.777 vs. the best value of 2.000.

Under Subtree, set Method to Largest and rerun. The result show that another split is added, using EBIT/TA. However, the misclassification rate is unchanged at 0.15. This result shows that setting Method to Assessment and Misclassification for Assessment Measure finds the smallest tree having the lowest misclassification

Model Comparison

The Model Comparison node under the Assess tab can be used to compare several different models

Create a diagram called Full Model that includes the bankrupt data node connected into the regression, decision tree, and neural network nodes

Connect the three model nodes into the Model Comparison node, and connect it and the bankrupt_score data node into a Score node

For Regression, set Selection Model to none; for Neural Network, set Model Selection Criterion to Average Error, and the Network properties as before; for Decision Tree, set Assessment Measure as Average Squared Error, and the other properties as before. This puts each of the models on a similar basis for fit. For Model Comparison set Selection Criterion as Average Squared Error.

Neural Network is selected, although Regression is nearly identical in average squared error. The Receiver Operating Characteristic (ROC) curve shows sensitivity (true positives) vs. 1-specificity (false positives) for various cutoff probabilities of a response. The chart shows that no matter what the cutoff probabilities are, regression and neural network classify 100% of responders as responders (sensitivity) and 0% of non-responders as responders (1-specificity). Decision tree performs reasonably well, as indicated by the area above the diagonal line.

decision tree models in data mining matthew j. liberatore thomas coghlan

Documents