boai: fast alternating decision tree induction based on bottom-up evaluation bishan yang, tengjiao...
TRANSCRIPT
BOAI: Fast Alternating Decision Tree
Induction based on Bottom-up Evaluation
Bishan Yang, Tengjiao Wang, Dongqing Yang, and Lei Chang Bishan Yang, Tengjiao Wang, Dongqing Yang, and Lei Chang
School of EECS, Peking University
Outline
• Motivation
• Related work
• Preliminaries
• Our Approach
• Experimental Results
• Summary
Motivation• Alternating Decision Tree (ADTree) is an
effective decision tree algorithm based on AdaBoost.– High accurate classifier– Small size of tree and easy to interpret– Provide measures of prediction confidence
• Wide range of applications– Customer churn prediction– Fraud detection– Disease trait modeling– ……
Limitation of existing work
• Very expensive to apply to large sets of training data– Takes hours to train on large number of
examples and attributes
• The training time will grow up exponentially with the increasing size of data
Related work• Several techniques have been developed to tackle the
efficiency problem for traditional decision trees (ID3,C4.5)– SLIQ (EDBT — Mehta et al.)
• Introduce data structures called attribute list and class list– SPRINT (VLDB — J. Shafer et al.)
• Only construct attribute list– PUBLIC (VLDB — Rastogi & Shim)
• Integrates MDL “prunning” into tree “building” process– Rainforest (VLDB — Gehrke, Ramakrishnan & Ganti)
• Introduce AVC-groups which are sufficient for split evaluation– BOAT (PODS — Gehrke, Ganti, Ramakrishnan & Loh)
• build tree based on a subset of data by using bootstrapping
• Can’t not directly apply to ADTree– evaluate splits based on information of the current node
Related work
• Several optimizing methods for ADTree– PAKDD – Pfahringer et al.
• Zpure cut off• Merging• Three heuristic mechanisms
– ILP – Vanassche et al.• Caching optimization
Cons: Little improvement until reaching large number of iterations
Cons: can’t guarantee the model quality
Cons :the additional memory
consumption grows fast with
increasing number of boosting rounds
Preliminaries• Alternating Decision Tree (ADTree) (ICML)
• Classification: – the sign of the sum of the prediction values along the paths
defined by the instance– Eg. Instance (age, income) = (35, 1300), – sign(f(x)) = sign(0.5-0.5+0.4+0.3) = sign(0.7) = +1
+0.5
Age < 40 Income > 1000
-0.5 +0.2 +0.3 -0.6
Income > 1200
+0.4 -0.2
Age > 30
-0.1 +0.1
: prediction node
: decision node
PreliminariesAlgorithm ADTree InductionInput:
For t=1 to T do
/* evaluation phase*/
for all such that
for all such that
calculate
Select , which minimize
/*Partition phase*/
Set
update weights: , is the prediction value
1c 1 tc P
2c 2c C
1c 2c 1 2( , )tZ c c
1 2 1 21 1 2
1 2 1 2
( ) 1 ( ) 11 1{ : , , ln , ln }
2 ( ) 1 2 ( ) 1t t t
W c c W c cR R r c c a b
W c c W c c
1 1 2 1 2{ , }t tP P c c c c
( ), 1 ,
t i ir x yi t i tw w e
( )t ir x
1 2 1 2 1 2 1 2 1 2 1( , ) 2( ( ) ( ) ( ) ( )) ( )tZ c c W c c W c c W c c W c c W c
1 1{( , ),..., ( , )} | , { 1, 1}}dm m i iS x y x y x R y Pt is set of preconditions,
C is set of base conditions,
(c=(A≤(vi+vi+1)/2) or c=(A=vi))
Pt is set of preconditions,
C is set of base conditions,
(c=(A≤(vi+vi+1)/2) or c=(A=vi))
W+(c) (resp. W-(c)) is the total weight of the positive (resp. negative) instances that satisfying condition c
W+(c) (resp. W-(c)) is the total weight of the positive (resp. negative) instances that satisfying condition c
Complexity mainly lies in this part!
Complexity mainly lies in this part!
Weights are increased for misclassified instances and
decreased for correctly classified instances
Weights are increased for misclassified instances and
decreased for correctly classified instances
Our approach – BOAI• Evaluation phase in top-down evaluation
– instances need to be sorted on numeric attributes at each prediction node
– the weight distribution for all possible splits need to be calculated by scanning instances at each prediction node
– great deal of sorting and computing overlap
• BOAI (Bottom-up Evaluation for ADTree Induction)– Presorting technique
• reduce the sorting cost
– Bottom-up evaluation• evaluate splits from the leaf nodes to the root node• avoid much redundant computing and sorting cost• obtain the exactly same evaluation results with the top-down
evaluation approach
Pre-sorting technique• Preprocessing step
– Sort the values of the numeric attribute, and map the sorting space of values x0,x1,…,xm-1 to 0,1,…,m-1, and then replace the attribute values as the sorted indexes.
• use the sorted indexes to speed up the sorting time the split evaluation phase.
1500 1800 1800 1600 1600 1500
Values on Attribute Income
sorting1500 1500 1600 1600 1800 1800
0 0 1 1 2 2
replacing
0 2 2 1 1 0
Sorting space: 1500, 1600, 1800
Sorted index: 0,1,2
Sorting space: 1500, 1600, 1800
Sorted index: 0,1,2
Replace the original values in the data with their sorted indexes
Replace the original values in the data with their sorted indexes
VW-set (Attribute-Value, Class-Weight)
• Only need weight distribution ( , ) on distinct attribute values for split evaluation. – Just keep the necessary information!
• VW-set of attribute A at node p– stores the weight distribution of each class for each distinct value
of A in F(p). (F(p) denotes the instances projected onto p)– If A is a numeric attribute, the distinct values in VW-set must be
sorted.
• VW-group of node p– the set of all VW-sets at node p
• Each prediction node can be evaluated based on its VW-group
( )W c ( )W c
VW-set (Attribute-Value, Class-Weight)
• The size of the VW-set is determined by the distinct attribute values appeared in F(p) and is not proportional to the size of F(p)
Dept. Income Class Weight2 1800 0 0.82 1600 1 1.22 2000 1 1.23 1600 0 0.83 1600 0 0.84 1800 0 0.8
Training data
Value PosW NegW2 2.4 0.83 0.0 1.64 0.0 0.8
VW-set of Dept. VW-set of Income
Value PosW NegW1600 1.2 1.61800 0.0 1.62000 1.2 0.0
W+(Dept.=2) W-(Dept.=2)
Bottom-up evaluation
• The main idea:– evaluate splits from the leaf nodes to the root node– use already computed statistics to evaluate parent
nodes – much computing and sorting redundancy can be
avoided
• Evaluate based on VW-group
Bottom-up evaluation (Cont.)
• For leaf nodes – directly construct VW-group by scanning instances at the
node– VW-set on categorical attribute:
• Use hash table to index different values and collect their weights
– VW-set on numeric attribute:• Map weights to the corresponding index in the value space and
compress them to VW-set
Income Class Weight0 0 0.81 0 0.83 1 1.21 0 0.81 1 1.20 0 0.83 1 1.2
Instances
Value PosW NegW0 0.0 1.61 1.2 1.62 0.0 0.03 2.4 0.0
value space
Value PosW NegW0 0.0 1.61 1.2 1.63 2.4 0.0
VW-set of Income
Sort can take linear
time! O(n+m)
Bottom-up evaluation (Cont.)
• For internal nodes – construct VW-group by merging the VW-groups of its
two children nodes (prediction nodes)
Value PosW NegWA 0.0 1.6B 1.2 0.8C 1.2 1.6
VW-set of Dept. VW-set of IncomeValue PosW NegW
0 0.0 2.41 1.2 0.02 1.2 1.6
Value PosW NegWB 0.0 1.6C 1.2 0.8
VW-set of Dept.VW-set of Income
Value PosW NegW1 0.0 0.82 1.2 0.83 0.0 0.8
Value PosW NegWA 0.0 1.6B 1.2 2.4C 2.4 2.4
VW-set of Dept. VW-set of IncomeValue PosW NegW
0 0.0 2.41 1.2 0.8
3 0.0 0.82 2.4 2.4
Y N
Sort cost: O(V1+V2)
Evaluation Algorithm
directly directly constructionconstruction
directly directly constructionconstruction
merge merge constructionconstruction
merge merge constructionconstruction
categorical splitcategorical splitcategorical splitcategorical split
numeric splitnumeric splitnumeric splitnumeric split
evaluate other evaluate other childrenchildren
evaluate other evaluate other childrenchildren
Computation analysis• Prediction node p, |F(p)|=n• Top-down evaluation:
• sorting cost for each numeric attribute : O(nlogn)• Z-value calculation cost for each attribute : O(n)
• Bottom-up evaluation:• sorting cost for each numeric attribute :
– Leaf node: in most case O(n+m)– Internal node: sort through merging, O(V1+V2), where V1,V2 are
the numbers of distinct values in the two merged VW-groups. They always much smaller then n
• Z-value calculation cost for each attribute:– O(V), V is the number of distinct values in the VW-group,
always much smaller than n
Experiments• Data sets
– Synthetic data sets: • IBM Quest data mining group, up to 500,000
instances
– Real data sets:• China Mobile Communication Company, 290,000
subscribers covering 92 variables
• Environment– AMD 3200+ CPU running windows XP with
768MB main memory
Experimental results (Synthetic data)
Experimental results (real data)
Experimental results (real data)
• Apply to churn prediction– Calibration set: 20,083, validation set: 5,062 – Imbalance problem: about 2.1% churn rate– Re-balancing strategy:
• Multiply the weight of each instance in the minority class by Wmaj/Wmin (Wmaj(resp. Wmin) is the total weight of the majority (resp. minority) class instances)
• Little information loss and does not introduce more computing power on average
Models F-measure G-mean W-accuracy Modeling Time (sec)
ADT(w/o re-balancing) 56.04 65.65 44.53 75.56
Random Forests 19.21 84.04 84.71 960.00
TreeNet 72.81 79.61 64.40 30.00
BOAI 50.62 90.81 85.84 7.625
Summary• We developed a novel approach for ADTree induction to
speed up training time on large data sets• Key insight:
– eliminate the great redundancy of sorting and computation in the tree induction by using a bottom-up evaluation approach based on VW-group
• Experiments on both synthetic and real datasets show that BOAI offers significant performance improvement while constructing exactly the same model.
• Its an attractive algorithm for modeling on large data sets!
Thanks!