haifa research lab © 2008 ibm corporation parallel streaming decision trees yael ben-haim &...

17
Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Upload: austen-russell

Post on 17-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation

Parallel streaming decision trees

Yael Ben-Haim & Elad Yom-TovPresented by: Yossi Richter

Page 2: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation2

Why decision trees?

Simple classification model, short testing time

Understandable by humans

BUT:

– Difficult to train on large data (need to sort each feature)

Page 3: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation3

Previous work

Presorting (SLIQ, 1996)

Approximations (BOAT, 1999) (CLOUDS, 1997)

Parallel (e.g. SPRINT 1996)

– Vertical parallelism

– Task parallelism

– Hybrid parallelism

Streaming

– Minibatch (SPIES, 2003)

– Statistic (pCLOUDS, 1999)

Page 4: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation4

Streaming parallel decision tree

Data

Page 5: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation5

Iterative parallel decision tree

Initializeroot

Master Workers

Build histogram

Compute node splits

Buildhistogram

Until convergence

Time

DataDataBuild

histogram

Buildhistogram

Merge

Page 6: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation6

Building an on-line histogram

A histogram is a list of pairs (p1, m1) … (pn, mn)

Initialize: c=0, p=[ ], m=[ ]

For each data point p:

– If p==pj for any j<=c

• mj = mj + 1– Otherwise

• Add a bin to the histogram with the value (p, 1)• c = c + 1• If c > max_bins

– Merge the two closest bins in the histogram– c = max_bins

Page 7: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation7

Merging two histograms

Concatenate the two histogram lists, creating a list of length c

Repeat until c <= max_bins

– Merge the two closest bins

Page 8: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation8

Example of the histogram

50 bins, 1000 data points

Page 9: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation9

Pruning

Taken from the MDL-based SLIQ algorithm

Consists of two phases:

– Tree construction

– Bottom-up pass on the complete tree

During tree construction, for each tree node,

set cleaf = 1 + number of samples that reached the node and do not belong to the

majority class

The bottom-up pass:

– for each leaf, set cboth = cleaf

– for each internal node, for which cboth(left) and cboth(right) have been assigned,

set cboth = 2 + cboth(left) + cboth(right)

– The subtree rooted at a node is to be pruned if cleaf is small, namely:

• Only a few samples reach it• A substantial portion of the samples that reach it belongs to the majority class

– If cleaf < cboth (i.e., the subtree does not contribute much information) then:

• Prune the subtree• Set cboth = cleaf

Page 10: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation10

IBM Parallel Machine Learning toolbox

A toolbox for conducting large-scale machine learning– Supports architectures ranging from single machines with multiple cores to large distributed clusters

Works by distributing the computations across multiple nodes– Allows for rapid learning of very large datasets

Includes state-of-the-art machine learning algorithms for:– Classification: Support-vector machines (SVM), decision tree– Regression: Linear and SVM– Clustering: k-means, fuzzy k-means, kernel k-means, Iclust– Feature reduction: Principal component analysis (PCA),

and kernel PCA.

Includes an API for adding algorithms

Freely available from alphaWorks

Joint project of the Haifa Machine Learning group and the

Watson Data Analytics group 0

2

4

6

8

10

12

14

16

18

0 200 400 600 800 1000 1200

Number of processors

Sp

ee

du

p (

Co

mp

are

d t

o a

sin

gle

no

de

)

Initializeparameters

Master Workers

Compute kernel matrix

Compute local updates

Compute global update

Compute local updates

Until convergence

Time

DataData

Compute kernel matrix

Compute local updates

Compute local updates

Merge

K-means, Blue Gene

Shameless

PR slide

Page 11: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation11

Results: Comparing single node solvers

Dataset Number of examples

Number of features

Standard tree SPDT

Adult 32561 (16281) 105 17.7 15.7

Isolet 6238 (1559) 617 18.7 14.6

Letter 20000 16 7.5 8.6

Nursery 12960 25 1.0 2.6

Page blocks 5473 10 3.1 3.1

Pen digits 7494 (3498) 16 4.6 5.4

Spambase 4601 57 8.4 10.5

No statistically

Significant

difference

Ten-fold cross-validation, unless test\train partition exists

Page 12: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation12

Results: Pruning

Dataset Standard tree SPDT

before pruning

SPDT

after pruning

Tree size

before pruning

Tree size

after pruning

Adult 17.7 15.7 14.3 1645 409

Isolet 18.7 14.6 17.8 211 141

Letter 7.5 8.6 9.3 135 67

Nursery 1.0 2.6 3.2 178 167

Page blocks 3.1 3.1 3.4 55 36

Pen digits 4.6 5.4 5.8 89 81

Spambase 8,4 10.5 11.4 572 445

80% reduction

in size

Page 13: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation13

Speedup (Strong scalability)

Alpha Beta

Speedup improves with data size!

Page 14: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation14

Weak scalability

Alpha Beta

Scalability improves with the number of processors!

Page 15: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation15

Algorithm complexity

Page 16: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation16

Summary

An efficient new algorithm for parallel streaming decision trees

Results as good as single-node trees, but with scalability that improves with the data size and the number of processors

Ongoing work: Proof that the algorithm is only epsilon different from previous decision tree algorithm

Page 17: Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab

© 2008 IBM Corporation17

תודהHebrew (Toda)

Thank You

MerciGrazie

Gracias

Obrigado

Danke

Japanese

English

French

Russian

German

Italian

Spanish

Portuguese

Arabic

Traditional Chinese

Simplified Chinese

Thai

Korean

KIITOSDanish