yadt (yet another decision tree builder) ah young shin ayoung18@uos.ac.kr visual communication lab....

Post on 29-Dec-2015

217 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

YaDT (Yet another Decision Tree

builder)

Ah Young Shinayoung18@uos.ac.krVisual Communication Lab.

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

1. Introduction

• YaDT is a from-scratch main-memory implementation of the C4.5-like

decision tree algorithm.

• ID3(Entropy) → C4.5(Information Gain) → C5.0 의 순으로 확장

• Unfortunately, C4.5 (and EC4.5) are implemented in the old style

K&R C code. The sources are then hard to understand, profile and

extend.

• Experimental results are reported comparing YaDT with Weka, dti

and (E)C4.5.

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

1. Introduction - C4.5

• C4.5

① 수치형 속성 취급 ( Handling continuous attributes )

② 무의미한 속성을 제외하는 문제

③ 나무의 깊이 문제 ( How deeply to grow the decision tree )

④ 결측치 처리 ( Handling missing attributes values )

⑤ 비용고려 ( Handling attributes with different costs )

⑥ 효율성 ( Improving computational efficiency )

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

2. Meta data representation

• Each attribute has one the following attribute types

: discrete, continuous, weights or class.

• The values of an attribute in a case belong to some data type includ-

ing

: integer, float, double, string. (special value‘?’or NULL)

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

2. Meta data representation

• Summarizing, in YaDT meta data de-

scribing the training set TS can be

structed as a table with columns

: attribute name, data type and attribute

type.

• Such a table can be provided as a database table, or as a text

file.

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

3. Data representation

• Example) training data for PlayTennis may include the following

case:

• C4.5 models an attribute value by a union structure to distinguish

discrete from continuous attributes.

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

4.1 YaDT optimizations

• All the strategies implement several optimizations, mainly related to

the efficient

computation of information gain.

① The first strategy computes the local threshold using the algorithm

of C4.5, which in particular sort cases by means of the quicksort

method.

② The second strategy also uses the algorithm of C4.5, but adopts a

counting sort method.

⇒ The selection of the strategy to adopt is performed accordingly to an

analytic comparison of their efficiency.

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

4.1 YaDT optimizations

• After splitting a node, a (weighted) subset of cases are “pushed

down” to each child node. (pushed down = LIFO)

• YaDT builds a weighted array for each node.

• The depth-first strategy is slightly faster, since the following opti-

mization can be implemented.

• The breadth-first strategy has a better memory occupation perfor-

mance, requiring to maintain arrays of weights and cases indexes for

a total of at most 2∙|TS| elements. -> YaDT

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

4.2 Some experiments on efficiency

• Ts name : the name of training set

• |TS| : the number of cases

• NC : the number of class values

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

5. YaDT version 1.2.5

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

5. YaDT version 1.2.5

Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.

6. Conclusion

• a structured object-oriented programing implementation

• portable code over Windows (Visual Studio) and Linux (gcc)

• 32 bit and 64 bit executable

• a documented C++ library of classes

• compressed binary output/input of trees

• a command line tree builder and a Java GUI.

top related