yadt (yet another decision tree builder) ah young shin [email protected] visual communication lab....
TRANSCRIPT
![Page 1: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/1.jpg)
YaDT (Yet another Decision Tree
builder)
Ah Young [email protected] Communication Lab.
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
![Page 2: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/2.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
1. Introduction
• YaDT is a from-scratch main-memory implementation of the C4.5-like
decision tree algorithm.
• ID3(Entropy) → C4.5(Information Gain) → C5.0 의 순으로 확장
• Unfortunately, C4.5 (and EC4.5) are implemented in the old style
K&R C code. The sources are then hard to understand, profile and
extend.
• Experimental results are reported comparing YaDT with Weka, dti
and (E)C4.5.
![Page 3: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/3.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
1. Introduction - C4.5
• C4.5
① 수치형 속성 취급 ( Handling continuous attributes )
② 무의미한 속성을 제외하는 문제
③ 나무의 깊이 문제 ( How deeply to grow the decision tree )
④ 결측치 처리 ( Handling missing attributes values )
⑤ 비용고려 ( Handling attributes with different costs )
⑥ 효율성 ( Improving computational efficiency )
![Page 4: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/4.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
2. Meta data representation
• Each attribute has one the following attribute types
: discrete, continuous, weights or class.
• The values of an attribute in a case belong to some data type includ-
ing
: integer, float, double, string. (special value‘?’or NULL)
![Page 5: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/5.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
2. Meta data representation
• Summarizing, in YaDT meta data de-
scribing the training set TS can be
structed as a table with columns
: attribute name, data type and attribute
type.
• Such a table can be provided as a database table, or as a text
file.
![Page 6: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/6.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
3. Data representation
• Example) training data for PlayTennis may include the following
case:
• C4.5 models an attribute value by a union structure to distinguish
discrete from continuous attributes.
![Page 7: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/7.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
4.1 YaDT optimizations
• All the strategies implement several optimizations, mainly related to
the efficient
computation of information gain.
① The first strategy computes the local threshold using the algorithm
of C4.5, which in particular sort cases by means of the quicksort
method.
② The second strategy also uses the algorithm of C4.5, but adopts a
counting sort method.
⇒ The selection of the strategy to adopt is performed accordingly to an
analytic comparison of their efficiency.
![Page 8: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/8.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
4.1 YaDT optimizations
• After splitting a node, a (weighted) subset of cases are “pushed
down” to each child node. (pushed down = LIFO)
• YaDT builds a weighted array for each node.
• The depth-first strategy is slightly faster, since the following opti-
mization can be implemented.
• The breadth-first strategy has a better memory occupation perfor-
mance, requiring to maintain arrays of weights and cases indexes for
a total of at most 2∙|TS| elements. -> YaDT
![Page 9: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/9.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
4.2 Some experiments on efficiency
• Ts name : the name of training set
• |TS| : the number of cases
• NC : the number of class values
![Page 10: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/10.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
5. YaDT version 1.2.5
![Page 11: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/11.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
5. YaDT version 1.2.5
![Page 12: YaDT (Yet another Decision Tree builder) Ah Young Shin ayoung18@uos.ac.kr Visual Communication Lab. Dept. Electronic Computer Engineering. University Of](https://reader036.vdocuments.net/reader036/viewer/2022062718/56649e7e5503460f94b82049/html5/thumbnails/12.jpg)
Dept. Electronic Computer Engineering. University Of Seoul. Visual Communication Lab.
6. Conclusion
• a structured object-oriented programing implementation
• portable code over Windows (Visual Studio) and Linux (gcc)
• 32 bit and 64 bit executable
• a documented C++ library of classes
• compressed binary output/input of trees
• a command line tree builder and a Java GUI.