module 4 classification – decision trees vpac education...

65
EDPNMO006/2001 (D.Taniar & K.Smith) 1 Module 4 Classification – Decision Trees VPAC Education Grant Round 2 EDPNMO006/2001 David Taniar and Kate Smith Monash University

Upload: others

Post on 10-Feb-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    1

    Module 4Classification – Decision Trees

    VPAC Education Grant Round 2EDPNMO006/2001

    David Taniar and Kate SmithMonash University

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    2

    2

    Decision tree� A decision tree is use for classification.� Classification is the process of assigning new objects to

    predefined categories or classes.

    ⇒ Given a set of labeled records

    ⇒ Build a model (decision tree)

    ⇒ Predict labels for future unlabelded records� A decision tree is usually a directed graph consisting of

    nodes and directed arcs. The nodes frequently correspond to a question or a test.

    � The data are a collection of records. Each record contains attributes and correspond target class.

    • Decision tree is one of the most popular tools for classification due to it comprehensible result in the form of decision rules.

    • Other classification tools are neural networks, statistical models, and genetic models.

    • Decision tree was first proposed by Hunt (ID3) and then subsequently improved by Quilan (C4.5, C5.0).

    • Another famous decision tree algorithm was proposed by Friedman (CART).

    • In decision tree, the objective is to create a set of rules that could be used to differentiate one target class from another.

    • The target class is a labeled categorical values (e.g. Bird, Tiger, Rabbit), a binary values (e.g. Yes, No) or any categorizable values.

    • After a set of decision rules have been constructed, it can be used to help us in decision making (e.g. classification) or estimation.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    3

    3

    � For example, a decision tree is use to classifying a few animals based on whether or not they have Hair or Feathers, and their Color

    Decision tree

    Hair?

    Feathers? KANGAROO

    PELICAN Color?

    TUNA DOLPHIN WHALE

    False True

    False True

    Whilte Grey

    Black

    • In the example above a decision tree is built from a training data set which consists of animal classes.

    • The name “decision tree” comes from the tree like structure constructed from the training data set.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    4

    4

    � Hair? Feathers? and Color? are questions or attributes.

    � Possible values for,� Hair => True or False� Feathers => True or False

    � Color => Black or Grey or White� KANGAROO, PELICAN, TUNA, DOLPHIN and

    WHALE are the classifications.

    Decision tree

    • The attributes value of training data can be divided into categorical values or continuous values.

    • In the example above, all attribute are of categorical values (countable).

    • Continuous values are real numbers (e.g. height of a person).

    • The possible target classes value must be countable.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    5

    5

    � Structure of a decision tree

    ◆ A decision tree is composed of three type of nodes, the root nodes, the intermediate note, and the leaf node.

    ◆ Leaf nodes contain a final decision or target class for a decision tree.

    ◆ The root node is the starting point of a decision tree.

    ◆ Every intermediate node corresponds to a question or test.

    Decision tree

    • An intermediate node can has as many branches as possible according to the possible values of the question at that node.

    • Some decision tree node only contains a maximum of two child nodes, we called it a binary tree. There are only two answer for a binary branch, that is true/yes or false/no.

    • A node is assigned as leaf node when all the examples at that node belongs to the same class or majority of the examples belongs to a class.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    6

    6

    � To assign a target class to an example,

    1. start with root node and perform the test

    2. branch to the correct path based on the test result

    3. reach leaf node? If yes, assign the target class to this example, else go to step 1 with this current node as the root node.

    � From the steps above, tracing a decision tree is a recursive process. The intermediate node can be taught as the root node for it sub-tree.

    � For example, if, Hair? = False and Feathers? = False and Color? = Grey then classification = DOLPHIN

    Decision tree

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    7

    7

    � Using the same training data set, two decision trees are generated.

    Decision tree

    Hair?

    Feathers? KANGAROO

    PELICAN Color?

    TUNA DOLPHIN WHALE

    False True

    False True

    Whilte Grey

    Black

    (1)

    Color?

    BlackGrey

    White

    Hair?TUNA Feathers?

    KANGAROODOLPHIN

    WHALE

    PELICAN

    (2)

    Which decision tree is better?

    • There are many ways to construct a decision tree. Although the decision tree is different, they are still valid in the sense they correctly classify the training data.

    • The generalization ability of two decision tree generated from the same domain may be varied.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    8

    8

    � Feature selection criterion

    ◆ Decision tree (1) and (2) different in how the position the features or input attributes.

    ◆ The choice of feature at a node may result tree that easier to understanding.

    ◆ The main aim of feature selection at some point in a decision tree is to create tree that is as simple as possible and gives the correct classification.

    ◆ Poor selection of attribute can result in poor decision tree.

    Decision tree - Feature selection (1)

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    9

    9

    Decision tree - Feature selection (2)� When constructing a decision tree, it is necessary to have a

    means of determining:

    √ the important attributes needed for the classification

    √ the ordering of the important attributes.� A feature selection criterion is used to determine the

    ranking of the input attributes.� The frequency information of the training data subset at a

    node is use to find the best splitting attribute.

    • Calculation is needed to find the best splitting attribute at a node. All possible splitting attributes are evaluated using feature selection criterion to find the best attribute.

    • The feature selection criterion still doesn’t guaranteed the best decision tree, it also rely on the completeness of the training data and whether the training data provides enough information or not.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    10

    10

    Decision tree- Feature selection (3)� There are two frequently used feature selection

    criterion,1. Gain criterion (ID3),2. Gini index (CART)

    Gain criterion� information based criterion

    � let S represent the training data set, and let there be x, y, and z examples for classes C1, C2, and C3 respectively,

    � the probability of an arbitrary example belongs to a class C1, C2, or C3 is

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    11

    11

    ◆ when a decision tree is used to classify an example, it returns a class. A decision tree can thus be regarded as a source of a message A, B or C, with the expected information needed to generate this message given by:

    Decision tree- Feature selection (4)

    zyx

    zor

    zyx

    y

    zyx

    x

    ++++++,,

    ++++

    ++++

    ++++

    −=

    zyx

    z

    zyx

    z

    zyx

    y

    zyx

    y

    zyx

    x

    zyx

    xCCC

    2

    2

    2

    log

    log

    log)3,2,1(info

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    12

    12

    Decision tree- Feature selection (5)

    ◆ the expected information required for the tree with attribute A as its root is given by,

    ◆ where n is the number of attribute value for attribute A, and xi, yi and zi are the number of examples of classes C1, C2 and C3 respectively with value Ai of the attribute A

    ◆ the information gained by branching on attribute A is

    GAIN(A) = info(C1,C2,C3) - E(A)� the attribute with the highest GAIN is chosen to split.

    ++++

    ++++= ∑

    = zyx

    zyx

    zyx

    zyxAE iii

    n

    i

    iii

    02log)(

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    13

    13

    GAIN example (1)

    � 4 attributes,

    • Hair = (true, false), Swims = (true, false), Color=(grey, brown, white), and Size=(small, medium, large)

    � 3 output classes (C1, C2, C2),• Kangaroo, Dolphin, and Whale

    Training data

    •At root needs to calculate GAIN of all attributes,

    info(C1,C2,C3) = -2/5log2(2/5)-2/5log2(2/5)-1/5log2(1/5)

    =1.5219�

    Attribute Hair,

    info(Hair=true) = - 2/2log2(2/2) - 0 - 0

    = 0.0000

    info(Hair=false) = - 0 - 2/3log2(2/3) - 1/3log2(1/3)

    =0.9183

    E(Hair)= 2/5 ×info(Hair=true) + 3/5 ×info(Hair=false)= 2/5 × 0.0000 + 3/5 × 0.9183= 0.5509

    GAIN(Hair) = info(C1,C2,C3)- E(Hair)

    = 1.5219 - 0.5509 = 0.9710�

    •similar calculation on attribute Swims result in

    GAIN(Swims) =0.9710

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    14

    Attribute Color,

    info(Color = white) = - 0 - 1/1log2(1/1) - 0

    = 0.0000

    info(Color = grey)=- 1/2log2(1/2) - 1/2log2(1/2) - 0

    =1.0000

    info(Color = brown) =- 1/2log2(1/2) - 0 - 1/2log2(1/2)

    =1.0000

    E(Color)=1/5×info(Color=white) + 2/5 × info(Color=grey)+ 2/5 × info(Color=brown)

    = (1/5 * 0.00) + (2/5 * 1.00) + (2/5 * 1.00)

    = 0.80

    hence,

    GAIN(Color) =info(C1, C2, C3) - E(Color)

    = 1.5219 - 0.8000

    = 0.7219�

    Attribute Size,

    GAIN(Size) = 0.5710�

    Results,

    GAIN(Hair)=0.9710, GAIN(Swims)=0.9710,

    GAIN(Color)=0.7219, and GAIN(Size)= 0.5710�

    Can choose attribute Swims or Hair as the splitting attribute.

    Swims?

    true false

    3 examples

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    15

    15

    Continuous attribute� The values of the continuous attribute needs to be

    sorted.� A middle point Z between two consecutive values is

    use as a split point evaluation.e.g. two consecutive values vi and vi+1Z = (vi + vi+1) / 2

    A ≤ Z and A > Z where A is the continuous attribute.� The best split point is chosen.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    16

    16

    Gini index� Measure the purity or impurity of a node.� The attribute that result in purer subset will be chosen

    as splitting attribute.� Assume an attribute A divided the subset S into two

    subsets S1 and S2, the Gini index resultant from splitting on attribute A are,

    � where c is the number of target classes, n is the total numbers of examples, n1 is S1 examples size and n2is S1 examples size.

    )()(),(

    1)(

    22

    11

    21

    1

    2

    SGinin

    nSGini

    n

    nSSGini

    pSGinic

    ii

    +=

    −= ∑=

    •The Gini index for all candidate attributes are calculated and the attribute with smallest Gini is selected.

    •If an subset S is pure, Gini(S) = 0.

    •Example: C1 and C2 are target classes and CarType is an attribute. Figure below shows the class distribution,

    Gini(S1) = 1 - [(1/5)^2 + (4/5)^2] = 0.32

    Gini(S2) = 1- [(2/3)^2 + (1/3)^2] = 0.444

    Gini(S3) = 1- [(1/2)^2 + (1/2)^2] = 0.50

    Gini(S) = 5/10(0.32) + 3/10(0.44) + 2/10(0.50)=0.16+ 0.132 + 0.1=0.392

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    17

    17

    Decision tree algorithm� C4.5 decision tree construction algorithm

    Partition(Data S)

    if (all data points in S are of the same class) then

    return;

    for each attribute A do

    evaluate splits on attribute A;

    Use best split found to partition S into S1 and S2;

    Partition(S1);

    Partition(S2);

    • The decision tree construction algorithm is a divide and conquer method.

    • Construct the tree using depth first fashion. Branching can be binary (only 2 branches) or multi-ways (>= 2 branches).

    • Tree can be constructed in a depth-fist or breadth first fashion.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    18

    18

    C4.5� Problems with C4.5 tree construction algorithm

    ◆ large amount of time is spend on sorting continuous values at each node.

    ◆ does not scaled well, assumed training data could fitted into main memory

    � two major issues that affect tree-growth performance◆ How to find split points that define node tests◆ having chosen a split point,how to partition the

    data

    • the main limitation of C4.5 algorithm is how it handle the continuous value

    • many algorithm have been proposed to overcome this limitation , for examples

    •Discretize continuous attributes (CLOUDS, SPEC)

    •Use a pre-sorted list for each continuous attributes (SPRINT, SLIQ, ScalParC)

    • Others limitation are a) feature selection criterion that favor multiple valued attribute, b) bushy tree when noise are presented.

    • tree-growth phase consume most of the decision tree construction time.

    • time must be minimized to find a split points and split the training data at each split points.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    19

    19

    SPRINT� designed to overcome limitation of C4.5 way of

    handling continuous attributes.� main design concepts

    ◆ pre-sorted continuous attributes◆ one attribute list for each attribute◆ new attribute lists are created at each child node

    • SPRINT by John Shafer et. al. 96 was designed to handle large training data set.

    • motivated by SLIQ (Mehta et. al. 96) algorithm.

    • the pre-sorted continuous attribute will save time of sorting it at each node.

    • preserve the sorted condition of continuous attribute when split the training data, thus re-sorting is not necessary after splitting.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    20

    20

    Serial SPRINT- Data Structures (1)� Attribute lists

    ◆ initially an attribute list is created for each attribute in the training data.

    ◆ entries in these lists are attribute records that consist of an attribute value, a class label, and the index of the record (rid) from which these value were obtained.

    ◆ lists for continuous attributes are sorted by attribute value

    • the rid of each entry in the attribute list is used to keep track the attribute value that belongs to the same record. In this way, after sorting the continuous attribute, we still can find an attribute record belongs to the same input pattern.

    • If the entire data does not fi in memory, attribute lists can be maintained on disk.

    • the initial lists created from the training data are associated with the root of the decision tree. When a list is partitioned, the order of the records in the list is preserved; thus, partitioned lists does not require resorting.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    21

    21

    Serial SPRINT- Data Structures (2)

    0

    Age

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    22

    22

    Serial SPRINT- Data Structures (3)� Histograms

    ◆ the histogram is used to store the class distributions of each attribute in the training data

    ◆ for continuous attributes, two histograms are associated with each node. One histogram is used to capture the class distribution below () the threshold.

    ◆ for each categorical attribute, a classes distribution histogram is associated with it.

    • The histograms are needed to calculate the Gini criterion values that use to find the best split attribute.

    • the two histogram for continuous attribute are denoted as Cabove and Cbelow.

    • the histogram for categorical attribute consist of the classes distribution of all the attribute values. This histogram is called count matrix.

    • at each node the count matrix’s and continuous attributes histogram are established to find the split attribute.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    23

    23

    Serial SPRINT - Finding split points (1)� Gini index is use to find the best split attribute� Only binary split is consider� The attribute with the lowest value for the Gini index

    is used to split a node

    Continuous attribute� the mid-point between two consecutive values are

    evaluated using the Cabove and the Cbelow histograms.� after an attribute list record is read, the two

    histograms are updated and the Gini index is calculated.

    � the split value that has the lowest Gini value is chosen as the continuous attribute split point.

    • the SPRINT algorithm only perform binary split. For a categorical attribute that has more than two attribute values, two groups of attribute values are created. The categorical attribute list is then split by considering whether a record attribute value contains attribute values in a group or not.

    • at root node, the Cbelow is initialized with count 0, whereas the Cabove is initialized to the class distribution for all the records.

    • after obtained the best split value the class histograms memory space is deallocated.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    24

    24

    Serial SPRINT - Finding split points (2)

    position 0

    position 3

    position 6

    Class histogramsposition of read cursor

    • the target classes are H and L.

    • at cursor position 0, the Cbelow histogram is initialized with 0 for both target class.

    • when the cursor is at position 3, the Cbelow and the Cabove class histograms are updated as shown above.

    • all mid-point between two consecutive values are evaluated. The split point value that has the smallest Gini index is chosen.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    25

    25

    Serial SPRINT - Finding split points (3)� Categorical attributes

    ◆ the splits for a categorical attribute A are of the form A ∈ S’, where S’ ⊂ S and S is the set of possible values of attribute A.example: attribute ColorS = { blue, white, black}S’ = {blue, white}

    ◆ If the cardinality of S is less than a threshold, then all of the subsets of S are evaluated. Otherwise, a greedy algorithm is used to obtain the desired subset.

    Color ∈ S’ Color ∉ S’

    • SPRINT algorithm implementation only involved the binary split.

    • For categorical attributes that wasn’t a binary value, two subset of the attribute values needs to be created. The subsets that will create the most pure split will be chosen.

    • Given m attribute values for attribute A, there are 2m possible subsets. If m is large, determining all subsets may not be visible.

    • When m is larger than a threshold, the greedy algorithm is used without evaluating all subsets.

    • The greedy algorithm starts with an empty subset S’ and adds one element of S to S’ which gives the best split. The process is repeated until there is no improvement in the splits.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    26

    26

    Serial SPRINT - Finding split points (4)� Categorical attribute example

    count matrix

    • In the example above, the count matrix for attribute CarType is first established.

    • S will be {family, sports, truck}, let S’ = {family, sports}

    Gini(CarType ∈ S’) = 1 - ( (4/5)^2 + (1/5)^2) = 0.294Gini(Cartype ∉ S’) = 1- ((0/1)^2 + (1/2)^2) =0.75Gini(S1, S2) = 5/6(0.294) + 1/6(0.75) = 0.37.

    • there are 2^3=8 possible subsets of S’ to evaluate.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    27

    27

    Serial SPRINT - Performing the split (1)� After the split attribute has been found for a node,

    two child nodes are created and the attribute records are divided between them.

    � Each attribute list are divided into two part.� The same record ID’s of different attribute must go to

    the same branches.� Use the record ID’s of split attribute to guide the

    others attributes on how to split the attribute list.� Hash table is used to store the rid’s for split attribute

    and which branch it goes.� Others attributes probe the hash table to determine

    which branch an attribute record should go.

    • as each attribute is in separated attribute list, the split at a node must ensure all records go to the correct child node.

    • when partition the splitting attribute, the rid of records is hashed into a hash table to keep track on which child the record was moved.

    • for the lists of the remaining attributes, we scan each of the the attribute lists records and probe the hash table with the record rid. The retrieved information tells us with which child to place the record.

    • when child nodes are created, the class histograms for continuous attribute are initialized with 0 to Cbelow and the class distribution for all attribute list records to Cabove.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    28

    28

    Serial SPRINT - Performing the split (2)� Example

    • In the example above, the Age attribute is chosen as the split point. The hash table is create by hashed on rid’s of of Age attribute list records. The entries of the hash tableconsist of information on which child (L or R) an rid was assigned to.

    • The hash table is then probe to determine which child an CarType record should assign to by hash on the record rid.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    29

    29

    Parallel SPRINT� data parallelism algorithm� Synchronized tree construction� training data are partitioned among all processor,

    each processor only on only 1/P of the total data, where P is the number of available processors.

    � parallel sorting to sort the continuous attributes � Cabove is initialized when continuous attribute list

    records are distributed.� continuous attribute list records are divided equally to

    all processors

    • Parallel SPRINT tree construction is similar to the serial version.

    • each processor work on its local data (shared nothing architecture).

    • at each node, each processor needs to be synchronized to exchange attributes histogram information, thus, synchronized tree construction.

    • parallel sorting algorithm is used to sort each continuous attribute list. After sorted, the attribute list records are distributed evenly to all processors.

    • data skew may occur at lower level of the tree.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    30

    30

    Exchange local nodes information

    SPRINT

    p0 p1 pn

    • The sprint algorithm constructs the decision tree in synchronous breadth first fashion.

    • The node in each processor synchronize to exchange their local information such as classes distribution and total examples at that node.

    • Each processor will have their own data partition.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    31

    31

    Parallel SPRINT� Example of initial data placement

    • Figure above shows the initial data placement at root node in Processor 0 and Processor 1 (assumed only 2 processors are used)

    • As seem from the diagram above, the same rid’s need not to be distributed to the same processor.

    • The Age continuous attribute list has been sorted before distributed it to each processor.

    • Problem may occur in problem sorting when the whole attribute list records cannot be fitted into the main memory.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    32

    32

    Parallel SPRINT- Finding split points (1)� Similar to serial version� Each processor scan local attribute list in parallel to

    get the local histograms information.� Local data class distribution is not enough to

    calculate the Gini index value.� Processors corporate to gather the global class

    distribution.� For continuous attribute global class distributions are

    exchanged before the local scan start.� For categorical attribute global class distributions are

    exchanged after the local scan ended.

    Continuous attributes

    •in parallel SPRINT the Cbelow and the Cabove histograms needs to be initialized to reflect others processor class distributions.

    •Cbelow initially reflect the lower sections of attribute list on other processor. That is

    where p is is processor number.

    •Cabove histograms must be initially initialized to local attribute list section and other higher order processors rank class distribution. That is

    where n is the total number of processor.

    •Once all the local attribute-list sections have been processed, the processors then communicate to determine which of the N split points has the lowest cost.

    ∑−

    ==

    1

    0

    p

    i

    iabove

    pbelow CC

    ∑=

    =n

    pi

    iabove

    pabove CC

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    33

    Categorical attributes

    • After each processor scanned its local attribute list to establish the local count matrix, all processor then perform all-to-all exchange of count matrix and produced global count matrix.

    • Alternatively, a coordinator can be assigned to collect all counts matrix information and then distribute it to other processors.

    • Each processor can then in parallel find the best split attribute.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    34

    34

    Parallel SPRINT-Finding split points (3)

    globalLocal

    Local global

    • above give an example on how continuous and categorical attribute global count histograms are obtained.

    • For continuous attribute (Age), the local histograms are created when the attribute list records are first distributed. The Cabove of processor 0 is initialized by summed with Cabove with processor 1 local class histogram. The Cbelow of processor 1 is initialized with Cabove of processor 0.

    • For the categorical attribute (CarType), both processors established it local count matrix. The count matrices are then summed up to obtained the global count matrices. Both processors with have the same count matrix at the end.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    35

    35

    Parallel SPRINT- Split attribute lists (1)� After determined the split points, two new child are

    created and the attribute lists are splits into two sections.

    � Due to parallel sorting, each processor will have different set of continuous attribute records. Thus a processor may not know which branch to assign a record of a non split attribute when it want to split because other part (split attribute) of the same records is in another processor.

    � Thus, it is necessary for a processor to collects the hash table rid’s from all others processor.

    • In parallel SPRINT, each processor could have difference portion of attribute lists records.

    • when performing a split based on the chosen attribute A, a node may not know which child to assign an record of attribute list B because the its rid of attribute A is in assigned to other processor.

    • To overcome this problem, a node needs to collect all rid from others processor so that it know how to split it local attribute lists records.

    • After collected all the rid’s, each processor can independently perform the split.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    36

    36

    Parallel SPRINT- Split attribute lists (2)� Each processor built it own local hash table based on

    the chosen split attribute � These hash table entries (rid’s) are collected from all

    processor.� Each processor has complete entries in it local hash

    table and can probe the hash table to perform splits.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    37

    37

    Parallel SPRINT-Implementation� Files

    ◆ sprint.cpp◆ treenode.cpp, treenode.h◆ chtbl.cpp, chtbl.h◆ list.cpp, list.h◆ support.cpp◆ global.h

    � Object classes definition◆ TreeNode { }

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    38

    38

    Input Format� Input files

    filename.datafilename.names

    � Input format◆ attributes and classes

    ◆ Training pattern

    @class class1, class2.

    @attribute att1: continuous.

    @attribute att2: attval1, attval2.

    rid value1, value2, ..., valuen, class.

    • The attributes and classes information is provided by the .names file. The target class available in the training patterns is stated with the keyword “@class”, follow by the class name separated by comma. The class line ended with a dot “.”. Spaces were allowed between class name, but no spaces are allowed between the last class name and the “.”.

    • The attributes are specified using “@attribute” keyword, follow by the attribute name and a colon “:”. After that if it is a continuous attribute, use the keyword “continuous”, otherwise it is discrete attribute, specify all allowable discrete attribute values separated by comma. The attribute line is ended with the dot “.” character.

    • The training patterns are store in the .data file (both file must have the same prefix). The training pattern start with the record id follow by the attributes values separated by comma. The attribute values MUST follow exactly the order in the .names file. The class name is the last value.

    • Examples: @class Play, Don’t Play.@attribute Humidity: continuous.

    @attribute Outlook: sunny, overcast, rain.

    @attribute Windy: true, false.

    @attribute Temp: continuous.

    A training pattern is 1 70, sunny,false,75, Don’t Play.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    39

    39

    Global variables� MaxAtt - total number of attributes� MaxClass - total number of target classes� MaxDiscVal - maximum number of attribute values

    among all discrete attributes.� MaxContAtt-total number of continuous attribute� ClassName-array of classes name in string.� MINOBJS - minimum objects in a node before it can

    be split.� Attributes - array to store each attribute information.� MAXSETSIZE - minimum number of discrete

    attribute values before using greedy subseting.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    40

    40

    TreeNode structure (1)� Contains the attribute lists, classes histograms and

    count matrix.� Methods to evaluate gini index for continuous and

    discrete attributes.� Method for exchanging hash table entries.� Store split attribute information, such as the cut point

    for continuous attribute and subset for splitting for discrete attribute.

    1. class TreeNode

    { public:

    TreeNode(bool *);

    ~TreeNode();

    5. void FindSplit();

    void AddAttList(List * attlst, short att,

    COUNT_TYPE *& cabv, COUNT_TYPE *&);

    void ExtractRules(long);

    private:

    DiscSplit* EvalDiscAtt(short att);

    10. DiscSplit* GreedyAlgorithm(short);

    DiscSplit* EvalAllSubset(short);

    ContSplit* EvalContAtt(short att);

    int ExchangeSplitInfo(CHTbl *);

    void Branch2Child(void);

    15. float EvalSubsetGini(short , int *, unsigned int);

    bool IsSubset(float );

    void reset();

    void ReclaimMemory();

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    41

    19. COUNT_TYPE **Cbelow, **Cabove;

    COUNT_TYPE *ClassFreq;

    COUNT_TYPE **countmatrix;

    short SplitAtt;

    float cut;

    int * BestSubset;

    25. short BestSubsetLen;

    List **AttList;

    bool * Tested;

    TreeNode *LeftBranch,

    *RightBranch;

    bool LeafNode;

    30. short TargetClass;

    };

    • Lines 5: FindSplit() - method to find the best split in the current node

    • Lines 6: AddAttList() - method to att splited attribute list from parent node.

    • Lines 7: ExtractRules() - Extract decision tree rules.

    • Lines 9: EvalDiscAtt() - evaluate discrete attributes to find the gini index.

    • Lines 10-11: Helper method to EvalDiscAtt().

    • Lines 12: EvalContAtt() - find gini values for continuous attributes.

    • Lines 13: ExchangeSplitInfo() - Exchange hash table entries from all processors.

    • Lines 14: Branch2Chil() - Split the attribute lists using the the split attribute and create two child nodes.

    • Lines 15: Use for discrete attribute evaluation.

    • Lines 19: Continuous attributes histogram. Cbelow[i][j] or Cabove[i][j] means theCbelow/Cabove for the ith attribute for class j.

    • Lines 21: Discrete attributes count matrix (shared by all discrete variables)

    • Lines 22: SplitAtt - the best split attribute. -1 means dead end leaf.

    • Lines 24: The indexes of discrete attribute values for splitting.

    • Lines 26: Attribute lists. AttList[i] means the list for the ith attribute.

    • Lines 27: Array of size MaxAtt to store which attributes have been used for splitting so far, so that it won't be used for splitting again in the child node.

    • Lines 28: Pointer to left and right child nodes.

    • Lines 29: true if it is a leaf node.

    • Lines 30: target class if it is a leaf node.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    42

    42

    Data Structures� Attribute

    typedef struct

    { char * AttName; short MaxDiscVal;

    char ** AttValName;

    } Attribute;

    � Attribute list recordtypedef struct{ float val;

    short Class;

    long rid;

    } AttRec;

    � Hash table entry typedef struct{ long rid;

    char branch;

    } HshCell;

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    43

    43

    Main Control Flows� Partition the database file into P partitions.� Read local partition and create attribute lists.� For all continuous attributes, performance parallel

    sort and distribute the values to all processors.� Create a root node. Add those attribute lists to the

    root node. For continuous attributes exchange local histograms information to initialize the Cabove and Cbelow in local nodes.

    � Add root node to the queue.� While the queue is not empty, get on treenode in the

    queue and find their split point.� Root processor extract decision rules.

    1. int main(int argc, char * argv[])

    { ...

    if (p_rank == ROOT_PROC)

    partition(filename,total_p,totitems);

    5. MPI_Barrier(io_comm);

    getNames(filename);

    getData(filename);

    BFQueue = new Queue((long)pow(2.0,MaxAtt+1));

    ...

    for( i=0;i < MaxAtt; i++)

    10. { if (Attributes[i].MaxDiscVal == 0)

    { SortContAtt(arr);

    ...

    RecvArr = ParallelSort(arr,s,RecvSize);

    gAttlist[i] = array_to_list(RecvArr,RecvSize);

    }

    15. }

    root = new TreeNode(NULL); /*starting root node*/

    Cabove = (unsigned int*) malloc(sizeof(unsigned int) *MaxClass);

    Cbelow = (unsigned int*) malloc(sizeof(unsigned int) *MaxClass);

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    44

    for(j=0; j < MaxAtt; j++)

    20. { if (Attributes[j].MaxDiscVal ==0)

    {

    hd = list_head(gAttlist[j]);

    while(hd != NULL)

    {

    25. Cabove[((AttRec *)hd->data)->Class]++;

    hd=list_next(hd);

    }

    InitContHistograms(Cabove,Cbelow);

    }

    30. root->AddAttList(gAttlist[j], j, Cabove, Cbelow);

    }

    BFQueue->Enqueue(root);

    while((Node = BFQueue->Dequeue()) != NULL)

    {

    35. Node->FindSplit();

    Node = NULL;

    }

    if (p_rank == ROOT_PROC)

    root->ExtractRules(0);

    ...

    40. return 1;

    }

    • Lines 4: Root processor perform data partitioning. The total numbers of training examples is input by users.

    • Lines 6-7: Processors read local partition.

    • Lines 9-15: Perform parallel sort for continuous attributes.

    • Lines 16: Create a root node.

    • Lines 19-31: Initialize Cabove and Cbelow histograms to reflect the other part of the continuous attribute lists. In lines 23-27 the local Cabove histogram is initialize. TheInitContHistograms() function is call to initialize the local Cabove and Cbelow.

    • Lines 32: Put the root node into queue.

    • Lines 33-37: Get one node from the queue and find the split attribute.

    • Lines 39: Root processor extract the decision rules.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    45

    45

    Parallel Sort (1)� Sample based partitioning

    p- Number of processorsStep 1. Sort records of continuous attribute A locally

    on each processor.

    Step 2. Pick up a sample of size p of attribute A from each processor. Gather the sample on root processor and sort it locally.

    Step 3. Pick up p-1 splitters (range partition vector) from this list on root processor and Broadcast them.

    1.AttRec * ParallelSort(AttRec *& Attarray, long &inArrSize, long &outArrSize)

    { ...//variables declaration

    Find_range_vector(Attarray,&inArrSize);

    send_displacements = (int*) calloc(total_p,sizeof(int));

    5. recv_displacements = (int*) calloc(total_p,sizeof(int));

    send_counts = (int*) calloc(total_p, sizeof(int));

    recv_counts = (int*) calloc(total_p, sizeof(int));

    Find_send_params(Attarray, inArrSize, send_displacements,

    send_counts);

    MPI_Alltoall(send_counts,1, MPI_INT, recv_counts, 1,

    MPI_INT, io_comm);

    10. mytotal = 0;

    for(i=0; i < total_p; i++)

    mytotal += recv_counts[i];

    outArrSize = mytotal;

    mypart = (AttRec*) malloc(sizeof(AttRec) * mytotal);

    15. Find_recv_params(recv_counts,recv_displacements);

    CreateMpiTyp(&recordtyp, mypart);

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    46

    46

    Parallel Sort (2)

    Step 4. Using the splitters each processor divides it’s local attribute records into p sorted partitions with all values of attribute A in a lower partition being lower than a higher numbered partition.

    Step 5. Using MpiAlltoall() communication each partition is sent to its destination processor.

    Step 6. Each processor merges the sorted partitions obtained from other processors resulting in a set of attribute records sorted globally on attribute A.

    MPI_Alltoallv(Attarray, send_counts, send_displacements,

    recordtyp, mypart, recv_counts, recv_displacements,recordtyp,

    io_comm);

    qsort(mypart,mytotal,sizeof(AttRec),cmpContval);

    return mypart;

    20. }

    • Lines 3: Perform step 2 and step 3.

    • Lines 8: Find the send displacements for the MPI function call. That is find how many continuous attribute records that going to send to each others processors.

    • Lines 9: Distribute the information on how many records other processors will receive from this processor including itself. (Step 4)

    • Lines 14: Allocate memory for the receive buffer.

    • Lines 15: Find the displacement for the receive buffer.

    • Lines 17: Perform all-to-all exchange on the continuous attribute records (Step 5).

    • Lines 18: Sort local partition (Step 6).

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    47

    47

    Processor 0

    Processor 1

    Processor 2 Processor p

    Processor 0

    ...

    p samples value p samples value

    p samples valuep samples value

    MPI_Gather

    Picks up p-1 values as range vector.

    Parallel Sort (3)� Forming range partitioning vector using sample

    based method.

    • The figure shows that each processor collect p samples continuous records value from each local partition and then Gather by the root processor.

    • The root processor then picks up p-1 continuous attribute values as the range partitioning vector.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    48

    48

    Processor 0

    Processor 2Processor 1

    Processor p

    ...

    ...

    ...

    ...

    Parallel Sort (4)� Partition continuous attribute using MPI_Alltoall

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    49

    49

    Initializing Histograms� Initialize Cbelow of processor i with Cabove of

    processor j < i.� Initialize Cabove of processor i with Cabove of

    processor i < j < p.

    1. void InitContHistograms(unsigned int *cabvHis, unsigned int*cblwHis)

    { int i,k;

    unsigned int * CaboveBuffer;

    CaboveBuffer=(unsigned int*) malloc(sizeof(unsigned int) *

    MaxClass*total_p);

    5. MPI_Allgather(cabvHis, MaxClass, MPI_UNSIGNED, CaboveBuffer,

    MaxClass, MPI_UNSIGNED, io_comm);

    for(i=(p_rank+1); i < total_p; i++)

    {

    for(k=0;k < MaxClass; k++)

    cabvHis[k] += CaboveBuffer[(i*MaxClass)+k];

    10. }

    for(i=0; i < p_rank; i++)

    {

    for(k=0; k < MaxClass; k++)

    cblwHis[k] += CaboveBuffer[(i*MaxClass)+k];

    15. }

    }

    •Lines 5: Gather all cabove from all processor.

    •Lines 6-10:Initialize Cabove.

    •Lines 11-15: Initialize Cbelow.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    50

    50

    Finding Split Points (1)� Obtains the global classes distribution at this node.� Check if this node can be split or not by looking at the

    frequency of the most frequent class in this node.� If the most frequent class in this node satisfied the

    leaf node conditions, then set this node to be leaf node and assign a class label to it.

    � Otherwise, evaluate all continuous and discrete attributes.

    � Find the best split attribute (lowest gini value).� Create two child nodes based on the best split

    attribute.

    1. void TreeNode::FindSplit()

    { ...//variables initialization omitted for brevity.

    TotalCases = 0;

    for(j=0; j < MaxAtt; j++)

    5. if(AttList[j] != NULL)

    break;

    elem = list_head(AttList[j]);

    for(i=0; i < MaxClass; i++)

    ClassFreq[i] = 0;

    10. while(elem!= NULL)

    {

    Class = ((AttRec*)elem->data)->Class;

    ClassFreq[Class]++;

    elem = list_next(elem);

    15. }

    MPI_Allreduce(ClassFreq, GlobalClassFreq, MaxClass, MPI_UNSIGNED,

    MPI_SUM, io_comm);

    BestClass = 0;

    TotalCases = GlobalClassFreq[0];

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    51

    for(i=1; i < MaxClass; i++)

    20. { if (GlobalClassFreq[i] > GlobalClassFreq[BestClass])

    BestClass = i;

    TotalCases += GlobalClassFreq[i];

    }

    if( (GlobalClassFreq[BestClass] == TotalCases) ||

    (TotalCases < (2*MINOBJS)))

    25. { LeafNode = true;

    TargetClass = BestClass;

    return;

    }

    ...

    for(i=0; i < MaxAtt; i++)

    30. { if (Tested[i])

    continue;

    if (Attributes[i].MaxDiscVal > 0){

    DiscGini[i] = EvalDiscAtt(i);

    GiniAtt[i] = DiscGini[i]->gini;

    35. }else

    { ContGini = EvalContAtt(i);

    GiniAtt[i]= ContGini->gini;

    CutPoint[i]=ContGini->cut;

    }

    40. }

    BestAtt = -1; BestGini = 1.5;

    for(i=0; i < MaxAtt; i++)

    { if (Tested[i])

    continue;

    45. if(GiniAtt[i] < BestGini)

    { BestAtt = i;

    BestGini = GiniAtt[i];

    }

    }

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    52

    50. if(BestAtt == -1)

    { LeafNode = true;

    TargetClass = -1;

    SplitAtt = -1;

    }else

    55. { SplitAtt = BestAtt;

    if (Attributes[BestAtt].MaxDiscVal == 0)

    cut = CutPoint[BestAtt];

    else{

    BestSubset = (int *)malloc(sizeof(int) *

    DiscGini[BestAtt]->len +1);

    60. for(i=1; i len; i++)

    BestSubset[i] = DiscGini[BestAtt]->BestSubset[i];

    BestSubsetLen = DiscGini[BestAtt]->len;

    }

    Tested[BestAtt] = true;

    65. }

    if (SplitAtt != -1)

    Branch2Child();

    }

    • Lines 4-15: Find the classes distribution in this tree node.

    • Lines 16: Exchange local classes distribution with others processors.

    • Lines 19-23: Find the most frequent class for examples in this node.

    • Lines 24-28: Check this node can be a leaf node. It is leaf node if the all examples in this node belongs to a single class or the most frequent class is less than (2 x MIN_OBJ).

    • Lines 29-40: Evaluate all attributes in this node that haven't being tested so far. CallEvalDiscAtt() to evaluate discrete attribute, call EvalContAtt() to evaluate continuous attribute.

    • Lines 41-49: Find the best split attribute that with lowest gini index value.

    • Lines 50-54: If all attribute have being used in higher level of the decision tree, that means no attribute can be used for split at this node. So, it is a dead end leaf.

    • Lines 55-65: Save the chosen split attribute information.

    • Lines 67: Create two child nodes by split the attribute lists in this node using the selected split attribute.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    53

    1. DiscSplit * TreeNode::EvalDiscAtt(short att)

    { ...

    reset();

    Numval = Attributes[att].MaxDiscVal;

    5. if (Numval > 0)

    { hd = list_head(AttList[att]);

    while(hd != NULL)

    { c = ((AttRec*)list_data(hd))->Class;

    v= (int)((AttRec*)list_data(hd))->val;

    10. countmatrix[v][c] = countmatrix[v][c] + 1;

    hd = list_next(hd);

    }

    ...

    for(i=0; i < Numval; i++)

    15. for(j=0; j < MaxClass; j++)

    localcountmat[(i * MaxClass) + j] = countmatrix[i][j];

    totsr = Numval * MaxClass;

    MPI_Allreduce(localcountmat, globalcountmat, totsr,

    MPI_INT, MPI_SUM, io_comm);

    for(i=0; i < Numval; i++)

    20. for(j=0; j < MaxClass; j++)

    countmatrix[i][j] = globalcountmat[(i*MaxClass) + j];

    if (Attributes[att].MaxDiscVal > MAXSETSIZE)

    GiniRec = GreedyAlgorithm(att);

    else

    25. GiniRec = EvalAllSubset(att);

    }else

    return 0;

    return GiniRec;

    }

    • Lines 7-12: Find the count matrix for current evaluate attribute.

    • Lines 14-16: Store the count matrix entries values into an array so that can be used by the MPI function.

    • Lines 18: Built global count matrix from all processor using MPI_SUM operator.

    • Lines 19-21: Update local count matrix with global count matrix.

    • Lines 23: If the number of attribute values for this attribute is greater than MAXSETSIZE then use Greedy algorithm to find the best Gini otherwise evaluate all attribute subset for this attribute (Lines 25).

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    54

    54

    Greedy Subset Evaluations (1)� Create two subsets of attribute values for the

    evaluating attribute.� Incrementally add one attribute value to the first

    subset until there are no further improvement on the Gini index value.

    1. DiscSplit *TreeNode::GreedyAlgorithm(short att)

    { ...

    totrecords = 0;

    rowtot = (unsigned int *) malloc(sizeof(unsigned int) *

    Attributes[att].MaxDiscVal );

    5. for(i=0; i < Attributes[att].MaxDiscVal; i++)

    {

    rowtot[i] = 0;

    for(j=0; j < MaxClass; j++)

    {

    10. rowtot[i] += countmatrix[i][j]; /*total of each row */

    }

    totrecords += rowtot[i];

    }

    s1 = 0;

    15. curGini = 1.5; //arbitrary value

    for (i=0; i < Attributes[att].MaxDiscVal ; i++)

    {

    prevGini = curGini;

    rowsubtot=0;

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    55

    20. for(j=0; j gini = curGini;

    newsplit->len = (i+1);

    for(j=1; j BestSubset[j] = (j-1);

    return newsplit; }

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    56

    Reminded that the count matrix is organized as below for attribute that have v values. c is the total number of classes.

    column

    row 0 1 2 ... c

    0

    1

    2..

    v

    • Lines 20-21: Calculate total number of records for row 0 to i.

    • Lines 23-31: Find the gini index for the first subset.

    • Lines 34-37: If the first subset is the size v, then stop because there is no attribute value for the second subset.

    • Lines 39-50: Calculate the gini value for the second subset.

    • Lines 52-55: Check if current two subsets gini value is less greater than previous two subset, if it is then stop.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    57

    57

    Creates Child Nodes (1)

    Steps� For each records for the chosen split attribute list,

    determine which child (left or right) it will go based on the subset (discrete attribute) or the cut point (continuous attribute). Also store the information on which child the record goes by hash on the rid of the record into a hash table.

    � Exchange the hash table entries with other processor.

    � Split others attribute list records by probe the hash table on which branch it should go.

    1. void TreeNode::Branch2Child(void)

    { ...

    hashtbl = (CHTbl*) malloc(sizeof(CHTbl));

    if(list_size(AttList[SplitAtt]) > 0)

    5. { chtbl_init(hashtbl,list_size(AttList[SplitAtt]) *

    total_p,hashfunc,matchcell,free);

    }else

    chtbl_init(hashtbl,total_p*99,hashfunc,matchcell, free);

    LCabove = (unsigned int*) malloc(sizeof(unsigned int) * MaxClass);

    RCabove = (unsigned int*) malloc(sizeof(unsigned int) * MaxClass);

    10. LCbelow = (unsigned int*) malloc(sizeof(unsigned int) * MaxClass);

    RCbelow = (unsigned int*) malloc(sizeof(unsigned int) * MaxClass);

    for(i=0; i < MaxClass; i++)

    {

    LCabove[i]=RCabove[i]= 0;

    15. LCbelow[i]=RCbelow[i]=0;

    }

    hd = list_head(AttList[SplitAtt]);

    Llist = (List*) malloc(sizeof(List));

    Rlist = (List*) malloc(sizeof(List));

    ...

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    58

    20. while(hd != NULL)

    {

    rec = (AttRec*)list_data(hd);

    CLS = rec->Class;

    newrec = (AttRec*) malloc(sizeof(AttRec));

    25. newrec->Class = CLS;

    newrec->rid = rec->rid;

    newrec->val = rec->val;

    newcell = (HshCell*) malloc(sizeof(HshCell));

    newcell->rid = rec->rid;

    30. if (IsSubset(rec->val))

    {

    if(Attributes[SplitAtt].MaxDiscVal ==0)

    LCabove[CLS]++;

    list_ins_next(Llist,list_tail(Llist),newrec);

    newcell->branch = ’0’;

    }else

    {

    if(Attributes[SplitAtt].MaxDiscVal ==0)

    RCabove[CLS]++;

    40. list_ins_next(Rlist,list_tail(Rlist),newrec);

    newcell->branch = ’1’;

    }

    chtbl_insert(hashtbl, newcell);

    45. hd = list_next(hd);

    }//while

    Tested[SplitAtt] = true;

    if(Attributes[SplitAtt].MaxDiscVal == 0)

    InitContHistograms(LCabove, LCbelow);

    50. LeftChild = new TreeNode(Tested);

    RightChild= new TreeNode(Tested);

    LeftChild->AddAttList(Llist,SplitAtt,LCabove,LCbelow);

    if(Attributes[SplitAtt].MaxDiscVal == 0)

    InitContHistograms(RCabove, RCbelow);

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    59

    55. RightChild->AddAttList(Rlist,SplitAtt,RCabove,RCbelow);

    temp = (HshCell*) malloc(sizeof(HshCell));

    data = temp;

    ExchangeSplitInfo(hashtbl);

    Llist=Rlist=NULL;

    60. for(i=0;i < MaxAtt; i++)

    {

    if (i==SplitAtt)

    continue;

    Llist = Rlist=NULL;

    65. Llist = (List *) malloc(sizeof(List));

    Rlist = (List *) malloc(sizeof(List));

    list_init(Llist,free); list_init(Rlist,free);

    hd = list_head(AttList[i]);

    if(Attributes[i].MaxDiscVal ==0)

    70. for(j=0; j < MaxClass;j++)

    { LCabove[j]=RCabove[j]=0;

    LCbelow[j]=RCbelow[j]=0;

    }

    while(hd != NULL)

    75. {

    rec = (AttRec*)list_data(hd);

    data->rid = rec->rid;

    chtbl_lookup(hashtbl,(void**)&data);

    newrec = (AttRec*) malloc(sizeof(AttRec));

    80. newrec->Class = rec->Class;

    newrec->rid = rec->rid;

    newrec->val= rec->val;

    if (data->branch == ’0’)

    {

    85. if (Attributes[i].MaxDiscVal == 0)

    LCabove[newrec->Class]++;

    list_ins_next(Llist,list_tail(Llist), newrec);

    } else if (data->branch ==’1’)

    { if (Attributes[i].MaxDiscVal == 0)

    90. RCabove[newrec->Class]++;

    list_ins_next(Rlist,list_tail(Rlist), newrec);

    }else {...}

    newrec = NULL;

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    60

    hd = list_next(hd);

    95. data = temp;

    }//while

    if (Attributes[i].MaxDiscVal == 0)

    { InitContHistograms(LCabove, LCbelow);

    InitContHistograms(RCabove, RCbelow);

    100. }

    LeftChild->AddAttList(Llist,i,LCabove,LCbelow);

    RightChild->AddAttList(Rlist,i,RCabove, RCbelow);

    }//for

    ...

    LeftBranch = LeftChild;

    105. RightBranch = RightChild;

    BFQueue->Enqueue(LeftBranch);

    BFQueue->Enqueue(RightBranch);

    }

    • Lines 3-7: Create and initialize the chained hash table.

    • Lines 18-19: Create the left and right attribute lists.

    • Lines 20-46: Split the split attribute list into left and right subtree and hash the records into hash table.

    • Lines 30: Check if current attribute record should belongs to left or right subtree based on the record value.

    • Lines 33, 39: For continuous attribute needs to initialize the Cabove histogram.

    • Lines 49: If the split attribute is a continuous attribute, then get the global histograms information.

    • Lines 50,51: Create two child nodes.

    • Lines 52,55: Add the split attribute records for the left and right subtree.

    • Lines 58: Exchange hash table entries between all processor.

    • Lines 60-96: Split other attributes by probe on the global hash table.

    • Lines 78: Lookup the hash table on where current record should go.

    • Lines 101-102: Add the attribute list records into left and right subtree.

    • Lines 104-107: Point this node to it left and right child respectively and push the left and right subtree node into queue for further splitting.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    61

    61

    Evaluate Continuous Attribute (1)� Evaluate all the cut points in the local node to find

    their gini index value.� Perform all gather to collects the cut points and gini

    values from others processors.� Independently each processor find the best cut point

    from the global cut points and gini values.

    1. ContSplit* TreeNode::EvalContAtt(short att)

    { ... //variables declaration

    totalcutpoint = list_size(AttList[att]);

    if(totalcutpoint > 0)

    5. {

    totalcutpoint = totalcutpoint+1;

    ... //calculate the first cut point here

    cutpointidx = 1;

    while((elem != NULL) && (cutpointidx < (totalcutpoint-1)))

    {

    10. c = ((AttRec*)elem->data)->Class;

    next = list_next(elem);

    val1 = ((AttRec*)elem->data)->val;

    val2 = ((AttRec*)next->data)->val;

    cutpoint[cutpointidx] = (float)(val1 + val2)/2.0;

    15. Cbelow[att][c]++;

    Cabove[att][c]--;

    cbelowtot++;

    cabovetot--;

    s1 = 0; s2 = 0;

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    62

    20. for(i=0; i < MaxClass; i++)

    { prob1 = (double)Cbelow[att][i]/(double)cbelowtot;

    s1 += (prob1 * prob1);

    prob2 = (double)Cabove[att][i]/(double)cabovetot;

    s2 += (prob2 * prob2);

    25. }

    s1 = 1.0 - s1; s2 = 1.0 - s2;

    Gini1 = ((double)cbelowtot/(double)sumrecord) * s1 + \

    ((double)cabovetot/(double)sumrecord) * s2;

    ginival[cutpointidx] = Gini1;

    cutpointidx++;

    30. elem = list_next(elem);

    }//while

    ...//calculate the last cut point

    }else

    {

    cutpoint = (float*)malloc(sizeof(float));

    35. ginival = (float*)malloc(sizeof(float));

    }

    recvcounts = (int*)malloc(sizeof(int) * total_p);

    MPI_Allgather(&totalcutpoint,1, MPI_INT, recvcounts, 1,

    MPI_INT, io_comm);

    sumrecord = 0;

    40. for(i=0; i < total_p; i++)

    sumrecord += recvcounts[i];

    ...

    MPI_Allgatherv(cutpoint, totalcutpoint, MPI_FLOAT,

    globalcutpoint, recvcounts, displacements, MPI_FLOAT, io_comm);

    MPI_Allgatherv(ginival, totalcutpoint, MPI_FLOAT, globalgini,

    recvcounts, displacements, MPI_FLOAT, io_comm);

    bestpoint = (ContSplit*) malloc(sizeof(ContSplit));

    45. bestpoint->gini =1.5;

    for(i=0; i < sumrecord; i++)

    if (globalgini[i] < bestpoint->gini)

    {

    bestpoint->gini= globalgini[i];

    50. bestpoint->cut = globalcutpoint[i];

    }

    return bestpoint;

    }

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    63

    • Lines 3: Get the total number of records in the attribute list.

    • Lines 4-31: Calculate the gini index values for all cut points.

    • Lines 12-14: The cut point is the middle value between two consecutive values. Note, if these two values are the same, it is still possible they belongs to different classes. So, it is not necessary check this.

    • Lines 15-18: After advancing one cut point, it is necessary to update the classes distributions cbelow (or cbelowtot) and cabove (cabovetot).

    • Lines 20-25: Calculate the gini index for the current cut point.

    • Lines 38: Gather the total number of cut points available in each processor, so that can use to allocate receive buffer size.

    • Lines 42-43: Gather all cut points and gini values from others processors using MPI_Allgatherv().

    • Lines 45-51: Each processor independently calculate the best cut point for this attribute and return this value.

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    64

    64

    Decision Rules� Generate decision rules in depth first order.� Examples of decision tree generated for Weather

    data:Humidity < 90.1

    Temp < 73.5

    Outlook in [sunny,overcast] ==>Play

    Outlook in [rain]

    Windy in [true] ==>Don’t Play

    Windy in [false] ==>Play

    Temp >= 73.5

    Outlook in [sunny] ==>Don’t Play

    Outlook in [overcast,rain] ==>Play

    Humidity >= 90.1 ==>Play

    Note: The implementation does not include tree pruning, so it is possible to generate dead end leaf. Also no default rule is assumed if no rules corresponding to a decision (target class).

  • EDPNMO006/2001 (D.Taniar & K.Smith)

    65

    End of Module 4Classification – Decision Trees