sprint : a scalable parallel classifier for data mining
DESCRIPTION
SPRINT : A Scalable Parallel Classifier for Data Mining. John Shafer, Rakesh Agrawal, Manish Mehta. PATHWAY. Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results. Terms. Training Data Set Attributes : Categorical and Continuous Class Label. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/1.jpg)
SPRINT : A Scalable Parallel Classifier for Data Mining
John Shafer, Rakesh Agrawal, Manish Mehta
![Page 2: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/2.jpg)
PATHWAY Terms Partition Algorithm Data Structures Performing Split Serial SPRINT Parallel SPRINT Results
![Page 3: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/3.jpg)
Terms Training Data Set
Attributes : Categorical and Continuous
Class Label
![Page 4: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/4.jpg)
Partition AlgorithmPartition( Data S ) {
if all points in S are in the same classreturnfor each attribute Aevaluate split on attribute Afind best splitpartition S into S1 and S2 call Partition( S1 )call Partition( S2 )
}
![Page 5: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/5.jpg)
Data Structures Attribute Lists
Histograms : Continuous and Categorical
![Page 6: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/6.jpg)
Finding Split PointGini(S) = 1 – Sum( Pj*Pj )
Gini Index(S) = Gini(S1)*n1/n + Gini(S2)*n2/n
![Page 7: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/7.jpg)
Split on Continuous Attributes Threshold value : Cabove and Cbelow
Sorted Once and Sequential Scan
Deallocation of Cabove and Cbelow
![Page 8: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/8.jpg)
Split on Categorical Attributes Create Count-Matrix All subsets of attribute values as possible split
point Compute Gini Index Gini from Count Matrix only Memory deallocation
![Page 9: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/9.jpg)
Perform Split and Partitioning Select splitting attribute and splitting value Create two child nodes and divide data on RIDs Optimization using Hashing <RID,child-ptr> Optimization depending on number of RIDs Partitioned Hashing for large hash-table Create new histogram and count-matrix of children
![Page 10: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/10.jpg)
Parallel SPRINT Environment : Shared nothing
Data placement and workload balancing
Parallel computation of categorical attribute lists
![Page 11: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/11.jpg)
Repartition of Continuous Attributes Global Sort Equal re-partitioning Relation between Cabove and Cbelow and processor
number Parallel computation of split index
![Page 12: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/12.jpg)
Split point for Categorical Attributes Create global matrix at coordinator
Compute split-index
![Page 13: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/13.jpg)
Partitioning Collect RIDs of splitting attributes from processors
Exchange RIDs
![Page 14: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/14.jpg)
Age Class Rid
17 High 1
20 High 5
23 High 0
Age Class Rid
17 High 1
20 High 5
23 High 0
32 Low 4
43 High 2
68 Low 3
CarT Class Rid
Family High 0
Sport High 1
Family High 5
CarT Class Rid
Family High 0
Sport High 1
Sport High 2
Family Low 3
Truck Low 4
Family High 5
Age Class Rid
32 Low 4
43 High 2
68 Low 3
CarT Class Rid
Sport High 2
Family Low 3
Truck Low 4
0
1 2
Age < 27.5
![Page 15: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/15.jpg)
Age Class Rid
17 High 1
20 High 5
23 High 0
32 Low 4
43 High 2
68 Low 3
0 0
4 2
1 2
3 0
Position 0
Position 3
Cbelow
Cabove
Cbelow
Cabove
H L
H L
Attribute List
![Page 16: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/16.jpg)
CarT Class Rid
Family High 0
Sport High 1
Sport High 2
Family Low 3
Truck Low 4
Family High 5
2 12 00 1
family
sport
truck
H L
Count MatrixAttribute List
![Page 17: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/17.jpg)
Breakdown of Response Time
![Page 18: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/18.jpg)
Scaleup of SPRINT
![Page 19: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/19.jpg)
Speedup of SPRINT
![Page 20: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/20.jpg)
Sizeup of SPRINT
![Page 21: SPRINT : A Scalable Parallel Classifier for Data Mining](https://reader036.vdocuments.net/reader036/viewer/2022062502/56814a1b550346895db74639/html5/thumbnails/21.jpg)
Age CarT Risk
23 Family High
17 Sports High
43 Sports High
68 Family Low
32 Truck Low
20 Family High
Age < 25
CarType=sports
High
High Low
Example:Decision Tree