![Page 1: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/1.jpg)
Adaptive Sampling Methods for Scaling up Knowledge Discovery
Algorithms From Ch 8 of From Ch 8 of Instace selection and Costruction for Data MiningInstace selection and Costruction for Data Mining (2001) (2001)
By CarlosBy Carlos Domingo et.al., Kruwer Academic Publishers (Summarized by Jinsan Yang, SNU Biointelligence Lab)
![Page 2: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/2.jpg)
AbstractMethods for large amounts of data
Adaptive sampling method instead of random sampling
Keywords Data Mining, Knowledge Discovery, Scalibility, Adaptive sampling,
Concentration Bounds
![Page 3: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/3.jpg)
Outline
Introduction General Rule Selection Problem Adaptive Sampling Algorithm An Application of Adaselect
Problem and Algorithm
Experiments
Concluding Remarks
![Page 4: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/4.jpg)
Introduction (1) Analysis of Large data
Redesign a known algorithm
Reduce the data size
A typical task in data miningFinding or selecting some rules or laws (General Rule Selection)
General Rule Selection: by random sampling (Batch Sampling)
Proper sample size: by Concentration Bounds or Deviation bounds
(Chernoff, Hoeffding bounds)
Problems Immense sample size is needed for good accuracy and confidence
For the batch sampling, the sample size should be determined a priori as the worst size and it is overestimated for most of the situations
![Page 5: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/5.jpg)
Introduction (2) Overcoming
Sampling in online sequential fashion (one by one or block by block)
Adaptive sample sizes (adaptive sampling)
![Page 6: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/6.jpg)
General Rule Selection Problem
Given Date D (discrete, categorical ?) and model set H,
Select a model h with maximum value of Utility U(h) (supervised learning)
![Page 7: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/7.jpg)
Adaptive Sampling Algorithm (1) Extension of Hoeffding bound
Reliability of Algorithm
))(0( cdhU
![Page 8: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/8.jpg)
Adaptive Sampling Algorithm (2)
![Page 9: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/9.jpg)
An Application of Adaselect (1)
Can apply as a tool for the General rule selection problem Example chosen: A boosting based classification algorithm t
hat uses a simple decision stump learner as a base learner.Decision stump: a single-split decision tree.
AdaBoost for boosting by sub-sampling or re-weighting.
Apply adaptive sampling to base learner (boosting by filtering).
Use MadaBoost by controlling the initial weight as bounded.
![Page 10: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/10.jpg)
An Application of Adaselect (2) Algorithm
Data: discrete instance vector with labels
Classification rule: decision stump
0-1 error measure, U: Utility Function
![Page 11: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/11.jpg)
An Application of Adaselect (3) Experiments
Discretize by 5 intervals and treat missing value as another value.
Artificial inflation (100 copies) of original UCI data
Only for 2 classes
10 fold cross validation and the results are averaged over 10 runs
Computer: cpu alpha 600MHz, 250Mb memory, 4.3 Gb Hard under Linux
C4.5 and Naïve Bayes classifier for comparison
Boosting round: 10
Number of all possible decision stumps:
(set of weighted majority of ten depth-1 decision tree)
|||| DSH
![Page 12: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/12.jpg)
An Application of Adaselect (4)
![Page 13: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/13.jpg)
An Application of Adaselect (5)
AdaSel is faster than C4.5
faster in large sample size.
![Page 14: Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8](https://reader036.vdocuments.net/reader036/viewer/2022062518/56649f265503460f94c3def0/html5/thumbnails/14.jpg)
Concluding Remarks
Justification and efficiency analysis Applied in the design of a base learner for a boosting algor
ithm