instance construction via likelihood-based data squashing
DESCRIPTION
Instance Construction via Likelihood-Based Data Squashing. Madigan D., et. al . (Ch 12, Instance selection and Construction for Data Mining (2001), Kruwer Academic Publishers) Summarize: Jinsan Yang, SNU Biointelligence Lab. Abstract Data Compression Method: Squashing - PowerPoint PPT PresentationTRANSCRIPT
Instance Construction via Likelihood-Based Data Squashing
Madigan D.,Madigan D., et. al. (Ch 12, (Ch 12, Instance selection and Construction for Data MiningInstance selection and Construction for Data Mining (2001), (2001), Kruwer Acade
mic Publishers)
Summarize: Jinsan Yang, SNU Biointelligence Lab
AbstractData Compression Method: Squashing
LDS: Likelihood based data squashing
Keywords Instance Construction, Data Squashing
Outline
IntroductionThe LDS AlgorithmEvaluation: Logistic Regression Evaluation: Neural NetworksIterative LDSDiscussion
Introduction Massive data examples
Large-scale retailingTelecommunicationsAstronomyComputational biologyInternet logging
Some computational challengesNeed of multiple passes for data access10^5~6 times slower than main memoryCurrent Solution:Scaling up existing algorithmHere: Scaling down the data
Data squashing: 750000 8443 ( DuMouchel et al (1999), Outperforms by a factor of 500 in MSE than random sample of size 7543
LDS Algorithm Motivation: Bayesian rule
Given three data points d1,d2,d3, estimate the parameter :
Clusters by likelihood profile:
)()|()|()|(),,|( 321321 pdpdpdpdddp
)|()|()|(,
),|()|(
212**
21
21
dpdpdpwithdbyddsquash
dpdpIf
))|((,),|((( 1 kii dpdp
LDS Algorithm Details of LDS Algorithm
[Select] Values of by a central composite design
Central composite Design for 3 factors
LDS Algorithm
[Profile] Evaluate the likelihood profiles
[Cluster] Cluster the mother data in a single pass- Select n’ random samples as initial cluster centers
- Assign the remaining data to each cluster
[Construct] Construct the Pseudo data:
cluster center
Evaluation: Logistic Regression•Small-scale simulations:
•Initial estimate of
•Plot: Log (Error Ratio)
•Three methods of initial parameter estimations
•100 data / 48 squashed data
5544332211
)1(1
)1(log
XXXXX
yp
yp
Evaluation: Logistic Regression Medium Scale: 100000 , base: 1% simple random sampling
Evaluation: Logistic Regression Large Scale: 744963 , base: 1% simple random sampling
Evaluation: Neural Networks Feed forward, two input nodes, one hidden layer with 3 units,
Single binary output
Mother data: 10000, Squashed data: 1000, repetitions:30
test data: 1000 from the same network
Comparisons for P(whole) - P(reduced)
Evaluation: Neural Networks
Iterative LDS
When the estimation of is not accurate.
1. Set from simple random sampling
2. Squash by LDS
3. Estimate
4. Go to 2.