stratified k-means clustering over a deep web data source

Stratified K-means Clustering Over A Deep Web Data Source

Tantan Liu, Gagan Agrawal

Dept. of Computer Science & Engineering

Ohio State University

Aug. 14, 2012

Outline

• Introduction

– Deep Web

– Clustering on the deep web

• Stratified K-means Clustering– Stratification

– Sample Allocation

• Conclusion

Deep Web

• Data sources hidden from the Internet– Online query interface vs. Database– Database accessible through online Interface– Input attribute vs. Output attribute

• An example of Deep Web

Data Mining over the Deep Web

• High level summary of data– Scenario 1: a user wants to relocate to the county.

• Summary of the residences of the county? – Age, Price, Square Footage

– County property assessor’s web-site only allows simple queries

Challenges

• Databases cannot be accessed directly– Sampling method for Deep web mining

• Obtaining data is time consuming– Efficient sampling method

– High accuracy with low sampling cost

An Example of Deep Web for Real-Estate

k-means clustering over a deep web data source

• Goal: Estimating k centers for the underlying clusters, so that the estimated k centers based on the sample are close to the k true centers in the whole population.

Overview of Method

Sub-population 1

Sub-population 2

Sub-population n

Sample 1 Sample 2 Sample n

Sample

Clusters

Stratified Based K-meansClustering

Stratification...

Sample Allocation

Stratification on the deep web

• Partitioning the entire population in to strata

– Stratifies on the query space of input attributes

– Goal: Homogenous Query subspaces– Radius of query subspace:

– Rule: Choosing the input attribute that mostly decreases the radius of a node

– For an input attribute , decrease of radius:

Y=1980 Y=1990 Y=2008

B=3 B=4

NULLYear ofconstruction

Y=2000

Bedroom

. . .

Partition on Space of Output Attributes

Price

SquareFeet

2008200019901980

Sampling Allocation Methods

• We have created c*k partitions and c*k subspaces– A pilot sample– C*k-mean clustering generate c*k partitions

• Representative sampling– Good Estimation on statistics of c*k subspaces

• Centers• Proportions

Representative Sampling-Centers

• Center of a subspace– Mean vector of all data points belonging to the subspace

• Let sample S={DR1, DR2, …, DRn}– For i-th subspace, center :

i

mjimi m

ODRsc

)(,,

Distance Function

• For c*k estimated centers with true centers

• Using Euclidean Distance

– Integrated variance • Computed based on pilot sample

– : # of sample drawn from j-th stratum

Optimized Sample Allocation

• Goal:

• Using Lagrange multipliers:

• We are going to sample stratum with large variance

• Data is spread in a wide area, and more data are need to represent the population

Active Learning based sampling Method

• In machine learning– Passive learning: data are randomly chosen – Active Learning

• Certain data are selected, to help build a better model• Obtaining data is costly and/or time-consuming

• Choosing stratum i, the estimated decrease of distance function is

• Iterative Sampling Process– At each iteration, stratum with largest decrease of distance function

is selected for sampling– Integrated variance is updated

Representative Sampling-Proportion

• Proportion of a sub-space:– Fraction of data records belonging to the sub-space – Depends on proportion of the sub-space in each stratum

• In j-th stratum,

• Risk function– Distance between estimated factions and their true values

• Iterative Sampling Process– At each iteration, stratum with largest decrease of risk function is

chosen for sampling– Parameters are updated

Stratified K-means Clustering

• Weight for data records in i-th stratum – , : size of population, : size of sample

• Similar to k-means clustering– Center for i-th cluster

Experiment Result

• Data Set:– Yahoo! data set:

• Data on used cars

• 8,000 data records

• Average Distance

Representative Sampling-Yahoo! Data set

• Benefit of Stratification– Compared with rand,

decrease of AvgDist are 7.2%, 13.2%, 15.0% and 16.8%

• Benefit of Representative Sampling

– Compared with rand_st, decrease of AvgDist are 6.6%, 8.5%, 10.5%

• Center based sampling methods have better performance

• Optimized sampling method has better performance in the long run

Conclusion

• Clustering over a deep web data source is challenging• A Stratified k-means clustering method over the deep

web• Representative Sampling

– Centers

– Proportions

• The experiment results show the efficiency of our work

stratified k-means clustering over a deep web data source

Documents