dm lecture 4

8/8/2019 DM Lecture 4

1/24

Lecture 4 Data Pre-processing

Fall 2010

Dr. Tariq MAHMOODNUCES (FAST) KHI

1


2/24

November 25, 2010 Data Mining: Concepts and Techniques 2

Reduce data volume by choosing alternative,smaller forms of data representationParametric methods

Assume the data fits some model, estimatemodel parameters, store only the parameters,and discard the data (except possible outliers)Example: Log-linear models obtain value ata point in m-D space as the product onappropriate marginal subspaces

Non-parametric methodsDo not assume modelsMajor families: histograms, clustering,sampling.


3/24


Linear regression : Data are modeled to fit a straight

line

Often uses the least-square method to fit the lineMultiple regression : allows a response variable Y to

be modeled as a linear function of multidimensional

feature vectorLog-linear model : approximates discrete

multidimensional probability distributions


4/24

Linear regression : Y = w X + bTwo regression coefficients, w and b, specifythe line and are to be estimated by using thedata at hand

Using the least squares criterion to the knownvalues of Y 1 , Y 2 , , X 1 , X 2 , .Multiple regression : Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformedinto the above

Log-linear models :The multi-way table of joint probabilities isapproximated by a product of lower-ordertablesProbability: p(a , b, c, d) = Ea b Fa c G a d H bcd


5/24


Divide data into buckets and store average (sum) for eachbucket

Partitioning rules:

Equal-width : equal bucket range

Equal-frequency (or equal-depth)

V-optimal : with the least histogr am v a ri an ce (weighted sumof the original values that each bucket represents)

MaxDiff : set bucket boundary between each pair for pairs

have the 1 largest differences


6/24


Partition data set into clusters based on similarity, and

store cluster representation (e.g., centroid and diameter)

only

Can be very effective if data is clustered but not if data is

smeared

Can have hierarchical clustering and be stored in multi-

dimensional index tree structures

There are many choices of clustering definitions and

clustering algorithms

Cluster analysis will be studied in depth in Chapter 7


7/24


S ampling : obtaining a small sample s torepresent the whole data set N Allow a mining algorithm to run in complexitythat is potentially sub-linear to the size of thedataChoose a representative subset of the data

S imple random sampling may have very poorperformance in the presence of skew

Develop adaptive sampling methodsS tratified sampling :

Approximate the percentage of each class (orsubpopulation of interest) in the overalldatabaseUsed in conjunction with skewed data.


8/24


Sampling: With or WithoutReplacement

Raw Data


9/24


Raw Data Cluster/Stratified Sample


10/24


Why preprocess the data?

Data cleaning

Data integration and transformationData reduction

Discretization and concept hierarchy generation

S ummary


11/24


Three types of attributes:

Nominal values from an unordered set, e.g., color,profession

Ordinal values from an ordered set, e.g., military or

academic rankContinuous real numbers, e.g., integer or real numbers

Discretization:

Divide the range of a continuous attribute into intervals

S ome classification algorithms only accept categoricalattributes

Reduce data size by discretization

Prepare for further analysis


12/24


Discretization

Reduce the number of values for a given continuous attributeby dividing the range of the attribute into intervals

Interval labels can then be used to replace actual data values

S upervised vs. unsupervised

S plit (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute

Concept hierarchy formation

Recursively reduce the data by collecting and replacing lowlevel concepts (such as numeric values for age) by higherlevel concepts (such as young, middle-aged, or senior)


13/24


Typical methods: All the methods can be applied recursively

Binning (covered above)

Top-down split, unsupervised,

Histogram analysis (covered above)

Top-down split, unsupervised

Clustering analysis (covered above)

Either top-down split or bottom-up merge, unsupervised

Entropy-based discretization : supervised, top-down split

Interval merging by G2 Analysis : unsupervised, bottom-upmerge

S egmentation by natural partitioning : top-down split,unsupervised


14/24


Given a set of samples S , if S is partitioned into two intervalsS 1 and S 2 using boundary T, the information gain afterpartitioning is

Entropy is calculated based on class distribution of thesamples in the set. G iven m classes, the entropy of S 1 is

where p i is the probability of class i in S 1The boundary that minimizes the entropy function over all

possible boundaries is selected as a binary discretizationThe process is recursively applied to partitions obtained untilsome stopping criterion is metS uch a boundary may reduce data size and improveclassification accuracy

)(||||

)(||||

),( 22

11

S EntropyS S

S EntropyS S T S I !

!

!m

iii p pS Entropy

121 )(log)(


15/24


Merging-based (bottom-up) vs. splitting-based methods

Merge: Find the best neighboring intervals and merge themto form larger intervals recursively

ChiMerge [Kerber AAAI 199 2 , S ee also Liu et al. DMKD 2002]

Initially, each distinct value of a numerical attr. A isconsidered to be one interval

G2 tests are performed for every pair of adjacent intervals

Adjacent intervals with the least G2 values are merged

together, since low G2 values for a pair indicate similar classdistributions

This merge process proceeds recursively until a predefinedstopping criterion is met (such as significance level)


16/24


A simply 3-4-5 rule can be used to segment numericdata into relatively uniform, natural intervals.

If an interval covers 3, 6, 7 or 9 distinct values at

the most significant digit, partition the range into 3equi-width intervals

If it covers 2 , 4, or 8 distinct values at the mostsignificant digit, partition the range into 4 intervals

If it covers 1, 5, or 1 0 distinct values at the mostsignificant digit, partition the range into 5 intervals


17/24


(-$400 - 0)

(-$400 --$300)

(-$300 --$200)

(-$200 --$100)

(-$100 -0)

(0 - $1,000)

(0 -$200)

($200 -$400)

($400 -$600)

($600 -$800) ($800 -

$1,000)

($1,000 - $2, 000)

($1,000 -$1,200)

($1,200 -$1,400)

($1,400 -$1,600)

($1,600 -$1,800) ($1,800 -

$2,000)

(-$400 -$5,000)

(-$400 - 0)

(-$400 --$300)

(-$300 --$200)

(-$200 --$100)

(-$100 -0)

(0 - $1,000)

(0 -$200)

($200 -$400)

($400 -$600)

($600 -$800) ($800 -

$1,000)

($2,000 - $5, 000)

($2,000 -$3,000)

($3,000 -$4,000)

($4,000 -$5,000)

($1,000 - $2, 000)

($1,000 -

$1,200)

($1,200 -$1,400)

($1,400 -$1,600)

($1,600 -$1,800) ($1,800 -

$2,000)

msd= 1,000 Lo w= -$1,000 H igh= $2,000Step 2:

Step 4:

Step 1: -$351 -$159 pr ofit $1,838 $4,700

Min Low (i.e, 5%- tile ) High(i.e, 95%-0 tile ) Max

count

(-$1,000 - $2,000)

(-$1,000 - 0) (0 -$ 1,000)

Step 3:

($1,000 - $2,000)


18/24


S pecification of a partial/total ordering of attributesexplicitly at the schema level by users or experts

street < city < state < countryS pecification of a hierarchy for a set of values byexplicit data grouping

{ Urbana, Champaign, Chicago} < IllinoisS pecification of only a partial set of attributes

E.g., only street < city, not othersAutomatic generation of hierarchies (or attributelevels) by the analysis of the number of distinct values

E.g., for a set of attributes: {street, city, state,country}


19/24


S ome hierarchies can be automaticallygenerated based on the analysis of the numberof distinct values per attribute in the data set

The attribute with the most distinct values isplaced at the lowest level of the hierarchy

Exceptions, e.g., weekday, month, quarter,year

country

pr ovince_ or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values


20/24


Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

S ummary


21/24


Data preparation or preprocessing is a big issuefor both data warehousing and data mining

Descriptive data summarization is needed for

quality data preprocessingData preparation includes

Data cleaning and data integration

Data reduction and feature selection

Discretization

A lot a methods have been developed but datapreprocessing still an active area of research.


22/24


23/24

23


24/24

1. What is meant by symmetric and skewed data [5 ]

2 . Describe techniques for smoothing out data [1 0]

3. Why is it important to carry out descriptive datasummarization? Justify your response through afictitious quantile-quantile plot [5 ]

4. Why is it necessary to carry out co-relation analysis?[5 ]

5. Describe data cube aggregation and itsadvantages [5 ]

6. Can you suggest some change(s) to the state-of-the-art data pre-processing activity? [1 0]

24

dm lecture 4

Documents