dm lecture 4
TRANSCRIPT
-
8/8/2019 DM Lecture 4
1/24
Lecture 4 Data Pre-processing
Fall 2010
Dr. Tariq MAHMOODNUCES (FAST) KHI
1
-
8/8/2019 DM Lecture 4
2/24
November 25, 2010 Data Mining: Concepts and Techniques 2
Reduce data volume by choosing alternative,smaller forms of data representationParametric methods
Assume the data fits some model, estimatemodel parameters, store only the parameters,and discard the data (except possible outliers)Example: Log-linear models obtain value ata point in m-D space as the product onappropriate marginal subspaces
Non-parametric methodsDo not assume modelsMajor families: histograms, clustering,sampling.
-
8/8/2019 DM Lecture 4
3/24
November 25, 2010 Data Mining: Concepts and Techniques 3
Linear regression : Data are modeled to fit a straight
line
Often uses the least-square method to fit the lineMultiple regression : allows a response variable Y to
be modeled as a linear function of multidimensional
feature vectorLog-linear model : approximates discrete
multidimensional probability distributions
-
8/8/2019 DM Lecture 4
4/24
Linear regression : Y = w X + bTwo regression coefficients, w and b, specifythe line and are to be estimated by using thedata at hand
Using the least squares criterion to the knownvalues of Y 1 , Y 2 , , X 1 , X 2 , .Multiple regression : Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformedinto the above
Log-linear models :The multi-way table of joint probabilities isapproximated by a product of lower-ordertablesProbability: p(a , b, c, d) = Ea b Fa c G a d H bcd
-
8/8/2019 DM Lecture 4
5/24
November 25, 2010 Data Mining: Concepts and Techniques 5
Divide data into buckets and store average (sum) for eachbucket
Partitioning rules:
Equal-width : equal bucket range
Equal-frequency (or equal-depth)
V-optimal : with the least histogr am v a ri an ce (weighted sumof the original values that each bucket represents)
MaxDiff : set bucket boundary between each pair for pairs
have the 1 largest differences
-
8/8/2019 DM Lecture 4
6/24
November 25, 2010 Data Mining: Concepts and Techniques 6
Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
Can be very effective if data is clustered but not if data is
smeared
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
Cluster analysis will be studied in depth in Chapter 7
-
8/8/2019 DM Lecture 4
7/24
November 25, 2010 Data Mining: Concepts and Techniques 7
S ampling : obtaining a small sample s torepresent the whole data set N Allow a mining algorithm to run in complexitythat is potentially sub-linear to the size of thedataChoose a representative subset of the data
S imple random sampling may have very poorperformance in the presence of skew
Develop adaptive sampling methodsS tratified sampling :
Approximate the percentage of each class (orsubpopulation of interest) in the overalldatabaseUsed in conjunction with skewed data.
-
8/8/2019 DM Lecture 4
8/24
November 25, 2010 Data Mining: Concepts and Techniques 8
Sampling: With or WithoutReplacement
Raw Data
-
8/8/2019 DM Lecture 4
9/24
November 25, 2010 Data Mining: Concepts and Techniques 9
Raw Data Cluster/Stratified Sample
-
8/8/2019 DM Lecture 4
10/24
November 25, 2010 Data Mining: Concepts and Techniques 10
Why preprocess the data?
Data cleaning
Data integration and transformationData reduction
Discretization and concept hierarchy generation
S ummary
-
8/8/2019 DM Lecture 4
11/24
November 25, 2010 Data Mining: Concepts and Techniques 11
Three types of attributes:
Nominal values from an unordered set, e.g., color,profession
Ordinal values from an ordered set, e.g., military or
academic rankContinuous real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
S ome classification algorithms only accept categoricalattributes
Reduce data size by discretization
Prepare for further analysis
-
8/8/2019 DM Lecture 4
12/24
November 25, 2010 Data Mining: Concepts and Techniques 12
Discretization
Reduce the number of values for a given continuous attributeby dividing the range of the attribute into intervals
Interval labels can then be used to replace actual data values
S upervised vs. unsupervised
S plit (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Concept hierarchy formation
Recursively reduce the data by collecting and replacing lowlevel concepts (such as numeric values for age) by higherlevel concepts (such as young, middle-aged, or senior)
-
8/8/2019 DM Lecture 4
13/24
November 25, 2010 Data Mining: Concepts and Techniques 13
Typical methods: All the methods can be applied recursively
Binning (covered above)
Top-down split, unsupervised,
Histogram analysis (covered above)
Top-down split, unsupervised
Clustering analysis (covered above)
Either top-down split or bottom-up merge, unsupervised
Entropy-based discretization : supervised, top-down split
Interval merging by G2 Analysis : unsupervised, bottom-upmerge
S egmentation by natural partitioning : top-down split,unsupervised
-
8/8/2019 DM Lecture 4
14/24
November 25, 2010 Data Mining: Concepts and Techniques 14
Given a set of samples S , if S is partitioned into two intervalsS 1 and S 2 using boundary T, the information gain afterpartitioning is
Entropy is calculated based on class distribution of thesamples in the set. G iven m classes, the entropy of S 1 is
where p i is the probability of class i in S 1The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretizationThe process is recursively applied to partitions obtained untilsome stopping criterion is metS uch a boundary may reduce data size and improveclassification accuracy
)(||||
)(||||
),( 22
11
S EntropyS S
S EntropyS S T S I !
!
!m
iii p pS Entropy
121 )(log)(
-
8/8/2019 DM Lecture 4
15/24
November 25, 2010 Data Mining: Concepts and Techniques 15
Merging-based (bottom-up) vs. splitting-based methods
Merge: Find the best neighboring intervals and merge themto form larger intervals recursively
ChiMerge [Kerber AAAI 199 2 , S ee also Liu et al. DMKD 2002]
Initially, each distinct value of a numerical attr. A isconsidered to be one interval
G2 tests are performed for every pair of adjacent intervals
Adjacent intervals with the least G2 values are merged
together, since low G2 values for a pair indicate similar classdistributions
This merge process proceeds recursively until a predefinedstopping criterion is met (such as significance level)
-
8/8/2019 DM Lecture 4
16/24
November 25, 2010 Data Mining: Concepts and Techniques 16
A simply 3-4-5 rule can be used to segment numericdata into relatively uniform, natural intervals.
If an interval covers 3, 6, 7 or 9 distinct values at
the most significant digit, partition the range into 3equi-width intervals
If it covers 2 , 4, or 8 distinct values at the mostsignificant digit, partition the range into 4 intervals
If it covers 1, 5, or 1 0 distinct values at the mostsignificant digit, partition the range into 5 intervals
-
8/8/2019 DM Lecture 4
17/24
November 25, 2010 Data Mining: Concepts and Techniques 17
(-$400 - 0)
(-$400 --$300)
(-$300 --$200)
(-$200 --$100)
(-$100 -0)
(0 - $1,000)
(0 -$200)
($200 -$400)
($400 -$600)
($600 -$800) ($800 -
$1,000)
($1,000 - $2, 000)
($1,000 -$1,200)
($1,200 -$1,400)
($1,400 -$1,600)
($1,600 -$1,800) ($1,800 -
$2,000)
(-$400 -$5,000)
(-$400 - 0)
(-$400 --$300)
(-$300 --$200)
(-$200 --$100)
(-$100 -0)
(0 - $1,000)
(0 -$200)
($200 -$400)
($400 -$600)
($600 -$800) ($800 -
$1,000)
($2,000 - $5, 000)
($2,000 -$3,000)
($3,000 -$4,000)
($4,000 -$5,000)
($1,000 - $2, 000)
($1,000 -
$1,200)
($1,200 -$1,400)
($1,400 -$1,600)
($1,600 -$1,800) ($1,800 -
$2,000)
msd= 1,000 Lo w= -$1,000 H igh= $2,000Step 2:
Step 4:
Step 1: -$351 -$159 pr ofit $1,838 $4,700
Min Low (i.e, 5%- tile ) High(i.e, 95%-0 tile ) Max
count
(-$1,000 - $2,000)
(-$1,000 - 0) (0 -$ 1,000)
Step 3:
($1,000 - $2,000)
-
8/8/2019 DM Lecture 4
18/24
November 25, 2010 Data Mining: Concepts and Techniques 18
S pecification of a partial/total ordering of attributesexplicitly at the schema level by users or experts
street < city < state < countryS pecification of a hierarchy for a set of values byexplicit data grouping
{ Urbana, Champaign, Chicago} < IllinoisS pecification of only a partial set of attributes
E.g., only street < city, not othersAutomatic generation of hierarchies (or attributelevels) by the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state,country}
-
8/8/2019 DM Lecture 4
19/24
November 25, 2010 Data Mining: Concepts and Techniques 19
S ome hierarchies can be automaticallygenerated based on the analysis of the numberof distinct values per attribute in the data set
The attribute with the most distinct values isplaced at the lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter,year
country
pr ovince_ or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
-
8/8/2019 DM Lecture 4
20/24
November 25, 2010 Data Mining: Concepts and Techniques 20
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
S ummary
-
8/8/2019 DM Lecture 4
21/24
November 25, 2010 Data Mining: Concepts and Techniques 21
Data preparation or preprocessing is a big issuefor both data warehousing and data mining
Descriptive data summarization is needed for
quality data preprocessingData preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but datapreprocessing still an active area of research.
-
8/8/2019 DM Lecture 4
22/24
-
8/8/2019 DM Lecture 4
23/24
23
-
8/8/2019 DM Lecture 4
24/24
1. What is meant by symmetric and skewed data [5 ]
2 . Describe techniques for smoothing out data [1 0]
3. Why is it important to carry out descriptive datasummarization? Justify your response through afictitious quantile-quantile plot [5 ]
4. Why is it necessary to carry out co-relation analysis?[5 ]
5. Describe data cube aggregation and itsadvantages [5 ]
6. Can you suggest some change(s) to the state-of-the-art data pre-processing activity? [1 0]
24