descriptive modeling
DESCRIPTION
Descriptive Modeling. Based in part on Chapter 9 of Hand, Manilla, & Smyth And Section 14.3 of HTF David Madigan. What is a descriptive model?. “presents the main features of the data” “a summary of the data” - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/1.jpg)
Descriptive Modeling
Based in part on Chapter 9 of Hand, Manilla, & Smyth
And Section 14.3 of HTF
David Madigan
![Page 2: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/2.jpg)
What is a descriptive model?
•“presents the main features of the data”
•“a summary of the data”
•Data randomly generated from a “good” descriptive model will have the same characteristics as the real data
•Chapter focuses on techniques and algorithms for fitting descriptive models to data
![Page 3: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/3.jpg)
Estimating Probability Densities
•parametric versus non-parametric
•log-likelihood is a common score function:
•Fails to penalize complexity
•Common alternatives:
n
iL ixpS
1
));((log)(
ndMSM kkkLk log);ˆ(2)( BICS
v
kDx
MkVL xpMS )|(ˆlog)(
![Page 4: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/4.jpg)
Parametric Density Models
•Multivariate normal
•For large p, number of parameters dominated by the covariance matrix
•Assume =I?
•Graphical Gaussian Models
•Graphical models for categorical data
![Page 5: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/5.jpg)
Mixture Models
)!52(
)()1(
!
)()(
21 5221
x
ep
x
epxf
xx
“Two-stage model”
K
kkkk xfxf
1
);()(
![Page 6: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/6.jpg)
Mixture Models and EM
•No closed-form for MLE’s
•EM widely used - flip-flop between estimating parameters assuming class mixture component is known and estimating class membership given parameters.
•Time complexity O(Kp2n); space complexity O(Kn)
•Can be slow to converge; local maxima
![Page 7: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/7.jpg)
Mixture-model example
Market basket:
For cluster k, item j:
Thus for person i:
Probability that person i is in cluster k:
Update within-cluster parameters:
otherwise
itempurchased person if
,0
,1)(
jiix j
jj xkj
xkjkjjk xp 1)1();(
j
xkj
xkj
K
kk
jjixp 1
1
)1())((
))((
)1(
)|(
)(1)(
ixpikp j
ixkj
ixkjk
jj
E-step
n
i
n
i jnewkj
ikp
ixikp
1
1
)|(
)()|( M-
step
![Page 8: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/8.jpg)
Fraley and Raftery (2000)
![Page 9: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/9.jpg)
Non-parametric density estimation
•Doesn’t scale very well - Silverman’s example
•Note that for Gaussian-type kernels estimating f(x) for some x involves summing over contributions from all n points in the dataset
![Page 10: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/10.jpg)
What is Cluster Analysis?
• Cluster: a collection of data objects– Similar to one another within the same cluster– Dissimilar to the objects in other clusters
• Cluster analysis– Grouping a set of data objects into clusters
• Clustering is unsupervised classification: no predefined classes
• Typical applications– As a stand-alone tool to get insight into data
distribution – As a preprocessing step for other algorithms
![Page 11: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/11.jpg)
General Applications of Clustering
• Pattern Recognition• Spatial Data Analysis
– create thematic maps in GIS by clustering feature spaces
– detect spatial clusters and explain them in spatial data mining
• Image Processing• Economic Science (especially market
research)• WWW
– Document classification– Cluster Weblog data to discover groups of similar
access patterns
![Page 12: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/12.jpg)
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation database
• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
• City-planning: Identifying groups of houses according to their house type, value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
![Page 13: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/13.jpg)
What Is Good Clustering?
• A good clustering method will produce high quality clusters with– high intra-class similarity
– low inter-class similarity
• The quality of a clustering result depends on both the similarity measure used by the method and its implementation.
• The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
![Page 14: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/14.jpg)
Requirements of Clustering in Data Mining
• Scalability
• Ability to deal with different types of attributes
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• High dimensionality
• Interpretability and usability
![Page 15: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/15.jpg)
Measure the Quality of Clustering
• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j)
• There is a separate “quality” function that measures the “goodness” of a cluster.
• The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, and ordinal variables.
• Weights should be associated with different variables based on applications and data semantics.
• It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective.
![Page 16: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/16.jpg)
Major Clustering Approaches
• Partitioning algorithms: Construct various partitions and
then evaluate them by some criterion
• Hierarchy algorithms: Create a hierarchical decomposition
of the set of data (or objects) using some criterion
• Density-based: based on connectivity and density functions
• Grid-based: based on a multiple-level granularity structure
• Model-based: A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to
each other
![Page 17: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/17.jpg)
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen’67): Each cluster is represented by the center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
![Page 18: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/18.jpg)
The K-Means Algorithm
![Page 19: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/19.jpg)
The K-Means Clustering Method • Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
Update the cluster means
reassignreassign
![Page 20: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/20.jpg)
![Page 21: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/21.jpg)
Comments on the K-Means Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms
• Weakness– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
![Page 22: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/22.jpg)
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
• Handling categorical data: k-modes (Huang’98)
– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with categorical objects
– Using a frequency-based method to update modes of clusters
– A mixture of categorical and numerical data: k-prototype method
![Page 23: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/23.jpg)
![Page 24: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/24.jpg)
![Page 25: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/25.jpg)
What is the problem of k-Means Method?
• The k-means algorithm is sensitive to outliers !
– Since an object with an extremely large value may substantially
distort the distribution of the data.
• K-Medoids: Instead of taking the mean value of the object in
a cluster as a reference point, medoids can be used, which is
the most centrally located object in a cluster.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 26: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/26.jpg)
The K-Medoids Clustering Method
• Find representative objects, called medoids, in
clusters
• PAM (Partitioning Around Medoids, 1987)
– starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it improves
the total distance of the resulting clustering
– PAM works effectively for small data sets, but does not scale
well for large data sets
• CLARA (Kaufmann & Rousseeuw, 1990)
• CLARANS (Ng & Han, 1994): Randomized sampling
• Focusing + spatial data structure (Ester et al., 1995)
![Page 27: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/27.jpg)
Typical k-medoids algorithm (PAM)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary choose k object as initial medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign each remaining object to nearest medoids Randomly select a
nonmedoid object,Oramdom
Compute total cost of swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O and Oramdom
If quality is improved.
Do loop
Until no change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 28: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/28.jpg)
PAM (Partitioning Around Medoids) (1987)
• PAM (Kaufman and Rousseeuw, 1987), built in Splus
• Use real object to represent the cluster– Select k representative objects arbitrarily
– For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
– For each pair of i and h,
• If TCih < 0, i is replaced by h
• Then assign each non-selected object to the most similar representative object
– repeat steps 2-3 until there is no change
![Page 29: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/29.jpg)
PAM Clustering: Total swapping cost TCih=jCjih
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
j
ih
t
Cjih = 0
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i h
j
Cjih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
h
it
j
Cjih = d(j, t) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
ih j
Cjih = d(j, h) - d(j, t)
i & t are the current mediods
![Page 30: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/30.jpg)
What is the problem with PAM?
• Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean
• Pam works efficiently for small data sets but does not scale well for large data sets.– O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)
![Page 31: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/31.jpg)
CLARA (Clustering Large Applications) (1990)
• CLARA (Kaufmann and Rousseeuw in 1990)
– Built in statistical analysis packages, such as R
• It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output
• Strength: deals with larger data sets than PAM
• Weakness:– Efficiency depends on the sample size
– A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased
![Page 32: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/32.jpg)
K-Means Example• Given: {2,4,10,12,3,20,30,11,25},
k=2• Randomly assign means:
m1=3,m2=4• Solve for the rest ….• Similarly try for k-medoids
![Page 33: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/33.jpg)
K-Means Example• Given: {2,4,10,12,3,20,30,11,25}, k=2• Randomly assign means: m1=3,m2=4• K1={2,3}, K2={4,10,12,20,30,11,25},
m1=2.5,m2=16• K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18• K1={2,3,4,10},K2={12,20,30,11,25},
m1=4.75,m2=19.6• K1={2,3,4,10,11,12},K2={20,30,25},
m1=7,m2=25• Stop as the clusters with these means are
the same.
![Page 34: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/34.jpg)
Cluster Summary Parameters
![Page 35: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/35.jpg)
Distance Between Clusters
• Single Link: smallest distance between points• Complete Link: largest distance between
points• Average Link: average distance between
points• Centroid: distance between centroids
![Page 36: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/36.jpg)
Hierarchical Clustering•Agglomerative versus divisive
•Generic Agglomerative Algorithm:
•Computing complexity O(n2)
![Page 37: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/37.jpg)
![Page 38: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/38.jpg)
Height of the cross-bar shows the change in within-cluster SS
Agglomerative
![Page 39: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/39.jpg)
Hierarchical ClusteringSingle link/Nearest neighbor (“chaining”)
Complete link/Furthest neighbor (~clusters of equal vol.)
},|),({min),(,
jiyx
jisl CyCxyxdCCD
},|),({max),(,
jiyx
jifl CyCxyxdCCD
•centroid measure (distance between centroids)
•group average measure (average of pairwise distances)
•Ward’s (SS(Ci) + SS(Cj) - SS(Ci+j))
![Page 40: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/40.jpg)
![Page 41: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/41.jpg)
Single-Link Agglomerative Example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
BA
E C
D
4
Threshold of
2 3 51
A B C D E
![Page 42: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/42.jpg)
Clustering Example
![Page 43: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/43.jpg)
AGNES (Agglomerative Nesting)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages, e.g., Splus
• Use the Single-Link method and the dissimilarity matrix.
• Merge nodes that have the least dissimilarity
• Go on in a non-descending fashion
• Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 44: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/44.jpg)
DIANA (Divisive Analysis)
• Introduced in Kaufmann and Rousseeuw (1990)
• Implemented in statistical analysis packages, e.g., Splus
• Inverse order of AGNES
• Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 45: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/45.jpg)
Clustering Market Basket Data: ROCK
• ROCK: Robust Clustering using linKs,by S. Guha, R. Rastogi, K. Shim (ICDE’99). – Use links to measure similarity/proximity– Not distance based– Computational complexity:
• Basic ideas:– Similarity function and neighbors:
Let T1 = {1,2,3}, T2={3,4,5}
O n nm m n nm a( log )2 2
Sim T TT T
T T( , )1 2
1 2
1 2
Sim T T( , ){ }
{ , , , , }.1 2
3
1 2 3 4 5
1
50 2
Han & Kamber
![Page 46: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/46.jpg)
Rock: Algorithm• Links: The number of common
neighbours for the two points.
• Algorithm– Draw random sample– Cluster with links– Label data in disk
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
{1,2,3} {1,2,4}3
Nbrs have sim > threshold
Han & Kamber
![Page 47: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/47.jpg)
CLIQUE (Clustering In QUEst) • Agrawal, Gehrke, Gunopulos, Raghavan
(SIGMOD’98).
• Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based– It partitions each dimension into the same number of equal
length interval
– It partitions an m-dimensional data space into non-overlapping rectangular units
– A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter
– A cluster is a maximal set of connected dense units within a subspace
Han & Kamber
![Page 48: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/48.jpg)
CLIQUE: The Major Steps• Partition the data space and find the number
of points that lie inside each unit of the partition.
• Identify the dense units using the Apriori principle
• Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters– Determine maximal regions that cover a cluster of
connected dense units for each cluster– Determination of minimal cover for each cluster
Han & Kamber
![Page 49: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/49.jpg)
1
Example
![Page 50: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/50.jpg)
Sala
ry
(10,
000)
20 30 40 50 60age
54
31
26
70
= 2
![Page 51: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/51.jpg)
Strength and Weakness of CLIQUE
• Strength – It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those subspaces
– It is insensitive to the order of records in input and does not presume some canonical data distribution
– It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases
• Weakness– The accuracy of the clustering result may be
degraded at the expense of simplicity of the method
Han & Kamber
![Page 52: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/52.jpg)
Model-based Clustering
K
kkkk xfxf
1
);()(
![Page 53: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/53.jpg)
Iter: 0
Iter: 1
Iter: 2
Iter: 5
Iter: 10
Iter: 25
![Page 54: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/54.jpg)
-50 -40 -30 -20 -10 0 10 20-80
-75
-70
-65
-60
-55
-50
-45Log-Likelihood Cross-Section
Log(sigma)
Lo
g-l
ike
liho
od
![Page 55: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/55.jpg)
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Component 1 Component 2p(
x)
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Mixture Model
x
p(x)
![Page 56: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/56.jpg)
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Component 1 Component 2p(
x)
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Mixture Model
x
p(x)
![Page 57: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/57.jpg)
-5 0 5 100
0.5
1
1.5
2
Component Modelsp(
x)
-5 0 5 100
0.1
0.2
0.3
0.4
0.5
Mixture Model
x
p(x)
![Page 58: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/58.jpg)
Advantages of the Probabilistic Approach
•Provides a distributional description for each component
•For each observation, provides a K-component vector of probabilities of class membership
•Method can be extended to data that are not in the form of p-dimensional vectors, e.g., mixtures of Markov models
•Can find clusters-within-clusters
•Can make inference about the number of clusters
•But... its computationally somewhat costly
![Page 59: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/59.jpg)
Mixtures of {Sequences, Curves, …}
k
K
k
kii cDpDp
1
)|()(
Generative Model
- select a component ck for individual i
- generate data according to p(Di | ck)
- p(Di | ck) can be very general
- e.g., sets of sequences, spatial patterns, etc
[Note: given p(Di | ck), we can define an EM algorithm]
![Page 60: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/60.jpg)
Application 1: Web Log Visualization
(Cadez, Heckerman, Meek, Smyth, KDD 2000)• MSNBC Web logs
– 2 million individuals per day– different session lengths per individual– difficult visualization and clustering problem
• WebCanvas– uses mixtures of SFSMs to cluster individuals
based on their observed sequences– software tool: EM mixture modeling +
visualization
![Page 61: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/61.jpg)
![Page 62: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/62.jpg)
Example: Mixtures of SFSMs
Simple model for traversal on a Web site(equivalent to first-order Markov with end-
state)
Generative model for large sets of Web users- different behaviors <=> mixture of SFSMs
EM algorithm is quite simple: weighted counts
![Page 63: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/63.jpg)
WebCanvas: Cadez, Heckerman, et al, KDD 2000
![Page 64: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/64.jpg)
![Page 65: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/65.jpg)
![Page 66: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/66.jpg)
![Page 67: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/67.jpg)
Comments on the K-Means Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms
• Weakness– Applicable only when mean is defined, then what about
categorical data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
![Page 68: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/68.jpg)
Variations of the K-Means Method
• A few variants of the k-means which differ in
– Selection of the initial k means
– Dissimilarity calculations
– Strategies to calculate cluster means
• Handling categorical data: k-modes (Huang’98)
– Replacing means of clusters with modes
– Using new dissimilarity measures to deal with categorical objects
– Using a frequency-based method to update modes of clusters
– A mixture of categorical and numerical data: k-prototype method
![Page 69: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/69.jpg)
What is the problem of k-Means Method?
• The k-means algorithm is sensitive to
outliers !
– Since an object with an extremely large value may
substantially distort the distribution of the data.
• K-Medoids: Instead of taking the mean value of
the object in a cluster as a reference point,
medoids can be used, which is the most centrally
located object in a cluster. 0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 70: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/70.jpg)
Partition-based Clustering: Scores
kCxk
k xn
r1
:mean the be to chosen oftencenter Cluster
K
k Cixk
K
kk
k
rxdCwcCwc1 )(
2
1
),()()( :riationcluster va-within
K
Kkjkj rrdCbc
1
2),()( :riationcluster va-between
Global score could combine within & between e.g. bc(C)/ wc(C)
K-means uses Euclidean distance and minimizes wc(C). Tends to lead to spherical clusters
Using: leads to more elongated clusters (“single-link criterion”)
},)(|))(),(({minmax)()(
yxCixjyixdCwc kCjyi
kk
![Page 71: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/71.jpg)
Partition-based Clustering: Algorithms
•Enumeration of allocations infeasible: e.g.1030 ways of allocated 100 objects into two classes
•Iterative improvement algorithms based in local search are very common (e.g. K-Means)
•Computational cost can be high (e.g. O(KnI) for K-Means)
![Page 72: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/72.jpg)
![Page 73: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/73.jpg)
BIRCH (1996)• Birch: Balanced Iterative Reducing and Clustering using
Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering– Phase 1: scan DB to build an initial in-memory CF tree (a multi-
level compression of the data that tries to preserve the inherent clustering structure of the data)
– Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
• Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the data record.
Han & Kamber
![Page 74: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/74.jpg)
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: Ni=1=Xi
SS: Ni=1=Xi
2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)(2,6)(4,5)(4,7)(3,8)
Han & Kamber
21
2)(
N
XXR i
i
21
2
)1(
)(
NN
XX
D i jji
![Page 75: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/75.jpg)
CF Tree
CF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF7
child7
CF1 CF2 CF6prev next CF1 CF2 CF4
prev next
Branching Factor (B) = 7
Max Leaf Size (L) = 6Root
Non-leaf node
Leaf node Leaf node
Han & Kamber
![Page 76: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/76.jpg)
Insertion Into the CF Tree
• Start from the root and recursively descend the tree choosing closest child node at each step.
• If some leaf node entry can absorb the entry (ie Tnew<T), do it
• Else, if space on leaf, add new entry to leaf• Else, split leaf using farthest pair as seeds
and redistributing remaining entries (may need to split parents)
• Also include a merge step
![Page 77: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/77.jpg)
![Page 78: Descriptive Modeling](https://reader035.vdocuments.net/reader035/viewer/2022081417/568145e9550346895db2ea9b/html5/thumbnails/78.jpg)
Sala
ry
(10,
000)
20 30 40 50 60age
54
31
26
70
20 30 40 50 60age
54
31
26
70
Vac
atio
n(w
eek)
age
Vac
atio
n
Salary 30 50
= 3
Han & Kamber