1 acctg 6910 building enterprise & business intelligence systems (e.bis) clustering olivia r....
Post on 21-Dec-2015
223 views
TRANSCRIPT
1
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
Clustering
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
2
Introduction • Clustering
– Groups objects without pre-specified class labels into a set of non-predetermined classes of similar objects
ClusteringO1
O3O2O5O4
O6
O1 O2 O6
O5
O3 O4
Oi:contains relevant attribute values without class labels
Class X
Class Y
Class Z
Classes X, Y or Z: non-predetermined
3
An example
We can cluster customers based on their purchase behavior.
2, $1700
3,$2000Cluster 1
4,$2300
10,$1800
12,$2100Cluster 211,$2040
2,$100
3,$200Cluster 3
3,$150
4
Applications• For discovery
– Customers by shopping behavior, credit rating and/or demographics
– Insurance policy holders– Plants, animals, genes, protein structures– Hand writing– Images – Drawings– Land uses – Documents– Web pages
• For pre-processing – data segmentation and outlier analysis
• For conceptual clustering – traditional clustering + classification/characterization to describe each cluster
5
Basic Terminology
• Cluster – a collection of objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.
• Distance measure – how dissimilar (similar) objects are – Non-negative– Distance between the same objects = 0– Symmetric– The distance between two objects, A & B, is
smaller than the sum of the distance from A to another object C and the distance from C to B
6
Clustering Process
• Compute similarity between objects/clusters
• Clustering based on similarity between objects/clusters
7
Similarity/Dissimilarity
• An object (e.g., a customer) has a list of variables (e.g., attributes of a customer such as age, spending, gender etc.)
• When measuring similarity between objects we measure similarity between variables of objects.
• Instead of measuring similarity between variables, we use distance to measure dissimilarity between variables.
8
Similarity/Dissimilarity
• Continuous variables
•Manhattan distance
•Euclidean distance
9
Dissimilarity
• For two objects X and Y with continuous variables 1,2,…n, Manhattan distance is defined as:
object Y of variablesof valuesare ... and
Xobject of variablesof valuesare ... where
),(
1
1
2211
n
n
nn
yy
xx
yxyxyxYXd
10
Dissimilarity
• Example of Manhattan distance
NAME AGE SPENDING($)
Sue 21 2300
Carl 27 2600
TOM 45 5400
JACK 52 6000
11
Dissimilarity
• For two objects X and Y with continuous variables 1,2,…n, Euclidean distance is defined as:
object Y of variablesof valuesare ... and
Xobject of variablesof valuesare ... where
)(...)()(),(
1
1
2222
211
n
n
nn
yy
xx
yxyxyxYXd
12
Dissimilarity
• Example of Euclidean distance
NAME AGE SPENDING($)
Sue 21 23200
Carl 27 23330
TOM 45 23260
JACK 52 23400
13
Similarity/Dissimilarity • Binary variable Normalized Manhattan distance = number of un-
matched variables/total number of variables
NAME Married Gender Home Internet
Sue Y F Y
Carl Y M Y
TOM N M N
JACK N M N
14
Similarity/Dissimilarity• Nominal/ordinal variables
NAME AGE BALANCE($) INCOME EYES GENDERKaren 21 2300 high Blue FSue 21 2300 high Blue FCarl 27 5400 high Brown M
• We assign 0/1 based on exact-match criteria:– Same gender = 0, Different gender = 1– Same eye color = 0, different eye color = 1
• We can also “rank” an attribute– income high =3, medium = 2, low = 1 – E.g. distance (high, low)=2
15
Distance Calculation
NAME AGE BALANCE($) INCOME EYES GENDERSue 21 2300 high Blue FCarl 27 5400 high Brown M
Manhattan Difference: 6 + 3100 + 0 + 1 + 1 =
3108
Euclidean Difference: Square root(62 + 31002+ 0 + 1 +
1)
Is there a problem?
16
Normalization
• Normalization of dimension values:– In the previous example, “balance” is dominant– Set the minimum and maximum distance values for
each dimension to be the same (e.g., 0 - 100)NAME AGE BALANCE($) INCOME EYES GENDERSue 21 2300 high Blue FCarl 27 5400 high Brown
MDon 18 0 low Black MAmy 62 16,543 low Blue FAssume that age range from 0 - 100Manhattan Difference (Sue, Carl): 6 + 100* ((5400-2300)/16543) + 0 + 100 + 100
17
Standardization
• Calculate the mean value• Calculate mean absolute deviation• Standardize each variable value as:
Standardized value = (original value – mean value)/ mean absolute deviation
18
Hierarchical Algorithms• Output: a tree of clusters where a parent node
(cluster) consists of objects in its child nodes (clusters)
• Input: Objects and distance measure only. No need for a pre-specified number of clusters.
• Agglomerative hierarchical clustering: – Bottom-up– Leaf nodes are individual objects– Merge lower level clusters by optimizing a
clustering criterion until the termination conditions are satisfied.
– More popular
19
Hierarchical Algorithms• Output: a tree of clusters where a parent
node (cluster) consists of objects in its child nodes (clusters)
• Input: Objects and distance measure only. No need for a pre-specified number of clusters.
• Divisive hierarchical clustering:– Top-down– The root node corresponds to the whole set of the
objects– Subdivides a cluster into smaller clusters by
optimizing a clustering criterion until the termination conditions are met.
20
Clustering based on dissimilarity
• After calculating dissimilarity between objects, a dissimilarity matrix can be created with objects as indexes and dissimilarities between objects as elements.
• Distance between clusters – Min, Max, Mean and Average
21
Clustering based on dissimilarity
Sue Tom Carl Jack Mary
Sue 0 6 8 2 7Tom 6 0 1 5 3Carl 8 1 0 10 9Jack 2 5 10 0 4Mary 7 3 9 4 0
22
Bottom-up Hierarchical Clustering
Step 1:Initially, place each object in an unique cluster
Step 2: Calculate dissimilarity between clustersDissimilarity between clusters is the minimum dissimilarity between two objects of the clusters, one from each cluster
Step 3: Merge two clusters with the least dissimilarity
Step 4: Continue steps 1-3 until all objects are in one cluster
23
Nearest Neighbor Clustering (Demographic
Clustering)• Dissimilarity by votes• Merge an object into a cluster with the
lowest avg dissimilarity• If the avg dissimilarity with each cluster
exceeds a threshold, the object forms its own cluster
• Stop after a max # of passes, a max # of clusters or no significant changes in the avg dissimilarities in each cluster
24
Comparative Criteria for Clustering Algorithms
• Performance Scalability• Ability to deal with different attribute
types• Clusters with arbitrary shape• Need K or not• Noise handling• Sensitivity to the order of input records• High dimensionality (# of attributes)• Constraint-based clustering• Interpretability and usability
25
Summary
• Problem definition– Input: objects without class labels– Output: clusters for discovery and
conceptual clustering for prediction
• Similarity/dissimilarity measures and calculations
• Hierarchical Clustering• Criteria for comparing algorithms• Readings – T2, pp. 335 – 344 and 354-
356