1 acctg 6910 building enterprise & business intelligence systems (e.bis) clustering olivia r....

1

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

ACCTG 6910Building Enterprise &

Business Intelligence Systems(e.bis)

Clustering

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business

2

Introduction • Clustering

– Groups objects without pre-specified class labels into a set of non-predetermined classes of similar objects

ClusteringO1

O3O2O5O4

O6

O1 O2 O6

O5

O3 O4

Oi:contains relevant attribute values without class labels

Class X

Class Y

Class Z

Classes X, Y or Z: non-predetermined

3

An example

We can cluster customers based on their purchase behavior.

2, $1700

3,$2000Cluster 1

4,$2300

10,$1800

12,$2100Cluster 211,$2040

2,$100

3,$200Cluster 3

3,$150

4

Applications• For discovery

– Customers by shopping behavior, credit rating and/or demographics

– Insurance policy holders– Plants, animals, genes, protein structures– Hand writing– Images – Drawings– Land uses – Documents– Web pages

• For pre-processing – data segmentation and outlier analysis

• For conceptual clustering – traditional clustering + classification/characterization to describe each cluster

5

Basic Terminology

• Cluster – a collection of objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.

• Distance measure – how dissimilar (similar) objects are – Non-negative– Distance between the same objects = 0– Symmetric– The distance between two objects, A & B, is

smaller than the sum of the distance from A to another object C and the distance from C to B

6

Clustering Process

• Compute similarity between objects/clusters

• Clustering based on similarity between objects/clusters

7

Similarity/Dissimilarity

• An object (e.g., a customer) has a list of variables (e.g., attributes of a customer such as age, spending, gender etc.)

• When measuring similarity between objects we measure similarity between variables of objects.

• Instead of measuring similarity between variables, we use distance to measure dissimilarity between variables.

8

Similarity/Dissimilarity

• Continuous variables

•Manhattan distance

•Euclidean distance

9

Dissimilarity

• For two objects X and Y with continuous variables 1,2,…n, Manhattan distance is defined as:

object Y of variablesof valuesare ... and

Xobject of variablesof valuesare ... where

),(

1

1

2211

n

n

nn

yy

xx

yxyxyxYXd

10

Dissimilarity

• Example of Manhattan distance

NAME AGE SPENDING($)

Sue 21 2300

Carl 27 2600

TOM 45 5400

JACK 52 6000

11

Dissimilarity

• For two objects X and Y with continuous variables 1,2,…n, Euclidean distance is defined as:

object Y of variablesof valuesare ... and

Xobject of variablesof valuesare ... where

)(...)()(),(

1

1

2222

211

n

n

nn

yy

xx

yxyxyxYXd

12

Dissimilarity

• Example of Euclidean distance

NAME AGE SPENDING($)

Sue 21 23200

Carl 27 23330

TOM 45 23260

JACK 52 23400

13

Similarity/Dissimilarity • Binary variable Normalized Manhattan distance = number of un-

matched variables/total number of variables

NAME Married Gender Home Internet

Sue Y F Y

Carl Y M Y

TOM N M N

JACK N M N

14

Similarity/Dissimilarity• Nominal/ordinal variables

NAME AGE BALANCE($) INCOME EYES GENDERKaren 21 2300 high Blue FSue 21 2300 high Blue FCarl 27 5400 high Brown M

• We assign 0/1 based on exact-match criteria:– Same gender = 0, Different gender = 1– Same eye color = 0, different eye color = 1

• We can also “rank” an attribute– income high =3, medium = 2, low = 1 – E.g. distance (high, low)=2

15

Distance Calculation

NAME AGE BALANCE($) INCOME EYES GENDERSue 21 2300 high Blue FCarl 27 5400 high Brown M

Manhattan Difference: 6 + 3100 + 0 + 1 + 1 =

3108

Euclidean Difference: Square root(62 + 31002+ 0 + 1 +

1)

Is there a problem?

16

Normalization

• Normalization of dimension values:– In the previous example, “balance” is dominant– Set the minimum and maximum distance values for

each dimension to be the same (e.g., 0 - 100)NAME AGE BALANCE($) INCOME EYES GENDERSue 21 2300 high Blue FCarl 27 5400 high Brown

MDon 18 0 low Black MAmy 62 16,543 low Blue FAssume that age range from 0 - 100Manhattan Difference (Sue, Carl): 6 + 100* ((5400-2300)/16543) + 0 + 100 + 100

17

Standardization

• Calculate the mean value• Calculate mean absolute deviation• Standardize each variable value as:

Standardized value = (original value – mean value)/ mean absolute deviation

18

Hierarchical Algorithms• Output: a tree of clusters where a parent node

(cluster) consists of objects in its child nodes (clusters)

• Input: Objects and distance measure only. No need for a pre-specified number of clusters.

• Agglomerative hierarchical clustering: – Bottom-up– Leaf nodes are individual objects– Merge lower level clusters by optimizing a

clustering criterion until the termination conditions are satisfied.

– More popular

19

Hierarchical Algorithms• Output: a tree of clusters where a parent

node (cluster) consists of objects in its child nodes (clusters)

• Input: Objects and distance measure only. No need for a pre-specified number of clusters.

• Divisive hierarchical clustering:– Top-down– The root node corresponds to the whole set of the

objects– Subdivides a cluster into smaller clusters by

optimizing a clustering criterion until the termination conditions are met.

20

Clustering based on dissimilarity

• After calculating dissimilarity between objects, a dissimilarity matrix can be created with objects as indexes and dissimilarities between objects as elements.

• Distance between clusters – Min, Max, Mean and Average

21

Clustering based on dissimilarity

Sue Tom Carl Jack Mary

Sue 0 6 8 2 7Tom 6 0 1 5 3Carl 8 1 0 10 9Jack 2 5 10 0 4Mary 7 3 9 4 0

22

Bottom-up Hierarchical Clustering

Step 1:Initially, place each object in an unique cluster

Step 2: Calculate dissimilarity between clustersDissimilarity between clusters is the minimum dissimilarity between two objects of the clusters, one from each cluster

Step 3: Merge two clusters with the least dissimilarity

Step 4: Continue steps 1-3 until all objects are in one cluster

23

Nearest Neighbor Clustering (Demographic

Clustering)• Dissimilarity by votes• Merge an object into a cluster with the

lowest avg dissimilarity• If the avg dissimilarity with each cluster

exceeds a threshold, the object forms its own cluster

• Stop after a max # of passes, a max # of clusters or no significant changes in the avg dissimilarities in each cluster

24

Comparative Criteria for Clustering Algorithms

• Performance Scalability• Ability to deal with different attribute

types• Clusters with arbitrary shape• Need K or not• Noise handling• Sensitivity to the order of input records• High dimensionality (# of attributes)• Constraint-based clustering• Interpretability and usability

25

Summary

• Problem definition– Input: objects without class labels– Output: clusters for discovery and

conceptual clustering for prediction

• Similarity/dissimilarity measures and calculations

• Hierarchical Clustering• Criteria for comparing algorithms• Readings – T2, pp. 335 – 344 and 354-

356

1 acctg 6910 building enterprise & business intelligence systems (e.bis) clustering olivia r....

Documents

variables of objects

distance high

cluster slide

distance measure

objectsclusters slide

b slide

distance calculation

objects x