2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 data mining:...

二〇二三年五月十四日 Data Mining: Concepts and Techniques 1

Cluster Analysis(Clustering)

©Jiawei Han and Micheline KamberIntelligent Database Systems Research Lab

School of Computing Science Simon Fraser University, Canada

http://www.cs.sfu.ca


Chapter 7. Cluster Analysis

OutlineWhat is Cluster Analysis?Types of Data in Cluster AnalysisMajor Clustering Methods

Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods

Summary


General Applications of Clustering(Exploratory Analysis)

Business/Economic Science (especially marketing research) Biomedical Informatics Pattern Recognition Spatial Data Analysis

detect spatial clusters and explain them in spatial data mining

WWW Web document profiling Cluster log data to discover groups of similar

access patternsDomain Knowledge is very, very important

to the applications of clustering !!


Specific Examples of Clustering Applications

Marketing: Help marketers discover group profiles of customers, and then use this knowledge to develop targeted marketing programs

Insurance: Identifying group profiles of motor insurance policy holders with a high average claim cost

Biology: Categorize genes with similar functionality, derive their taxonomies, and gain insight into the structures inherent in populations


What Is Good Clustering?

A good clustering method will produce high quality clusters with the following features high intra-class similarity low inter-class similarity

The quality of a clustering result depends on both the similarity measure and the clustering method.

The quality of a clustering method is measured by its ability to discover the number of the hidden patterns.


Issues of Clustering in Data Mining

Scalability (important for Big Data Analysis) Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to

determine input parameters Able to deal with noise and outliers Insensitive to order of input records Able to deal with high dimensionality Incorporation of user-specified constraints Interpretability and usability



What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary


Data Structures in Clustering

Two data structures Data matrix

n objects × p variables

Dissimilarity matrix n objects × n objects

npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x

0...)2,()1,(:::

)2,3()

...ndnd

0dd(3,10d(2,1)

0


Similarity Measurement between Samples

Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function d (i, j), which is typically metric.

The definitions of distance functions are usually very different for interval-scaled (numeric), boolean, categorical, ordinal and ratio variables.

It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.

Sometimes, weights should be associated with different variables based on applications and data semantics.


Type of data in clustering analysis

Interval-scaled variables data values have order and equal intervals measured on a linear scale the intervals keep the same importance

throughout the scale Binary variables Nominal, ordinal, and ratio variables Variables of mixed types


Data normalization(Interval-valued variables)

Data normalization (for each feature f )1. Calculate the mean:

2. Calculate the mean absolute deviation :

3. Calculate the standardized measurement (z-score) (zero-mean normalization)

Using mean absolute deviation is more robust to outliers than using standard deviation for cluster analysis

.)...211

nffff xx(xn m

|)|...|||(|121 fnffffff mxmxmxns

zero-mean normalization

f

fifif s

mx z

])mx(...)mx()mx[(nsfnfffff

'f

22

2

2

11


Similarity and Dissimilarity Between Data Objects

Distances are normally used to measure the similarity or dissimilarity between two data objects xi and xj (if data objects are viewed as points in data space)

Some popular ones include: Minkowski distance:

where xi = <xi1, xi2, …, xip> and xj = <xj1, xj2, …, xjp> are

two p-dimensional data points, and q is a positive integer

If q = 1, d is Manhattan distance

q q

jpip

q

ji

q

jijixxxxxxxxd )||...|||(|),(

2211

||...||||),( 2211 jpipjijiji xxxxxxxxd


Similarity and Dissimilarity Between Objects (Cont.)

If q = 2, d is Euclidean distance:

Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

Also one can use weighted distance, parametric Pearson product moment correlation (or simply correlation r), or other disimilarity measures.

)||...|||(|),( 2222

211 jpipjijiji xxxxxxxxd


Distance Functions for Binary Variables

A contingency table for binary data

a : the # of variables that equal 1 for both objects i and j d : the # of variables that equal 0 for both objects i and j b : the # of variables that equal 1 for object i but that

equal 0 for object j c : the # of variables that equal 0 for object i but that

equal 1 for object j

pdbcasumdcdcbaba

sum

01

01

Object i

Object j

i: 011001100110j: 110011001111

a:4, b:2, c:4, d:2


A contingency table for binary data

Simple matching coefficient (if the binary variable is symmetric):

Jaccard coefficient (if the binary variable is asymmetric):

dcbacb jid

),(

pdbcasumdcdcbaba

sum

01

01

cbacb jid

),(

Object i

Object j

Distance Functions for Binary Variables

cbaa jidjisim

),(1),(


Dissimilarity between Binary Variables

Example

Gender is a symmetric attribute which is ignored here

The remaining attributes are asymmetric binary Let the values Y and P be set to 1, Let the value N be set to 0

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N


Dissimilarity between Binary Variables

(Asymmetric Example)

Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0

330102

10 .cba

cb)mary,jack(d

pdbcasumdcdcbaba

sum

01

01

Object i

Object j

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N

Preprocessing1.Ignore gender2.Y, P → 13.N → 0

Conclusion : The clinical conditions of Jack and Mary are more similar.

670111

11 .)jim,jack(d

750211

21 .)mary,jim(d


Nominal Variables (Categorical Variables)

A generalization of the binary variable in that it can take more than 2 states, e.g., color: red, yellow, blue, …

Dissimilarity between Nominal Variables Method 1: Simple matching

m : # of matches, p : total # of variables Method 2:

use a large number of asymmetric binary variables creating a new binary variable for each of the M nominal states of

a particular variable, e.g. <0, 0, 0, 0, 0, 0, 1> for purple color Purple color is totally different from blue color

Method 3: use a small number of additive binary variables <1, 0, 1> for purple color (in 3-bit RGB representation)

Purple color is partially(50%) different from blue color

pmpjid ),(


Dissimilarity between Ordinal Variables

An ordinal variable can be discrete or continuous value The values are ordered in a meaningful sequence, e.g.,

job ranks It can be treated like interval-scaled (for each feature f )

1. Define rank for feature values with M f states2. Replacing xif (the value of f-th variable of i-th object ) by a

rank 3. Map the range of feature f onto [0, 1] by replacing its

corresponding rank by

Compute the dissimilarity using methods for interval-scaled variables

1M1

f

ifif

rz

}...,,{r fif M1

ifr

)( 10, ifififif zzrx


Dissimilarity between Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale,

such as AeBt or Ae-Bt Alternative methods:

1. treat them like interval-scaled variables — not a good choice! (why? Since the scale may be distorted)

2. apply logarithmic transformation yif = log(xif), and treat the transformed value like interval-scaled value

3. treat them as continuous ordinal data and treat their ranks as interval-scaled values.


Variables of Mixed Types

The data set may contain all the six types of variables symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio-scaled variables. One may use a weighted formula to combine their

effects.

If feature f is binary or nominal:df (i, j) = 0 if xif = xjf , or df (i, j) = 1 otherwise

If f is interval-based: use the normalized distance If feature f is ordinal or ratio-scaled

compute ranks rif and treat rif as interval-scaled

f

ff

w

jidwjid p

f

p

f

1

1),(

),(

1

1

f

if

Mrzif

(distance between objects i and j)

)( ififif zrx



What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods


Summary


Major Clustering Approaches

Partitioning algorithms: Construct a number of partitions, and then improve them iteratively by some optimization criterion

Hierarchy algorithms: Create a hierarchical (clustering) structure for the data set using some optimization criterion

Density-based: Based on connectivity and/or density functions

Grid-based: Quantize the data space into a finite number of grid cells on which cluster analysis is performed

Model-based: A cluster model is hypothesized for each data cluster, and find the best cluster models fitting the data



What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods


Summary


Partitioning Algorithms: Basic Concept

(Cluster Representation) Partitioning method: Given n data objects, a partition algorithm

organizes objects into k partitions. Each represents a cluster. Given a k, find k clusters that optimizes the chosen

partitioning criterion (based on similarity function) Global optimal solution: exhaustively enumerate all combinations Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the

center of the cluster during iterative clustering process k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by the representative object of the cluster during iterative clustering process


Centroid-Based Clustering Technique

(The k-Means Clustering Method )

Given k, the k-Means algorithm is implemented in 4 steps:

1. Randomly select k of the data objects as the seed points. Each seed point represents a cluster.

2. Re-assign each data object to a cluster with the nearest seed point. (n objects, k clusters)

3. Compute new seed points which are the centroids of the clusters at this epoch. The centroid is the center (mean) of a cluster.

4. Go back to Step 2, stop when no more new data assignment happens.

t iterations


The k-Means Clustering Method

Example : 2-means

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Find the centroids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Reassign all data

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Stop if stabilized

Find the centroids

assigned to new cluster

assigned to new cluster


Comments on the k-Means Method

Strength Relatively efficient: O(nkt), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global

optimum may be found using techniques such as: simulated annealing and genetic algorithms

Weakness Applicable only when mean is defined. It cannot handle

categorical data, like colors : red, blue, green, …? Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers which may distort the

distribution of data Not suitable to discover clusters with non-convex shapes


Variations of the k-Means Method

A few variants of the k-Means which differ in Selection of the k initial seed points Dissimilarity calculations Strategies to calculate cluster means

Handling categorical data: k-modes (Huang’98) Using new dissimilarity measures to deal with

categorical objects Replacing means of clusters with modes Using a frequency-based method to update

modes of clusters A mixture of categorical and numerical data: k-

prototype method


Representative Object-Based Technique

(The K-Medoids Clustering Method)

Find representative objects, called medoids, in clusters Medoid : the most centrally located data object in a cluster k-Means is sensitive to noisy data and outliers.

PAM (Partitioning Around Medoids, 1987) PAM works effectively for small data sets, but does not

scale well for large data sets : O( C(n,k) ) Improved algorithms :

CLARA (Kaufmann & Rousseeuw, 1990): Use sampling to find representatives of the data (by reducing input data size)

CLARANS (Ng & Han, 1994): Use different sample set to search representatives at each run. Return the best representative set in a user-defined number of runs.

Centroid-Based


PAM (Partitioning Around Medoids) (1987)

PAM (Kaufman and Rousseeuw, 1987) : O(k(n-k)2t) Use representative objects of a cluster as seed points.

1. Select k objects as representative objects arbitrarily2. Assign each non-medoid object to the most similar

representative object (n objects, k clusters)3. Do a greedy search O(k(n-k)2) : for each pair of non-medoid

object h and medoid object i, calculate the total swapping cost TCih (after cluster reassignment). Find the best TCih

Best .If TCih

Best = (DT

Best - DT)< 0 (where ),then medoid object i is replaced by h, otherwise stop.

4. repeat steps 3 until there is no change

k

jCp

jTj

opD1

||

t iterations


CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990) Built in statistical analysis packages, such as S+

It draws a sample set of the data, applies PAM on the samples, and gives the best clustering as the output

Strength: deals with larger data sets than PAM Weakness:

Efficiency depends on the sample size If the samples are biased, a good clustering based on

the samples will not necessarily represent a good clustering for the whole data set


CLARANS (“Randomized” CLARA) (1994)

(Clustering Large Application based on Randomized Search-Ng & Han’94) The PAM clustering process can be presented as searching a graph where

every node (a set of k medoids) is a potential solution : O(C(n,k)). At each step, all the neighbors of the current node are examined. The

current node is replaced by the neighbor with the best improvement.(Two nodes are neighbors if their medoid sets differ by only one object. Every node has k(n-k) neghbors.)

CLARA confine to a fixed sample of size m at each search stage. CLARANS dynamically draws a random sample of neighbors in each

search step. If a local optimum is found, CLARANS starts with a new sample of

neighbors in search for a new local optimum. Once a user-specified number of local minima has been found, CLARANS outputs the best one.

It is more efficient and scalable than both PAM and CLARA

mn


Hierarchical Clustering Use distance matrix as clustering criteria. This

method does not require the number of clusters k as an input, but may need a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

dc

e

a a b

d ec d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)


AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g.,

S+ Use the Single-Link method and the dissimilarity

matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

…

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

…

Hierarchical clustering algorithm AGNESorganize data objects into a tree structure of hierarchical partitions (a tree of clusters), called dendrogram.

A clustering result of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.


Dendrogram for Cluster Visualization (Shows How Clusters Merge Hierarchically)


Clustering Dendrogram Clustering Dendrogram in Gene Expression Analysisin Gene Expression Analysis

Finding differentially regulated genes

Clustering


DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g., S+ Inverse order of AGNES Eventually each node (data object) forms a cluster

on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


More on Hierarchical Clustering Methods

Weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2),

where n is the number of data objects can never undo what was done previously at each

stage Integration of hierarchical and distance-based clusterings

BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters (good for Big Data Analysis)

CURE (1998): selects well-scattered points from each cluster, and then shrinks them towards the center of the cluster by a specified fraction

CHAMELEON (1999): hierarchical clustering using dynamic modeling


BIRCH (1996) Birch: Balanced Iterative Reducing and Clustering using

Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96) Incrementally construct a CF-tree (Clustering Feature Tree),

a hierarchical data structure for multiphase clustering Phase 1: scan DB to build an initial in-memory CF-tree (a

hierarchical compression structure of the data that tries to preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary (hierarchical) clustering algo. to cluster the data in each leaf node of the CF-tree

Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans(A good feature for big data analysis)

Weakness: handles only numeric data, and sensitive to the input order of the data records.

Example


Clustering Feature of a sub-cluster: CF = (N, LS, SS)

N: Number of data vectors in the sub-cluster

LS: Ni=1Xi (linear sum)

SS: Ni=1Xi

2 (square sum)

BIRCH’s Clustering Feature Vector

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)


BIRCH’s Parameters & Their Meanings

A clustering feature is the summary statistics for the data in a sub-cluster The zero moment : the number of data vectors in a sub-cluster The first moment : the linear sum on N data vectors The second moment : the square sum on N data vectors

BIRCH use two parameters for controlling node split in the construction process of a CF tree Branching factor : constrains the maximum number of child

nodes per non-leaf node (a leaf node is required to fit in a memory page of size P which determines the branching factor of a leaf node)

Threshold : constrains the maximum average distance of data pairs of a sub-cluster in the leaf nodes


CF TreeCF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

Branching factor = 7Threshold = 6

Root

Non-leaf node

Leaf node Leaf node

An object is inserted into the closed leaf entry. Split the node as neededAfter an insertion, information about it is passed toward the root of tree

Similar to the construction of B+-tree


CURE (Clustering Using REpresentatives )

CURE: proposed by Guha, Rastogi & Shim, 1998A type of hierarchical clustering. Most clustering algorithms either favors clusters with

spherical shape and similar sizes, or are fragile in the present of outliers. CURE overcomes these problems.

Uses multiple representative points to represent a cluster. At each step of clustering, the two clusters with the closest pair of representative points are merged.

Adjusts well to arbitrary shaped clusters. The representative points of a cluster attempt to capture the data distribution and shape of the cluster.

Stops the creation of a cluster hierarchy until there are only k remaining clusters.


The Steps for large data set in CURE

Steps involved in clustering large data set using CURE


Cure: The Algorithm(for large data set in CURE )

1. Phase 1 : Pre-clusteringa. Draw a random sample S of the original data(if the size of the original data is very large)b. Partition sample S into p partitions(Partitioning for speedup)c. Partially cluster each partition into |S|/(p╳q) clusters

(By CURE hierarchical clustering: starts with each input point as a separate cluster for some q>1)

d. Outlier elimination Most outliers are eliminated by random sampling (Step a) If a cluster grows too slow, eliminate it.

1/2


Cure: The Algorithm2. Phase 2 : find the final clusters (By CURE)

a. Cluster partial clusters (# of partial clusters |S|/q) Find well scattered representative points for the new cluster

at each iteration. The representative points of a cluster are used to compute its distance from other clusters.

The distance between two clusters is the distance between the closest pair of representative points. Merge the closest pair of clusters.

The representative points falling in each newly formed cluster are “shrinked” or moved toward the cluster center by a user-specified fraction .

The representative points capture the shape of the clusterb. Mark the data with the corresponding cluster labels

2/2


Data Partitioning and Clustering

Sample size : s = 50 # of partitions : p = 2

x x

x

y

y y

y

x

y

x

Partition size : s/p = 25 cluster #: s/pq = 5


Representative Points Shrinking in Cure

Shrink the multiple representative points towards the gravity center by a fraction of .

Multiple representatives capture the shape of the cluster

x

y

x

y

Shrink

representatives

old cluster centernew cluster center


CHAMELEON

CHAMELEON: A type of hierarchical clustering, by G. Karypis, E.H. Han and V. Kumar’99

Measures the similarity based on a dynamic model Two clusters are merged only if the interconnectivity

and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters

A two phase algorithm 1. Use a graph partitioning algorithm: cluster objects

into a large number of relatively small sub-clusters 2. Use an agglomerative hierarchical clustering

algorithm: find the genuine clusters by repeatedly combining these sub-clusters


Overall Framework of CHAMELEON

Construct

Sparse GraphPartition the GraphBy cutting long edges

Merge Partition

Final Clusters

Data Set

K-nearest neighbor graph

According to relativeinterconnectivity andrelative closeness


Density-Based Clustering Methods

Clustering based on density (local cluster criterion, such as density-connected points)

Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition

Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98)


Density-Based Clustering: Background (I)

Two density parameters: ε : maximum neighborhood radius of a core point MinPts: min. number of points in an ε-

neighborhood of a core point Nε(p) :{q belongs to D | dist(p,q) <= ε}

D : data space, Nε(p) : neighbors of point p Directly density-reachable: A point p is directly

density-reachable from a point q wrt. ε, MinPts if 1) p belongs to Nε(q) 2) |Nε(q)| >= MinPts (q is a core point)

pq

MinPts = 5ε = 1 cm


Density-Based Clustering: Background (II)

Density-reachable: A point p is density-reachable

from a point q wrt. ε, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi

Density-connected A point p is density-connected to

a point q wrt. ε, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. ε and MinPts.

p

qp1

p q

o


DBSCAN: Density Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points

If the ε-neighborhood of an object contains at least MinPts of objects, then the object is called a core object

core point

border point

Outlier

ε = 1cm

MinPts = 5


DBSCAN: The Algorithm

Arbitrary select a point p Retrieve all points density-reachable from p wrt

ε and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-

reachable from p and DBSCAN visits the next point of the database.

Continue the process until all of the points have been processed.


Grid-Based Clustering Method Using multi-resolution grid data structure Several interesting methods

STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)

WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) A multi-resolution clustering approach

using wavelet method CLIQUE: Agrawal, et al. (SIGMOD’98)


STING: A Statistical Information Grid Approach (1)

Wang, Yang and Muntz (VLDB’97) The spatial area is divided into rectangular cells There are several levels of cells corresponding to

different levels of resolution (1→4→16→64→‥‥)

Each cell at a high level is partitioned into a number of smaller cells in the next lower level

Statistical info of each cell is calculated and stored beforehand and is used to answer queries

Parameters of higher level cells can be easily calculated from parameters of lower level cell count, mean, s, min, max type of distribution—normal, uniform, etc.

Use a top-down approach to answer spatial data queries Start from a pre-selected layer—typically with a small

number of cells For each cell in the current level compute the confidence

interval


Remove the irrelevant cells from further consideration

When finish examining the current layer, proceed to the next lower level

Repeat this process until the bottom layer is reached

Advantages: Query-independent, easy to parallelize,

incremental update O(K), where K is the number of grid cells at the

lowest level Disadvantages:

All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected



CLIQUE (Clustering In QUEst)(Clustering for High-Dimension Space)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). Automatically identifying subspaces of a high dimensional

data space that allow better clustering than original space CLIQUE can be considered as both density-based and grid-

based It partitions each dimension into the same number of

equal length units A unit is dense if the fraction of total data points

contained in the unit exceeds the input model parameter

A cluster is a maximal set of connected dense units within a subspace of higher dimensionality


CLIQUE: The Major Steps Identify dense units of higher dimensionality :

Partition the data space into units and find the number of points that lie inside each unit.

Determine dense units in all 1-D dimensions of interests.

Identify dense units of higher dimensionality using the Apriori principle (if a k-D unit is dense, then its (k-1)-D projection units must be dense)

Generate a minimal description for each cluster Determine the maximal region that cover a set of

connected dense units for the cluster Determination of minimal cover for each cluster


Sala

ry

(10,

000)

20 30 40 50 60age

54

31

26

70

20 30 40 50 60age

54

31

26

70

Vac

atio

n(w

eek)

age

Vac

atio

n

Salary 30 50

= 3

maximal region


Strength and Weakness of CLIQUE

Strength It automatically finds subspaces of the highest

dimensionality s.t. high density clusters exist in those subspaces

It is insensitive to the order of records in input and does not presume some canonical data distribution

It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases

Weakness The accuracy of the clustering result may be

degraded at the expense of simplicity of the method


Model-Based Clustering Methods

Attempt to optimize the fit between the data and a given cluster models

Conceptual clustering (a statistical and AI approach) Clustering is performed first, followed by characterization Produces a classification scheme for a set of unlabeled objects Finds a conceptual characteristic description for each class/cluster COBWEB (Fisher’87) A popular and simple method of incremental conceptual

clustering Creates a hierarchical clustering in the form of a classification tree Each node denotes a concept and contains a probabilistic

description of that concept


COBWEB Clustering Method

A classification treeanimalP(C0) = 1. 0P(scales|C0) = 0.25. . .

fishP(C1) = 0.25P(scales|C1) = 1. 0. . .

amphibianP(C2) = 0.25P(moist|C2) = 1. 0. . .

Mammal/birdP(C3) = 0.5P(hair|C3) = 0.5. . .

MammalP(C4) = 0.25P(hair|C4) = 1. 0. . .

birdP(C5) = 0.25P(feathers|C5) = 1. 0. . .


Other Model-Based Clustering Methods

Neural network approaches Represent each cluster as an exemplar, acting as a

“prototype” of the cluster In the clustering process, a data object is assigned to

a cluster whose exemplar is the most similar to the data object according to some distance measure

They are based on competitive learning Involving an organized architecture of neuron units The neuron units compete in a “winner-takes-all”

fashion for current input training data object


Self-Organizing Feature Maps (SOMs)

(A Neural Network Approach) Example Clustering is performed by having output

neurons competing for the current input data The output neuron whose connection weight vector

is closest to the current input training object wins The winner and its neighbors learn by having their

connection weights adjusted SOMs are believed to resemble the processing

that can occur in the human brain Also useful for visualizing high-dimensional

data mapped into 2- or 3-D space

Example


An Example of SOM Architecture

An example of SOM network• 2 input neurons• Format for input X : <x1 , x2>• 16 output neurons (clusters)• Connection weight matrix : 16╳2• Weight vector of neuron i : <wi1 , wi2>

Input neurons

Output neurons(4╳4 architecture)

Connections

Example of SOM result representation

neuron i


Examples of SOM Feature Maps


Mathematical Model of An SOM Example(A 3-input, 25-output SOM)

12 14 1513

1

11

2 3 4 5

6 7 8 9 10

17 19 201816

22 24 252321

InputConnection weights

and competitive layer

= compet (W25╳3 )

125325

13

253

X n a125

Feature Map (Output)

x1

x2

x3

a1

ai

a25

.

.

.

.

.

.

3 input neurons25 output neurons

a25╳1 X3╳1


Weights Update & Neighborhood in SOM

Update weight vectors in a neighborhood of the winning neuron i*

N 13 1 8 12 13 14 18 =

12 14 1513

1

11

2 3 4 5

6 7 8 9 10

17 19 201816

22 24 252321

)1(13N

N 13 2 3 7 8 9 11 12 13 14 15 17 18 19 23 =

12 14 1513

1

11

2 3 4 5

6 7 8 9 10

17 19 201816

22 24 252321

)2(13N

,))()(()()1( pWpXpWpW iii i Ni d

• X(p) : the training vector at iteration p

}|{)(,** ddjdNjii

• Neighborhood example: 13 is the winning neuron

•

• : neighborhood of the winning neuron i*


• Step 1: Initialisation

• Step 2: Find the winning neuron

• Step 3: Learning

• Step 4: Iteration

SOM Learning Algorithm


Step 1Step 1:: InitialisationInitialisation.. Set initial connection weights to small random Set initial connection weights to small random values, say in an interval [0, 1], and assign a small values, say in an interval [0, 1], and assign a small positive value to the positive value to the learning rate parameter learning rate parameter ..



Step 2Step 2:: Find the winning neuron (iteration p) (Activation and Similarity Matching)

• Activate the Kohonen network by applying an input vector Activate the Kohonen network by applying an input vector XX• Find the winning (Find the winning (best matchingbest matching) neuron ) neuron i *, using the , using the criterion criterion

of minimal Euclidean distance of minimal Euclidean distance as follows:as follows:

where , where , mm is the # of neurons in the is the # of neurons in the Kohonen layer, and Kohonen layer, and nn is the # of input neurons. is the # of input neurons.

2/1

1

2* ][ )()()(

n

jijj

m

ii

m

ipwxminpminpi WX


nxxxX ...,, ,21


Step 3Step 3:: Learning. • Update the Update the connection weights connection weights between input neuron between input neuron jj and and

output neuron output neuron i :i :

where where wwijij((pp) is the weight correction at iteration ) is the weight correction at iteration pp..• The weight correction is determined by the competitive learning The weight correction is determined by the competitive learning

rule:rule:

where where • is the is the learning ratelearning rate parameter, and parameter, and • NNi*i*(d, (d, pp)) is the is the neighbourhood functionneighbourhood function centred around the centred around the

winning neuron winning neuron i * at a at a distance distance dd at at iteration iteration pp..

)()()1( pwpwpw ijijij

)p,d(Ni,

)p,d(Ni,)p(wx)p(w

*

*

i

iijjij

][ 0


, 1≤ i≤m, 1≤ j≤n


Step 4Step 4:: IterationIteration.. Increase iteration Increase iteration pp by one, go back to Step 2 and by one, go back to Step 2 and continue until continue until the criterion ofthe criterion of minimal Euclidean distanceminimal Euclidean distance is is

satisfied, or satisfied, or no noticeable changes occur in the feature map.no noticeable changes occur in the feature map.



The overall effect of the SOM learning rule resides in The overall effect of the SOM learning rule resides in

moving the connection weight vector moving the connection weight vector Wi of the trained of the trained

output neuron output neuron ii towards the input pattern towards the input pattern X. .

Learning in Kohonen Network

XX- Wi

ΔWi = α (X- Wi)Wi+ΔWiinput patterninput pattern

Wi

connection weight of neuron connection weight of neuron ii

))()(()(),()()1( pWpXpWpWpWpW iiiii

-Wi


Convergence Process of SOM

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

100 iterations

1000 iterations

2000 iterations

5000 iterations


What Is Outlier Discovery? What are outliers?

The set of objects are considerably dissimilar from the remainder of the data

Example: Sports: Michael Jordan, ... Problem

Find top n outlier points Applications:

Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis


Outlier Discovery: Statistical Approaches

Assume a model underlying distribution that generates data set (e.g. normal distribution) Use discordancy tests depending on

data distribution distribution parameter (e.g., mean, variance) number of expected outliers

Drawbacks most tests are for single attribute In many cases, data distribution may not be known

Statistical Approaches

Outlier Discovery: Distance-Based Approach

Introduced to counter the main limitations imposed by statistical methods We need multi-dimensional analysis without

knowing data distribution. Distance-based outlier: A DB(p, D)-outlier is an

object O in a dataset T s.t. at least a fraction p of the objects in T lies at a distance greater than D from O

Algorithms for mining distance-based outliers Index-based algorithm Nested-loop algorithm Cell-based algorithm


Outlier Discovery: Deviation-Based Approach

Identifies outliers by examining the main characteristics of objects in a group

Objects that “deviate” from this description are considered outliers

sequential exception technique simulates the way in which humans can

distinguish unusual objects from a series of similar objects

OLAP data cube technique uses data cubes to identify regions of

anomalies in a large multi-dimensional data set


Summary Cluster analysis groups data objects based on their

similarity and has wide applications Measure of similarity can be computed for various

types of data Clustering algorithms can be categorized into

partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods

Outlier detection and analysis are very useful for fraud detection and can be performed by statistical, distance-based or deviation-based approaches

There are still lots of research issues on cluster analysis, such as constraint-based clustering

2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 data mining:...

Documents