2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 data mining:...

88
二二二二二二 Data Mining: Concepts and Techniques 1 Cluster Analysis (Clustering) ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca

Upload: patrick-boone

Post on 20-Jan-2018

251 views

Category:

Documents


0 download

DESCRIPTION

What is Cluster Analysis? Clusters: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis (Clustering) Grouping a set of data objects into different clusters Clustering is unsupervised classification: no predefined classes Typical purposes As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

TRANSCRIPT

Page 1: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 1

Cluster Analysis(Clustering)

©Jiawei Han and Micheline KamberIntelligent Database Systems Research Lab

School of Computing Science Simon Fraser University, Canada

http://www.cs.sfu.ca

Page 2: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 2

Chapter 7. Cluster Analysis

OutlineWhat is Cluster Analysis?Types of Data in Cluster AnalysisMajor Clustering Methods

Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods

Summary

Page 3: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 4

General Applications of Clustering(Exploratory Analysis)

Business/Economic Science (especially marketing research) Biomedical Informatics Pattern Recognition Spatial Data Analysis

detect spatial clusters and explain them in spatial data mining

WWW Web document profiling Cluster log data to discover groups of similar

access patternsDomain Knowledge is very, very important

to the applications of clustering !!

Page 4: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 5

Specific Examples of Clustering Applications

Marketing: Help marketers discover group profiles of customers, and then use this knowledge to develop targeted marketing programs

Insurance: Identifying group profiles of motor insurance policy holders with a high average claim cost

Biology: Categorize genes with similar functionality, derive their taxonomies, and gain insight into the structures inherent in populations

Page 5: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 6

What Is Good Clustering?

A good clustering method will produce high quality clusters with the following features high intra-class similarity low inter-class similarity

The quality of a clustering result depends on both the similarity measure and the clustering method.

The quality of a clustering method is measured by its ability to discover the number of the hidden patterns.

Page 6: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 7

Issues of Clustering in Data Mining

Scalability (important for Big Data Analysis) Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to

determine input parameters Able to deal with noise and outliers Insensitive to order of input records Able to deal with high dimensionality Incorporation of user-specified constraints Interpretability and usability

Page 7: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 8

Chapter 7. Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary

Page 8: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 9

Data Structures in Clustering

Two data structures Data matrix

n objects × p variables

Dissimilarity matrix n objects × n objects

npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x

0...)2,()1,(:::

)2,3()

...ndnd

0dd(3,10d(2,1)

0

Page 9: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 10

Similarity Measurement between Samples

Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function d (i, j), which is typically metric.

The definitions of distance functions are usually very different for interval-scaled (numeric), boolean, categorical, ordinal and ratio variables.

It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.

Sometimes, weights should be associated with different variables based on applications and data semantics.

Page 10: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 11

Type of data in clustering analysis

Interval-scaled variables data values have order and equal intervals measured on a linear scale the intervals keep the same importance

throughout the scale Binary variables Nominal, ordinal, and ratio variables Variables of mixed types

Page 11: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 12

Data normalization(Interval-valued variables)

Data normalization (for each feature f )1. Calculate the mean:

2. Calculate the mean absolute deviation :

3. Calculate the standardized measurement (z-score) (zero-mean normalization)

Using mean absolute deviation is more robust to outliers than using standard deviation for cluster analysis

.)...211

nffff xx(xn m

|)|...|||(|121 fnffffff mxmxmxns

zero-mean normalization

f

fifif s

mx z

])mx(...)mx()mx[(nsfnfffff

'f

22

2

2

11

Page 12: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 13

Similarity and Dissimilarity Between Data Objects

Distances are normally used to measure the similarity or dissimilarity between two data objects xi and xj (if data objects are viewed as points in data space)

Some popular ones include: Minkowski distance:

where xi = <xi1, xi2, …, xip> and xj = <xj1, xj2, …, xjp> are

two p-dimensional data points, and q is a positive integer

If q = 1, d is Manhattan distance

q q

jpip

q

ji

q

jijixxxxxxxxd )||...|||(|),(

2211

||...||||),( 2211 jpipjijiji xxxxxxxxd

Page 13: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 14

Similarity and Dissimilarity Between Objects (Cont.)

If q = 2, d is Euclidean distance:

Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

Also one can use weighted distance, parametric Pearson product moment correlation (or simply correlation r), or other disimilarity measures.

)||...|||(|),( 2222

211 jpipjijiji xxxxxxxxd

Page 14: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 15

Distance Functions for Binary Variables

A contingency table for binary data

a : the # of variables that equal 1 for both objects i and j d : the # of variables that equal 0 for both objects i and j b : the # of variables that equal 1 for object i but that

equal 0 for object j c : the # of variables that equal 0 for object i but that

equal 1 for object j

pdbcasumdcdcbaba

sum

01

01

Object i

Object j

i: 011001100110j: 110011001111

a:4, b:2, c:4, d:2

Page 15: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 16

A contingency table for binary data

Simple matching coefficient (if the binary variable is symmetric):

Jaccard coefficient (if the binary variable is asymmetric):

dcbacb jid

),(

pdbcasumdcdcbaba

sum

01

01

cbacb jid

),(

Object i

Object j

Distance Functions for Binary Variables

cbaa jidjisim

),(1),(

Page 16: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 17

Dissimilarity between Binary Variables

Example

Gender is a symmetric attribute which is ignored here

The remaining attributes are asymmetric binary Let the values Y and P be set to 1, Let the value N be set to 0

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N

Page 17: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 18

Dissimilarity between Binary Variables

(Asymmetric Example)

Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0

330102

10 .cba

cb)mary,jack(d

pdbcasumdcdcbaba

sum

01

01

Object i

Object j

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N

Preprocessing1.Ignore gender2.Y, P → 13.N → 0

Conclusion : The clinical conditions of Jack and Mary are more similar.

670111

11 .)jim,jack(d

750211

21 .)mary,jim(d

Page 18: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 19

Nominal Variables (Categorical Variables)

A generalization of the binary variable in that it can take more than 2 states, e.g., color: red, yellow, blue, …

Dissimilarity between Nominal Variables Method 1: Simple matching

m : # of matches, p : total # of variables Method 2:

use a large number of asymmetric binary variables creating a new binary variable for each of the M nominal states of

a particular variable, e.g. <0, 0, 0, 0, 0, 0, 1> for purple color Purple color is totally different from blue color

Method 3: use a small number of additive binary variables <1, 0, 1> for purple color (in 3-bit RGB representation)

Purple color is partially(50%) different from blue color

pmpjid ),(

Page 19: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 20

Dissimilarity between Ordinal Variables

An ordinal variable can be discrete or continuous value The values are ordered in a meaningful sequence, e.g.,

job ranks It can be treated like interval-scaled (for each feature f )

1. Define rank for feature values with M f states2. Replacing xif (the value of f-th variable of i-th object ) by a

rank 3. Map the range of feature f onto [0, 1] by replacing its

corresponding rank by

Compute the dissimilarity using methods for interval-scaled variables

1M1

f

ifif

rz

}...,,{r fif M1

ifr

)( 10, ifififif zzrx

Page 20: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 21

Dissimilarity between Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale,

such as AeBt or Ae-Bt Alternative methods:

1. treat them like interval-scaled variables — not a good choice! (why? Since the scale may be distorted)

2. apply logarithmic transformation yif = log(xif), and treat the transformed value like interval-scaled value

3. treat them as continuous ordinal data and treat their ranks as interval-scaled values.

Page 21: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 22

Variables of Mixed Types

The data set may contain all the six types of variables symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio-scaled variables. One may use a weighted formula to combine their

effects.

If feature f is binary or nominal:df (i, j) = 0 if xif = xjf , or df (i, j) = 1 otherwise

If f is interval-based: use the normalized distance If feature f is ordinal or ratio-scaled

compute ranks rif and treat rif as interval-scaled

f

ff

w

jidwjid p

f

p

f

1

1),(

),(

1

1

f

if

Mrzif

(distance between objects i and j)

)( ififif zrx

Page 22: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 23

Chapter 7. Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods

Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods

Summary

Page 23: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 24

Major Clustering Approaches

Partitioning algorithms: Construct a number of partitions, and then improve them iteratively by some optimization criterion

Hierarchy algorithms: Create a hierarchical (clustering) structure for the data set using some optimization criterion

Density-based: Based on connectivity and/or density functions

Grid-based: Quantize the data space into a finite number of grid cells on which cluster analysis is performed

Model-based: A cluster model is hypothesized for each data cluster, and find the best cluster models fitting the data

Page 24: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 25

Chapter 7. Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods

Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods

Summary

Page 25: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 26

Partitioning Algorithms: Basic Concept

(Cluster Representation) Partitioning method: Given n data objects, a partition algorithm

organizes objects into k partitions. Each represents a cluster. Given a k, find k clusters that optimizes the chosen

partitioning criterion (based on similarity function) Global optimal solution: exhaustively enumerate all combinations Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the

center of the cluster during iterative clustering process k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by the representative object of the cluster during iterative clustering process

Page 26: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 27

Centroid-Based Clustering Technique

(The k-Means Clustering Method )

Given k, the k-Means algorithm is implemented in 4 steps:

1. Randomly select k of the data objects as the seed points. Each seed point represents a cluster.

2. Re-assign each data object to a cluster with the nearest seed point. (n objects, k clusters)

3. Compute new seed points which are the centroids of the clusters at this epoch. The centroid is the center (mean) of a cluster.

4. Go back to Step 2, stop when no more new data assignment happens.

t iterations

Page 27: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 28

The k-Means Clustering Method

Example : 2-means

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Find the centroids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Reassign all data

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Stop if stabilized

Find the centroids

assigned to new cluster

assigned to new cluster

Page 28: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 29

Comments on the k-Means Method

Strength Relatively efficient: O(nkt), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global

optimum may be found using techniques such as: simulated annealing and genetic algorithms

Weakness Applicable only when mean is defined. It cannot handle

categorical data, like colors : red, blue, green, …? Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers which may distort the

distribution of data Not suitable to discover clusters with non-convex shapes

Page 29: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 30

Variations of the k-Means Method

A few variants of the k-Means which differ in Selection of the k initial seed points Dissimilarity calculations Strategies to calculate cluster means

Handling categorical data: k-modes (Huang’98) Using new dissimilarity measures to deal with

categorical objects Replacing means of clusters with modes Using a frequency-based method to update

modes of clusters A mixture of categorical and numerical data: k-

prototype method

Page 30: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 31

Representative Object-Based Technique

(The K-Medoids Clustering Method)

Find representative objects, called medoids, in clusters Medoid : the most centrally located data object in a cluster k-Means is sensitive to noisy data and outliers.

PAM (Partitioning Around Medoids, 1987) PAM works effectively for small data sets, but does not

scale well for large data sets : O( C(n,k) ) Improved algorithms :

CLARA (Kaufmann & Rousseeuw, 1990): Use sampling to find representatives of the data (by reducing input data size)

CLARANS (Ng & Han, 1994): Use different sample set to search representatives at each run. Return the best representative set in a user-defined number of runs.

Centroid-Based

Page 31: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 32

PAM (Partitioning Around Medoids) (1987)

PAM (Kaufman and Rousseeuw, 1987) : O(k(n-k)2t) Use representative objects of a cluster as seed points.

1. Select k objects as representative objects arbitrarily2. Assign each non-medoid object to the most similar

representative object (n objects, k clusters)3. Do a greedy search O(k(n-k)2) : for each pair of non-medoid

object h and medoid object i, calculate the total swapping cost TCih (after cluster reassignment). Find the best TCih

Best .If TCih

Best = (DT

Best - DT)< 0 (where ),then medoid object i is replaced by h, otherwise stop.

4. repeat steps 3 until there is no change

k

jCp

jTj

opD1

||

t iterations

Page 32: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 33

CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990) Built in statistical analysis packages, such as S+

It draws a sample set of the data, applies PAM on the samples, and gives the best clustering as the output

Strength: deals with larger data sets than PAM Weakness:

Efficiency depends on the sample size If the samples are biased, a good clustering based on

the samples will not necessarily represent a good clustering for the whole data set

Page 33: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 34

CLARANS (“Randomized” CLARA) (1994)

(Clustering Large Application based on Randomized Search-Ng & Han’94) The PAM clustering process can be presented as searching a graph where

every node (a set of k medoids) is a potential solution : O(C(n,k)). At each step, all the neighbors of the current node are examined. The

current node is replaced by the neighbor with the best improvement.(Two nodes are neighbors if their medoid sets differ by only one object. Every node has k(n-k) neghbors.)

CLARA confine to a fixed sample of size m at each search stage. CLARANS dynamically draws a random sample of neighbors in each

search step. If a local optimum is found, CLARANS starts with a new sample of

neighbors in search for a new local optimum. Once a user-specified number of local minima has been found, CLARANS outputs the best one.

It is more efficient and scalable than both PAM and CLARA

mn

Page 34: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 35

Chapter 7. Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary

Page 35: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 36

Hierarchical Clustering Use distance matrix as clustering criteria. This

method does not require the number of clusters k as an input, but may need a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

dc

e

a a b

d ec d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

Page 36: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 37

AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g.,

S+ Use the Single-Link method and the dissimilarity

matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 37: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

Hierarchical clustering algorithm AGNESorganize data objects into a tree structure of hierarchical partitions (a tree of clusters), called dendrogram.

A clustering result of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

二〇二三年五月十四日 Data Mining: Concepts and Techniques 38

Dendrogram for Cluster Visualization (Shows How Clusters Merge Hierarchically)

Page 38: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 39

Clustering Dendrogram Clustering Dendrogram in Gene Expression Analysisin Gene Expression Analysis

Finding differentially regulated genes

Clustering

Page 39: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 40

DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g., S+ Inverse order of AGNES Eventually each node (data object) forms a cluster

on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 40: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 41

More on Hierarchical Clustering Methods

Weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2),

where n is the number of data objects can never undo what was done previously at each

stage Integration of hierarchical and distance-based clusterings

BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters (good for Big Data Analysis)

CURE (1998): selects well-scattered points from each cluster, and then shrinks them towards the center of the cluster by a specified fraction

CHAMELEON (1999): hierarchical clustering using dynamic modeling

Page 41: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 42

BIRCH (1996) Birch: Balanced Iterative Reducing and Clustering using

Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96) Incrementally construct a CF-tree (Clustering Feature Tree),

a hierarchical data structure for multiphase clustering Phase 1: scan DB to build an initial in-memory CF-tree (a

hierarchical compression structure of the data that tries to preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary (hierarchical) clustering algo. to cluster the data in each leaf node of the CF-tree

Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans(A good feature for big data analysis)

Weakness: handles only numeric data, and sensitive to the input order of the data records.

Example

Page 42: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 43

Clustering Feature of a sub-cluster: CF = (N, LS, SS)

N: Number of data vectors in the sub-cluster

LS: Ni=1Xi (linear sum)

SS: Ni=1Xi

2 (square sum)

BIRCH’s Clustering Feature Vector

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)

Page 43: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 44

BIRCH’s Parameters & Their Meanings

A clustering feature is the summary statistics for the data in a sub-cluster The zero moment : the number of data vectors in a sub-cluster The first moment : the linear sum on N data vectors The second moment : the square sum on N data vectors

BIRCH use two parameters for controlling node split in the construction process of a CF tree Branching factor : constrains the maximum number of child

nodes per non-leaf node (a leaf node is required to fit in a memory page of size P which determines the branching factor of a leaf node)

Threshold : constrains the maximum average distance of data pairs of a sub-cluster in the leaf nodes

Page 44: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 45

CF TreeCF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

Branching factor = 7Threshold = 6

Root

Non-leaf node

Leaf node Leaf node

An object is inserted into the closed leaf entry. Split the node as neededAfter an insertion, information about it is passed toward the root of tree

Similar to the construction of B+-tree

Page 45: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 46

CURE (Clustering Using REpresentatives )

CURE: proposed by Guha, Rastogi & Shim, 1998A type of hierarchical clustering. Most clustering algorithms either favors clusters with

spherical shape and similar sizes, or are fragile in the present of outliers. CURE overcomes these problems.

Uses multiple representative points to represent a cluster. At each step of clustering, the two clusters with the closest pair of representative points are merged.

Adjusts well to arbitrary shaped clusters. The representative points of a cluster attempt to capture the data distribution and shape of the cluster.

Stops the creation of a cluster hierarchy until there are only k remaining clusters.

Page 46: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 47

The Steps for large data set in CURE

Steps involved in clustering large data set using CURE

Page 47: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 48

Cure: The Algorithm(for large data set in CURE )

1. Phase 1 : Pre-clusteringa. Draw a random sample S of the original data(if the size of the original data is very large)b. Partition sample S into p partitions(Partitioning for speedup)c. Partially cluster each partition into |S|/(p╳q) clusters

(By CURE hierarchical clustering: starts with each input point as a separate cluster for some q>1)

d. Outlier elimination Most outliers are eliminated by random sampling (Step a) If a cluster grows too slow, eliminate it.

1/2

Page 48: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 49

Cure: The Algorithm2. Phase 2 : find the final clusters (By CURE)

a. Cluster partial clusters (# of partial clusters |S|/q) Find well scattered representative points for the new cluster

at each iteration. The representative points of a cluster are used to compute its distance from other clusters.

The distance between two clusters is the distance between the closest pair of representative points. Merge the closest pair of clusters.

The representative points falling in each newly formed cluster are “shrinked” or moved toward the cluster center by a user-specified fraction .

The representative points capture the shape of the clusterb. Mark the data with the corresponding cluster labels

2/2

Page 49: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 50

Data Partitioning and Clustering

Sample size : s = 50 # of partitions : p = 2

x x

x

y

y y

y

x

y

x

Partition size : s/p = 25 cluster #: s/pq = 5

Page 50: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 51

Representative Points Shrinking in Cure

Shrink the multiple representative points towards the gravity center by a fraction of .

Multiple representatives capture the shape of the cluster

x

y

x

y

Shrink

representatives

old cluster centernew cluster center

Page 51: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 52

CHAMELEON

CHAMELEON: A type of hierarchical clustering, by G. Karypis, E.H. Han and V. Kumar’99

Measures the similarity based on a dynamic model Two clusters are merged only if the interconnectivity

and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters

A two phase algorithm 1. Use a graph partitioning algorithm: cluster objects

into a large number of relatively small sub-clusters 2. Use an agglomerative hierarchical clustering

algorithm: find the genuine clusters by repeatedly combining these sub-clusters

Page 52: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 53

Overall Framework of CHAMELEON

Construct

Sparse GraphPartition the GraphBy cutting long edges

Merge Partition

Final Clusters

Data Set

K-nearest neighbor graph

According to relativeinterconnectivity andrelative closeness

Page 53: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 54

Chapter 7. Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary

Page 54: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 55

Density-Based Clustering Methods

Clustering based on density (local cluster criterion, such as density-connected points)

Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition

Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98)

Page 55: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 56

Density-Based Clustering: Background (I)

Two density parameters: ε : maximum neighborhood radius of a core point MinPts: min. number of points in an ε-

neighborhood of a core point Nε(p) :{q belongs to D | dist(p,q) <= ε}

D : data space, Nε(p) : neighbors of point p Directly density-reachable: A point p is directly

density-reachable from a point q wrt. ε, MinPts if 1) p belongs to Nε(q) 2) |Nε(q)| >= MinPts (q is a core point)

pq

MinPts = 5ε = 1 cm

Page 56: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 57

Density-Based Clustering: Background (II)

Density-reachable: A point p is density-reachable

from a point q wrt. ε, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi

Density-connected A point p is density-connected to

a point q wrt. ε, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. ε and MinPts.

p

qp1

p q

o

Page 57: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 58

DBSCAN: Density Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points

If the ε-neighborhood of an object contains at least MinPts of objects, then the object is called a core object

core point

border point

Outlier

ε = 1cm

MinPts = 5

Page 58: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 59

DBSCAN: The Algorithm

Arbitrary select a point p Retrieve all points density-reachable from p wrt

ε and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-

reachable from p and DBSCAN visits the next point of the database.

Continue the process until all of the points have been processed.

Page 59: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 60

Chapter 7. Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary

Page 60: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 61

Grid-Based Clustering Method Using multi-resolution grid data structure Several interesting methods

STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)

WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98) A multi-resolution clustering approach

using wavelet method CLIQUE: Agrawal, et al. (SIGMOD’98)

Page 61: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 62

STING: A Statistical Information Grid Approach (1)

Wang, Yang and Muntz (VLDB’97) The spatial area is divided into rectangular cells There are several levels of cells corresponding to

different levels of resolution (1→4→16→64→‥‥)

Page 62: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

Each cell at a high level is partitioned into a number of smaller cells in the next lower level

Statistical info of each cell is calculated and stored beforehand and is used to answer queries

Parameters of higher level cells can be easily calculated from parameters of lower level cell count, mean, s, min, max type of distribution—normal, uniform, etc.

Use a top-down approach to answer spatial data queries Start from a pre-selected layer—typically with a small

number of cells For each cell in the current level compute the confidence

interval

STING: A Statistical Information Grid Approach (2)

Page 63: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

Remove the irrelevant cells from further consideration

When finish examining the current layer, proceed to the next lower level

Repeat this process until the bottom layer is reached

Advantages: Query-independent, easy to parallelize,

incremental update O(K), where K is the number of grid cells at the

lowest level Disadvantages:

All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected

STING: A Statistical Information Grid Approach (3)

Page 64: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 65

CLIQUE (Clustering In QUEst)(Clustering for High-Dimension Space)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98). Automatically identifying subspaces of a high dimensional

data space that allow better clustering than original space CLIQUE can be considered as both density-based and grid-

based It partitions each dimension into the same number of

equal length units A unit is dense if the fraction of total data points

contained in the unit exceeds the input model parameter

A cluster is a maximal set of connected dense units within a subspace of higher dimensionality

Page 65: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 66

CLIQUE: The Major Steps Identify dense units of higher dimensionality :

Partition the data space into units and find the number of points that lie inside each unit.

Determine dense units in all 1-D dimensions of interests.

Identify dense units of higher dimensionality using the Apriori principle (if a k-D unit is dense, then its (k-1)-D projection units must be dense)

Generate a minimal description for each cluster Determine the maximal region that cover a set of

connected dense units for the cluster Determination of minimal cover for each cluster

Page 66: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 67

Sala

ry

(10,

000)

20 30 40 50 60age

54

31

26

70

20 30 40 50 60age

54

31

26

70

Vac

atio

n(w

eek)

age

Vac

atio

n

Salary 30 50

= 3

maximal region

Page 67: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 68

Strength and Weakness of CLIQUE

Strength It automatically finds subspaces of the highest

dimensionality s.t. high density clusters exist in those subspaces

It is insensitive to the order of records in input and does not presume some canonical data distribution

It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases

Weakness The accuracy of the clustering result may be

degraded at the expense of simplicity of the method

Page 68: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 69

Chapter 7. Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary

Page 69: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 70

Model-Based Clustering Methods

Attempt to optimize the fit between the data and a given cluster models

Conceptual clustering (a statistical and AI approach) Clustering is performed first, followed by characterization Produces a classification scheme for a set of unlabeled objects Finds a conceptual characteristic description for each class/cluster COBWEB (Fisher’87) A popular and simple method of incremental conceptual

clustering Creates a hierarchical clustering in the form of a classification tree Each node denotes a concept and contains a probabilistic

description of that concept

Page 70: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 71

COBWEB Clustering Method

A classification treeanimalP(C0) = 1. 0P(scales|C0) = 0.25. . .

fishP(C1) = 0.25P(scales|C1) = 1. 0. . .

amphibianP(C2) = 0.25P(moist|C2) = 1. 0. . .

Mammal/birdP(C3) = 0.5P(hair|C3) = 0.5. . .

MammalP(C4) = 0.25P(hair|C4) = 1. 0. . .

birdP(C5) = 0.25P(feathers|C5) = 1. 0. . .

Page 71: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 72

Other Model-Based Clustering Methods

Neural network approaches Represent each cluster as an exemplar, acting as a

“prototype” of the cluster In the clustering process, a data object is assigned to

a cluster whose exemplar is the most similar to the data object according to some distance measure

They are based on competitive learning Involving an organized architecture of neuron units The neuron units compete in a “winner-takes-all”

fashion for current input training data object

Page 72: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 73

Self-Organizing Feature Maps (SOMs)

(A Neural Network Approach) Example Clustering is performed by having output

neurons competing for the current input data The output neuron whose connection weight vector

is closest to the current input training object wins The winner and its neighbors learn by having their

connection weights adjusted SOMs are believed to resemble the processing

that can occur in the human brain Also useful for visualizing high-dimensional

data mapped into 2- or 3-D space

Example

Page 73: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 74

An Example of SOM Architecture

An example of SOM network• 2 input neurons• Format for input X : <x1 , x2>• 16 output neurons (clusters)• Connection weight matrix : 16╳2• Weight vector of neuron i : <wi1 , wi2>

Input neurons

Output neurons(4╳4 architecture)

Connections

Example of SOM result representation

neuron i

Page 74: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 75

Examples of SOM Feature Maps

Page 75: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 76

Mathematical Model of An SOM Example(A 3-input, 25-output SOM)

12 14 1513

1

11

2 3 4 5

6 7 8 9 10

17 19 201816

22 24 252321

InputConnection weights

and competitive layer

= compet (W25╳3 )

125325

13

253

X n a125

Feature Map (Output)

x1

x2

x3

a1

ai

a25

.

.

.

.

.

.

3 input neurons25 output neurons

a25╳1 X3╳1

Page 76: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 77

Weights Update & Neighborhood in SOM

Update weight vectors in a neighborhood of the winning neuron i*

N 13 1 8 12 13 14 18 =

12 14 1513

1

11

2 3 4 5

6 7 8 9 10

17 19 201816

22 24 252321

)1(13N

N 13 2 3 7 8 9 11 12 13 14 15 17 18 19 23 =

12 14 1513

1

11

2 3 4 5

6 7 8 9 10

17 19 201816

22 24 252321

)2(13N

,))()(()()1( pWpXpWpW iii i Ni d

• X(p) : the training vector at iteration p

}|{)(,** ddjdNjii

• Neighborhood example: 13 is the winning neuron

• : neighborhood of the winning neuron i*

Page 77: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 78

• Step 1: Initialisation

• Step 2: Find the winning neuron

• Step 3: Learning

• Step 4: Iteration

SOM Learning Algorithm

Page 78: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 79

Step 1Step 1:: InitialisationInitialisation.. Set initial connection weights to small random Set initial connection weights to small random values, say in an interval [0, 1], and assign a small values, say in an interval [0, 1], and assign a small positive value to the positive value to the learning rate parameter learning rate parameter ..

SOM Learning Algorithm

Page 79: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 80

Step 2Step 2:: Find the winning neuron (iteration p) (Activation and Similarity Matching)

• Activate the Kohonen network by applying an input vector Activate the Kohonen network by applying an input vector XX• Find the winning (Find the winning (best matchingbest matching) neuron ) neuron i *, using the , using the criterion criterion

of minimal Euclidean distance of minimal Euclidean distance as follows:as follows:

where , where , mm is the # of neurons in the is the # of neurons in the Kohonen layer, and Kohonen layer, and nn is the # of input neurons. is the # of input neurons.

2/1

1

2* ][ )()()(

n

jijj

m

ii

m

ipwxminpminpi WX

SOM Learning Algorithm

nxxxX ...,, ,21

Page 80: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 81

Step 3Step 3:: Learning. • Update the Update the connection weights connection weights between input neuron between input neuron jj and and

output neuron output neuron i :i :

where where wwijij((pp) is the weight correction at iteration ) is the weight correction at iteration pp..• The weight correction is determined by the competitive learning The weight correction is determined by the competitive learning

rule:rule:

where where • is the is the learning ratelearning rate parameter, and parameter, and • NNi*i*(d, (d, pp)) is the is the neighbourhood functionneighbourhood function centred around the centred around the

winning neuron winning neuron i * at a at a distance distance dd at at iteration iteration pp..

)()()1( pwpwpw ijijij

)p,d(Ni,

)p,d(Ni,)p(wx)p(w

*

*

i

iijjij

][ 0

SOM Learning Algorithm

, 1≤ i≤m, 1≤ j≤n

Page 81: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 82

Step 4Step 4:: IterationIteration.. Increase iteration Increase iteration pp by one, go back to Step 2 and by one, go back to Step 2 and continue until continue until the criterion ofthe criterion of minimal Euclidean distanceminimal Euclidean distance is is

satisfied, or satisfied, or no noticeable changes occur in the feature map.no noticeable changes occur in the feature map.

SOM Learning Algorithm

Page 82: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 83

The overall effect of the SOM learning rule resides in The overall effect of the SOM learning rule resides in

moving the connection weight vector moving the connection weight vector Wi of the trained of the trained

output neuron output neuron ii towards the input pattern towards the input pattern X. .

Learning in Kohonen Network

XX- Wi

ΔWi = α (X- Wi)Wi+ΔWiinput patterninput pattern

Wi

connection weight of neuron connection weight of neuron ii

))()(()(),()()1( pWpXpWpWpWpW iiiii

-Wi

Page 83: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 84

Convergence Process of SOM

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

100 iterations

1000 iterations

2000 iterations

5000 iterations

Page 84: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 85

What Is Outlier Discovery? What are outliers?

The set of objects are considerably dissimilar from the remainder of the data

Example: Sports: Michael Jordan, ... Problem

Find top n outlier points Applications:

Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis

Page 85: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 86

Outlier Discovery: Statistical Approaches

Assume a model underlying distribution that generates data set (e.g. normal distribution) Use discordancy tests depending on

data distribution distribution parameter (e.g., mean, variance) number of expected outliers

Drawbacks most tests are for single attribute In many cases, data distribution may not be known

Statistical Approaches

Page 86: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

Outlier Discovery: Distance-Based Approach

Introduced to counter the main limitations imposed by statistical methods We need multi-dimensional analysis without

knowing data distribution. Distance-based outlier: A DB(p, D)-outlier is an

object O in a dataset T s.t. at least a fraction p of the objects in T lies at a distance greater than D from O

Algorithms for mining distance-based outliers Index-based algorithm Nested-loop algorithm Cell-based algorithm

Page 87: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 88

Outlier Discovery: Deviation-Based Approach

Identifies outliers by examining the main characteristics of objects in a group

Objects that “deviate” from this description are considered outliers

sequential exception technique simulates the way in which humans can

distinguish unusual objects from a series of similar objects

OLAP data cube technique uses data cubes to identify regions of

anomalies in a large multi-dimensional data set

Page 88: 2016年3月15日星期二 2016年3月15日星期二 2016年3月15日星期二 Data Mining: Concepts…

二〇二三年五月十四日 Data Mining: Concepts and Techniques 89

Summary Cluster analysis groups data objects based on their

similarity and has wide applications Measure of similarity can be computed for various

types of data Clustering algorithms can be categorized into

partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods

Outlier detection and analysis are very useful for fraud detection and can be performed by statistical, distance-based or deviation-based approaches

There are still lots of research issues on cluster analysis, such as constraint-based clustering