grouping data

34
Grouping Data Methods of cluster analysis

Upload: cala

Post on 25-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Grouping Data. Methods of cluster analysis. Goals 1. We want to identify groups of similar artifacts or features or sites or graves, etc that represent cultural, functional, or chronological differences - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Grouping Data

Grouping Data

Methods of cluster analysis

Page 2: Grouping Data

Goals 11. We want to identify groups of

similar artifacts or features or sites or graves, etc that represent cultural, functional, or chronological differences

2. We want to create groups as a measurement technique to see how they vary with external variables

Page 3: Grouping Data

Goals 23. We want to cluster artifacts or

sites based on their location to identify spatial clusters

Page 4: Grouping Data

Real vs. Created Types• Differences in goals

– Real types are the aim of Goal 1– Created types are the aim of Goal 2

• Debate over whether Real types can be discovered with any degree of certainty

• Cluster analysis guarantees groups – you must confirm their utility

Page 5: Grouping Data

Initial Decisions 1• What variables to use?

– All possible– Constructed variables (from principal

components, correspondence analysis, or multi-dimensional scaling)

– Restricted set of variables that support the goal(s) of creating groups (e.g. functional groups, cultural or stylistic groups)

Page 6: Grouping Data

Initial Decisions 2• How to transform the variables?

– Log transforms– Conversion to percentages (to weight

rows equally)– Size standardization (dividing by

geometric mean)– Z – scores (to weight columns

equally)– Conversion of categorical variables

Page 7: Grouping Data

Initial Decisions 3• How to measure distance?

– Types of variables– Goals of the analysis– If uncertain, try multiple methods

Page 8: Grouping Data

Methods of Grouping• Partitioning Methods – divide the

data into groups• Hierarchical Methods

– Agglomerating – from n clusters to 1 cluster

– Divisive – from 1 cluster to k clusters

Page 9: Grouping Data

Partitioning• K – Means, K – Medoids, Fuzzy• Measure of distance – but do not

need to compute full distance matrix

• Specify number of groups in advance

• Minimizing within group variability• Finds spherical clusters

Page 10: Grouping Data

Procedure• Start with centers for k groups (user-

supplied or random)• Repeat up to iter.max times (default

10)– Allocate rows to their closest center– Recalculate the center positions

• Stop• Different criteria for allocation• Use multiple starts (e.g. 5 – 15)

Page 11: Grouping Data

Evaluation 1• Compute groups for a range of

cluster sizes and plot within group sums of squares to look for sharp increases

• Cluster randomized versions of the data and compare the results

• Examine table of statistics by group

Page 12: Grouping Data

Evaluation 2• Plot groups in two dimensions with

PCA, CA, or MDS• Compare the groups using data or

information not included in the analysis

Page 13: Grouping Data

Partitioning Using R• Base R includes kmeans() for

forming groups by partitioning• Rcmdr includes KMeans() to iterate

kmeans() for best solution• Package cluster() includes pam()

which uses medoids for more robust grouping and fanny() which forms fuzzy clusters

Page 14: Grouping Data

Example• DarlPoints (not DartPoints) has 4

measurements for 23 Darl points• Create Z-scores to weight variables

equally with Data | Manage variables in active data set | Standardize variables …

• (or could use PCA and PC Scores)

Page 15: Grouping Data

Example (cont)• Use Rcmdr to partition the data

into 5, 4, 3, and 2 groups• Statistics | Dimensional analysis |

Cluster analysis | k-means cluster analysis …

• TWSS = 15.42, 19.78, 25.83, 34.24• Select group number and have

Rcmdr add group to data set

Page 16: Grouping Data
Page 17: Grouping Data
Page 18: Grouping Data

Evaluation• Evaluate groups against

randomized data– Randomly permute each variable– Run k-means– Compare random and non-random

results• Evaluate groups against external

criteria (location, material, age, etc)

Page 19: Grouping Data

KMPlotWSS <- function(data, ming, maxg) { WSS <- sapply(ming:maxg, function(x) kmeans(data, centers = x, iter.max = 10, nstart = 10)$tot.withinss) plot(ming:maxg, WSS, las=1, type="b", xlab="Number of Groups", ylab="Total Within Sum of Squares", pch=16) print(WSS)}

KMRandWSS <- function(data, samples, min, max) { KRand <- function(data, min, max){ Rnd <- apply(data, 2, sample) sapply(min:max, function(y) kmeans(Rnd, y, iter.max= 10, nstart=5)$tot.withinss) } Sim <- sapply(1:samples, function(x) KRand(data, min, max)) t(apply(Sim, 1, quantile, c(0,.005, .01, .025, .5, .975, .99, .995, 1)))}

Page 20: Grouping Data

# Compare data to randomized setsKMPlotWSS(DarlPoints[,6:9], 1, 10)Qtiles <- KMRandWSS(DarlPoints[,6:9], 2000, 1, 10)matlines(1:10, Qtiles[,c(1, 5, 9)], lty=c(3, 2, 3), lwd=2, col="dark gray")legend("topright", c("Observed", "Median (Random)", "Max/Min Random"), col=c("black", "dark gray", "dark gray"), lwd=c(1, 2, 2), lty=c(1, 2, 3))

Page 21: Grouping Data
Page 22: Grouping Data

Hierarchical Methods• Agglomerative – successive

merging• Divisive - successive splitting

– Monothetic – binary data– Polythetic – interval/ratio

Page 23: Grouping Data

Agglomerative• At the start all rows are in separate

groups (n groups or clusters)• At each stage two rows are

merged, a row and a group are merged, or two groups are merged

• The process stops when all rows are in a single cluster

Page 24: Grouping Data

Agglomeration Methods• How should clusters be formed?

– Single Linkage, irregular shape groups– Average Linkage – spherical groups– Complete Linkage – spherical groups– Ward’s Method – spherical groups– Median – dendrogram inversions– Centroid – dendrogram inversions– McQuitty – similarity by reciprocal

pairs

Page 25: Grouping Data

Agglomerating with R• Base R includes hclus() for forming

groups by partitioning• Package cluster() includes agnes()• Rcmdr uses hclus() via Statistics |

Dimensional analysis | Cluster analysis | Hierarchical cluster analysis …

Page 26: Grouping Data

HClust• Rcmdr menus provide

– Cluster analysis and plot– Summary statistics by group– Adding cluster to data set

• To get traditional dendrogram:– plot(HClust.1, hang=-1, main= "Darl Points", xlab= "Catalog Number", sub="Method=Ward; Distance=Euclidian")

– rect.hclust(HClust.1, 3)

Page 27: Grouping Data
Page 28: Grouping Data
Page 29: Grouping Data

summary(as.factor(cutree(HClust.1, k = 3))) # Cluster Sizes 1 2 3 11 6 6

by(model.matrix(~-1 + Z.Length + Z.Thickness + Z.Weight + Z.Width, DarlPoints), as.factor(cutree(HClust.1, k = 3)), mean) # Cluster CentroidsINDICES: 1 Z.Length Z.Thickness Z.Weight Z.Width -0.1345150 -0.1585615 -0.2523805 -0.1241642 ------------------------------------------------------------ INDICES: 2 Z.Length Z.Thickness Z.Weight Z.Width -1.1085541 -0.9209550 -0.9400026 -0.8200594 ------------------------------------------------------------ INDICES: 3 Z.Length Z.Thickness Z.Weight Z.Width 1.355165 1.211651 1.402700 1.047694

> biplot(princomp(model.matrix(~-1 + Z.Length + Z.Thickness + Z.Weight + Z.Width, DarlPoints)), xlabs = as.character(cutree(HClust.1, k = 3)))

Page 30: Grouping Data
Page 31: Grouping Data

> cbind(HClust.1$merge, HClust.1$height) [,1] [,2] [,3] [1,] -12 -13 0.3983821 [2,] -2 -3 0.5112670 [3,] -9 -14 0.5247650 [4,] -10 -17 0.5572146 [5,] -15 3 0.7362171 [6,] -1 -11 0.7471874 [7,] -6 -18 0.8120594 [8,] -7 -8 0.8491895 [9,] 4 5 0.9841552[10,] 2 6 1.2150606[11,] -19 -21 1.2300507[12,] 1 10 1.4059158[13,] -22 11 1.4963400[14,] -16 -20 1.5800167[15,] -4 9 1.6195709[16,] -5 12 2.1556543[17,] -23 13 2.4007863[18,] 7 14 2.4252670[19,] 8 17 3.2632812[20,] 16 18 4.9021149[21,] 15 20 6.6290417[22,] 19 21 18.7730146

Page 32: Grouping Data
Page 33: Grouping Data
Page 34: Grouping Data

Divisive• At the start all rows are considered

to be a single group• At each stage a group is divided

into two groups based on the average dissimilarities

• The process stops when all rows are in separate clusters