cluster analysis handout
Post on 30-Oct-2014
37 Views
Preview:
TRANSCRIPT
Cluster Analysis
LEARNING OBJECTIVES:
1. Define cluster analysis, its roles and its limitations.
2. Identify the research questions addressed by cluster analysis.
3. Understand how interobject similarity is measured.
4. Distinguish between the various distance measures.
5. Differentiate between clustering algorithms.
6. Understand the differences between hierarchical and nonhierarchical clustering
techniques.
7. Describe how to select the number of clusters to be formed.
8. Follow the guidelines for cluster validation.
9. Construct profiles for the derived clusters and assess managerial significance.
Cluster Analysis
Cluster analysis . . . groups objects (respondents,
products, firms, variables, etc.) so that each object is
similar to the other objects in the cluster and
different from objects in all the other clusters.
Cluster Analysis Defined
Cluster analysis . . . is a group of multivariate techniques whose
primary purpose is to group objects based on the
characteristics they possess.
• It has been referred to as Q analysis, typology construction,
classification analysis, and numerical taxonomy.
• The essence of all clustering approaches is the classification of
data as suggested by “natural” groupings of the data
themselves.
What is Cluster Analysis?
Between-Cluster Variation = Maximize
Within-Cluster Variation = Minimize
Three Cluster Diagram Showing
Between-Cluster and Within-Cluster Variation
Scatter Diagram for Cluster Observations
High
Low
Low High
Fre
qu
en
cy o
f e
ati
ng
ou
t
Frequency of going to fast food restaurants
Scatter Diagram for Cluster Observations
High
Low
Low High
Frequency of going to fast food restaurants
Fre
qu
en
cy o
f e
ati
ng
ou
t
Scatter Diagram for Cluster Observations
High
Low
Low High Frequency of going to fast food restaurants
Fre
qu
en
cy o
f e
ati
ng
ou
t
Scatter Diagram for Cluster Observations
High
Low
Low High
Frequency of going to fast food restaurants
Fre
qu
en
cy o
f e
ati
ng
ou
t
The following must be addressed by conceptual
rather than empirical support:
• Cluster analysis is descriptive, a-theoretical, and non-
inferential.
• . . . will always create clusters, regardless of the actual
existence of any structure in the data.
• The cluster solution is not generalizable because it is
totally dependent upon the variables used as the basis for
the similarity measure.
Criticisms of Cluster Analysis
What Can We Do With Cluster Analysis?
1. Determine if statistically different clusters
exist.
2. Identify the meaning of the clusters.
3. Explain how the clusters can be used.
Stage 1: Objectives of Cluster Analysis
Primary Goal = to partition a set of objects into two or
more groups based on the similarity of the objects
for a set of specified characteristics (the cluster
variate).
There are two key issues:
• The research questions being addressed, and
• The variables used to characterize objects in the
clustering process.
Research Questions in Cluster Analysis
Three basic research questions:
• How to form the taxonomy – an empirically based
classification of objects.
• How to simplify the data – by grouping
observations for further analysis.
• Which relationships can be identified – the process
reveals relationships among the observations.
Selection of Clustering Variables
Two Issues:
1. Conceptual considerations- include
only variable that….
– Characterize the objects being
clustered
– Relate specifically to the
objectives of the cluster analysis
• Practical considerations.
Rules of Thumb- 1
OBJECTIVES OF CLUSTER ANALYSIS
• Cluster analysis is used for:
Taxonomy description – identifying natural groups within the data.
Data simplification – the ability to analyze groups of similar observations instead of
all individual observations.
Relationship identification – the simplified structure from cluster analysis portrays
relationships not revealed otherwise.
• Theoretical, conceptual and practical considerations must be observed when selecting
clustering variables for cluster analysis:
Only variables that relate specifically to objectives of the cluster analysis are
included, since “irrelevant” variables can not be excluded from the analysis
once it begins
Variables are selected which characterize the individuals (objects) being
clustered.
Stage 2: Research Design in Cluster Analysis
Four Questions:
• Is the sample size adequate?
• Can outliers be detected an, if so, should they be
deleted?
• How should object similarity be measured?
• Should the data be standardized?
Measuring Similarity
Interobject similarity is an empirical measure of correspondence, or
resemblance, between objects to be clustered. It can be measured in a variety
of ways, but three methods dominate the applications of cluster analysis:
• Correlational Measures- correlation between profiles of two objects.
High correlation indicates similarity while low correlation denotes lack of
it.
• Distance Measures- are actually a measure of dissimilarity with larger
values denoting lesser similarity.
• Association- used to measure objects whose characteristics are measured
only in non-metrice terms (like percentage of times agreement occurs, both
respondents may say yes to a question or no to a question).
Similarity measures calculated across the entire set of clustering variables
allow for the grouping of observations and their comparison to each other.
Types of Distance Measures
• Euclidean distance- measure of the length of a straight line drawn
between two objects when represented graphically.
• Squared (or absolute) Euclidean distance- is the sum of squared
distances and is the recommended measure for the centroid and Ward’s
methods of clustering
• Mahalanobis distance (D2)- standardized form of Euclidean Distance.
• City-block (Manhattan) distance
• Chebychev distance
Given the sensitivity of some procedures to the similarity measure
used, the researcher should employ several distance measures and
compare the results from each with other results or theoretical/known
patterns
Sample Size
• The sample size required is not based on statistical considerations for
inference testing, but rather:
Sufficient size is needed to ensure representativeness of the population and
its underlying structure, particularly small groups within the population.
Minimum group sizes are based on the relevance of each group to the
research question and the confidence needed in characterizing that group.
• Outliers can severely distort the representativeness of the results if they appear as
structure (clusters) that are inconsistent with the research objectives
They should be removed if the outlier represents:
Aberrant observations not representative of the population
Observations of small or insignificant segments within the population which are
of no interest to the research objectives
They should be retained if representing an under-sampling/poor representation of
relevant groups in the population. In this case, the sample should be augmented to
ensure representation of these groups.
• Outliers can be identified based on the similarity measure by:
Finding observations with large distances from all other observations
Graphic profile diagrams highlighting outlying cases
Their appearance in cluster solutions as single-member or very small clusters
• Clustering variables should be standardized whenever possible to avoid problems
resulting from the use of different scale values among clustering variables.
The most common standardization conversion is Z scores.
If groups are to be identified according to an individual’s response style, then
within-case or row-centering standardization is appropriate.
Outliers
Representativeness of the sample.
Impact of multicollinearity- Reduce the
variables to equal numbers in each set of
correlated measures, or
• Use a distance measure that compensates
for the correlation, like Mahalanobis
Distance.
Assumptions of Cluster Analysis
The researcher must:
• Select the partitioning procedure used for
forming clusters, and
• Make the decision on the number of
clusters to be formed.
Stage 4: Deriving Clusters and Assessing Overall Fit
Clustering Procedures
Hierarchical Clustering Procedure- Stepwise clustering procedures
involving a combination of the objects into clusters. .Such a procedure
produces N-1 clusters. Two Types
1. Agglomerative Methods (buildup)
2. Divisive Methods (breakdown)
Non hierarchical Clustering Procedures- produce only a single cluster
solutions for a set of cluster seeds (initial centroid or starting point for a
cluster). Cluster seeds are used to group objects within pre-specified
distance of the seeds. IF FOUR CLUSTERS ARE SPECIFIED ONLY
FOUR ARE FORMED.
How Agglomerative Approaches Work?
• Start with all observations as their own cluster.
• Using the selected similarity measure, combine the two most
similar observations into a new cluster, now containing two
observations.
• Repeat the clustering procedure using the similarity measure to
combine the two most similar observations or combinations of
observations into another new cluster.
• Continue the process until all observations are in a single cluster.
• Devisive is the opposite of Agglomerative Aprroach.
Agglomerative Algorithms
• Single Linkage (nearest neighbor)- interobject similarity is defined as
the distance between the closest objects in two clusters.
• Complete Linkage (farthest neighbor)- interobject similarity is based
on the maximum distance between objects in two clusters.
• Average Linkage- avearage distance from all objects in one cluster to
all objects in another cluster.
• Centroid Method- similarity between clusters is measured as the
distance between cluster centroids. When two clusters are combined a
new centroid is computed.
• Ward’s Method- similarity used to join clusters is the sum of squares
between the two clusters summed over all variables.
DERIVING CLUSTERS
• Selection of hierarchical or nonhierarchical methods is based on:
– Hierarchical clustering solutions are preferred when:
– A wide range, even all, alternative clustering solutions is to be examined
– The sample size is moderate (under 300-400, not exceeding 1,000) or a sample of the larger
dataset is acceptable
– Nonhierarchical clustering methods are preferred when:
– The number of clusters is known and initial seed points can be specified according to some
practical, objective or theoretical basis.
– There is concern about outliers since nonhierarchical methods generally are less susceptible to
outliers.
• A combination approach using a hierarchical approach followed by a nonhierarchical
approach is often advisable.
– A nonhierarchical approach is used to select the number of clusters and profile cluster centers
that serve as initial cluster seeds in the nonhierarchical procedure.
– A nonhierarchical method then clusters all observations using the seed points to provide more
accurate cluster memberships.
continued . . .
• This stage involves examining each cluster in terms of
the cluster variate to name or assign a label accurately
describing the nature of the clusters
Stage 5: Interpretation of the Clusters
Validation and Profiling of the Clusters
Validation:
• Cross-validation.
• Criterion validity.
Profiling: describing the characteristics of each cluster
to explain how they may differ on relevant
dimensions. This typically involves the use of
discriminant analysis or ANOVA.
top related