2014 cluster analysis handout
TRANSCRIPT
-
8/9/2019 2014 Cluster Analysis Handout
1/25
Cluster AnalysisSegmenting the market
Cluster Analysis
(classification analysis, numericaltaxonomy):
a class of techniques used to classify objects orcases into relatively homogeneous groups calledclusters based on the set of variables considered.Objects in each cluster tend to be similar to eachother and dissimilar to objects in the other
clusters.
objects: either variables or observations;
likeness: calculated from the measurements for
each object.
-
8/9/2019 2014 Cluster Analysis Handout
2/25
Applications:
1. market segmentation: e.g., benefit
segmentation: clustering consumers on the
basis of benefits sought from the purchase of
a product,
2. understanding buyer behaviors: e.g.,
clustering consumers to identify
homogeneous groups, a firm can examine the
buying behavior or information seeking
behavior of each group,
3. identifying new product opportunities: e.g.,
clustering brands and products to identify
competitive sets within the market, a firm can
examine its current offerings compared to
those of its competitors to identify potential
new product opportunities,
4. selecting test markets: e.g., clustering cities
into homogeneous clusters, a firm can selectcomparable cities to test various marketingstrategies.
-
8/9/2019 2014 Cluster Analysis Handout
3/25
Distance measures for individual
observations
• To measure similarity between two observations adistance measure is needed
• With a single variable, similarity is straightforward• Example: income – two individuals are similar if their income level
is similar and the level of dissimilarity increases as the incomegap increases
• Multiple variables require an aggregate distancemeasure• Many characteristics (e.g. income, age, consumption habits,
brand loyalty, purchase frequency, family composition, educationlevel, ..), it becomes more difficult to define similarity with a singlevalue
• The most known measure of distance is the Euclideandistance, which is the concept we use in everyday life forspatial coordinates.
Model:
Data: each object is characterized by a set of
numbers (measurements);
e.g., object 1: (x 11, x 12 , … , x 1n )
object 2: (x 21, x 22 , … , x 2n )
: :
object p : (x p1, x p2 , … , x pn )
Distance: Euclidean distance, d ij ,
22222
11 jnin ji jiij x x x x x xd
-
8/9/2019 2014 Cluster Analysis Handout
4/25
Example
Household Household
Income Size
A 50K 5
B 50K 4
C 20K 2
D 20K 1
22 3324.4
$
(unit: 10K
Size
A
B
C
D
20K 50K
1
1
22 3261.3
8
BetweenBetween--Cluster Variation = MaximizeCluster Variation = Maximize
WithinWithin--Cluster Variation = MinimizeCluster Variation = Minimize
Three Cluster Diagram ShowingThree Cluster Diagram Showing
BetweenBetween--Cluster and WithinCluster and Within--Cluster VariationCluster Variation
-
8/9/2019 2014 Cluster Analysis Handout
5/25
HighHigh
LowLow
LowLow HighHigh
F r e q u e n c y
o f e a t i n g o
u t
F r e q u e n c y
o f e a t i n g o
u t
Frequency of going to fast food restaurantsFrequency of going to fast food restaurants
Scatter Diagram for Cluster
Observations
HighHigh
LowLow
LowLow HighHigh
Frequency of going to fast food restaurantsFrequency of going to fast food restaurants
F r e q u e n c y
o f e a t i n g o
u t
F r e q u e n c y
o f e a t i n g
o u t
Scatter Diagram for Cluster ObservationsScatter Diagram for Cluster Observations
-
8/9/2019 2014 Cluster Analysis Handout
6/25
Comparison of Score Profiles for FactorComparison of Score Profiles for Factor
Analysis and Hierarchical Cluster Analysis Analysis and Hierarchical Cluster Analysis
Variables Variables
RespondentRespondent 11 22 33
A A 77 66 77
BB 66 77 66
CC 44 33 44
DD 33 44 33
7
6
5
4
3
2
1
Respondent ARespondent A
Respondent BRespondent B
Respondent CRespondent C
Respondent DRespondent D
S
c or e
Clustering procedures
• Hierarchical procedures
• Agglomerative (start from n clusters to
get to 1 cluster)
• Divisive (start from 1 cluster to get to n
clusters)
• Non hierarchical procedures
• K-means clustering
-
8/9/2019 2014 Cluster Analysis Handout
7/25
Hierarchical clustering
• Agglomerative:• Each of the n observations constitutes a separate cluster • The two clusters that are more similar according to some distance rule are
aggregated, so that in step 1 there are n-1 clusters
• In the second step another cluster is formed (n-2 clusters), by nesting the twoclusters that are more similar, and so on
• There is a merging in each step until all observations end up in a singlecluster in the final step.
• Divisive• All observations are initially assumed to belong to a single cluster
• The most dissimilar observation(s) is extracted to form a separate cluster • In step 1 there will be 2 clusters, in the second step three clusters and so on,
until the final step will produce as many clusters as the number ofobservations. This technique is used in medical research and not in thescope of our course.
• The number of clusters determines the stopping rule for thealgorithms
Non-hierarchical clustering• These algorithms do not follow a hierarchy and produce a
single partition
• Knowledge of the number of clusters (c) is required
• In the first step, initial cluster centres (the seeds) aredetermined for each of the c clusters, either by theresearcher or by the software.
• Each iteration allocates observations to each of the cclusters, based on their distance from the cluster centres
• Cluster centres are computed again and observations may
be reallocated to the nearest cluster in the next iteration• When no observations can be reallocated or a stopping rule
is met, the process stops
-
8/9/2019 2014 Cluster Analysis Handout
8/25
Distance between clusters
• Algorithms vary according to the way thedistance between two clusters is defined.
• The most common algorithm forhierarchical methods include
• centroid method
• single linkage method
• complete linkage method
• average linkage method
• Ward algorithm
Linkage methods• Single linkage method (nearest neighbour):
distance between two clusters is the minimumdistance among all possible distances betweenobservations belonging to the two clusters.
• Complete linkage method (furthest neighbour):nests two cluster using as a basis the maximumdistance between observations belonging toseparate clusters.
• Average linkage method: the distance betweentwo clusters is the average of all distancesbetween observations in the two clusters.
-
8/9/2019 2014 Cluster Analysis Handout
9/25
Ward algorithm
1. The sum of squared distances is computedwithin each of the cluster, considering alldistances between observation within the samecluster
2. The algorithm proceeds by choosing theaggregation between two clusters whichgenerates the smallest increase in the total sumof squared distances.
• It is a computationally intensive method,because at each step all the sum of squared
distances need to be computed, together with allpotential increases in the total sum of squareddistances for each possible aggregation ofclusters.
Non-hierarchical clustering:
K-means method
The number k of clusters is fixed
An initial set of k “seeds” (aggregation centres) is
provided
First k elements
Given a certain fixed threshold, all units areassigned to the nearest cluster seed
New seeds are computed
Go back to step 3 until no reclassification is
necessary Units can be reassigned in successive steps
(optimising partioning)
-
8/9/2019 2014 Cluster Analysis Handout
10/25
Hierarchical vs. non-hierarchical methods
Hierarchical Methods Non-hierarchical methods
No decision about the number
of clusters Problems when data contain a
high level of error Can be very slow, preferable
with small data-sets
At each step they requirecomputation of the full
proximity matrix
Faster, more reliable, works
with large data sets Need to specify the number of
clusters Need to set the initial seeds
Only cluster distances to seeds
need to be computed in eachiteration
How many clusters?
no hard and fast rules,
a. theoretical, conceptual, or practical
considerations;
b. the distances at which clusters are combined
in a hierarchical clustering;
c. the relative size of the clusters should be
meaningful, etc.
-
8/9/2019 2014 Cluster Analysis Handout
11/25
Outlairs
• It would affect your cluster solution if you
don’t remove it!
• It would affect your cluster solution if you
remove it! (small sample size)
• Should we standardize clustering
variables?
• What is the effect of multi-collinearity in
cluster analysis?
-
8/9/2019 2014 Cluster Analysis Handout
12/25
Cluster AnalysisCluster Analysis – – Variable Selection Variable Selection
• Variables are typicallymeasured metrically, but
technique can be applied to
non-metric variables with
caution.
• Variables are logically relatedto a single underlying conceptor construct.
Variable Descripti onVariable Descriptio n TypeType
Work Environm ent MeasuresWork Environm ent Measures
XX11 I am paid fairly for the work I do.I am paid fairly for the work I do. MetricMetricXX22 I am doing the kind of work I want.I am doing the kind of work I want. MetricMetricXX33 My supervisor gives credit and praise for work well don e.My supervisor gives credit and praise for work w ell done. MetricMetricXX44 There is a lot of coop eration among the members of my work grou pThere is a lot of coo peration among the members of my work grou p.. MetricMetricXX55 My job allows me to learn new skills.My job allows me to learn new skills. MetricMetricXX66 My supervisor recognizes my potential.My supervisor recognizes my potential. MetricMetricXX77 My work gives me a sense of accomplishm ent.My work giv es me a sense of accomplishm ent. MetricMetricXX88 My immediate work group functions as a team.My immediate work group funct ions as a team. MetricMetricXX99 My pay reflects the effort I put into doing m y work.My pay reflects the effort I put into doing m y work. MetricMetricXX1010 My supervisor is friendly and helpful.My supervisor is friendly and helpful. MetricMetricXX1111 The members of my wo rk group h ave the skills and/or trainingThe members of my wo rk group h ave the skills and/or training
to do their job well.to do their job well. MetricMetric
XX1212 The benefits I receive are reasonable.The benefits I receive are reasonable. MetricMetric
Relationship MeasuresRelationship Measures
XX1313 I have a sense of loyalty t o McDonald's restaurant.I have a sense of loyalty to McDonald's restaurant. MetricMetric
XX1414 I am willing to put in a great deal of effort beyond thatI am willing to put in a great deal of effort beyond thatexpected to help McDonald's restaurant t o be successful.expected to help McDonald's restaurant to be successful. MetricMetric
XX1515 I am proud to tell others that I work f or McDonald's restaurant.I am proud to tell others that I work f or McDonald's restaurant. MetricMetric
Classification VariablesClassification VariablesXX1616 Intention to SearchIntention to Search MetricMetricXX1717 Length of Time an EmployeeLength of Time an Employee NonmetricNonmetricXX1818 Work Type = PartWork Type = Part--Time vs. FullTime vs. Full--TimeTime NonmetricNonmetricXX1919 Gender Gender NonmetricNonmetricXX2020 Age Age MetricMetricXX2121 PerformancePerformance MetricMetric
-
8/9/2019 2014 Cluster Analysis Handout
13/25
Using SPSS to Identify ClustersUsing SPSS to Identify Clusters
For this example we are looking for subgroups among all the 63For this example we are looking for subgroups among all the 63
employees of McDonald's restaurant using theemployees of McDonald's restaurant using the ““organizationalorganizational
commitmentcommitment”” variables. The SPSS click through sequence is: Analyzevariables. The SPSS click through sequence is: Analyze
ClassifyClassify Hierarchical Cluster. This will take you to a dialog box wherHierarchical Cluster. This will take you to a dialog box wheree
you select and move variables Xyou select and move variables X1313, X, X1414 and Xand X1515 into theinto the ““VariablesVariables”” box.box.
Next you go to the statistics box and agglomeration schedule isNext you go to the statistics box and agglomeration schedule is selected asselected as
default option. Cluster membershipdefault option. Cluster membership ‘‘nonenone’’ is selected as default. We shallis selected as default. We shall
continue with default option here. Next click oncontinue with default option here. Next click on ‘‘plotplot’’ box. Check onbox. Check on
dendogramdendogram and inand in IcicleIcicle window,window, click on none button. Then continue.click on none button. Then continue.
Next click on the Method box and select WardNext click on the Method box and select Ward’’s under Cluster Method (its under Cluster Method (it
is the last option). Squared Euclidean Distances is the defaultis the last option). Squared Euclidean Distances is the default underunder
Measure and we will use it, and we do not need to standardize thMeasure and we will use it, and we do not need to standardize this data.is data.
We will not select anything on the save option now. Now click onWe will not select anything on the save option now. Now click on ““OKOK”” toto
run the program.run the program.
-
8/9/2019 2014 Cluster Analysis Handout
14/25
-
8/9/2019 2014 Cluster Analysis Handout
15/25
Notice the charm inNotice the charm in
coefficients in last twocoefficients in last two
stagesstages
-
8/9/2019 2014 Cluster Analysis Handout
16/25
Identified the numberIdentified the number
of clusters fromof clusters from
dendogramdendogram
Identified the numberIdentified the number
of clusters fromof clusters from
dendogramdendogram
Using SPSS to Identify ClustersUsing SPSS to Identify Clusters
IIn the next step in SPSS click throughn the next step in SPSS click through
sequence is: Analyzesequence is: Analyze ClassifyClassify KK--mean cluster.mean cluster.
This will take you to a dialog box where you selectThis will take you to a dialog box where you select
and move variables Xand move variables X1313, X, X1414 and Xand X1515 into theinto the
““VariablesVariables”” box. In the boxbox. In the box ‘‘number of clustersnumber of clusters’’ putput
3 in place of 2. Next you go to the save box and3 in place of 2. Next you go to the save box and
check on cluster membership. Next click on options.check on cluster membership. Next click on options.
UncheckUncheck initial cluster option and check ANOVAinitial cluster option and check ANOVAtable. Now click ontable. Now click on ““OKOK”” to run the program.to run the program.
-
8/9/2019 2014 Cluster Analysis Handout
17/25
33
34
-
8/9/2019 2014 Cluster Analysis Handout
18/25
35
36
Determine if clusters exist . . .Determine if clusters exist . . . – – RunRun ANOVA with cluster IDs and ANOVA with cluster IDs and
organizational commitment variables.organizational commitment variables.
-
8/9/2019 2014 Cluster Analysis Handout
19/25
37
Click on Options,Click on Options,
check Descriptive,check Descriptive,
next Continue,next Continue,
and then OKand then OK
3
Move the three clusterMove the three cluster
variables into windowvariables into window
1
MoveMove
cluster IDcluster ID
variable intovariable intowindowwindow
2
ANOVA ANOVA
38
Step 1: Determine if clusters exist?Step 1: Determine if clusters exist?
– – 2 Cluster ANOVA Results2 Cluster ANOVA Results – – Three issues to examine: (1) statisticalThree issues to examine: (1) statistical
significance, (2) cluster sample sizes, andsignificance, (2) cluster sample sizes, and
(3) variable means(3) variable means..
ConclusionConclusion::
Cluster 1Cluster 1 – –
More CommittedMore Committed
Cluster 2Cluster 2 – –
Less CommittedLess Committed
-
8/9/2019 2014 Cluster Analysis Handout
20/25
39
Step 2: Determine if clusters exist?Step 2: Determine if clusters exist?
– – 3 Cluster ANOVA3 Cluster ANOVA – –
Must runMust run ““postpost--hochoc”” teststests
Take 2 cluster IDTake 2 cluster ID
variable out andvariable out and
insert 3 cluster IDinsert 3 cluster ID
1
Click on PostClick on PostHoc button andHoc button and
checkcheck ScheffeScheffe
2
40
ConclusionsConclusions::
Cluster 1Cluster 1 – – Least CommittedLeast CommittedCluster 2Cluster 2 – – Moderately CommittedModerately Committed
Cluster 3Cluster 3 – – Most CommittedMost Committed
•• Individual cluster sample sizes OK.Individual cluster sample sizes OK.
•• Clusters significantly different, butClusters significantly different, but
must examine post hoc tests.must examine post hoc tests.
-
8/9/2019 2014 Cluster Analysis Handout
21/25
41
Step 2: Determine if clusters exist?Step 2: Determine if clusters exist?
– – 3 Cluster ANOVA3 Cluster ANOVA – –
42
Remove 3 cluster IDRemove 3 cluster ID
variable and insert 4variable and insert 4
cluster ID variablecluster ID variable
1Click OKClick OKto runto run
2
Step 3: Determine if clusters exist?Step 3: Determine if clusters exist?
– – 4 Cluster ANOVA4 Cluster ANOVA – –
-
8/9/2019 2014 Cluster Analysis Handout
22/25
43
Determine if clusters exist?Determine if clusters exist?
– – 4 Cluster ANOVA4 Cluster ANOVA – –
Conclusions:Conclusions:
1.1. Group sample sizes still OK.Group sample sizes still OK.
2.2. Clusters are significantly different.Clusters are significantly different.
3.3. Means of four clusters more difficult toMeans of four clusters more difficult to
interpretinterpret – – may want to examinemay want to examine ““polarpolar
extremesextremes””. Most likely approach is combine. Most likely approach is combine
clusters 1 and 2 and do a three clusterclusters 1 and 2 and do a three cluster
solution, or remove groups 1 and 2 andsolution, or remove groups 1 and 2 and
compare extreme groups (3 & 4).compare extreme groups (3 & 4).
44
Four Cluster ANOVAFour Cluster ANOVA
– – Post Hoc resultsPost Hoc results – –
1.1. All clusters are All clusters are
significantly different.significantly different.
2.2. Largest differencesLargest differences
consistently betweenconsistently between
clusters 3 and 4.clusters 3 and 4.
-
8/9/2019 2014 Cluster Analysis Handout
23/25
45
Error Error
CoefficientsCoefficients
Decide number of clusters . . .Decide number of clusters . . .
1.1. Examine cluster analysisExamine cluster analysis
Agglomeration Schedule. Agglomeration Schedule.
2.2. Consider cluster sample sizes.Consider cluster sample sizes.
3.3. Consider statistical significance.Consider statistical significance.
4.4. Evaluate differences in cluster means.Evaluate differences in cluster means.
5.5. Evaluate interpretation &Evaluate interpretation &
communication issues.communication issues.
Error ReductionError Reduction::
11 – – 2 Clusters = 58.4%2 Clusters = 58.4%
22 – – 3 Clusters = 25.5%3 Clusters = 25.5%
33 – – 4 Clusters = 22.8%4 Clusters = 22.8%
44 – – 5 Clusters = 22.2%5 Clusters = 22.2%
ConclusionConclusion:: benefitbenefit
similar or less after 3similar or less after 3
clusters.clusters.
46
Step 4: Describe cluster characteristics . . .Step 4: Describe cluster characteristics . . .
1.1. Use ANOVAUse ANOVA
2.2. Remove clustering variables fromRemove clustering variables from
““Dependent ListDependent List”” windowwindow
3.3. Insert demographic variablesInsert demographic variables
4.4. ChangeChange ““Factor Factor ”” variable if necessaryvariable if necessary
InsertInsert
DemograDemogra
phicphic
VariablesVariables
-
8/9/2019 2014 Cluster Analysis Handout
24/25
47
Step 4: Describe cluster characteristics . . .Step 4: Describe cluster characteristics . . .
1.1. Go to Variable View.Go to Variable View.
2.2. Click on None beside variable for numberClick on None beside variable for numberof cluster groups you will examine underof cluster groups you will examine under
Values column.Values column.
3.3. Assign value labels to each cluster. Assign value labels to each cluster.
4.4. Run ANOVA on demographics.Run ANOVA on demographics.
Assign value Assign value
labels forlabels for
clustersclusters
•• Describe demographic characteristicsDescribe demographic characteristics
•• ConclusionsConclusions – – 3 cluster solution:3 cluster solution:
•• Clusters are significantly different.Clusters are significantly different.
•• More committed cluster (must know codingMore committed cluster (must know codingto interpret) . . .to interpret) . . .
– – Less likely to search (lower mean)Less likely to search (lower mean)
– – Full time employees (code = 0)Full time employees (code = 0) – – Females (code = 1)Females (code = 1)
– – High performers (higher mean)High performers (higher mean)
-
8/9/2019 2014 Cluster Analysis Handout
25/25
•Thank you