2014 cluster analysis handout

8/9/2019 2014 Cluster Analysis Handout

1/25

Cluster AnalysisSegmenting the market

Cluster Analysis

(classification analysis, numericaltaxonomy):

a class of techniques used to classify objects orcases into relatively homogeneous groups calledclusters based on the set of variables considered.Objects in each cluster tend to be similar to eachother and dissimilar to objects in the other

clusters.

objects: either variables or observations;

likeness: calculated from the measurements for

each object.


2/25

Applications:

1. market segmentation: e.g., benefit

segmentation: clustering consumers on the

basis of benefits sought from the purchase of

a product,

2. understanding buyer behaviors: e.g.,

clustering consumers to identify

homogeneous groups, a firm can examine the

buying behavior or information seeking

behavior of each group,

3. identifying new product opportunities: e.g.,

clustering brands and products to identify

competitive sets within the market, a firm can

examine its current offerings compared to

those of its competitors to identify potential

new product opportunities,

4. selecting test markets: e.g., clustering cities

into homogeneous clusters, a firm can selectcomparable cities to test various marketingstrategies.


3/25

Distance measures for individual

observations

• To measure similarity between two observations adistance measure is needed

• With a single variable, similarity is straightforward• Example: income – two individuals are similar if their income level

is similar and the level of dissimilarity increases as the incomegap increases

• Multiple variables require an aggregate distancemeasure• Many characteristics (e.g. income, age, consumption habits,

brand loyalty, purchase frequency, family composition, educationlevel, ..), it becomes more difficult to define similarity with a singlevalue

• The most known measure of distance is the Euclideandistance, which is the concept we use in everyday life forspatial coordinates.

Model:

Data: each object is characterized by a set of

numbers (measurements);

e.g., object 1: (x 11, x 12 , … , x 1n )

object 2: (x 21, x 22 , … , x 2n )

: :

object p : (x p1, x p2 , … , x pn )

Distance: Euclidean distance, d ij ,

22222

11 jnin ji jiij x x x x x xd


4/25

Example

Household Household

Income Size

A 50K 5

B 50K 4

C 20K 2

D 20K 1

22 3324.4

$

(unit: 10K

Size

A

B

C

D

20K 50K

1

1

22 3261.3

8

BetweenBetween--Cluster Variation = MaximizeCluster Variation = Maximize

WithinWithin--Cluster Variation = MinimizeCluster Variation = Minimize

Three Cluster Diagram ShowingThree Cluster Diagram Showing

BetweenBetween--Cluster and WithinCluster and Within--Cluster VariationCluster Variation


5/25

HighHigh

LowLow

LowLow HighHigh

F r e q u e n c y

o f e a t i n g o

u t

F r e q u e n c y

o f e a t i n g o

u t

Frequency of going to fast food restaurantsFrequency of going to fast food restaurants

Scatter Diagram for Cluster

Observations

HighHigh

LowLow

LowLow HighHigh

Frequency of going to fast food restaurantsFrequency of going to fast food restaurants

F r e q u e n c y

o f e a t i n g o

u t

F r e q u e n c y

o f e a t i n g

o u t

Scatter Diagram for Cluster ObservationsScatter Diagram for Cluster Observations


6/25

Comparison of Score Profiles for FactorComparison of Score Profiles for Factor

Analysis and Hierarchical Cluster Analysis Analysis and Hierarchical Cluster Analysis

Variables Variables

RespondentRespondent 11 22 33

A A 77 66 77

BB 66 77 66

CC 44 33 44

DD 33 44 33

7

6

5

4

3

2

1

Respondent ARespondent A

Respondent BRespondent B

Respondent CRespondent C

Respondent DRespondent D

S

c or e

Clustering procedures

• Hierarchical procedures

• Agglomerative (start from n clusters to

get to 1 cluster)

• Divisive (start from 1 cluster to get to n

clusters)

• Non hierarchical procedures

• K-means clustering


7/25

Hierarchical clustering

• Agglomerative:• Each of the n observations constitutes a separate cluster • The two clusters that are more similar according to some distance rule are

aggregated, so that in step 1 there are n-1 clusters

• In the second step another cluster is formed (n-2 clusters), by nesting the twoclusters that are more similar, and so on

• There is a merging in each step until all observations end up in a singlecluster in the final step.

• Divisive• All observations are initially assumed to belong to a single cluster

• The most dissimilar observation(s) is extracted to form a separate cluster • In step 1 there will be 2 clusters, in the second step three clusters and so on,

until the final step will produce as many clusters as the number ofobservations. This technique is used in medical research and not in thescope of our course.

• The number of clusters determines the stopping rule for thealgorithms

Non-hierarchical clustering• These algorithms do not follow a hierarchy and produce a

single partition

• Knowledge of the number of clusters (c) is required

• In the first step, initial cluster centres (the seeds) aredetermined for each of the c clusters, either by theresearcher or by the software.

• Each iteration allocates observations to each of the cclusters, based on their distance from the cluster centres

• Cluster centres are computed again and observations may

be reallocated to the nearest cluster in the next iteration• When no observations can be reallocated or a stopping rule

is met, the process stops


8/25

Distance between clusters

• Algorithms vary according to the way thedistance between two clusters is defined.

• The most common algorithm forhierarchical methods include

• centroid method

• single linkage method

• complete linkage method

• average linkage method

• Ward algorithm

Linkage methods• Single linkage method (nearest neighbour):

distance between two clusters is the minimumdistance among all possible distances betweenobservations belonging to the two clusters.

• Complete linkage method (furthest neighbour):nests two cluster using as a basis the maximumdistance between observations belonging toseparate clusters.

• Average linkage method: the distance betweentwo clusters is the average of all distancesbetween observations in the two clusters.


9/25

Ward algorithm

1. The sum of squared distances is computedwithin each of the cluster, considering alldistances between observation within the samecluster

2. The algorithm proceeds by choosing theaggregation between two clusters whichgenerates the smallest increase in the total sumof squared distances.

• It is a computationally intensive method,because at each step all the sum of squared

distances need to be computed, together with allpotential increases in the total sum of squareddistances for each possible aggregation ofclusters.

Non-hierarchical clustering:

K-means method

The number k of clusters is fixed

An initial set of k “seeds” (aggregation centres) is

provided

First k elements

Given a certain fixed threshold, all units areassigned to the nearest cluster seed

New seeds are computed

Go back to step 3 until no reclassification is

necessary Units can be reassigned in successive steps

(optimising partioning)


10/25

Hierarchical vs. non-hierarchical methods

Hierarchical Methods Non-hierarchical methods

No decision about the number

of clusters Problems when data contain a

high level of error Can be very slow, preferable

with small data-sets

At each step they requirecomputation of the full

proximity matrix

Faster, more reliable, works

with large data sets Need to specify the number of

clusters Need to set the initial seeds

Only cluster distances to seeds

need to be computed in eachiteration

How many clusters?

no hard and fast rules,

a. theoretical, conceptual, or practical

considerations;

b. the distances at which clusters are combined

in a hierarchical clustering;

c. the relative size of the clusters should be

meaningful, etc.


11/25

Outlairs

• It would affect your cluster solution if you

don’t remove it!

• It would affect your cluster solution if you

remove it! (small sample size)

• Should we standardize clustering

variables?

• What is the effect of multi-collinearity in

cluster analysis?


12/25

Cluster AnalysisCluster Analysis – – Variable Selection Variable Selection

• Variables are typicallymeasured metrically, but

technique can be applied to

non-metric variables with

caution.

• Variables are logically relatedto a single underlying conceptor construct.

Variable Descripti onVariable Descriptio n TypeType

Work Environm ent MeasuresWork Environm ent Measures

XX11 I am paid fairly for the work I do.I am paid fairly for the work I do. MetricMetricXX22 I am doing the kind of work I want.I am doing the kind of work I want. MetricMetricXX33 My supervisor gives credit and praise for work well don e.My supervisor gives credit and praise for work w ell done. MetricMetricXX44 There is a lot of coop eration among the members of my work grou pThere is a lot of coo peration among the members of my work grou p.. MetricMetricXX55 My job allows me to learn new skills.My job allows me to learn new skills. MetricMetricXX66 My supervisor recognizes my potential.My supervisor recognizes my potential. MetricMetricXX77 My work gives me a sense of accomplishm ent.My work giv es me a sense of accomplishm ent. MetricMetricXX88 My immediate work group functions as a team.My immediate work group funct ions as a team. MetricMetricXX99 My pay reflects the effort I put into doing m y work.My pay reflects the effort I put into doing m y work. MetricMetricXX1010 My supervisor is friendly and helpful.My supervisor is friendly and helpful. MetricMetricXX1111 The members of my wo rk group h ave the skills and/or trainingThe members of my wo rk group h ave the skills and/or training

to do their job well.to do their job well. MetricMetric

XX1212 The benefits I receive are reasonable.The benefits I receive are reasonable. MetricMetric

Relationship MeasuresRelationship Measures

XX1313 I have a sense of loyalty t o McDonald's restaurant.I have a sense of loyalty to McDonald's restaurant. MetricMetric

XX1414 I am willing to put in a great deal of effort beyond thatI am willing to put in a great deal of effort beyond thatexpected to help McDonald's restaurant t o be successful.expected to help McDonald's restaurant to be successful. MetricMetric

XX1515 I am proud to tell others that I work f or McDonald's restaurant.I am proud to tell others that I work f or McDonald's restaurant. MetricMetric

Classification VariablesClassification VariablesXX1616 Intention to SearchIntention to Search MetricMetricXX1717 Length of Time an EmployeeLength of Time an Employee NonmetricNonmetricXX1818 Work Type = PartWork Type = Part--Time vs. FullTime vs. Full--TimeTime NonmetricNonmetricXX1919 Gender Gender NonmetricNonmetricXX2020 Age Age MetricMetricXX2121 PerformancePerformance MetricMetric


13/25

Using SPSS to Identify ClustersUsing SPSS to Identify Clusters

For this example we are looking for subgroups among all the 63For this example we are looking for subgroups among all the 63

employees of McDonald's restaurant using theemployees of McDonald's restaurant using the ““organizationalorganizational

commitmentcommitment”” variables. The SPSS click through sequence is: Analyzevariables. The SPSS click through sequence is: Analyze

ClassifyClassify Hierarchical Cluster. This will take you to a dialog box wherHierarchical Cluster. This will take you to a dialog box wheree

you select and move variables Xyou select and move variables X1313, X, X1414 and Xand X1515 into theinto the ““VariablesVariables”” box.box.

Next you go to the statistics box and agglomeration schedule isNext you go to the statistics box and agglomeration schedule is selected asselected as

default option. Cluster membershipdefault option. Cluster membership ‘‘nonenone’’ is selected as default. We shallis selected as default. We shall

continue with default option here. Next click oncontinue with default option here. Next click on ‘‘plotplot’’ box. Check onbox. Check on

dendogramdendogram and inand in IcicleIcicle window,window, click on none button. Then continue.click on none button. Then continue.

Next click on the Method box and select WardNext click on the Method box and select Ward’’s under Cluster Method (its under Cluster Method (it

is the last option). Squared Euclidean Distances is the defaultis the last option). Squared Euclidean Distances is the default underunder

Measure and we will use it, and we do not need to standardize thMeasure and we will use it, and we do not need to standardize this data.is data.

We will not select anything on the save option now. Now click onWe will not select anything on the save option now. Now click on ““OKOK”” toto

run the program.run the program.


14/25


15/25

Notice the charm inNotice the charm in

coefficients in last twocoefficients in last two

stagesstages


16/25

Identified the numberIdentified the number

of clusters fromof clusters from

dendogramdendogram

Identified the numberIdentified the number

of clusters fromof clusters from

dendogramdendogram

Using SPSS to Identify ClustersUsing SPSS to Identify Clusters

IIn the next step in SPSS click throughn the next step in SPSS click through

sequence is: Analyzesequence is: Analyze ClassifyClassify KK--mean cluster.mean cluster.

This will take you to a dialog box where you selectThis will take you to a dialog box where you select

and move variables Xand move variables X1313, X, X1414 and Xand X1515 into theinto the

““VariablesVariables”” box. In the boxbox. In the box ‘‘number of clustersnumber of clusters’’ putput

3 in place of 2. Next you go to the save box and3 in place of 2. Next you go to the save box and

check on cluster membership. Next click on options.check on cluster membership. Next click on options.

UncheckUncheck initial cluster option and check ANOVAinitial cluster option and check ANOVAtable. Now click ontable. Now click on ““OKOK”” to run the program.to run the program.


17/25

33

34


18/25

35

36

Determine if clusters exist . . .Determine if clusters exist . . . – – RunRun ANOVA with cluster IDs and ANOVA with cluster IDs and

organizational commitment variables.organizational commitment variables.


19/25

37

Click on Options,Click on Options,

check Descriptive,check Descriptive,

next Continue,next Continue,

and then OKand then OK

3

Move the three clusterMove the three cluster

variables into windowvariables into window

1

MoveMove

cluster IDcluster ID

variable intovariable intowindowwindow

2

ANOVA ANOVA

38

Step 1: Determine if clusters exist?Step 1: Determine if clusters exist?

– – 2 Cluster ANOVA Results2 Cluster ANOVA Results – – Three issues to examine: (1) statisticalThree issues to examine: (1) statistical

significance, (2) cluster sample sizes, andsignificance, (2) cluster sample sizes, and

(3) variable means(3) variable means..

ConclusionConclusion::

Cluster 1Cluster 1 – –

More CommittedMore Committed

Cluster 2Cluster 2 – –

Less CommittedLess Committed


20/25

39


– – 3 Cluster ANOVA3 Cluster ANOVA – –

Must runMust run ““postpost--hochoc”” teststests

Take 2 cluster IDTake 2 cluster ID

variable out andvariable out and

insert 3 cluster IDinsert 3 cluster ID

1

Click on PostClick on PostHoc button andHoc button and

checkcheck ScheffeScheffe

2

40

ConclusionsConclusions::

Cluster 1Cluster 1 – – Least CommittedLeast CommittedCluster 2Cluster 2 – – Moderately CommittedModerately Committed

Cluster 3Cluster 3 – – Most CommittedMost Committed

•• Individual cluster sample sizes OK.Individual cluster sample sizes OK.

•• Clusters significantly different, butClusters significantly different, but

must examine post hoc tests.must examine post hoc tests.


21/25

41



42

Remove 3 cluster IDRemove 3 cluster ID

variable and insert 4variable and insert 4

cluster ID variablecluster ID variable

1Click OKClick OKto runto run

2




22/25

43

Determine if clusters exist?Determine if clusters exist?


Conclusions:Conclusions:

1.1. Group sample sizes still OK.Group sample sizes still OK.

2.2. Clusters are significantly different.Clusters are significantly different.

3.3. Means of four clusters more difficult toMeans of four clusters more difficult to

interpretinterpret – – may want to examinemay want to examine ““polarpolar

extremesextremes””. Most likely approach is combine. Most likely approach is combine

clusters 1 and 2 and do a three clusterclusters 1 and 2 and do a three cluster

solution, or remove groups 1 and 2 andsolution, or remove groups 1 and 2 and

compare extreme groups (3 & 4).compare extreme groups (3 & 4).

44

Four Cluster ANOVAFour Cluster ANOVA

– – Post Hoc resultsPost Hoc results – –

1.1. All clusters are All clusters are

significantly different.significantly different.

2.2. Largest differencesLargest differences

consistently betweenconsistently between

clusters 3 and 4.clusters 3 and 4.


23/25

45

Error Error

CoefficientsCoefficients

Decide number of clusters . . .Decide number of clusters . . .

1.1. Examine cluster analysisExamine cluster analysis

Agglomeration Schedule. Agglomeration Schedule.

2.2. Consider cluster sample sizes.Consider cluster sample sizes.

3.3. Consider statistical significance.Consider statistical significance.

4.4. Evaluate differences in cluster means.Evaluate differences in cluster means.

5.5. Evaluate interpretation &Evaluate interpretation &

communication issues.communication issues.

Error ReductionError Reduction::

11 – – 2 Clusters = 58.4%2 Clusters = 58.4%




ConclusionConclusion:: benefitbenefit

similar or less after 3similar or less after 3

clusters.clusters.

46

Step 4: Describe cluster characteristics . . .Step 4: Describe cluster characteristics . . .

1.1. Use ANOVAUse ANOVA

2.2. Remove clustering variables fromRemove clustering variables from

““Dependent ListDependent List”” windowwindow

3.3. Insert demographic variablesInsert demographic variables

4.4. ChangeChange ““Factor Factor ”” variable if necessaryvariable if necessary

InsertInsert

DemograDemogra

phicphic

VariablesVariables


24/25

47

Step 4: Describe cluster characteristics . . .Step 4: Describe cluster characteristics . . .

1.1. Go to Variable View.Go to Variable View.

2.2. Click on None beside variable for numberClick on None beside variable for numberof cluster groups you will examine underof cluster groups you will examine under

Values column.Values column.

3.3. Assign value labels to each cluster. Assign value labels to each cluster.

4.4. Run ANOVA on demographics.Run ANOVA on demographics.

Assign value Assign value

labels forlabels for

clustersclusters

•• Describe demographic characteristicsDescribe demographic characteristics

•• ConclusionsConclusions – – 3 cluster solution:3 cluster solution:

•• Clusters are significantly different.Clusters are significantly different.

•• More committed cluster (must know codingMore committed cluster (must know codingto interpret) . . .to interpret) . . .

– – Less likely to search (lower mean)Less likely to search (lower mean)

– – Full time employees (code = 0)Full time employees (code = 0) – – Females (code = 1)Females (code = 1)

– – High performers (higher mean)High performers (higher mean)


25/25

•Thank you

2014 cluster analysis handout

Documents