1 psy6010: statistics, psychometrics and research design professor leora lawton spring 2006...
TRANSCRIPT
1
PSY6010: Statistics, Psychometrics and Research Design
Professor Leora LawtonSpring 2006
Wednesdays 7-10 PMRoom 204
FACTOR ANALYSIS, CLUSTER ANALYSISand
SEGMENTATIONS
2
1. Purpose of Factor Analysis
Factor Analysis – a ‘data reduction’ technique1. Technique for dealing with multicollinearity2. Used to transform Likert scales into factor scores as
an alternative to linear additive scale. 3. Creates groups of respondents based on sets of
shared attitudes (explains variables in terms of their underlying dimensions).
4. Facilitates interpretation of a large number of variables
5. Factor scores (the grouped attitudes) can be then used as an independent variable.
3
2. Steps to conducting FA
• When creating a questionnaire, often you may want to include a number of attitudinal questions around certain issues.
• When analyzing the data with all these variables you start by selecting those attitudes that you think describe some overall category, for example ‘Taste in Music’.
• These attitudinal variables ideally should be of the same metric (e.g., 1,2,3,4,5). Some say the variables should have 7 values, but 5 works fine. Don’t use dichotomous variables.
• Begin by computing a correlation matrix of all the variables in question. There should be some significant correlations, both positive and negative.
• There should be a 4:1 ratio of cases to variables (e.g., 100 cases for 25 variables minimum), and sample size of at least 50.
4
Correlation matrix of musical tastes
• Research issue: You’ve been asked by a music store owner to assist in increasing sales by making sure the placement of music genres in the store is optimal.
• Using GSS93 subset.sav, run a set of frequencies to check that the variables fit the requirements.
• Then run a correlation matrix of all the music questions.
5
Correlation MatrixCorrelations
1 .267** .111** .269** .526** .365** .311** .280** .363** -.059* -.091**
.000 .000 .000 .000 .000 .000 .000 .000 .033 .001
1337 1253 1328 1307 1299 1302 1300 1320 1285 1292 1290
.267** 1 .409** .194** .153** .109** .399** .062* .149** -.026 -.005
.000 .000 .000 .000 .000 .000 .024 .000 .352 .847
1253 1335 1331 1307 1286 1292 1296 1311 1277 1298 1298
.111** .409** 1 .033 .008 -.109** .214** -.110** -.029 -.041 -.075**
.000 .000 .211 .757 .000 .000 .000 .275 .127 .005
1328 1331 1468 1421 1398 1409 1404 1436 1398 1417 1413
.269** .194** .033 1 .220** .199** .167** .556** .206** .183** .107**
.000 .000 .211 .000 .000 .000 .000 .000 .000 .000
1307 1307 1421 1434 1381 1387 1379 1416 1370 1391 1383
.526** .153** .008 .220** 1 .499** .363** .262** .452** .030 -.115**
.000 .000 .757 .000 .000 .000 .000 .000 .272 .000
1299 1286 1398 1381 1412 1381 1362 1398 1359 1369 1366
.365** .109** -.109** .199** .499** 1 .407** .281** .583** .014 .000
.000 .000 .000 .000 .000 .000 .000 .000 .604 .996
1302 1292 1409 1387 1381 1425 1375 1406 1377 1383 1375
.311** .399** .214** .167** .363** .407** 1 .112** .328** -.058* -.039
.000 .000 .000 .000 .000 .000 .000 .000 .031 .144
1300 1296 1404 1379 1362 1375 1414 1393 1359 1374 1373
.280** .062* -.110** .556** .262** .281** .112** 1 .231** .197** .100**
.000 .024 .000 .000 .000 .000 .000 .000 .000 .000
1320 1311 1436 1416 1398 1406 1393 1451 1388 1405 1401
.363** .149** -.029 .206** .452** .583** .328** .231** 1 .116** -.013
.000 .000 .275 .000 .000 .000 .000 .000 .000 .623
1285 1277 1398 1370 1359 1377 1359 1388 1410 1370 1366
-.059* -.026 -.041 .183** .030 .014 -.058* .197** .116** 1 .360**
.033 .352 .127 .000 .272 .604 .031 .000 .000 .000
1292 1298 1417 1391 1369 1383 1374 1405 1370 1431 1392
-.091** -.005 -.075** .107** -.115** .000 -.039 .100** -.013 .360** 1
.001 .847 .005 .000 .000 .996 .144 .000 .623 .000
1290 1298 1413 1383 1366 1375 1373 1401 1366 1392 1423
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
bigband Bigband Music
blugrass BluegrassMusic
country Country WesternMusic
blues Blues or R & BMusic
musicals BroadwayMusicals
classicl Classical Music
folk Folk Music
jazz Jazz Music
opera Opera
rap Rap Music
hvymetal Heavy MetalMus ic
bigband BigbandMusic
blugrass Bluegrass
Music
country CountryWestern
Music
blues Blues or R& B Music
musicals BroadwayMusicals
classicl Classical
Mus icfolk Folk
Musicjazz Jazz
Music opera Operarap Rap
Music
hvymetal Heavy Metal
Mus ic
Correlation is significant at the 0.01 level (2-tailed).**.
Correlation is significant at the 0.05 level (2-tailed).*.
6
Evaluating Appropriateness of FA
• Check the correlation matrix, which examines only relationships between pairs of variables (e.g., bivariate, not multivariate correlation)
• So, then select these variables into the FA.• Analysis - Data Reduction – Factor • Move all 11 music variables to the Variables window.• Under Descriptions, click on the option for KMO and Bartletts test of
sphericity.• Use Bartlett Test of Sphericity to examine the entire matrix, where you want
to reject the null hypothesis that the matrix is a unity matrix (i.e., it should be significant. A unity matrix is when all the correlations are 0 except for, of course, the correlation between a variable and itself (=1). (Note that our text says not to place much value on this test in most cases.)
• KMO stands for Kaiser-Meyer-Olkin Meausure and it compares the magnitude of observed correlation coefficients to partial (that is, what’s unique about the attribute) coefficients. Here you want a number closer to 1. Less than .5 indicates that FA may not be appropriate. Ours is .748.
7
SPSS for PCA/FA
• Analysis – Data Reduction – Factor• Under Extraction, choose the options for Principle
Components, Eigenvalues over 1, Display unrotated and screen plot.
• Note that there is an option for Number of Factors. There are times you may want to impose a number rather than letting SPSS decide for you (and it decides based on the eigenvalues in the extraction).
• For Rotation, choose Varimax (variance maximization; it’s the most commonly used), and Display Rotated Solution.
• For scores, you will want to select Save as Variables/Regression when you find your solution. But not while in the exploration phase.
8
SPSS for PCA/FA
FACTOR /VARIABLES bigband blugrass country blues musicals classicl folk jazz opera rap hvymetal /MISSING LISTWISE /ANALYSIS
bigband blugrass country blues musicals classicl folk jazz opera rap hvymetal
/PRINT INITIAL KMO EXTRACTION ROTATION /CRITERIA MINEIGEN(1) ITERATE(25) /EXTRACTION PC /CRITERIA ITERATE(25) /ROTATION VARIMAX /METHOD=CORRELATION .
9
Interpreting SPSS results
• Under the chart ‘Total Variance Explained’ you will see that four factors have been identified, based on having eigenvalues > 1.
• The screen plot shows you a pictoral view of the eigenvalues. We have four, some might want to try the fifth, because that’s where the slope of the eigenvalues change, or similarly, try only 2. The most important thing is that the solution is interpretable, that it makes sense, that the factors provide insight into your overall concept. Eigenvalues are the values for the factor loading matrix that is used to describe the factors. It’s the variance in the correlation matrix condensed into a scale such that the factor with the largest eigenvalue has the most variance (or, the more variance the greater the distance of one factor from another, i.e., the factors are distinguishable.
• The unrotated matrix doesn’t tell you too much, go directly to the rotated matrix: here’s where the ‘rotated view’ can give you a better picture on the distinctiveness of each factor. Rotation maximizes high correlations and minimizes low correlations in the matrix used t calculate the factors, or it makes the factors more distinguishable to the ‘naked eye.’
• In the rotated matrix, you then select the variables (attributes) with the highest coefficients. This one works out pretty well, sometimes you have to go back to the drawing board to redefine.
• Try it by limiting the result to just two factors. What underlying issue might be explaining this result compared to the four-factor solution?
10
Interpreting SPSS results
Rotated Component Matrixa
.597 .340 .206 -.189
.164 .137 .813 .018
-.074 -.045 .825 -.058
.133 .850 .143 .105
.764 .190 .033 -.091
.841 .097 -.072 .046
.604 -.040 .463 -.012
.204 .843 -.086 .099
.785 .090 .006 .103
.020 .142 -.027 .793
-.044 .018 -.012 .822
bigband Bigband Music
blugrass BluegrassMusic
country Country WesternMusic
blues Blues or R & BMusic
musicals BroadwayMusicals
classicl Classical Music
folk Folk Music
jazz Jazz Music
opera Opera
rap Rap Music
hvymetal Heavy MetalMus ic
1 2 3 4
Component
Extraction Method: Princ ipal Component Analysis. Rotation Method: Varimax with Kaiser Normalization.
Rotation converged in 5 iterations.a.
You want to keep find components where the coefficients are at least above .3 and see a clear demarcation between the highest coefficients per component. Note that folk music is high for both 1 and 3. Sometimes therefore it is worthwhile to set the number of components to one above, and one less, than the default number based on the eigenvalue you’ve selected.
11
Scree Plot: Number of Components
12
Interpreting SPSS results
Rotated Component Matrix(a)
Component
1 2 3 4
Bigband Music 0.597 0.340 0.206 -0.189
Bluegrass Music 0.164 0.137 0.813 0.018
Country Western Music -0.074 -0.045 0.825 -0.058
Blues or R & B Music 0.133 0.850 0.143 0.105
Broadway Musicals 0.764 0.190 0.033 -0.091
Classical Music 0.841 0.097 -0.072 0.046
Folk Music 0.604 -0.040 0.463 -0.012
Jazz Music 0.204 0.843 -0.086 0.099
Opera 0.785 0.090 0.006 0.103
Rap Music 0.020 0.142 -0.027 0.793
Heavy Metal Music -0.044 0.018 -0.012 0.822
Extraction Method: Principal Component Analysis. Rotation Method:
Varimax with Kaiser Normalization.
a Rotation converged in 5 iterations.
13
Project Recommendations
Current
Aisle 1A Aisle1B Aisle2A Aisle2B
bigband jazz heavy metal rap
bluegrass blues musicals C&W
classical opera
folk
Recommended
Aisle 1A Aisle1B Aisle2A Aisle2B
bigband folk C&W metal
musicals classical bluegrass rap
opera blues
Jazz
14
Homework #8
• Using our own employee dataset (or if you wish, use your SDA data set and select your own variables), take the attitudinal variables, to understand how people define “quality of work.”– V11 I have the necessary resources (e.g., computers, databases) to
do my work comfortably and efficiently.– V13 The work I'm responsible for is appropriate for my level of
capability.– V16 I'm challenged and interested in my work.– V17 My immediate manager recognizes and acknowledges my
contributions.– V22 I have responsibility with the required authority.– V24 I am satisfied with communications between management and
employees.– v41r Your total compensation (salary, bonuses)– v42r 401(k), retirement and/or pension– v43r Availability of PTO (vacation) days– v44r The office itself (lighting, space, decor)– v45r Performance awards and bonuses
15
Homework #8
• Run a frequencies test to make sure they are appropriate. Are they? Explain.
• Run a correlations table. Is this appropriate for PCA/FA? Explain.
• On this same selection of variables, conduct tests for KMO and Bartlett. Are we still on track for PCA/FA? Explain.
• Now conduct a factor analysis using these variables, setting the defaults as in the class example. Are you happy with this result? Then try setting the number of components differently, adding one or more, or subtracting, from the first result. Are you happy with this result? Explain.
• What can you say about components of Quality of Work?
16
Using Factor Scores
• Rarely are factor analyses conducted just for themselves. Rather, they are used as attitudinal measures to predict or be associated with other behavior or statuses.
• One could use factor scores as predictors in regression analyses.
• Or, as will be seen in segmentation later this semester, one can use factor scores to cluster with other characteristics to create typologies, or segments, of subgroups in a population.
• Today we’ll go back and use our music taste factors as predictors in other behaviors.
17
Review of Factor Analysis
First, let’s not twist our brains into pretzels, so begin by doing an automatic recode on all musical variables. Give them a consistent new name, e.g., preface or end with an ‘r’, e.g., BIGBAND becomes RBIGBAND.
/VARIABLES bigband blugrass country blues musicals classicl folk jazz opera rap hvymetal
18
Saving the Factor Score
• Analyze – data reduction – factor– Descriptives (check KMO-Bartletts)– Extraction (uncheck unrotated matrix, and check
Screen Plot, select method = principal components)– Rotation (select varimax)– Scores (select Save as Variables)
• Run. Now look at your Variable View, and then at the Data View.
• Now run a Descriptive Statistics – Descriptives – Mean, Std Dev, Min, Max).
19
Using Factor Scores in a Regression
• Now, let’s predict tv viewing.
• First, run a frequencies of the variable TV hours watched per week.
• Recode it so that 8 hours and above = 8.
• Create a conceptual model:
TV viewing = a + musical taste + education + sex + age.
Run your regression with these variables.
20
Homework #9
• Using the same factor analysis you ran last week with the employee data (see slide #14, run this factor analysis and save the factor score variables.
• Now run a regression:• Overall satisfaction = a + (factor scores) + male
+ hours worked (hourswk)+ whether there was a layoff (v32)
• Explain why this model makes theoretical sense. Now explain the results. If you were an HR manager, what areas would you either try to improve, or make sure they stay as good?
21
Segmentation Using Factor Analysis and Cluster Analysis
• As you learned last week, segmentation analysis is used to create typologies or categorical groups of constituents, such as customers, patrons, etc.
• Often segmentations employ factor score results as well. • In a segmentation, one first develops any necessary factor scores and
saves them as output variables (you will see them added to your data set). • Then, because the purpose of the segmentation is to create groups that can
then be reached through some sort of marketing (social or commercial), or for some other actionable purpose, use demographics that can be employed to target the groups.
• Then, with the factor scores and the sociodemographic variables identified as being logical, use a clustering technique to create the groups.
• We will use cluster analysis, but other techniques include discriminant (also in SPSS), CHAID and CART (separate software packages), and the most adventurous is latent class models (also separate software, such as AMOS).
22
Cluster Analysis - 1
• We’ll use GSS93 subset.sav. • You will remember our musical factors (go back to slide #12 for
results).• First create names for your factor scores. I’ve labeled them:
Classbig, bluejazz, cwgrass, heavyrap. Clients like meaningful labels, plus it helps you when reading the output.
• Then, consider possible demographic factors that might relate to musical taste, e.g., sex, age, race, region, education, income.
• Because this kind of analysis tends to be exploratory, you don’t need to specify the logic behind the relationships, but you should have some a priori idea about why these factors might be important in distinguishing the possible groups, in this case, musical taste.
• Cluster analysis doesn’t require recoding of IVs the way the other methods do…specify a categorical variable, or a covariate, as is appropriate.
23
Cluster Analysis - 2
• Analyze - Classify – 2-step Cluster – select factors (categorical variables, e.g., sex) and covariates (ratio, interval or continuous variables).
• In our first round, do not specify the number of clusters.
• Because segmentations are part art, part science, you need to experiment until you find one that ‘works’ for you, so let’s try it with a different number of clusters.
24
Syntax for Cluster Analysis
• TWOSTEP CLUSTER• /CATEGORICAL VARIABLES = sex politics• /CONTINUOUS VARIABLES = bigclass bluejazz cwgrass heavyrap age
educ• /DISTANCE LIKELIHOOD• /NUMCLUSTERS FIXED = 4• /HANDLENOISE 0• /MEMALLOCATE 64• /CRITERIA INITHRESHOLD (0) MXBRANCH (8) MXLEVEL (3)• /PLOT BARFREQ PIEFREQ• /PRINT COUNT SUMMARY• /SAVE VARIABLE=TSC_4337 .• AIM TSC_4337• /CATEGORICAL sex politics• /CONTINUOUS bigclass bluejazz cwgrass heavyrap age educ• /PLOT ERRORBAR CATEGORY CLUSTER (TYPE=PIE) .
25
Segmentation Homework
• Use the same data set, but this time use the variables for tv viewing and attendance at sports events and art museums for your factors.
• Label the factors, then cluster them with age, sex, political views.
• Try it with 3, 4, and 5 clusters. Which do you find, if any, to be believable? Why?