phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/cd01-schedule-sylla…  · web viewyou will...

38
Categorical Data Analysis Packet CD01 Schedule, Syllabus, Assignments Spring 2014 Dale Berger Claremont Graduate University ([email protected] ) Statistics website: http://wise.cgu.edu Contents: 2 Schedule 3 Syllabus 6 Exercise Set 1: Mediation and Moderation Analyses 12 Exercise Set 2: Nonparametric Statistics 15 Exercise Set 3: Binary Logistic Regression 17 Exercise Set 4: Resampling 18 Missing Data Resources 19 Supplemental exercise Set 5 (not required): Log- linear Analysis Additional Handouts: CD02 Mediation Analysis with Regression CD03 Moderation analysis with Regression CD04 PROCESS: An SPSS Macro for Mediation and Moderation CD05 Selected Nonparametric Statistics D11: Wilcoxon Ws and Mann-Whitney U D12: Comparing two groups with SPSS D13: Wilcoxon T for paired data D14: SPSS Crosstabs (likelihood ratio, gamma, phi, kappa, etc.) D15: McNemar’s test of related proportions D16: Spearman r and SPSS Tables for Spearman r, Mann-Whitney U, Wilcoxon T, runs CD01 Categorical Data Analysis Assignments 1

Upload: lykien

Post on 03-Feb-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Categorical Data AnalysisPacket CD01

Schedule, Syllabus, AssignmentsSpring 2014

Dale Berger Claremont Graduate University

([email protected]) Statistics website: http://wise.cgu.edu

Contents:

2 Schedule3 Syllabus6 Exercise Set 1: Mediation and Moderation Analyses

12 Exercise Set 2: Nonparametric Statistics15 Exercise Set 3: Binary Logistic Regression17 Exercise Set 4: Resampling18 Missing Data Resources19 Supplemental exercise Set 5 (not required): Log-linear Analysis

Additional Handouts:

CD02 Mediation Analysis with RegressionCD03 Moderation analysis with RegressionCD04 PROCESS: An SPSS Macro for Mediation and Moderation CD05 Selected Nonparametric Statistics

D11: Wilcoxon Ws and Mann-Whitney UD12: Comparing two groups with SPSSD13: Wilcoxon T for paired dataD14: SPSS Crosstabs (likelihood ratio, gamma, phi, kappa, etc.)D15: McNemar’s test of related proportionsD16: Spearman r and SPSSTables for Spearman r, Mann-Whitney U, Wilcoxon T, runs

CD06 Binary Logistic RegressionCD07 Log-linear ModelingCD08 Resampling – permutation and bootstrapping proceduresCD09 Missing data APA Statistical Methods in Psychology Journals (APA Taskforce on Statistical Methods)

Chance What your doctor should know about statistics

Instructor: Dale Berger

Teaching Associates: Nic Barreto, Maggie Burkhart, Val Dubon, Aly Lopez, Stephen Weltz

CD01 Categorical Data Analysis Assignments 1

Page 2: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Psychology 308d SCHEDULE Dale BergerCategorical Data Analysis Spring 2014

Please read assignments before class. The schedule may be modified as the course progresses.

Date Topic Sources and Assignments

3/25 Overview, Mediation and moderation CD01, CD02, CD03, CD04; Howell 15.14

3/27 Counting rules, binomial, sign test, CD05 (nonparametric): D11, D12, D13,Wilcoxon Ws, T, Mann-Whitney U Howell 18.6-18.8

4/1 Chi-square, Crosstab stats, median; Howell 18.9-18.10; CD05: D14, D15Fisher Exact; McNemar, Kappa; runs

4/3 Exercise 1 due; phi, Spearman’s rho, Howell Chapter 10; CD05: D16 Gamma, Kruskal-Wallis, loess

4/8 Intro to binary logistic regression Howell 15.15; CD06 (logistic regression)

4/10 Logistic regression, propensity Howell 15.15

4/15 Exercise 2 due; Intro to resampling CD08 (resampling); Howell 18.1 – 18.5

4/17 review; Resampling CD08; WISE bootstrapping applet and guide

4/22 Midterm Examination (mediation, moderation, nonparametrics)

4/24 No class Go to WPA in Portland!

4/29 Review midterm; loglinear CD07 (loglinear)

5/1 Exercise 3 due; missing data CD09 (missing data)

5/6 Data analysis guidelines; Ho testing APA Task Force reports

5/8 Exercise 4 due; Review What your doctor should know about statistics

5/13 Final Examination 8:30-11:50 (comprehensive)

5/15 Review scoring of final

5/18 CGU Graduation Day (join the celebration!)

CD01 Categorical Data Analysis Assignments 2

Page 3: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Psychology 308d Syllabus for Categorical Data Analysis Dale Berger Spring 2014

Course Description:

You will learn about the logic, application, and interpretation of selected models for categorical data analysis, including nonparametric tests of central tendency and relationships, logistic regression, and resampling. We also will discuss advanced regression applications including mediation and moderation with categorical data and SPSS macros, and missing data analysis.

Required Text:

Howell, D. C. (2013). Statistical methods for psychology (8th ed.). Belmont: Wadsworth

Computer Support:

Class information, materials, links, and supplemental materials are on our Sakai website. Let me know if you have any difficulty accessing it. https://sakai.claremont.edu:8443/portal

http://wise.cgu.edu provides tutorials and other resources, plus links to other sites. Be sure to check the applet and exercises with resampling.

Howell’s free Resampling program can be downloaded from http://www.uvm.edu/~dhowell/StatPages/Resampling/Resampling.html

Also, remember that Google is your friend – seek out additional resources on your own.

Recommended Supplemental Sources:

Aiken, L.S. & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park: Sage Publications.

Stevens, J. P. (2009). Applied Multivariate Statistics for the Social Sciences (5th ed.). New York: Routledge. [This inexpensive paperback is accessible, filled with examples and useful advice.]

Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Needham Heights, MA: Allyn & Bacon. [This is an excellent resource for common multivariate methods.]

Professor:

Dale Berger 909-621-8084 (SBOS office); 607-3714 (direct line); 909-240-9566 (home)Office Hours in ACB101: Most Tuesdays 2-4 Also by appointment and by email: [email protected]

CD01 Categorical Data Analysis Assignments 3

Page 4: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Teaching Associates: 909-621-8084 Lab/Review Sessions

Aly [email protected] Monday 4:00 -- 5:50 ACB 119Val [email protected] Tuesday 4:00 -- 5:50 ACB 208Maggie [email protected] Wednesday 11:00 -- 12:50 Burkle 26Nic [email protected] Wednesday 4:00 -- 5:50 McManus 31Stephen [email protected] Thursday 4:00 -- 5:50 Harper 2

Class Meetings (March 25 – May 15, 2014):

Lecture (Dale) Tuesday and Thursday, 9:00-10:50am, Burkle 16Lab/Review (TAs) Tuesday and Thursday 8:00-9:00am and 11:00-11:50am, Burkle 16

Student Learning Outcomes:

By the end of this course, students will be able to do the following:1. Conduct mediation and moderation analyses with continuous and/or categorical variables.2. Assess the appropriateness of basic parametric tests and identify nonparametric

alternatives where warranted.3. Conduct and interpret selected nonparametric procedures, including the following:

Wilcoxon Ws and Mann-Whitney U for independent groupsMedian testWilcoxon T for paired dataMcNemar’s test for related proportionsSpearman r and gamma for ordinal relationshipsRuns test (too many or too few)

4. Identify when to use binary logistic regressionRun SPSS analyses and interpretInterpret odds ratios correctly

5. Identify when to use resampling proceduresRun and interpret permutation tests and bootstrapping procedures

6. Describe findings for sophisticated and for lay audiences (the APA and the PTA).

Homework:

There are four exercise sets (each worth 20 points). Homework is due at 9:00am on the due date. Late homework will be penalized one point per day up to a maximum of five points deducted. Homework with scores less than 17 (not counting any late penalty) may be resubmitted to receive up to 17 points (not counting late penalty). The final homework is due not later than 9:00am on Friday May 15. Please be neat with your homework: Use a word processor for text where possible, identify problems by number, show your work, circle or highlight your answers, and present answers for computer problems on a separate sheet.

Study Partners:

You are encouraged to complete each homework exercise on your own, but then consult with one or two others to prepare a shared homework to turn in. Be sure that you can run and explain the computer output yourself.

CD01 Categorical Data Analysis Assignments 4

Page 5: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Examinations:

The midterm examination will be on Tuesday, April 22, covering material from the first two homework exercise sets (mediation, moderation, nonparametrics). The final examination on Tuesday, May 13 will be comprehensive, but emphasize material covered after the midterm. Both exams are 9:00-11:50, in-class, open-book, open-notes, calculator but no computer. You may come earlier to begin the exams at 8:30am if you wish. We will review the midterm exam during class on Tuesday April 29 and review the final exam on Thursday May 15.

CGU Policy on Grading Standards:

Letter Grade

Grade Points Description Learning Outcome

A 4.0 Complete mastery of course material and additional insight beyond course material

Insightful

B 3.0 Complete mastery of course material ProficientC 2.0 Gaps in mastery of the course material; not at level

expected by the programDeveloping

U 0.0 Unsatisfactory Ineffective

Grading:

The course grade will be based on total points: homework (80), midterm examination (60), and final examination (100).

CGU Policy on Scientific and Professional Ethics:

The work you do in this course must be your own. Feel free to build on, react to, criticize, and analyze the ideas of others but, when you do, make it known whose ideas you are working with. You must explicitly acknowledge when your work builds on someone else's ideas, including ideas of classmates, professors, and authors you read. If you ever have questions about drawing the line between others' work and your own, ask the course professor who will give you guidance.

CD01 Categorical Data Analysis Assignments 5

Page 6: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Categorical Data Analysis Exercise Set 1Psychology 308d Dale Berger Mediation and Moderation with categorical variables

Sources: Howell, Section 15.14; class notesClass handout on mediation and moderation

Exercises: For these exercises please use Word (or other word processor) to prepare your report. You may paste in key results from SPSS or create new tables and figures based on the SPSS results. Attach relevant SPSS output that includes syntax. Annotate your output by identifying key findings in your output.

We are interested in gender differences in beginning salaries for college graduates. We will use a data set provided by SPSS: University of Florida graduate salaries.sav, available for download on Sakai, under Resources, Data Files. This data set includes starting salary and gender for graduates of the University of Florida for four semesters (Fall 1989 through Spring 1991) from eight colleges. This is real data with some messy issues. If you get really stuck, you can check UFsalRev.sav, the same file with college4 and other variables already created.

Hints and answers are on the next page, but first give it a try yourself. 1. The first task is to examine the data and prepare the data for further analyses.

a) Run frequencies on all variables (ask for Mean, Median, Minimum, Maximum, Skewness, and Kurtosis, ask for Histograms, suppress tables if more than 10 categories). Briefly summarize what you learned from this analysis.

b) We see that some colleges have as few as two graduates. Limit all further analyses to the four largest colleges, those with at least 50 graduates.

c) Create a new variable college4 coded 1,2,3,4 for the four colleges. Code 1=agriculture; 2=construction; 3=business; 4=engineering

d) Create a dummy code for each of the four colleges (e.g., agricd for Agriculture).

e) Scan the data for outliers. A good technique is ‘clustered box plots.’ Graphs, Boxplot, where Variable is Salary, Category is College, and Clusters are by Gender. We see a few extreme outliers in several different groups. Here reasonable people might do different things. We could define outliers on overall salary ignoring groupings, we could define outliers within different subgroups, we could omit outliers, we could Winsorize outlier, we could try other transformations.... Probably it would be best to Winsorize within Colleges. However, for purposes of the exercise, and to keep us all together, Winsorize the two most extreme cases on each end of the overall salary distribution and call the new variable salaryw2 (set the two most extreme values equal to the third most extreme on each end of the overall salary distribution for the four colleges. We can compare results for salary and salaryw2 later.

CD01 Categorical Data Analysis Assignments 6

Page 7: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

For Exercises 2 through 5 include a figure that shows the conceptual model for that analysis. Here is a figure for Exercise 3.

2. Does gender predict salaryw2 for the combined group of graduates from these four colleges, ignoring all other variables? Conduct this analysis with a t-test and then with regression. Report means and a test of statistical significance of gender differences.

3. Does graddate (D) mediate the relationship between gender (G) and salaryw2 (S)? Treat graddate as an interval variable. (3a) Test paths a and b with regression. 3b) Use the PROCESS macro from Hays to conduct a bootstrap test of mediation. You will need to download the PROCESS macro into an SPSS program. Download http://www.afhayes.com/introduction-to-mediation-moderation-and-conditional-process-analysis.html . PROCESS has been installed on the CGU lab computers under SPSS Regression. This is PROCESS Model 4 (simple mediation). Consult the handout CD04, first example.

4. Does college4 mediate the relationship between gender and salaryw2? (Hint: There is no nice test here because the mediator is a categorical variable. There are three a and three b paths. If both paths a and b are significant, that also indicates mediation. You can describe the difference between paths c and c′ as an indication of effect size.)

5. Does college4 moderate the relationship between gender and salaryw2? (Hint: You need to create three interaction components based on salaryw2 to add in the last step. Test the moderating effect of college as a whole and each interaction component.)

6. Write an ‘Executive Summary’ for a nontechnical audience to describe the big picture, with a summary of the relationships between salary, graduation date, gender, and the four colleges. This should be no more than two pages including figures (include SPSS Clustered Boxplots to show the distributions of salaries for males and females within each college). Include findings from questions 1 to 5 in your summary.

Hints and answers for Exercise Set 1

First give it a go on your own for each question. Hints and answers are not complete.

1a. Run frequencies on all variables (ask for Mean, Median, Minimum, Maximum, Skewness, and Kurtosis, ask for Histograms, suppress tables if more than 10 categories). Some things that should catch your eye: kurtosis on Salary is 1.778, large enough to call for a close look at salary. The histogram doesn’t look bad though, maybe just a couple of outliers. Gender is split reasonably evenly (though maybe not within colleges). Some colleges have very few graduates. Everyone earned a BA, so that is not a variable in this data set. Graduate is just an ID code. We need to deal with colleges and maybe salary.

1b. We need to identify which colleges have more than 50 graduates. Check frequencies on College to find that four colleges qualify. These are 1, 3, 4, and 7 (check to make sure).

CD01 Categorical Data Analysis Assignments 7

Page 8: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

We can use Select Cases to limit the analyses to these colleges. We wish to select all cases for which College is 1, 3, 4, or 7. Click Data, Select Cases…, select If condition is satisfied, Click If…, College, click the arrow to enter College into the selection box, click =, 1. Now click the ‘or’ symbol which is a vertical line |, click College again, arrow to enter into selection box, =, 3, |. Repeat with College = 4 | College = 7. When you are finished, you should have college = 1 | college = 3 |college = 4 |college = 7 Click Continue, OK. Check your handiwork by running Frequencies again on College. This analysis should confirm that you now are using data from only the four largest colleges.

Shortcut Hint: In the selection box you could enter the college = 1 | by keyboard, highlight it and press Ctrl-c (i.e., hold Ctrl and press c) to copy, and then press Ctrl-v three times to insert three copies, then edit to select colleges 3, 4 and 7.

1c. For subsequent analyses, it will be convenient to have a variable for college that is coded 1 through 4 for the four colleges of interest. We can create that variable with a recode. Click Transform, Recode into Different Variables, select college, click arrow to enter college into the recode box, under Output Variable enter a new name like college4, click Change. Click Old and New Values…, if there are already values in the Old New: box, click on one at a time followed by Remove.

Under Old Value, select Value:, enter 1, under New Value select Value:, enter 1, click Add. Under Old Value, Value:, enter 3, under New Value, Value:, enter 2, click Add. Under Old Value, Value:, enter 4, under New Value, Value:, enter 3, click Add. Under Old Value, Value:, enter 7, under New Value, Value:, enter 4, click Add. Under Old Value, select All other values, under New Value, select System-missing, click Add,

Continue. Alternatively, you could simply enter the following syntax into the syntax window:recode college (1=1)(3=2)(4=3)(7=4)(else=sysmis) into college4.

It would be good to provide value labels for this new variable. With syntax, simply enter the following: value labels college4 1 'Agric' 2 'Construct' 3 'Business' 4 ' Engineer'.

1d. We have four colleges, so we need only three dummy variables, but create all four anyway. You can use point and click commands as in 1c, but it is easier to just enter syntax like the following:

recode college4 (1=1)(2,3,4=0)(else=sysmis) into agricd.recode college4 (2=1)(1,3,4=0)(else=sysmis) into constd.recode college4 (3=1)(1,2,4=0)(else=sysmis) into busd.recode college4 (4=1)(1,2,3=0)(else=sysmis) into engind.

It would be prudent to check that your coding worked as intended. I suggest running frequencies:freq var=college4, agricd, constd, busd engind. Check: Total N=1073

1e. Winsorizing (note the spelling – NOT Windsorizing) is a technique for ‘robust’ analyses that reduces the influence of extreme scores while retaining all cases. First find the k most extreme cases on each end the distribution. An easy way to see the extreme values is to sort the cases by salary. You can go to the Data View window, right click on salary, and select

CD01 Categorical Data Analysis Assignments 8

Page 9: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

‘Sort ascending.’ Two cases have extremely large values; those values will be changed to the value for the third largest case (52000). The two cases with the smallest salaries (7200 and 8000) will be set to value for the third smallest case (8000). We can create a new variable salaryw2 with a recode. We will use this variable instead of salary from here forward.

Click Transform, Recode into different variables…, select salary, click the arrow to move salary into the Input box. Under Output Variable, enter a name for the new recoded variable, salaryw2, and click Change. Click Old and New Values…, under New Value, click Range, LOWEST through value:, enter 8000, under New Value enter 8000, click Add. Under New Value, select Range: enter 8000 in the first box, enter 52000 in the second box, under New Value select Copy old value(s), click Add. Under Old Value, select Range, value through HIGHEST:, enter 52000 in the box, under New Value, select Value:, enter 52000, click Add, Continue, OK.

Alternatively, you could go into the syntax window and enter the following code:recode salary (Lowest thru 8000=8000) (8000 thru 52000=Copy) (52000 thru Highest=52000) into salaryw2 . exe. (Isn’t syntax nifty?)

2. Does gender predict salaryw2 for the combined group of graduates from these four colleges, ignoring all other variables? Conduct this analysis with a t-test and then with regression. Report means and a test of statistical significance of gender differences. Include a figure that shows an arrow from Gender to Salary.

Mean salary is $24,895.81 for females and $27,073.75 for males. t (1071) = 5.185, p < .001 (Note: if you provide the means, you can report the t value as positive.)

Regression gives a constant of 24895.81 with B for Gender = 2177.94; t (1071) = 5.185, p < .001. What do the B values and the test of Gender tell us in nontechnical terms?

3. Does graddate mediate the relationship between gender and salaryw2?

a = -0.14864, SEa = 0.06487, t = -2.2914, p = .0221b = 508.8889, SEb = 197.3456, t = 2.5787, p = .0101Check a*b = c – c'? -0.14864 * 508.8889 = -75.64; 2177.944 – 2253.584 = -75.64; PROCESS boot = -75.6394, SEboot ≈ 47.84, LLCI ≈ -189.4, ULCI ≈ -8.29Your bootstrap values will differ somewhat from these. Attach output (not syntax).(The obsolete Sobel Test Statistic is -1.7128, giving two-tailed p = .0868, non-significant.)

4. Does college4 mediate the relationship between gender and salaryw2?

There is no simple test here for paths a and b because there are three a and three b paths because the mediator is represented by three dummy variables. Instead, we can describe the difference between coefficients c and c′, describe the a and b paths, and test the overall relationships and each component.

First draw a figure to help conceptualize the model. We first enter gender as a predictor of salaryw2 (path c). Next we add college4 in the form of three dummy variables. The weight on gender in this model gives us c′.

CD01 Categorical Data Analysis Assignments 9

Page 10: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

For the analog of path b we can use R squared change to test whether the mediators as a group are related to salaryw2, and we can test individual contributions as well. For the analog of path a we actually have three paths because we have three mediators. Here ordinary regression is not appropriate because each dependent variable is a dichotomous dummy variable. Also, the predictor is dichotomous.

We can examine the relationships between the two categorical variables gender and college4 with chi-square test of independence. First, a 2x4 crosstab and chi-square test of independence will test the null hypothesis that gender and college4 are independent. We find Chi-square (df=3) = 197.38, p<.001, so the analog to path a is nonzero.

Now we can put the information together to assess whether there is a mediating relationship. Compare c and c′, and then also assess paths a and b (or their analogs). We find c = 2177.944 and c′ = -184.434 which suggests nearly complete mediation. Gender by College4 shows a significant relationship, with more women than men in Agriculture but fewer in Business and especially fewer in Engineering and Construction, so path a is nonzero. When entered along with gender in a regression model to predict salaryw2, the three dummy variables for college contribute significantly to the model (R square change = .162, F(3, 1068) = 71.042, p < .001, so path b is nonzero. (Both a and b are nonzero.)

5. Does college4 moderate the relationship between gender and salaryw2?

Create three interaction terms (e.g., COMPUTE GxAg= gender* agricd). Enter gender into the model at the first step, the three college dummy variables on the second step, and the three interaction terms on the third step. We can test the R squared added by the interaction terms on the third step, and we can examine the components represented by the individual interaction terms. Because all predictor variables are dichotomous, interpretation is easier if we do not center these variables (but be careful when interpreting main effects and significance tests for main effects in a model that includes the interaction terms – weights refer to cases that have values of zero on the other predictors).

The test of the interaction gives us an R squared added of only .006, which is not quite statistically significant, F(3, 1065) = 2.539, p = .055. Checking the coefficients table, we find that none of the three interaction components is significant when Engineering is used as the reference group. Note that we can use the B weights in the final model to examine the pattern of sex differences between departments. The final model constant = 31377.778 indicates the predicted score for someone who is zero on all predictors – the average salary for women in Engineering is $31,377.78. The B = -596.42 for Gender indicates that male engineers are paid $596.42 less than female engineers. The negative B weights on all three dummy variables indicate that, for women, Engineers have the highest average salary. The B weight of -1195.245 for the Gender by Construction interaction indicates that relative to the sex difference in Engineering, in Construction the pay advantage for women is $1195.24 greater. On the other hand, in Business compared to Engineering, the pay advantage is $1922.09 more toward men, giving men $1922.09 – $596.42 = $1325.67 more in Business. These differences are not statistically significant, but they describe the sample exactly. On the next page is a table summarizing the sample means in all eight sex by occupation groups. Verify the interpretations of these and other regression coefficients.

CD01 Categorical Data Analysis Assignments 10

Page 11: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

If we were interested only in describing group means, we could simply ask for descriptive statistics. However, in a more complex analysis we might have control variables that make it more difficult to generate descriptive tables. The computations we did with the B weights would apply to cases that are equivalent on the control variables.

Descriptive Statistics

Dependent Variable:salaryw2

college4 Four largest

colleges gender Gender Mean Std. Deviation N

1 Agric 0 Female 24201.4760 7724.53988 271

1 Male 22992.3611 7543.20034 144

Total 23781.9277 7674.64211 415

2 Construct 0 Female 29825.0000 4692.10330 4

1 Male 28033.3333 3881.16821 51

Total 28163.6364 3923.16783 55

3 Business 0 Female 23969.1729 4080.37024 133

1 Male 25294.8413 5718.56839 189

Total 24747.2826 5140.66790 322

4 Engineer 0 Female 31377.7778 4704.82774 45

1 Male 30781.3559 5280.40858 236

Total 30876.8683 5189.21947 281

Total 0 Female 24895.8057 6909.81443 453

1 Male 27073.7500 6710.99433 620

Total 26154.2637 6877.17171 1073

6. Provide an executive summary for a nontechnical audience, giving the big picture and a description of the relationships between salary, graduation date, gender, and the four colleges. This should be no more than two pages including figures (include SPSS Clustered Boxplots to show the distributions of salaries for males and females within each college).

To make the clustered boxplots: Click Graphs, Chart Builder…, Under Gallery select Boxplot, drag the Clustered Boxplot (the middle one) into the chart preview window, from the Variables: window drag salaryw2 onto the Y axis, try to drag college4 onto the X axis but it won’t go because it is a scale variable – right click college4 and select Nominal, now you can drag it to the X axis, finally drag Gender to the Cluster on X box.

To generate the table of means shown here:Click Analyze, General linear model, Univariate.Dependent = salaryw2; Fixed = college4, genderOptions, Descriptive statistics

CD01 Categorical Data Analysis Assignments 11

Page 12: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Categorical Data Analysis Exercise Set 2Psychology 308d Dale Berger Nonparametrics p.1 of 3

Topics:Comparison of parametric t test and nonparametric tests Binomial distribution, sign testIntroduction to resampling techniques

Sources:Howell, Sections 18.6-18.8 for nonparametrics: Wilcoxon Ws and THowell, Sections 18.1-18.5 for resamplingClass handouts and class notes

Exercises:For these two exercises please use Word (or other word processor) to prepare your report. Label the parts of each exercise. You may paste in key results from SPSS or create new tables and figures based on the SPSS results. Attach relevant SPSS output that includes syntax. Annotate your output by identifying and interpreting key findings in your output. SPSS doesn’t provide all of the analyses needed here.

1. You are evaluating a prenatal care program for teenage mothers. For the first eight mothers who were in the program for three trimesters the birth weights of their babies (in grams) were 2940, 3380, 4900, 2810, 2800, 3210, 3080, and 2950. For ten mothers who began care in the third trimester, birth weights were 1680, 3830, 3110, 2760, 1700, 2790, 3050, 2660, 1400, and 2775. We are interested in whether there is a difference in birth weight between the two populations represented by these samples. Hint: See CD05: D11 and D12.

a) Enter these data into SPSS in preparation for analysis. Include three columns (variables) with 18 rows each. The first column is an ID code, with sequential values from 1 to 18. The second column is a group code. Use 0 for only third trimester and 1 for all three trimesters (a bigger number means more treatment). The third column is birth weight. Include value labels for the group variable and variable labels for all three variables. Use the names id, group, and weight. [see CD05: D12]

b) Examine the shape of the distribution of weight for each group. [Use split file to compare the two groups. Under ‘Data’ in data view, split file, compare groups, use group. Then Graphs, histogram…, select weight as the variable.] Copy each histogram as an object and paste it into your Word document.

c) Conduct a t-test for independent groups (remember to turn off the split file). [See CD05: D12: Analyze, compare means, independent samples….] Can you assume equal variance?

d) Use SPSS to conduct a nonparametric Wilcoxon rank sum test to compare the two groups. What is the correct value for Ws? Does SPSS give this value? [See CD05: D12.]

e) Summarize your findings in no more than two pages. Include summary numbers in the text, and a discussion of the validity and interpretation of the various tests. You may include technical interpretations, but focus on non-technical interpretations.

CD01 Categorical Data Analysis Assignments 12

Page 13: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Categorical Data Analysis Exercise Set 2Psychology 308d Dale Berger p. 2 of 3

2. Bumble is interested in how closely married couples agree in reporting their family income. He asked each partner in eight married couples independently to report their family income. He obtained the following data, in $1,000s:

Couple 1 2 3 4 5 6 7 8Husband

32 18 50 35 47 87 160

45

Wife 30 18 49 30 45 80 100

50

Your task is to test whether there is a systematic difference between husbands and wives in reporting income. The null hypothesis is that there is no difference.

a) Enter these data into SPSS in preparation for analysis. Include three columns (variables) with 8 rows. The first column is a code number for each couple, with sequential values from 1 to 8. The second column is the income value reported by the husband, and the third column is the value reported by the wife. [See CD05: D13]

b) Conduct a paired t-test using SPSS. (Hint: Use Analyze, Compare means, Paired Samples t-test, select both husband and wife)

c) Conduct a Wilcoxon matched pairs signed ranks test using SPSS. (Hint: Analyze, Nonparametrics, 2 Related Samples…,) Report T and the two-tailed p value. [see CD05: D13]

d) Summarize your findings. Include a discussion of the validity of each of the two tests.

3. Alice and Bob were on a selection committee where they each rated the same 50 candidates as ‘hire’ or ‘do not hire.’ They agreed on ‘hire’ for 10 candidates, and agreed on ‘do not hire’ for 30 candidates. Bob said ‘hire’ for 8 candidates where Alice said ‘do not hire’ and Alice said ‘hire’ for 2 candidates where Bob said ‘do not hire.’ See CD05: D14 and D15.

a) What is their percent agreement? Hint: make a 2x2 table for Alice vs. Bob.

b) Is their agreement greater than chance? Test and interpret.

c) What is the value of Cohen’s kappa? Interpret.

d) Is Bob significantly more likely to say ‘hire’ than Alice? Find the p-value from the SPSS test, and also the exact binomial p-value for x ≤ 2 using StatWISE Binomial (n = 10, X = 2, p = .5). Interpret.

CD01 Categorical Data Analysis Assignments 13

Page 14: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Calculate 3a, 3b, 3c, and 3d by hand and also using SPSS (you may need to do some small hand calculations on the SPSS findings). Attach annotated SPSS output.

CD01 Categorical Data Analysis Assignments 14

Page 15: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Categorical Data Analysis Exercise Set 2Psychology 308d Dale Berger p. 3 of 3

Study Questions (to discuss with others):

How do you decide whether to use a parametric or nonparametric test (e.g., t vs. W)?Under what conditions might you use the t-test for independent groups even when the variances

are unequal?

Hints and answers:

If you have trouble with SPSS, step through the examples in the handouts. Bring problems to the lab or office hours or email questions. Don’t spend a lot of time spinning wheels. Get help.

1c) t(16) = 1.975, p=.066 two-tailed1d) Be sure to show how you knew the Ws=71 was wrong and how you found the correct Ws; correct Ws = 52; Mann-Whitney U = 16, p=.034 two-tailed is correct

2b) t(7) = 1.218, p=.263 two-tailed2c) T = 4.50, N=7, p>.05; z = 1.612, p=.107 [SPSS Asymp. two-tailed uses a normal

approximation; the exact two-tailed probability for T≤4 = .109 (double .0547)

3. You can enter the data in several different ways. You can create 50 cases in the SPSS data editor, perhaps with variables ID, Alice, and Bob, coding an ID number and perhaps 0 or 1 for Alice and Bob to indicate ‘hire’ or ‘do not hire.’ You may find it easy to use Excel to create the data, using the Excel copy feature to create multiple copies of each of the four kinds of cases. Then you can cut and paste into the SPSS data editor.

Perhaps the easiest way is to use the weighting feature in SPSS as shown in the handout D14: Statistics for Contingency Tables.

WEIGHT BY freq.CROSSTABS /TABLES=Alice BY Bob /FORMAT=AVALUE TABLES /STATISTICS=CHISQ KAPPA MCNEMAR /CELLS=COUNT /COUNT ROUND CELL.

a) 80% b) Yes c) .532 d) no, two-tailed p = .109 (one-tailed p = .0547)

CD01 Categorical Data Analysis Assignments 15

Page 16: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Categorical Data Analysis Exercise Set 3Psychology 308d Dale Berger Binary Logistic Regression p. 1 of 2

Topics:How does logistic regression differ from ordinary least squares regression?How do we assess model fit with logistic regression?How do we test the contribution of individual predictors or sets of predictors?How do we use the model to make predictions for specific cases?

Sources: Handout: Introduction to Binary Logistic RegressionClass notes Howell, Section 15.15

Exercises: We would like to be able to predict who is likely to drive after drinking. We will use real data from a national survey of licensed drivers. You can access this data set on Sakai as driver.sav

Our criterion variable (dependent variable) is the dichotomous variable willdd2, as described below.

For this exercise we will use the following variables:age Respondent’s age (0, 99 missing)sex Respondent’s sex (1=male; 2=female)marst Marital status (1=single; 2=married or stable; 3=divorced or separated;

4=widowed; 5=missing)drink2 Did you drink alcohol in the past year? (0=no; 1=yes)willdd2 Will you drive after drinking in the coming year? (0=no; 1=maybe)

1. Use binary logistic regression to predict willdd2 as the dependent variable using age, sex, and marital status. Limit the analyses to people who did drink some alcohol in the past year and people who have complete data. Include information on the size and direction of any effects along with tests of statistical significance.

Include a table of results and a writeup similar to the example shown at the end of the handout Introduction to Binary Logistic Regression. Address the following in the writeup. Attach relevant printout and syntax. Hints are on the next page.

a. How is age related to willdd2? b. How is sex related to willdd2? Does sex contribute beyond age to predicting willdd2?c. How is marital status related to willdd2? How is marital status related to willdd2 when

we control for sex and age?

How likely is each of the following to indicate that they might drive after drinking in the next year?d. a 90-year-old widowed woman e. a 25-year-old divorced man

CD01 Categorical Data Analysis Assignments 16

Page 17: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Source:

Berger, D.E., Snortum, J.R., Homel, R.J., Hauge, R., & Loxley, W. (1990). Deterrence and prevention of alcohol-impaired driving in Australia, the United States, and Norway. Justice Quarterly, 7(3), 453-465.

Hints:

Check the frequency distribution for each of the variables; note the coding for each variable.

To make sure that you have all of the same cases for each analysis, you can use the ‘select cases’ option. Data, Select cases, If…, click If, click drink2, click the black triangle to enter the variable into the top-right window, click = 1, click &, click age, click the triangle, click >=0, click &, click marst, click the triangle, click & >=, …. continue for sex and willdd2.The syntax is drink2 = 1 & age >= 0 & marst >= 0

Run frequencies on all of the variables you will use to make sure you have the same number of cases for each analysis. You should have 1122.

Test how well each variable by itself predicts the criterion. You can use a t-test for a continuous predictor and chi-square for a categorical or dichotomous predictor. These analyses will give you useful descriptive statistics.

Enter each variable by itself into a logistic model. Be sure to tell the program that marst is a categorical variable, or it will treat marst as a continuous variable. All marst variables go in on one step.

If you tell the program that sex is a categorical variable, pay close attention to how the program codes this variable. It may be reversed from the original coding.

The analyses are quite similar to those in the handout, but make sure you can do these analyses on your own and that you understand each step of the analysis and the interpretations.

Fewer than 10% of 90-year-old widowed women are predicted to report that they will drive after drinking, compared to just over 80% for 25-year-old divorced men. Find the predicted values to the nearest .1%.

Study questions (to discuss with others, not to be turned in):

Why can’t we use ordinary regression when we have a dichotomous outcome variable?

Give an example where an odds ratio is very different from a risk ratio.

CD01 Categorical Data Analysis Assignments 17

Page 18: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Categorical Data Analysis Exercise Set 4Psychology 308d Dale Berger Introduction to Resampling p. 1 of 2

Topics:What is resampling? How do you explain to a lay audience?What are the advantages and disadvantages of resampling?What is the difference between bootstrapping and permutation tests with resampling?What are recommended methods for dealing with missing data?

Resampling Sources: Handout: Introduction to Resampling (CD08)Class notes Howell, Sections 18.1 to 18.5

Resampling Exercises:

For these exercises you can use Howell’s resampling program. The program, instructions for using the program, references for resampling, and discussions of resampling can be accessed through his web page http://www.uvm.edu/~dhowell/StatPages/Resampling/Resampling.html .

You are asked to assess the effectiveness of a program designed to help people who have been unemployed for more than six months. One goal of the program is to increase the number of job applications submitted by participants. Below are data from eight people who completed the program and from twelve people who are on a waiting list for the program. The data represent the number of job applications submitted in the past week.

Trainees: 5, 7, 5, 10, 3, 5, 9, 6

Control: 0, 2, 0, 1, 0, 0, 3, 0, 38, 2, 0, 1

1. Is a t-test for independent groups appropriate to test the null hypothesis that the population means are equal? Why or why not? (How do you know?)

2. Use Howell’s Randomization test for the difference of two medians to test the null hypothesis that the medians are the same for the two groups. The first step is to prepare the data for analysis with Howell’s program. Use Notepad or other program that can save a file in ASCII format (no hidden characters – Word has hidden characters). Remember to name the file with a .DAT extension rather than a .TXT extension so Howell’s program can see it.

a) What is the median for the Trainees?b) What is the median for the Controls?c) Describe your analysis and conclusions for a professional audience (e.g., APA journal).d) Explain your findings for a lay audience (e.g., Board members).

3. Use Howell’s Bootstrapping difference of two medians to find a confidence interval for the difference in medians.

a) Describe your analysis and conclusions for a professional audience (e.g., APA journal).b) Explain your findings for a lay audience (e.g., Board members).

CD01 Categorical Data Analysis Assignments 18

Page 19: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Hints and answers:

1. You could use SPSS to do a thorough examination of your data, though an eyeball test may be sufficient to identify relevant issues.

2. Your data file should look like this:

8 125 7 5 10 3 5 9 60 2 0 1 0 0 3 0 38 2 0 1

c) Be sure to include a description of what you did, with size and direction of effects.

d) You might try your answer on a friend who is not familiar with resampling.

Missing Data Resources

SPSS documentation on missing values analysis:

ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/20.0/en/client/Manuals/IBM_SPSS_Missing_Values.pdf

Howell’s supplements:

http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html

http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Mixed Models for Repeated Measures.pdf .

http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing-Part-Two.html

http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing-Logistic/MissingLogistic.html

CD01 Categorical Data Analysis Assignments 19

Page 20: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Categorical Data Analysis Non-required Exercise Set 5Psychology 308d Dale Berger Introduction to Loglinear Models p. 1 of 2

Topics:

Introduction to loglinear modelsHow to screen models and test hypothesesLoglinear analysis with SPSS for Windows

Sources:

Class notes and handoutsMaterials on Sakai, including SPSS User's Guide (CROSSTABS, HILOGLINEAR)

Non-required Exercises:

1. A random survey of 1000 U.S. licensed drivers included the following variables:

Name Label Coding A Age (1=16-24; 2=25-44; 3=45+)S Sex (1=male; 2=female)P Preferred drink (0=abstain; 1=beer; 2=wine; 3=spirits)D Drunk driving in past year (0=none; 1=some)

This data set is available on Sakai as DDAT.SAV. The missing data code is 9 for each variable. An associated DDAT.TXT file describes the source of the data set. Use ‘Model Selection’ loglinear to analyze these data for drinkers only (do not include abstainers).

a) Find the ‘best’ loglinear model for these data. .b) Ask for the Association table. Interpret the test the partial effect of PD. c) Test the fit of model [D,ASP]. Interpret.d) Show how the degrees of freedom are computed for model [D,ASP].

2. Summarize your findings. Include the following:a) Describe the research issue (how are age, sex, and preferred drink related to D)b) Describe the source of your data (n, sampling)c) Specify any a priori hypothesesd) Check data for problems (describe checks of coding, distributions)e) Analyze the data - fit specific modelsf) Report results - include effect size (report proportions, perhaps in a table)g) Interpret findings for a lay audienceh) Annotate the output file for future reference and include the syntax

CD01 Categorical Data Analysis Assignments 20

Page 21: phd.meghan-smith.comphd.meghan-smith.com/.../2016/05/CD01-Schedule-Sylla…  · Web viewYou will learn about the logic, application, and interpretation of selected models for categorical

Categorical Data Analysis Non-required Exercise Set 5Psychology 308d Dale Berger Introduction to Loglinear Models p. 2 of 2

Question for discussion (not to be turned in):

Bumble did a loglinear analysis with four factors (A, B, C, and D). He told you that he used ‘backwards elimination’ and the program told him that the best model was [AC,BC,ABD]. Why might this not be the 'best' model for the data?

Hints and answers (have a go at the problems before you look at our answers):

The loglinear handout has a similar analysis - you can use the syntax file for that analysis as a guide. One approach is to type the syntax directly into the syntax window. An alternate approach is to use the mouse to point and click in SPSS. See the loglinear handout for more details. There are 612 valid cases.

1. a) /method = backward will generate model [PD,AP,SP,ASD] as the ‘final’ model in SPSS. When interpreting, be aware that including ASD in the model implies that AS, AD, and SD are also in the model. Any (or none) of these two-way interactions may be important. Consider the size of the effects as reflected in differences in proportions.

b) You can take this from the association table, or you can compare two models.

c) The test of the model [D,ASP] gives chi-square (df=17) = 136.68, p<.001. The only terms not in this model are all of the interactions with D. The model does not fit the observed data. This indicates that some interaction(s) with D are important (i.e., the proportion of drunk drivers depends on the levels of other variables).

d) You can calculate df for terms not in the model, or subtract df for terms in the model from the total number of cells. Recall that if ASP is in the model, this tells you that AS, AP, SP, A, S, P, and the mean are each in the model as well.

e) Paste your commands into a syntax file, edit them if necessary before the final run, and then save them into a separate file that can be combined with your output file.

Discussion Question (not to be turned in):The program selects the ‘best’ model on the basis of statistical significance alone, so there is a substantial risk of capitalization on chance. A new set of data may find a different ‘best’ model. Also, it is possible that a different approach would arrive at a different solution (e.g., the final model may vary with the level of alpha that you use). It may be that some effects are much larger than others, and that some of the effects included in the model are very small in comparison, so it may be more meaningful to include only the larger effects. If the sample size is very large, even trivially small effects may be statistically significant.

CD01 Categorical Data Analysis Assignments 21