friday 20th february cross-tabular analysis. outline week 9 (two weeks time) – seminar...

45
Friday 20th February Cross-Tabular Analysis Cross-Tabular Analysis

Upload: sydney-taylor

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Friday 20th February

Cross-Tabular AnalysisCross-Tabular Analysis

Page 2: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Outline• Week 9 (two weeks time) – seminar organization• Recap on last week• Cross-tabular Analysis

– Testing for association: Chi-Square– Testing for the strength of association: Cramér’s V and Odds

Ratios

• Choosing statistical tests • Multivariate Analysis – quick overview

– What multivariate analysis tells you– Multivariate analysis in cross-tabular analysis

Page 3: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Week 9 (in two weeks)In the seminar we’ll be looking at multivariate analysis in published work. In preparation for this I would like you to work in groups to take a look at a published sociological article that employs multivariate analysis. It is up to you to choose an article that you are interested in. If you want to check whether your article is suitable, email me a link to it (or show it to me). Generally any article that says it uses multiple (or ordinal) regression or logistic regression will be okay. If you don’t have other ideas the Social Science Quarterly is available online and a large proportion of the articles include multivariate analysis, covering a wide range of topics (albeit with a somewhat US focus): http://www.blackwell-synergy.com/loi/ssqu

In your group, read the article, and try and work out:What is the question that is being addressed (hypothesis or hypotheses)?What is the dependent variable (or variables)?What is the main independent variable(s) (the focus of the article)?What control variable(s) is/are being used?Which independent/control variables have a significant effect?What does this mean substantively?

*Do not worry if you do not understand everything. Just work out what you can!Group 1:SevineStephenAnne-ClaireMaelAysenur

Group 2:ElgarsCarolineKassandraIsobelTy

Group 3: EmmaShahenLucyWeronikaNikos

Group 4: TomJennyPamelaKaz

Group 5: UlfFaeezaDannyJiyoungTahsina

Page 4: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Last week recap• Hypothesis testing involves testing the NULL HYPOTHESIS: no

effect/relationship, or independence between variables. • When we test a hypothesis we test whether relationships/ differences that

we observe in our sample are likely if the null hypothesis were true of the population.

• The sampling distribution (the distribution of possible samples) plays an important role in inferential statistics because it allows us to work out how much variation is likely to be produced by sampling error (i.e. that the sample of Warwick students contained an extraordinarily large number of people who drink a lot), as opposed to a real population parameter (i.e. that the population of Warwick students drink a lot).

• When we find a very low probability (p<0.05, or less than 5%) that we could have found what we found in our sample if the null hypothesis were true we can infer that it is unlikely to be true.

• And therefore, that the ALTERNATIVE HYPOTHESIS (of difference/ an effect) is likely to be true.

• This is a sort of backwards logic. So… if you find it easier to think forwards, the simple version (although less technically correct) is that when we find that p<0.05 we have found a relationship/effect.

Page 5: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Last week recapLast week we looked at tests that had interval-ratio variables (such as income, or years residence) as dependent variables. Specifically we looked at tests that investigated:

• Whether our population has a mean that is different from a stated mean (z-tests, or ‘one-sample t-test’ in SPSS). i.e. Do people tend to stay in their domestic residence for 10 years? Or whether our population proportion has a proportion in category x that is different from a stated proportion (‘binomial’ tests in SPSS). i.e. Is the proportion of people who have been in their residence over five years 50%?

• Whether two groups have populations that are different from one another (a t-test, or ‘independent sample t-test’ in SPSS). i.e. Do men and women have different lengths of residency?

• Whether the (more than two) different categories of a variable (i.e. social class) have means that are different from one another (ANOVA, or ‘one way ANOVA’ in SPSS). i.e. Are people in different social classes likely to have different lengths of residency?

Page 6: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Categorical Data analysis• Today we are going to look at relationships between categorical

variables (i.e. gender, race, religion).• When both variables are categorical we cannot produce means.

But we can construct contingency tables that show the frequency with which cases fall into each combination of categories – i.e. ‘man’ and ‘Christian’ (we cannot do this with continuous variables such as age as most people would fall into different categories and so the tables would be enormous and unmanageable).

• When we conduct statistical analysis of tabular data we are trying to work out whether there is any systematic relationship between the different variables being analyzed or whether cases are randomly distributed across the cells.

• Therefore the tests that we do compare what we find with what might be expected if there were no relationship – this is the null hypothesis.

Page 7: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Example: A table showing (imaginary) frequencies for outlook on life by gender

• So, what proportion of men think the glass is half-empty?

• 15/40 (.375)• What proportion of women

think the glass is half-empty?• 35/60 (.58)• So: Can we say that men and

women have significantly different attitudes to life, or could this apparent difference be the result of chance sampling error?

Men Women Total

Glass half empty

15 35 50

Glass half full

25 25 50

Total 40 60 100

Page 8: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

How do you work out whether the difference between men and women is

likely to be due to chance?• For each cell in the table: we work out

the frequency that we would expect and see how much the observed frequency differs from this.

• We then square this difference and divide by the expected frequency.

• We then sum these values.• The observed frequency for each cell is

what we see.• The expected frequency is the

frequency that you would get in each cell if men and women were exactly as likely as each other to think the glass was half empty (or half full) but the totals remained the same (i.e. there were still 40 men and 60 women and there were still 50 people who thought the glass was half empty and 50 who thought it was half full).

Men Women Total

Glass half

empty

15 35 50

Glass half full

25 25 50

Total 40 60 100

We use chi-square (χ2) to look at the difference between what we observe and what would be likely if there were no difference except that generated by chance:

χ2= (Observedij – Expectedij)2

Expectedij

Page 9: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

How do you work out expected frequencies?

,

40 5020

100Men Empty

Men Empty

CT RTExpected

n

Expectedi,j = Column Totali x Row Totalj

n

,

40 5020

100Men Full

Men Full

CT RTExpected

n

,

60 5030

100Women Full

Women Full

CT RTExpected

n

,

60 5030

100Women Empty

Women Empty

CT RTExpected

n

In the following: CT = column total RT = row total

Men Women Total

Glass half

empty

15 35 50

Glass half full

25 25 50

Total 40 60 100

20

20

30

30

Page 10: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Working out Chi-square

χ2 = (Observedij – Expectedij)2

Expectedij

= (15-20)2 + (25-20)2 + (35-30)2 + (25-30)2

20 20 30 30

= (-5)2 + (5)2 + (5)2 + (-5)2

20 20 30 30

= 1.25 + 1.25 + .833 + .833

= 4.166

Men Women Total

Glass half

empty

O: 15E: 20

O: 35E: 30

50

Glass half full

O: 25E: 20

O: 25E: 30

50

Total 40 60 100

Observed and Expected by cell:

This statistic can be checked against a distribution with known properties in a table (similarly to how you looked up z-scores).

However to look up a chi-square value we need to know the degrees of freedom (df).

Degrees of freedom in crosstabulations are calculated by (r-1)(c-1), where r is the number of rows and c is the number of columns. (i.e. the number of categories in each variable minus one and multiplied).

So here we have (2-1) (2-1) = 1 df.

Page 11: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Chi-Square Distribution for1 to 5 degrees of freedom

Chi-Square values of greater than 6 are relatively rare.

However higher values of chi-square are more common when the degrees of freedom (k in this chart) is larger.

This is reflected in chi-square tables: you will see that the values for p=0.05 (the point at which only 5% of cases lie to the outside) increase as the degrees of freedom increase.

Page 12: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Checking the chi-square statistic• To check whether what we find is significant we look up critical values of chi-

square in a table.• For 1 df the critical values are 3.84 (p = 0.05) and 6.63 (p=0.01).• Because our chi-square value (4.166) is bigger than 3.84 we can say that it is

significant at p<0.05. (Since it is not bigger than 6.63 we cannot say that it is significant at p<0.01).

• What this means is that we would find a difference in our sample of at least the size of the difference that we observed in the proportions of men and women who thought that the glass was half empty as opposed to half full between 1% and 5% of the time if there were no real difference between them in the population (which is the null hypothesis).

• This is rare enough that we consider the null hypothesis to be unlikely. We can therefore reject the null hypothesis and accept the alternative hypothesis of a relationship between gender and outlook.

• When you do this on SPSS it will produce an estimate of the precise probability of obtaining a chi-square statistic as big as (in this case) 4.166 by chance. If this p-value is less than 0.05 we can say that it is significant (and that there is a significant relationship between our two variables).

Page 13: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

A note on using Chi-square

• You can only use chi-square tests where each cell in the table has an expected value of at least 5. (Although where samples are larger it may be acceptable to have one cell with an expected frequency under 5). If you violate this assumption the test loses power and may not detect a genuine effect.

• The categories must be discreet. No case should fall into more than one category (i.e. you cannot think that the glass is half empty and half full).

Page 14: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

And a note on presenting tables:1. When presenting tables you should always present percentages (within the

independent variable) – not frequencies (as these can be affected by the numbers of people in each independent category). Percentages ‘tell the story’ more clearly. You do not need to give these to a million decimal places; 1 decimal place is usually adequate (and simpler to read).

2. However you should also give totals as frequencies (so that someone could re-construct the table if they wanted).

3. You only need to include horizontal lines and these can be minimal.4. You only need to include totals going one way – either row or column – since

this is enough information for people to reconstruct all other information themselves.

5. Always title a table:

Men Women

Glass Half Empty 37.5% 58.0%

Glass Half Full 62.5 42.0

Total (N)

100(40)

100(60)

Table 1: Outlook on life, by Gender

Page 15: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Strength of Association• Chi-square tells us whether there is a ‘significant’ association

between two variables (or whether there exists an association that would be unlikely to be found by chance).

• However it does not tell us in a clear way how strong this association is since the size of chi-square depends in part on the sample size.

• We will look at two different statistics that do tell us about the strength of association:– Cramér’s V. This is a PRE statistic that tells us the strength

of associations in contingency tables. (the meaning of PRE will be explained). It also enables us to compare different associations and decide which is stronger.

– Odds Ratios. These tell us the relative odds of an event occurring for different categories or groups of cases (or people). As such they’re quite easily understood ways to discuss the strength of association in contingency tables.

Page 16: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

PRE• Strength of association is generally measured by PRE or

Proportional Reduction in Error. • Another way of saying this is, how much better our

prediction of the dependent variable will be if we know something about the independent variable.

• PRE statistics range from 0 to ±1. Where (roughly speaking):– values between 0 to ± 0.25 indicate a non-existent to weak

association – values between ± 0.26 and ± 0.50 indicate moderate association; – values between ± 0.51 and ± 0.75 indicate a moderate to strong

association – values between ± 0.76 and ± 1 indicate strong to perfect

association.

Page 17: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Example: Sex, Age and SportData from Young People’s Social Attitudes Study 2003 (available from Nesstar)

YP played sport as part of sports club YP * Age band Crosstabulationa

62 68 34 164

65.3% 54.4% 41.0% 54.1%

33 57 49 139

34.7% 45.6% 59.0% 45.9%

95 125 83 303

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Malea.

YP played sport as part of sports club YP * Age band Crosstabulationa

55 48 11 114

44.4% 32.0% 12.8% 31.7%

69 102 75 246

55.6% 68.0% 87.2% 68.3%

124 150 86 360

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Femalea.

Boys:

Girls:

Page 18: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

What can we say about these tables?

• It looks like boys play sport with clubs more than girls.

• And it looks like both boys and girls become less likely to play sport with clubs as they get older.

Questions:1. Is there a significant association between age

and sports club membership for boys? For girls?2. Is the association between age and sports club

membership stronger/weaker for boys than it is for girls?

Page 19: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

YP played sport as part of sports club YP * Age band Crosstabulationa

62 68 34 164

65.3% 54.4% 41.0% 54.1%

33 57 49 139

34.7% 45.6% 59.0% 45.9%

95 125 83 303

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Malea.

To answer this question we work out chi-square, by calculating:

χ2= (Observedij – Expectedij)2

Expectedij

= (62 – (95*164)/303)2 + (33 – (95*139)/303)2 + (68 – (125*164)/303)2 +

(95*164)/303 (95*139)/303 (125*164)/303

(57 – (125*139)/303)2 + (34 – (83*164)/303)2 + (49 – (83*139)/303)2 + (125*139)/303 (83*164)/303 (83*139)/303

= (62-51.4)2/51.4 + (33-43.6)2/43.6 + (68-67.7)2/67.7 + (57-57.3)2/57.3 + (34-44.9)2/44.9 + (49-38.1)2/38.1

= 2.2 +2.6 + 0 + 0 +2.6 + 3.1 = 10.5

1. Does age significantly affect sports club membership for boys?

Page 20: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

1. Does age significantly affect sports club membership for boys?

• Chi-square = 10.5. • df = (r-1)(c-1) = 1*2 = 2• If we look up the .05 value for 2 degrees of freedom it’s 5.99

and the .01 value is 9.21. Since 10.5 is bigger than both of these it is significant at (p< 0.01).

• And the computer printout, below, confirms that in fact the p value is .005, which is less than 0.01 (n.b. chi-square without rounding is 10.541).

Chi-Square Testsb

10.541a 2 .005

10.626 2 .005

10.457 1 .001

303

Pearson Chi-Square

Likelihood Ratio

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 38.08.

a.

YP sex household grid [BSA2003] YP22 = Maleb.

Page 21: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

And girls…?

• Work out the chi-square statistic and test whether it is significant for girls…

YP played sport as part of sports club YP * Age band Crosstabulationa

55 48 11 114

44.4% 32.0% 12.8% 31.7%

69 102 75 246

55.6% 68.0% 87.2% 68.3%

124 150 86 360

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Femalea.

Page 22: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

And girls…?• The computer printout of the chi-square test for

girls shows a chi-square value of 23.394. • The p-value for this chi-square statistic with 2

degrees of freedom is rounded up to 0.000 (SPSS only shows you results to 3 decimal places). This means that it is less than 0.001.

• Therefore “Age has a significant affect on whether or not girls play sport in clubs (p<.001)”:

Chi-Square Testsb

23.394a 2 .000

25.370 2 .000

22.862 1 .000

360

Pearson Chi-Square

Likelihood Ratio

Linear-by-LinearAssociation

N of Valid Cases

Value dfAsymp. Sig.

(2-sided)

0 cells (.0%) have expected count less than 5. Theminimum expected count is 27.23.

a.

YP sex household grid [BSA2003] YP22 = Femaleb.

Page 23: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

2. Is the effect of age stronger/weaker for boys than it is for girls?

• The chi-square statistic is bigger for girls than for boys, however the sample of girls is also bigger (360 versus 303) so this may have affected it.

• To work out the strength of association we need to correct for both sample size and for the table shape (as this also affects the magnitude of chi-square statistics). A frequently used measure of association is Cramér’s V:

where χ2 is chi-square, N is the sample size,

and L is the lesser of the number of rows and number of columns.

Note: In any table where either the number of rows or columns is equal to 2, Cramér’s V is equal to another measure of association, referred to as phi (or Φ).

2

( 1)N L

Page 24: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

YP played sport as part of sports club YP * Age band Crosstabulationa

62 68 34 164

65.3% 54.4% 41.0% 54.1%

33 57 49 139

34.7% 45.6% 59.0% 45.9%

95 125 83 303

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Malea. YP played sport as part of sports club YP * Age band Crosstabulationa

55 48 11 114

44.4% 32.0% 12.8% 31.7%

69 102 75 246

55.6% 68.0% 87.2% 68.3%

124 150 86 360

100.0% 100.0% 100.0% 100.0%

Count

% within Age band

Count

% within Age band

Count

% within Age band

Yes

No

YP played sport as partof sports club YP

Total

12-13 14-16 17-19

Age band

Total

YP sex household grid [BSA2003] YP22 = Femalea.

Boys:

χ2 = 10.541

Girls :

χ2 = 23.394

Cramér’s V for the two tables is therefore:

10.50.186

303(2 1)

23.9

0.258360(2 1)

2

( 1)N L

Boys

Girls

Comparing strength of association between ageand involvement in sport for boys and girls

Page 25: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Cramér’s V in SPSS

• We can also see Cramér’s V in SPSS printout

• boys are above and girls below – the values of Cramér’s V are the same as those we worked out (0.186 and 0.258), with small differences due to rounding error.

Symmetric Measuresc

.255 .000

.255 .000

360

Phi

Cramer's V

Nominal byNominal

N of Valid Cases

Value Approx. Sig.

Not assuming the null hypothesis.a.

Using the asymptotic standard error assuming the nullhypothesis.

b.

YP sex household grid [BSA2003] YP22 = Femalec.

Symmetric Measuresc

.187 .005

.187 .005

303

Phi

Cramer's V

Nominal byNominal

N of Valid Cases

Value Approx. Sig.

Not assuming the null hypothesis.a.

Using the asymptotic standard error assuming the nullhypothesis.

b.

YP sex household grid [BSA2003] YP22 = Malec.

Page 26: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

What’s it mean substantively?

• We can say that age has a significant but relatively weak affect on boys’ participation in sport.

• And that age has a significant and slightly stronger affect on girls’ participation in sport.

• Thus, both girls and boys are likely to decrease their participation in sports clubs as they get older but this effect is more pronounced among girls than among boys.

• And there is a (small) gender difference in the relationship between age and participation in sport.

Page 27: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

A different way of measuring the strength of association:

Odds Ratios

YP played sport as part of sports club YP * YP sex Crosstabulation

164 114 278

54.1% 31.7% 41.9%

139 246 385

45.9% 68.3% 58.1%

303 360 663

100.0% 100.0% 100.0%

Count

% within YP sex

Count

% within YP sex

Count

% within YP sex

Yes

No

YP played sport aspart of sports club

Total

Male Female

YP sex

Total

The odds Ratio is the odds of an outcome given membership of group a divided by the odds of that outcome given membership of group b.

Or, looking at the table below, the odds of playing sport if you’re male (or membership of group ‘male’), divided by the odds of playing sport if you’re female (or membership of group ‘female’).

Page 28: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Working out the odds.The odds of an event occurring can be worked out by the number of times that it occurs divided by the number of times that it does not occur.

164 114 278

54.1% 31.7% 41.9%

139 246 385

45.9% 68.3% 58.1%

303 360 663

100.0% 100.0% 100.0%

Count

% within YP sex

Count

% within YP sex

Count

% within YP sex

Yes

No

YP played sport aspart of sports club

Total

Male Female

YP sex

Total

ODDS OF A MALE PLAYING SPORT IN A CLUB = 164/139 = 1.18

This means that a male is 1.18 times more likely to be a member of a sports club than not to be.

ODDS OF A FEMALE PLAYING SPORT IN A CLUB = 114/246 = 0.46

This means that a female is 0.46 times as likely to be a member of a sports club as not to be (or less than 50% as likely). n.b. it’s often easier to talk about it the other way round (i.e. 246/114 = 2.15 – therefore women are about two times as likely not to take part in sports clubs as they are to take part).

The ODDS RATIO is the odds of a male playing sport divided by the odds of a woman playing sport: 1.18 / 0.46 = 2.57

Therefore the odds that a male is part of a sport club are over two and a half times as great as the odds of a female being part of a sports club.

Page 29: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Odds Ratios• When Odds Ratios are equal to 1 the two groups are

identical (the odds of the given outcome are the same for both groups).

• When odds ratios get close to 0 or to infinity the groups are totally different (the odds of the given outcome are very high (or certain) for one group and very low (or zero) for the other).

• We will come back to odds ratios when we look at logistic regression and log-linear analysis.

• Note: where the independent variable is in the columns a mathematic description of the odds ratio is:

(a/c) / (b/d)

which is the same as:(a*d) / (b*c)

• Where the independent variable is in the rows it’s (a/b) / (c/d)

Columns

Rows a b

c d

Page 30: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Group activity: Interpreting Output

• In groups discuss the handout that you have been given.

• What do the percentages in the table show? Can you substantively describe the percentages in different cells.

• Is the association between job tenure and work hours significant for men? For women?

• How much variation in job tenure does work-hours explain for men? For women?

• How much more/less likely is a man working part-time to have a temporary job than a man working full-time? What about for women?

Page 31: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Choosing Statistical TestsMost of the statistical tests/procedures that we have covered are only suitable in some instances. You can find this discussed in full by Buckingham and Saunders’ (2004) “Appendix E: Choosing Statistical Tests” (available online).

In order to determine what sort of test to use you need to ask yourself the following questions:

1. Univariate, Bivariate or Multivariate?2. What type of variable is your first (dependent) variable?3. What type of variable is/are your second (or independent)

variable(s)? 4. What do you want to know (do you want to infer things about the

population? Make a causal argument? Investigate interactions?

The tables on the following slides are schematic representations that use the answers to these questions to determine where to go next with analysis.Note: The tables only include the tests that we have covered in this class – in some instances there may be other statistical tests that are appropriate (and discussed in textbooks) however you are only expected to be able to use the ones listed, and we have covered sufficient tests that you should be able to find a reasonably suitable one in most instances

Page 32: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Categorical variables (nominal or ordinal)

Interval-ratio variables

Statistics Frequency tables displaying raw numbers or percentiles (nominal

data); percentages, medians, deciles and quartiles (ordinal data)

Inferential: z-test for population proportion

Summary statistics displaying: means, medians, standard

deviations, skew etc.

Inferential: z-test for population mean.

Graphics 1.Bar Charts

2.Pie Charts

1.Histograms

2.Stem and leaf plots

3.Box Plots

4.Normal Plots

Choosing and running Statistical tests: Univariate analysis

Page 33: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Choosing and running Statistical tests: Bivariate analysis

Type of variable (1)

Categorical (binary)

Categorical (3+ categories)

Interval-ratio

Type of variable

(2)

Categorical (binary)

Cross-tabulation, chi-square,

Phi or Cramer’s V, Odds ratios

Cross-tabulation, chi-square, Cramer’s VCompare bar-charts or

pie-charts by independent variable

T-testCompare boxplots or

histograms by category (group)

Categorical (3+ Categories)

Cross-tabulation, chi-square, Cramer’s VCompare bar-charts

or pie-charts by independent variable

Cross-tabulation, chi-square, Cramer’s V

Compare bar-charts or pie-charts by

independent variable (avoid if too complex)

ANOVA

Compare boxplots or histograms by category

(group)

Interval-ratio

T-testCompare boxplots or

histograms by category (group)

ANOVA

Compare boxplots or histograms by

category (group)

Pearson’s correlation coefficient, Simple linear regression (regression

demands that the dependent variable is

distinguished)

Scatterplot

Note: you can always reduce the level of measurement of a variable (from interval-ratio to categorical) by recoding. Therefore variables can (with recoding) be used to perform any test that is above or to the left of the test that is possible without recoding.

Page 34: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Multivariate Analysis• So far we’ve concentrated on two-way relationships (i.e.

between gender and participation in sports). But we are starting to talk about three-way relationships.

• Social relationships are usually more complex than bivariate analysis allows.

• Multivariate analysis is used in order to better detect this complexity.

• In future weeks we will look at linear regression, logistic regression and log-linear analysis, all types of multivariate analysis.

• This week we’ll talk briefly about the rationale for multivariate analysis and begin to look at crosstabular techniques for conducting this analysis.

Page 35: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Multivariate AnalysisDe Vaus (1996:198) suggests that we can use multivariate analysis to elaborate bivariate relationships in order to answer the following questions:

1. Why does the relationship [between two variables] exist? What are the mechanisms and processes by which one variable is linked to another?

2. What is the nature of the relationship? Is it causal or non-causal?3. How general is the relationship? Does it hold for most types of

people or is it specific to certain subgroups?

This is because multivariate analysis enables detection of:Spurious RelationshipsIntervening Variables

Replication of RelationshipsSpecification of Relationships

Page 36: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Age

Shoe size Reading abilitySpurious relationship

Spurious relationships• A spurious relationship exists where two

variables are not related but a relationship between them is produced by their relationships with a third variable.

• For example:

Page 37: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Spurious relationship

Small shoe size Large shoe size

Poor reading ability 90% 20%

Good reading ability 10% 80%

Under 10 years old Over 10 years old

Small shoe size

Large shoe size

Small shoe size

Large shoe size

Poor reading ability

90 91 20 19

Good reading ability

10 9 80 81

Example contd:

Page 38: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Intervening Variables• Sometimes although there is a real (non-spurious) relationship

between two variables we want to identify why that relationship exists.

• For example, if we find that there is a relationship between high levels of unemployment and ethnicity we want to know why that is the case. One possibility is that some ethnic groups have lower educational levels and that this has implications for their ability to get work. In this case education would be an intervening variable.

• Intervening variables enable us to answer questions about the bivariate relationship between two variables – suggesting that (in this case) the relationship between ethnicity and unemployment is not direct but (at least in part) occurs via educational levels.

EducationEthnicity Unemployment

Page 39: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Is it spurious or intervening?When we do statistical tests we will find similar results for a spurious variable and an intervening variable: In both cases the effect of the independent variable on the dependent variable will be moderated by the third variable. So how do we know whether this third variable is evidence of a spurious relationship or is an intervening variable?– There is no hard-and-tight statistical rule for this.– But if we are suggesting that it is intervening the logic must

make sense – ie you must have a cogent theoretical reason for thinking that your independent variable affects the intervening variable which affects the dependent variable.

– This is easiest to show when the timing supports this. i.e. when the intervening variable occurs in-between the dependent and the independent variable (for example education in the earlier example of the relationship between ethnicity and unemployment).

Page 40: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Replication

• Sometimes when we have found a basic (zero-order) relationship between two things (i.e. ethnicity and unemployment) we want to demonstrate that this relationship persists across different subgroups of the population (i.e. both men and women; those of different ages…).

• Where the relationship IS replicated we can rule out the possibility that it is produced by any intervening variable.

• This provides evidence that our original relationship is very strong.

Page 41: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Specification

• Sometimes a particular variable only has an effect under specific conditions. The variable that determines these conditions is said to ‘interact’ with the independent variable.

• For example, de Vaus shows that going to a religious school makes boys more religious but has no effect on girls.

• In this case type of school interacts with gender: Religious education only affects students’ religiosity in combination with being male.

Page 42: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Specification (interactions)

Not at all Very

How religious was your education?

Religiousness

high

low

boys

girls

Not at all Very

How religious was your education?

Religiousness

high

low

boys

girls

Interaction between No interaction sex and religiousness

of school

Graphical representation of relationship between religious educationand religiousness, controlling for sex:

Page 43: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

Using Cramér’s V to determine what sort of relationship it is

• If the Cramér’s V values are all similar, then we have a situation of replication.

• If the Cramér’s V values are smaller for the layered cross-tabulation than that for the original cross-tabulation, then we either have a situation where the third variable is acting as an intervening variable, or one where it is inducing a spurious relationship between the original two variables. Deciding between these two options involves reflecting on whether the third variable makes sense conceptually as part of some causal mechanism linking the original two variables.

If we use SPSS to produce a cross-tabulation of two variables, then we can elaborate this relationship by introducing a third variable as a layer variable. Examining the Cramér’s V values for the original cross-tabulation and for the layers of the elaborated cross-tabulation tells us what kind of situation we are looking at:

Page 44: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

• If the Cramér’s V values for the layered cross-tabulation vary in size, perhaps with some being smaller than the original value and some being as large or larger than it, then the situation is one of specification.

• However, if one or more of the Cramér’s V values is larger than the original value, then a failure to take account of the third variable in the first instance may also have been suppressing an underlying relationship between the two variables. (This is a variation on the theme of spuriousness, where it is the absence of a bivariate relationship that is spurious rather than the presence of one!)

Using Cramér’s V to determine what sort of relationship it is

(cont’d)

Page 45: Friday 20th February Cross-Tabular Analysis. Outline Week 9 (two weeks time) – seminar organization Recap on last week Cross-tabular Analysis –Testing

• Multivariate analysis uses many other techniques – and we’ll be looking at multivariate regression, logistic regression and log-linear analysis – in order to determine whether the relationship between two variables persists or is altered when we “control for” a third (or fourth or fifth) variable.

• Multivariate analysis also enables us to determine which variable(s) has the greatest effect on our dependent variable – i.e. Is sex more important than race in determining income?

This week in the computer lab we will develop do cross-tabular analysis.

Next week in lecture you will look at correlation and regression and in the seminar and the week after at logistic regression and we will discuss multivariate analysis in published research in the afternoon.

Generally…