1 matters arising 1.summary of last weeks lecture 2.the exercises

Post on 28-Mar-2015

219 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Matters arising

1. Summary of last week’s lecture

2. The exercises

2

Last week

• This week I extended my discussion of statistical association to the topic of partial correlation.

• A partial correlation can help the researcher to choose from different causal models.

• I also considered the analysis of nominal data in the form of contingency tables.

• The chi-square statistic can be used to test for the presence of an association between qualitative or categorical variables.

3

CORRELATION

does not necessarily mean

CAUSATION

4

The choice

• A strong positive correlation between Exposure and Actual violence was obtained.

• But at least three CAUSAL MODELS are compatible with that result.

5

A background variable

• Fortunately, we had information on a third variable, a measure of parental orientation towards violence.

• Both Exposure and Actual violence correlated highly with this Background variable.

6

Partial correlation

A PARTIAL CORRELATION is what remains of a Pearson correlation between two variables when the influence of a third variable has been removed, or PARTIALLED OUT.

7

Partial correlation

Removes the influence of the third variable.

Rescales with new variances, so that the range is as below.

8

The partial correlation

• When correlations with Background are taken into account, the original correlation is now insignificant.

• The third model seems the most convincing.

9

A medical question

• Is there an association between the type of body tissue one has and the presence of a potentially harmful antibody?

• This is a question of whether two QUALITATIVE VARIABLES are associated.

10

A contingency table

• The pattern of frequencies in the CONTINGENCY TABLE suggests that there is indeed an association between Presence and Tissue Type.

• The null hypothesis is that the two variables are INDEPENDENT.

11

Expected cell frequencies (E)

• The EXPECTED FREQUENCY E in each cell of the table is calculated from the MARGINAL TOTALS of the contingency table on the assumption that Tissue Type and Presence are independent.

• We compare the values of E with the OBSERVED FREQUENCIES O.

12

The expected frequencies

• In the Critical group, there seem to be large discrepancies between O and E: fewer No’s than expected and more Yes’s.

13

Formula for chi-square

• The magnitude of the discrepancies feeds into the value of the CHI-SQUARE statistic.

14

The value of chi-square

The value of

chi-square is 10.66 .

15

Degrees of freedom

• To decide whether a given value of chi-square is significant, we must specify the DEGREES OF FREEDOM df.

• If a contingency table has R rows and C columns, the degrees of freedom is given by

• df = (R – 1)(C – 1)• In our example, R = 4, C = 2 and so• df = (4 – 1)(2 – 1) = 3.

16

Significance

• SPSS will tell us that the p-value of a chi-square with a value of 10.655 in the chi-square distribution with three degrees of freedom is .014.

• We should write this result as: χ2(3) = 10.66; p = .01 .

• Since the result is significant beyond the .05 level, we have evidence against the null hypothesis of independence and evidence for the scientific hypothesis.

17

Multiple-choice example

18

Solution

• It isn’t easy to ask a sensible multiple-choice question about partial correlation.

• C is obviously the correct answer.

19

Example

20

Solution

• A is wrong: we usually hope the null hypothesis will be falsified.

• B is wrong: it’s the null hypothesis that is tested.• C is wrong: the p-value must be less than 0.05 for

significance. • D is correct: significance requires a p-value of less than

0.05.

21

Example

22

Solution

• df = (R-1)(C-1) = (4 – 1)(5 – 1) = 12.• So the correct answer is B.

23

Lecture 10

RUNNING CHI-SQUARE TESTS ON SPSS

24

In Variable View

• In Variable View, Name three variables and assign Values to the code numbers making up the various tissue groups.

• Always assign CLEAR VALUE LABELS to make the output comprehensible.

25

In Data View

• The third variable, Count, contains the frequencies of occurrence of the antibody in the different groups.

• When entering the data, it’s helpful to be able to view the value labels.

26

What the rows in Data View represent

• SPSS assumes that, in Data View, each row contains information on just ONE participant or CASE.

• In our example, each row contains information about SEVERAL people.

• At some point, SPSS must be informed of this.

• You do this by WEIGHTING THE CASES with the frequencies.

27

Weighting the cases

• Select Weight Cases from the Data menu.

• Complete the Weight Cases dialog by transferring Count to the Frequency Variable slot.

• Click OK to weight the cases with frequencies

28

Another approach

• We could have dispensed with the Count variable and simply entered the data on each of the 79 people in the study.

• Here are 8 of the 79 cases.

• You don’t need the Weight cases procedure here.

29

Selecting the chi-square test

The chi-square test is available in Crosstabs, on the Descriptive Statistics menu.

30

The Crosstabs dialog

• We want the columns to represent the Presence variable, as in the contingency table.

31

Clustered bar charts

Check the box labelled ‘Display clustered bar charts’

32

Crosstabs: Statistics

• Choose Chi-square.• The Chi-square statistic

itself is not suitable as a measure of the strength of an association, because it is affected by the size of the data set.

• Click ‘Phi and Cramer’s V’. These are measures of the STRENGTH of the association between tissue type and the incidence of the antibody.

33

Crosstabs: Cell Display

• Check the Observed and Expected buttons.

• Since the columns represent Yes’s and No’s, it will be useful to have the column PERCENTAGES.

34

The output: contingency table

The percentages are useful: they show a marked predominance of Presence of the antibody in the Critical tissue group only.

35

The clustered bar chart

• The figure shows the trend apparent from inspection of the column percentages.

• There is a marked presence of the antibody in the Critical tissue group.

36

Result of the chi-square test

• The p-value in the column headed ‘Asymp.Sig.’: p = .014 .

• Write the result as:

χ2 (3) = 10.655; p = .01.

• Notice the information about the number of cells with values of E less than 5.

• When there are too many, the usual p-value cannot be trusted.

37

Strength of the association

• Unlike a correlation, the value of chi-square is partly determined by the sample size and is therefore unsuitable as a measure of association strength.

• Interpret either Phi or Cramer’s statistic as the extent to which the incidence of the antibody can be accounted for by tissue type. Cramer’s V can take values in the range from 0 to +1.

38

A smaller data set

• Is there an association between Tissue Type and Presence of the antibody?

• The antibody is indeed more in evidence in the ‘Critical’ tissue group.

High incidence in Critical category

39

Result of the chi-square test

• How disappointing! It looks as if we haven’t demonstrated a significant association.

• Under the column headed ‘Asymp. Sig.’ is the p-value, which is given as .060 .

40

Sampling distributions

• Because of sampling variability, the values of the statistics we calculate from our data would vary were the data-gathering exercise to be repeated.

• The distribution of a statistic is known as its SAMPLING DISTRIBUTION.

• Test statistics such as t, F and chi-square have known sampling distributions.

• You must know the sampling distribution of any statistic to produce an accurate p-value.

41

The familiar chi-square formula

42

The true definition of chi-square

• The familiar formula is not the defining formula for chi-square.

• Chi-square is NOT defined in the context of nominal data, but in terms of continuously distributed, independent standard normal variables Z as follows:

43

True definition of chi-square

44

An approximation

• The familiar chi-square statistic is only APPROXIMATELY distributed as chi-square.

• The approximation is good, provided that the expected frequencies E are adequately large.

45

The meaning of ‘Asymptotic’

• The term ASYMPTOTIC denotes the limiting distribution of a statistic as the sample size approaches infinity.

• The ‘asymptotic’ p-value of a statistic is its p-value under the assumption that the statistic has the limiting distribution.

• That assumption may be false.

46

Goodness of the approximation…

• In the SPSS output, the column headed ‘Asymp. Sig.’ contains a p-value calculated on the assumption that the approximate chi-square statistic behaves like the real chi-square statistic.

• But underneath the table there is a warning about low frequencies, indicating that the ‘asymptotic’ p-value cannot be relied upon.

Warning about low expected frequencies.

47

Exact tests

• Fortunately, there are available EXACT TESTS, which do not make the assumption that the approximation is good.

• There are the Fisher exact tests, designed by R. A. Fisher many years ago; and there are modern ‘brute force’ methods requiring massive computation.

48

Ordering an exact test

• Click the Exact… button at the bottom of the Crosstabs dialog box.

• Check the Exact radio button in the Exact Tests dialog.

49

A better result!

• The exact test has shown that we DO have evidence for an association between tissue type and incidence of the antibody.

• The exact p-value is markedly lower than the asymptotic value.

50

Regression

51

The violence study scatterplot

52

Linear association

• If two variables have a PERFECT linear relationship, the graph of one against the other is a straight line.

• The graph of temperature in degrees Fahrenheit against the equivalent Celsius temperature is a straight line.

53

A perfect positive linear relationship

Degrees Fahrenheit

Degrees Celsius (0, 0)

932

5F C

Intercept → 32

Q

P

9 / 5P

SlopeQ

54

The slope of the line

• The COEFFICIENT 9/5 in front of the Celsius variable is the SLOPE of the straight line.

• When the Celsius temperature increases by FIVE degrees, the Fahrenheit temperature increases by NINE degrees. When the Celsius temperature increases by one degree, the Fahrenheit temperature increases by 1.8 degrees.

55

A strong linear association

• A narrowly elliptical scatterplot like this indicates a strong positive association between the two variables.

• The Pearson correlation is + 0.89 .

56

Regression

• Regression is a set of techniques for exploiting the presence of statistical association among variables to make PREDICTIONS of values of one variable (the DV or CRITERION) from knowledge of the values of other variables (the IVs or REGRESSORS).

57

Simple and multiple regression

• In the simplest case, there is just one IV or regressor. This is known as SIMPLE regression.

• In MULTIPLE regression, there are two or more IVs.

58

The regression line

59

The regression line of Violence upon Preference

The REGRESSION LINE is the line that fits the points best from the point of view of predicting Actual Violence from Preference. There is a precise criterion for the ‘best-fitting’ line.

60

The regression equation

61

F is a linear function of C

Degrees Fahrenheit

Degrees Celsius (0, 0)

932

5F C

Intercept → 32

Q

P

9 / 5P

SlopeQ

62

The regression line

Y (Violence)

X (Exposure) (0, 0)

2.091 0.736Y X

Q

P

0.736P

SlopeQ

63

Using the equation

64

Predicting a score

Y (Violence)

X (Exposure) (0, 0)

Intercept

2.091

9

8.7Y 8.0Y

0.7

e Y Y

true Violence scoreY

predicted Violence scoreY

65

The error in prediction

66

Simple regression

• B is the slope and B0 is the intercept.

• Y/ is the y-coordinate of the point on the line above the value of X.

• An increase of one unit on variable X will result in an estimated increase of B units on variable Y.

• A NEGATIVE value of B would mean that an increase of one unit on variable X will result in an estimated REDUCTION of B units on Y.

regression constant (intercept)

regression coefficient (slope)

67

The ‘least-squares’ criterion

• In ORDINARY LEAST SQUARES (OLS) REGRESSION, a RESIDUAL score e is the difference between the real value Y and the estimate Y/ from the regression equation.

• e = (Fred’s real violence – Fred’s predicted violence from the regression equation).

• OLS regression minimizes the sum of the squares of the residuals Σ(Y ─Y/)2 = Σe2.

68

Finding the values of b0 and b

69

Regression line with independence

• When the variables show no association, the slope of the regression line is zero and the line runs horizontally through the mean MY of the criterion or dependent variable.

• The intercept (B0) is MY in this case.

70

Intercept-only prediction

• In OLS regression, the intercept B0 is related to the regression coefficient B1 according to

• B0 = MY – B1MX

• When X and Y are independent, the slope of the regression line is zero and

• B0 = MY

• The best we can do with regression is to draw a horizontal line at Y = MY through the middle of the cloud of points.

• Whatever the degree of association between X and Y, the INTERCEPT-ONLY prediction is Y/ = MY.

71

Improved prediction

• There is a strong linear association here.

• The regression line makes much more accurate predictions than simply using the mean score on Actual violence as your prediction whatever the Preference value.

72

Summary

• How to run chi-square tests of association on SPSS.

• When the data are scarce, the usual chi-square test can give a misleading result.

• Run an EXACT TEST if there are warnings about low expected frequencies.

• REGRESSION is a set of techniques for predicting a target (dependent) variable from a regressor or independent variable.

73

An exercise

• I have placed the larger and smaller data sets for the Tissue and Antibody example on my website.

• Try running the chi-square tests.

top related