nonparametric tests dr william simpson psychology, university of plymouth 1

Post on 05-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Nonparametric tests

Dr William SimpsonPsychology, University of Plymouth

1

Hypothesis testing

2

An experiment

•Volunteers sign up to weight loss expt•Randomly assign half to low carb diet,•half to low fat diet•For each subject, find weight loss at end•Low carb (C): 10,6,7,8,14 kg•Low fat (F): 5,1,3,9,2 kg

3

Is it “significant”?

•We have:•C<-c(10,6,7,8,14); mean(C) is 9•F<-c(0,1,3,9,2); mean(F) is 3•It’s obvious that low carb works better for these subjects•Statistical significance comes in when we want to talk about people in general or if we were to repeat the expt or if we wonder if low fat diet “really works”

4

Hypothesis testing

• A random process was involved with these data: random assignment

• Suppose that each person would lose the same am’t of weight regardless of diet:

• 10,6,7,8,14,0,1,3,9,2• By chance, the big weight losers were

assigned to the low carb diet and low ones to low fat

• How likely is this sceptical idea?

5

Argument by contradiction

1. Assume the opposite of what we want to show (“A”)2. Show that this assumption leads to absurd conclusion3. Therefore initial assumption was wrong; conclude “not A”

6

• Guy at party asserts: “solids are denser than liquids”

• I disagree. I want to show that liquids can be denser

• Assume the opposite of what I want to show: solid H2O is denser than liquid

• If ice were denser, then it would sink in water

• Ice does not sink• Therefore ice is less dense than water

7

Null hypothesis testing

1. Assume the opposite of what we want to show: Pattern of weight loss just due to random assignment

2. Show that this assumption leads to very unlikely conclusion

3. Therefore initial assumption was wrong; weight loss NOT just random assignment (ie due to diet)

8

Weight loss hypo testing

• Null hypo: Pattern of weight loss just due to random assignment

• Calculate a “test statistic”• Find prob of getting such an extreme

test statistic if null hypo is true• If prob is low, reject null hypo. The

difference is “statistically significant”9

“Nonparametric” tests

• Some types of statistical test make assumptions about the data distribution (e.g. Normal)

• Nonparametric tests make no such assumptions

10

When useful?

1. Interval or ratio data but don’t want to make assumption about distribution and small sample size

2. Ordinal (rank) data

11

Ordinal data

•Data in graded categories. E.g. Likert scale:1.Strongly disagree2.Disagree3.Neither agree or disagree4.Agree5.Strongly Agree

12

The tests

13

1. Two independent groups, between subjects

14

a) Permutation test

•In weight loss expt, each subject assigned randomly to one of two groups•Null hypo says that our data are due simply to a fluke of random assignment

15

•Permutation test: use computer to do many random permutations. Compute diff in means each time. Get distrib. See how likely it is to get diff as big as ours:•mean(C) – mean(F) = 9-3 =6kg

16

•What mean diff C-F should we get if just random assignment?•Should be near zero, but will vary.

17

•C:(10,6,7,8,14) F:(0,1,3,9,2)

• diff•9 6 3 1 0 2 14 7 10 8 -4.4•2 6 8 10 7 14 0 9 3 1 1.2•7 3 9 14 0 6 10 1 8 2 1.2•14 0 1 6 9 10 8 2 7 3 0.0•… 1000s of times

18

•C<-c(10,6,7,8,14)•F<-c(0,1,3,9,2)•x<-c(C,F)

•nsim<-5000•d<-rep(0,nsim)•for (i in 1:nsim)•{•samp<-sample(x)•d[i]<-mean(samp[1:5])-mean(samp[6:10])•}

19

•hist(d)20

•P(diff>=6)=.01

•sum(d>=6)/nsim

21

•If null hypo is true, chance of getting as big a mean diff as we found (6 kg) or bigger is about .01

•This is a “low” prob. Conventional low probs are .05, .01, .001

22

•Reject null hypo. Diff in weight loss not just due to random assignment. Statistically significant (p=.01)•“Those on the low-carb diet lost significantly less weight (permutation test, p=.01)”

23

•Why do we say “p of getting diff as big as we got or bigger”?•Because we would also reject null if we had diff bigger than 6

24

Tails

25

One-tailed

• If we predicted that low fat would work better, expect mean(C) – mean(F) >0

• What is chance of getting C-F=6 or more?

26

•P(diff>=6) is right-hand•tail

27

Two-tailed

•Reviewer says: “Yeah, but it could have turned out the other way, with C-F<0. You should have tested for both possibilities”

28

•Can test both possibilities at same time.•Reject null either if C-F is a big negative or a big positive diff.•Both tails of distribution.

29

30

•One-tailed or directional test: p=.0142•sum(d>=6)/length(d)•Two-tailed or nondirectional test: p=.034•sum(d>=6)/length(d) + sum(d<= -6)/length(d)

31

One- vs two-tailed

•The p-value for 2-tailed will always be about twice as big as for 1-tailed•Harder to get statistical signif•More convincing to reviewers

32

Fallibility of hypo tests

• When p-value is small (<.05), we reject null hypo• BUT 5 times in 100, null hypo will actually be true!

Type I error

33

• Also possible to get a big p-value and fail to reject null even if a real effect exists. Type II error

• Will happen if effect is small and if sample size is small. Low power

34

b) Mann-Whitney-Wilcoxon test

•Suppose that we lump all the scores together•C:(10,6,7,8,14) F:(0,1,3,9,2)•c,c,c,c,c,f,f,f,f,f•10,6,7,8,14,0,1,3,9,2

35

•Now rank these scores•If the diet had no effect on weight loss, expect the average of the ranks associated with the Fs and with the Cs to be similar.

36

•Pretend we originally had•0 7 10 8 2 9 3 1 6 14•Ranks:•1 6 9 7 3 8 4 2 5 10•mean(0,7,10,8,2)=5.2 mean(9,3,1,6,14)=5.8

37

•If the diet had an effect, expect the mean of the ranks assoc with F to be markedly different from the mean of the ranks assoc with C.

38

•Pretend we originally had•0 1 2 3 6 7 8 9 10 14•Ranks:•1 2 3 4 5 6 7 8 9 10•mean(0,1,2,3,6)=2.4 mean(7,8,9,10,14)=9.6

39

•Thus, if the average (or sum*) of the ranks associated with the Cs or Fs is too large or small, we have evidence that the null (weight loss same in both) should be rejected•*mean=sum/n, so same except for scale factor

40

•Low carb (C): 10,6,7,8, 14•Low fat (F): 0, 1,3,9,2

Score Rank Group14 10 C10 9 C9 8 F8 7 C7 6 C6 5 C3 4 F2 3 F1 2 F0 1 F

41

Sum of ranks for Group C=

10 + 9 + 7 + 6 + 5 = 37Sum of ranks for Group F =

8 + 4 + 3 + 2 + 1 = 18

Weight loss example

•Using the summed ranks, calculate a statistic (Mann-Whitney U)•Distribution of U has been tabulated, given sample sizes n1 and n2•Look up p-value in table

42

•wilcox.test() Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as ‘Mann-Whitney’ test.

•wilcox.test(C,F,alternative="greater")• Wilcoxon rank sum test

•data: C and F •W = 22, p-value = 0.02778•alternative hypothesis: true location shift is greater than 0

43

•wilcox.test(C,F,alternative="two.sided")

• Wilcoxon rank sum test

•data: C and F •W = 22, p-value = 0.05556•alternative hypothesis: true location shift is not equal to 0

44

Note: different tests

•Not all tests give the same answers•The permutation test gave smaller p-value (p=.034) than the U test (p=0.056)•Which one to believe? Use judgement

45

2. Paired groups, repeated measures, within subjects

46

Repeated measures design

•Repeated measures: each subject participates in conditions in random order•Each subject serves as own control•Data to be used: differences between each pair of scores.

47

a) Permutation test

•Use computer to re-assign order many times. Each time find mean of the diffs. Distribution of these gives prob of getting mean diff as big as we observe

48

•Null hypo: each person has a pair of scores, emitting one the first time tested and the other the 2nd time tested. These scores not related to treatment (C or F)

49

•Randomly shuffle the scores. Find mean diff each time.•At end, have distrib of mean diffs

50

•If diff between diets just due to random assignment of order, expect our mean of diffs to be near zero. We had:•C-F = (10,6,7,8, 14)- (0, 1,3,9,2)•= 10, 5, 4, -1, 12; mean=6

51

•C<-c(10,6,7,8,14)•F<-c(0,1,3,9,2)

•nsim<-5000•d<-rep(0,nsim)•for (i in 1:nsim)• {• ord<-(runif(5)>.5)*2-1 #flip sign of difference randomly• samp<- (C-F)*ord• d[i]<-mean(samp)• }

52

53

hist(d)

•One-tailed or directional test: p=.06•sum(d>=6)/nsim

54

•Two-tailed or nondirectional test: p=.12•sum(d>=6)/nsim + sum(d<= -6)/nsim

55

b) Wilcoxon signed-ranks test

•Repeated measures uses diffs•C-F = (10,6,7,8, 14)- (0, 1,3,9,2)•= 10, 5, 4, -1, 12

56

•Basic idea: if random order is all that determined scores, expect diffs below and above 0 to balance out•Use signed ranks rather than raw scores

57

•Original diffs: 10, 5, 4, -1, 12•Ranked by abs size: 4, 3, 2, 1, 5•Then give any rank a minus sign if the original diff had minus sign:•Signed ranks: 4, 3, 2, -1, 5

58

•Find sum of the pos ranks•Find |sum| of the neg ranks•[under null hypo, expect them to be about equal]•sum(4, 3, 2, 5)=14 |sum(-1)|= 1

59

•W= smaller of the 2 sums*•sum(4, 3, 2, 5)=14 |sum(-1)|= 1•W = 1

•Use table to get p-value•*different methods of calculating W exist

60

•W=1, n=5•1-tail, p=.05, need W=0•Not signif

61

•C<-c(10,6,7,8,14)•F<-c(0,1,3,9,2)

•wilcox.test(C,F,alternative="greater",paired=T)

• Wilcoxon signed rank test

•data: C and F •V = 14, p-value = 0.0625•alternative hypothesis: true location shift is greater than 0

62

•C<-c(10,6,7,8,14)•F<-c(0,1,3,9,2)

•> wilcox.test(C,F,alternative="two.sided",paired=T)

• Wilcoxon signed rank test

•data: C and F •V = 14, p-value = 0.125•alternative hypothesis: true location shift is not equal to 0

63

Panic study

• Efficacy of internet therapy for panic disorder. Journal of Behavior Therapy and Experimental Psychiatry 37 (2006) 213–238

64

• Agoraphobic Cognitions Questionnaire: 14-item self-report questionnaire. Rate how often each thought occurs during a period of anxiety from 0 (never) to 4 (always).

65

66

67

68

69

3. Independent, more than 2 groups: Kruskal-Wallace

70

ANOVA

•A significance test can be done with more than 2 groups•It tests null hypo: “all groups are equal”

71

•Kruskal-Wallace is nonparametric version of ANOVA•ANalysis Of VAriance

72

73

Total deviation of point around grand mean

=

Deviation of pointaround group

mean

+

Deviation of group mean

around grand mean

Total variance

=

Within group variance

+

Between group variance

•ANOVA computes the ratio:•variance between groups•variance within groups

•a big ratio happens when not all groups are the same (ie the treatment has an effect)

74

Kruskal-Wallace

•Kruskal-Wallace is like indep groups ANOVA except calculation uses ranks

75

•Basic idea: if random order is all that determined scores, expect all groups to have about same average rank

76

example

•Attitude towards the use of preservatives in food: 6 vegans, 6 vegetarians, and 6 meat eaters. The data were collected using a 50-point rating scale. A higher score represents a more positive attitude.

77

Group1. Vegan 2. Vegetarian 3. Carnivore

32 35 4026 29 2838 37 3829 42 3931 27 4330 36 41

78

rankings

Group1. Vegan 2. Vegetarian 3. Carnivore

32 (8) 35 (9) 40 (15)26 (1) 29 (4.5) 28 (3)

38 (12.5) 37 (11) 38 (12.5)29 (4.5) 42 (17) 39 (14)31 (7) 27 (2) 43 (18)30 (6) 36 (10) 41 (16)

79

Rank the observations from lowest to highest, regardless of group

Test statistic

Essentially calculates variability of group mean ranks about grand meanIf it is big, reject null (groups equal)

80

•x <- c(32,26,38,29,31,30) # vegan•y <- c(35,29,37,42,27,36) # vegetarian•z <- c(40,28,38,39,43,41) # carnivore•kruskal.test(list(x, y, z))• Kruskal-Wallis rank sum test

•data: list(x, y, z) •Kruskal-Wallis chi-squared = 4.6792, df = 2, p-value = 0.09636

81

4. Repeated measures, more than 2 groups: Friedman

82

Friedman test (cf repeated measures ANOVA)

•Friedman is like repeated measures ANOVA except calculation uses ranks

83

•Ranking is now for indiv subject across conditions. This takes account of repeated measures

•For indep grps, ranking was across all subjects

84

example

•10 participants rated attractiveness (10 pt scale) of Photoshopped images of the same person. Picture 1 was unaltered. Picture 2 simulated a face-lift, Picture 3 a nose job, and Picture 4 a collagen implant. Did the manipulations affect attractiveness?

85

PictureParticipant 1. Unaltered 2. Face-lift 3. Nose 4. Lips

1 8 (4) 6 (2.5) 6 (2.5) 4 (1)

2 5 (4) 4 (2.5) 3 (1) 4 (2.5)

3 7 (4) 5 (2) 6 (3) 3 (1)4 5 (3) 7 (4) 3 (1) 4 (2)5 9 (4) 6 (3) 5 (2) 3 (1)6 7 (4) 6 (3) 5 (2) 4 (1)7 6 (3) 8 (4) 5 (1.5) 5 (1.5)

8 6 (4) 5 (3) 3 (1) 4 (2)9 8 (4) 7 (3) 4 (1) 5 (2)

10 7 (4) 5 (2) 4 (1) 6 (3)

86Rank the observations for each subject across conditions

Test statistic

Essentially calculates variability of group mean ranks about grand meanIf it is big, reject null (groups equal)

87

•x1<-c(8,5,7,5,9,7,6,6,8,7) # unaltered•x2<-c(6,4,5,7,6,6,8,5,7,5) # face-lift•x3<-c(6,3,6,3,5,5,5,3,4,4) # nose•x4<-c(4,4,3,4,3,4,5,4,5,6) # lips•m<-cbind(x1,x2,x3,x4)•friedman.test(m)• Friedman rank sum test

•Friedman chi-squared = 20.4124, df = 3, p-value = 0.0001394

88

•“The Photoshop manipulation of the face images produced a significant effect on attractiveness ratings (Friedman chi-squared = 20.41, df = 3, p-value = 0.00014).”

89

Big issues

90

Sample size

•If using nonparametric approach, do when sample size is small•Why small?•Nonparametric statistics are used when don’t want to make assumptions about data distrib

91

•When the sample is large (rule of thumb: 25 or more), don’t need to make assumptions anyway•Due to central limit theorem

92

•Parametric versions of the tests use calculations involving and inferences about sums of data•Central limit theorem says that the distribution of a sum approaches the normal as sample size increases•http://onlinestatbook.com/stat_sim/sampling_dist/index.html

93

Robustness

•Parametric tests (t-test, ANOVA) can be quite robust to violations of assumptions underlying them•http://www.ruf.rice.edu/~lane/stat_sim/robustness/index.html

94

Summary

• logic of hypo testing: null hypo, test statistic, reject null, p-value

• Type I , Type II errors• power, effect size, sample size

95

Nonparametric and parametric tests

•Permutation tests possible for every scenarioNonparametric parametric•Mann-Whitney indep groups t-test•Wilcoxon repeated measures t-test•Kruskal-Wallace indep groups ANOVA•Friedman repeated measures ANOVA

96

top related