nonparametric tests dr william simpson psychology, university of plymouth 1

Nonparametric tests

Dr William SimpsonPsychology, University of Plymouth

Hypothesis testing

An experiment

•Volunteers sign up to weight loss expt•Randomly assign half to low carb diet,•half to low fat diet•For each subject, find weight loss at end•Low carb (C): 10,6,7,8,14 kg•Low fat (F): 5,1,3,9,2 kg

Is it “significant”?

•We have:•C<-c(10,6,7,8,14); mean(C) is 9•F<-c(0,1,3,9,2); mean(F) is 3•It’s obvious that low carb works better for these subjects•Statistical significance comes in when we want to talk about people in general or if we were to repeat the expt or if we wonder if low fat diet “really works”

Hypothesis testing

• A random process was involved with these data: random assignment

• Suppose that each person would lose the same am’t of weight regardless of diet:

• 10,6,7,8,14,0,1,3,9,2• By chance, the big weight losers were

assigned to the low carb diet and low ones to low fat

• How likely is this sceptical idea?

Argument by contradiction

1. Assume the opposite of what we want to show (“A”)2. Show that this assumption leads to absurd conclusion3. Therefore initial assumption was wrong; conclude “not A”

• Guy at party asserts: “solids are denser than liquids”

• I disagree. I want to show that liquids can be denser

• Assume the opposite of what I want to show: solid H2O is denser than liquid

• If ice were denser, then it would sink in water

• Ice does not sink• Therefore ice is less dense than water

Null hypothesis testing

1. Assume the opposite of what we want to show: Pattern of weight loss just due to random assignment

2. Show that this assumption leads to very unlikely conclusion

3. Therefore initial assumption was wrong; weight loss NOT just random assignment (ie due to diet)

Weight loss hypo testing

• Null hypo: Pattern of weight loss just due to random assignment

• Calculate a “test statistic”• Find prob of getting such an extreme

test statistic if null hypo is true• If prob is low, reject null hypo. The

difference is “statistically significant”9

“Nonparametric” tests

• Some types of statistical test make assumptions about the data distribution (e.g. Normal)

• Nonparametric tests make no such assumptions

When useful?

1. Interval or ratio data but don’t want to make assumption about distribution and small sample size

2. Ordinal (rank) data

Ordinal data

•Data in graded categories. E.g. Likert scale:1.Strongly disagree2.Disagree3.Neither agree or disagree4.Agree5.Strongly Agree

The tests

1. Two independent groups, between subjects

a) Permutation test

•In weight loss expt, each subject assigned randomly to one of two groups•Null hypo says that our data are due simply to a fluke of random assignment

•Permutation test: use computer to do many random permutations. Compute diff in means each time. Get distrib. See how likely it is to get diff as big as ours:•mean(C) – mean(F) = 9-3 =6kg

•What mean diff C-F should we get if just random assignment?•Should be near zero, but will vary.

•C:(10,6,7,8,14) F:(0,1,3,9,2)

• diff•9 6 3 1 0 2 14 7 10 8 -4.4•2 6 8 10 7 14 0 9 3 1 1.2•7 3 9 14 0 6 10 1 8 2 1.2•14 0 1 6 9 10 8 2 7 3 0.0•… 1000s of times

•C<-c(10,6,7,8,14)•F<-c(0,1,3,9,2)•x<-c(C,F)

•nsim<-5000•d<-rep(0,nsim)•for (i in 1:nsim)•{•samp<-sample(x)•d[i]<-mean(samp[1:5])-mean(samp[6:10])•}

•hist(d)20

•P(diff>=6)=.01

•sum(d>=6)/nsim

•If null hypo is true, chance of getting as big a mean diff as we found (6 kg) or bigger is about .01

•This is a “low” prob. Conventional low probs are .05, .01, .001

•Reject null hypo. Diff in weight loss not just due to random assignment. Statistically significant (p=.01)•“Those on the low-carb diet lost significantly less weight (permutation test, p=.01)”

•Why do we say “p of getting diff as big as we got or bigger”?•Because we would also reject null if we had diff bigger than 6

One-tailed

• If we predicted that low fat would work better, expect mean(C) – mean(F) >0

• What is chance of getting C-F=6 or more?

•P(diff>=6) is right-hand•tail

Two-tailed

•Reviewer says: “Yeah, but it could have turned out the other way, with C-F<0. You should have tested for both possibilities”

•Can test both possibilities at same time.•Reject null either if C-F is a big negative or a big positive diff.•Both tails of distribution.

•One-tailed or directional test: p=.0142•sum(d>=6)/length(d)•Two-tailed or nondirectional test: p=.034•sum(d>=6)/length(d) + sum(d<= -6)/length(d)

One- vs two-tailed

•The p-value for 2-tailed will always be about twice as big as for 1-tailed•Harder to get statistical signif•More convincing to reviewers

Fallibility of hypo tests

• When p-value is small (<.05), we reject null hypo• BUT 5 times in 100, null hypo will actually be true!

Type I error

• Also possible to get a big p-value and fail to reject null even if a real effect exists. Type II error

• Will happen if effect is small and if sample size is small. Low power

b) Mann-Whitney-Wilcoxon test

•Suppose that we lump all the scores together•C:(10,6,7,8,14) F:(0,1,3,9,2)•c,c,c,c,c,f,f,f,f,f•10,6,7,8,14,0,1,3,9,2

•Now rank these scores•If the diet had no effect on weight loss, expect the average of the ranks associated with the Fs and with the Cs to be similar.

•Pretend we originally had•0 7 10 8 2 9 3 1 6 14•Ranks:•1 6 9 7 3 8 4 2 5 10•mean(0,7,10,8,2)=5.2 mean(9,3,1,6,14)=5.8

•If the diet had an effect, expect the mean of the ranks assoc with F to be markedly different from the mean of the ranks assoc with C.

•Pretend we originally had•0 1 2 3 6 7 8 9 10 14•Ranks:•1 2 3 4 5 6 7 8 9 10•mean(0,1,2,3,6)=2.4 mean(7,8,9,10,14)=9.6

•Thus, if the average (or sum*) of the ranks associated with the Cs or Fs is too large or small, we have evidence that the null (weight loss same in both) should be rejected•*mean=sum/n, so same except for scale factor

•Low carb (C): 10,6,7,8, 14•Low fat (F): 0, 1,3,9,2

Score Rank Group14 10 C10 9 C9 8 F8 7 C7 6 C6 5 C3 4 F2 3 F1 2 F0 1 F

Sum of ranks for Group C=

10 + 9 + 7 + 6 + 5 = 37Sum of ranks for Group F =

8 + 4 + 3 + 2 + 1 = 18

Weight loss example

•Using the summed ranks, calculate a statistic (Mann-Whitney U)•Distribution of U has been tabulated, given sample sizes n1 and n2•Look up p-value in table

•wilcox.test() Performs one- and two-sample Wilcoxon tests on vectors of data; the latter is also known as ‘Mann-Whitney’ test.

•wilcox.test(C,F,alternative="greater")• Wilcoxon rank sum test

•data: C and F •W = 22, p-value = 0.02778•alternative hypothesis: true location shift is greater than 0

•wilcox.test(C,F,alternative="two.sided")

• Wilcoxon rank sum test

•data: C and F •W = 22, p-value = 0.05556•alternative hypothesis: true location shift is not equal to 0

Note: different tests

•Not all tests give the same answers•The permutation test gave smaller p-value (p=.034) than the U test (p=0.056)•Which one to believe? Use judgement

2. Paired groups, repeated measures, within subjects

Repeated measures design

•Repeated measures: each subject participates in conditions in random order•Each subject serves as own control•Data to be used: differences between each pair of scores.

a) Permutation test

•Use computer to re-assign order many times. Each time find mean of the diffs. Distribution of these gives prob of getting mean diff as big as we observe

•Null hypo: each person has a pair of scores, emitting one the first time tested and the other the 2nd time tested. These scores not related to treatment (C or F)

•Randomly shuffle the scores. Find mean diff each time.•At end, have distrib of mean diffs

•If diff between diets just due to random assignment of order, expect our mean of diffs to be near zero. We had:•C-F = (10,6,7,8, 14)- (0, 1,3,9,2)•= 10, 5, 4, -1, 12; mean=6

•C<-c(10,6,7,8,14)•F<-c(0,1,3,9,2)

•nsim<-5000•d<-rep(0,nsim)•for (i in 1:nsim)• {• ord<-(runif(5)>.5)*2-1 #flip sign of difference randomly• samp<- (C-F)*ord• d[i]<-mean(samp)• }

hist(d)

•One-tailed or directional test: p=.06•sum(d>=6)/nsim

•Two-tailed or nondirectional test: p=.12•sum(d>=6)/nsim + sum(d<= -6)/nsim

b) Wilcoxon signed-ranks test

•Repeated measures uses diffs•C-F = (10,6,7,8, 14)- (0, 1,3,9,2)•= 10, 5, 4, -1, 12

•Basic idea: if random order is all that determined scores, expect diffs below and above 0 to balance out•Use signed ranks rather than raw scores

•Original diffs: 10, 5, 4, -1, 12•Ranked by abs size: 4, 3, 2, 1, 5•Then give any rank a minus sign if the original diff had minus sign:•Signed ranks: 4, 3, 2, -1, 5

•Find sum of the pos ranks•Find |sum| of the neg ranks•[under null hypo, expect them to be about equal]•sum(4, 3, 2, 5)=14 |sum(-1)|= 1

•W= smaller of the 2 sums*•sum(4, 3, 2, 5)=14 |sum(-1)|= 1•W = 1

•Use table to get p-value•*different methods of calculating W exist

•W=1, n=5•1-tail, p=.05, need W=0•Not signif

•C<-c(10,6,7,8,14)•F<-c(0,1,3,9,2)

•wilcox.test(C,F,alternative="greater",paired=T)

• Wilcoxon signed rank test

•data: C and F •V = 14, p-value = 0.0625•alternative hypothesis: true location shift is greater than 0

•C<-c(10,6,7,8,14)•F<-c(0,1,3,9,2)

•> wilcox.test(C,F,alternative="two.sided",paired=T)

• Wilcoxon signed rank test

•data: C and F •V = 14, p-value = 0.125•alternative hypothesis: true location shift is not equal to 0

Panic study

• Efficacy of internet therapy for panic disorder. Journal of Behavior Therapy and Experimental Psychiatry 37 (2006) 213–238

• Agoraphobic Cognitions Questionnaire: 14-item self-report questionnaire. Rate how often each thought occurs during a period of anxiety from 0 (never) to 4 (always).

3. Independent, more than 2 groups: Kruskal-Wallace

•A significance test can be done with more than 2 groups•It tests null hypo: “all groups are equal”

•Kruskal-Wallace is nonparametric version of ANOVA•ANalysis Of VAriance

Total deviation of point around grand mean

Deviation of pointaround group

Deviation of group mean

around grand mean

Total variance

Within group variance

Between group variance

•ANOVA computes the ratio:•variance between groups•variance within groups

•a big ratio happens when not all groups are the same (ie the treatment has an effect)

Kruskal-Wallace

•Kruskal-Wallace is like indep groups ANOVA except calculation uses ranks

•Basic idea: if random order is all that determined scores, expect all groups to have about same average rank

example

•Attitude towards the use of preservatives in food: 6 vegans, 6 vegetarians, and 6 meat eaters. The data were collected using a 50-point rating scale. A higher score represents a more positive attitude.

Group1. Vegan 2. Vegetarian 3. Carnivore

32 35 4026 29 2838 37 3829 42 3931 27 4330 36 41

rankings

Group1. Vegan 2. Vegetarian 3. Carnivore

32 (8) 35 (9) 40 (15)26 (1) 29 (4.5) 28 (3)

38 (12.5) 37 (11) 38 (12.5)29 (4.5) 42 (17) 39 (14)31 (7) 27 (2) 43 (18)30 (6) 36 (10) 41 (16)

Rank the observations from lowest to highest, regardless of group

Test statistic

Essentially calculates variability of group mean ranks about grand meanIf it is big, reject null (groups equal)

•x <- c(32,26,38,29,31,30) # vegan•y <- c(35,29,37,42,27,36) # vegetarian•z <- c(40,28,38,39,43,41) # carnivore•kruskal.test(list(x, y, z))• Kruskal-Wallis rank sum test

•data: list(x, y, z) •Kruskal-Wallis chi-squared = 4.6792, df = 2, p-value = 0.09636

4. Repeated measures, more than 2 groups: Friedman

Friedman test (cf repeated measures ANOVA)

•Friedman is like repeated measures ANOVA except calculation uses ranks

•Ranking is now for indiv subject across conditions. This takes account of repeated measures

•For indep grps, ranking was across all subjects

example

•10 participants rated attractiveness (10 pt scale) of Photoshopped images of the same person. Picture 1 was unaltered. Picture 2 simulated a face-lift, Picture 3 a nose job, and Picture 4 a collagen implant. Did the manipulations affect attractiveness?

PictureParticipant 1. Unaltered 2. Face-lift 3. Nose 4. Lips

1 8 (4) 6 (2.5) 6 (2.5) 4 (1)

2 5 (4) 4 (2.5) 3 (1) 4 (2.5)

3 7 (4) 5 (2) 6 (3) 3 (1)4 5 (3) 7 (4) 3 (1) 4 (2)5 9 (4) 6 (3) 5 (2) 3 (1)6 7 (4) 6 (3) 5 (2) 4 (1)7 6 (3) 8 (4) 5 (1.5) 5 (1.5)

8 6 (4) 5 (3) 3 (1) 4 (2)9 8 (4) 7 (3) 4 (1) 5 (2)

10 7 (4) 5 (2) 4 (1) 6 (3)

86Rank the observations for each subject across conditions

Test statistic

Essentially calculates variability of group mean ranks about grand meanIf it is big, reject null (groups equal)

•x1<-c(8,5,7,5,9,7,6,6,8,7) # unaltered•x2<-c(6,4,5,7,6,6,8,5,7,5) # face-lift•x3<-c(6,3,6,3,5,5,5,3,4,4) # nose•x4<-c(4,4,3,4,3,4,5,4,5,6) # lips•m<-cbind(x1,x2,x3,x4)•friedman.test(m)• Friedman rank sum test

•Friedman chi-squared = 20.4124, df = 3, p-value = 0.0001394

•“The Photoshop manipulation of the face images produced a significant effect on attractiveness ratings (Friedman chi-squared = 20.41, df = 3, p-value = 0.00014).”

Big issues

Sample size

•If using nonparametric approach, do when sample size is small•Why small?•Nonparametric statistics are used when don’t want to make assumptions about data distrib

•When the sample is large (rule of thumb: 25 or more), don’t need to make assumptions anyway•Due to central limit theorem

•Parametric versions of the tests use calculations involving and inferences about sums of data•Central limit theorem says that the distribution of a sum approaches the normal as sample size increases•http://onlinestatbook.com/stat_sim/sampling_dist/index.html

Robustness

•Parametric tests (t-test, ANOVA) can be quite robust to violations of assumptions underlying them•http://www.ruf.rice.edu/~lane/stat_sim/robustness/index.html

Summary

• logic of hypo testing: null hypo, test statistic, reject null, p-value

• Type I , Type II errors• power, effect size, sample size

Nonparametric and parametric tests

•Permutation tests possible for every scenarioNonparametric parametric•Mann-Whitney indep groups t-test•Wilcoxon repeated measures t-test•Kruskal-Wallace indep groups ANOVA•Friedman repeated measures ANOVA

nonparametric tests dr william simpson psychology, university of plymouth 1

Documents

bayesian nonparametric covariance regressionbayesian...

bhattacharya nonparametric

plymouth philharmonic choir - plymouth & devon's premier...

nonparametric methods

applied nonparametric regression - kuliah umum 19...

north & west plymouth - see plymouth massachusetts

members of the plymouth institution now plymouth …

“nonparametric” “nonparametric” methodsmethods

module 9: nonparametric tests - nova southeastern...

1436 sheridan, plymouth, mi | steps from downtown plymouth

nonparametric regression

plymouth humanoids team description paper for robocup...

experiments : design, parametric and nonparametric ... ·...

janet newton's website plymouth high school plymouth,...

nonparametric statistics

nonparametric alona_raviv

nonparametric inference

nonparametric lecture.ppt

introduction to nonparametric...

nonparametric counterfactual predictions in neoclassical...