a comparison of statistical significance tests for information retrieval evaluation

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

CIKM´07, November 2007

Summary Motivation Significance Testing General Approach Significance Test’s

Randomization test, Wilcoxon test, Sign test, Bootstrap test, Student’s t test;

Results Discussion Conclusions

Motivation Goal => Promote retrieval methods that truly

are better rather than methods that by chance perform better given a set of topics, judgments, and documents used in the evaluation.

Given two information retrieval (IR) systems, how can we determine which one is better than the other? Common approaches like TREC use the difference of

the Mean Average Precision (MAP). Problems? How can they be solved? Use significance tests!

What significance test should IR researchers use? Student’s paired test t? Wilcoxon signed ranked test?

Sing test? bootstrap? Fisher’s randomization?

Significance Testing Significance Testing

1. A test statistic or criterion by which to judge the two systems. IR researchers commonly use the difference in mean average precision (MAP) or the difference in the mean of another IR metric.

2. A distribution of the test statistic given a null hypothesis. A typical null hypothesis is that there is no difference in our two systems

3.A significance level (p-value) that is computed by taking the value of the test statistic for our experimental systems and determining how likely a value could have occurred under the null hypothesis.

General Approach

Randomization testp-value = 0.0138

Wilcoxon Testp-value = 0.0560

Sign Testp-value = 0.3222

p-value = 0.3604

Bootstrap Testp-value = 0.0107

Student’s Paired t-testp-value = 0.0153

Results

Discussion Sing and Wilcoxon tests:

The use this tests should not be use because they test criteria that do not match the criteria of interest.

Randomization and Bootstrap tests: This tests can use whatever criterion we specify while the other

tests are fixed in their test statistics. Bootstrap test and Student’s t test:

The scores from the two IR Systems are random samples from a single population. Test topics are not random samples from the population of topics but hand selected to meet various criteria.

Student’s t test: This test can only be used for the difference between means and

not for median or other test statistics. At smaller sample sizes, violations in normality may result in

errors in the t-test.

Conclusion The Randomization test is the recomendaded

test to used to compare two IR systems. The Wilcoxon Signed Ranked Test and Sign

tests should no longer be used in this context. The Randomization test, Bootstrap shifted

method test, and Student’s t test all produced comparable significance values => there’s is no practical difference between them!

The Wilcoxon Signed Ranked test and Sign tests both procuded very different p-values => can incorrectly predict significance and can fail to detect significance results.

a comparison of statistical significance tests for information retrieval evaluation

Documents

ranked test

test topics

test statistics

recomendaded test

shifted method test

wilcoxon tests

bootstrap tests

sign tests