a comparison of statistical significance tests for information retrieval evaluation

13
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007

Upload: hester

Post on 20-Jan-2016

65 views

Category:

Documents


0 download

DESCRIPTION

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. CIKM´07, November 2007. Summary. Motivation Significance Testing General Approach Significance Test’s Randomization test, Wilcoxon test, Sign test, Bootstrap test, Student’s t test; Results Discussion - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

CIKM´07, November 2007

Page 2: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Summary Motivation Significance Testing General Approach Significance Test’s

Randomization test, Wilcoxon test, Sign test, Bootstrap test, Student’s t test;

Results Discussion Conclusions

Page 3: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Motivation Goal => Promote retrieval methods that truly

are better rather than methods that by chance perform better given a set of topics, judgments, and documents used in the evaluation.

Given two information retrieval (IR) systems, how can we determine which one is better than the other? Common approaches like TREC use the difference of

the Mean Average Precision (MAP). Problems? How can they be solved? Use significance tests!

What significance test should IR researchers use? Student’s paired test t? Wilcoxon signed ranked test?

Sing test? bootstrap? Fisher’s randomization?

Page 4: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Significance Testing Significance Testing

1. A test statistic or criterion by which to judge the two systems. IR researchers commonly use the difference in mean average precision (MAP) or the difference in the mean of another IR metric.

2. A distribution of the test statistic given a null hypothesis. A typical null hypothesis is that there is no difference in our two systems

3.A significance level (p-value) that is computed by taking the value of the test statistic for our experimental systems and determining how likely a value could have occurred under the null hypothesis.

Page 5: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

General Approach

Page 6: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Randomization testp-value = 0.0138

Page 7: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Wilcoxon Testp-value = 0.0560

Page 8: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Sign Testp-value = 0.3222

p-value = 0.3604

Page 9: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Bootstrap Testp-value = 0.0107

Page 10: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Student’s Paired t-testp-value = 0.0153

Page 11: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Results

Page 12: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Discussion Sing and Wilcoxon tests:

The use this tests should not be use because they test criteria that do not match the criteria of interest.

Randomization and Bootstrap tests: This tests can use whatever criterion we specify while the other

tests are fixed in their test statistics. Bootstrap test and Student’s t test:

The scores from the two IR Systems are random samples from a single population. Test topics are not random samples from the population of topics but hand selected to meet various criteria.

Student’s t test: This test can only be used for the difference between means and

not for median or other test statistics. At smaller sample sizes, violations in normality may result in

errors in the t-test.

Page 13: A Comparison of Statistical Significance Tests for Information Retrieval Evaluation

Conclusion The Randomization test is the recomendaded

test to used to compare two IR systems. The Wilcoxon Signed Ranked Test and Sign

tests should no longer be used in this context. The Randomization test, Bootstrap shifted

method test, and Student’s t test all produced comparable significance values => there’s is no practical difference between them!

The Wilcoxon Signed Ranked test and Sign tests both procuded very different p-values => can incorrectly predict significance and can fail to detect significance results.